基于P2P网络爬虫设计------爬行模块设计

资源描述

《基于P2P网络爬虫设计------爬行模块设计》由会员分享，可在线阅读，更多相关《基于P2P网络爬虫设计------爬行模块设计（62页珍藏版）》请在金锄头文库上搜索。

1、基于基于 P2PP2P 网络爬虫设计网络爬虫设计爬行模块设计爬行模块设计摘摘要要网络爬虫是一种能够自主采集 Web 页面内容的程序。随着数据量的爆炸性增长，传统的网络爬虫已经越来越不能满足人们不断增长的信息获取需求。随着对等网络(peer-to-peer，简称 P2P)技术的快速发展，人们提出了基于 P2P 的网络爬虫并迅速成为研究热点。本课题采用 P2P 网络计算，用并行编程来实现网络爬虫。课题分为爬行模块和控制模块两大部分。爬行模块实现单个爬行结点的基本功能，主要是根据URL 队列，从 Internet 下载网页。本论文分为四个章节，从技术背景、系统设计、代码实现、实例展示等方面，详细

2、地阐述了爬行模块的开发目的、开发技术和开发过程。关键词：关键词：网络爬虫；多线程；哈希表AbstractThe Web Crawler is one kind of the procedure being able to collect Web page of face content autonomously. With the explosive growth of the data, traditional web crawler become harder to catch up with the growing step of peoples information needs. W

3、ith the rapid development of peer-to-peer (P2P) technology, the notion of P2P Web Crawler has been proposed and quickly becomes a research focus.This subject adopts P2P networks with parallel programming to realize web crawlers. Subject is divided into crawler module and control module. The crawler

4、module crawling realizes the basic function of single node, mainly basing on the URL queue downloads page from Internet.This paper is divided into four chapters, from technical background, system design, detailed design, software testing, etc, and expounds the development purposes,development techno

5、logy and the development process of the crawler module.Key words: the web crawler; Multi-thread; Haxi table目录第第 1 章章绪论绪论11.1 爬虫技术背景爬虫技术背景.11.1.1 网络爬虫的工作原理11.1.2 网络爬虫的搜索策略11.1.3 超文本传输协议简介21.1.4 开发工具和开发语言介绍21.2 目前该技术的应用现状以及存在的问题和缺陷目前该技术的应用现状以及存在的问题和缺陷111.2.1 超链分析算法的应用与发展121.2.2 传统的集中式网络爬虫向分布式网络爬虫的发展

6、131.2.3 传统的通用网络爬虫向面向主题网络爬虫的发展131.3 论文的主要内容和特点论文的主要内容和特点141.4 论文组织结构论文组织结构14第第 2 章章总体设计与实现总体设计与实现152.1 需求分析需求分析.152.2 系统设计系统设计152.2.1 爬行模块的功能设计162.2.2 爬行模块的流程设计172.2.3 爬行模块与控制模块的协工作182.3 代码实现代码实现192.3.1 Page 类的实现202.3.2 UrlManager 类的实现 262.3.3 Spider 类的实现 .302.3.4 其它模块382.4 小结小结42第第 3 章章实例展示实例展示433

7、.1 实例的软件硬件环境实例的软件硬件环境.433.2 爬行模块测试爬行模块测试.433.3 集成测试集成测试.48第第 4 章章总结与展望总结与展望504.1 本文所解决的问题本文所解决的问题.504.2 对爬虫的将来提出展望对爬虫的将来提出展望504.2.1 质量和性能方面504.2.2 个性化服务方面51参考文献参考文献52致谢致谢54CONTENTSChapter I: Introduction.11.1 Background of the Web crawler technology11.1.1 Web crawler operating principles .11.1.2 Se

8、arch strategy of the Web crawler.11.1.3 Brief introduction to HTTP.21.1.4 Introduction to developmental tools and languages21.2 Current status of this technology, its problems and shortcomings111.2.1 Hyperlink analysis algorithem: application and development121.2.2 Change of traditional crawler from

9、 centralized to distributed 131.2.3 Change of traditional crawler from generic to specific.131.3 Summary of my thesis and its characteristics .141.4 Structure of my thesis14Chapter II: General architecture152.1 Analysis of demands and needs.152.2 System architecture152.2.1 Function design of the cra

10、wling module.162.2.2 Process design of the crawling module.172.2.3 Coordination of crawling module and controller module.182.3 Code implementation.192.3.1 Implementation of Page Class.202.3.2 Implementation of UrlManager.262.3.3 Implementation of Spider302.3.4 Other modules.382.4 Summary.42Chapter I

11、II: System demonstration433.1 System requirements for software and hardware.433.2 Tests of the crawler module.433.3 Integration tests48Chapter IV: Conclusion and perspective.504.1 What this thesis has solved504.2 My perspective for the future of the clawler504.2.1 Quality and performance.504.2.2 Per

12、sonalized services.51References .52Acknowledgements.54第 1 章绪论1第第 1 章章绪论绪论1.1 爬虫技术背景爬虫技术背景万维网 WWW（World Wide Web）是一个巨大的，分布全球的信息服务中心，正在以飞快的速度扩展。1998 年 WWW 上拥有约 3.5 亿个文档，每天增加约 1 百万的文档，不到 9 个月的时间文档总数就会翻一番。WEB 上的文档和传统的文档比较，有很多新的特点，它们是分布的，异构的，无结构或者半结构的，这就对传统信息检索技术提出了新的挑战。最近几年，许多研究者发现，WWW 上超链结构是个非常丰富和重要的

13、资源，如果能够充分利用的话，可以极大的提高检索结果的质量。网络爬虫是一个自动提取网页的程序，它为搜索引擎从万维网上下载网页，是搜索引擎的重要组成1,7。它在搜索时往往采用一定的搜索策略。网络爬虫搜索策略的研究对搜索引擎的应用与发展有着重要意义。1.1.1 网络爬虫的工作原理网络爬虫的工作原理网络爬虫的基本工作原理是：首先，爬虫从一个或若干初始网页的 URL 开始，通过分析该 URL 的源文件，提取出新的网页链接，然后，通过这些链接继续寻找新的链接，这样一直循环下去，直到抓取并分析完所有的网页为止15,17。网络爬虫程序的主要功能是进行网页数据的抓取以及对网页中超链接的提取与分析。网络爬虫进入某

14、个超级文本时，它利用 HTML 语言的标记结构来搜索信息及获取指向其他超级文本的 URL 地址，可以完全不依赖用户干预实现网络上的自动“爬行”和搜索19。所以,爬虫技术实际上是一种更主动和专门的搜索技术。1.1.2 网络爬虫的搜索策略网络爬虫的搜索策略一种好的策略就是要在合理的时间限度内, 以较少的网络资源、存储资源和计算资源的消耗获得更多的与主题相关页面。基于 P2P 网络爬虫设计爬行模块设计一、宽度或深度优先搜索策略一、宽度或深度优先搜索策略第一代网络爬虫主要是基于传统的图算法16，如宽度优先或深度优先算法。一个核心的 URL 集被用来作为一个种子集合，这种算法递归的跟踪超链接到其它页面,

15、而通常不管页面的内容, 因为最终的目标是这种跟踪能覆盖整个Web。这种策略通常用在通用搜索引擎中，因为通用搜索引擎获得的网页越多越好，没有特定的要求。二、聚焦搜索策略二、聚焦搜索策略基于第一代网络爬虫的搜索引擎抓取的网页一般少于 1 000 000 16个网页, 极少重新搜集网页并去刷新索引。而且其检索速度非常慢, 一般都要等待 10s甚至更长的时间. 随着网页页信息的指数级增长及动态变化, 这些通用搜索引擎的局限性越来越大, 随着科学技术的发展, 定向抓取相关网页资源的聚焦爬虫便应运而生。1.1.3 超文本传输协议简介超文本传输协议简介超文本传输协议1（HypertextTransferP

16、rotocol,HTTP）是一个简单的协议。客户进程建立一条同服务器进程的 TCP 连接，然后发出请求并读取服务器进程的应答。服务器进程关闭连接表示本次响应结束。服务器进程返回的内容包含两个部分，一个“应答头” （responseheader），一个“应答体” （responsebody），后者通常是一个 HTML 文件，我们称之为“网页” 。一个完整的 HTML 文档以开始，以结束。大部分的HTML 命令都像这样成对出现。HTML 文档含有以开始、以结束的首部和以开始、以结束的主体部分。标题通常由客户程序显示在窗口的顶部。关于 HTML 规范的详细介绍可以参看W3C,1999。1.1.4 开发工具开发工具和开发语言介绍和开发语言介绍本系统的开发工具使用 Microsoft Visual Studio 2005。Visual Studio 是微软公司推出的开发环境。是目前最流行的 Windows 平台应用程序开发环境。本系统的开发语言使用 C#。这是因为它已经内置了 HTTP 访问和多线程的第 1 章绪论3能力，而这两种能力对于爬虫

展开阅读全文