《主题网络爬虫的研究与实现》-公开DOC·毕业论文

资源描述

《《主题网络爬虫的研究与实现》-公开DOC·毕业论文》由会员分享，可在线阅读，更多相关《《主题网络爬虫的研究与实现》-公开DOC·毕业论文（51页珍藏版）》请在金锄头文库上搜索。

1、本科毕业论文主题网络爬虫的设计与实现Design and implementation of subject-oriented crawler姓名：学号：23020051204554学院：软件学院系：软件工程专业：软件工程年级：指导教师：副教授 w摘要目前信息网上蕴含了大量的信息，但通过人工浏览的方法很难做到对信息的安全浏览、整理，很多有用的信息也就白白流失，产生了大量信息不能及时应用的矛盾，给用户造成了很大的不便，为了解决这一问题，搜索引擎这一新热点技术应运而生，本文结合信息网的特征，运用信息抽取和网页解析技术，设计和实现了搜索引擎中最重要的部分网络爬虫，以提供分类

2、更细致精确、数据更全面深入、更新更及时的因特网搜索服务。本文首先对概述了网络爬虫的发展概况，然后分析了网络爬虫的体系结构以及实现原理，并深入分析了主题页面在Web上的分布特征与主题相关性的判别算法，具体工作如下：(1)爬虫部分，通过设计种子网站进行爬虫，下载尽可能全且与用户要求相符合的网站。(2)网页预处理过程，包括分词、HTML解析和网页消噪。在对树节点进行裁剪的基础上，设计了基于样式的网页消噪方法，进一步提高网页消噪过程。(3)主题相关性判断，包括特征提取和权值计算阶段。在特征提取阶段，通过组合文档频率，得到新的特征，达到降维和提高分类精度的效果。在权值计算阶段，结合信息增益、传统TFID

3、F算法和空间向量模型VSM算法，得到了更适合主题相关性判断的权值计算方法。(4)最后，在MYECLIPSE平台上，实现了一个简易的网络爬虫系统，并简要分析了爬虫的运行效果，达到了令人满意的效果。关键词：网页解析；TFIDF算法；VSM算法AbstractCurrently there is lot of information in the public security information website,but it is not possible to visit and clean up all information only through artifical manner,s

4、o much import information would be lost,also would go aginst cracking a criminal case,which causes a great deal of inconvenience to users.To deal with this problem,search engine technology came into being the new hot spot.Based on the characteristics of information networks,the paper designed and im

5、plemented the most important part of search engineWeb Spider，using information extraction and web analytic technology to provide more detailed classification accuracy, data is more comprehensive and in-depth, more timely updates of Internet search services.This paper first outlined the development o

6、f search engines and reptile research network status and then analyzed the architecture of topic search engine and depthly analysd the theme of the page in the Web on the distribution of subject characteristics and the identification algorithm.In this paper,the concrete work as follows:(1)Spider par

7、t. By set seeds through the design of website, download as much as possible and with the whole site in line with user requirements.(2)Page pre-processing process, including Word particiling, HTML parsing and page de-noising.(3) To determine the relevance of the theme, including the feature extractio

8、n stage and the right value. In the feature extraction stage, through the combination of document frequency, new features, to achieve dimensionality reduction and improving the classification accuracy results. Value in the right phase, combined with information gain, TFIDF algorithm and the traditio

9、nal vector space model algorithm, have been more suitable for the theme of the relevance of the right to determine the value of the calculation.(4) Finally, in MYECLIPSE platform to realize a simple network system reptiles, and reptiles a brief analysis of the effect of the operation, reached a sati

10、sfactory result.Key words: page analysis; TFIDF algorithm; space vector algorithm.w目录第一章绪论11.1 选题背景和研究意义11.2 搜索引擎的发展11.3 国内外研究现状31.4 本文的主要工作和论文结构5第二章网络爬虫工作原理72.1 网络爬虫在搜索引攀中的地位72.2 网络爬虫的基本原理92.2.1 主题网络爬虫的体系结构92.2.2 系统模块功能说明102.3 内容提取112.4 主题页面在web上的分布特征122.5 本章小结14第三章网络爬虫的关键算法153.1 网页搜索策略153.2主题爬

11、虫的搜索策略163.2.1 基于内容评价的搜索策略163.2.2 基于链接结构评价的搜索策略193.3 主题相关性算法213.3.1 向量空间模型(VSM)213.3.2 页面主题相关性算法233.4 本章小结24第四章主题爬虫的分析与设计254.1 主题爬虫的体系结构254.2 初始种子选取和URL队列维护264.2.1 初始种子选取264.2.2 URL队列维护274.3 网页解析274.3.1 HTML语法的分析284.3.2 网页中信息资源的提取294.4 主题相关性算法实现304.4.1 分词算法314.4.2 权值计算:TF-IDF算法314.4.3 权值算法的改进：IG算法34

12、4.4.4 VSM算法384.5 建立索引384.6 系统实现394.6 总结41第五章总结与展望425.1 本文总结425.2 研究展望42参考文献43致谢44ContentsChapter 1 Introduction11.1 Background of the topics and research significance11.2 History of the development of search engines11.3 Research status at home and abroad31.4 Main work and structure of this paper5Ch

13、apter 2 Working principle of crawler72.1 Status of crawler in search engine domain72.2 The basic principles of crawler92.2.1 Architecture of subject-oriented crawler92.2.2 Introduction of module function102.3 Information extraction112.4 Distribution features of subject-oriented page on web122.5 Summ

14、ary of this chapter14Chapter 3 Key algorithm of crawler153.1 Web searching strategy153.2 Searching strategy of subject-oriented crawler163.2.1 Link-based relevance algorithm163.2.2 Content-based relevance algorithm193.3 Subject relevance algorithm213.3.1 VSM(Vector Space Model)213.3.2 Relevance algo

15、rithm about web page subject233.4 Summary of this chapter24Chapter 4 Analysis and design about subject-oriented crawler254.1 Architecture of subject-oriented crawler254.2 Beginning seeds selection and URL queue maintaince264.2.1 Beginning seeds selection264.2.2 URL queue maintaince274.3 Web page extraction274.3.1 HTMLsyntax analyze284.3.2 Information resources ex

展开阅读全文