文献网络计算机网络外文文献英文文献外文翻译探讨搜索引擎爬虫

资源描述

《文献网络计算机网络外文文献英文文献外文翻译探讨搜索引擎爬虫》由会员分享，可在线阅读，更多相关《文献网络计算机网络外文文献英文文献外文翻译探讨搜索引擎爬虫（7页珍藏版）》请在金锄头文库上搜索。

1、2301 31 5161 71 81 9 2 02 12 3 2 71 fL rlhtml 24J. Oliver A. for 1aniingProceedings of 21*3 National Conference on Challenges all links in that rool- HTML- page arc tlircct sons of the root. Subsequent links are then sons of lhe previous sons.A single URL Server serves lists of URLs to a number of c

2、rawlers. Web crawler starts by parsing a specified Web page, noting any hypertext links on thal page that point to other Web pages. They then parse those pages for new links, and so on, recursively. WebCrawler software docsiil actually move around lo difcrcnl computers on the hilcrael. as viruses or

3、 intelligent agents do. Each crawlcr keeps roughly 300 connections open at oncc. This is ncccssary to retrieve Web pages al a fast enough pacc. A crawlcr resides on a single machine. The crawlcr simply sends HTTP requests for documents lo other machines on the Lnlemcl, just as a Web browser docs whe

4、n the user clicks on links. All the crawlcr really does is lo automate the process of following links. Web crawling can be regarded as processing items in a queue. When the crawlw visits a Web page, it extracts links to other Web pages. So lhe crawlcr puts these URLs at lhe end of a queue, and conli

5、nues crawling to a URL that it removes from lhe front of the queue 1. A. Resource Constraints Crawlcrs consumc resources: network bandwidth to download pages, memory lo niainlain private data structures in support of llieir algorithms, CPU to evaluate and select URLs, and disk storage !o store tfie

6、lexl and links of fetched pages as well as other persistent data. B. Robot Protocol The robot.txt file gives directives for excluding a portion of a Web site lo be crawlcil. Analogously, a simple tcx! file can furnish information about the freshness and popularity of published objects. This informat

7、ion permits a crawlcr to optimize its strategy for refreshing collected dala as well as replacing object policy. C. Meta Search Engine A nicta-scarch engine is (he kind of search engine that does not have its own database of Web pages. It sends search terms io lhe databases mainlamed by other search

8、 engines and gives users lhe result that come from all the search engines queried. Fewer meta searchers allow you to delve into the largest, most useful search engine databases. They tend to229return results from smaller and/or free search engines and miscellaneous free directories, often small and

9、highly commcrcial.V.CRAWLING TECHNIQUES A. Focused CrawlingA general purpose Web crawler gathers as many pages as il can from a particular set of URLs. Where as a focused crawler is designed lo only gather documents on a specific topic, thus reducing the amount of network traffic and downloads. I“hc

10、 goal of Ihc foe used crawicr is to selectively out pages thal are relevant lo a pre-defined set of topics, topics are specified nol using keywords, but using exemplary documents. Rather than collecling and indexing all accessiblc AVeb documents lo be able lo answer all possible ad-hoc queries, a Io

11、cusctl crawicr analyzes its crawl boundary lo find ihc links lhal arc likely lo be most relevant for the crawl, and avoids irrelevant regions of Ihe Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. The focuscd crawicr has three main

12、components: a classificr, which makes relevancc judgments on pages, crawled to decide on link expansion, a distiller which delemunes a measure of centrality of crawled pages to determine visit priorities, and a crawler with dynamically reconfigurablc priority controls which is governed by the classi

13、ficr and distiller. I hc most crucial evaluation of focuscd crawling is to measure the harvest ralio, which is rale al which relevant pages are acquired aiui irrelevant pages arc effectively filtered off from ihe crawl. This harvesl ratio musl be high, otherwise the focused crawler would spend a lot

14、 of lime merely eliminating irrelevant pages, and it may be better lo use an ordinary crawicr instead 17.B. Distributed CrawlingIndexing the Web is a challenge due to its growing and dynamic nature. As Ihc size of Ihc Web is growing it has become imperative to parallelize the crawling process in ord

15、er to finish downloading Ihe pages in a reasonable amount of time. A single crawling process even if multithrcadmg is used will be insufficient for large scale engines that need to fetch large amounts of data rapidly. When a single centralized crawicr is used all the fetched data passes through a si

16、ngle physical link. Dislnbuting the crawling activity via multiple M-ocesscs can help build a scalable, easily configurable system, which is fault tolerant system. Splitting Ihe load decreases hardware requirements and at Ihc same lime increases Ihc overall download speed and reliability. Ivach task is performed in a iully distributed fashion, that is, no ccntral coordinator exists 3J.VI. IROBLKM OF SELRCTI

展开阅读全文

文献网络计算机网络 外文文献 英文文献 外文翻译 探讨搜索引擎爬虫

最新文档

文献网络计算机网络外文文献英文文献外文翻译探讨搜索引擎爬虫