分布式搜索引擎设计与实现

资源描述

《分布式搜索引擎设计与实现》由会员分享，可在线阅读，更多相关《分布式搜索引擎设计与实现（94页珍藏版）》请在金锄头文库上搜索。

1、中国科学技术大学硕士学位论文分布式搜索引擎设计与实现姓名：李伟申请学位级别：硕士专业：模式识别与智能系统指导教师：朱明20060501摘要在网页如此繁多的今天，人们在互联网上查找各种信息，往往都需要借助互联网搜索引擎的帮助。本文就是要设计一个针对互联网搜索的大规模分布式搜索引擎。互联网搜索引擎系统一般由四个主要部分组成：爬虫子系统，存储子系统，索引子系统，门户子系统。首先爬虫子系统通过网页链接爬行互联网，将网页或者其他Web对象抓取下来，保存到存储子系统；索引子系统从存储子系统获取未索引的网页，计算索引数据，建立索引。门户提供一个用户交互界面，用户搜索互联网时，在门户上输入查询关键字，门户建立

2、查询语句发送到索引子系统，查询关键字对应的网页，然后返回给用户。本文实现了互联网搜索引擎中的核心功能，完成了一个基本的面向大规模互联网的分布式搜索引擎平台。在分布式爬虫子系统中，多个爬虫应该避免重复爬行，本文按照URL的Hash值为每个爬虫分配一个URL空间，互不重叠，并通过调整爬虫爬行的URL空间来进行负载均衡。另外，本文实现的爬虫系统可以同时支持IPv4和IPv6网络。存储子系统由若干个存储组构成，每个存储组存储互不重叠的一个URL空间的Web对象，由主服务器发布这一存储策略。通过扩展存储组可以不断提高整个系统的存储容量。每个存储组又由若干个存储单元组成，它们存储完全相同的数据，即所有的数

3、据都是多备份的，保证数据安全，并可以提高数据访问的并发能力。外部客户端访问存储子系统根据主服务器发布的存储策略直接访问，数据访问过程中，无需主服务器参与，主服务器不再成为频繁数据访问操作下的瓶颈。索引子系统分为两个部分，索引计算和索引服务。索引计算子系统从存储子系统下载待索引数据建立索引，并发送给索引服务子系统。为提高索引计算的可靠性，索引计算服务器与存储子系统的存储组采用多对多的关系，即多个索引计算服务器同时计算多个存储组上的待索引数据。存储组提供FTP服务，一次只允许一个索引计算服务器下载待数据包，下载完毕，将该数据包移动到待删除目录，从而避免了多个索引计算服务器同时下载计算相同的索引。索

4、引服务子系统中各个索引服务器上都存储所有的索引数据，保证索引数据安全性。本文的各个子系统都采用基于策略的分布式架构，策略描述了系统内部服务分布情况，以及访问这些服务应该遵守的接口，由主服务器制定和发布系统服务访问策略。系统内部各个服务器都按照策略规定提供服务，成为一个独立的自治系统，相互之间直接协调工作。外部客户端访问系统提供的服务也是按照策略直接访问，不需要主服务器参与。这种服务访问方式极大地提高了系统扩展性，使主服务器不再成为系统瓶颈。同时也提高了系统性能和可靠性(主服务器宕机时，整个系统仍然可以在一定程度上继续提供服务)。目前搜索引擎厂商的Web存储系统解决方案都没有公开，只有Googl

5、e提到它的Web存储建立在Google文件系统之上，也没有公开详细的Web存储设计。本文详细描述了所实现的搜索引擎中Web存储系统的解决方案。为了提高性能，简化数据访问模型，本文设计的Web存储系统不再建立在分布式文件系统之上，而是采用基于策略的分布式架构，由每个存储组自行存储、组织和维护Web对象，主服务器不维护Web对象元数据，也不参与具体的数据访问。外部客户端需要访问存储服务，只需要按照访问策略直接访问相应的存储组。搜索引擎中的所有服务器都是采用廉价的PC机，各种软硬件故障在所难免。为了在不可靠的软硬件系统上建立一个稳定可靠的搜索引擎，系统中的每个服务器都与其他一些服务器维持心跳，持续检

6、测各种异常情况，及时处理错误。重要数据都有多个备份，并能通过简单的数据复制进行快速灾难恢复。总体上，本文实现的搜索引擎具有很好的可扩展性、高性能和可靠性，解决了分布式互联网搜索引擎中爬虫系统、存储系统和索引系统中的若干问题。关键字：搜索引擎网络爬虫Web存储索引分布式2AbstractToday,people find all kinds of information on the Intemet usually rely on the help of theInternet search engines We are designing a largescale distributed th

7、e tnternet search engine hereGenerally，nternet search engine consists of four main components：crawling subsystem，storage subsystem，indexing subsystem，portal subsystem Firstly,crawling subsystem crawlWebPages th rough the pages linksand stores them in the storage subsystemIndexing subsystemdownloads

8、tile crawled pages，calculates index data The users input several keywords to the portalIt builds up a query,sends to index subsystem，gets the hitpages，and return to the userWe have implemented the core cotD,ponent of the largescale Internet distributed searchengine platformIn distributed crawling su

9、bsystem，we assign a nonoverlapped URL space，according to URL hash value，to each of the crawlersThe crawlers carl keep loadbalance byadjusting the URL spaceA number of storage teams constitute the Storage subsystem The masterofthe system publishes the storage strategyThe strategy divides the URL spac

10、e，and assigns eachto one storage teamThrough expanded the storage teams can constantly improve the system as awhole storage capacity Several storage cells constitute a team，and store identical dataThey aremore backups ofall data，to ensure data security It can also improve the parallelaCCeSS capabili

11、tyStorage clients access the data on storage teams di rectly,according the storage strategy,withoutmasters help Master ceased to be a frequent ope ration of the data access bottlenecks Indexsubsystem has two parts，the indexing and index serviceIndexing subsystem downloads web pagepackage from storag

12、e subsystem，calculates index data,and transfer it to index service subsystemTo improve the reliability of index calculation，multi indexers calculates the web pages on multistorage teamsA storage team allows only one indexer to download data packages through FTP atone time When completed，tire storage

13、 team moves them outThe jndexers would not downloadand calculate same index data The index data stores on each index service server,ensuring thesecurity ofthe index dataEach subsystem uses the strategybased distributed architectureThe strategy describes thedistribution of services within the system，

14、developed and published by the master,All serverswithin tire system provide services in accordance with the strategy as an independent autonomoussystem They collaborate directlyClients directly accessing the services are also in accordancewith the strategy，and there is no need for the nlaster involv

15、edIt greatly enhanced systemscalability In addition，it also increased system performance and reliabilityCurrently the Web search engine manufacturers are not open storage system solutionsThisthesis gives a solution for internet search engine storing web pages To improve performance，simpliey data acc

16、ess models，the Web storage system no longer based on distributed file systemItuses strategybased distributed architecture，constitutes from storage teams They maintain theWeb objects all by themselves The master does not maintain objects metadata，nor involved indata accessClients access tile storing service directly according to the strategyThe whole systemeontinunously detects various exceptions，and deals with them in timeBy this way，we build ahigh avai

展开阅读全文

分布式搜索引擎设计与实现

最新文档