《萨师煊国际大数据分析与研究中心》由会员分享,可在线阅读,更多相关《萨师煊国际大数据分析与研究中心(40页珍藏版)》请在金锄头文库上搜索。
1、Weiyi Meng 孟卫一 Department of Computer Science State University of New York at BinghamtonJuly 9, 2012,Large-Scale Distributed Information Retrieval on the Web,萨师煊国际大数据分析与研究中心 Summer Research Camp Seminar,About SUNY Binghamton,Founded in 1946 after WWII. Located in Binghamton a city in Southern Tier o
2、f New York State About 15,000 students (3,000 grad students) IBM was founded inBinghamton One of the 4 University Centers of SUNY system:SUNY at Stony Brook, SUNY at Buffalo, SUNY at Albany. For more information, see http:/www2.binghamton.edu/features/premier/index.html,What is Information Retrieval
3、?,Information retrieval (IR) is a computer science discipline for finding unstructured data (usually text documents) that satisfy an information need from within large collections that are stored on computers. In this seminar, we are going to extend this definition to include both unstructured and s
4、tructured data.,What is Distributed Information Retrieval (DIR)?,It is a special branch of information retrieval where the data of the IR system are stored in multiple distributed locations/collections. In the Web environment, DIR deals with data that are distributed across many websites or web serv
5、ers. Related terms for DIR: metasearch engine, federated search, web DB integration system,The Scale How Large?,It can be as large as the number of data sources on the Web. A 2007 survey (Madhavan et al. 2007) indicates there were about 50 million searchable Web data sources in 2007. 25 million for
6、un- or less structured data (web pages, weibo, ) 25 million for structured data (web databases),Where do Web data reside?,Iceberg Structure: A small fraction is on the Surface Web with mostly static web pages that are crawlable by following hyperlinks. Publicly indexable portion: 40-60 billion pages
7、 Most are in the Deep Web with both structured data and less structured text documents hidden behind numerous search interfaces. About 1 trillion pages/records,Two paradigms to provide integrated access to Web data,Crawling-based: Gather Web data from various Web servers and/or search engines and bu
8、ild a search index for the gathered data. Surface Web crawling Deep Web crawling Metasearching-based (DIR-based): Integrate existing search engines into federated systems. Metasearching text documents Metasearching structured data by domain,Advantages of each approach,Crawling-based: Complete contro
9、l on crawled data: Can add metadata Can link data from different sources in advance Can create an archive gradually Complete control on retrieving techniques and ranking functions Fast response time,Metasearching-based: Capabilities of search engines can be leveraged Natural clustering of the data b
10、y individual search engines can be utilized Three-level query evaluation process (SE selection, SE retrieval, result merging) can lead to better effectiveness More likely to obtain fresher results,Disadvantages of each approach,Crawling-based: Deep Web crawling difficult Often incomplete Many sites
11、not crawlable Lose semantics/structure of the data Cannot leverage search engines capabilities Crawling delay leads to less up-to-date results Copyright and privacy issues,Metasearching-based: Performance depends on the quality of used search engines May cause search engines to crash Access could be
12、 blocked by search engines No direct control of the data Slower response time,Conclusions?,Both technologies (crawling-based and metasearching-based) have unique values and they should co-exist. They actually complement each other! Question: Is there an effective way to combine both technologies int
13、o a single platform?,Our seminar will focus on the metasearching (DIR)-based approach.,Two types of metasearching systems,Because structured and unstructured data have very different characteristics, they are often handled separately with different technologies. Metasearching systems for text docume
14、nts (metasearch engines or DIR systems). Metasearching systems for structured data, each for a given domain (Web database integration systems). We will first introduce large-scale metasearch engines and then introduce large-scale Web database integration systems. Due to limited time, we will focus o
15、n challenges and remaining challenges, not on current solutions.,Large-Scale Metasearch Engines (MSE),useruser interface query dispatcher result mergersearch search searchengine 1 engine 2 engine n. . . . . .text text textsource 1 source 2 source n,query,result,A simple MSE architecture,What is a la
16、rge-scale MSE?,A large-scale metasearch engine needs to satisfy ALL of the following requirements: It is a metasearch engine. It is connected to a large number of (thousands or more) component search engines. The component search engines are special-purpose search engines Covering a specific domain: news, sports, medicine, Covering a specific organization: RenDa, IBM, ACM, Why the third requirement? To retain the advantages on freshness and searching the deep Web.,