萨师煊国际大数据分析与研究中心

资源描述

《萨师煊国际大数据分析与研究中心》由会员分享，可在线阅读，更多相关《萨师煊国际大数据分析与研究中心（40页珍藏版）》请在金锄头文库上搜索。

1、Weiyi Meng 孟卫卫一 Department of Computer Science State University of New York at BinghamtonJuly 9, 2012Large-Scale Distributed Information Retrieval on the Web 萨师煊国际大数据分析与研究中心 Summer Research Camp SeminarAbout SUNY BinghamtonnFounded in 1946 after WWII.nLocated in Binghamton a city in Southern Tier of

2、 New York StatenAbout 15,000 students (3,000 grad students)nIBM was founded inBinghamtonnOne of the 4 University Centers of SUNY system:SUNY at Stony Brook, SUNY at Buffalo, SUNY at Albany.nFor more information, see http:/www2.binghamton.edu/features/premier/index.html What is Information Retrieval?

3、nInformation retrieval (IR) is a computer science discipline for finding unstructured data (usually text documents) that satisfy an information need from within large collections that are stored on computers. nIn this seminar, we are going to extend this definition to include both unstructured and s

4、tructured data. What is Distributed Information Retrieval (DIR)?nIt is a special branch of information retrieval where the data of the IR system are stored in multiple distributed locations/collections. nIn the Web environment, DIR deals with data that are distributed across many websites or web ser

5、vers.nRelated terms for DIR: metasearch engine, federated search, web DB integration system The Scale How Large? nIt can be as large as the number of data sources on the Web. nA 2007 survey (Madhavan et al. 2007) indicates there were about 50 million searchable Web data sources in 2007.n25 million f

6、or un- or less structured data (web pages, weibo, )n25 million for structured data (web databases)Where do Web data reside?Iceberg Structure:nA small fraction is on the Surface Web with mostly static web pages that are crawlable by following hyperlinks.nPublicly indexable portion: 40-60 billion page

7、snMost are in the Deep Web with both structured data and less structured text documents hidden behind numerous search interfaces.nAbout 1 trillion pages/recordsTwo paradigms to provide integrated access to Web datanCrawling-based: Gather Web data from various Web servers and/or search engines and bu

8、ild a search index for the gathered data.nSurface Web crawlingnDeep Web crawlingnMetasearching-based (DIR-based): Integrate existing search engines into federated systems.nMetasearching text documentsnMetasearching structured data by domainAdvantages of each approachCrawling-based:nComplete control

9、on crawled data:nCan add metadatanCan link data from different sources in advancenCan create an archive graduallynComplete control on retrieving techniques and ranking functionsnFast response timeMetasearching-based:nCapabilities of search engines can be leveragednNatural clustering of the data by i

10、ndividual search engines can be utilizednThree-level query evaluation process (SE selection, SE retrieval, result merging) can lead to better effectivenessnMore likely to obtain fresher resultsDisadvantages of each approachCrawling-based:nDeep Web crawling difficultnOften incompletenMany sites not c

11、rawlablenLose semantics/structure of the datanCannot leverage search engines capabilitiesnCrawling delay leads to less up-to-date resultsnCopyright and privacy issuesMetasearching-based:nPerformance depends on the quality of used search enginesnMay cause search engines to crashnAccess could be block

12、ed by search enginesnNo direct control of the datanSlower response timeConclusions?nBoth technologies (crawling-based and metasearching- based) have unique values and they should co-exist.nThey actually complement each other!nQuestion: Is there an effective way to combine both technologies into a si

13、ngle platform?Our seminar will focus on the metasearching (DIR)-based approach.Two types of metasearching systemsnBecause structured and unstructured data have very different characteristics, they are often handled separately with different technologies.nMetasearching systems for text documents (met

14、asearch engines or DIR systems).nMetasearching systems for structured data, each for a given domain (Web database integration systems).nWe will first introduce large-scale metasearch engines and then introduce large-scale Web database integration systems.nDue to limited time, we will focus on challe

15、nges and remaining challenges, not on current solutions.Large-Scale Metasearch Engines (MSE)useruser interface query dispatcher result mergersearch search searchengine 1 engine 2 engine n. . . . . .text text textsource 1 source 2 source nqueryresultA simple MSE architectureWhat is a large-scale MSE?nA large-scale metasearch engine needs to satisfy ALL of the following requirements:nIt is a metasearch engine.nIt is connected to a large number of (thousands or more) component search engines.nThe component search engines are special-purpose s

展开阅读全文

萨师煊国际大数据分析与研究中心

最新文档