萨师煊国际大数据分析与研究中心课件

资源描述

《萨师煊国际大数据分析与研究中心课件》由会员分享，可在线阅读，更多相关《萨师煊国际大数据分析与研究中心课件（40页珍藏版）》请在金锄头文库上搜索。

1、Weiyi Meng 孟孟卫卫一一Department of Computer ScienceState University of New York at BinghamtonJuly 9, 2012Large-Scale Distributed Information Retrieval on the Web 萨师煊国际大数据分析与研究中心Summer Research Camp Seminar萨师煊国际大数据分析与研究中心About SUNY BinghamtonnFounded in 1946 after WWII.nLocated in Binghamton a city in So

2、uthern Tier of New York StatenAbout 15,000 students (3,000 grad students)nIBM was founded in BinghamtonnOne of the 4 University Centers of SUNY system: SUNY at Stony Brook, SUNY at Buffalo, SUNY at Albany.nFor more information, see 萨师煊国际大数据分析与研究中心What is Information Retrieval?nInformation retrieval

3、(IR) is a computer science discipline for finding unstructured data (usually text documents) that satisfy an information need from within large collections that are stored on computers. nIn this seminar, we are going to extend this definition to include both unstructured and structured data. 萨师煊国际大数

4、据分析与研究中心What is Distributed Information Retrieval (DIR)?nIt is a special branch of information retrieval where the data of the IR system are stored in multiple distributed locations/collections. nIn the Web environment, DIR deals with data that are distributed across many websites or web servers.nRe

5、lated terms for DIR: metasearch engine, federated search, web DB integration system 萨师煊国际大数据分析与研究中心The Scale How Large? nIt can be as large as the number of data sources on the Web. nA 2007 survey (Madhavan et al. 2007) indicates there were about 50 million searchable Web data sources in 2007.n25 mi

6、llion for un- or less structured data (web pages, weibo, )n25 million for structured data (web databases)萨师煊国际大数据分析与研究中心Where do Web data reside?Iceberg Structure:nA small fraction is on the Surface Web with mostly static web pages that are crawlable by following hyperlinks.nPublicly indexable porti

7、on: 40-60 billion pagesnMost are in the Deep Web with both structured data and less structured text documents hidden behind numerous search interfaces.nAbout 1 trillion pages/records萨师煊国际大数据分析与研究中心Two paradigms to provide integrated access to Web datanCrawling-based: Gather Web data from various Web

8、 servers and/or search engines and build a search index for the gathered data.nSurface Web crawlingnDeep Web crawlingnMetasearching-based (DIR-based): Integrate existing search engines into federated systems.nMetasearching text documentsnMetasearching structured data by domain萨师煊国际大数据分析与研究中心Advantag

9、es of each approachCrawling-based:nComplete control on crawled data:nCan add metadatanCan link data from different sources in advancenCan create an archive graduallynComplete control on retrieving techniques and ranking functionsnFast response timeMetasearching-based:nCapabilities of search engines

10、can be leveragednNatural clustering of the data by individual search engines can be utilizednThree-level query evaluation process (SE selection, SE retrieval, result merging) can lead to better effectivenessnMore likely to obtain fresher results萨师煊国际大数据分析与研究中心Disadvantages of each approachCrawling-b

11、ased:nDeep Web crawling difficultnOften incompletenMany sites not crawlablenLose semantics/structure of the datanCannot leverage search engines capabilitiesnCrawling delay leads to less up-to-date resultsnCopyright and privacy issuesMetasearching-based:nPerformance depends on the quality of used sea

12、rch enginesnMay cause search engines to crashnAccess could be blocked by search enginesnNo direct control of the datanSlower response time萨师煊国际大数据分析与研究中心Conclusions?nBoth technologies (crawling-based and metasearching-based) have unique values and they should co-exist.nThey actually complement each

13、other!nQuestion: Is there an effective way to combine both technologies into a single platform?Our seminar will focus on the metasearching (DIR)-based approach.萨师煊国际大数据分析与研究中心Two types of metasearching systemsnBecause structured and unstructured data have very different characteristics, they are oft

14、en handled separately with different technologies.nMetasearching systems for text documents (metasearch engines or DIR systems).nMetasearching systems for structured data, each for a given domain (Web database integration systems).nWe will first introduce large-scale metasearch engines and then intr

15、oduce large-scale Web database integration systems.nDue to limited time, we will focus on challenges and remaining challenges, not on current solutions.萨师煊国际大数据分析与研究中心Large-Scale Metasearch Engines (MSE)萨师煊国际大数据分析与研究中心 user user interface query dispatcher result merger search search search engine 1

16、engine 2 engine n . . . . . . text text text source 1 source 2 source nqueryresultA simple MSE architecture萨师煊国际大数据分析与研究中心What is a large-scale MSE?nA large-scale metasearch engine needs to satisfy ALL of the following requirements:nIt is a metasearch engine.nIt is connected to a large number of (th

17、ousands or more) component search engines.nThe component search engines are special-purpose search enginesnCovering a specific domain: news, sports, medicine, nCovering a specific organization: RenDa, IBM, ACM, nWhy the third requirement?nTo retain the advantages on freshness and searching the deep

18、Web.萨师煊国际大数据分析与研究中心Technical challenges with large-scale MSEnScalable and accurate search engine selectionnMost search engines are useless for a given user query.nBest 10 results, 10,000 search engines at least 9990 useless.nUsing useless search engines is badnUnnecessary network trafficnWaste resou

19、rces of local search enginesnIncur higher cost at the metasearch enginenLead to poor effectivenessnHow to identify the most appropriate search engines for any given query accurately and in a timely manner?nHow to summarize a search engine content (representative)?nHow to collect the representative?n

20、How to use the representatives to perform selection? 萨师煊国际大数据分析与研究中心Technical challenges (cont.)nAutomatic search engine inclusion into metasearch enginenAutomatic connection to search engines (automatic connection wrapper generation)nSubmit queries and receive result pages via programnAutomatic sea

21、rch result records (SRR) extraction (automatic extraction wrapper generation)nAutomatic wrapper maintenancenSearch engines may change the connection parameters and and result presentation any time 萨师煊国际大数据分析与研究中心Technical challenges (cont.)nEffective and efficient result merging nAutonomous componen

22、t search engines likely employ different matching techniques between queries and documents (index techniques, weighting schemes, similarity functions, link-based ranking, etc)nLocal scores and ranks are generally not comparablenHow to re-rank the results returned from different search engines into a

23、 single ranked list such that high effectiveness can be achieved in a speedy manner? 萨师煊国际大数据分析与研究中心Large-scale MSE architecture Gbl. Rep. DB. Search Engine mSearch Engine SelectorQuery DispatcherResult MergerResult Collector and Extractor Search Engine 1Search EngineRepresentativesUser queryWorld W

24、ide WebWebSearch EngineDiscoverySE ListSE IncorporationAutomatic connection and result extractionMetasearch Engine Construction ModuleQuery Processing Module.ResultSearch engineRepresentativesGeneration萨师煊国际大数据分析与研究中心Two Recent Books (Monographs)nW. Meng and C. Yu. Advanced Metasearch Engine Technol

25、ogy. Morgan & Claypool Publishers, December 2010. Table of content:nIntroductionnMetasearch engine architecturenSearch engine selectionnSearch engine incorporationnResult mergingnSummary and Future Research萨师煊国际大数据分析与研究中心Two Recent Books (Monographs)nM. Shokouhi and L. Si. Federated Search. Foundati

26、ons and Trends in Information Retrieval, 5(1), pp.1-102, 2011. Table of content:nIntroductionnCollection representationnCollection selectionnResult mergingnFederated search testbedsnConclusion and Future Research Challenges萨师煊国际大数据分析与研究中心Search Engine Selection (1)nProblem: Given any user query and

27、a set of search engines (or document collections), determine the search engines that match the user query the best. nBasic solution: nSummarize the content of each search engine in advance. nFor each user query, compare it with the search engine summaries and compute a matching score.nRank search en

28、gines in descending order of their matching scores with the query and select the top-ranked search engines.萨师煊国际大数据分析与研究中心Search Engine Selection (2)nQuestion 1: How to summarize the content of each search engines?nAdvanced solutions are statistics-based: One or more statistics for each term in the

29、documents of a search engine.nSome used statistics for a term t:ndocument frequency (df): The number of documents in the search engine that contain t.ncollection frequency (cf): The number of search engines in a metasearch engine that contain t.naverage normalized weight (anw): The avg of the weight

30、s of t in all documents containing t in a SE.nmaximum normalized weight (mnw): The max of the weights of t in all documents in a SE. 萨师煊国际大数据分析与研究中心Search Engine Selection (3)nQuestion 2: How to obtain the summaries of search engines?nTwo general scenarios: nStraightforward computation if the docume

31、nts of the search engine is available. nQuery-based sampling if the documents of the search engine are not directly available (i.e., deep web search engine).nMany published solutions, but still not scalable to large-scale metasearch engines.萨师煊国际大数据分析与研究中心Search Engine Selection (4)nQuestion 3: How

32、to rank search engines for each user query?nSub-questions:nHow to define a measure of usefulness of a search engine with respect to a query?nHow to compute the measure very quickly (highly efficiently) in a large-scale metasearch engine? nA large number of search engine selection algorithms have bee

33、n proposed, most are not very scalable.萨师煊国际大数据分析与研究中心nAutomatic connection to any search engine given its URLnPass queries to the search engine programmatically.nReceive results from the search engine programmatically.nAutomatic extraction of retrieved search resultsnExtract the URLs and snippets o

34、f retrieved pages.nExtract the number of hitsnExtract the URL pattern of the next page button.nAutomatic connection and extraction maintenancenAutomatic failure detectionAutomatic Search Engine Incorporation萨师煊国际大数据分析与研究中心nExtract connection parameters from the HTML form tag of each search engine.nA

35、pply HTTP request method (GET or POST) to perform connection.Automatic Search Engine Connection 萨师煊国际大数据分析与研究中心nComplex search forms with many control elementsnIll-formatted HTML search formsnMultiple search forms on the same pagenSearch forms with JavaScript and/or CSS (cascading Style Sheets)nSear

36、ch forms that have action redirectionsnSearch forms that utilize sessions/cookiesnSearch engines that do not allow metasearchingSearch form extraction: Difficulties萨师煊国际大数据分析与研究中心nA search result record (SRR) consists of the returned information associated with a retrieved Web page.nURL of the pagen

37、Title of the pagenA short summary of the pagenOther misc.: size, date, category, nResult pages often contain irrelevant information such as that related to advertisement and hosting organization, in addition to SRR.Automatic Search Result Records (SRRs) Extraction (1)萨师煊国际大数据分析与研究中心WebScales: Wrappe

38、r Generationan SRR萨师煊国际大数据分析与研究中心an SRR萨师煊国际大数据分析与研究中心nExtract correct SRRs from returned response pages while discarding irrelevant information.nThe problem is to identify the rules (often called wrapper) that can extract the correct SRRs.Automatic SRR Extraction (2)萨师煊国际大数据分析与研究中心nGeneral methodol

39、ogynUtilize the tag strings/DOM trees/visual information on one or more result pages from the same search engine to mine extraction patterns.nIdentify the minimal data-rich region/subtree that likely contains the SRRs.nIdentify separator(s) that separate different SRRs. nMore recent solutions use mo

40、re visual information on result pages.nStill cannot handle complex result pages well (javascript, multiple columns, multiple sections, multiple SRR formats)Automatic SRR Extraction (3)萨师煊国际大数据分析与研究中心Result Merging (1)Problem: Merge returned documents from multiple sources into a single ranked list.D

41、ifficultiesnFull documents of search results are not available or too expensive to download and analyze on the fly.nLocal similarities (thus local ranks) are usually not comparable due tondifferent similarity functionsndifferent term weighting schemesndifferent statistical values, e.g., global idf v

42、s. local idf萨师煊国际大数据分析与研究中心Result Merging (2)nA large number of solutions has been proposed to perform result merging.nSome use local similarities associated with each result (modern search engines no longer provide the information).nSome use local ranks of search results. nSome analyze downloaded f

43、ull documents. nSome use the titles and snippets of the search results. nSome consider the quality of the used search engine.nSome consider whether a result is retrieved from multiple search engines.nSome use a sample set of documents from each search engine萨师煊国际大数据分析与研究中心Information that could be u

44、tilized for result merging: nLocal similarity or local rank of each resultnTitle of each resultnSnippet of each resultnPublication time of each resultnOrganization/person who published the result (from URL)nSize of each resultnNumber of search engines that returned the resultnRanking scores of the s

45、earch engines that returned the resultnFull content of each result (or some of the results)nPageRank or number of backlinks of each resultnA sample set of documents from each search engineResult Merging (3)萨师煊国际大数据分析与研究中心Remaining Research Challenges (1) nSearch engine summary generation and mainten

46、ance nQuery-based sampling methods have not been shown to be practically viable for a large number of truly autonomous search engines. nCertain statistics used by some search engine selection algorithms, such as the maximum normalized weight, are still too expensive to collect as it may require subm

47、itting a substantial number of queries to cover a significant portion of the vocabulary of a search engine. nThe important issue of how to effectively maintain the quality of summaries for search engines whose contents may change over time has started to get attention only recently and more investig

48、ation into this issue is needed.萨师煊国际大数据分析与研究中心Remaining Research Challenges (2) nAutomatic search engine connection with complex search forms. nMore and more search engines are employing more advanced tools to program their search forms. For example, more and more search forms now have Javascripts.

49、 Some search engines also include cookie and session id in their connection mechanism. These complexities make it significantly more difficult to automatically extract all needed connection information.萨师煊国际大数据分析与研究中心Remaining Research Challenges (3) nAutomatic maintenance. nSearch engines used by m

50、etasearch engines may make various changes due to upgrade or other reasons. Possible changes may include search form change, query format change, and result display format change. nThese changes can cause the search engines not usable in the metasearch engines unless necessary adjustments are made a

51、utomatically. nAutomatic metasearch engine maintenance is critical for the smooth operation of a large-scale metasearch engine but this important problem remains largely unsolved. There are mainly two issues. ndetect and differentiate various changes automatically nfix the problem for each type of c

52、hanges automatically萨师煊国际大数据分析与研究中心Remaining Research Challenges (4) nMore advanced result merging algorithm. No existing solution has explored all of the following factors:nLocal similarity or local rank of each resultnTitle of each resultnSnippet of each resultnPublication time of each resultnOrga

53、nization/person who published the result (from URL)nSize of each resultnNumber of search engines that returned the resultnRanking scores of the search engines that returned the resultnFull content of each result (or some of the results)nPageRank or number of backlinks of each resultnA sample set of

54、documents from each search engine萨师煊国际大数据分析与研究中心Remaining Research challenges (5) nBuilding a truly large-scale metasearch enginenThe number of specialized document search engines on the Web was estimated to be over 25 million Madhavan et al., 2007. nA metasearch engine that can connect to a high pe

55、rcentage of these search engines, if built, will likely give Web users unprecedented search coverage of the Web. nBuilding a metasearch engine of such a scale involves many technical challenges beyond those we have already discussed. nHow to identify all document search engines?nHow to measure the quality of these search engines (some of them may have very poor quality and should not be used)?nHow to identify and remove redundant search engines among them?萨师煊国际大数据分析与研究中心

展开阅读全文

萨师煊国际大数据分析与研究中心课件

最新文档