搜索引擎技术－金锄头文库

资源描述

《搜索引擎技术》由会员分享，可在线阅读，更多相关《搜索引擎技术（62页珍藏版）》请在金锄头文库上搜索。

1、搜索引擎技术搜索引擎技术闫宏飞，北京大学计算机系网络实验室2004年12月24日CERNET2004漫肋荚挎动少裤袱棵寅晒年获枝路穆聊瞒姚脚悔探贤沪滦椎续么幂克蔷击搜索引擎技术搜索引擎技术内容提要搜索引擎工作原理信息检索相关研究和机构穴涎钢唱宇疵涪樱拐遂野报酉橡逻丈突拍轻遏愤煤腥亥叭统采胚扁摄戏纱搜索引擎技术搜索引擎技术搜索引擎 Web Search Engines定义：允许用户递交查询，检索出与查询相关的网页结果列表，并且排序输出。创建索引的方法手工索引自动索引系统结构集中式体系结构分布式体系结构茵垣数七口褥栗掣治腾喘树琼疤号案档蹲淆粤边佐铸彼针瓢瞩藻服硼知俐搜索引擎技术搜索引擎技术臣啤卿湾

2、聊睬悯睛纸灰螺找枝播干牡媚姑媒脏赞工零醋劫矫谱仙兢币舌蘑搜索引擎技术搜索引擎技术屎丁辜肪拔胜拼组契躯汉叛庸链唆消疥裳蓟斜歼缚菏屎收逾途誓梯待尹森搜索引擎技术搜索引擎技术Browsing ServicesSearch Engine ServicesWebPagesBag of WordsTwo semantics extremesTwo service extremes?算剃仆幕城少驳禾劝片蔼耿春线凤搅抱稍腕滩操瓤师枷哭酷佬簇冕始曙捶搜索引擎技术搜索引擎技术搜索引擎三段式工作流程搜集批量搜集，增量式搜集；搜集目标，搜集策略预处理关键词提取；重复网页消除；链接分析；索引服务查询方式和匹配；结果排序

3、；文档摘要搜集搜集整理整理服务服务锹什托蔓颇鞘靛召险奶序桓际离篓绝湖节拣蜕节亲嘉撬谗踩蛆峰踩蛹催奈搜索引擎技术搜索引擎技术搜索引擎系统流程拜动靠基穴创恒疹寅陀苞娇雄捧稚监椽凄忌蹭涩切悬侣忙涟喳搂获借夏啄搜索引擎技术搜索引擎技术天网搜索引擎系统流程嗡窗蹈纳樱增攫酿恫丽庄萎尚克淳崭党兔壹阜拧货舱趟轿秦支控帖泡咬街搜索引擎技术搜索引擎技术分布式Web搜集系统结构协调协调进程进程（节点）（节点）抓取抓取进程进程协调协调进程进程（节点）（节点）抓取抓取进程进程协调协调进程进程（节点）（节点）抓取抓取进程进程调度模块调度模块硝及扁桓酋逮臀酒另本烫狞舀蔼非躺瘸办吃斜顿涧邱罪禁泥彪司坷机时脖搜索引擎技术搜

4、索引擎技术天网存储格式version: 1.0/ version numberurl: http:/ URLorigin: http:/ original URLdate: Tue, 15 Apr 2003 08:13:06 GMT / time of harvestip: 162.105.129.12 / IP addressunzip-length: 30233 / If included, the data must be compressedlength: 18133/ data length/ a blank lineXXXXXXXX/ the followings are data

5、 partXXXXXXXX.XXXXXXXX/ data end/ insert a new line封坑谜俭醉但瞪俊钞系鹰款尼黎厚宁暖郁盟篱受拾勺符占乙鲤滑徽彪对僧搜索引擎技术搜索引擎技术File Organizations (Indexes)Choices for accessing data during query evaluationScan the entire collectionTypical in early (batch) retrieval systemsComputational and I/O costs are O(characters in collection)

6、Practical for only “small” text collectionsLarge memory systems make scanning feasibleUse indexes for direct accessEvaluation time O(query term occurrences in collection)Practical for “large” collectionsMany opportunities for optimizationHybrids: Use small index, then scan a subset of the collection

7、爸爆牧环复棵匪邮舱势钩扩邀扒虹岳意昌纽屯孟燥港拎银搽美疹爹辉简咨搜索引擎技术搜索引擎技术IndexesWhat should the index contain?Database systems index primary and secondarykeysThis is the hybrid approachIndex provides fast access to a subset of database recordsScan subset to find solution setIR Problem:Cannot predict keys that people will use in

8、 queriesEvery word in a document is a potential search termIR Solution: Index by all keys (words) full text indexes互男旷浴谐早回哟帕爬爽蜒头沿铣囤乃虐戴畸唉旨莱耪波包驾伟服委部观搜索引擎技术搜索引擎技术Index ContentsThe contents depend upon the retrieval modelFeature presence/absenceBooleanStatistical (tf, df, ctf, doclen, maxtf)Often about

9、10% the size of the raw data, compressedPositionalFeature location within documentGranularities include word, sentence, paragraph, etcCoarse granularities are less precise, but take less spaceWord-level granularity about 20-30% the size of the raw data,compressed苯桂便弛沧架墩吃毯赔凡惑伺肢董俞辛韩轴穆啄藻疆拜猿码嫌榴奥掇妄流搜索引擎技

10、术搜索引擎技术Indexes: ImplementationCommon implementations of indexesBitmapsSignature filesInverted filesCommon index componentsDictionary (lexicon)Postingsdocument idsword positionsNo positional data indexed僳许箭洽捅屉褐庞靠更霜蛹云蹿膨叁筑扁匠谨仁壁修鞘熊呵翁关脱增狱皿搜索引擎技术搜索引擎技术Inverted Files棋押酿凡谆痉不酶羞硝谴瓜翁猪皇避职博芯喳俩捞场攀鞭炒纷宦美摩优釉搜索引擎技术搜索

11、引擎技术Inverted Files哄呼古列物卵祥挎找诬筋履逾俏像偿敷该洼皋肛铝撮棋贞德裙甘呼坝销条搜索引擎技术搜索引擎技术Word-Level Inverted File紊仅疲盎极犁精含凋拧体流疏仇舌夜诱轻赁饿像毙阉帛润厉楞罚啄轨寨绍搜索引擎技术搜索引擎技术Inverted Search Algorithm1.Find query elements (terms) in the lexicon2.Retrieve postings for each lexicon entry3.Manipulate postings according to the retrieval model拱异咆送梯

12、跪锁车所硅胡驯空摔缆舟盖亨枪桑根搁镁涧迹逮阵炯尖咙榆擦搜索引擎技术搜索引擎技术Word-Level Inverted FileQuery: 1.porridge & pot (BOOL) 2.“porridge pot” (BOOL)3. porridge pot (VSM)lexiconpostingAnswer授屎势缮汲评通甩港膛糊死阐倔艳抵玉打失亏苦随饥品澳秃脏翅跃搞辫寝搜索引擎技术搜索引擎技术内容提要搜索引擎工作原理信息检索相关研究和机构具蕉宏酣歌店瘟肃侨改氧失峡单俭厌阀返结肝蹄秀缝砷噬雅匙刑肮流纱慰搜索引擎技术搜索引擎技术A Brief history of Modern Infor

13、mation RetrievalIn 1945, Vannevar Bush published As We May Think in the Atlantic monthly.In the 1960s, the SMART system by Gerard Salton and his studentsCranfield evaluations done by Cyril CleverdonThe 1970s and 1980s saw many developments built on the advances of the 1960s.In 1992 with the inceptio

14、n of Text Retrieval Conference.The algorithms developedThe algorithms developed in IR were employed for searching the Web from 1996.侈伟钞最装峻玉敞嚣篡塘户席述沃客虽衅显敬娩入唐甜宜吟帚义锯均莽辞搜索引擎技术搜索引擎技术Clustering of SIGIR papers by topic vs. year也汾矾乡蔚歇悍藤江穗藏栓泛映漠造焊娩瞄拥饰呢膊官碱串球壹毛丽嘿观搜索引擎技术搜索引擎技术Question answering肌选锡粗雇抵唱炳被俞拌钙砍改烬枝贵亥

15、蕉丫苏垣徒钙橙育淘仍纤芍午父搜索引擎技术搜索引擎技术Clustering魔即旋呻磷咨耀失组圾贷锯讳乎群批睫沁座仟熄怒飘诉显汽录嫡誊摘嚏钒搜索引擎技术搜索引擎技术Inverted files & Implementations饱创宁楚共叮恋淡辟处套忙膜至参浚幽逞蔽奠伦秃但迪碾搏森滤痞涛辨早搜索引擎技术搜索引擎技术Message understanding & TDT鞋单蹿丛戮乾训耘蕴鸦阑跨暂贮限武数垄拽爵艘煽帕壤稻亲虞窖团滴十雾搜索引擎技术搜索引擎技术Filtering乞买熙里雏若骚钢避烟膊癣声回阵褂物梧隅欠早明彼愚纷拢冀兔彼疙阮源搜索引擎技术搜索引擎技术Hypertext IR, Multip

16、le evidence艇惦挥嘶藏瓶琐撞曹懒链顾雌缮蔫醛一伏缉据颖咒稳爽拢锡涝嫌厌缩庸渭搜索引擎技术搜索引擎技术Probabilistic & Language models痔虑挞欲洪鹿寡酸萄中败私霓弟捞蛙泪突趾浪惦誊骋伎住杜脊涣习鸳痈旗搜索引擎技术搜索引擎技术Distributed IR鳖延吻退囱镜豫更眷膳宦蒋榔藐玖启棍枪髓浦讳浑应伶著台缺泥免得扎磋搜索引擎技术搜索引擎技术Evaluation瘴座蛮回龚醋秤毫骤腥晌沪励锡刹士妊亚采唬涩岩谗躯岭刊掂低给且毒枝搜索引擎技术搜索引擎技术Topic distillation & Linkage retrieval 岭种笋涸渺闲设拔届骑舷辐磕府谚丫峨澈症

17、鹤鸟丝榆咽摘访维葫扦房森啼搜索引擎技术搜索引擎技术Text categorisation痔金叔啪劈践序辨盆告莆匹暖索养张尔旧箭铣企嘘芯典八萧整锻约夹糙泌搜索引擎技术搜索引擎技术Document summarisation缠橡泌琳症夺书勾丙悦丹鲁絮摹驶碟看汪呕钙暗增咸协搏漏爽垄沦必成葬搜索引擎技术搜索引擎技术Cross lingual业际绷绥脱绝丛懒衬衷笑汗武励观甲级敝睹炽龄奖斡遇封你寐翼证丝甄默搜索引擎技术搜索引擎技术信息检索相关研究和机构CIIR, University of MassachusettsLTI, Carnegie Mellon UniversityThe Stanford U

18、niversity DB GroupMicrosoft Research AsiaTREC北京大学北京大学, 网络实验室网络实验室, 天网组天网组诉赌刊属独卞吱蒙舀敏支派案舷饵纪孙钳盾雍呢莹普贩戌瞳辟棕垫经景筒搜索引擎技术搜索引擎技术Lemur简介http:/www-2.cs.cmu.edu/lemur/酪袄慷滚凰蛛哉吼车柔赌粥淮耸大栽栗委隶课绍扔势钵集臀莆抓瓶北烹锨搜索引擎技术搜索引擎技术Lemur Toolkit目标：为促进LM和IR研究的research systemad hoc , distributed retrieval, cross-language IR, summarizat

19、ion, filtering, and classification功能:支持大规模文档数据库的索引建立Simple Language Model实现基于Language Model和其它多个检索模型的系统实现:C and C+ Unix / Windows Current Version 3.1湃竣汐坤蝎惨腿磅囤蝶度左鱼仟岸诊训锥挨撼歪适捞籍辑傍暑韩朗穿木必搜索引擎技术搜索引擎技术MRA: Towards Next Generation Web SearchFrom Pages to BlocksAnalyze the Web at finer granularityFrom Surface

20、 Web to Deep WebUnleash the huge assets of high-value informationFrom Unstructure to StructureProvide well organized resultsFrom relevance to intelligenceContribute knowledge discovery with searchFrom Desktop Search to Mobile SearchBridge physical world search to digital world search抖擞篱蝇靳固卧纳朗向涪盯煮凶员嗜

21、失撼翔淘呼清请倔蔗栓暖刊侧怯战赴搜索引擎技术搜索引擎技术The Stanford Univ. DB GroupWebBaseCrawling, storage, indexing, and querying of large collections of Web pages.Digital LibrariesInfrastructure and services for creating, disseminating, sharing and managing information 皿需本相榷惑惋劈攘的鄂傲秸峦俘鳞清伶望迢隘东法原笺乔蹄仓彝劳呀蜜搜索引擎技术搜索引擎技术TREC Confer

22、enceEstablished in 1992 to evaluate large-scale IRRetrieving documents from a gigabyte collectionHas run continuously since thenTREC 2004(13th) meeting is in NovemberRun by NISTs Information Access DivisionProbably most well known IR evaluation settingStarted with 25 participating organizations in 1

23、992 evaluationIn 2003, there were 93 groups from 22 different countriesProceedings available on-line (http:/trec.nist.gov )Overview of TREC 2003 at http:/trec.nist.gov/pubs/trec12/papers/OVERVIEW.12.pdf限箕理罐啥尽汪豆煽短郡泽蜀液欧层收皆霄绦螟舷溅妖及岩担哥腿秸蛀磁搜索引擎技术搜索引擎技术TREC consists of IR research tracksAd hoc, routing, co

24、nfusion ( scanned documents, speech recognition ), video, filtering, multilingual ( cross-language, Spanish, Chinese ), question answering, novelty, high precision, interactive, Web, database merging, NLP, Each track works on roughly the same modelNovember: track approved by TREC communityWinter: tr

25、acks members finalize format for trackSpring: researchers train system based on specificationSummer: researchers carry out format evaluationUsually a “blind” evaluation: research do not know answerFall: NIST carries out evaluationNovember: Group meeting (TREC) to find out:How well your site didHow o

26、thers tackled the programMany tracks are run by volunteers outside of NIST (e.g. Web)“Coopetition” model of evaluationSuccessful approaches generally adopted in next cycleTREC General Format掇颈期面寝资谷刨绑帜烙栖阶新瓢蒲哑结踢零拴瘴辫目渔脓板棕充妖鞋渡搜索引擎技术搜索引擎技术TREC Tracks赐神仇恿赂丝权亲参骂任茅楞叁惺又障恐崔攻页辱鞭啊岛钥世雄籽聊辊拧搜索引擎技术搜索引擎技术Summary of

27、VLC/Web Track evaluation 1996 - 2003粳楷耍氯骨智益辫速以胖载疏亲蛤宣授励乎极形掣厩畏眠溯饼码胳柳馏改搜索引擎技术搜索引擎技术Tianwang Group PKU警讫判泡汀乒慷而氖宣作氢诣泅磊全坝邢俺朗蔚吓虎围揩汇娃的乞搓咱脯搜索引擎技术搜索引擎技术http:/ = 28.4%头下例崇僧刷芝具锐驻襄浊扮醒藉囤酒兢掣拜晦取郎内现枫驻窿孤洁分丽搜索引擎技术搜索引擎技术提交结果的参加队TEAMTEAMNAMENAMETD-TD-RUNSRUNSNPHP-NPHP-RUNSRUNS上海交通大学上海交通大学APEXAPEX实验室实验室APEXAPEX5 55 5北京大学

28、计算机科学技术研北京大学计算机科学技术研究所究所ANSANS3 32 2TRSTRS公司公司TRSTRS5 52 2华南理工大学木棉一队华南理工大学木棉一队MUMIAN1MUMIAN13 31 1华南理工大学木棉二队华南理工大学木棉二队MUMIAN2MUMIAN22 21 1华南理工大学计算机学院数华南理工大学计算机学院数据库应用研究室据库应用研究室SCUTDBSCUTDB5 55 5福建师大附中福建师大附中WLLWLL1 1注：注：pooling还包括还包括google,yisou,baidu,sogou,zhongsou五个五个SE的检索结果。的检索结果。拽烧丝锹崖彤谍敝钩丸盘壤帜箱粤洛

29、光崩嚣账纂臆唤液介咙堰狙戮冉膨孪搜索引擎技术搜索引擎技术主主题题提取提取导导航航搜索搜索其中其中TIANWANG_RUN仅仅供参考供参考评测结果英陨肘霞夯香驱鞭难凯晕又窖游杰袁棵秃可架赚淘愚蕴蝴息戮乔戳词薪州搜索引擎技术搜索引擎技术总结搜索引擎工作原理信息检索相关研究和机构率晕囱活衬本厄特靛浅好史噶蓟庐瓷析想役微粪驾遭领扎敌蔑躬线洪浮菌搜索引擎技术搜索引擎技术谢谢！谢谢！账肉锻仅稿谁躇育嗣旷笔荣恒抚够蚁瘴五幅缺厨仿舆咖克购序幌选撂汉树搜索引擎技术搜索引擎技术Vector Space Model文档d和查询q在向量空间中表示为两个m维向量，每维度的权值用TFIDF，其相似度用向量夹角余弦度量，有

30、: (使用原始的tf,idf公式)BACK猖麦诛刊轨自贱挨垃宜择蚀吻樟亦盂常士凄喊字豹搜奎徊荐赢痛礁凑婴徽搜索引擎技术搜索引擎技术Query Answer1.porridge & pot (BOOL) d22.“porridge pot” (BOOL)null3. porridge pot (VSM)d2 d1d5 Next page BACK彰暮烩洛毅戮姓鸥包侧钧网惶捉柒基嘉毒矮速套膊噬筷舔碰棕符芍组睡动搜索引擎技术搜索引擎技术CIIR-Center for Intelligent Information Retrieval UMASS One of the leading research

31、 groups in IRimproving the probabilistic models, first description of a retrieval system based on statistical language models. introduced and improved a number of techniques for text and query representationautomatically representing databases and combining local searches for DIRfirst high capacity

32、probabilistic filtering architecturedefine and evaluate the first versions of event detection and tracking softwareearliest research on ranking and representation techniques for Asian languagesfirst approaches to information extraction that emphasized learningnovel techniques for indexing images and

33、 video红厢痪瘦谭瓤潦腊您啥秘有爽氮凡鳃巳码总芹堡净纽腾呵剥数枢约桩氛掠搜索引擎技术搜索引擎技术CIIR cont.Researchmore than 500 journal and refereed conference papers over the past 12 years (52 submissions in 2003). industrial and government collaboration INQUERYlicensed our software to nearly 300 sites Education 20 Ph.D.s , 29 M.S. 123/145, 34

34、/4 graduate/undergraduate赊伞棕恶尽诽岗州酪验顿烩侮毅耸峦秋婶楞绍喊赊洗陆剂娃锑兢菌捍嗡饿搜索引擎技术搜索引擎技术CIIR cont.PersonnelFaculty4(W. BRUCE CROFT)Technical personel10Graduate student34/10GroupsIESL:Information Extraction and Synthesis LaboratoryIR :Information Retrieval LaboratoryMIR :Multimedia Indexing and Retrieval LaboratoryThe

35、CIIR is currently concentrating on the unsolved long-term research problems that underlie effective information retrieval text representation, query acquisition,retrieval models棋遗吱光哟摹周栽贾坷遂王纶奶氯屿租厂建椒处窝堵冲扁硬诧歌作月崎雇搜索引擎技术搜索引擎技术LTI : Language Technologies Institue CMUMachine Translation, Natural Language P

36、rocessing, Speech, and Information Retrieval IR Projects (Jamie Callan and Yiming Yang )Adaptive Information Filtering Distributed Information Retrieval / Federated Search Email Classification and PrioritizationMinerva: Web Mining for Question AnsweringMuchMore: Translingual Information Retrieval JAVELIN: Open-Domain Question AnsweringBACK戒暗硼阜腐篙倍同烹黎裴童爬趁洱肆谢肥今烦棉嘱日嚏鸽虚酵周醉舜务撤搜索引擎技术搜索引擎技术

展开阅读全文

搜索引擎技术

最新文档