龙星计划课程信息检索OverviewofTextRetrievalPart

资源描述

《龙星计划课程信息检索OverviewofTextRetrievalPart》由会员分享，可在线阅读，更多相关《龙星计划课程信息检索OverviewofTextRetrievalPart（90页珍藏版）》请在金锄头文库上搜索。

1、龙星计划课程龙星计划课程:信息检索信息检索OverviewofTextRetrieval:Part2ChengXiang Zhai (翟成祥) Department of Computer ScienceGraduate School of Library & Information ScienceInstitute for Genomic Biology, StatisticsUniversity of Illinois, Urbana-Champaignhttp:/www-faculty.cs.uiuc.edu/czhai, czhaics.uiuc.edu2008 ChengXiang

2、Zhai Dragon Star Lecture at Beijing University, June 21-30, 20081OutlineOther retrieval modelsImplementation of a TR SystemApplications of TR techniques2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20082P-norm(ExtendedBoolean)(Salton et al. 83)Motivation: how to rank do

3、cuments with a Boolean query?IntuitionsDocssatisfyingthequeryconstraintshouldgetthehighestranksPartialsatisfactionofqueryconstraintcanbeusedtorankotherdocsQuestion: How to capture “partial satisfaction”?2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20083P-norm:BasicIdea

4、sNormalized term weights for doc rep (0,1)Define similarity between a Boolean query and a doc vectorQ= T1 AND T2(0,0)(1,0)(0,1)(1,1)(x,y)Q= T1 ORT2(0,0)(1,0)(0,1)(1,1)(x,y)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20084P-norm:FormulasSince the similarity value is no

5、rmalized to 0,1, these two formulas can be applied recursively.1 P +vector-space Boolean/Fuzzy logic2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20085P-norm:SummaryA general (and elegant) similarity function for Boolean query and a regular document vectorConnecting Boo

6、lean model and vector space model with models in betweenAllowing different “confidence” on Boolean operators (different p for different operators)A model worth more exploration (how to learn optimal p values from feedback?)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2

7、0086ProbabilisticRetrievalModels 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20087OverviewofRetrievalModelsRelevance (Rep(q),Rep(d)SimilarityP(r=1|q,d)r 0,1ProbabilityofRelevanceP(dq)orP(qd)ProbabilisticinferenceDifferentrep&similarityVectorspacemodel(Salton et al., 7

8、5)Prob.distr.model(Wong & Yao, 89)GenerativeModelRegressionModel(Fox 83)Classicalprob.Model(Robertson & Sparck Jones, 76)DocgenerationQuerygenerationLMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)Prob.conceptspacemodel(Wong & Yao, 95)DifferentinferencesystemInferencenetworkmodel(Turtle & Croft,

9、91)LearntoRank(Joachims 02)(Burges et al. 05)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20088TheBasicQuestionWhatistheprobabilitythatTHISdocumentisrelevanttoTHISquery?Formally3randomvariables:queryQ,documentD,relevanceR 0,1Givenaparticularqueryq,aparticulardocumentd,

10、p(R=1|Q=q,D=d)=?2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20089ProbabilityofRelevanceThree random variablesQueryQDocumentDRelevanceR 0,1Goal: rank D based on P(R=1|Q,D)EvaluateP(R=1|Q,D)Actually,onlyneedtocompareP(R=1|Q,D1)withP(R=1|Q,D2),I.e.,rankdocumentsSeveral d

11、ifferent ways to refine P(R=1|Q,D) 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200810RefiningP(R=1|Q,D)Method1:conditionalmodelsBasic idea: relevance depends on how well a query matches a documentDefinefeaturesonQxD,e.g.,#matchedterms,#thehighestIDFofamatchedterm,#doc

12、len,.P(R=1|Q,D)=g(f1(Q,D),f2(Q,D),fn(Q,D), )Usingtrainingdata(knownrelevancejudgments)toestimateparameter ApplythemodeltoranknewdocumentsSpecial case: logistic regression2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200811LogisticRegression(Cooper92,Gey94)logit function

13、:logistic (sigmoid) function:XP(R=1|Q,D)1.0Uses 6 features X1, , X62008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200812Features/AttributesAverage Absolute Query FrequencyQuery LengthAverage Absolute Document FrequencyDocument LengthAverage Inverse Document FrequencyInve

14、rse Document FrequencyNumber of Terms in common between query and document - logged 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200813LogisticRegression:Pros&ConsAdvantagesAbsoluteprobabilityofrelevanceavailableMayre-useallthepastrelevancejudgmentsProblemsPerformancei

15、sverysensitivetotheselectionoffeaturesNomuchguidanceonfeatureselectionIn practice, performance tends to be average2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200814RefiningP(R=1|Q,D)Method2:generativemodelsBasic ideaDefineP(Q,D|R)ComputeO(R=1|Q,D)usingBayesruleSpecial

16、 casesDocument“generation”:P(Q,D|R)=P(D|Q,R)P(Q|R)Query“generation”:P(Q,D|R)=P(Q|D,R)P(D|R)Ignored for ranking D2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200815DocumentGenerationModel of relevant docs for QModel of non-relevant docs for QAssume independent attribute

17、s A1Ak .(why?)Let D=d1dk, where dk 0,1 is the value of attribute Ak (Similarly Q=q1qk )Non-query terms are equally likely to appear in relevant and non-relevant docs2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200816Robertson-SparckJonesModel(Robertson & Sparck Jones 7

18、6)Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc qi = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc (RSJ model) How to estimate parameters?Suppose we have relevance judgments,“+0.5” and “+1” can be justified by Bayesian estimation 20

19、08 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200817RSJModel:NoRelevanceInfo(Croft & Harper 79)(RSJ model) How to estimate parameters?Suppose we do not have relevance judgments,- We will assume pi to be a constant - Estimate qi by assuming all documents to be non-relevant

20、N: # documents in collectionni: # documents in which term Ai occurs2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200818RSJModel:SummaryThe most important classic prob. IR modelUse only term presence/absence, thus also referred to as Binary Independence ModelEssentially

21、Nave Bayes for doc rankingMost natural for relevance/pseudo feedbackWhen without relevance judgments, the model parameters must be estimated in an ad hoc wayPerformance isnt as good as tuned VS model2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200819ImprovingRSJ:Adding

22、TFLet D=d1dk, where dk is the frequency count of term AkBasic doc. generation model: 2-Poisson mixture modelMany more parameters to estimate! (how many exactly?)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200820BM25/OkapiApproximation(Robertson et al. 94)Idea: Approxi

23、mate p(R=1|Q,D) with a simpler function that share similar properties Observations:logO(R=1|Q,D)isasumoftermweightsWiWi=0,ifTFi=0WiincreasesmonotonicallywithTFiWihasanasymptoticlimitThe simple function is 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200821AddingDoc.Len

24、gth&QueryTFIncorporating doc lengthMotivation:The2-PoissonmodelassumesequaldocumentlengthImplementation:“Carefully”penalizelongdocIncorporating query TFMotivation:Appearstobenotwell-justifiedImplementation:AsimilarTFtransformationThe final formula is called BM25, achieving top TREC performance2008 C

25、hengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200822TheBM25Formula“Okapi TF/BM25 TF”2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200823Extensionsof“DocGeneration”ModelsCapture term dependence (Rijsbergen&Harper78)Alternative ways to incorporate

26、TF (Croft83,Kalt96)Feature/term selection for feedback (OkapisTRECreports)Other Possibilities (machinelearning)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200824QueryGenerationAssuming uniform prior, we haveQuery likelihood p(q| d) Document prior Now, the question is

27、how to compute ?Generally involves two steps:(1) estimate a language model based on D(2) compute the query likelihood according to the estimated modelLeading to the so-called “Language Modeling Approach” 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200825WhatisaStatist

28、icalLM?Aprobabilitydistributionoverwordsequencesp(“Today is Wednesday”) 0.001p(“Today Wednesday is”) 0.0000000000001p(“The eigenvalue is positive”) 0.00001Context-dependent!Canalsoberegardedasaprobabilisticmechanismfor“generating”text,thusalsocalleda“generative”model2008 ChengXiang Zhai Dragon Star

29、Lecture at Beijing University, June 21-30, 200826TheSimplestLanguageModel(Unigram Model)Generate a piece of text by generating each word INDEPENDENTLYThus, p(w1 w2 . wn)=p(w1)p(w2)p(wn)Parameters: p(wi) p(w1)+p(wN)=1 (N is voc. size)Essentially a multinomial distribution over wordsA piece of text ca

30、n be regarded as a sample drawn according to this word distribution2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200827TextGenerationwithUnigramLM(Unigram) Language Model p(w| )text 0.2mining 0.1association 0.01clustering 0.02food 0.00001Topic 1:Text miningfood 0.25nutr

31、ition 0.1healthy 0.05diet 0.02Topic 2:Health DocumentText miningpaperFood nutritionpaperSampling2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200828EstimationofUnigramLM(Unigram) Language Model p(w| )=? Documenttext 10mining 5association 3database 3algorithm 2query 1eff

32、icient 1text ?mining ?association ?database ?query ?EstimationA “text mining paper”(total #words=100)10/1005/1003/1003/1001/1002008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200829LanguageModelsforRetrieval(Ponte & Croft 98) DocumentText miningpaperFood nutritionpaperLan

33、guage Model text ?mining ?assocation ?clustering ?food ?food ?nutrition ?healthy ?diet ?Query = “data mining algorithms”?Which model would most likely have generated this query?2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200830RankingDocsbyQueryLikelihoodd1d2dNq d1 d2

34、 dNDoc LMp(q| d1)p(q| d2)p(q| dN)Query likelihood2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200831RetrievalasLanguageModelEstimationDocument ranking based on query likelihood Retrieval problem Estimation of p(wi|d)Smoothing is an important issue, and distinguishes di

35、fferent approachesDocument language model2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200832HowtoEstimatep(w|d)?Simplest solution: Maximum Likelihood EstimatorP(w|d)=relativefrequencyofwordwindWhatifaworddoesntappearinthetext?P(w|d)=0In general, what probability should

36、 we give a word that has not been observed?If we want to assign non-zero probabilities to such words, well have to discount the probabilities of observed wordsThis is what “smoothing” is about 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200833LanguageModelSmoothing(Il

37、lustration)P(w)Word wMax. Likelihood Estimate Smoothed LM 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200834HowtoSmooth?All smoothing methods try todiscounttheprobabilityofwordsseeninadocumentre-allocatetheextracountssothatunseenwordswillhaveanon-zerocountA simple met

38、hod (additive smoothing): Addaconstant tothecountsofeachwordProblems?“Addone”,LaplacesmoothingVocabularysizeCountsofwindLengthofd(totalcounts)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200835AGeneralSmoothingSchemeAll smoothing methods try todiscounttheprobabilityofw

39、ordsseeninadocre-allocatetheextraprobabilitysothatunseenwordswillhaveanon-zeroprobabilityMost use a reference model (collection language model) to discriminate unseen wordsDiscounted ML estimateCollection language model2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20083

40、6Smoothing&TF-IDFWeightingPlug in the general smoothing scheme to the query likelihood retrieval formula, we obtainIgnore for rankingIDF weightingTF weightingDoc length normalization(long doc is expected to have a smaller d)Smoothing with p(w|C) TF-IDF + length norm.2008 ChengXiang Zhai Dragon Star

41、Lecture at Beijing University, June 21-30, 200837DerivationoftheQueryLikelihoodRetrievalFormulaDiscountedMLestimateReferencelanguagemodelRetrievalformulausingthegeneralsmoothingschemeKeyrewritingstepSimilarrewritingsareverycommonwhenusingLMsforIR2008 ChengXiang Zhai Dragon Star Lecture at Beijing Un

42、iversity, June 21-30, 200838MoreSmoothingMethodsMethod 1 (Absolute discounting): Subtractaconstant fromthecountsofeachwordMethod 2 (Linear interpolation, Jelinek-Mercer): “Shrink”uniformlytowardp(w|REF)#uniqwordsparameterMLestimate2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June

43、21-30, 200839MoreSmoothingMethods(cont.)Method 3 (Dirichlet Prior/Bayesian): Assumepseudocounts p(w|REF)Method 4 (Good Turing):Assumetotal#unseeneventstoben1(#ofsingletons),andadjusttheseeneventsinthesamewayparameter2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200840Di

44、richletPriorSmoothing ML estimator: M=argmax M p(d|M) Bayesian estimator: First consider posterior: p(M|d) =p(d|M)p(M)/p(d) Then, consider the mean or mode of the posterior dist. p(d|M) : Sampling distribution (of data) P(M)=p(1 , N) : our prior on the model parameters conjugate = prior can be inter

45、preted as “extra”/“pseudo” data Dirichlet distribution is a conjugate prior for multinomial sampling distribution “extra”/“pseudo” word counts i= p(wi|REF)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200841DirichletPriorSmoothing(cont.)Posterior distribution of paramet

46、ers:Thepredictivedistributionisthesameasthemean:Dirichlet prior smoothing2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200842AdvantagesofLanguageModelsSolid statistical foundationParameters can be optimized automatically using statistical estimation methods Can easily m

47、odel many different retrieval tasksTo be covered more later2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200843WhatYouShouldKnowGlobal relationship among different probabilistic models How logistic regression worksHow the Robertson-Sparck Jones model worksThe BM25 formu

48、la All document-generation models have trouble when no relevance judgments are availableHow the language modeling approach (query likelihood scoring) worksHow Dirichlet prior smoothing works3 state of the art retrieval models: Pivoted Norm Okapi/BM25 Query Likelihood (Dirichlet prior smoothing)2008

49、ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200844ImplementationofanIRSystem2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200845IRSystemArchitectureUserqueryjudgmentsdocsresultsQueryQueryRepRepDocRepRankingFeedbackINDEXINGINDEXINGSEARCHINGSEAR

50、CHINGQUERY MODIFICATIONQUERY MODIFICATIONINTERFACEINTERFACE2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200846IndexingIndexing = Convert documents to data structures that enable fast search Inverted index is the dominating indexing method (used by all search engines)Ot

51、her indices (e.g., document index) may be needed for feedback2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200847InvertedIndexFast access to all docs containing a given term (along with freq and pos information) For each term, we get a list of tuples (docID, freq, pos).

52、Given a query, we can fetch the lists for all query terms and work on the involved documents.Booleanquery:setoperationNaturallanguagequery:termweightsummingMore efficient than scanning docs (why?)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200848InvertedIndexExampleTh

53、is is a sample documentwith one samplesentenceDoc 1This is another sample documentDoc 2DictionaryPostingsTerm# docsTotal freqThis22is22sample23another11Doc idFreq112111211221212008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200849DataStructuresforInvertedIndexDictionary:

54、modest sizeNeedsfastrandomaccessPreferredtobeinmemoryHashtable,B-tree,trie,Postings: hugeSequentialaccessisexpectedCanstayondiskMaycontaindocID,termfreq.,termpos,etcCompressionisdesirable2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200850InvertedIndexCompressionObserva

55、tionsInvertedlistissorted(e.g.,bydocidortermfq)SmallnumberstendtooccurmorefrequentlyImplications“d-gap”(storedifference):d1,d2-d1,d3-d2-d1,Exploitskewedfrequencydistribution:fewerbitsforsmall(highfrequency)integers Binary code, unary code, -code, -code 2008 ChengXiang Zhai Dragon Star Lecture at Bei

56、jing University, June 21-30, 200851IntegerCompressionMethodsIn general, to exploit skewed distributionBinary: equal-length codingUnary: x1 is coded as x-1 one bits followed by 0, e.g., 3= 110; 5=11110-code: x= unary code for 1+log x followed by uniform code for x-2 log x in log x bits, e.g., 3=101,

57、5=11001-code: same as -code ,but replace the unary prefix with -code. E.g., 3=1001, 5=101012008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200852ConstructingInvertedIndexThe main difficulty is to build a huge index with limited memoryMemory-based methods: not usable for l

58、arge collections Sort-based methods: Step1:collectlocal(termID,docID,freq)tuplesStep2:sortlocaltuples(tomake“runs”)Step3:pair-wisemergerunsStep4:Outputinvertedfile2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200853Sort-basedInversion.Term Lexicon:the 1cold 2days 3a 4.D

59、ocIDLexicon:doc1 1doc2 2doc3 3.doc1doc1doc300. .Sort by doc-idParse & Count.Sort by term-id“Local” sort.Merge sortAll info about term 12008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200854SearchingGiven a query, score documents efficientlyBoolean queryFetchtheinvertedlis

60、tforallquerytermsPerformsetoperationstogetthesubsetofdocsthatsatisfytheBooleanconditionE.g.,Q1=“info”AND“security”,Q2=“info”OR“security”info: d1, d2, d3, d4info: d1, d2, d3, d4security: d2, d4, d6security: d2, d4, d6Results: d2,d4 (Q1) d1,d2,d3,d4,d6 (Q2)Results: d2,d4 (Q1) d1,d2,d3,d4,d6 (Q2)2008 C

61、hengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200855RankingDocumentsAssumption:score(d,q)=fg(w(d,q,t1),w(d,q,tn), w(d),w(q), where, tis are the matched termsMaintain a score accumulator for each doc to compute function gFor each query term tiFetchtheinvertedlist(d1,f1),(dn,fn

62、)Foreachentry(dj,fj),Computew(dj,q,ti),andUpdatescoreaccumulatorfordocdiAdjust the score to compute f, and sort 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200856RankingDocuments:ExampleQuery = “info security”S(d,q)=g(t1)+g(tn) sum of freq of matched termsInfo: (d1, 3

63、), (d2, 4), (d3, 1), (d4, 5)Security: (d2, 3), (d4,1), (d5, 3)Accumulators: d1 d2 d3 d4 d5 0 0 0 0 0 (d1,3) = 3 0 0 0 0 (d2,4) = 3 4 0 0 0 (d3,1) = 3 4 1 0 0 (d4,5) = 3 4 1 5 0 (d2,3) = 3 7 1 5 0 (d4,1) = 3 7 1 6 0 (d5,3) = 3 7 1 6 3infosecurity2008 ChengXiang Zhai Dragon Star Lecture at Beijing Uni

64、versity, June 21-30, 200857FurtherImprovingEfficiencyKeep only the most promising accumulatorsSort the inverted list in decreasing order of weights and fetch only N entries with the highest weightsPre-compute as much as possible 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21

65、-30, 200858OpenSourceIRToolkitsSmart (Cornell)MG (RMIT & Melbourne, Australia; Waikato, New Zealand), Lemur (CMU/Univ. of Massachusetts)Terrier (Glasgow)Lucene (Open Source)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200859SmartThe most influential IR system/toolkitDe

66、veloped at Cornell since 1960s Vector space model with lots of weighting optionsWritten in C The Cornell/AT&T groups have used the Smart system to achieve top TREC performance2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200860MGA highly efficient toolkit for retrieval

67、of text and images Developed by people at Univ. of Waikato, Univ. of Melbourne, and RMIT in 1990sWritten in C, running on UnixVector space model with lots of compression and speed up tricksPeople have used it to achieve good TREC performance2008 ChengXiang Zhai Dragon Star Lecture at Beijing Univers

68、ity, June 21-30, 200861Lemur/IndriAn IR toolkit emphasizing language modelsDeveloped at CMU and Univ. of Massachusetts in 2000sWritten in C+, highly extensibleVector space and probabilistic models including language modelsAchieving good TREC performance with a simple language model2008 ChengXiang Zh

69、ai Dragon Star Lecture at Beijing University, June 21-30, 200862TerrierA large-scale retrieval toolkit with lots of applications (e.g., desktop search) and TREC supportDeveloped at University of Glasgow, UKWritten in Java, open source“Divergence from randomness” retrieval model and other modern retr

70、ieval formulas2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200863LuceneOpen Source IR toolkit Initially developed by Doug Cutting in JavaNow has been ported to some other languagesGood for building IR/Web applications Many applications have been built using Lucene (e.g

71、., Nutch Search Engine) Currently the retrieval algorithms have poor accuracy2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200864WhatYouShouldKnowWhat is an inverted indexWhy does an inverted index help make search fastHow to construct a large inverted indexSimple integ

72、er compression methodsHow to use an inverted index to rank documents efficientlyHOW TO IMPLEMENT A SIMPLE IR SYSTEM2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200865ApplicationsofBasicIRTechniques2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-

73、30, 200866Some“Basic”IRTechniquesStemmingStop wordsWeighting of terms (e.g., TF-IDF)Vector/Unigram representation of textText similarity (e.g., cosine)Relevance/pseudo feedback (e.g., Rocchio)Theyarenotjustforretrieval!2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20086

74、7GeneralityofBasicTechniquesRaw textTerm similarityDoc similarityVector centroidCLUSTERING dCATEGORIZATIONMETA-DATA/ANNOTATION d d d d d d d d d d d d d d t t t t t t t t t t t tStemming & Stop wordsTokenized textTerm Weightingw11 w12 w1nw21 w22 w2n wm1 wm2 wmn t1 t2 tn d1 d2 dmSentenceselectionSUMM

75、ARIZATION2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200868SampleApplicationsInformationFiltering(coveredearlier)TextCategorizationDocument/TermClusteringTextSummarization2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200869TextCategorizat

76、ionPre-givencategoriesandlabeleddocumentexamples(Categoriesmayformhierarchy)ClassifynewdocumentsAstandardsupervisedlearningproblemCategorizationSystemSportsBusinessEducationScienceSportsBusinessEducation2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200870“Retrieval-base

77、d”CategorizationTreateachcategoryasrepresentingan“informationneed”Treatexamplesineachcategoryas“relevantdocuments”Usefeedbackapproachestolearnagood“query”MatchallthelearnedqueriestoanewdocumentAdocumentgetsthecategory(categories)representedbythebestmatchingquery(queries)2008 ChengXiang Zhai Dragon S

78、tar Lecture at Beijing University, June 21-30, 200871Prototype-basedClassifierKeyelements(“retrievaltechniques”)Prototype/documentrepresentation(e.g.,termvector)Document-prototypedistancemeasure(e.g.,dotproduct)Prototypevectorlearning:RocchiofeedbackExample2008 ChengXiang Zhai Dragon Star Lecture at

79、 Beijing University, June 21-30, 200872K-NearestNeighborClassifierKeepalltrainingexamplesFindkexamplesthataremostsimilartothenewdocument(“neighbor”documents)Assignthecategorythatismostcommonintheseneighbordocuments(neighborsvoteforthecategory)Canbeimprovedbyconsideringthedistanceofaneighbor(Aclosern

80、eighborhasmoreinfluence)Technicalelements(“retrievaltechniques”)DocumentrepresentationDocumentdistancemeasure2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200873ExampleofK-NNClassifier(k=1)(k=4) 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30,

81、 200874ExamplesofTextCategorizationNews article classificationMeta-data annotationAutomatic Email sortingWeb page classification 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200875SampleApplicationsInformationFilteringTextCategorizationDocument/TermClusteringTextSummar

82、ization2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200876TheClusteringProblemDiscover“naturalstructure”GroupsimilarobjectstogetherObjectcanbedocument,term,passagesExample2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200877Similarity-based

83、Clustering(as opposed to “model-based”)DefineasimilarityfunctiontomeasuresimilaritybetweentwoobjectsGraduallygroupsimilarobjectstogetherinabottom-upfashionStopwhensomestoppingcriterionismetVariations:differentwaystocomputegroupsimilaritybasedonindividualobjectsimilarity2008 ChengXiang Zhai Dragon St

84、ar Lecture at Beijing University, June 21-30, 200878Similarity-inducedStructure2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200879HowtoComputeGroupSimilarity?Given two groups g1 and g2,Single-link algorithm: s(g1,g2)= similarity of the closest paircomplete-link algorit

85、hm: s(g1,g2)= similarity of the farthest pairaverage-link algorithm: s(g1,g2)= average of similarity of all pairsThree Popular Methods:2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200880ThreeMethodsIllustratedSingle-link algorithm?g1g2complete-link algorithmaverage-lin

86、k algorithm2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200881ExamplesofDoc/TermClusteringClustering of retrieval resultsClustering of documents in the whole collection Term clustering to define “concept” or “theme”Automatic construction of hyperlinksIn general, very u

87、seful for text mining2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200882SampleApplicationsInformationFilteringTextCategorizationDocument/TermClusteringTextSummarization2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200883TheSummarizationPro

88、blemEssentially “semantic compression” of textSelection-based vs. generation-based summaryIn general, we need a purpose for summarization, but its hard to define it2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200884“Retrieval-based”SummarizationObservation: term vector

89、 summary?Basic approachRank“sentences”,andselecttopNasasummaryMethods for ranking sentencesBasedontermweightsBasedonpositionofsentencesBasedonthesimilarityofsentenceanddocumentvector2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200885SimpleDiscourseAnalysis-vector 1vect

90、or 2vector 3vector n-1vector nsimilaritysimilaritysimilarity2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200886ASimpleSummarizationMethod-sentence 1sentence 2sentence 3summaryDoc vectorMost similarin each segment2008 ChengXiang Zhai Dragon Star Lecture at Beijing Unive

91、rsity, June 21-30, 200887ExamplesofSummarizationNews summary Summarize retrieval resultsSingledocsummaryMulti-docsummarySummarize a cluster of documents (automatic label creation for clusters)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200888WhatYouShouldKnowRetrieval

92、 touches some basic issues in text information management (what are these basic issues?)How to apply simple retrieval techniques, such as the vector space model, to information filtering, text categorization, clustering, and summarizationThere are many other tasks that can potentially benefit from s

93、imple IR techniques 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200889RoadmapThis lectureOtherretrievalmodelsIRsystemimplementationApplicationsofbasicTRtechniquesNext lecture: more in-depth treatment of language models2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200890

展开阅读全文

龙星计划课程信息检索OverviewofTextRetrievalPart

最新文档