《龙星计划课程信息检索OverviewofTextRetrievalPart》由会员分享,可在线阅读,更多相关《龙星计划课程信息检索OverviewofTextRetrievalPart(90页珍藏版)》请在金锄头文库上搜索。
1、龙星计划课程龙星计划课程:信息检索信息检索OverviewofTextRetrieval:Part2ChengXiang Zhai (翟成祥) Department of Computer ScienceGraduate School of Library & Information ScienceInstitute for Genomic Biology, StatisticsUniversity of Illinois, Urbana-Champaignhttp:/www-faculty.cs.uiuc.edu/czhai, czhaics.uiuc.edu2008 ChengXiang
2、Zhai Dragon Star Lecture at Beijing University, June 21-30, 20081OutlineOther retrieval modelsImplementation of a TR SystemApplications of TR techniques2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20082P-norm(ExtendedBoolean)(Salton et al. 83)Motivation: how to rank do
3、cuments with a Boolean query?IntuitionsDocssatisfyingthequeryconstraintshouldgetthehighestranksPartialsatisfactionofqueryconstraintcanbeusedtorankotherdocsQuestion: How to capture “partial satisfaction”?2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20083P-norm:BasicIdea
4、sNormalized term weights for doc rep (0,1)Define similarity between a Boolean query and a doc vectorQ= T1 AND T2(0,0)(1,0)(0,1)(1,1)(x,y)Q= T1 ORT2(0,0)(1,0)(0,1)(1,1)(x,y)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20084P-norm:FormulasSince the similarity value is no
5、rmalized to 0,1, these two formulas can be applied recursively.1 P +vector-space Boolean/Fuzzy logic2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20085P-norm:SummaryA general (and elegant) similarity function for Boolean query and a regular document vectorConnecting Boo
6、lean model and vector space model with models in betweenAllowing different “confidence” on Boolean operators (different p for different operators)A model worth more exploration (how to learn optimal p values from feedback?)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 2
7、0086ProbabilisticRetrievalModels 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20087OverviewofRetrievalModelsRelevance (Rep(q),Rep(d)SimilarityP(r=1|q,d)r 0,1ProbabilityofRelevanceP(dq)orP(qd)ProbabilisticinferenceDifferentrep&similarityVectorspacemodel(Salton et al., 7
8、5)Prob.distr.model(Wong & Yao, 89)GenerativeModelRegressionModel(Fox 83)Classicalprob.Model(Robertson & Sparck Jones, 76)DocgenerationQuerygenerationLMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)Prob.conceptspacemodel(Wong & Yao, 95)DifferentinferencesystemInferencenetworkmodel(Turtle & Croft,
9、91)LearntoRank(Joachims 02)(Burges et al. 05)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20088TheBasicQuestionWhatistheprobabilitythatTHISdocumentisrelevanttoTHISquery?Formally3randomvariables:queryQ,documentD,relevanceR 0,1Givenaparticularqueryq,aparticulardocumentd,
10、p(R=1|Q=q,D=d)=?2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20089ProbabilityofRelevanceThree random variablesQueryQDocumentDRelevanceR 0,1Goal: rank D based on P(R=1|Q,D)EvaluateP(R=1|Q,D)Actually,onlyneedtocompareP(R=1|Q,D1)withP(R=1|Q,D2),I.e.,rankdocumentsSeveral d
11、ifferent ways to refine P(R=1|Q,D) 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200810RefiningP(R=1|Q,D)Method1:conditionalmodelsBasic idea: relevance depends on how well a query matches a documentDefinefeaturesonQxD,e.g.,#matchedterms,#thehighestIDFofamatchedterm,#doc
12、len,.P(R=1|Q,D)=g(f1(Q,D),f2(Q,D),fn(Q,D), )Usingtrainingdata(knownrelevancejudgments)toestimateparameter ApplythemodeltoranknewdocumentsSpecial case: logistic regression2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200811LogisticRegression(Cooper92,Gey94)logit function
13、:logistic (sigmoid) function:XP(R=1|Q,D)1.0Uses 6 features X1, , X62008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200812Features/AttributesAverage Absolute Query FrequencyQuery LengthAverage Absolute Document FrequencyDocument LengthAverage Inverse Document FrequencyInve
14、rse Document FrequencyNumber of Terms in common between query and document - logged 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200813LogisticRegression:Pros&ConsAdvantagesAbsoluteprobabilityofrelevanceavailableMayre-useallthepastrelevancejudgmentsProblemsPerformancei
15、sverysensitivetotheselectionoffeaturesNomuchguidanceonfeatureselectionIn practice, performance tends to be average2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200814RefiningP(R=1|Q,D)Method2:generativemodelsBasic ideaDefineP(Q,D|R)ComputeO(R=1|Q,D)usingBayesruleSpecial
16、 casesDocument“generation”:P(Q,D|R)=P(D|Q,R)P(Q|R)Query“generation”:P(Q,D|R)=P(Q|D,R)P(D|R)Ignored for ranking D2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200815DocumentGenerationModel of relevant docs for QModel of non-relevant docs for QAssume independent attribute
17、s A1Ak .(why?)Let D=d1dk, where dk 0,1 is the value of attribute Ak (Similarly Q=q1qk )Non-query terms are equally likely to appear in relevant and non-relevant docs2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200816Robertson-SparckJonesModel(Robertson & Sparck Jones 7
18、6)Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc qi = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc (RSJ model) How to estimate parameters?Suppose we have relevance judgments,“+0.5” and “+1” can be justified by Bayesian estimation 20
19、08 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200817RSJModel:NoRelevanceInfo(Croft & Harper 79)(RSJ model) How to estimate parameters?Suppose we do not have relevance judgments,- We will assume pi to be a constant - Estimate qi by assuming all documents to be non-relevant
20、N: # documents in collectionni: # documents in which term Ai occurs2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200818RSJModel:SummaryThe most important classic prob. IR modelUse only term presence/absence, thus also referred to as Binary Independence ModelEssentially
21、Nave Bayes for doc rankingMost natural for relevance/pseudo feedbackWhen without relevance judgments, the model parameters must be estimated in an ad hoc wayPerformance isnt as good as tuned VS model2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200819ImprovingRSJ:Adding
22、TFLet D=d1dk, where dk is the frequency count of term AkBasic doc. generation model: 2-Poisson mixture modelMany more parameters to estimate! (how many exactly?)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200820BM25/OkapiApproximation(Robertson et al. 94)Idea: Approxi
23、mate p(R=1|Q,D) with a simpler function that share similar properties Observations:logO(R=1|Q,D)isasumoftermweightsWiWi=0,ifTFi=0WiincreasesmonotonicallywithTFiWihasanasymptoticlimitThe simple function is 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200821AddingDoc.Len
24、gth&QueryTFIncorporating doc lengthMotivation:The2-PoissonmodelassumesequaldocumentlengthImplementation:“Carefully”penalizelongdocIncorporating query TFMotivation:Appearstobenotwell-justifiedImplementation:AsimilarTFtransformationThe final formula is called BM25, achieving top TREC performance2008 C
25、hengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200822TheBM25Formula“Okapi TF/BM25 TF”2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200823Extensionsof“DocGeneration”ModelsCapture term dependence (Rijsbergen&Harper78)Alternative ways to incorporate
26、TF (Croft83,Kalt96)Feature/term selection for feedback (OkapisTRECreports)Other Possibilities (machinelearning)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200824QueryGenerationAssuming uniform prior, we haveQuery likelihood p(q| d) Document prior Now, the question is
27、how to compute ?Generally involves two steps:(1) estimate a language model based on D(2) compute the query likelihood according to the estimated modelLeading to the so-called “Language Modeling Approach” 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200825WhatisaStatist
28、icalLM?Aprobabilitydistributionoverwordsequencesp(“Today is Wednesday”) 0.001p(“Today Wednesday is”) 0.0000000000001p(“The eigenvalue is positive”) 0.00001Context-dependent!Canalsoberegardedasaprobabilisticmechanismfor“generating”text,thusalsocalleda“generative”model2008 ChengXiang Zhai Dragon Star
29、Lecture at Beijing University, June 21-30, 200826TheSimplestLanguageModel(Unigram Model)Generate a piece of text by generating each word INDEPENDENTLYThus, p(w1 w2 . wn)=p(w1)p(w2)p(wn)Parameters: p(wi) p(w1)+p(wN)=1 (N is voc. size)Essentially a multinomial distribution over wordsA piece of text ca
30、n be regarded as a sample drawn according to this word distribution2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200827TextGenerationwithUnigramLM(Unigram) Language Model p(w| )text 0.2mining 0.1association 0.01clustering 0.02food 0.00001Topic 1:Text miningfood 0.25nutr
31、ition 0.1healthy 0.05diet 0.02Topic 2:Health DocumentText miningpaperFood nutritionpaperSampling2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200828EstimationofUnigramLM(Unigram) Language Model p(w| )=? Documenttext 10mining 5association 3database 3algorithm 2query 1eff
32、icient 1text ?mining ?association ?database ?query ?EstimationA “text mining paper”(total #words=100)10/1005/1003/1003/1001/1002008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200829LanguageModelsforRetrieval(Ponte & Croft 98) DocumentText miningpaperFood nutritionpaperLan
33、guage Model text ?mining ?assocation ?clustering ?food ?food ?nutrition ?healthy ?diet ?Query = “data mining algorithms”?Which model would most likely have generated this query?2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200830RankingDocsbyQueryLikelihoodd1d2dNq d1 d2
34、 dNDoc LMp(q| d1)p(q| d2)p(q| dN)Query likelihood2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200831RetrievalasLanguageModelEstimationDocument ranking based on query likelihood Retrieval problem Estimation of p(wi|d)Smoothing is an important issue, and distinguishes di
35、fferent approachesDocument language model2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200832HowtoEstimatep(w|d)?Simplest solution: Maximum Likelihood EstimatorP(w|d)=relativefrequencyofwordwindWhatifaworddoesntappearinthetext?P(w|d)=0In general, what probability should
36、 we give a word that has not been observed?If we want to assign non-zero probabilities to such words, well have to discount the probabilities of observed wordsThis is what “smoothing” is about 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200833LanguageModelSmoothing(Il
37、lustration)P(w)Word wMax. Likelihood Estimate Smoothed LM 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200834HowtoSmooth?All smoothing methods try todiscounttheprobabilityofwordsseeninadocumentre-allocatetheextracountssothatunseenwordswillhaveanon-zerocountA simple met
38、hod (additive smoothing): Addaconstant tothecountsofeachwordProblems?“Addone”,LaplacesmoothingVocabularysizeCountsofwindLengthofd(totalcounts)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200835AGeneralSmoothingSchemeAll smoothing methods try todiscounttheprobabilityofw
39、ordsseeninadocre-allocatetheextraprobabilitysothatunseenwordswillhaveanon-zeroprobabilityMost use a reference model (collection language model) to discriminate unseen wordsDiscounted ML estimateCollection language model2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20083
40、6Smoothing&TF-IDFWeightingPlug in the general smoothing scheme to the query likelihood retrieval formula, we obtainIgnore for rankingIDF weightingTF weightingDoc length normalization(long doc is expected to have a smaller d)Smoothing with p(w|C) TF-IDF + length norm.2008 ChengXiang Zhai Dragon Star
41、Lecture at Beijing University, June 21-30, 200837DerivationoftheQueryLikelihoodRetrievalFormulaDiscountedMLestimateReferencelanguagemodelRetrievalformulausingthegeneralsmoothingschemeKeyrewritingstepSimilarrewritingsareverycommonwhenusingLMsforIR2008 ChengXiang Zhai Dragon Star Lecture at Beijing Un
42、iversity, June 21-30, 200838MoreSmoothingMethodsMethod 1 (Absolute discounting): Subtractaconstant fromthecountsofeachwordMethod 2 (Linear interpolation, Jelinek-Mercer): “Shrink”uniformlytowardp(w|REF)#uniqwordsparameterMLestimate2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June
43、21-30, 200839MoreSmoothingMethods(cont.)Method 3 (Dirichlet Prior/Bayesian): Assumepseudocounts p(w|REF)Method 4 (Good Turing):Assumetotal#unseeneventstoben1(#ofsingletons),andadjusttheseeneventsinthesamewayparameter2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200840Di
44、richletPriorSmoothing ML estimator: M=argmax M p(d|M) Bayesian estimator: First consider posterior: p(M|d) =p(d|M)p(M)/p(d) Then, consider the mean or mode of the posterior dist. p(d|M) : Sampling distribution (of data) P(M)=p(1 , N) : our prior on the model parameters conjugate = prior can be inter
45、preted as “extra”/“pseudo” data Dirichlet distribution is a conjugate prior for multinomial sampling distribution “extra”/“pseudo” word counts i= p(wi|REF)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200841DirichletPriorSmoothing(cont.)Posterior distribution of paramet
46、ers:Thepredictivedistributionisthesameasthemean:Dirichlet prior smoothing2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200842AdvantagesofLanguageModelsSolid statistical foundationParameters can be optimized automatically using statistical estimation methods Can easily m
47、odel many different retrieval tasksTo be covered more later2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200843WhatYouShouldKnowGlobal relationship among different probabilistic models How logistic regression worksHow the Robertson-Sparck Jones model worksThe BM25 formu
48、la All document-generation models have trouble when no relevance judgments are availableHow the language modeling approach (query likelihood scoring) worksHow Dirichlet prior smoothing works3 state of the art retrieval models: Pivoted Norm Okapi/BM25 Query Likelihood (Dirichlet prior smoothing)2008
49、ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200844ImplementationofanIRSystem2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200845IRSystemArchitectureUserqueryjudgmentsdocsresultsQueryQueryRepRepDocRepRankingFeedbackINDEXINGINDEXINGSEARCHINGSEAR
50、CHINGQUERY MODIFICATIONQUERY MODIFICATIONINTERFACEINTERFACE2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200846IndexingIndexing = Convert documents to data structures that enable fast search Inverted index is the dominating indexing method (used by all search engines)Ot
51、her indices (e.g., document index) may be needed for feedback2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200847InvertedIndexFast access to all docs containing a given term (along with freq and pos information) For each term, we get a list of tuples (docID, freq, pos).
52、Given a query, we can fetch the lists for all query terms and work on the involved documents.Booleanquery:setoperationNaturallanguagequery:termweightsummingMore efficient than scanning docs (why?)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200848InvertedIndexExampleTh
53、is is a sample documentwith one samplesentenceDoc 1This is another sample documentDoc 2DictionaryPostingsTerm# docsTotal freqThis22is22sample23another11Doc idFreq112111211221212008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200849DataStructuresforInvertedIndexDictionary:
54、modest sizeNeedsfastrandomaccessPreferredtobeinmemoryHashtable,B-tree,trie,Postings: hugeSequentialaccessisexpectedCanstayondiskMaycontaindocID,termfreq.,termpos,etcCompressionisdesirable2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200850InvertedIndexCompressionObserva
55、tionsInvertedlistissorted(e.g.,bydocidortermfq)SmallnumberstendtooccurmorefrequentlyImplications“d-gap”(storedifference):d1,d2-d1,d3-d2-d1,Exploitskewedfrequencydistribution:fewerbitsforsmall(highfrequency)integers Binary code, unary code, -code, -code 2008 ChengXiang Zhai Dragon Star Lecture at Bei
56、jing University, June 21-30, 200851IntegerCompressionMethodsIn general, to exploit skewed distributionBinary: equal-length codingUnary: x1 is coded as x-1 one bits followed by 0, e.g., 3= 110; 5=11110-code: x= unary code for 1+log x followed by uniform code for x-2 log x in log x bits, e.g., 3=101,
57、5=11001-code: same as -code ,but replace the unary prefix with -code. E.g., 3=1001, 5=101012008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200852ConstructingInvertedIndexThe main difficulty is to build a huge index with limited memoryMemory-based methods: not usable for l
58、arge collections Sort-based methods: Step1:collectlocal(termID,docID,freq)tuplesStep2:sortlocaltuples(tomake“runs”)Step3:pair-wisemergerunsStep4:Outputinvertedfile2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200853Sort-basedInversion.Term Lexicon:the 1cold 2days 3a 4.D
59、ocIDLexicon:doc1 1doc2 2doc3 3.doc1doc1doc300. .Sort by doc-idParse & Count.Sort by term-id“Local” sort.Merge sortAll info about term 12008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200854SearchingGiven a query, score documents efficientlyBoolean queryFetchtheinvertedlis
60、tforallquerytermsPerformsetoperationstogetthesubsetofdocsthatsatisfytheBooleanconditionE.g.,Q1=“info”AND“security”,Q2=“info”OR“security”info: d1, d2, d3, d4info: d1, d2, d3, d4security: d2, d4, d6security: d2, d4, d6Results: d2,d4 (Q1) d1,d2,d3,d4,d6 (Q2)Results: d2,d4 (Q1) d1,d2,d3,d4,d6 (Q2)2008 C
61、hengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200855RankingDocumentsAssumption:score(d,q)=fg(w(d,q,t1),w(d,q,tn), w(d),w(q), where, tis are the matched termsMaintain a score accumulator for each doc to compute function gFor each query term tiFetchtheinvertedlist(d1,f1),(dn,fn
62、)Foreachentry(dj,fj),Computew(dj,q,ti),andUpdatescoreaccumulatorfordocdiAdjust the score to compute f, and sort 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200856RankingDocuments:ExampleQuery = “info security”S(d,q)=g(t1)+g(tn) sum of freq of matched termsInfo: (d1, 3
63、), (d2, 4), (d3, 1), (d4, 5)Security: (d2, 3), (d4,1), (d5, 3)Accumulators: d1 d2 d3 d4 d5 0 0 0 0 0 (d1,3) = 3 0 0 0 0 (d2,4) = 3 4 0 0 0 (d3,1) = 3 4 1 0 0 (d4,5) = 3 4 1 5 0 (d2,3) = 3 7 1 5 0 (d4,1) = 3 7 1 6 0 (d5,3) = 3 7 1 6 3infosecurity2008 ChengXiang Zhai Dragon Star Lecture at Beijing Uni
64、versity, June 21-30, 200857FurtherImprovingEfficiencyKeep only the most promising accumulatorsSort the inverted list in decreasing order of weights and fetch only N entries with the highest weightsPre-compute as much as possible 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21
65、-30, 200858OpenSourceIRToolkitsSmart (Cornell)MG (RMIT & Melbourne, Australia; Waikato, New Zealand), Lemur (CMU/Univ. of Massachusetts)Terrier (Glasgow)Lucene (Open Source)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200859SmartThe most influential IR system/toolkitDe
66、veloped at Cornell since 1960s Vector space model with lots of weighting optionsWritten in C The Cornell/AT&T groups have used the Smart system to achieve top TREC performance2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200860MGA highly efficient toolkit for retrieval
67、of text and images Developed by people at Univ. of Waikato, Univ. of Melbourne, and RMIT in 1990sWritten in C, running on UnixVector space model with lots of compression and speed up tricksPeople have used it to achieve good TREC performance2008 ChengXiang Zhai Dragon Star Lecture at Beijing Univers
68、ity, June 21-30, 200861Lemur/IndriAn IR toolkit emphasizing language modelsDeveloped at CMU and Univ. of Massachusetts in 2000sWritten in C+, highly extensibleVector space and probabilistic models including language modelsAchieving good TREC performance with a simple language model2008 ChengXiang Zh
69、ai Dragon Star Lecture at Beijing University, June 21-30, 200862TerrierA large-scale retrieval toolkit with lots of applications (e.g., desktop search) and TREC supportDeveloped at University of Glasgow, UKWritten in Java, open source“Divergence from randomness” retrieval model and other modern retr
70、ieval formulas2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200863LuceneOpen Source IR toolkit Initially developed by Doug Cutting in JavaNow has been ported to some other languagesGood for building IR/Web applications Many applications have been built using Lucene (e.g
71、., Nutch Search Engine) Currently the retrieval algorithms have poor accuracy2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200864WhatYouShouldKnowWhat is an inverted indexWhy does an inverted index help make search fastHow to construct a large inverted indexSimple integ
72、er compression methodsHow to use an inverted index to rank documents efficientlyHOW TO IMPLEMENT A SIMPLE IR SYSTEM2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200865ApplicationsofBasicIRTechniques2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-
73、30, 200866Some“Basic”IRTechniquesStemmingStop wordsWeighting of terms (e.g., TF-IDF)Vector/Unigram representation of textText similarity (e.g., cosine)Relevance/pseudo feedback (e.g., Rocchio)Theyarenotjustforretrieval!2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 20086
74、7GeneralityofBasicTechniquesRaw textTerm similarityDoc similarityVector centroidCLUSTERING dCATEGORIZATIONMETA-DATA/ANNOTATION d d d d d d d d d d d d d d t t t t t t t t t t t tStemming & Stop wordsTokenized textTerm Weightingw11 w12 w1nw21 w22 w2n wm1 wm2 wmn t1 t2 tn d1 d2 dmSentenceselectionSUMM
75、ARIZATION2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200868SampleApplicationsInformationFiltering(coveredearlier)TextCategorizationDocument/TermClusteringTextSummarization2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200869TextCategorizat
76、ionPre-givencategoriesandlabeleddocumentexamples(Categoriesmayformhierarchy)ClassifynewdocumentsAstandardsupervisedlearningproblemCategorizationSystemSportsBusinessEducationScienceSportsBusinessEducation2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200870“Retrieval-base
77、d”CategorizationTreateachcategoryasrepresentingan“informationneed”Treatexamplesineachcategoryas“relevantdocuments”Usefeedbackapproachestolearnagood“query”MatchallthelearnedqueriestoanewdocumentAdocumentgetsthecategory(categories)representedbythebestmatchingquery(queries)2008 ChengXiang Zhai Dragon S
78、tar Lecture at Beijing University, June 21-30, 200871Prototype-basedClassifierKeyelements(“retrievaltechniques”)Prototype/documentrepresentation(e.g.,termvector)Document-prototypedistancemeasure(e.g.,dotproduct)Prototypevectorlearning:RocchiofeedbackExample2008 ChengXiang Zhai Dragon Star Lecture at
79、 Beijing University, June 21-30, 200872K-NearestNeighborClassifierKeepalltrainingexamplesFindkexamplesthataremostsimilartothenewdocument(“neighbor”documents)Assignthecategorythatismostcommonintheseneighbordocuments(neighborsvoteforthecategory)Canbeimprovedbyconsideringthedistanceofaneighbor(Aclosern
80、eighborhasmoreinfluence)Technicalelements(“retrievaltechniques”)DocumentrepresentationDocumentdistancemeasure2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200873ExampleofK-NNClassifier(k=1)(k=4) 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30,
81、 200874ExamplesofTextCategorizationNews article classificationMeta-data annotationAutomatic Email sortingWeb page classification 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200875SampleApplicationsInformationFilteringTextCategorizationDocument/TermClusteringTextSummar
82、ization2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200876TheClusteringProblemDiscover“naturalstructure”GroupsimilarobjectstogetherObjectcanbedocument,term,passagesExample2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200877Similarity-based
83、Clustering(as opposed to “model-based”)DefineasimilarityfunctiontomeasuresimilaritybetweentwoobjectsGraduallygroupsimilarobjectstogetherinabottom-upfashionStopwhensomestoppingcriterionismetVariations:differentwaystocomputegroupsimilaritybasedonindividualobjectsimilarity2008 ChengXiang Zhai Dragon St
84、ar Lecture at Beijing University, June 21-30, 200878Similarity-inducedStructure2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200879HowtoComputeGroupSimilarity?Given two groups g1 and g2,Single-link algorithm: s(g1,g2)= similarity of the closest paircomplete-link algorit
85、hm: s(g1,g2)= similarity of the farthest pairaverage-link algorithm: s(g1,g2)= average of similarity of all pairsThree Popular Methods:2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200880ThreeMethodsIllustratedSingle-link algorithm?g1g2complete-link algorithmaverage-lin
86、k algorithm2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200881ExamplesofDoc/TermClusteringClustering of retrieval resultsClustering of documents in the whole collection Term clustering to define “concept” or “theme”Automatic construction of hyperlinksIn general, very u
87、seful for text mining2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200882SampleApplicationsInformationFilteringTextCategorizationDocument/TermClusteringTextSummarization2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200883TheSummarizationPro
88、blemEssentially “semantic compression” of textSelection-based vs. generation-based summaryIn general, we need a purpose for summarization, but its hard to define it2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200884“Retrieval-based”SummarizationObservation: term vector
89、 summary?Basic approachRank“sentences”,andselecttopNasasummaryMethods for ranking sentencesBasedontermweightsBasedonpositionofsentencesBasedonthesimilarityofsentenceanddocumentvector2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200885SimpleDiscourseAnalysis-vector 1vect
90、or 2vector 3vector n-1vector nsimilaritysimilaritysimilarity2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200886ASimpleSummarizationMethod-sentence 1sentence 2sentence 3summaryDoc vectorMost similarin each segment2008 ChengXiang Zhai Dragon Star Lecture at Beijing Unive
91、rsity, June 21-30, 200887ExamplesofSummarizationNews summary Summarize retrieval resultsSingledocsummaryMulti-docsummarySummarize a cluster of documents (automatic label creation for clusters)2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200888WhatYouShouldKnowRetrieval
92、 touches some basic issues in text information management (what are these basic issues?)How to apply simple retrieval techniques, such as the vector space model, to information filtering, text categorization, clustering, and summarizationThere are many other tasks that can potentially benefit from s
93、imple IR techniques 2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200889RoadmapThis lectureOtherretrievalmodelsIRsystemimplementationApplicationsofbasicTRtechniquesNext lecture: more in-depth treatment of language models2008 ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 200890