考虑到这种情况则可以有两个或两个以上的breakevenpoint

资源描述

《考虑到这种情况则可以有两个或两个以上的breakevenpoint》由会员分享，可在线阅读，更多相关《考虑到这种情况则可以有两个或两个以上的breakevenpoint（81页珍藏版）》请在金锄头文库上搜索。

1、Information Retrieval ModelsPengBoOct 30, 2010上次课回顾上次课回顾nBasic Index TechniquesnInverted indexnDictionary & PostingsnScoring and RankingnTerm weightingntfidfnVector Space ModelnCosine SimilaritynIR evaluationnPrecision, Recall, FnInterpolationnMAP, interpolated AP本次课大纲本次课大纲nInformation Retrieval Mod

2、elsnVector Space Model (VSM)nLatent Semantic Model (LSI)nLanguage Model (LM)Relevance FeedbackQuery ExpansionVector Space ModelDocuments as vectorsn每一个文档 j 能够被看作一个向量，每个term 是一个维度，取值为log-scaled tf.idfnSo we have a vector spacenterms are axesndocs live in this spacen高维空间：即使作stemming, may have 20,000+

3、dimensionsD1D2D3D4D5D6中国4.10.03.75.93.10.0文化4.54.50011.60日本03.52.902.13.9留学生03.15.112.800教育2.9002.200北京7.10004.43.8IntuitionPostulate: 在vector space中“close together” 的文档会talk about the same things.t1d2d1d3d4d5t3t2用例：Query-by-example，Free Text query as vectorCosine similarityn向量d1和d2的 “closeness”可以用它

4、们之间的夹角大小来度量n具体的，可用cosine of the angle x来计算向量相似度.n向量按长度归一化Normalizationt 1d 2d 1t 3t 2#1.COS Similarityn计算查询 “digital cameras” 与文档 “digital cameras and video cameras” 之间的相似度. n假定 N = 10,000,000, query和document都采用logarithmic term weighting (wf columns), query采用idf weighting ，document采用cosine normaliza

5、tion. “and”作为stop word. #2. Evaluationn定义precision-recall graph 如下：对一个查询结果列表，每一个返回结果文档处计算precision/recall点，由这些点构成的图. n在这个图上定义 breakeven point 为precision和 recall值相等的点. n问：存在多于一个breakeven point的图吗？如果有，给出例子；没有的话，请证明之。Latent Semantic ModelVector Space Model: ProsnAutomatic selection of index termsnParti

6、al matching of queries and documents (dealing with the case where no document contains all search terms)nRanking according to similarity score (dealing with large result sets)nTerm weighting schemes (improves retrieval performance)nVarious extensionsnDocument clusteringnRelevance feedback (modifying

7、 query vector)nGeometric foundationProblems with Lexical SemanticsnPolysemy: 词通常有multitude of meanings 和不同用法。Vector Space Model不能区分同一个词的不同含义，即ambiguity.nSynonymy: 不同的terms可能具有identical or a similar meaning. Vector Space Model里不能表达词之间的associations.Issues in the VSMnterms之间的独立性假设n有些terms更可能在一起出现n同义词，相

8、关词汇，拼写错误，etc.n根据上下文，terms可能有不同的含义nterm-document矩阵维度很高对每篇文档/每个词，真的有那么多重要的特征?Singular Value Decompositionn对term-document矩阵作奇异值分解 Singular Value Decompositionnr, 矩阵的rankn, singular values的对角阵（按降序排列）nD, T, 具有正交的单位长度列向量(TT=I, DD=I)t dt rWtd=Tr rDTr dWWT的特征值WTW和WWT的特征向量Singular Valuesn gives an ordering t

9、o the dimensionsn值下降非常快n尾部的singular values at 代表noisen在low-value dimensions截止可以减少 noise，提高性能Low-rank Approximationt dt rwtd=Tr rDTr dt dt kwtd=k kk dTDTLatent Semantic Indexing (LSI)nPerform a low-rank approximation of term-document matrix (typical rank 100-300)nGeneral ideanMap documents (and terms

10、) to a low-dimensional representation.nDesign a mapping such that the low-dimensional space reflects semantic associations (latent semantic space).nCompute document similarity based on the inner product in this latent semantic spaceWhat it isn从原始的term-document矩阵Ar, 我们计算得到它的近似Ak.n在Ak 中，每行对应一个term，每列对

11、应一个documentn区别是，文档在新的空间，它的维度 k P(s|M1)0.2the0.0001 class0.03sayst0.02pleaseth0.1yon0.01maiden0.0001 womanStochastic Language Modelsn用来生成文本的统计模型nProbability distribution over strings in a given languageMP ( | M )= P ( | M ) P ( | M, )P ( | M, )P ( | M, )Unigram and higher-order modelsnUnigram Languag

12、e ModelsnBigram (generally, n-gram) Language ModelsnOther Language ModelsnGrammar-based models (PCFGs), etc.nProbably not the first thing to try in IR= P ( ) P ( | ) P ( | )P ( | ) P ( ) P ( ) P ( ) P ( )P ( ) P ( ) P ( | ) P ( | ) P ( | )Easy.Effective!The fundamental problem of LMsn模型 M 是不知道的是不知道的

13、n只有代表这个模型的样例文本n从样例文本中来估计Modeln然后计算观察到的文本概率P ( | M ( ) )MUsing Language Models in IRn每篇文档对应一个modeln按P(d | q)对文档排序nP(d | q) = P(q | d) x P(d) / P(q)nP(q) is the same for all documents, so ignorenP(d) the prior is often treated as the same for all dnBut we could use criteria like authority, length, gen

14、renP(q | d) is the probability of q given ds modelnVery general formal approachLanguage Models for IRnLanguage Modeling Approachesn为query generation process 建模n文档排序：按一个query作为由文档模型产生的随机样本而被观察到的概率the probability that a query would be observed as a random sample from the respective document modelnMult

15、inomial approachRetrieval based on probabilistic LMn把query的产生当作一个随机过程n方法n为每个文档Infer a language model.nEstimate the probability：估计每个文档模型产生这个query的概率nRank ：按这个概率对文档排序.n通常使用Unigram modelQuery generation probability (1)n排序公式n用最大似然估计:Unigram assumption:Given a particular language model, the query terms o

16、ccur independently : language model of document d : raw tf of term t in document d : total number of tokens in document dInsufficient datanZero probabilityn一个文档里没有query中的某个term时nGeneral approachn没有出现文档中的term按它出现在collection中的概率来代替.nIf , : raw count of term t in the collection : raw collection size(to

17、tal number of tokens in the collection)Insufficient datanZero probabilities spell disastern使用平滑：smooth probabilitiesnDiscount nonzero probabilitiesnGive some probability mass to unseen thingsn有很多方法，如adding 1, or to counts, Dirichlet priors, discounting, and interpolationnSee FSNLP ch. 6 if you want

18、moren使用混合模型：use a mixture between the document multinomial and the collection multinomial distributionMixture modelnP(w|d) = Pmle(w|Md) + (1 )Pmle(w|Mc)n参数很重要n 值高，使得查询成为 “conjunctive-like” 适合短查询n 值低更适合长查询n调整来优化性能n比如使得它与文档长度相关 (cf. Dirichlet prior or Witten-Bell smoothing)Basic mixture model summary

19、nGeneral formulation of the LM for IRgeneral language modelindividual-document modelExamplenDocument collection (2 documents)nd1: Xerox reports a profit but revenue is downnd2: Lucent narrows quarter loss but revenue decreases furthernModel: MLE unigram from documents; = nQuery: revenue downnP(Q|d1)

20、 n= (1/8 + 2/16)/2 x (1/8 + 1/16)/2n= 1/8 x 3/32 = 3/256nP(Q|d2)n = (1/8 + 2/16)/2 x (0 + 1/16)/2n = 1/8 x 1/32 = 1/256nRanking: d1 d2Alternative Models of Text GenerationQuery ModelQueryDoc ModelDocSearcherWriterIs this the same model?Retrieval Using Language ModelsQuery ModelQueryDoc ModelDoc123Qu

21、ery likelihood (1)Document likelihood (2),Model comparison (3)Query LikelihoodnP(Q|Dm)n主要问题是估计文档modelni.e. smoothing techniques instead of tf.idf weightsn检索效果不错ne.g. UMass, BBN, Twente, CMU n问题：处理relevance feedback, query expansion, structured queries困难Document Likelihoodn按P(D|R)/P(D|NR)排序nP(w|R) is

22、 estimated by P(w|Qm)，Qm is the query or relevance modelnP(w|NR) is estimated by collection probabilities P(w)n问题是估计relevance modelnTreat query as generated by mixture of topic and backgroundnEstimate relevance model from related documents (query expansion)nRelevance feedback is easily incorporatedn

23、Good retrieval results ne.g. UMass at SIGIR 01ninconsistent with heterogeneous document collectionsModel Comparisonn估计query和document模型，进行模型比较nKL divergence D(Qm|Dm)n取得了较前两方法更好的效果Language models: pro & connNovel way of looking at the problem of text retrieval based on probabilistic language modelingn

24、Conceptually simple and explanatorynFormal mathematical modelnNatural use of collection statistics, not heuristics (almost)nLMs provide effective retrieval and can be improved to the extent that the following conditions can be metnOur language models are accurate representations of the data.nUsers h

25、ave some sense of term distribution.Comparison With Vector Spacen和传统的tf.idf models有一定联系:n(unscaled) term frequency is directly in modelnthe probabilities do length normalization of term frequenciesnthe effect of doing a mixture with overall collection frequencies is a little like idf: terms rare in

26、the general collection but common in some documents will have a greater influence on the rankingComparison With Vector Spacen相似点nTerm weights based on frequencynTerms often used as if they were independentnInverse document/collection frequency usednSome form of length normalization usedn不同点nBased on

27、 probability rather than similaritynIntuitions are probabilistic rather than geometricnDetails of use of document length and term, document, and collection frequency differ本次课小结本次课小结nLatent Semantic Indexingnsingular value decompositionnMatrix Low-rank ApproximationnLanguageModelnGenerative modelnsm

28、ooth probabilitiesnMixture modelResourcesnThe Template Numerical Toolkit (TNT)nThe Lemur Toolkit for Language Modeling and Information Retrieval. http:/www-2.cs.cmu.edu/lemur/ CMU/Umass LM and IR system in C(+), currently actively developed.Thank You!Q&A阅读材料阅读材料n1 IIR Ch12, Ch18n2 M. Alistair, Z. Ju

29、stin, and H. David, Recommended reading for IR research students SIGIR Forum, vol. 39, pp. 3-14, 2005. #2 EvaluationnQuestion an不能有两个或两个以上的breakeven pointn证明：一次检索I，相关文档集为R，设当前为breakeven point，检出文档集为A，检出的相关文档集为Ra，则precision=|Ra|/|A|，recall=|Ra|/|R|，根据breakeven point的定义，precision=recall，推出|R|=|A|。假设再检

30、出k (k0)个文档后，又出现一个breakeven point，则此时的precision=|Ra|/|A|，recall=|Ra|/|R|，推出|A|=|R|。由于|A|=|A|+k，k0，且|A|=|R|，推出矛盾，所以不能有两个或两个以上的breakeven point注意注意：当没有检出相关文档时，查全率和查准率都是：当没有检出相关文档时，查全率和查准率都是零，这时是零，这时是breakevenpoint吗？考虑到这种情况，吗？考虑到这种情况，则可以有两个或两个以上的则可以有两个或两个以上的breakeven pointMatrix Low-rank Approximation fo

31、r LSIEigenvalues & EigenvectorsnEigenvectors (for a square mm matrix S)nHow many eigenvalues are there at most?only has a non-zero solution if this is a m-th order equation in which can have at most m distinct solutions (roots of the characteristic polynomial) can be complex even though S is real.ei

32、genvalue(right) eigenvectorExampleMatrix-vector multiplicationhas eigenvalues 3, 2, 0 withcorresponding eigenvectorsAny vector (say x= ) can be viewed as a combination ofthe eigenvectors: x = 2v1 + 4v2 + 6v3Matrix vector multiplicationnThus a matrix-vector multiplication such as Sx (S, x as in the p

33、revious slide) can be rewritten in terms of the eigenvalues/vectors:nEven though x is an arbitrary vector, the action of S on x is determined by the eigenvalues/vectors.nSuggestion: the effect of “small” eigenvalues is small. Eigenvalues & EigenvectorsFor symmetric matrices, eigenvectors for distinc

34、teigenvalues are orthogonalAll eigenvalues of a real symmetric matrix are real.ExamplenLetnThennThe eigenvalues are 1 and 3 (nonnegative, real). nThe eigenvectors are orthogonal (and real):Real, symmetric.Plug in these values and solve for eigenvectors.nLet be a square matrix with m linearly indepen

35、dent eigenvectors nTheorem: Exists an eigen decomposition nColumns of U are eigenvectors of SnDiagonal elements of are eigenvalues of Eigen/diagonal DecompositiondiagonalDiagonal decomposition: why/howLet U have the eigenvectors as columns:Then, SU can be writtenAnd S=U U1.Thus SU=U , or U1SU= Diago

36、nal decomposition - exampleRecall The eigenvectors and form Inverting, we haveThen, S=U U1 =RecallUU1 =1.Example continuedLets divide U (and multiply U1) by Then, S=Q Q(Q(Q-1-1= Q= QT T ) ) Why? Stay tuned nIf is a symmetric matrix:nTheorem: Exists a (unique) eigen decompositionnwhere Q is orthogona

37、l:nQ Q-1-1= Q= QT TnColumns of Q are normalized eigenvectorsnColumns are orthogonal.n(everything is real)Symmetric Eigen DecompositionTime out!nWhat do these matrices have to do with text?nRecall m n term-document matrices nBut everything so far needs square matrices so Singular Value Decompositionm

38、mmnV is nnFor an m n matrix A of rank r there exists a factorization(Singular Value Decomposition = SVD) as follows:The columns of U are orthogonal eigenvectors of AAT.The columns of V are orthogonal eigenvectors of ATA.Singular values.Eigenvalues 1 r of AAT are the eigenvalues of ATA.Singular Value

39、 DecompositionnIllustration of SVD dimensions and sparsenessSVD exampleLetThus m=3, n=2. Its SVD isTypically, the singular values arranged in decreasing order.nSVD can be used to compute optimal low-rank approximations.nApproximation problem: Find Ak of rank k such thatAk and X are both mn matrices.

40、Typically, want k r.Low-rank ApproximationFrobenius normnSolution via SVDLow-rank Approximationset smallest r-ksingular values to zerocolumn notation: sum of rank 1 matriceskApproximation errornHow good (bad) is this approximation?nIts the best possible, measured by the Frobenius norm of the error:w

41、here the i are ordered such that i i+1.Suggests why Frobenius error drops as k increased.SVD Low-rank approximationnWhereas the term-doc matrix A may have m=50000, n=10 million (and rank close to 50000)nWe can construct an approximation A100 with rank 100.nOf all rank 100 matrices, it would have the

42、 lowest Frobenius error.nGreat but why would we?nAnswer: Latent Semantic IndexingC. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936.Performing the mapsnEach row and column of A gets mapped into the k-dimensional LSI space, by the SVD.nA query q is also mapped into this space, by

展开阅读全文

考虑到这种情况则可以有两个或两个以上的breakevenpoint

最新文档