《《clique是一个图中两两相邻的一个点集,或是一个完全子图》-精选课件(公开PPT)》由会员分享,可在线阅读,更多相关《《clique是一个图中两两相邻的一个点集,或是一个完全子图》-精选课件(公开PPT)(24页珍藏版)》请在金锄头文库上搜索。
1、Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words,Dmitry Davidov, Ari Rappoport The Hebrew University ACL 2006,Introduction,Discovering word categories, sets of words sharing a significant aspect of their meaning context feature vectors pattern-bas
2、ed discovery Manually prepared pattern set (ex. x and y) requiring POS tagging or partial or full parsing,Pattern Candidates,high frequency word (HFW) A word appearing more than TH times per million words Ex. and, or, from, to content word (CW) word appearing less than TC times per a million words,P
3、attern Candidates,meta-patterns obey the following constraints at most 4 words exactly two content words no two consecutive CWs Example CHC, CHCH, CHHC, and HCHC from x to y (HCHC), x and y (CHC), x and a y (CHHC),Symmetric Patterns,In order to find a usable subset from pattern candidates, we focus
4、on the symmetric patterns Example x belongs to y (asymmetric relationships) X and y (asymmetric relationships),Symmetric Patterns,We use single pattern graph G(P) to identifying symmetric patterns there is a directed arc A(x, y) from node x to node y iff the words x and y both appear in an instance
5、of the pattern P as its two CWs x precedes y in P SymG(P), the symmetric subgraph of G(P), containing only the bidirectional arcs and nodes of G(P),Symmetric Patterns,We compute three measures on G(P) M1 counts the proportion of words that can appear in both slots of the pattern M2, M3 measures coun
6、t the proportion of the number of symmetric nodes and edges in G(P),Symmetric Patterns,We removed patterns that appear in the corpus less than TP times per million words We remove patterns that are not in the top ZT in any of the three lists We remove patterns that are in the bottom ZB in at least o
7、ne of the lists,Discovery of Categories,words that are highly interconnected are good candidates to form a category word relationship graph G merging all of the single-pattern graphs into a single unified graph clique是一個圖中兩兩相鄰的一個點集,或是一個完全子圖,The Clique-Set Method,Strong n-cliques subgraphs containing
8、 n nodes that are all bidirectionally interconnected A clique Q defines a category that contains the nodes in Q plus all of the nodes that are (1) at least unidirectionally connected to all nodes in Q (2) bidirectionally connected to at least one node in Q,The Clique-Set Method,we use 2-cliques For
9、each 2-cliques, a category is generated that contains the nodes of all triangles that contain 2-cliques and at least one additional bidirectional arc. For example book and newspaper, newspaper and book, book and note, note and book and note and newspaper,The Clique-Set Method,a pair of nodes connect
10、ed by a symmetric arc can appear in more than a single category define exactly three strongly connected triangles, ABC,ABD,ACE arc (A,B) yields a category A,B,C,D arc (A,C) yields a category A,C,B,E,Category Merging,In order to cover as many words as possible, we use the smallest clique, a single sy
11、mmetric arc. This creates redundant categories We use two simple merge heuristics If two categories are identical we treat them as one Given two categories Q,R, we merge them iff theres more than a 50% overlap between them,Corpus Windowing,In order to increase category quality and remove categories
12、that are too context-specific Instead of running the algorithm of this section on the whole corpus, we divide the corpus into windows of equal size we select only those categories that appear in at least two of the windows,Evaluation,three corpora, two for English and one for Russian BNC, containing
13、 about 100M words Dmoz, 68GB containing about 8.2G words from 50M web pages Russian corpus was assembled from many web sites and carefully filtered for duplicates, to yield 33GB and 4G words,Evaluation,Thresholds thresholds TH, TC, TP ,ZT ,ZB ZB, were determined by memory size 100, 50, 20, 100, 100
14、window size 5% of the different words in the corpus assigned into categories,V=number of different words in the corpus. W=number of words belonging to at least one categories. C= is the number of categories AS= is the average category size,Human Judgment Evaluation,Part I given 40 triplets of words
15、and were asked to rank them using 1-5 (1 is best) 40 triplets were obtained as follows 20 of our categories were selected at random from the non-overlapping categories 10 triplets were selected from the categories produced by k-means (Pantel and Lin, 2002) 10 triplets were generated by random select
16、ion of content words,Human Judgment Evaluation,Part II, subjects were given the full categories of the triplets that were graded as 1 or 2 in Part I They were asked to grade the categories from 1 (worst) to 10 (best) according to how much the full category had met the expectations they had when seeing only the triplet,Human Judgment Evaluation,shared meaning average percentage of triplets that were given scores of 1 or 2,WordNet-Based Evaluation,Took 10 WN subsets referred to a