《基于多元统计分析的聚类方法》由会员分享,可在线阅读,更多相关《基于多元统计分析的聚类方法(48页珍藏版)》请在金锄头文库上搜索。
1、Clustering Methods Applications of Multivariate Statistical AnalysisJiangsheng Yu c ?School of Electronics Engineering and Computer Science Peking University, Beijing, , http:/ Methods p.1/48Topics1. Whats clustering? 2. Similarity Measures 3. Hierarchical Clustering Methods 4. Nonhierarchical Clust
2、ering Methods 5. Multidimentional Scaling 6. Correspondence Analysis 7. Biplots for Viewing Sampling Units and Variables 8. Procrustes Analysis: A Method 9. Conclusion 10. ReferencesClustering Methods p.2/48Clustering ProblemExploratory procedures are helpful in understanding the complex nature of m
3、ultivariate relationships. Problem 1 Given a set of data xi Rp, satisfying 1. the number of classes is unknown; 2. the class of any individual is unknown. We intend to1. define some suitable statistics; 2. clarify the number of classes K;3. find a reasonable clustering method; and 4. classify the da
4、ta into K categories.aaSo, clustering is also called unsupervised classification.Clustering Methods p.3/48Example of ClusteringPartition a given set by some similarity:Figure 1: fuzzy c-means clusteringClustering Methods p.4/48Observation Matrix Given n sample points, each has m variables:X1XjXm x1x
5、11x1jx1m . xixi1xijxim . xnxn1xnjxnm mean x1 xi xm stds1sismTable 1: Observation dataClustering Methods p.5/48No Best Clustering MethodAKQJClustering Methods p.6/48Cluster Methods 1. System Method: merge the most similar classes, update the data and repeat the procedure till alldata are classified.2
6、. Dynamic Method: give an initial classification ofdata firstly, then adjust the classes by least value of loss function till no improvement can made. 3. Fuzzy Method: for instance fuzzy c-means clustering, usually works well for data with fuzzy characteristics. 4. Method of Minimum Spanning Tree 5.
7、 Clustering Methods p.7/48Transformation of DataCentralization: make the mean 0, and the variance-covariance matrix unchanged.xij= xij xj(1)S= S = (sij)mm(2)sij=1 n 1nXt=1(xti xi)(xtj xj) =1 n 1nXt=1xtix tj(3)Standardization: make the mean 0, and sdt 1.xij=xij xj sj(4)Clustering Methods p.8/48Measur
8、ing Similarity1. Distance (a) Minkowski Distance (b) Statistical Distance (c) Lance Distance (d) Mahalanobis Distance, 2. Measure (not necessary distance) (a) Canberra Measure(b) Czekanowski Coefficient, 3. Similarity Coefficient (a) Cosine Similarity(b) Correlation Coefficient, Clustering Methods p
9、.9/48Minkowski Distance 1. Minkowski Distance:dk(xi,xj) =“mXt=1|xit xjt|k#1/k(5)2. Euclidean Distance:d2(xi,xj) =vuutmXt=1(xit xjt)2(6)3. Chebyshev Distance:d(xi,xj) = max1tm|xit xjt|(7)Clustering Methods p.10/48Distances without Measure Unit 1. Statistical Distance:dk(xi,xj) =“mXt=1?xit xjt st?k#1/
10、k (8)2. Lance Distance:dL(xi,xj) =1 mmXt=1|xit xjt| xit+ xjt(9)3. Mahalanobis Distance:dM(xi,xj) = (xi xj)TS1(xi xj)(10)Clustering Methods p.11/48Measures 1. Canberra Measure:m(xi,xj) =mXt=1|xit xjt| xit+ yjt(11)2. Czekanowski Coefficient:m(xi,xj) = 1 2mPt=1min(xit,xjt)mPt=1(xit+ xjt)(12)Neither of
11、them is distance.Clustering Methods p.12/48Shortcomings of Distances1. Both Minkowski Distance and Lance Distance assume the independency between random variables (r.v.), which talk about the distance in an orthogonal space. But in practice, the r.v.s are relative. Mahalanobis Distance overcomes thi
12、s shortcoming, but 2. Mahalanobis Distance works badly if the covariance matrix S is calculated by all data. Concentrated to a particular class, its performance is good. While, we know nothing about classes before clustering. 3. The mathematics of non-distance measures is not beautiful.Clustering Me
13、thods p.13/48ClusteringbyShadingD-MatrixAssume D = (dij) denote the distance between the i-th and j-th objects, replaced by a prescribed class. Shading the distance matrix D. The objects are clustered by the triangle of ./.Clustering Methods p.14/48Example of Shading MethodENDaDuGFrSpIPHFiE10Conside
14、r the 11 languages.N810The concordant first letters of words 1-10.Da8910Du35410G465510Fr4441310Sp44513810I445139910P3340257610H12221000010Fi111111111210Clustering Methods p.15/48How to Describe the Difference? Consider the 0-1 feature vector based on m variables. 0-1 valued variables Item1lm ixi1xil
15、xim jxj1xjlxjmTable 2: Comparison between item i and item j The difference between item i and item j can be mea- sured byPm l=1(xil xjl)2. But it suffers from weight- ing the 1-1 and 0-0 matches equally.Example 1 In grouping people, the characteristic ofdoing Algebraic Geometry is significant.Clustering Methods p.16/48Contingency TableTo illustrate the matches and mismatches, we arrange the amounts into a contingency table.Item iItem j10Totals 1n11n12n1= n11+ n12 0n21n22n2= n21+ n22 Totalsn1n2N = n1+ n2Table 3: Contingency table In Example 1, it might be reasonable to discount the 0-0 m