[工学]第8章主成分分析

资源描述

《[工学]第8章主成分分析》由会员分享，可在线阅读，更多相关《[工学]第8章主成分分析（78页珍藏版）》请在金锄头文库上搜索。

1、第八章多元数据分析,1、主成分分析的概念 2、主成分分析方法,主成分分析的概念,多变量大样本为科学研究提供丰富的信息，但也在一定程度上增加了数据采集的工作量，更重要的是在大多数情况下，许多变量之间可能存在相关性而增加了问题分析的复杂性，同时对分析带来不便。,主成分分析的概念,如果分别分析每个指标，分析又可能是孤立的，而不是综合的。盲目减少指标会损失很多信息，容易产生错误的结论。因此需要找到一个合理的方法，减少分析指标的同时，尽量减少原指标包含信息的损失，对所收集的资料作全面的分析。,主成分分析的概念,由于各变量间存在一定的相关关系，因此有可能用较少的综合指标分别综合存在于各变量中的各类信

2、息。主成分分析就是这样一种降维的方法。主成分分析就是将多个实测变量转换为少数几个不相关的综合指标的多元统计分析方法,主成分分析的概念,综合指标之间彼此不相关，即各指标代表的信息不重叠。综合指标称为因子或主成分，一般有两种方法：特征值1 累计贡献率0.8,例：成绩数据,100个学生的数学、物理、化学、语文、历史、英语的成绩如下表（部分）。,从本例可能提出的问题,能不能将6个变量用一两个综合变量来表示呢？这一两个综合变量包含有多少原来的信息呢？能不能利用找到的综合变量来对学生排序呢？这一类数据所涉及的问题可以推广到对企业，对学校进行分析、排序、判别和分类等问题。,本例中的数据点是六维的；即

3、每个观测点是6维空间中的一个点。我们希望把6维空间用低维空间表示。,先假定只有二维，即只有两个变量，它们由横坐标和纵坐标所代表；因此每个观测值都有相应于这两个坐标轴的两个坐标值；如果这些数据形成一个椭圆形状的点阵这个椭圆有一个长轴和一个短轴。在短轴方向上，数据变化很少；在极端的情况，短轴如果退化成一点，那只有在长轴的方向才能够解释这些点的变化了；这样，由二维降到了一维。,当坐标轴和椭圆的长短轴平行，代表长轴的变量就描述了数据的主要变化，而代表短轴的变量就描述了数据的次要变化。但坐标轴通常并不和椭圆的长短轴平行。因此，需要寻找椭圆的长短轴，并进行变换，使得新变量和椭圆的长短轴平行。如果长轴

4、变量代表了数据包含的大部分信息，就用该变量代替原先的两个变量（舍去次要的一维），降维就完成了。椭圆（球）的长短轴相差得越大，降维也越有道理。,对于多维变量的情况和二维类似，也有高维的椭球。首先把高维椭球的主轴找出来，再用代表大多数数据信息的最长的几个轴作为新变量；这样，主成分分析就基本完成了。注意：和二维情况类似，高维椭球的主轴也是互相垂直的。这些互相正交的新变量是原先变量的线性组合，叫做主成分(principal component)。,正如二维椭圆有两个主轴，三维椭球有三个主轴一样，有几个变量，就有几个主成分。选择越少的主成分，降维就越好。什么是标准呢？那就是这些被选的主成分所代

5、表主轴的长度之和占了主轴长度总和的大部分。所选的主轴总长度占所有主轴长度之和的大约85%即可。,这里的Initial Eigenvalues就是这里的六个主轴长度，又称特征值（数据相关阵的特征值）。头两个成分特征值累积占了总方差的81.142%。后面的特征值的贡献越来越少。,怎么解释这两个主成分？前面说过主成分是原始六个变量的线性组合。是怎么样的组合呢？,这里每一列代表一个主成分作为原来变量线性组合的系数（比例）。比如第一主成分作为数学、物理、化学、语文、历史、英语这六个原先变量的线性组合，系数（比例）为-0.806, -0.674, -0.675, 0.893, 0.825, 0.836。

6、,如用x1,x2,x3,x4,x5,x6分别表示原先的六个变量，而用y1,y2,y3,y4,y5,y6表示新的主成分，那么，原先六个变量x1,x2,x3,x4,x5,x6与第一和第二主成分y1,y2的关系为： x1=-0.806y1 + 0.353y2 x2=-0.674y1 + 0.531y2 x3=-0.675y1 + 0.513y2 x4= 0.893y1 + 0.306y2 x5= 0.825y1 + 0.435y2 x6= 0.836y1 + 0.425y2,这些系数称为主成分载荷（loading），它表示主成分和相应的原先变量的相关系数。比如x1表示式中y1的系数为-0.806，

7、这就是说第一主成分和数学变量的相关系数为-0.806。相关系数(绝对值）越大，主成分对该变量的代表性也越大。第一主成分对各个变量解释得都很充分。而最后的几个主成分和原先的变量就不那么相关了。,可以把第一和第二主成分的载荷点出一个二维图以直观地显示它们如何解释原来的变量的。这个图叫做载荷图。,左面三个点是数学、物理、化学三科，右边三个点是语文、历史、外语三科。,A typical data analysis situation,12 Jams samples were made from berries plucked in various cultivars and seasonal ti

8、mes. Several parameters (sensory measurements) were measured on each sample.,Data set Raspberry Jams,What samples are similar/dissimilar to each other?,Sample comparison according to 1 variable: Redness,What about the 11 other parameters?,Sample comparison according to 2 variables: Redness and colou

9、r,What about the 10 other parameters?,Sample comparison according to 3 variables: Redness, colour and R. Smell,What about the 9 other parameters?,Sample comparison according to all 12 variables: multivariate model (PCA),Map of samples,Sample comparison according to all 12 variables: multivariate mod

10、el (PCA),Map of variables,Sample comparison according to all 12 variables: multivariate model (PCA),Map of Samples & Variables,Principal Component Analysis (PCA),Principles behind PCA,The principles of Principal Component Analysis (PCA),X1 (Variable1),X3 (Variable 3),X2 (Variable 2),The original dat

11、a points, plotted on the original axes of variables (X1, X2, X3) For convenience, we have assumed that the data points are in the shape of a cuboid.,The original data points, plotted on the original axes of variables (X1, X2, X3) For convenience, we have assumed that the data points are in the shape

12、 of a cuboid.,The principles of Principal Component Analysis (PCA),X1 (Variable1),X3 (Variable 3),X2 (Variable 2),The principles of Principal Component Analysis (PCA),X1 (Variable1),X3 (Variable 3),X2 (Variable 2),The original data points, plotted on the original axes of variables (X1, X2, X3) For c

13、onvenience, we have assumed that the data points are in the shape of a cuboid.,X1,X2,Samples,PCA- Scores Plot,X1 (Variable 1),X3 (Variable 3),X2 (Variable 2),Instead, we use PCA to create “new” variables, called Principal Components These are ordered as PC1, PC2, PC3, etc. PC1 explains the maximum v

14、ariation of the data. PC2 then explains maximum of residual variation, etc.,PC1,PC2,PC3,PCA- Scores Plot,Instead, we use PCA to create “new” variables, called Principal Components These are ordered as PC1, PC2, PC3, etc. PC1 explains the maximum variation of the data. PC2 then explains maximum of re

15、sidual variation, etc.,PC1,PC2,PC3,PCA- Scores Plot,Instead, we use PCA to create “new” variables, called Principal Components These are ordered as PC1, PC2, PC3, etc. PC1 explains the maximum variation of the data. PC2 then explains maximum of residual variation, etc.,PC1,PC2,PC3,PCA- Scores Plot,P

16、C1,PC2,Instead, we use PCA to create “new” variables, called Principal Components These are ordered as PC1, PC2, PC3, etc. PC1 explains the maximum variation of the data. PC2 then explains maximum of residual variation, etc.,PC1,PC2,PC3,Samples,Scores Plot By default, after doing PCA on The Unscrambler, it is present in the top left corner,PCA Loadings Plot,PC1,PC2,PC3,It is also necessary to find out how the individual PCs relate to the or

展开阅读全文

[工学]第8章 主成分分析

[工学]第8章主成分分析