半监督聚类分析策略设计及其拓展性研究

资源描述

《半监督聚类分析策略设计及其拓展性研究》由会员分享，可在线阅读，更多相关《半监督聚类分析策略设计及其拓展性研究（104页珍藏版）》请在金锄头文库上搜索。

1、南京航空航天大学博士学位论文半监督聚类分析策略设计及其拓展性研究姓名：尹学松申请学位级别：博士专业：计算机应用技术指导教师：陈松灿 2009-12 南京航空航天大学博士学位论文 I 摘要在机器学习和数据挖掘领域中，人们常遇到大量的无标记数据。对这些数据进行标记时，可能需要耗费大量的人力物力，如会谈中说话人语音的分割与识别，GPS 数据中的道路检测和电影片段中不同男演员或女演员的分组等问题。因此，利用少量样本的先验知识来解决这些问题，已成为机器学习领域的研究热点。半监督聚类（SSC）利用少量样本的监督信息和大量未标记样本进行学习，来完成对数据聚类的。自然地，它也能应用于无

2、监督聚类，以达到提高无监督聚类性能的目的，故半监督聚类已逐渐成为机器学习和数据挖掘中的重要研究内容之一。本文紧紧围绕半监督聚类研究的两个核心学习算法与度量学习对半监督聚类算法展开较深入的研究。借助机器学习领域中流行的判别分析技术和核技巧，提出了相应的 SSC 改进模型，并将其拓展到无监督学习方法之中。本文的主要贡献总结如下： (1) 提出了基于成对约束的判别型半监督聚类分析方法（DSCA）。该方法从线性判别分析入手，通过利用监督信息和大量无标记样本，同时执行聚类和降维。现有的半监督聚类方法要么只关注监督信息对聚类的帮助，忽略了对数据的降维，要么分离了聚类与降维。DSCA 迭代

3、执行聚类和降维克服上述问题，并有机地将聚类和降维刻画在一个联合框架中。同时，通过提出了基于成对约束的 K 均值聚类方法（PCBKM），克服了成对约束（cannot-link 约束和 must-link 约束）的违反问题。在本文所用数据集上的实验结果表明，在使用相同的成对约束条件下，与其它同类的半监督学习方法相比，DSCA 方法相对更有效地提高了聚类性能。 (2) 借助当前普遍使用的核方法，针对现有的半监督聚类算法难以提高不同聚类样本之间分离性的缺点，提出了基于度量学习的自适应半监督聚类核方法（SCKMM）。该方法主要有以下四个特点：i）将度量学习引入非线性半监督聚类中扩大不

4、同聚类样本之间的分离性；ii）将聚类结果作为迭代执行度量学习和半监督聚类的桥梁，有效地提高聚类精度；iii）针对核聚类算法中核参数调节的手工依赖问题，利用 cannot-link 约束和 must-link 约束构造了一个目标函数来自动优化高斯核参数。这种设计理念有助于在简化算法和使数学上易处理的同时，探究超参数对算法推广性能的影响，最终为类似的超参数求解提供了一个可供选择的途径；iv）初步考察了带有噪声的 cannot-link 约束和 must-link 约束对半监督学习算法性能的影响。现有的半监督学习算法通常假设给定的成对约束是正确的，并依此进行算法设计，而忽略了带噪声的

5、成对约束。本文采用随机地翻转 cannot-link 约束和 must-link 约束，生成噪声成对约束，从而进一步研究带有噪声的约束对算法性能的影响。实验验证了在相同噪声成对约束的条件下，所提出的算半监督聚类分析策略设计及其拓展性研究 II 法比其它半监督算法更为鲁棒。 (3) 发展出了一个更为广泛的判别式聚类学习框架，其具有如下特点：i）有效地集成了广义的线性判别分析（GLDA）和正则化软 K 均值（RSKM）。在该框架下，数据的聚类隶属度取值在 0 和 1 之间而不是简单的两个值（0 和 1）。而由 Chris Ding 所发展出的线性判别分析引导的自适应维数约减算法（

6、LDA-Km）成为了该框架的特例之一； ii）通过将最大熵作为正则化项，所提出的基于判别分析的正则化软 K 均值（ResKmeans）方法与其它同类工作相比，更能有效地提高聚类性能； iii）从理论上证明了 GLDA 和软 K 均值优化的目标函数的等价性； iv） GLDA 推广了著名的线性判别分析方法（LDA）；V）ResKmeans 不仅推广了同类工作，同时也自然地容纳了现有的聚类方法，包括高斯混合和模糊 C 均值等。关键字：半监督聚类，成对约束，聚类分析，闭包中心，度量学习，判别分析，软聚类，维数约减南京航空航天大学博士学位论文 III Abstract In the

7、 machine learning and data mining communities, one is often confronted with a great number of unlabeled data. As we know, in many real applications, labeled instances are often diffi cult, expensive, or time consuming to obtain, as they require the efforts of experienced human annotators. Consequent

8、ly, prior knowledge of instances is used to partition the data, which has been the focus in the field of machine learing. Semi-supervised clustering (SSC) addresses this problem by using large amount of unlabeled data, together with prior information of a few instances, to give higher accuracy. It a

9、lso naturally is applied to unsupervised clustering to improve clustering performance. Thus, SSC has become an importanct topic of machine learning and data mining communities. In this dissertation, we improve on SSC algorithms closely following the two cores of the SSC research, learning algorithm

10、and metric learning. With the popular discriminant analysis and kernel trick in machine learning, we propose the corresponding imrpoved SSC models, and generalize them to unsupervised learnnig. The main contributions of this dissertation are summerized as follows: (1) This dissertation presents a di

11、scriminative semi-supervised clustering analysis algorithm with pairwise constraints (DSCA). The proposed algorithm uses supervised information and abundant unlabeled instances to simultaneously perform clustering and dimensionality reduction with linear discriminant analysis (LDA). Most existing SS

12、C algorithms either focus on supervised information that directs clustering and are not designed for handling high-dimensional data, or ignore the relationship between clustering and dimensionality reduction. The key idea in DSCA is to intergrate dimensionality reduction and clustering in a joint fr

13、amework so that the above problems can be mitigated. Meanwhile, pairwise constraint based K-means algorithm (PCBKM) presented in DSCA both reduces the computational complexity of constraints based semi-supervised algorithm and solves the problem of violating pairwise constraints in the existing semi

14、-supervised clustering algorithms. The experimental results show that DSCA can effectively deal with high-dimensional data and provide an appealing clustering performance compared with state-of-the-art SSC algorithms. (2) With the widely-used kernel method, this dissertation presents an adaptive sem

15、i-supervised clustering kernel method based on metric learning (SCKMM). The main characteristics of this method can be summarized as follows: i) the SCKMM introduces metric learning into nonlinear SSC to 半监督聚类分析策略设计及其拓展性研究 IV improve separability of the data for clustering; ii) the SCKMM makes effec

16、tive use of cluster membership as the bridge connecting metric learning and SSC. With this connection, clusters are effectively discovered in the transform space, while the transform space is adaptively re-adjusted for global optimality; iii) we construct an objective function from pairwise constraints to automatically optimize the parameter of the Gaussian kernel, aiming at the problem of manually tuning the kernel parameters. T

展开阅读全文

半监督聚类分析策略设计及其拓展性研究

最新文档