K-Means算法研究及在文本聚类中的应用

资源描述

《K-Means算法研究及在文本聚类中的应用》由会员分享，可在线阅读，更多相关《K-Means算法研究及在文本聚类中的应用（61页珍藏版）》请在金锄头文库上搜索。

1、摘要学校代码：*学号：*密级：K-Means算法研究及在文本聚类中的应用The Research and Application in Text Clustering of K-Means Algorithm姓名*学科专业计算机应用技术研究方向数据库与Web技术指导教师*完成时间2013年4月III独创性声明本人声明所呈交的学位论文是本人在导师指导下进行的研究工作及取得的研究成果。据我所知，除了文中特别加以标注和致谢的地方外，论文中不包含其他人已经发表或撰写过的研究成果，也不包含为获得或其他教育机构的学位或证书而使用过的材料。与我一同工作的同志对本研究所做的任何贡献均已在论文中作了明确

2、的说明并表示谢意。学位论文作者签名：签字日期：年月日学位论文版权使用授权书本学位论文作者完全了解有关保留、使用学位论文的规定，有权保留并向国家有关部门或机构送交论文的复印件和磁盘，允许论文被查阅和借阅。本人授权可以将学位论文的全部或部分内容编入有关数据库进行检索，可以采用影印、缩印或扫描等复制手段保存、汇编学位论文。（保密的学位论文在解密后适用本授权书）学位论文作者签名：导师签名：签字日期：年月日签字日期：年月日学位论文作者毕业去向：工作单位：电话：通讯地址：邮编：摘要摘要随着互联网的快速发展，大量文本信息存储过程变得更加容易，在Web上可以利用文档的数量正在迅猛地增长。在知识的海洋中，

3、可以利用的信息总量在持续增长的时候，而用户的理解和处理信息的能力维持不变，如何从这海量的信息当中寻找出自己感兴趣的信息，如何对这些未分类的文本信息进行分门别类等等，这些问题涉及一个新的研究方向文本挖掘的研究。文本挖掘最重要的研究角度之一即为文本聚类挖掘。所谓文本聚类挖掘是一个发现文本集类别信息和包含内容的方法，将文本文档按照设定的相似度度量标准划分为指定数目的类别，使得每个类别中的样本具有较高的相似性并且给出各类别的概要描述。与对普通实验数据聚类相比，文本聚类有其自身的特点，相关的研究具有很大的挑战性。目前，针对K-Means算法研究及应用，尤其是在文本聚类挖掘层面的应用研究越来越多。本文首先

4、系统地介绍了聚类分析和文本聚类挖掘的基本理论，然后针对K-Means算法的局限性提出自己的改进方法，最后将改进的K-Means算法应用在文本聚类挖掘中。首先，文章介绍了当前国内外的聚类算法和文本聚类挖掘的研究现状。相比之下，国外的研究相对比较成熟，国内主要的研究还只处在理论研究阶段。同时，简要地介绍了数据挖掘的理论内容，包括数据挖掘的概念以及数据挖掘的步骤等。然后，在介绍聚类的概念和聚类算法等聚类分析相关理论知识的基础上，着重阐释了K-Means算法，并对其优缺点进行分析。针对原K-Means算法受孤立点影响和初始聚类中心随机选择等问题，提出了带孤立点分析的改进的K-Means聚类算法。孤立点

5、分析主要采用统计学中“Z分数（标准分数）的绝对值大于2的数据作为孤立点”的思想，这个方法不但有着严格的数学理论基础而且可以避免用户设定阈值的前提条件。确定初始聚类中心的策略是每次都把相对集中的数据先划分出来，这样就可以保证每个簇划分出的数据对象有着较高的相似性。孤立点检测可以降低孤立点对聚类结果的影响，改进的K-Means算法中的初始聚类中心确定策略可以降低算法陷入局部最优的可能性并在一定程度上减少算法迭代的次数。继而使用iris数据集对改进的算法进行实验，验证了改进的K-Means算法的效果和性能较原算法相比都有很大的提高。接着，描述了文本挖掘的概念和文本挖掘的主要过程，并实现了一个基于本文

6、改进后的K-Means算法的文本聚类挖掘的应用实例。该应用实例主要包括文本预处理模块、聚类模块和性能评估模块三个模块，其中每个模块都给出详细设计思路和简要代码结构。在实例具体实现过程中，对数据预处理模块中的tf-idf值的计算提出“空间换时间”性能优化方案，对性能评估模块中的准确率计算给出相应的计算方法。随后，将设计好的应用实例应用在搜狗实验室“文本分类语料库”文本数据集上，并给出文本聚类挖掘的结果。最后，对本文做出总结并提出在研究过程中未能深入研究的相关问题，给出了聚类挖掘未来的研究方向。关键词：K-Means算法；数据预处理；文本聚类AbstractAbstractWith the rap

7、id development of the Internet, the process of storing large amounts of textual information becomes easier. Simultaneously, the number of available documents on the Web is growing rapidly. When the amount of usable information continues growing, the abilities of users understanding and managing rema

8、in unchanged. Naturally, problems, such as how to find out the interesting one from so much information and how to categorize these unclassified text information, involve a new research direction-text mining. As one of the most important research branches of text mining, text clustering mining means

9、 a method to find the categories information and inclusion of text corpus, which divide the text document into different specified categories according to the standard of similarity metrics. All this makes each class has a higher similarity and also gives the corresponding overview description for e

10、ach category. Comparing with ordinary experimental data clustering, text clustering mining has its own unique characteristic, so it is a great challenge field for pursers. Currently, researches and applications on the K-Means algorithm are increasing, especially in text clustering mining.In this art

11、icle, we introduced the basic theory of cluster analysis and text clustering mining firstly, and then put forward own improved method that aiming at the limitation of K-Means algorithm, finally improved K-Means algorithm applied to text clustering mining.Firstly, this article summarizes the backgrou

12、nd of clustering algorithm and text clustering mining research and achievements at home and abroad. It is well developed abroad, while the domestic counterpart is still at theory research stage. Then, the theory of data mining is briefly introduced, including the concept and steps of data mining.On

13、the base of introducing the theory of clustering analysis such as the concept of clustering and clustering algorithm, this thesis explained K-Means algorithm and its advantage and disadvantage emphatically. Aiming at some related problems, such as the impact of isolated point and the selection of th

14、e initial cluster centers in the original algorithm, the improved K-Means algorithm is proposed and outlier analysis is introduced at the same time. The outlier analysis mainly uses the statistics thought that a data is isolated when the absolute value of the Z-scores (standard scores) greater than

15、2. This method not only has a strict mathematical theory basis, but also avoids the necessary precondition that the user should set a threshold. The strategy to determine the initial cluster centers is that dividing out relatively centralized data firstly each time, which can guarantee there is a striking similarity in the samples of each class. Outlier detection can reduce the influence of outlier clustering. At the same time, the initial cluster centers selection strategy in the improved K-Means algorithm can not only reduce

展开阅读全文