1、 文 本 分 类 算 法 毕 业 论 文学 院: 计算机科学与技术学院 专 业: 电子信息科学与技术 论文题目: 基于半监督的文本分类算法 摘 要随着 Internet 的出现,大量的文字信息开始以计算机可读的形式存在,以传统的手工方式对这些信息进行组织整理既费时费力且效果不理想。文本分类作为处理和组织大量文本数据的关键技术,可以利用机器来对文本进行分析整理,使用户从繁琐的文档处理工作中解放出来,并能极大地提高了信息的利用率。文本分类是指分析文本内容并按一定的策略把文本归入一个或多个合适的类别的应用技术。而作为信息过滤、信息检索、搜索引擎、文本数据库、数字化图书馆等领域的技术基础,文本分类技术

2、有着广泛的应用前景。本文首先介绍了文本分类的背景,文本分类所用的半监督算法及文本分类的几个关键技术。然后鉴于高分类精度需要大规模己标记训练集而已标记文档缺乏,利用未标识文档进行学习的半监督学习算法己成为文本分类的研究重点这一情况,着重研究了半监督分类算法。最后本文设计了一个文本分类原型系统,为保证分类的准确性,采用了不同的标准数据集进行测试,并评价了其分类的性能。通过以上实验表明,当有足够的己标识文档时,本算法与其它算法性能相当,但当已标识文档很少时,本算法优于现有的其它算法。关键词:文本分类;半监督学习;聚类;EM;KNNABSTRACTWith the emergence of Inter

3、net, a large number of text messages began to exist in the form of computer-readable, to the traditional manual way for organizations to collate the information is time-consuming effort and the result is not satisfactory. As the key technology in organizing and processing large mount of document dat

4、a, Text classification can use the machine to collate the text analysis, allowing users from the tedious work of document processing liberated and can greatly improve the utilization of information. Text classification is a supervised leaning task of assigning natural language text documents to one

5、or more predefined categories or classes according to their contents. Moreover, text classification has the broad applied future as the technical basis of information filtering, information retrieval, search engine, text database, and digital library and so on.This thesis firstly introduces the back

6、ground of the text classification, text classification using semi-supervised algorithm and a few key technologies about text classification. Secondly considering the contradiction of deadly need for large labeled train-set to obtain high classification accuracy and the scarcity of labeled documents,

7、this thesis emphasizes on improvement of Semi-supervised classification algorithms, Finally we design a document classification system. In order to ensure the accuracy of classification, using a data set different standards for texting and evaluation of the performance of their classification. The e

8、xperiments above showed the superior performance of our method over existing methods when labeled data size is extremely small. When there is sufficient labeled data,our method is comparable to other existing algorithms. Keywords: text classification; semi-supervised leaning; clustering; EM; KNN目 录1 引言 11.1 课题背景 11.2 本文的内容组织 22 半监督学习 32.1 半监督学习的概念及意义 32.2 半监督学习的研究进展 42.3 半监督学习的方法 52.3.1 协同训练(Co-training) 52.3.2 自训练 62.3.3 半监督支持向量机(S3VMs) 72.3.4 基于图的方法(Graph-Based Methods) 82.4 本章小结 93 文本分类 103.1 文本分类的概念及意义



