中文文本自动分类算法研究

资源描述

《中文文本自动分类算法研究》由会员分享，可在线阅读，更多相关《中文文本自动分类算法研究（69页珍藏版）》请在金锄头文库上搜索。

1、上海交通大学硕士学位论文中文文本自动分类算法研究姓名：王香港申请学位级别：硕士专业：电子与通信工程指导教师：倪佑生20071201上海交通大学硕士学位论文摘要第 I 页中文文本自动分类算法研究摘要摘要随着 Internet 的迅猛发展和日益普及，电子文本信息迅速膨胀，如何有效地组织和管理这些信息，并快速、准确、全面地从中找到用户所需要的信息是当前信息科学和技术领域面临的一大挑战。文本分类作为处理和组织大量文本数据的关键技术，可以在较大程度上解决信息杂乱现象的问题，方便用户准确地定位所需的信息和分流信息。而且作为信息过滤、信息检索、搜索引擎、文本数据库、数字化图书馆等领域的

2、技术基础，文本分类技术有着广泛的应用前景。本文对文本分类及其相关技术进行了研究。从提高分类方法的快速性、准确性和稳定性出发，提出多种有效的解决或改进的方法和技术。较系统地综述了中文文本分类中自动分词技术、特征提取技术、文本分类模型和性能评估技术的研究现状和研究方法。较全面地讨论了贝叶斯方法、k 近邻方法和 AdaBoost 等三种中文文本分类方法。作者采用三个模型，实现了朴素贝叶斯分类器、k 近邻分类器和 Adaboost 分类器三个中文文本分类器，集成了一个实用性较强的实验系统。文中深入地分析了 k 近邻方法的不足，提出了改进的 k 近邻方法，有基于隐含语义，特征聚合，强化文本中语义链属

3、性因子与检索相结合的迭代近邻法四种方法进行改进，提高了分类器的性能。重点讨论了 AdaBoost 的相关问题。概述了 boost 理论的主要内容和应用情况。 Naive Bayesian 分类器是一种有效的文本分类方法，但由于具有较强的稳定性，很难通过 Boosting 机制提高其性能。因此用 Naive 分类器作上海交通大学硕士学位论文摘要第 II 页为 Boosting 的基分类器需要解决的最大问题，就是如何破坏 Naive Bayesian 分类器的稳定性。提出了 3 种破坏 Naive Bayesian 学习器稳定性的方法。第一种方法改变训练集样本，第二种方法采用随机属性选

4、择社团，第三种方法是在 Boosting 的每次迭代中利用不同的文本特征提取方法建立不同的特征词集。实验表明，这几种方法各有其优缺点，但都比原有方法准确、高效。实验表明，三种分类器都适合于中文文本分类的需要，其中 Adaboost分类器的分类性能最好。而朴素贝叶斯的简单快速，k 近邻方法性能适中同样适用于中文文本分类的需要。关键词：关键词：特征选择，文本分类，贝叶斯分类器，k 近邻分类器，Adaboost分类算法上海交通大学硕士学位论文 ABSTRACT 第 III 页 A STUDY ON CHINESE TEXT CATEGORIZATION ABSTRACT With the r

5、apid development and spread of Internet, electronic text information greatly increases. It is a great challenge for information science and technology that how to organize and process large amount of document data, and find the interested information of user quickly, exactly and fully. As the key te

6、chnology in organizing and processing large mount of document data, text classification can solve the problem of information disorder to a great extent, and is convenient for user to find the required information quickly. Moreover, text classification has the broad applied future as the technical ba

7、sis of information filtering, information retrieval, search engine, text database, and digital library and so on. Research on text classification and its related technologies are done in the paper. From the angle of improving the speed, precision and stability, several methods and techniques are pre

8、sented. The thesis summarizes systematically some techniques about word segmentation, feature selection, categorizing algorithm and performance estimating in Chinese text categorization. It discusses three kinds of Chinese text categorization methods like Bayes method, k Nearest Neighbor (kNN) and A

9、daboost. Author develops three text classifiers like naive-Bayes classifier, k Nearest Neighbor classifier and Adaboost classifier. Furthermore, including the three classifiers, one text categorization system is built up, and it has high practicability. Because of the weakness appeared in kNN predic

10、tion an improved kNN with more satisfactory performance is put forward here. There are latent semantic, feature aggregations, the text attribute factor of semantic chain, iterative KNN. It gives a stress on Adaboost. The thesis discussed the Adaboost problem theory and application instance. Naive Ba

11、yesian classifier is a kind of 上海交通大学硕士学位论文 ABSTRACT 第 IV 页 effective text categorization methods, but it is hard to improve its performance by Boosting procedure because of its stability. So the main problem derived from the Boosting procedure using Naive Bayesian classifier as the basic classifier

12、 is how to break its stability. Three methods that break the stability of naive Bayesian classifier are given. The first method changes samples of the training set, the second adopts the random selected feature group, and the third creates different feature set using different method to extract text

13、 features in the each iteration of Boosting procedure. The three methods have respective advantages and disadvantage, but all of them are more accurate and effective than the original Naive Bayesian classifier. It can be much clear from the experiment result that the Adaboost classifier is more sati

14、sfactory than others. Naive Bayesian classifier is simple and quick, KNN classifiers performance is suitable. The three classifiers are suitable for Chinese text categorization. Keywords: : Feature Selection, Text Categorization, Nave Bayesian Text Classifier, K Nearest Neighbor Classifier, Adaboost

15、 Classifier Algorithm 上海交通大学上海交通大学学位论文原创性声明学位论文原创性声明本人郑重声明：所呈交的学位论文，是本人在导师的指导下，独立进行研究工作所取得的成果。除文中已经注明引用的内容外，本论文不包含任何其他个人或集体已经发表或撰写过的作品成果。对本文的研究做出重要贡献的个人和集体，均已在文中以明确方式标明。本人完全意识到本声明的法律结果由本人承担。学位论文作者签名：王香港日期：2008 年 1 月 16 日上海交通大学上海交通大学学位论文版权使用授权书学位论文版权使用授权书本学位论文作者完全了解学校有关保留、使用学位论文的规定，同意学校保留并向国家有关部门或机构送交论文的复印件和电子版，允许论文被查阅和借阅。本人授权上海交通大学可以将本学位论文的全部或部分内容编入有关数据库进行检索，可以采用影印、缩印或扫描等复制手段保存和汇编本学位论文。保密保密，在年解密后适用本授权书。本学位论文属于不保密不保密。（请在以上方框内打“” ）学位论文作者签名：王香港指导教师签名：倪佑生日期：2008

展开阅读全文