基于特征提取和权值计算算法的中文网页分类研究

资源描述

《基于特征提取和权值计算算法的中文网页分类研究》由会员分享，可在线阅读，更多相关《基于特征提取和权值计算算法的中文网页分类研究（61页珍藏版）》请在金锄头文库上搜索。

1、安徽大学硕士学位论文基于特征提取和权值计算算法的中文网页分类研究姓名：孔令成申请学位级别：硕士专业：计算机软件与理论指导教师：郑诚 2010-03 摘要 I 摘摘要要在现代社会，互联网急剧地改变着我们的生活，面对互联网上巨量的信息，如何得到我们真正想要的信息变成了一个非常重要的问题，网页分类便成为了一个热点研究领域。网页分类就是根据一定的规则实现大量的网页的自动归类，进而对网页进行有序组织，改善信息检索的性能，提高网络资源的利用率。特征提取和加权是网页分类过程中的重要步骤，也是提高网页分类效率的前提，算法的优劣直接影响到分类器的性能。本文的工作是在“中文网页分类

2、系统”的开发过程中，对网页分类技术进行了较为深入的研究，包括中文网页信息提取、自动分词、特征提取、权值计算、自动网页分类等方面，并基于传统的特征提取和权值计算算法提出自己的改进算法。本文的主要工作如下：首先，介绍了网页分类的国内外现状和研究方法，并指出课题研究的重点和难点。其次，我们较为深入地研究了传统的MI算法和tf-idf公式在网页分类中的应用及其存在的不足，发现传统的MI算法忽视了互信息值为负的特征以及过分倾向于低频词，另外传统的tf-idf公式忽视了特征项在类别之间的分布，并针对这些不足对传统的算法提出改进，再通过实验证明改进的优越性和可行性。最后，本文利用有监督的

3、机器学习理论构建一个网页分类器。运用改进型的互信息算法对分词结果进行特征提取，对传统的tf-idf加权公式进行了一些改进，运用KNN算法构建分类器。进行了大量的实验，实验结果表明改进后的算法相对传统算法具有优越性，实现了较高的精确度。伴随着互联网上信息的飞速增长，网络数据挖掘越来越变成一个重要的学术研究领域，中文网页分类作为网络数据挖掘领域的重要分支，具有重大的研究价值和现实意义。关键词：关键词：中文网页分类；特征提取；权值计算 Abstract II Abstract In modern society, the Internet has been dramatically c

4、hanging our lives. Facing a huge amount of information on the Internet, the problem of how to get the information we really want becomes a very important issue. Thus, page classification has become a popular area of research. The web page categorization is a process using computers to classify large

5、 quantity of web pages automatically according to some categorization rules. It can organize the web pages orderly, improve the performance of information retrieval system and increase the availability of web resources. Feature selection and weights calculation are key steps of web page categorizati

6、on，they are also prerequisite to improving the efficiency of web page classification. Whats more, the algorithm will directly affect the performance of classifier. In the process of establishing Chinese Web page classification System, we have made a thorough study on the approaches of Web page class

7、ification, including Chinese Web page information extraction、Chinese phrase segmentation、feature extraction、weights calculation、classification of Web page, and etc. The author also proposes his improved algorithm based on traditional algorithm of feature extraction and weights calculation. The main

8、works of the thesis are as follows: Firstly, the paper introduces present Research situation in China and foreign country and research methods about web page categorization, and pointed out emphasis and difficulty of the research. Secondly, we research the application in page classification and defe

9、cts of the traditional MI algorithm and the traditional tf-idf formula deeply, finding out that the traditional MI algorithm ignores the features whose MI are negative and is too inclined to the words with low occurrence probabilities,and the traditional tf-idf formula ignores the distribution of th

10、e features among all categories, and improve the traditional algorithm on the basis of the above. The superiority and feasibility of improvement are verified through the experiments. Finally, this paper makes use of supervised machine learning theory to implement a Web pages classifier. The method c

11、an be conducted as following, text Abstract III segmentation, feature extraction using Improved MI, improving Traditional TF-IDF Formula, and constructing classifier according to KNN. We did many experiments and the experimental results showed the superiority of the improved algorithm compared to tr

12、aditional algorithm, a higher precision was achieved. With the rapid surge of Internet information, network data mining has increasingly become a major academic research field. As a important branch of network data mining, the Chinese web page classification has great research value and practical si

13、gnificance. Key words: Chinese Web Page Classification; feature extraction; weights calculation 第一章前言 1 第一章前言第一章前言 1.1 课题的研究背景和意义课题的研究背景和意义 Internet诞生于上个世纪末，在人类文明史上的意义非同一般，对人类社会发展所产生的影响可以同蒸汽机和电相媲美，甚至超过了历史上的任何一项发明创造。特别是WEB浏览器被发明后，在不到二十年的时间内，更是使Internet 变成了一个遍及全世界的巨大的信息空间，最重要的表现就是万维网技术的普及，使我们的生

14、活各个方面都离不开网络，小到社会上每一个人，大到整个国家，网络的影响和重要性不言而喻。网络上存在各种各样的资源，非常繁多，其中WWW是发展速度最快的、普及范围最广，对我们的影响也是最大的，在大部分人的心中，提起Internet就很自然地和WWW联系起来。伴随着WWW的快速发展，我们所面临的网络是一个信息资源非常巨大的仓库，并且这个仓库的规模任然在呈现着飞跃式的增长。截止到2007年1月，网络已经是覆盖全世界233个国家和地区，网民总数达到十几亿，用户的普及率超过百分之十五。在我国，网络的发展相比世界先进国家要迟许多，自从进入21世纪以来网络以飞速发展。在2006年12月，通过C

15、NNIC 公布的网络发展调查报告可以知道，中国网民数已经达到1.37亿，而在1997年10 月所进行的第一次全国调查显示那个时候国内只有62万网民，不到十年的时间里网民总数已经是以前的200多倍了。网络在给我们带来丰富资源的同时也产生了新的难题，那便是在浩如烟海的信息资源中，我们怎么样找到自己真正需要和感兴趣的，并合理地去管理和组织这些资源。面对以上的问题，搜索引擎1应运而生。搜索引擎的出现的确是为我们在网络中查询资源提供了很多的便利，在目前人们所熟悉的搜索引擎中，知名度比较大的有百度，google等等，是每一个网民日常生活中所离不开的。但是也存在一些比较大的问题，比如我们在查

16、询信息的时候经常会遇到很多相似的信息，即信息冗余，另外查询率也偏低，并不能真正找到自己希望的信息。于是人们基于特征提取和权值计算算法的中文网页分类研究 2 希望能够对网页根据其内容进行自动分类，这样的话就能够不需要人来干预，还可以节省大量的人力和物力，并且自动分类在速度和精度上远远超过人工分类，完全可以适应现在的需求。网页自动分类由于其重要的实用价值，现在是一个热点研究领域，应用范围非常之广，主要有数字图书馆2和信息检索等。网页自动分类实际上就是合理有序地组织网页，改善搜索引擎的性能和作用，使我们可以更好地检索和管理自己所需要的网络资源，提高对网络信息资源的利用效益。网页自动分类的前身就是文本自动分类3,4，伴随着WEB的发展，文本分类技术被运用到网络领域，，那么分类的对象就由文本演变成了网页，可以说文本分类是网页分类的基础，网页分类就是文本分类技术在WEB上的应用，但是网页分类并不能完全等同于文本

展开阅读全文

基于特征提取和权值计算算法的中文网页分类研究

最新文档