Web文本分类方法研究与系统实现

资源描述

《Web文本分类方法研究与系统实现》由会员分享，可在线阅读，更多相关《Web文本分类方法研究与系统实现（84页珍藏版）》请在金锄头文库上搜索。

1、 I摘要近年来，Web已经飞速发展成为了世界上数据量最大的公共信息源。如何使Web用户能够在浩瀚的信息资源中方便、快捷的定位到所需要的信息，已经成为迫切需要解决的问题。Web文本的正确分类正是其中的核心问题。Web文本分类源自于自动分类技术，是Web文本挖掘的重要组成部分。它不仅可以有效提高用户的搜索效率，帮助用户快速、准确的定位到目标知识，而且还可以获取到不同用户的类别兴趣特征，为满足用户的个性化服务要求提供参考。目前的分类研究多把文档类别看成是平面化的、不相交的，没有考虑到类别间的层次关系。当类别数目较多时，平面分类学习得到分类器的时间开销大，而且在对未知文档分类时，需要与全部类模型

2、进行比较，这显然很不恰当。本文在对Web文本挖掘及自动分类技术进行深入研究的基础上，结合类别间的层次关系，实现了一个多层次的Web文本分类系统。本文创新点和关键技术如下： 1建立了层次化的训练和分类模型：本文针对网页内容丰富、涉及多领域的多个类别的特征，分析了平面分类方法在多类别情况下存在的问题，提出了层次分类的思想，建立了层次化的训练和分类模型。 2设计并实现了 Web 文本的自动抽取器：Web 网页中掺杂的广告、超链接等噪声给 Web 文本分类带来了极大困扰。本文实现了一个 Web 文本自动抽取器，使 Web 页面经过处理变为较纯净的包含标题和正文内容的纯文本。 3提出了一种适合于 Web

3、网页的关键词提取方法：网页中不同位置和不同词性的词语对表达网页内容所起的作用也有所不同，针对这一特点，本文提出了基于词性、位置和词频信息加权的关键词提取方法来进一步过滤掉网页噪声词，取得了较好的效果。 4提出了一种基于2统计量加权的分类方法：2统计量能够很好的反映特征和类别间的相关性。本文创新性的将2统计量应用于文本分类，不但简化了分类过程，而且在实际应用中得到了较好的分类速度和准确度。本论文根据 Web 文本的特点提出了一套针对大规模、多类别的 Web 文本进行分类的实施方案，设计了一个 Web 文本的多层次分类系统。结果表明，本系统在实践中的分类性能优于一般的平面分类器。关键词关键词

4、：层次化，Web自动抽取，文本分类，特征提取，关键词提取 IIABSTRACT In recent years, Web has rapidly become the worlds largest public data source of information. Its an urgent problem for Web users to locate the desired information easily and fast in vast information. The Correctness of Web Text Classification is the key issue

5、 of this problem. Deriving from the automatic classification technology, Web Text Classification is an important part of Web Text Mining. It can not only effectively improve the efficiency of the users search by helping users to locate the desired knowledge quickly and accurately, but also obtain th

6、e interest of different types of users, which can consult to meet the users personalized demand. The current classification researches consider that the categories of text is disjoint and flat, without thinking about the hierarchy of categories. When the number of categories is large, the cost to ob

7、tain the classifier with flat classification learning is expensive, and when dealing with unknown document, it is obviously inappropriate to compare the document with the whole class models. With in-depth study on the Web text mining and automatic classification, This paper achieve a multi-level Web

8、 text classification system combined with hierarchical relations of classes. The Innovative point and key technologies are as follows: 1. Established a hierarchical model of training and classification. As the Web content is abundant and involve multiple types of features in many fields, and the fla

9、t classification has some problems for the multi-class situation, this paper proposed the idea of hierarchical classification and establish a hierarchical training and classification model. 2. Designed and implemented an automatic Web text extractor. The ads and hyperlinks in the Web pages brought g

10、reat trouble for Web text classification. This paper implemented an automaticic Web text extractor, enabling the Web page to become more pure which only contains the title and body. 3. Proposed a keyword extraction method suiting for Web page. As the words in different locations and with different p

11、roperty have different role in Web content, this paper proposed a keyword extraction method based on property、location and weighted IIIword frequency, and used the method in Web text classification with good results. 4. Presents a classification method based on 2 weighted statistics. The statistics

12、can well reflect the relations between feature and categories.This paper innovatively used the 2 statistics in the text classification so that not noly simplify the classification process, but also obtain better speed and accuracy in practical applications. Based on the characteristics of Web text,

13、this paper put forward an idea to deal with large-scale, multi-category Web text classification, and designed a multi-level Web text classification system. The results show that the systems performance is betten than the gereral flat classifier in practice. Key words: Hierarchical, Web automatic extraction, Text classification, feature selection, keyword extraction IV目录第一章第一章引言引言. 1 1.1 研究背景和意义. 1 1.1.1 研究背景. 1 1.1.2 Web 挖掘意义. 1 1.1.3 Web 文本分类意义. 2 1.2 文本分类技术的研究现状. 3 1.2.1 国外文本分类研究现状. 3 1.2.2 国内文本分类研究现状. 4 1.2.3 Web 文本分类研究现状. 5 1.3 课题研究难点及突出问题.

展开阅读全文

Web文本分类方法研究与系统实现

最新文档