基于ICVSM的摘要抽取算法研究

资源描述

《基于ICVSM的摘要抽取算法研究》由会员分享，可在线阅读，更多相关《基于ICVSM的摘要抽取算法研究（59页珍藏版）》请在金锄头文库上搜索。

1、河南科技大学硕士学位论文基于ICVSM的摘要抽取算法研究姓名：郭志兵申请学位级别：硕士专业：计算机应用技术指导教师：黄广君20091201摘要 I论文题目：基于论文题目：基于 ICVSM 的摘要抽取算法研究的摘要抽取算法研究专业：计算机应用技术专业：计算机应用技术研究生：郭志兵研究生：郭志兵指导教师：黄广君副教授指导教师：黄广君副教授摘要摘要摘要抽取是应对现代信息化社会而产生的一种信息提炼技术，它从大篇幅的文本中快速、准确地抽取出能够表达文本主题意思的句子生成文摘，为人们高效获取有用信息提供帮助。本文首先介绍摘要抽取的研究现状及相关技术；然后针对统计和

2、语义相结合类型的中文摘要抽取算法存在的不足，提出一种改进的摘要抽取算法。新算法从以下两方面对原有算法进行了改进。针对汉语词语的多义性问题，本文提出一种词义排歧的改进算法。该算法先利用知网和训练语料建立义原同现频率数据库，作为词义排歧的依据；在计算待排歧词各义项与上下文特征词的相关系数时，考虑对表达语义能力不同的四类义原的对应关系，并且考虑两个影响词语语义表达的距离因素：特征词与待排歧词之间的空间距离；最近选择该义项的同形歧词与待排歧词之间的空间距离。针对概念向量空间模型中项之间的独立性问题，本文基于聚类思想提出一种模糊的概念等价类划分算法。该算法从现实意义出发，对那些在表达语义上没有明显区

3、别、相似度很大的概念，进行等价类划分，合并为概念集合，以概念集合代替单独的概念作为向量空间模型的项，用改进的概念向量空间模型表示文本，进而更准确地对文本进行量化，以便生成更为精简的摘要。最后本文开发了相应的实验系统，对提出的基于 ICVSM(改进概念向量空间模型)的摘要抽取算法进行了实验验证。实验结果表明，改进后的算法较以往的算法，在对歧义词排歧的准确率和召回率上均有相应的提高，并且使生成的摘要在质量上也有所改进。关键词：关键词：摘要抽取，How Net，义原同现频率，概念向量空间模型论文类型：论文类型：应用研究摘要 IISubject: Research on Summar

4、ization Abstract Algorithm based on Improved CVSM Specialty: The Technology of Computer Application Name: Guo Zhi-Bing Supervisor: Huang Guang-Jun vice-professor ABSTRACT Summarization abstract is a technique of Information-Abstract which is a response to the modern information world. It can extract

5、 sentences from a large text quickly and accurately which can express the meaning of the text to generate summarization, and help people to gain useful information efficiency. Firstly, the thesis introduces the research actuality and related technique; and then, according to the disadvantage of the

6、Chinese summarization abstract algorithm which based on the combined of statistic and semantic, proposed a improved abstraction algorithm. The new algorithm improved the previous method from the following two aspects. According to the ambiguity of Chinese words, this thesis put forward an improved a

7、lgorithm of word sense disambiguation. In this method, we use How Net and corpus builds the primitive co-occurrence frequency database as the basis for word sense disambiguation. When calculate the correlation coefficient of word sense and context, consider the corresponding relationship of the four

8、 kinds of primitives have difference ability on semantic expressing, and consider two distance factors which affect the expression of semantic, the one is the space distance between the character-words and the malt vocal word, the other is the space distance between the currently malty vocal word an

9、d the same malt vocal word which has been selected sense at the latest. According to the independency between items of the CVSM, this thesis put forward a vague concept equivalence class partitioning algorithm which based on clustering concept. Considering the actuality significance, this algorithm

10、combined the concepts which have no distinct difference in semantics expressing and have great similarity. Use the muster of conceptual as the CVSMs items instead of the single conceptual, and use the improved CVSM to express the text, then translate the text into data more correct, so that to gener

11、ate a more concise summary. 摘要 IIIAt the end of this thesis, we have developed the corresponding experimental system for the summarization abstract algorithm which based on improved CVSM to process the experimental verification. Experimental results show that the improved algorithm is better than th

12、e previous algorithms, both accuracy and recall rate of the ambiguous word disambiguation are corresponding increase, and so generates a summary in terms of quality also improved. KEY WORDS: Summarization Abstract, How Net, Primitive Co-occurrence Data, Conceptual Vector Space Model Dissertation Typ

13、e: Application Research 第 1 章绪论 1第1章绪论第1章绪论 1.1 课题背景及意义课题背景及意义随着计算机和互联网的普及，越来越多的电子文档、电子信息出现在人们的日常生活及科学研究当中，而那些对人们有价值的信息同时也被淹没在这信息的海洋之中。如何从这浩瀚的信息海洋中获得人们所需要的内容，已成为人们日益关注的问题。文摘可以用简明、准确、概括性强的少量语句表达文章的主要内容1，但使用人工进行文摘的编写不但费时、费力，而且仁者见仁、智者见智，使得文摘在质量和表达形式上的差别也比较大，并且对于一些专业性强的文章，更需要有专业人员参于到文摘的编制当中去，这样人工编

14、制文摘的难度就更大了，而且时效性也不好。鉴于此，人们提出用机器来代替人进行文摘编制，自动文摘就是解决此问题的有效途径2,3，它是用计算机通过一定的算法，实现从大量文本中抽取一定数量的能够反映文章中心意思的句子，按一定顺序生成文章内容梗概的技术4。最早的自动文摘研究始于 1952 年，到现在已经有半个世纪之久，并且国际上和国内也取得了一定的研究成果，但这些还远远不能满足现代高度信息化社会的需求，其主要原因是，现在已有的四种主要的摘要抽取算法生成的摘要在速度、准确率以及可读性上均无法达到令人十分满意的程度。由于国内的中文自动文摘的研究开始的比较晚，始于上个世纪八十年代末，所以，国内自动文摘的研

15、究在技术上更落后于国外先进水平，这也使得对自动文摘的研究成为我国今后的重点研究领域之一。目前，自动文摘已经成为一个国际性的研究课题，受到了越来越多的国家和学者们的重视。在理论方面上5，研究自动文摘技术，有助于人类对自然语言文本特别是电子文本的理解和认知，从而有助于为其它信息处理领域的相关研究提供有力的理论依据。在实用性方面上，自动文摘的有效使用，既可以大大的提高文摘编制的速度和质量，也可以使人们在面对海量电子文本信息时，能在短时间内获取对自身更为实用的信息，提高人们阅读的速度和获取有价值信息的效率，而无需花费大量的时间去通篇阅读6。自动文摘的研究拥有极大的理论价值和现实意义，因此获取质量更好、准确率更高的自动文摘已成为当今的研究热点，并且也将成为未来自然科学的重点研究领域之一7-10。河南科技大学硕士学位论文 21.2 自动文摘研究现状自动文摘研究现状 1952 年 Luhn 最早

展开阅读全文

基于ICVSM的摘要抽取算法研究

最新文档