基于语义向量的无导词义消歧

资源描述

《基于语义向量的无导词义消歧》由会员分享，可在线阅读，更多相关《基于语义向量的无导词义消歧（42页珍藏版）》请在金锄头文库上搜索。

1、密级：学校代码：10075 分类号：学号：20091677 工程硕士学位论文基于语义向量的无导词义消歧学位申请人：崔磊指导教师：李新福教授学位类别：工程硕士学科专业：计算机技术授予单位：河北大学答辩日期：二一二年五月 Classified Index: CODE: 10075 U.D.C: NO: 20091677 A Dissertation for the Degree of M. Engineering A Chinese unsupervised word sense disambiguation method based on sema

2、ntic vector Candidate : Cui Lei Supervisor : Prof. Li Xinfu Academic Degree Applied : Master of Engineering Specialty : Computer Technology University : Hebei University Date of Oral Examination : May, 2012 摘要 I 摘要词义消歧问题是计算语言学和自然语言处理领域一个重要的研究课题，在许多应用领域中具有重要的理论和实践意义。具有较高准确率和良好实用性的词义消歧方法会对包括：机器

3、翻译、文本分类、自动文摘、信息检索、文本挖掘等问题的研究和具体实践应用产生巨大的帮助。有指导机器学习的词义消歧方法需要对训练语料的词语进行词义标注。为克服数据稀疏问题，并且为达到好的消歧效果，必须建立大规模的标记语料库，而标记语料库的获得需付出高昂的人工代价。针对这一问题，本文提出了基于语义向量的无指导词义消歧的方法。该方法不需要对训练样本的每个词语的进行人工词义标注，能够有效地解决数据稀疏问题。本文结合互信息和 Z 测试，在歧义词的上下文 6 个词范围内选取特征词，用义项词语来描述多义词的某一义项，借鉴传统信息检索中计算自然语言查询和文档的相似度的思想，将多义词的上

4、下文看作信息检索中的查询，将义项词语看作信息检索中的文档。然后构造语义向量和待消歧词的上下文查询向量，通过计算各个语义向量和查询向量的相似度来确定多义词的正确义项。对 150 个典型多义词进行消歧，实验结果证明了本方法的有效性。关键词词义消歧无指导学习互信息相似度 Abstract II Abstract The problem of word sense disambiguation is an important research topic in the fields of computational Linguistics and natural language p

5、rocessing, it has important theory and practice significance in many applications fields. The word sense disambiguation methods with high accuracy and good practicality will improve the effects in machine translation, text classification, automatic summarization, information retrieval and text minin

6、g. The supervised machine learning word sense disambiguation method need to annotate the words in the training corpus, in order to overcome the data sparseness problem to achieve the good effect of disambiguation, it must establish a large-scale marked corpus, obtaining the marked corpus requires to

7、 pay the high artificial price. To solve this problem this paper proposes an unsupervised learning methods without manual annotation, the word sense disambiguation method based on the unsupervised machine learning does not require to annotate the semantic categories for each word in the training sam

8、ples, so it does not need the support of annotating the corpus manually, the method will effectively overcome the data sparseness problem. This paper combines PMI and the Z test to select the feature words within three words around the context of the polysemy, uses the sense words to definite certai

9、n sense describing the polysemy, learns from the idea of computing natural language query and document in the traditional information retrieval, lets the context of polysemy as a query in information retrieval, lets the sense words as the documents in information retrieval. Constructing the semantic

10、 vector and the feature words vector for each polysemy, and then calculating the similarity between the semantic vector and the query vector for each polysemy to determine the correct meaning of the polysemy. This paper uses 150 typical polysemy as the experiment data, The experimental results show that this method is validity. Keywords word sense disambiguation unsupervised learning PMI similarity 目录 III 目录第 1 章引言 1 1.1 研究背景和意义 1 1.2 国内外对词义消歧的研究 2 1.2.1 国外的研究状况 2 1.2.2 国内的研究现状 3 第 2 章词义消歧的方法和 HowNet 6 2.1 词义消歧的概述 6 2.1.1 词义消歧的概念

展开阅读全文

基于语义向量的无导词义消歧

最新文档