Classified Index: TP391.1 U.D.C: 681.3 Thesis for the Master Degree in Engineering RESEARCH ON CHINESE WORD AND PHRASE SENTIMENT ANALYSIS Candidate: Zhu Li Supervisor: Prof. Guan Yi Academic Degree Applied for: Master of Engineering Specialty: Computer Science and Technology Affiliation: School of Computer Science and Technology Date of Defence: June, 2009 Degree-Conferring-Institution: Harbin Institute of Technology 哈尔滨工业大学工学硕士学位论文 - I - 摘要摘要 随着 Internet 的发展和网络的普及,以文本形式出现的信息越来越多,逐渐成为我们最容易获取也是最为丰富的一种交互资源,因此情感倾向分析也逐渐成为自然语言处理领域中一个新的热点。
情感倾向分析的研究大致可以分成词语情感倾向性分析、句子情感倾向性分析、篇章情感倾向性研究、海量信息的整体倾向性预测四个研究层次 文本情感倾向性分析,就是对说话人的态度(或称观点、情感)进行分析,也就是对文本中的主观性信息进行分析而词语情感倾向分析是对单独的词语或者实体的极性、强度和上下文模式进行分析因此词语情感倾向分析是文本情感倾向分析的前提和基础目前,在国内情感分析方面的研究还比较少,因此本文在这方面的研究具有重要深远的意义 针对目前情感倾向分析用资源的状况,本文具体分析了情感词典的构建方法,通过对比说明词语的情感倾向分析的难点与限制;此外,本文具体介绍了程度副词、否定副词、连词等在情感分析过程中的作用与收集方法;最后介绍了情感语料库的建设现状 针对词语的情感倾向分析问题,本文在情感词典的基础上,引入了2统计和朴素贝叶斯分类相结合的词语情感倾向分析方法,实验结果显示它能很好地发掘文本中新出现的情感词;此外,本文提出了利用情感短语模板识别文本中的情感短语,实验结果显示在结合情感词和情感短语后,判断的各项评价指标都有明显的提升 针对文本的情感倾向分析问题,本文对比了传统的文本情感计算方法和文本情感分类方法,突出了后者在文本情感分析任务中的重要角色;针对文本情感分类方法,本文在以情感词和情感短语为目标特征,以信息增益和2统计值为特征选择策略,选用了朴素贝叶斯和支持向量机为分类算法,通过对比选用最好的方法,实现了一个基于情感词典的文本情感倾向分析系统,实验结果表明该系统在中文倾向性分析评测语料上可以达到 86%的准确率。
关键字关键字:情感倾向分析;情感词典;情感短语搭配;情感分类 哈尔滨工业大学工学硕士学位论文 - II - Abstract With the development of Internet, more and more information has been organized as the textural format, and this has become the richest interactive resources in our research. As a result, text sentiment analysis has become a hotspot in natural language processing in recent years. Sentiment analysis in general terms can be divided into four levels: word and phrase sentiment analysis, sentence sentiment analysis, text sentiment analysis, and mass of information sentiment analysis in overall. Text sentiment polarity analysis means analyzing the speakers and writers’ attitude (or point of view, emotion), that is, analyzing the subjectivity information of text. And the word and phrase sentiment analysis means analyzing the single word or phrase’s semantic orientation and intensity. As a result, word and phrase sentiment analysis is the foundation of text sentiment polarity analysis. However, few internal researches have focused on sentiment analysis. Therefore it has far-reaching significance with our research in this area. For the condition of sentiment analysis resource, after analyzing and comparing the different methods to build sentiment lexicons, this paper tries to lay out the variety of difficulties and limitations in word sentiment analysis. Besides, this paper describes the function and the collection methods of adverbs, negation operators and conjunctions in sentiment analysis. At last, we also present the status of sentiment corpus building. For the word and phrase sentiment analysis, this paper proposes a method to automatically identify sentiment words in corpus by word sentiment classification combined with chi-square statics, and the result of experiment shows that we achieved preferable effects on word sentiment analysis. In addition, the paper also proposes the use of the sentiment phrase templates to identify the sentiment phrases in corpus, and the result of experiment also shows that there has been a marked improvement after combining the sentiment words and sentiment phrases in our system. For the text sentiment analysis, this paper contrasts the traditional methods to calculate the text sentiment polarity and the text sentiment classification methods, and highlights the important role of text sentiment classification methods in sentiment analysis. For text sentiment classification methods, we use the sentiment words and sentiment phrases as the target features, use the information gain and the chi-square statics as the feature selection strategies, and use the Naï ve Bayesian classifier and the support vector machine as the classification algorithms, at last, we choose the best method to construct our system for text sentiment analysis, and the result of 哈尔滨工业大学工学硕士学位论文 - III - experiment shows our system can achieve 86% precision while running on the COAE2008 corpus. Keywords: Sentiment analysis; Sentiment lexicon; Sentiment phrase; Text sentiment classification 哈尔滨工业大学工学硕士学位论文 - IV - 目录目录 摘要 ................................................................................................................................. I Abstract .......................................................................................................................... II 第 1 章 绪论 .................................................................................................................. 1 1.1 课题背景和研究意义 ......................................................................................... 1 1.2 情感倾向分析研究现状 ..................................................................................... 2 1.2.1 词语情感倾向性分析 .................................................................................. 2 1.2.2 句子情感倾向性分。