汉语词与句子切分技术及机器翻译评估方法探讨

资源描述

《汉语词与句子切分技术及机器翻译评估方法探讨》由会员分享，可在线阅读，更多相关《汉语词与句子切分技术及机器翻译评估方法探讨（104页珍藏版）》请在金锄头文库上搜索。

1、分类号密级 UDC 编号中国科学院研究生院硕士学位论文汉语词与句子切分技术及机器翻译评估方法研究刘丁指导教师宗成庆研究员博士中国科学院自动化研究所申请学位级别工学硕士学科专业名称模式识别与智能系统论文提交日期 2004年6月论文答辩日期 2004年6月培养单位中国科学院自动化研究所学位授予单位中国科学院研究生院答辩委员会主席 Approaches to Chinese Word Analysis, Utterance Segmentation and Automatic Evaluation of Machine TranslationDisserta

2、tion Submitted toInstitute of Automation, Chinese Academy of Sciencesin partial fulfillment of the requirementsfor the degree ofMaster of EngineeringbyDing Liu(Pattern Recognition and Intelligence System)Dissertation Supervisor: Professor Chengqing Zong独创性声明本人声明所成交的论文是我个人在导师指导下进行的研究工作及取得的研究成果。尽我所知，除

3、了文中特别加以标注和致谢的地方外，论文中不包含其他人已经发表或撰写过的研究成果。与我一同工作的同志对本研究所做的任何贡献均已在论文中作了明确地说明并表示了谢意。签名：_导师签名：_ 日期：_关于论文使用授权的说明本人完全了解中国科学院自动化研究所有关保留、使用学位论文的规定，即：中国科学院自动化研究所有权保留送交论文的复印件，允许论文被查阅和借阅；可以公布论文的全部或部分内容，可以采用影印、缩印或其他复制手段保存论文。（保密的论文在解密后应遵守此规定）签名：_导师签名：_ 日期：_摘要本论文以统计模型为基础，在参考了大量前人工作的基础上，对汉语词法分析、口语句子切分和机器翻译评估进行了较为

4、深入的探讨和研究。汉语词法分析是大部分中文处理的第一步，其重要性不言而喻；句子切分是语音翻译中连接语音识别和文本翻译的桥梁，无论语音识别和文本翻译单独的效果有多么好，这座桥没搭好，综合的性能依然无法提高；机器翻译的自动评估是构建机器翻译系统中很重要的辅助工作，其可以加速翻译系统的开发速度，缩短其开发周期。简言之，这三方面同属于自然语言处理的基础的研究领域，其效果直接影响到高层应用的水平。在词法分析上，我们利用隐马尔可夫模型（HMM）提出了一种融和了分词、词性标注和命名实体识别的一体化词法分析方法。最初我们用基于类别的HMM，其优点是对词的覆盖面广，系统开销小；缺点是不能精确地预测词的出现概率。

5、为了提升模型的准确率，我们引入基于词汇的HMM，并将两者有机地结合，并用一个“词到字”的概率平滑方法对基于词的HMM进行平滑。实验结果显示，我们的混合模型由于综合考虑到了字、词、词性以及命名实体的知识，在切分的准确率和召回率上都明显优于单纯基于类别或者基于词的HMM。此外在分词系统的实现上，我们借助对通用分词系统APCWS的整体框架和各功能模块的介绍，讨论了如何有效地存储和加载数据等一些技术细节问题。在口语句子切分上，我们提出了基于双向N元模型和最大熵模型的句子切分算法，这种算法由于通过最大熵有机地将正、逆向N元切分结合起来，综合考虑到了切分点左、右的上下文，从而得到了很好的切分效果。我们在中

6、、英文语料上训练我们的模型并作测试，结果显示其在性能上明显优于基本的正向N元切分。在此基础上，我们分析并对比了各模型的切分结果，从而验证了我们当初对于模型的预计：其一方面保存了正向N元算法的正确切分，一方面用逆向N元算法有效地避免了正向算法的错误切分。在机器翻译的自动评估上，我们首先介绍了两种常用的基于参考译文的评估算法BLEU和NIST，然后给出了一种基于N元模型的句子流畅度评估方法E3。这种方法不需要借助任何参考译文，它通过区别地对待句子中不同的词的转移概率，达到了很好的评估效果。综上所述，本文针对汉语词法分析、口语句子切分和机器翻译评估提出了以统计模型为基础的创新方法，它们不仅仅在科学方

7、法上有重要的参考价值，对于实际应用中也有重要意义。ABSTRACTThis thesis proposed our novel statistical approaches on Chinese word analysis, utterance segmentation and automatic evaluation of machine translation (MT). Word analysis is the first step for most application based on Chinese language technologies; utterance segment

8、ation is the bridge which connects speech recognition and text translation in a speech translation system; automatic evaluation of machine translation (MT) system can speed the research and development of a MT system, reduce its developing cost. In short, the three aspects all belong to the basic re

9、search area of Natural Language Processing (NLP) and have significant meaning to many important applications such as text translation, speech translation and so on.In Chinese word analysis, we proposed a novel unified approach based on HMM, which efficiently combine word segmentation, Part of Speech

10、 (POS) tagging and Named Entity (NE) recognition. Our first model is a class-based HMM. So as to increase its accuracy, we introduce into the word-based HMM and combine it with the class-based HMM. At last we used a “word-to-character” smoothing method for predicting the probability of those words w

11、hich dont occur in the training set. The experimental results show that our combined model, by comprehensively considering the information of Chinese characters, words, POS and NE, achieved much better performance in the precision and recall of the Chinese word segmentation. Based on the knowledge o

12、f our combined model, we described the details in implementing the general word segmentation system APCWS. We discussed some technical problems in the data saving and loading, and described our modules of knowledge management and word lattice construction.In utterance segmentation, this paper propos

13、ed a novel approach which was based on a bi-directional N-gram model and Maximized Entropy model. This novel method, which effectively combines the normal and reverse N-gram algorithm, is able to make use of both the left and right context of the candidate site and achieved very good performance in

14、utterance segmentation. We conducted experiments both in Chinese and in English. The results showed the effect of our novel method was much better than the normal N-gram algorithm. Then by analyzing the experimental results, we found the reason why our novel method achieved better results: it on one hand retained the correct segmentation of the normal N-gram algorithm, on the other hand avoided the incorrect segmentation by making use of reverse N-gram algorithm.In automatic evaluation of MT systems, we first introduced two clas

展开阅读全文