中国科学院研究生院硕士学位论文 汉语词与句子切分技术及机器翻译评估方法研究 刘丁 指导教师 宗成庆 研究员 博士 中国科学院自动化研究所 申请学位级别 工学硕士 学科专业名称 模式识别与智能系统 论文提交日期 2004年6月 论文答辩日期 2004年6月 培养单位 中国科学院自动化研究所 学位授予单位 中国科学院研究生院 Approaches to Chinese Word Analysis, Utterance Segmentation and Automatic Evaluation of Machine Translation

Dissertation Submitted to Institute of Automation, Chinese Academy of Sciences in partial fulfillment of the requirements for the degree of Master of Engineering by Ding Liu (Pattern Recognition and Intelligence System) Dissertation Supervisor: Professor Chengqing Zong

摘要本论文以统计模型为基础,在参考了大量前人工作的基础上,对汉语词法分析、口语句子切分和机器翻译评估进行了较为




7、法上有重要的参考价值,对于实际应用中也有重要意义。ABSTRACTThis thesis proposed our novel statistical approaches on Chinese word analysis, utterance segmentation and automatic evaluation of machine translation (MT). Word analysis is the first step for most application based on Chinese language technologies; utterance segment

8、ation is the bridge which connects speech recognition and text translation in a speech translation system; automatic evaluation of machine translation (MT) system can speed the research and development of a MT system, reduce its developing cost. In short, the three aspects all belong to the basic re

9、search area of Natural Language Processing (NLP) and have significant meaning to many important applications such as text translation, speech translation and so on.In Chinese word analysis, we proposed a novel unified approach based on HMM, which efficiently combine word segmentation, Part of Speech

10、 (POS) tagging and Named Entity (NE) recognition. Our first model is a class-based HMM. So as to increase its accuracy, we introduce into the word-based HMM and combine it with the class-based HMM. At last we used a “word-to-character” smoothing method for predicting the probability of those words w

11、hich dont occur in the training set. The experimental results show that our combined model, by comprehensively considering the information of Chinese characters, words, POS and NE, achieved much better performance in the precision and recall of the Chinese word segmentation. Based on the knowledge o

12、f our combined model, we described the details in implementing the general word segmentation system APCWS. We discussed some technical problems in the data saving and loading, and described our modules of knowledge management and word lattice construction.In utterance segmentation, this paper propos

13、ed a novel approach which was based on a bi-directional N-gram model and Maximized Entropy model. This novel method, which effectively combines the normal and reverse N-gram algorithm, is able to make use of both the left and right context of the candidate site and achieved very good performance in

14、utterance segmentation. We conducted experiments both in Chinese and in English. The results showed the effect of our novel method was much better than the normal N-gram algorithm. Then by analyzing the experimental results, we found the reason why our novel method achieved better results: it on one hand retained the correct segmentation of the normal N-gram algorithm, on the other hand avoided the incorrect segmentation by making use of reverse N-gram algorithm.In automatic evaluation of MT systems, we first introduced two cla



