沪江网做互联网学习的开拓者

上传人:正** 文档编号:46687626 上传时间:2018-06-27 格式:PDF 页数:16 大小:258.95KB
返回 下载 相关 举报
沪江网做互联网学习的开拓者_第1页
第1页 / 共16页
沪江网做互联网学习的开拓者_第2页
第2页 / 共16页
沪江网做互联网学习的开拓者_第3页
第3页 / 共16页
沪江网做互联网学习的开拓者_第4页
第4页 / 共16页
沪江网做互联网学习的开拓者_第5页
第5页 / 共16页
点击查看更多>>
资源描述

《沪江网做互联网学习的开拓者》由会员分享,可在线阅读,更多相关《沪江网做互联网学习的开拓者(16页珍藏版)》请在金锄头文库上搜索。

1、77Research PaperCJLIS Vol. 5 No. 4, 2012 pp 7792 National Science Library, Chinese Academy of Scienceshttp:/Received Nov. 27, 2012 Revised Dec. 3, 2012 Accepted Dec. 22, 2012 Translated with a permis- sion from New Technology of Library and Informa- tion Service (in Chinese), 2012, 28: 36A method fo

2、r improving the accuracy of automatic indexing of Chinese-English mixed documents*Yan ZHAO1,2 String matching; Accuracy of automatic indexing; Cybernetics; Dedicated hepatitis B virus (HBV) databaseWith the development of the Internet and communications technologies, people speaking different langua

3、ges communicate with one another more frequently and the number of multilingual documents is increasing rapidly. These documents cannot be used without being indexed first. However, the significant differences between languages in terms of grammatical rules and word formation rules bring about diffi

4、culties in indexing multilingual documents accurately. Given the importance of indexing in the processing of documents, we conduct a study on the automatic indexing of Chinese-English mixed documents with the aim of finding a method for increasing the accuracy of multilingual document indexing.1 A b

5、rief introduction of document indexingDocument indexing means that a document is indexed according to some rules and criteria. By using index terms to describe this document, an originally isolated document will be connected with the existing conceptual systems. Indexing provides convenience for the

6、 search, storage and use of the document. According to different ways of classification, document indexing can be classified into assignment indexing and derivative indexing, or manual indexing and automatic indexing, etc. The production of keywords in context (KWIC) index by Luhn of IBM was a great

7、 step forward in the development of a modern indexing method characterized by the computer-assisted automatic indexing1. Assignment indexing, where index terms are taken from a controlled vocabulary, is also called controlled indexing2. Indexing is the very first step in text processing and the qual

8、ity of indexing plays a decisive role in the further steps such as knowledge classification, data mining and knowledge discovery, etc. Due to the changeability and complexity of the controlled vocabulary and target document, it is hard to ensure accuracy of an indexing method. This is especially the

9、 case for the multilingual documents indexing (exemplified by Chinese-English mixed documents) due to the significant differences between the languages in terms of grammatical rules and word formation rules.2 Factors affecting the accuracy of automatic indexing of Chinese-English mixed documentsHow

10、to ensure accuracy has long been one of the major problems in computer- assisted automatic assignment indexing. For years, scholars both in China and abroad have been focusing on this research area, but numerous research results 79Yan ZHAO in “rheumatoid factors (RF)”, “rheumatoid factors” was marke

11、d with “RF” left behind. Similarly, as for the word “anti-calmodulin antibodies (anti-CaM)”, “anti-calmodulin antibodies” was marked while “anti-CaM” was not.4.2.3 Indexing efficiencyThe whole indexing process took nearly 9 hours. Reasons for long hours are analyzed as follows.Firstly, there are ove

12、r 35,000 entries, and more than 50% of them are long words, each of which consists of more than 10 characters. Secondly, there are not separate controlled vocabularies for Chinese document and English document. This caused substantially ineffective matching. Thirdly, for the sake of convenience in p

13、rogramming, the present indexing system adopted brute-force (BF) algorithm, which is considered as the easiest and least efficient algorithm in string matching. 4.3 Improvement of the indexing processAs mentioned in Section 4.2, recall of the indexing system is 97.37%, which is acceptable because th

14、e controlled vocabulary is relatively complete. However, precision of 88.54% is far from satisfying. Mis-indexing is still the main reason for dissatisfaction. Based on the analysis in Section 3, we will introduce the cybernetics theory into the improvement of indexing quality in the three phases. M

15、oreover, Chinese and English texts need to be processed differently. 4.3.1 Feed-forward controlBefore indexing, we pre-processed the controlled vocabulary and the target database.(i) Pre-processing controlled vocabulary We divided the original controlled vocabulary which has both Chinese and English

16、 content into 3 parts: Chinese, English and mixed Chinese-English words. Chinese controlled vocabulary only consists of Chinese characters and English controlled vocabulary has only English characters, Greek and Arabic numbers as the constituents. As for the rest, they were all grouped under the category of Chinese-English mixed controlled vocabulary.Chinese Journal of Library and Information Science Vol. 5 No

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 办公文档 > 其它办公文档

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号