基于概念关联网络的文献挖掘与应用系统

资源描述

《基于概念关联网络的文献挖掘与应用系统》由会员分享，可在线阅读，更多相关《基于概念关联网络的文献挖掘与应用系统（92页珍藏版）》请在金锄头文库上搜索。

1、硕士学位论文基于概念关联网络的文献挖掘与应用系统姓名：王越千学号：1532757所在院系：生命科学与技术学院学科门类：理学学科专业：生物学指导教师：张晓艳副指导教师：二一八年五月Tongji University in conformity with the requirements for the degree of Master of ScienceLiterature Mining Application System Based On Concept Associated NetworkCandidate: Yueqian WangStudent Number: 1532757Sc

2、hool/Department: School of Life Sciences and TechnologyDiscipline: ScienceMajor: BiologySupervisor: Xiaoyan ZhangMay, 2018基于概念关联网络的文献挖掘与应用系统王越千同济大学学位论文版权使用授权书本人完全了解同济大学关于收集，保存，使用学位论文的规定，同意如下各项内容：按照学校要求提交学位论文的印刷本和电子版本；学校有权保存学位论文的印刷本和电子版，并采用影印，缩印，扫描，数字化或其它手段保存论文；学校有权提供目录检索以及提供本学位论文全文或者部分的阅览服务；学校有权按有关规

3、定向国家有关部门或者机构送交论文的复印件和电子版；在不以赢利为目的的前提下，学校可以适当复制论文的部分或全部内容用于学术活动。学位论文作者签名：年月日同济大学学位论文原创性声明本人郑重声明：所呈交的学位论文，是本人在导师指导下，进行研究工作所取得的成果。除文中已经注明引用的内容外，本学位论文的研究成果不包含任何他人创作的，已公开发表或者没有公开发表的作品的内容。对本论文所涉及的研究工作做出贡献的其他个人和集体，均已在文中以明确方式标明。本学位论文原创性声明的法律责任由本人承担。学位论文作者签名：年月日同济大学硕士学位论文摘要摘要随着生命科学的高速发展，文献的数量呈现出爆炸性地增长。

4、如生物医学文献数据库PubMed，每年新收录的文献数量达到30-35万条，且数量仍在不断增长。如何获取相关文献文献和关联目标的概念，从而更好地提取所需信息，是目前研究热点和难点。传统方式通过阅读文献来获取所需信息，非常低效。因此，需要更为高效的方式，使研究者可以系统地获取目标文献信息，并从文献中挖掘出潜在关系。本课题选取了文献挖掘中实体识别，信息提取，文本挖掘和信息整合四种研究方法进行探索，建立了基于概念关联网络的文献挖掘与应用系统。针对实体识别方法，通过整合MetaMap软件的识别结果，对CRISPR/Cas9技术文献中抽取的概念词进行层次归类，评估MetaMap在不同层次分类下抽提概念的准

5、确率。在评估结果的过程中，通过筛选MetaMap抽取概念的错误结果，提高了MetaMap抽取概念的准确率，验证了基于MetaMap抽取概念这一方法的有效性。在此基础上，针对信息提取的问题，结合自然语言和共词策略两种方法。首先整合自然语言处理工具SemRep软件抽取文献中概念关联关系，并建立了肝癌文献中基于语义关系的基因与疾病的网络。其次，对抽取的语义关系进行筛选，来探索肝癌文献集中基因与疾病间的各类关系。对抽取出的基因与疾病间的相关关系与人工标注得到的基因与疾病对进行比较，结果表明自然语言方法建立的语义网络是能更准确地挖掘出文献中的概念关系，查准率高；但缺点是查全率较低，且难以提取多种复杂类型

6、的关系和发现潜在的基因与疾病间关联。故在研究文本挖掘的方法时，本课题利用共词策略建立概念之间关系，进而发现潜在的概念之间关联与联系。本课题以消化道肿瘤为中心，挖掘了32751篇消化道肿瘤文献中的概念关系。首先通过MetaMap抽取了不同消化道肿瘤中肿瘤和基因的概念。通过共词策略的方法，建立了概念关联网络，通过Phi相关系数，点互信息和余弦相似度三种方法分别评估肿瘤和基因间关系程度。进一步根据肿瘤和基因之间关联强度的强弱，发现潜在的基因与疾病间关联和预测新的肿瘤标志物。最后在探索信息整合方法的过程中，结合文本挖掘方法得到的肿瘤与基因关系对与 TCGA的公共数据中不同癌症的基因表达，甲基化程度和病

7、人临床数据等信息进行整合，建立文献挖掘应用系统。此系统为消化道肿瘤的诊断，治疗和预后等临床方面提供很好的借鉴意义，并为更准确地实现个性化医疗提供信息支持。结果表明，本课题设计并建立的文献挖掘与应用系统，具有其研究价值和实用价值。能够从大量文献中进行总结和分析，展现出相关领域的热点研究信息和潜在知识间关联。上述功能能够在文献信息提取，信息整合等诸多方面发挥重要作用。关键词：文献挖掘，关联分析，概念关联网络ITongji University Master of Philosophy AbstractABSTRACTWith the rapid development of the biomedi

8、cal field, research results are increased in the form of literature explosively. PubMed, a biomedical literature database developed by the U.S. National Center for Biotechnology Information NCBI, has grown from 300,000 to 350,000 records each year. For the growing literatures of the status quo, how

9、to obtain the targeted articles and how to relate the focused concept are two major problems that researchers face with currently. It is obviously time-consuming and laborious to read the literature in the traditional manual method. Therefore, more efficient methods are needed to enable researchers

10、to obtain target literature information and to mine potential relationship from literatures systematically.This research selects four literature mining methods (Entity Recognition, Information Extraction, Text Data Mining and Information Integration) to explore and establishes a literature mining ap

11、plication system based on concept associated network. According to the method of Entity Recognition, the recognition results of MetaMap software are integrated. The concepts extracted from the CRISPR/Cas9 technical literatures are classified by different hierarchical categories. The precision of the

12、 concepts extracted from MetaMap is evaluated under different categories. In the process of evaluating the results, the error results of the MetaMap extraction are screened to improve the precision and the feasibility of the concept based on MetaMap extraction is practiced. Based on the Entity Recog

13、nition by Metamap, the research adopts Natural Language Processing and co-occurence strategy in the exploration of Information Extration. Firstly, SemRep, a natural language processing tool, is integrated to extract the concept relationship in the literature. The semantic network is established in l

14、iver cancer literatures. In addition, we further screen the extracted semantic relationship between genes and diseases in liver cancer literature. The research compares with the correlation extracted from literatures and the pairs of diseases and genes which are manually annotated. It is found that

15、the advantage of the semantic network established by the NLP method is that it can more accurately dig out the concept relationships in the literatures, in the other word, with high precision. But the disadvantage is that the recall rate is not rather high. And it is difficult to extract a variety of types of relationships and difficult to mine the correlation between potential genes and diseases.Therefore, when exploring Text Mining methods, this research adopts a co-occurence strategy to establish the relationship

展开阅读全文