
上传人:王*** 文档编号:136908462 上传时间:2020-07-03 格式:DOCX 页数:92 大小:2.42MB
返回 下载 相关 举报
第1页 / 共92页
第2页 / 共92页


1、硕士学位论文基于概念关联网络的文献挖掘与应用系统姓 名:王越千学 号:1532757所在院系:生命科学与技术学院学科门类:理学学科专业:生物学指导教师:张晓艳副指导教师:二一八年五月Tongji University in conformity with the requirements for the degree of Master of ScienceLiterature Mining Application System Based On Concept Associated NetworkCandidate: Yueqian WangStudent Number: 1532757Sc

2、hool/Department: School of Life Sciences and TechnologyDiscipline: ScienceMajor: BiologySupervisor: Xiaoyan ZhangMay, 2018基于概念关联网络的文献挖掘与应用系统王越千同济大学学位论文版权使用授权书本人完全了解同济大学关于收集,保存,使用学位论文的规定,同意如下各项内容:按照学校要求提交学位论文的印刷本和电子版本;学校有权保存学位论文的印刷本和电子版,并采用影印,缩印,扫描,数字化或其它手段保存论文;学校有权提供目录检索以及提供本学位论文全文或者部分的阅览服务;学校有权按有关规

3、定向国家有关部门或者机构送交论文的复印件和电子版;在不以赢利为目的的前提下,学校可以适当复制论文的部分或全部内容用于学术活动。学位论文作者签名:年 月 日同济大学学位论文原创性声明本人郑重声明:所呈交的学位论文,是本人在导师指导下,进行研究工作所取得的成果。除文中已经注明引用的内容外,本学位论文的研究成果不包含任何他人创作的,已公开发表或者没有公开发表的作品的内容。对本论文所涉及的研究工作做出贡献的其他个人和集体,均已在文中以明确方式标明。本学位论文原创性声明的法律责任由本人承担。学位论文作者签名:年 月 日同济大学 硕士学位论文 摘要摘要随着生命科学的高速发展,文献的数量呈现出爆炸性地增长。



6、的关系和发现潜在的基因与疾病间关联。故在研究文本挖掘的方法时,本课题利用共词策略建立概念之间关系,进而发现潜在的概念之间关联与联系。本课题以消化道肿瘤为中心,挖掘了32751篇消化道肿瘤文献中的概念关系。首先通过MetaMap抽取了不同消化道肿瘤中肿瘤和基因的概念。通过共词策略的方法,建立了概念关联网络,通过Phi相关系数,点互信息和余弦相似度三种方法分别评估肿瘤和基因间关系程度。进一步根据肿瘤和基因之间关联强度的强弱,发现潜在的基因与疾病间关联和预测新的肿瘤标志物。最后在探索信息整合方法的过程中,结合文本挖掘方法得到的肿瘤与基因关系对与 TCGA的公共数据中不同癌症的基因表达,甲基化程度和病

7、人临床数据等信息进行整合,建立文献挖掘应用系统。此系统为消化道肿瘤的诊断,治疗和预后等临床方面提供很好的借鉴意义,并为更准确地实现个性化医疗提供信息支持。结果表明,本课题设计并建立的文献挖掘与应用系统,具有其研究价值和实用价值。能够从大量文献中进行总结和分析,展现出相关领域的热点研究信息和潜在知识间关联。上述功能能够在文献信息提取,信息整合等诸多方面发挥重要作用。关键词:文献挖掘,关联分析,概念关联网络ITongji University Master of Philosophy AbstractABSTRACTWith the rapid development of the biomedi

8、cal field, research results are increased in the form of literature explosively. PubMed, a biomedical literature database developed by the U.S. National Center for Biotechnology Information NCBI, has grown from 300,000 to 350,000 records each year. For the growing literatures of the status quo, how

9、to obtain the targeted articles and how to relate the focused concept are two major problems that researchers face with currently. It is obviously time-consuming and laborious to read the literature in the traditional manual method. Therefore, more efficient methods are needed to enable researchers

10、to obtain target literature information and to mine potential relationship from literatures systematically.This research selects four literature mining methods (Entity Recognition, Information Extraction, Text Data Mining and Information Integration) to explore and establishes a literature mining ap

11、plication system based on concept associated network. According to the method of Entity Recognition, the recognition results of MetaMap software are integrated. The concepts extracted from the CRISPR/Cas9 technical literatures are classified by different hierarchical categories. The precision of the

12、 concepts extracted from MetaMap is evaluated under different categories. In the process of evaluating the results, the error results of the MetaMap extraction are screened to improve the precision and the feasibility of the concept based on MetaMap extraction is practiced. Based on the Entity Recog

13、nition by Metamap, the research adopts Natural Language Processing and co-occurence strategy in the exploration of Information Extration. Firstly, SemRep, a natural language processing tool, is integrated to extract the concept relationship in the literature. The semantic network is established in l

14、iver cancer literatures. In addition, we further screen the extracted semantic relationship between genes and diseases in liver cancer literature. The research compares with the correlation extracted from literatures and the pairs of diseases and genes which are manually annotated. It is found that

15、the advantage of the semantic network established by the NLP method is that it can more accurately dig out the concept relationships in the literatures, in the other word, with high precision. But the disadvantage is that the recall rate is not rather high. And it is difficult to extract a variety of types of relationships and difficult to mine the correlation between potential genes and diseases.Therefore, when exploring Text Mining methods, this research adopts a co-occurence strategy to establish the relationship


当前位置:首页 > 高等教育 > 教育学

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号