一种半监督学习方法及其在命名实体识别中的应用

上传人:ZJ****1 文档编号:60422084 上传时间:2018-11-16 格式:PPT 页数:30 大小:684KB
返回 下载 相关 举报
一种半监督学习方法及其在命名实体识别中的应用_第1页
第1页 / 共30页
一种半监督学习方法及其在命名实体识别中的应用_第2页
第2页 / 共30页
一种半监督学习方法及其在命名实体识别中的应用_第3页
第3页 / 共30页
一种半监督学习方法及其在命名实体识别中的应用_第4页
第4页 / 共30页
一种半监督学习方法及其在命名实体识别中的应用_第5页
第5页 / 共30页
点击查看更多>>
资源描述

《一种半监督学习方法及其在命名实体识别中的应用》由会员分享,可在线阅读,更多相关《一种半监督学习方法及其在命名实体识别中的应用(30页珍藏版)》请在金锄头文库上搜索。

1、2018/11/16,一种半监督学习方法及其在命名实体识别中的应用 李彦鹏,Information Retrieval Laboratory, Dalian University of Technology,Outline,Biomedical named entity recognition Data sparseness Feature coupling generalization Experimental results Current & future work,Biomedical Named Entity Recognition (BioNER ),Recognize named

2、 entities in biomedical texts, e.g., genes, proteins, and cells, etc. An important preliminary step for advanced text mining tasks. For example:,The TCF-1 alpha binding site was also required for TCR alpha enhancer activity in transcriptionally active extracts from Jurkat but not HeLa cells, confirm

3、ing that TCF-1 alpha is a T-cell-specific transcription factor,Challenges of BioNER,Huge vocabulary size. e.g., millions of gene/protein names. Long names, “ethanol repression autoregulation ( ERA ) / twelve-fold TA repeat ( TAB ) repressor element” Ambiguous definition of entity boundaries. The sam

4、e term can refer to different types in different contexts.,Current state-of-the-art,Challenge evaluations:,In these challenges,Dictionary look-up methods yield poor performances (50%-70% F-score): Low coverage, e.g., long names, variants Large noise, e.g., common English terms, entities of other typ

5、es. Machine learning methods show great success. The framework like “Lexical features + regularized linear model” is applied by all the top-performing systems.,Lexical-level features,IL 2 gene,0 1 0 1 1 0 1 1,W=IL, W=2, bigram=IL 2, norm=ILgene, suffix = *ene,Data sparseness in lexical features,Larg

6、e out-of-vocabulary (OOV) rate Terms not in the training corpus are not modeled well Extreme low frequency terms can not provide sufficient information to train a good classifier. Regular expression features can alleviate the problem, but far not enough: Surface information is not always indicative

7、Indicative patterns also lead to sparseness,Overcome data sparseness,Taxonomy based methods Word net, UMLS Depend on their qualities Subspace based methods LSA, KPCA, sparse coding Automatic methods Huge space and time cost Corpus-based methods PMI-IR, Web search, ESA Good scalability and easy to im

8、plement,Web-based methods for NER,Finkel et al. (2005) used co-occurrence of a named entity and indicative context to validate the gene name 0.17 F-score improvement. Etzioni et al. (2005) used PMI of the current entity and discriminator phrases as the input of a nave Bayes classifier Disadvantages:

9、 Not general enough to be extended to a broader area of NLP and Machine leaning. No systematic comparison with elaborately designed lexical features. Large room for further improvement,Our method,We analyze the nature of these methods and give a general framework. Generate new features from the rela

10、tedness of two special components. We try to find answers to: What are these two components? How to find them? How to convert the relatedness measures into new features?,Key definitions,Example-distinguishing features (EDFs) Features with high ability to distinguish the current examples from others.

11、 E.g., “bigram=IL 2”. Tends to be sparse, and sometimes lead to data sparseness. EDF roots: the “higher-level” concepts. E.g., “bigram” Class-distinguishing features (CDFs) Strong indicative to the target classes. E.g., patterns: “X gene”, “X proteins” Tends to be dense, and reflect the characterist

12、ic of a large number of examples. Feature coupling degree (FCD) The relatedness measure of an EDF-CDF pair in universal data,The algorithm,An example of FCG,Classify the name candidate prnp gene EDF: leftmost 1-gram = prnp CDF: expression of X EDF root: leftmost 1-gram FCD type: PMI The FCD feature

13、is indexed by leftmost 1-gramexpression of X PMI Feature value: FCD(U, leftmost 1-gram = prnp, expression of X) = PMI (leftmost 1-gram = prnp, expression of X),An example of FCG (2),The entity classification task,Construct a gene dictionary by: Combine two recourses: BioThesaurus 2.0 and ABGene lexi

14、con Generate variants by simple rules. The tasks is to determine whether a dictionary entry is a gene name in most cases. The training set is derived from BioCreative 2 GM task. 12567 positive and 36862 negative examples,FCG for gene entity classification,EDFs EDF I: normalized names EDF II: boundar

15、y n-grams CDFs: CDF I: indicative context patterns CDF II: outputs of a local context predictor FCD measures:,Model selection,The density of FCD features is around 25%. Interestingly we found that this feature space was somewhat like that in the task of image recognition. Inspired by the prevailing

16、techniques in such tasks, we first used singular value decomposition (SVD) to get a subspace of the original features and then used a SVM with a radial basis function (RBF) kernel to classify the examples.,Lexical features - the baseline,Bag-of-n-grams (n = 1, 2, 3) Boundary n-grams (n = 1, 2, 3) left-2-gram = IL 2 for IL 2 gene Sliding windows: character-level sliding windows with the size of 5. Bounda

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 办公文档 > 总结/报告

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号