番茄花园3TypicalWorkonAutomaticRelationExtraction

资源描述

《番茄花园3TypicalWorkonAutomaticRelationExtraction》由会员分享，可在线阅读，更多相关《番茄花园3TypicalWorkonAutomaticRelationExtraction（75页珍藏版）》请在金锄头文库上搜索。

1、3 Typical Work on Automatic Relation Extraction自动关系抽取的三种重要方法武文娟2009.06.04Outline lDIPRE,1998lKnowItAll, 2005lOpen IE, 20071 DIPRE: Dual Iterative Pattern ExpansionSergey Brin, Extracting Patterns and Relations from the World Wide Web, In : Proceedings of the International Workshop on the Web and Dat

2、abases, 1998.1 DIPRE: Dual Iterative Pattern Expansionl首次利用迭代方法发现数据实体间的模式和关系，并成功的发现了作者/作品数据对。lInput: 5本书的样本集(author, title) lOutput: 自动扩展到了15,000本书l有些书是最大的网上书店亚马逊也没有的。 1.1 Ideapatterntuplediscoverextract模板和关系之间存在对偶性1.2 Algorithm R (Tuple set)OccurrenceFindOccurrence (R, D)PatternsGenerate & filterSe

3、arch七元组(author, title, order, url, prefix, middle, suffix)Pattern generation Group by Order, middleOccurrence七元组(author, title, order, url, prefix, middle, suffix)For each OiGenOnePattern(Oi)Pattern p五元组(order, urlprefix, prefix, middle, suffix)O1, O2, , Okp is specific?YES输出 pNOURL: 匹配urlprefix*内容：

4、 *prefix, author, middle, title, suffix*1.3 ExperimentslCorpuslA repository of 24 million web pagesl147G1.3 Experiments: Initial sample1.3 Experiments: 3 Patterns in First Iteration1.3 Experiments: 4047 new pairs in First Iteration1.3 Experiments: reviewcorpusoccurrencespatterns(author, title) pairs

5、1st iteration24 million 199340472nd iteration5 million397210593693rd iteration156,0009938346152571.4 ConclusionlDIPRE：l半监督关系学习方面的最初的工作l利用了关系和模板之间的对偶性，在Web这样的大规模语料库上，通过少量的sample作为种子，以迭代的方法，不断地抽取新的模板和实例。Outline lDIPRE,1998lKnowItAll, 2005lOpen IE, 2007KNOWITALLOren Etzioni etc.University of Washington

6、 Unsupervised Named-Entity Extraction from the Web: An Experimental StudyAAAI 2005Introductionl以前的工作：HMM, CRFl小规模的语料库l需要提供种子数据lKNOWITALL: an unsupervised, domain-independent system that extracts information from the Webl关键挑战: l保证准确率：a novel generate-and-test architecturel提高召回率：lPattern Learning (PL)

7、lSubclass Extraction (SE)lList Extraction (LE)1 Flowchart of the main components in KnowItAllFor every predicate: lcreates extraction rules and discriminatorslTrain discriminators“cities such as ” NPListInformation Focusl唯一领域相关的输入是一组predicate，用来指定所关注的领域。通用的抽取模板通用的抽取模板Extraction Rulesl通用的抽取模板，结合predi

8、cate的标签，生成相应领域的抽取规则lClass1 = city，规则即为l“cities such as ” NPListl“towns such as ” NPListlKeywords: “cities such as ” , “towns such as ” （提交给搜索引擎）Discriminatorl用来确认某个抽取到的信息是否validatel利用PMITraining discriminator: BootstrappingThe result of traininglA set of discriminator, eg.lDiscriminator: is a cityLe

9、arned threshold T: 0.000016Conditional probabilitiesP(PMI T | class) = 0.83P(PMI T | class) = 0.08An ExamplelPredicate: citylBootstrapping:lGenerate extraction rules and discriminatorslTrain all discriminators, and selected the 5 best discriminatorsAn Example:Trained discriminatorAn ExampleMain cycl

10、e: extractlSuppose thatlthe query is “and other cities”lfrom a rule with extraction pattern: NP “and other cities”.l2 instances: Fes, East CoastAn ExampleMain cycle: AssesslTo compute the probability of City (Fes)lsends six queriesl“Fes” has 446,000 hits; l“Fes is a city” has 14 hitsl“cities Fes” (2

11、01 hits)l“cities such as Fes” (10 hits); l“cities including Fes” (4 hits)l0 hits for “Fes and other towns”. lCity (East Coast)lbelow threshold for all discriminatorsSum up all the probability,The final probability is 0.99815The final probability is 0.00027.1.2 Experiment noise tolerance1.2 Experimen

12、t find negative training seeds for assessor1.2 Experiment: search cutoff metriclSignal to Noise ratio (STN): 正例与负例的比值lQuery Yield Ratio (QYR)：n个网页抽取到的新信息量2 如何提高召回率如何提高召回率lPattern Learning (PL):l抽取规则l评价实例准确性的确认模板lSubclass Extraction (SE):l自动识别子概念，便于抽取l例如，为了抽取科学家的实例，可以先找到科学家的子概念（物理学家、地理学家等），再抽取这些子概念的实

13、例。lList Extraction (LE):llearns a “wrapper” for each list, and uses the wrapper to extract list elements.l使用通用抽取模板抽取到的信息作为这三种方法的初始种子，因此它们都不需要人事先给出训练数据。2.1 Pattern Learning (PL):l通用模板对特定领域来说通常并不是最有效的模板l “the film starring” l“headquartered in ” Pattern Learning algorithmlEstimating recall & precision

14、efficientlyltake the positive examples of one class to be negative examples for all other classes.I: A set of seed instancesContext of iSearchBest patternsFilter:Recall &precision 3 of the most productive rules如何提高召回率如何提高召回率lPattern Learning (PL)lSubclass Extraction (SE)lList Extraction (LE)2.2 Subc

15、lass ExtractionBasic subclass extraction (SEbase)lExtracting candidate subclassesl通用抽取规则在抽取实例的同时也抽取子类.如何区分?l实例:专有名词,大写 Scientists such as Einstein, Newton,l子类: 普通名词 Scientists such as physical scientist, biologist, lAssessing Candidate Subclasses, a combination methodl子类名是否包含了父类名l “microbiologist” i

16、s a subclass of “biologist”l在WordNet中是否有父子关系lSEbase Assessor: lbootstrap training methodRules for subclass extractionImproving Subclass Extraction Recalll对抽取到的候选子类，用table2中后两条规则来抽取它们兄弟,得到更多的候选子类。l两种子类lContext-independent subclasslPerson - PriestlContext-dependent subclasslPerson - Pharmacistl两种asses

17、sing methodlSEself: 用自训练的方式训练一个分类器 lSEiter：迭代地为每个抽取规则计算置信度Experimental result: Context-independent subclassExperimental result: Context-dependent subclass如何提高召回率如何提高召回率lPattern Learning (PL)lSubclass Extraction (SE)lList Extraction (LE)l不同于前两种方法处理无结构文本lLE利用网页中的结构来抽取信息2.3 List Extractorl网页中很多列表都是从数据库

18、中生成的，因此通常具有明显的结构特征l基本方法l定位网页中的listl学习一个wrapper，自动抽取所有list中的itemLearning a WrapperAn ExampleW3 is the BEST(1)对应的HTML块尽量小(2)匹配尽量多keywordsExperiments of LEDiscussionl使用LE可以用较少的查询，抽取到大量的信息l虽然准确率不够高，但是l帮助缩小了候选信息的数量,使得Assessor工作量大大减少. l可以发现在标准IE方法没有抽取到的信息l在HTML文档中，长选择列表中的一些罕见城市2.4 PL，SE和和LE的比较：的比较：recallc

19、ityfilmscientist对于通用概念的实例抽取，SE更有效PL，SE和和LE的比较的比较: extraction ratelextraction rate = num (unique extraction) / num (query)the Trade-off between Recall and Precision3 ConclusionlKnowItAll: Unsupervised information extraction from the WeblInput a set of predicate nameslno hand-labeled training examples

20、 of any kindl准确率lutilizes a novel generate-and-test architecturelExtractor, Assessorl召回率lPattern learning, Subclass Extraction, List ExtractionOutline lDIPRE,1998lKnowItAll, 2005lOpen IE, 2007Open IEMichele BankoUniversity of WashingtonOpen Information Extraction from the Web, IJCAI20071 Introductio

21、nl传统IEl小规模、同类的语料l因此可以很大程度上依赖自然语言处理技术，例如：命名实体识别等l固定的关系类型1.1 新的挑战新的挑战lAutomationl最初，把手工标注的实例、文档片断和自动学习到的特定领域的抽取模板作为系统的输入l后来，只需要为每种目标关系提供少量的种子实例或是手工编写的抽取模板（DIPRE，SNOWBALL，Web-based question answering system）l但是，生成这些数据依然需要专业知识。并且对于每种目标关系，都要提供训练数据。同时需要预先规定好要抽取的关系。新的挑战新的挑战lCorpus Heterogeneityl以前的工作抽取关系时,

22、都是在小规模的特定领域的语料上抽取有限几种特定的关系lKernel-based methodsBunescu and Mooney, 2005lmaximum-entropy models Kambhatla, 2004lgraphical models Rosario and Hearst, 2004; Culotta et al., 2006lco-occurrence statistics Lin and Pantel, 2001; Ciaramita et al., 2005 l绝大多数工作都使用了NER、词法分析、依赖关系分析等技术。这些语言处理技术在处理异构的网络文本中会出现更多的

23、错误。同时，已有的NER系统也不适合网络上命名实体的数量和复杂程度新的挑战新的挑战lEfficiencylKNOWITALLl自动化：通过利用少量领域无关的抽取模式，自动地标识训练集l网络异构：使用Part-of-speech tagger代替parser,并不需要NERl但是需要大量的搜索引擎查询和下载网页，实验往往需要数周时间。l需要关系的名字作为输入。因此每次改变目标关系，就需要重新运行一次。1.2 本文的贡献本文的贡献lOpen Information Extractionl自动发现可能的关系,不需要事先确定要找的关系,因此只需要扫描一遍语料库lTEXTRUNNERlOIE的完整实现l

24、对抽取结果进行了统计报告2 Open IEl3个模块lSelf-Supervised Learnerl输入：一个小规模语料l输出：一个分类器，用于判断候选的关系是否trustworthylSingle-Pass Extractorl一遍扫描整个语料，抽取候选关系元组，用分类器判断，保留正例。lRedundancy-Based Assessor2.1 Self-Supervised Learnerl自动为训练集标正反例(ei, ri,j, ej)l分析数千个句子，获得依赖图l在每个句子中寻找名词短语作为eil对每对(ei, ej)，按它们在依赖图中的连接，寻找代表他们之间关系的词序列作为ri,j

25、。对每个元组，如果满足预先给定的启发式规则，就标为正例。lei, ej之间的连接路径长度不得大于某阈值l路径在一个句子的范围之内lei, ej二者都不能只包含代词2.1 Self-Supervised Learnerl用标好的训练集训练分类器l把每个元组映射为特征向量lthe number of tokens in ri,j ,lthe number of stopwords in ri,j , lwhether or not an object e is found to be a proper noun, lthe part-of-speech tag to the left of ei,

26、 lthe part-of-speech tag to the right of ej .l将特征向量集作为训练集，得到Nave Bayes分类器l不是特定关系的，也不包含词汇特征2.2 Single-Pass Extractorl识别名词短语，从连接两个名词短语的文本中寻找关系，得到候选元组l利用较轻量级的NLP技术，这使得方法更健壮，能适应异构的网络文本。l启发式地去除过度修饰实体的介词短语和其它不必要的修饰词等。l使用分类器，保留被标为“trustworthy”的元组2.3 Redundancy-Based Assessorl合并相同元组l去掉不必要的修饰词l对每个元组，统计它们出现的不

27、同的句子的总数l用这个总数来评估该元组正确的概率l研究证明，这个概率方法的准确率比其它基于noisy-or，点互信息的方法高得多。2.4 Query Processingl查询速度快：at interactive speedsl对元组和它所在的文本建立倒排索引l每种关系分配在一台电脑上。l由于关系的名字是从网络文本中抽取到的，这些名字也更容易被用作查询关键词。l使用relation-centric索引，不同于目前搜索引擎使用的标准倒排索引，它可以支持复杂的关系查询lrelationship queries, unnamed-item queries, and multiple-attribut

28、e queries2.5 Analysisl时间复杂度lOpenIE：O(D) for extraction, O(TlogT) for sort, count and assess the tuples lTraditional IE: O(R * D)l速度l由于没有采用dependency parse等NLP技术，OIE处理一个句子需要0.036 CPU seconds, 传统IE技术需要3 CPU seconds3 Experimental Results: 3.1 Comparison with Traditional IEl9 million Web page corpus, 10

29、 relations 3 Experimental Results: Comparison with Traditional IEl抽取到相同的正确元组，TEXTRUNNER错误率更低l时间上，TextRunner慢一些l85 vs. 63 CPU Hoursl同时抽取更多关系3 Experimental Results:3.2 Global Statistics on Facts Learnedl11.3 million tuples containing 278,085 distinct relation strings.lFiltered rules:lprobability of at

30、 least 0.8 lThe tuples relation is supported by at least 10 distinct sentences in the corpuslThe tuples relation is not found to be in the top 0.1% of relations by number of supporting sentences. (These relations were so general as to be nearly vacuous, such as (NP1, has, NP2).Estimating the Correct

31、ness of Factsl手工标注其中400个元组作为样本l判断lWell-formed?lRelation: (FCI, specializes in, software development) lEntities: (29, dropped, instruments)lConcrete or abstract?lConcrete: 用于IE，question answering(Tesla, invented, coil transformer)lAbstract: ontology learning(Einstein, derived, theory)lTrue or false?l

32、与它所在的句子原意一致Estimating the Number of Distinct FactslDistinct relationlMergingl首尾的标点符号，助动词，开头的停用词，如“are consistent with”, “, which is consistent with”.l主动和被动语态l关系的多义性lEg. developedl使得如果不借助domain-specific type checking，同义的relation将对应着有重叠但是差别很大的tuple集Estimating the Number of Distinct FactslDistinct rela

33、tionlBuild “synonymy clusters” for 11.3 million tuples: l(e1,r,e2), (e1,q,e2), where rql1/3 belong to the “synonymy clusters” lDistinct facts in the “synonymy clusters”: lhatl2/3 + (1/3 3/4 ) or roughly 92% of the tuples found by TEXTRUNNER express distinct assertions. overestimated4 ConclusionlOpen

34、 IElan unsupervised extraction paradigmlWeblAll relationslone-time relation discoverylTEXTRUNNERla fully implemented Open IE systemldemonstrates its ability to extract massive amounts of high-quality information from a 9 million Web page corpus. lComparing with KnowItAllSUMMARYSUMMARYlDIPRE,1998l首次利用迭代方法发现数据实体间的模式和关系l半监督关系抽取方面最初的工作lKnowItAll, 2005lunsupervised, domain-independentlextracts information from the WeblOpen IE, 2007lunsupervised, domain-independentlWeblAll relationslone-time relation discoverylHigher precision than KnowItAll感谢大家！感谢大家！Questions?

展开阅读全文

番茄花园3TypicalWorkonAutomaticRelationExtraction

最新文档