Crowdsourcing a Wikipedia Vandalism Corpus

上传人:yanm****eng 文档编号:594792 上传时间:2017-04-09 格式:PDF 页数:2 大小:52.52KB
返回 下载 相关 举报
Crowdsourcing a Wikipedia Vandalism Corpus_第1页
第1页 / 共2页
Crowdsourcing a Wikipedia Vandalism Corpus_第2页
第2页 / 共2页
亲,该文档总共2页,全部预览完了,如果喜欢就下载吧!
资源描述

《Crowdsourcing a Wikipedia Vandalism Corpus》由会员分享,可在线阅读,更多相关《Crowdsourcing a Wikipedia Vandalism Corpus(2页珍藏版)》请在金锄头文库上搜索。

1、Crowdsourcing a Wikipedia Vandalism CorpusMartin PotthastBauhaus-Universitt Weimar99421 Weimar, Germanymartin.potthastuni-weimar.deABSTRACTWereport on theconstruction of thePANWikipediavandalism cor-pus, PAN-WVC-10, using Amazons Mechanical Turk. The corpuscompiles 32452 edits on 28468 Wikipedia art

2、icles, among which2391 vandalism edits have been identified. 753 human annotatorscast a total of 193022 votes on the edits, so that each edit wasreviewed by at least 3 annotators, whereas the achieved level ofagreement was analyzed in order to label an edit as “regular” or“vandalism.” The corpus is

3、available free of charge.1Categories and Subject Descriptors: H.3.4 Information Storageand Retrieval: Systems and SoftwarePerformance EvaluationGeneral Terms: ExperimentationKeywords: Wikipedia, Vandalism Detection, Evaluation, Corpus1. INTRODUCTIONWikipedia is an encyclopedia written by the crowd.

4、The key toWikipedias success is a collaborative writing process, where ev-erybody can edit every article. Ideally, the reader of an article alsorevises it to the best of her abilities, e.g. by correcting errors, byimproving the writing style, by adding missing information, or byremoving redundancy.

5、In this way Wikipedias articles get continu-ously improved and updated. This“freedom of editing” gavethe lieto those who suggested that the resulting articles would be charac-terized by poor quality and instability. Wikipedia thrives. There isno free lunch, however, and Wikipedia faces problems that

6、 limit itsgrowth, such as vandalism, editwars, and lobbyism. Ourconcern isthe automatic detection of vandalism in Wikipedia, i.e., the detec-tion of edits that were made with bad intentions. We contribute tothis research fieldby developing a largecorpus ofhuman-annotatededits, which is a prerequisit

7、e for the meaningful evaluation of van-dalism detection algorithms. In particular, we report on our effortsto use Amazons Mechanical Turk as a possibility to drive the cor-pus size tothe necessary order of magnitude withoutcompromisingthe corpus quality.Related Work. Although vandalism has been obse

8、rved in Wikipe-dia right from the start, and, although vandalism is often deemedone of Wikipedias biggest problems, research has addressed auto-matic vandalism detection only recentlyfor the first time in 3,5, 7. Vandalized articles often get restored rather quickly by othereditors, but still, the a

9、uthors of 6 find that the number of timesvandalized articles get viewed amounts up to hundreds of millions,1Download the corpus from http:/www.webis.de/research/corporaCopyright is held by the author/owner(s).SIGIR10, July 1923, 2010, Geneva, Switzerland.ACM 978-1-60558-896-4/10/07.and that the prob

10、ability of encountering vandalism grew exponen-tially between 2003 and 2006. In reaction to this development, theWikipediacommunity has developed anumber ofrule-based robotsthat are capable of restoring the most obvious cases of vandalism,or that aid editors to do so 2. However, the performance of t

11、herobots is surpassed, for instance, by an approach based on machinelearning 5. Other reactions include the temporary suspension ofthe freedom of editing for articles that are often vandalized, whichthreatens the very idea of Wikipedia.The first vandalism corpus was the Webis-WVC-07, which con-sists

12、 of 940 human-annotated edits of which 301 are vandalism 4.The PAN-WVC-10 is two orders of magnitude larger and has beenannotated by many different people; it thus forms a more repre-sentative sample of vandalism and allows for better estimates ofwhether a vandalism retrieval model will actually wor

13、k in practice.In this respect, the Mechanical Turk provides an exciting new wayto scale up corpus construction, which has also been applied suc-cessfully, e.g., to recreate TREC assessments 1.2. CORPUS DESIGNCorpus Layout. An edit marks the transition from one article re-vision to another. On Wikipe

14、dia, each revision of every article isaccessible by means of a permanent identifier, so that an edit is de-scribed uniquely by a pairof revision IDsreferencing the old articlerevision and the new revision.2 Basically, our corpus is a list of re-vision ID pairs along with labels whether or not the re

15、spective editis vandalism. Moreover, for each edit meta information is given aswell as the plain texts of both the old and the new article revision.Corpus Acquisition. Our sample of edits is drawn from the revisionhistories of Wikipedia articles by means of probability proportionalto size sampling,

16、where in our case, the “size” of an article is theaverage number of times it gets edited in a given time frame. Wehypothesize that the average edit ratio of an article correlates withthe number of times it gets viewed. In that case, our edit sampleresembles well the distribution of article importance at the time ofsampling, which presumably also influences the articles chosen byvandals. By contrast, the edits of the Webis-WVC-07 were chosenin search for vandalism from a

展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 学术论文 > 其它学术论文

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号