Crowdsourcing a Wikipedia Vandalism Corpus

1、Crowdsourcing a Wikipedia Vandalism CorpusMartin PotthastBauhaus-Universitt Weimar99421 Weimar, Germanymartin.potthastuni-weimar.deABSTRACTWereport on theconstruction of thePANWikipediavandalism cor-pus, PAN-WVC-10, using Amazons Mechanical Turk. The corpuscompiles 32452 edits on 28468 Wikipedia art

2、icles, among which2391 vandalism edits have been identified. 753 human annotatorscast a total of 193022 votes on the edits, so that each edit wasreviewed by at least 3 annotators, whereas the achieved level ofagreement was analyzed in order to label an edit as “regular” or“vandalism.” The corpus is

3、available free of charge.1Categories and Subject Descriptors: H.3.4 Information Storageand Retrieval: Systems and SoftwarePerformance EvaluationGeneral Terms: ExperimentationKeywords: Wikipedia, Vandalism Detection, Evaluation, Corpus1. INTRODUCTIONWikipedia is an encyclopedia written by the crowd.

4、The key toWikipedias success is a collaborative writing process, where ev-erybody can edit every article. Ideally, the reader of an article alsorevises it to the best of her abilities, e.g. by correcting errors, byimproving the writing style, by adding missing information, or byremoving redundancy.

5、In this way Wikipedias articles get continu-ously improved and updated. This“freedom of editing” gavethe lieto those who suggested that the resulting articles would be charac-terized by poor quality and instability. Wikipedia thrives. There isno free lunch, however, and Wikipedia faces problems that

6、 limit itsgrowth, such as vandalism, editwars, and lobbyism. Ourconcern isthe automatic detection of vandalism in Wikipedia, i.e., the detec-tion of edits that were made with bad intentions. We contribute tothis research fieldby developing a largecorpus ofhuman-annotatededits, which is a prerequisit

7、e for the meaningful evaluation of van-dalism detection algorithms. In particular, we report on our effortsto use Amazons Mechanical Turk as a possibility to drive the cor-pus size tothe necessary order of magnitude withoutcompromisingthe corpus quality.Related Work. Although vandalism has been obse

8、rved in Wikipe-dia right from the start, and, although vandalism is often deemedone of Wikipedias biggest problems, research has addressed auto-matic vandalism detection only recentlyfor the first time in 3,5, 7. Vandalized articles often get restored rather quickly by othereditors, but still, the a

9、uthors of 6 find that the number of timesvandalized articles get viewed amounts up to hundreds of millions,1Download the corpus from http:/ is held by the author/owner(s).SIGIR10, July 1923, 2010, Geneva, Switzerland.ACM 978-1-60558-896-4/10/07.and that the prob

10、ability of encountering vandalism grew exponen-tially between 2003 and 2006. In reaction to this development, theWikipediacommunity has developed anumber ofrule-based robotsthat are capable of restoring the most obvious cases of vandalism,or that aid editors to do so 2. However, the performance of t

11、herobots is surpassed, for instance, by an approach based on machinelearning 5. Other reactions include the temporary suspension ofthe freedom of editing for articles that are often vandalized, whichthreatens the very idea of Wikipedia.The first vandalism corpus was the Webis-WVC-07, which con-sists

12、 of 940 human-annotated edits of which 301 are vandalism 4.The PAN-WVC-10 is two orders of magnitude larger and has beenannotated by many different people; it thus forms a more repre-sentative sample of vandalism and allows for better estimates ofwhether a vandalism retrieval model will actually wor

13、k in practice.In this respect, the Mechanical Turk provides an exciting new wayto scale up corpus construction, which has also been applied suc-cessfully, e.g., to recreate TREC assessments 1.2. CORPUS DESIGNCorpus Layout. An edit marks the transition from one article re-vision to another. On Wikipe

14、dia, each revision of every article isaccessible by means of a permanent identifier, so that an edit is de-scribed uniquely by a pairof revision IDsreferencing the old articlerevision and the new revision.2 Basically, our corpus is a list of re-vision ID pairs along with labels whether or not the re

15、spective editis vandalism. Moreover, for each edit meta information is given aswell as the plain texts of both the old and the new article revision.Corpus Acquisition. Our sample of edits is drawn from the revisionhistories of Wikipedia articles by means of probability proportionalto size sampling,

16、where in our case, the “size” of an article is theaverage number of times it gets edited in a given time frame. Wehypothesize that the average edit ratio of an article correlates withthe number of times it gets viewed. In that case, our edit sampleresembles well the distribution of article importance at the time ofsampling, which presumably also influences the articles chosen byvandals. By contrast, the edits of the Webis-WVC-07 were chosenin search for vandalism from a



