Impact of HIT Design on Crowdsourcing Relevance

资源描述

《Impact of HIT Design on Crowdsourcing Relevance》由会员分享，可在线阅读，更多相关《Impact of HIT Design on Crowdsourcing Relevance（2页珍藏版）》请在金锄头文库上搜索。

1、Impact of HIT Design on Crowdsourcing RelevanceGabriella Kazai1 Jaap Kamps2 Marijn Koolen2 Natasa Milic-Frayling11 Microsoft Research, Cambridge UK2 University of Amsterdam, The NetherlandsABSTRACTIn this paper we investigate the design and implementationof e ective crowdsourcing tasks in the contex

2、t of book searchevaluation. We observe the impact of aspects of the HumanIntelligence Task (HIT) design on the quality of relevance la-bels provided by the crowd. We assess the output in termsof label agreement with a gold standard data set and ob-serve the e ect of the crowdsourced relevance judgme

3、nts onthe resulting system rankings. This enables us to observethe e ect of crowdsourcing on the entire IR evaluation pro-cess. Using the test set and experimental runs from theINEX 2010 Book Track, we nd that varying the HIT de-sign and the pooling and document ordering strategies leadsto con

4、siderable di erences in agreement with the gold setlabels. We then observe the impact of the crowdsourced rel-evance label sets on the relative system rankings using fourIR performance metrics. System rankings based on MAPand Bpref remain less a ected by di erent label sets whilethe Precision10 and

5、nDCG10 lead to dramatically dif-ferent system rankings, especially for labels acquired fromHITs with weaker quality controls. Overall, we nd thatcrowdsourcing can be an e ective tool for the evaluation ofIR systems, provided that care is taken when designing theHITs.1. INTRODUCTIONThe evaluati

6、on and tuning of Information Retrieval (IR)systems based on the Cran eld paradigm requires purpose-built test collections, at the heart of which lie the humanrelevance judgments. With the ever increasing size and di-versity of both the document collections and the query sets,gathering relevance labe

7、ls by editorial judges has become achallenge. Recently, crowdsourcing has emerged as a fea-sible approach to gathering relevance data. However, theuse of crowdsourcing presents a radical departure from thecontrolled conditions in which editorial judgments are col-lected. In this paper, we explore th

8、e e ectiveness of variousHIT designs as a means of controlling the crowd workersengagement and, consequently, the quality of the resultingrelevance labels and the reliability of the IR evaluation interms of the relative system rankings.We focus our investigation of HIT designs on three as-pects: 1)

9、quality control elements, 2) document pooling andsampling for relevance judgments by the crowd, and 3) doc-?This is an extended abstract of Kazai et al. 1.Copyright is held by the author/owner(s).DIR-2012 February 2324, 2012, Gent, Belgium.Copyright 2012 by the author(s).Figure 1: Part of a HIT show

10、ing question series tosolicit relevance labels for book pages from workerson Amazon Mechanical Turk: Full design.ument ordering within a HIT for presentation to the work-ers. Based on the analysis of the collected data, we provideinsights on 1) how design decisions in uence both the rawlabel quality

11、, i.e., agreement with gold standard (GS) ob-tained from traditional editorial judges, and 2) the useful-ness of crowdsourced relevance labels in IR evaluation, i.e.,their impact on the system rankings.2. EXPERIMENTAL SETUPData We use the books, search topics, o cial runs, and rel-evance judgments p

12、rovided by the INEX 2010 Book Track.The corpus comprises 50,239 out-of-copyright books, con-taining over 17 million pages and amounting to 400GB.There are 15 Best Books runs (Ad hoc retrieval of wholebooks) and 10 Prove It! runs (return pages that con rm orrefute a claim).HIT designs used two di ere

13、nt sets of quality controlmechanisms. Full design (FullD), see Figure 1, controls allthe stages of the task and explicitly pre-quali es workers, re-Fraction Agreement on RelevanceDensity0.0 0.2 0.4 0.6 0.8 1.00.00.51.01.52.02.53.0Fraction Agreement on RelevanceDensity0.0 0.2 0.4 0.6 0.8 1.00.00.51.0

14、1.52.02.53.0(a) Full Design (b) Simple DesignFigure 2: Distribution of workers over agreement ashistogram and probability density function.Table 1: System rank correlation between the dif-ferent designsDesign MAP Bpref P10 nDCG10FullD 0.76 0.45 0.85 0.73SimpleD 0.96 0.87 0.34 0.02stricting participa

15、tion to those who completed over 100 HITsat a 95+% approval rate. It includes trap questions, quali -cation questions, a captcha, and dependencies between thequestions. Simple design (SimpleD) does not impose restric-tions on the workers who can participate. No qualifying testis included to check if

16、 workers are familiar with the claim.No warning is displayed to workers about the expected qual-ity of their labels. Finally, no captcha is used in this design,also simplifying the structure of the Flow questions.Pooling strategies We used three interleaved pools: 1)Top-n pool based on the submissions; 2) Rank-boosted poolreranking the submissions based on the popularity of books;3) Answer-boosted pool reranking insisting on keywords ofthe topic being present on the page.Page ordering We used

展开阅读全文