Design and Implementation of Relevance Assessments Using Crowdsourcing

资源描述

《Design and Implementation of Relevance Assessments Using Crowdsourcing》由会员分享，可在线阅读，更多相关《Design and Implementation of Relevance Assessments Using Crowdsourcing（12页珍藏版）》请在金锄头文库上搜索。

1、Design and Implementation ofRelevance Assessments Using CrowdsourcingOmar Alonso1and Ricardo Baeza-Yates21Microsoft Corp., Mountain View, California, USA2Yahoo! Research, Barcelona, Spainrbaezaacm.orgAbstract. In the last years crowdsourcing has emerged as a viable plat-form for conducting relevance

2、 assessments. The main reason behind thistrend is that makes possible to conduct experiments extremely fast, withgood results and at low cost. However, like in any experiment, there areseveral details that would make an experiment work or fail. To gatheruseful results, user interface guidelines, int

3、er-agreement metrics, and jus-tification analysis are important aspects of a successful crowdsourcing ex-periment. In this work we explore the design and execution of relevancejudgments using Amazon Mechanical Turk as crowdsourcing platform,introducing a methodology for crowdsourcing relevance asses

4、sments andthe results of a series of experiments using TREC 8 with a fixed budget.Our findings indicate that workers are as good as TREC experts, evenproviding detailed feedback for certain query-document pairs. We alsoexplore the importance of document design and presentation when per-forming relev

5、ance assessment tasks. Finally, we show our methodologyat work with several examples that are interesting in their own.1 IntroductionIn the world of the Web 2.0 and user generated content, one important sub-classof peer collaborative production is the phenomenon known as crowdsourcing.In crowdsourci

6、ng, potentially large jobs are broken into many small tasks thatare then outsourced directly to individual workers via public solicitation. One ofthe best examples is Wikipedia, where each entry or part of an entry could beconsidered as a task being solicited. As in the later example, workers someti

7、mesdo it for free, motivated either because the work is fun or due to some form ofsocial reward 1,12. However, successful examples of volunteer crowdsourcingare dicult to replicate. As a result, crowdsourcing increasingly uses a financialcompensation, usually as micro-payments of the order of a few

8、cents per task.This is the model of Amazon Mechanical Turk (AMT1), where many tasks canbe done quickly and cheaply.AMT is currently used as a feasible alternative for conducting all kind of rel-evance experiments in information retrieval and related areas. The lower cost P. Clough et al. (Eds.): ECI

9、R 2011, LNCS 6611, pp. 153164, 2011.c Springer-Verlag Berlin Heidelberg 2011154 O. Alonso and R. Baeza-Yatesrunning experiments in conjunction with the flexibility of the editorial approachat a larger scale, makes this approach very attractive for testing new ideas with afast turnaround. In AMT work

10、ers choose from a list of jobs being oered, wherethe reward being oered per task and the number of tasks available for thatrequest are indicated. Workers can click on a link to view a brief description ora preview of each task. The unit of work per task to be performed is called aHIT (Human Intellig

11、ence Task). Each HIT has an associated payment and anallotted completion time; workers can see sample HITs, along with the paymentand time information, before choosing whether to work on them or not. Afterseeing the preview, workers can choose to accept the task, where optionally, aqualification exa

12、m must be passed to assign ocially the task to them. Tasks arevery diverse in size and nature, requiring from seconds to minutes to complete.On the other hand, the typical compensation ranges from one cent to less thana dollar per task and is usually correlated to the task complexity.However, what i

13、s not clear is how exactly to implement an experiment. First,given a certain budget, how we spend it and how we design the tasks? Thatis, in our case, how many people evaluate how many queries looking at howmany documents? Second, how the information for each relevance assessmenttask should be prese

14、nted? What should be the right interaction? How can wecollect the right user feedback, considering that relevance is a personal subjec-tive decision? In this paper we explore these questions providing a methodologyfor crowdsourcing relevance assessments and its evaluation, giving guidelines toanswer

15、 the questions above. In our analysis we consider binary relevance assess-ments. That is, after presenting a document to the user with some context, thepossible outcome of the task is relevant or non-relevant. Ranked list relevanceassessment is out of the scope of this work, but it is a matter of fu

16、ture research.This paper is organized as follows. First, in Section 2 we present an overviewof the related work in this area. Second, we describe our proposed methodologyin Section 3. Then, we explain the experimental setup in Section 4 and discussour experiments in Section 5. We end with some final remarks in Section 6.2 Related WorkThere is previous work on using crowdsourcing for IR and NLP. Alonso & Miz-zaro 2 compared a single topic t

展开阅读全文