Crowdsourcing for Book Search Evaluation

资源描述

《Crowdsourcing for Book Search Evaluation》由会员分享，可在线阅读，更多相关《Crowdsourcing for Book Search Evaluation（10页珍藏版）》请在金锄头文库上搜索。

1、Crowdsourcing for Book Search Evaluation:Impact of HIT Design on Comparative System RankingGabriella Kazai1 Jaap Kamps2 Marijn Koolen2 Natasa Milic-Frayling11 Microsoft Research, Cambridge UK, v-gabkaz,2 University of Amsterdam, The Netherlands, kamps,m.h.a.koolenuva.nlABSTRACTThe evaluation of info

2、rmation retrieval (IR) systems over specialcollections, such as large book repositories, is out of reach of tra-ditional methods that rely upon editorial relevance judgments. In-creasingly, the use of crowdsourcing to collect relevance labels hasbeen regarded as a viable alternative that scales with

3、 modest costs.However, crowdsourcing suffers from undesirable worker practicesand low quality contributions. In this paper we investigate the de-sign and implementation of effective crowdsourcing tasks in thecontext of book search evaluation. We observe the impact of as-pects of the Human Intelligen

4、ce Task (HIT) design on the qualityof relevance labels provided by the crowd. We assess the output interms of label agreement with a gold standard data set and observetheeffectofthe crowdsourced relevancejudgments onthe resultingsystem rankings. This enables us to observe the effect of crowd-sourcin

5、g on the entire IR evaluation process. Using the test set andexperimental runs from the INEX 2010 Book Track, we find thatvarying the HIT design, and the pooling and document orderingstrategies leads to considerable differences in agreement with thegold set labels. We then observe the impact of the

6、crowdsourcedrelevance label sets on the relative system rankings using four IRperformance metrics. System rankings based on MAP and Bprefremain less affected by different label sets while the Precision10and nDCG10 lead to dramatically different system rankings, es-pecially for labels acquired from H

7、ITs with weaker quality con-trols. Overall, we find that crowdsourcing can be an effective toolfor the evaluation of IR systems, provided that care is taken whendesigning the HITs.Categories and Subject Descriptors: H.3.4 Information Storage andRetrieval: SystemsandSoftwareperformanceevaluation(effi

8、ciencyandeffectiveness)General Terms: Experimentation, Measurement, PerformanceKeywords: Prove it, Crowdsourcing Quality, Book Search.1. INTRODUCTIONThe evaluation and tuning of Information Retrieval (IR) systemsbased on the Cranfield paradigm 4, 27 requires purpose-built testcollections, at the hea

9、rt of which lie the human judgments that indi-Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full cita

10、tion on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGIR11, July 2428, 2011, Beijing, China.Copyright 2011 ACM 978-1-4503-0757-4/11/07 .$10.00.cate the relevance of search results to a set of queries.

11、 With the everincreasing size and diversity of both the document collections andthe query sets, gathering relevance labels by traditional methods,i.e., from a select group of trained experts, has become increas-ingly challenging 9. This issue is especially prevalent in spe-cialized search domains su

12、ch as academic papers or books, whichcan support a range of tailored search tasks but also present ad-ditional complexities in IR evaluation. A good illustration is theINEX Book Track 13 which aims to provide a test bed for theevaluation of book search systems. The track reports on a range ofissues

13、related to the gathering of relevance labels 12, 13, one ofwhich is the sheer effort of reviewing whole books and renderingrelevance judgments for pages across a large number of retrievedbooks. While the INEX book collection comprises only 50,000books, the effort to judge a single topic is estimated

14、 at 33 days ifthe assessor spent 95 minutes a day judging pages on that topicalone 12. This estimate is based on a relatively shallow pool of200 books per topic. The issue of scale in collecting human assess-ments is even more evident in the case of large online repositoriesthat store millions of di

15、gitized books such as the Million Booksrepository1 and the Google Books Library2.Recently, crowdsourcing 8 has emerged as a feasible approachto gathering relevance data in the context of IR evaluations 13, 7, 11, 15. As such, it promises to offer a solution to the scalabil-ity problem that hinders t

16、raditional approaches based on editorialjudgments, which has been the cornerstone of the IR evaluationsince its conception at Cranfield 4 over 50 years ago. In general,crowdsourcing is a method of outsourcing work through an opencall for contributions from members of a crowd, who are invited tocarry out Human Intelligence Tasks (HITs) in exchange for micro-payments, social recognition, or entertainment value. Crowdsourc-ingplatforms,suchasCrow

展开阅读全文