Crowdsourcing for Relevance Evaluation

资源描述

《Crowdsourcing for Relevance Evaluation》由会员分享，可在线阅读，更多相关《Crowdsourcing for Relevance Evaluation（7页珍藏版）》请在金锄头文库上搜索。

1、PAPER Crowdsourcing for Relevance Evaluation Omar Alonso Daniel E. Rose Benjamin Stewart A Palo Alto, CA oalonso, danrose, Abstract Relevance evaluation is an essential part of the development and maintenance of information retrieval systems. Yet traditional

2、 evaluation approaches have several limitations; in particular, conducting new editorial evaluations of a search system can be very expensive. We describe a new approach to evaluation called TERC, based on the crowdsourcing paradigm, in which many online users, drawn from a large community, ea

3、ch performs a small evaluation task. 1 Introduction Relevance evaluation for information retrieval is a notoriously difficult and expensive task. In the early years of the field, a set of volunteer editors often graduate students would painstakingly read through every document in a corpus to d

4、etermine its relevance to a set of test queries. This process was sufficiently difficult that only a few small test collections (Cranfield, CACM, etc.) were created. With the advent of TREC in 1992 9, researchers had access to test collections with millions of full-text documents.

5、;However, the scale of TREC was only possible by eliminating the notion that every document would be read and evaluated. Instead, the pooling approach was developed, in which only the top N documents retrieved by at least one of the participating systems were examined. The other factor t

6、hat made TREC possible was the availability of a large number of professional assessors retired intelligence analysts who were paid for their work with funds from the sponsoring agencies. While the TREC collections and as importantly, the query sets and evaluations have been invaluable in fur

7、thering IR research over the past 15 years, they still have some limitations. The most obvious of these is that researchers are limited to the types of IR tasks that TREC has studied. For example, if a researcher wishes to study search in a particular vertical domain area (for example, a

8、 yellow-pages-style search for local businesses) or experiment with a new search interaction paradigm (for example, collaborative search), then existing TREC collections may not help. Furthermore, despite the presence of a Web track, evaluating general Web search has unique challenges 7, which often

9、 require another approach. For these reasons, many researchers in both industry and academia now rely on the strategy of using editorial resources to create their own new relevance assessments from scratch, specific to the needs of ACM SIGIR Forum 9 Vol. 42 No. 2 December 2008 the system

10、 they are testing. Many web search engines reportedly use large editorial staffs, either in-house or under contract, to judge the relevance of web pages for queries in an evaluation set. Academic researchers, without access to such editors, often rely instead on small groups of student volunte

11、ers. Because of the students limited time and availability, test sets are often smaller than desired, making it harder to detect statistically significant differences in performance by the experimental systems being tested. A recent article by Saracevic presents an in-depth discussion on how p

12、eople behave around relevance and how it was studied 8. Looking at the summary of the studies, the article shows that most of them include a handful of individuals. This is not a surprise, because setting up an experiment takes time and resources, people being the most important factor. As an altern

13、ative, Joachims 6 and others have proposed exploiting user behavior as an evaluation signal. This approach allows relevance evaluation to be performed at a much larger scale and at much lower cost than the editorial method. While behavioral evaluation can be very effective in certain cir

14、cumstances, it has limitations as well. It requires access to a large stream of actual behavioral data something not always available to a researcher testing an experimental system. And, as with TREC, there are certain tasks for which it does not make sense. For example, Rose 7 poi

15、nts out that user click behavior cannot be used to assess the quality of a web search result snippet, since the lack of a click might indicate either a perfect snippet that satisfies the users information need, or a poor one that fails to convey the relevance of the underlying page. For many tasks,

16、then, what is needed is a third approach, one that provides the customizability of the editorial approach, but on a larger scale. We propose the use of crowdsourcing for this purpose. Jeff Howe coined the word “crowdsourcing” in a Wired magazine article to describe tasks that were outsourced to a large group of people instead of performed by an employee 4. Crowdsourcing is an open call to solve a problem or carry out a task and usually involves a monetary value in exchange for such service. Crowdsourcing has the “Web 2.0”-style attribute of increased interacti

展开阅读全文