《Crowdsourcing Performance Evaluations of User Interfaces》由会员分享,可在线阅读,更多相关《Crowdsourcing Performance Evaluations of User Interfaces(10页珍藏版)》请在金锄头文库上搜索。
1、Crowdsourcing Performance Evaluations of User InterfacesSteven Komarov, Katharina Reinecke, Krzysztof Z. GajosIntelligent Interactive Systems GroupHarvard School of Engineering and Applied Sciences33 Oxford St., Cambridge, MA 02138, USAfkomarov, reinecke, kgajosgseas.harvard.eduABSTRACTOnline labor
2、markets, such as Amazons Mechanical Turk(MTurk), provide an attractive platform for conducting hu-man subjects experiments because the relative ease of re-cruitment, low cost, and a diverse pool of potential partici-pants enable larger-scale experimentation and faster experi-mental revision cycle co
3、mpared to lab-based settings. How-ever, because the experimenter gives up the direct controlover the participants environments and behavior, concernsabout the quality of the data collected in online settings arepervasive. In this paper, we investigate the feasibility ofconducting online performance
4、evaluations of user interfaceswith anonymous, unsupervised, paid participants recruitedvia MTurk. We implemented three performance experimentsto re-evaluate three previously well-studied user interface de-signs. We conducted each experiment both in lab and onlinewith participants recruited via MTurk
5、. The analysis of ourresults did not yield any evidence of significant or substan-tial differences in the data collected in the two settings: Allstatistically significant differences detected in lab were alsopresent on MTurk and the effect sizes were similar. In ad-dition, there were no significant
6、differences between the twosettings in the raw task completion times, error rates, consis-tency, or the rates of utilization of the novel interaction mech-anisms introduced in the experiments. These results suggestthat MTurk may be a productive setting for conducting per-formance evaluations of user
7、 interfaces providing a comple-mentary approach to existing methodologies.Author KeywordsCrowdsourcing; Mechanical Turk; User Interface EvaluationACM Classification KeywordsH.5.2. Information Interfaces and Presentation (e.g. HCI):User Interfaces-Evaluation/MethodologyINTRODUCTIONOnline labor market
8、s, such as Amazons Mechanical Turk(MTurk), have emerged as an attractive platform for humanPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and
9、that copiesbear this notice and the full citation on the first page. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CHI13, April 27May 2, 2013, Paris, France.Copyright 2013 ACM 978-1-4503-1899-0$10.00.subjects research.
10、Researchers are drawn to MTurk becausethe relative ease of recruitment affords larger-scale experi-mentation (in terms of the number of conditions tested andthe number of participants per condition), a faster experimen-tal revision cycle, and potentially greater diversity of partici-pants compared t
11、o what is typical for lab-based experimentsin an academic setting 15, 24.The downside of such remote experimentation is that the re-searchers give up the direct supervision of the participantsbehavior and the control over the participants environments.In lab-based settings, the direct contact with t
12、he experimentermotivates participants to perform as instructed and allows theexperimenter to detect and correct any behaviors that mightcompromise the validity of the data. Because remote partici-pants may lack the motivation to focus on the task, or may bemore exposed to distraction than lab-based
13、participants, con-cerns about the quality of the data collected in such settingsare pervasive 11, 19, 23, 26.A variety of interventions and filtering methods have beenexplored to either motivate participants to perform as in-structed or to assess the reliability of the data once it hasbeen collected
14、. Such methods have been developed for ex-periments that measure visual perception 8, 12, 3, decisionmaking 20, 27, 14, and subjective judgement 11, 19, 21.Missing from the literature and practice are methods forremotely conducting performance-based evaluations of userinterface innovations such as e
15、valuations of novel inputmethods, interaction techniques, or adaptive interfaces where accurate measurements of task completion times arethe primary measure of interest. Such experiments may bedifficult to conduct in unsupervised settings for several rea-sons. First, poor data quality may be hard to
16、 detect: For ex-ample, while major outliers caused by a participant taking aphone call in the middle of the experiment are easy to spot,problems such as a systematically slow performance due to aparticipant watching TV while completing the experimentaltasks may not be as easily identifiable. Second, the most pop-ular quality control mechanisms used in paid crowdsourcing,such as gold standard tasks 1, 18, verifiable tasks 11, orchecking for output agreement 13, 17, do n