大规模数据处理云计算Lecture1IntroductiontoMapReduce

上传人:lizhe****0001 文档编号:54637940 上传时间:2018-09-16 格式:PPT 页数:44 大小:6.50MB
返回 下载 相关 举报
大规模数据处理云计算Lecture1IntroductiontoMapReduce_第1页
第1页 / 共44页
大规模数据处理云计算Lecture1IntroductiontoMapReduce_第2页
第2页 / 共44页
大规模数据处理云计算Lecture1IntroductiontoMapReduce_第3页
第3页 / 共44页
大规模数据处理云计算Lecture1IntroductiontoMapReduce_第4页
第4页 / 共44页
大规模数据处理云计算Lecture1IntroductiontoMapReduce_第5页
第5页 / 共44页
点击查看更多>>
资源描述

《大规模数据处理云计算Lecture1IntroductiontoMapReduce》由会员分享,可在线阅读,更多相关《大规模数据处理云计算Lecture1IntroductiontoMapReduce(44页珍藏版)》请在金锄头文库上搜索。

1、大规模数据处理/云计算 Lecture 1 Introduction to MapReduce,What is this course about?,Data-intensive information processing Large-data (“web-scale”) problems Focus on MapReduce programming An entry-level course,2,What is MapReduce?,Programming model for expressing distributed computations at a massive scale Ex

2、ecution framework for organizing and performing such computations Open-source implementation called Hadoop,3,Why Large Data?,How much data?,Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009) Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of us

3、er data + 50 TB/day (5/2009) CERNs LHC will generate 15 PB a year (?),640K ought to be enough for anybody.,5,6,7,Happening everywhere!,Molecular biology (cancer),microarray chips,Particle events (LHC),particle colliders,microprocessors,Simulations (Millennium),Network traffic (spam),fiber optics,300

4、M/day,1B,1M/sec,8,Maximilien Brice, CERN,9,Maximilien Brice, CERN,10,Maximilien Brice, CERN,11,Maximilien Brice, CERN,No data like more data!,(Banko and Brill, ACL 2001),(Brants et al., EMNLP 2007),s/knowledge/data/g;,How do we get here if were not Google?,12,Example: information extraction,Answerin

5、g factoid questions Pattern matching on the Web Works amazingly wellLearning relations Start with seed instances Search for patterns on the Web Using patterns to find more instances,Who shot Abraham Lincoln? X shot Abraham Lincoln,Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879),Wolfgang Amadeu

6、s Mozart (1756 - 1791),Einstein was born in 1879,PERSON (DATE PERSON was born in DATE,(Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; ),13,14,Example: Scene Completion,Image Database Grouped by Semantic Content 30 different F groups 2.3

7、 M images total (396 GB). Select Candidate Images Most Suitable for Filling Hole Classify images with gist scene detector Torralba Color similarity Local context matching,Computation Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by

8、 using 5 machines Extension F has over 500 million images ,Hays, Efros (CMU), “Scene Completion Using Millions of Photographs” SIGGRAPH, 2007,More Data More Gains?,CNNIC中国互联网络发展状况统计 截至 2010年6月底,我国网民规模达4.2亿人,互联网普及率持续上升增至31.8%。手机网民成为拉动中国总体网民规模攀升的主要动力,半年内新增 4334万,达到2.77亿人,增幅为18.6%。值得关注的是,互联网商务化程度迅速提高,全

9、国网络购物用户达到1.4亿,网上支付、网络购物和网上银 行半年用户增长率均在30%左右,远远超过其他类网络应用。,15,2009年全国新闻出版业基本情况,2009年:出版书籍238868种(初版145475种,重版、重印93393种),总印数37. 88亿册(张),总印张312.46亿印张,折合用纸量73.4万吨(包括附录用纸1.41亿印张,折合用纸量0.33万吨),定价总金额567.27亿 元(包括附录定价总金额4.73亿元)。与上年相比种数增长8.86%(初版增长11.24%,重版、重印增长5.36%),总印数增长4.53%,总印 张增长4.61%,定价总金额增长8.94%。,16,Did

10、 you know?,17,Did you know?,“We are currently preparing our students for jobs that dont yet exist ” “It is estimated that a weeks worth of the New York Times contains more information than a person was likely to come across in a lifetime in the 18th century” “The amount of new technical information

11、is doubling every 2 years” “So what does IT ALL MEAN?”,18,“We are living in exponential times “,19,20,Two Different Views,a “thrower-awayer”,MyLifeBits,“丢弃,必要时再找回来的代价 要比维护它们要小得多” “trying to live an efficient life so that one has time to work and be with ones family. “,Jennifer Widom,Gordon Bell,Info

12、rmation Overloading,不能学以致用的原因之一:信息超载 对于那些只接触过一次的信息,我们通常只能记住其中一小部分。 我们应该少而精而非多而浅地去学习。 要想掌握某件事,关键在于间隔性重复。 一旦真正透彻地掌握了自己的工作,人们就会变得更有创造性,甚至能够创造奇迹。,21,What is Cloud Computing?,The best thing since sliced bread?,Before clouds Grids Vector supercomputers Cloud computing means many different things: Large-da

13、ta processing Rebranding of web 2.0 Utility computing Everything as a service,23,Rebranding of web 2.0,Rich, interactive web applications Clouds refer to the servers that run them AJAX as the de facto standard (for better or worse) Examples: Facebook, YouTube, Gmail, “The network is the computer”: t

14、ake two User data is stored “in the clouds” Rise of the netbook, smartphones, etc. Browser is the OS,24,Source: Wikipedia (Electricity meter),Utility Computing,What? Computing resources as a metered service (“pay as you go”) Ability to dynamically provision virtual machines Why? Cost: capital vs. op

15、erating expenses Scalability: “infinite” capacity Elasticity: scale up or down on demand Does it make sense? Benefits to cloud users Business case for cloud providers,I think there is a world market for about five computers.,26,Everything as a Service,Utility computing = Infrastructure as a Service

16、(IaaS) Why buy machines when you can rent cycles? Examples: Amazons EC2, Rackspace Platform as a Service (PaaS) Give me nice API and take care of the maintenance, upgrades, Example: Google App Engine Software as a Service (SaaS) Just run it for me! Example: Gmail, Salesforce,27,Utility Computing,“pay-as-you-go” 好比让用户把电源插头插在墙上,你得到的电压和Microsoft得到的一样,只是你用得少,pay less;utility computing的目标就是让计算资源也具有这样的服务能力,用户可以使用500强公司所拥有的计算资源,只是use less pay less。这是cloud computing的一个重要方面,

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 高等教育 > 其它相关文档

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号