《Hadoop与数据分析@taobao》由会员分享,可在线阅读,更多相关《Hadoop与数据分析@taobao(30页珍藏版)》请在金锄头文库上搜索。
1、1Hadoop与数据分析淘宝数据平台及产品部基础研发组周敏日期:2014-05-26OutlineHadoop基本概念Hadoop的应用范围Hadoop底层实现原理Hive与数据分析Hadoop集群管理典型的Hadoop离线分析系统架构常见问题及解决方案关于打扑克的哲学打扑克与MapReduceInput split shuffle output 分牌各自齐牌交换再次理牌搞定统计单词数TheweatherisgoodThisguyisagoodmanTodayisgoodGoodmanisgoodthe1weather1is1good1today1is1good1this1guy1is1a1g
2、ood1man1good1man1is1good1a1good1good1good1good1good1man1man1the1weather1today1guy1is1is1is1is1this1a1good5guy1is4man2the1this1today1weather1流量计算6趋势分析7http:/www.trendingtopics.org/截图用户推荐8分布式索引910Hadoop 核心Hadoop Common分布式文件系统HDFSMapReduce框架并行数据分析语言Pig 列存储NoSQL数据库 Hbase分布式协调器Zookeeper数据仓库Hive(使用SQL)Had
3、oop日志分析工具ChukwaHadoop生态系统11DataDatadatadatadatadataDatadatadatadatadataDatadatadatadatadataDatadatadatadatadataDatadatadatadatadataDatadatadatadatadataDatadatadatadatadataDatadatadatadatadataDatadatadatadatadataDatadatadatadatadataDatadatadatadatadataDatadatadatadatadataResultsDatadatadatadataDatada
4、tadatadataDatadatadatadataDatadatadatadataDatadatadatadataDatadatadatadataDatadatadatadataDatadatadatadataDatadatadatadataHadoop ClusterDFSBlock1DFSBlock1DFSBlock2DFSBlock2DFSBlock2DFSBlock1DFSBlock3DFSBlock3DFSBlock3MAPMAPMAPReduceHadoop实现作业执行流程/ MapClass1中的中的map方法方法 public void map(LongWritable Ke
5、y, Text value, OutputCollector output, Reporter reporter) throws IOException String strLine = value.toString(); String strList = strLine.split(); String mid = strList3; String sid = strList4;String timestr = strList0;try timestr = timestr.substring(0,10);catch(Exception e)return;timestr += 0000; / 省
6、略数十行省略数十行 output.collect(new Text(mid + “” + “sid” + timestr , .);Hadoop案例(1) public static class Reducer1 extends MapReduceBase implements Reducer private Text word = new Text(); private Text str = new Text(); public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter)
7、throws IOException String t = key.toString().split(); word.set(t0);/ str.set(t1); output.collect(word,str);/uid kind /reduce /Reduce0b Hadoop案例(2) public static class MapClass2 extends MapReduceBase implements Mapper private Text word = new Text(); private Text str = new Text(); public void map(Long
8、Writable Key, Text value, OutputCollector output, Reporter reporter) throws IOException String strLine = value.toString(); String strList = strLine.split(s+);word.set(strList0);str.set(strList1);output.collect(word,str); Hadoop案例(3) public static class Reducer2 extends MapReduceBase implements Reduc
9、er private Text word = new Text(); private Text str = new Text(); public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException while(values.hasNext() String t = values.next().toString(); / 省略数十行代码省略数十行代码 / 省略数十行代码省略数十行代码 output.collect(new Text(mid + “”
10、 + sid + “”) + ., .) Hadoop案例(4)BADAACBCBCDGroupCo-groupFunctionAggregate FilterFilterThinkinginMapReduce(1)ThinkinginMapReduce(2)Magics of Hive:SELECT COUNT(DISTINCT mid) FROM log_tableHive的魔力为什么淘宝采用Hadoop?webalizerawstat般若Atpanel时代日志最高达250GB/天最高达约50道作业每天运行20小时以上Hadoop时代当前日志470GB/天当前366道作业平均67小时完成还
11、有谁在用Hadoop?雅虎北京全球软件研发中心雅虎北京全球软件研发中心中国移动研究院中国移动研究院英特尔研究院英特尔研究院金山软件金山软件百度百度腾讯腾讯新浪新浪搜狐搜狐IBMFacebookAmazonYahoo!Web ServersLog Collection ServersFilersData Warehousing on a ClusterOracle RACFederated MySQLWeb站点的典型Hadoop架构HadoopRich ClientMetaStore ServerMysqlSchedulerThrift ServerWebJobClientCLI/GUIClie
12、ntProgramWeb Server淘宝Hadoop与Hive的使用标准输出,标准出错Web显示(50030,50060,50070)NameNode,JobTracker,DataNode,TaskTracker日志本地重现:LocalRunnerDistributedCache中放入调试代码调试目的:查性能瓶颈,内存泄漏,线程死锁等工具:jmap,jstat,hprof,jconsole,jprofilermat,jstack对JobTracker的Profile对各slave节点TaskTracker的Profile对各slave节点某Child进程的Profile(可能存在单点执行速度过慢)Profiling目的:监控集群或单个节点I/O,内存及CPU工具:Ganglia监控如何减少数据搬动?28数据倾斜29