Hadoop平台简介-肖韬南京大学计算机系

资源描述

《Hadoop平台简介-肖韬南京大学计算机系》由会员分享，可在线阅读，更多相关《Hadoop平台简介-肖韬南京大学计算机系（44页珍藏版）》请在金锄头文库上搜索。

1、肖韬南京大学计算机科学与技术系2010使用Hadoop的Java API接口在Hadoop文件系统中的文件是由一个Hadoop Path对象来表示的，可以把一个Path对象想象成一个Hadoop文件系统的URI，例如 hdfs:/localhost:9000/user/xt/input/text.dat通过2个静态工厂方法从抽象的Hadoop文件系统中抽取出一个具体的实现的实例。public static FileSystem get(Configuration conf) throws IOException;返回默认的文件系统(在conf/core-site.xml中指定)，或者本

2、地的文件系统( 如果该文件中没有指定)public static FileSystem get(URI uri, Configuration conf) throws IOException;返回由uri决定的文件系统，或者默认的文件系统(如果uri无效)新旧API变化的对比以0.20.0版本为分水岭，有一些API在新的版本中被舍弃了，且推荐不使用，而是改为使用新的API下面将以WordCount程序为例进行说明0.20.0之前的WordCount程序public WordCount public static void main(String args) throws Throwable

3、 JobConf conf = new JobConf(WordCount.class); conf.setJobName(“A Sample WordCount Example”);FileInputFormat.addInputPath(conf, new Path(args0); FileInputFormat.setOutputPath(conf, new Path(args1);conf.setMapperClass(WordCountMapper.class); conf.setReducerClass(WordCountReducer.class);conf.setOutputK

4、eyClass(Text.class); conf.setOutputValueClass(IntWritable.class);JobClient.runJob(conf); class WordCountMapper extends MapReduceBase implements Mapper public void map(LongWritable offset, Text line, OutputCollector collector, Reporter reporter) throws IOException StringTokenizer tokenzier = new Stri

5、ngTokenizer(line.toString();while (tokenizer.hasMoreTokens()collector.collect(new Text(tokenizer.nextToken(), new IntWritable(1); class WordCountReducer extends MapReduceBase implements Reducer public void reduce(Text word, Iterator counts, OutputCollector collector, Reporter reporter) throws IOExce

6、ption int sum = 0;while (counts.hasNext()sum += counts.next().get();collector.collect(word, new IntWritable(sum); 0.20.0之后的WordCount程序public class WordCount public static void main(String args) throw Exception Configuration conf = new Configuration(); Job job = new Job(conf, “A Sample WordCount Exam

7、ple”);job.setJarByClass(WordCount.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path(args0); FileOutputFormat.setOutputPath(job, new P

8、ath(args1);job.waitForCompletion(true); class WordCountMapper extends Mapper public void map(LongWritable offset, Text line, Context context) throws IOException, InterruptedException StringTokenizer tokenizer = new StringTokenizer(line.toString();while (tokenizer.hasMoreTokens()context.write(new Tex

9、t(tokenizer.nextToken(), new IntWritable(1); class WordCountReducer extends Reducer public void reduce(Text word, Iterator counts, Context context) throws IOException, InterruptedException int sum = 0;while (counts.hasNext()sum += counts.next().get();context.write(word, new IntWritable(sum); Shuffle

10、 and SortMapReduce保证每一个reduce task的输入基于key排序的。(MapReduce makes the guarantee that the input to every reducer is sorted by key).系统进行排序的过程(包括将map的输出转换为reduce的输入) 被称为shuffle.The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle.

11、shuffle过程map task中生成了3个spill file，每个spill file中有3个partitionshuffle过程: map task side当一个map task开始产生它的输出时，输出并非不经处理被直接就写到磁盘上去的。每一个map task都有一个circular memory buffer，缺省大小为 100MB，map task会将它产生的输出（key-value pairs）写入到它的 memory buffer中去。当map task写入到memory buffer的数据占memory buffer的大小百分比到达一个阈值(缺省为80%)时，一个bac

12、kground thread(记为 thread)将开始把memory buffer中的内容spill到磁盘上去。在thread将memory buffer中的数据spill到磁盘中之前，thread首先将这些数据分成若干partition，每一个partition将被发送至一个 reducer。在每一个partition内，thread将根据key对该partition内的数据（即key-value pairs）进行in-memory sort，如果指定了 combiner function，那么该combiner function将会被作用于上述in-memory sort的输出。每

13、当memory buffer中的数据达到一个阈值时，就会产生一个spill file，所以在map task输出了所有的record之后，就会存在多个spill files. (1个record即1个key-value pair)在map task结束之前，所有的spill files将被merge到一个单独的output file中，该output file在结构上由多个partition组成，每一个partition内的数据都是排好序的，且每一个 partition将被送至对应的一个reduce task.如果指定了combiner function并且spill的数量不低于3个，

14、那么在生成output file之前，combiner function将会作用于将要被写入到output file里的每一个partition内的数据。reduce task sidemap task的输出存储在map task节点所在机器的本地文件系统中，reduce task会自己所需的某个partition数据复制到自己所在的HDFS中，且一个reduce task将会从多个map task复制其所需要的partition（这些partition都是同一类的）。reducer 怎样知道从哪些map tasktracker那里去取自己所需要的partition（亦即map t

15、ask的输出）？当map task成功完成后，它会将状态更新通知它所属的 tasktracker，该tasktracker进而又会通知其所属的jobtracker 。这些通知是通过heartbeat通信机制实现的。这样，对于一个 job而言，jobtracker知道map output与tasktracker之间的映射关系。reducer中的一个线程会周期性地向jobtracker询问map output所在的位置，直到该reducer接收了所有的map biner function 与 partitioner function当存在多个reducer时，map tasks将会对它们的输出进

16、行 partition，每一个mask task都会为每一个reduce task生成一个 partition.在每一个partition内都可能会有很多keys(以及相应的values)，但是对于任一个key而言，它的records都在一个partition内。partition的过程可以由用户定义的partitioning函数来控制，但是一般来说，默认的partitioner函数（根据key进行hash映射）已经可以令人满意。存在多个reduce task时的partitioningpartition的数量与reducer的数量是一致的定制个性化的partitioner自定义的partitioner function需要继承于一个抽象类Partitioner controls the partitionini

展开阅读全文