map-reduce入门过程解释(温度示例)

资源描述

《map-reduce入门过程解释(温度示例)》由会员分享，可在线阅读，更多相关《map-reduce入门过程解释(温度示例)（11页珍藏版）》请在金锄头文库上搜索。

1、 Hadoop 学习总结之三： Map-Reduce 入门2010-11-29 21:31 2632 人阅读评论(0) 收藏举报hadoop 任务 inputoutputinterfaceiterator目录 (?)+1、Map-Reduce 的逻辑过程假设我们需要处理一批有关天气的数据，其格式如下：按照 ASCII 码存储，每行一条记录每一行字符从 0 开始计数，第 15 个到第 18 个字符为年第 25 个到第 29 个字符为温度，其中第 25 位是符号+/-0067011990999991950051507+0000+0043011990999991950051512+0022

2、+0043011990999991950051518-0011+0043012650999991949032412+0111+0043012650999991949032418+0078+0067011990999991937051507+0001+0043011990999991937051512-0002+0043011990999991945051518+0001+0043012650999991945032412+0002+0043012650999991945032418+0078+现在需要统计出每年的最高温度。Map-Reduce 主要包括两个步骤： Map 和 Reduce每一步

3、都有 key-value 对作为输入和输出： map 阶段的 key-value 对的格式是由输入的格式所决定的，如果是默认的 TextInputFormat，则每行作为一个记录进程处理，其中 key 为此行的开头相对于文件的起始位置，value 就是此行的字符文本 map 阶段的输出的 key-value 对的格式必须同 reduce 阶段的输入 key-value 对的格式相对应对于上面的例子，在 map 过程，输入的 key-value 对如下：(0, 0067011990999991950051507+0000+)(33, 0043011990999991950051512+0022+

4、)(66, 0043011990999991950051518-0011+)(99, 0043012650999991949032412+0111+)(132, 0043012650999991949032418+0078+)(165, 0067011990999991937051507+0001+)(198, 0043011990999991937051512-0002+)(231, 0043011990999991945051518+0001+)(264, 0043012650999991945032412+0002+)(297, 0043012650999991945032418+007

5、8+)在 map 过程中，通过对每一行字符串的解析，得到年-温度的 key-value 对作为输出：(1950, 0)(1950, 22)(1950, -11)(1949, 111)(1949, 78)(1937, 1)(1937, -2)(1945, 1)(1945, 2)(1945, 78)在 reduce 过程，将 map 过程中的输出，按照相同的 key 将 value 放到同一个列表中作为 reduce 的输入(1950, 0, 22, 11)(1949, 111, 78)(1937, 1, -2)(1945, 1, 2, 78)在 reduce 过程中，在列表中选择出最大的温度，将

6、年-最大温度的 key-value 作为输出：(1950, 22)(1949, 111)(1937, 1)(1945, 78)其逻辑过程可用如下图表示：2、编写 Map-Reduce 程序编写 Map-Reduce 程序，一般需要实现两个函数：mapper 中的 map 函数和 reducer 中的 reduce 函数。一般遵循以下格式： map: (K1, V1) - list(K2, V2)public interface Mapper extends JobConfigurable, Closeable void map(K1 key, V1 value, OutputCollector

7、 output, Reporter reporter)throws IOException; reduce: (K2, list(V) - list(K3, V3) public interface Reducer extends JobConfigurable, Closeable void reduce(K2 key, Iterator values,OutputCollector output, Reporter reporter)throws IOException;对于上面的例子，则实现的 mapper 如下：public class MaxTemperatureMapper ext

8、ends MapReduceBase implements Mapper Overridepublic void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException String line = value.toString();String year = line.substring(15, 19);int airTemperature;if (line.charAt(25) = +) airTemperature = Integer.parseInt(l

9、ine.substring(26, 30); else airTemperature = Integer.parseInt(line.substring(25, 30);output.collect(new Text(year), new IntWritable(airTemperature);实现的 reducer 如下：public class MaxTemperatureReducer extends MapReduceBase implements Reducer public void reduce(Text key, Iterator values, OutputCollector

10、 output, Reporter reporter) throws IOException int maxValue = Integer.MIN_VALUE;while (values.hasNext() maxValue = Math.max(maxValue, values.next().get();output.collect(key, new IntWritable(maxValue);欲运行上面实现的 Mapper 和 Reduce，则需要生成一个 Map-Reduce 得任务(Job)，其基本包括以下三部分：输入的数据，也即需要处理的数据 Map-Reduce 程序，也即上面实

11、现的 Mapper 和 Reducer 此任务的配置项 JobConf欲配置 JobConf，需要大致了解 Hadoop 运行 job 的基本原理： Hadoop 将 Job 分成 task 进行处理，共两种 task：map task 和 reduce task Hadoop 有两类的节点控制 job 的运行：JobTracker 和 TaskTrackero JobTracker 协调整个 job 的运行，将 task 分配到不同的 TaskTracker 上o TaskTracker 负责运行 task，并将结果返回给 JobTracker Hadoop 将输入数据分成固定大小的块，我们

12、称之 input split Hadoop 为每一个 input split 创建一个 task，在此 task 中依次处理此 split 中的一个个记录 (record) Hadoop 会尽量让输入数据块所在的 DataNode 和 task 所执行的 DataNode(每个 DataNode 上都有一个 TaskTracker)为同一个，可以提高运行效率，所以 input split 的大小也一般是 HDFS 的 block 的大小。 Reduce task 的输入一般为 Map Task 的输出，Reduce Task 的输出为整个 job 的输出，保存在 HDFS 上。在 reduc

13、e 中，相同 key 的所有的记录一定会到同一个 TaskTracker 上面运行，然而不同的 key 可以在不同的 TaskTracker上面运行，我们称之为 partitiono partition 的规则为：(K2, V2) Integer，也即根据 K2，生成一个 partition 的 id，具有相同 id 的 K2 则进入同一个 partition，被同一个 TaskTracker 上被同一个 Reducer 进行处理。public interface Partitioner extends JobConfigurable int getPartition(K2 key, V2

14、value, int numPartitions);下图大概描述了 Map-Reduce 的 Job 运行的基本原理：下面我们讨论 JobConf，其有很多的项可以进行配置： setInputFormat：设置 map 的输入格式，默认为 TextInputFormat，key 为 LongWritable, value 为 Text setNumMapTasks：设置 map 任务的个数，此设置通常不起作用， map 任务的个数取决于输入的数据所能分成的 input split的个数 setMapperClass：设置 Mapper，默认为 IdentityMapper setMapRunn

15、erClass：设置 MapRunner, map task 是由 MapRunner 运行的，默认为 MapRunnable，其功能为读取 input split 的一个个 record，依次调用 Mapper 的 map 函数 setMapOutputKeyClass 和 setMapOutputValueClass：设置 Mapper 的输出的 key-value 对的格式 setOutputKeyClass 和 setOutputValueClass：设置 Reducer 的输出的 key-value 对的格式 setPartitionerClass 和 setNumReduceTas

16、ks：设置 Partitioner，默认为 HashPartitioner，其根据 key 的 hash 值来决定进入哪个 partition，每个 partition 被一个 reduce task 处理，所以 partition 的个数等于 reduce task 的个数 setReducerClass：设置 Reducer，默认为 IdentityReducer setOutputFormat：设置任务的输出格式，默认为 TextOutputFormat FileInputFormat.addInputPath：设置输入文件的路径，可以使一个文件，一个路径，一个通配符。可以被调用多次添加多个路径 FileOutputFormat.setOutputPath：设置输出文件的路径，在 job 运行前此路径不应该存在当然不用所有的都设置，由上面的例子，可以编写 Map-Red

展开阅读全文