大规模数据处理云计算Lecture5–MapreduceAlgorithm

资源描述

《大规模数据处理云计算Lecture5–MapreduceAlgorithm》由会员分享，可在线阅读，更多相关《大规模数据处理云计算Lecture5–MapreduceAlgorithm（44页珍藏版）》请在金锄头文库上搜索。

1、大规模数据处理大规模数据处理/云计算云计算 Lecture 5 Mapreduce Algorithm Design彭波北京大学信息科学技术学院7/19/2011http:/ work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http:/creativecommons.org/licenses/by-nc-sa/3.0/us/ for detailsJimmy LinUniversity of Maryland课程建设SEWMGroupQuiz

2、What is MapReduce?“Hello World”: Word Count2Map(String docid, String text): for each word w in text: Emit(?);Reduce(String term, ? values): int sum = 0; for each v in values: sum += v; Emit(?);?ba12c9ac52bc78?k1k2k3k4k5k6v1v2v3v4v5v6ba12cc36ac52bc78? and ?: aggregate values by keys?a15b27c298r1s1r2s

3、2r3s3c23683Putting everything togetherdatanode daemonLinux file system?slave nodedatanode daemonLinux file system?slave nodedatanode daemonLinux file system?slave nodenamenodenamenode daemonjob submission node?4Data Center5Todays AgendaMapReduce algorithm designlHow do you express everything in term

4、s of m, r, c, p?lToward “design patterns”6MapReduce: RecapProgrammers must specify:map (k, v) *reduce (k, v) *lAll values with the same key are reduced togetherOptionally, also:partition (k, number of partitions) partition for klOften a simple hash of the key, e.g., hash(k) mod nlDivides up key spac

5、e for parallel reduce operationscombine (k, v) *lMini-reducers that run in memory after the map phaselUsed as an optimization to reduce network trafficThe execution framework handles everything else7combinecombinecombinecombineba12c9ac52bc78partitionpartitionpartitionpartitionmapmapmapmapk1k2k3k4k5k

6、6v1v2v3v4v5v6ba12cc36ac52bc78Shuffle and Sort: aggregate values by keysreducereducereducea15b27c298r1s1r2s2r3s38“Everything Else”The execution framework handles everything elselScheduling: assigns workers to map and reduce tasksl“Data distribution”: moves processes to datalSynchronization: gathers,

7、sorts, and shuffles intermediate datalErrors and faults: detects worker failures and restartsLimited control over data and execution flowlAll algorithms must expressed in m, r, c, pYou dont know:lWhere mappers and reducers runlWhen a mapper or reducer begins or finisheslWhich input a particular mapp

8、er is processinglWhich intermediate key a particular reducer is processing9Tools for SynchronizationPreserving state in mappers and reducerslCapture dependencies across multiple keys and valuesCleverly-constructed data structureslBring partial results togetherSort order of intermediate keyslControl

9、order in which reducers process keysPartitionerlControl which reducer processes which keys10Preserving StateMapper objectconfiguremapclosestateone object per taskReducer objectconfigurereduceclosestateone call per input key-value pairone call per intermediate keyAPI initialization hookAPI cleanup ho

10、ok11Scalable Hadoop Algorithms: ThemesAvoid object creationlInherently costly operationlGarbage collectionAvoid bufferinglLimited heap sizelWorks for small datasets, but wont scale!12Importance of Local AggregationIdeal scaling characteristics:lTwice the data, twice the running timelTwice the resour

11、ces, half the running timeWhy cant we achieve this?lSynchronization requires communicationlCommunication kills performanceThus avoid communication!lReduce intermediate data via local aggregationlCombiners can help13Shuffle and SortMapperReducerother mappersother reducerscircular buffer (in memory)sp

12、ills (on disk)merged spills (on disk)intermediate files (on disk)CombinerCombiner?14Word Count: BaselineWhats the impact of combiners?15Word Count: Version 1Are combiners still needed?16Word Count: Version 2Are combiners still needed?Key: preserve state acrossinput key-value pairs!17Design Pattern f

13、or Local Aggregation“In-mapper combining”lFold the functionality of the combiner into the mapper by preserving state across multiple map callsAdvantageslSpeedlWhy is this faster than actual combiners?DisadvantageslExplicit memory management requiredlPotential for order-dependent bugs18Combiner Desig

14、nCombiners and reducers share same method signaturelSometimes, reducers can serve as combinerslOften, notRemember: combiner are optional optimizationslShould not affect algorithm correctnesslMay be run 0, 1, or multiple timesExample: find average of all integers associated with the same key19Computi

15、ng the Mean: Version 1Why cant we use reducer as combiner?20Computing the Mean: Version 2Why doesnt this work?21Computing the Mean: Version 3Fixed?22Computing the Mean: Version 4Are combiners still needed?23Co-occurrence MatrixTerm co-occurrence matrix for a text collectionlM = N x N matrix (N = voc

16、abulary size)lMij: number of times i and j co-occur in some context (for concreteness, lets say context = sentence)Why?lDistributional profiles as a way of measuring semantic distancelSemantic distance useful for many language processing tasks24MapReduce: Large Counting ProblemsTerm co-occurrence ma

17、trix for a text collection= specific instance of a large counting problemlA large event space (number of terms)lA large number of observations (the collection itself)lGoal: keep track of interesting statistics about the eventsBasic approachlMappers generate partial countslReducers aggregate partial

18、countsHow do we aggregate partial counts efficiently?25First Try: “Pairs”Each mapper takes a sentence:lGenerate all co-occurring term pairslFor all pairs, emit (a, b) countReducers sum up counts associated with these pairsUse combiners!26Pairs: Pseudo-Code27“Pairs” AnalysisAdvantageslEasy to impleme

19、nt, easy to understandDisadvantageslLots of pairs to sort and shuffle around (upper bound?)lNot many opportunities for combiners to work28Another Try: “Stripes”Idea: group together pairs into an associative arrayEach mapper takes a sentence:lGenerate all co-occurring term pairslFor each term, emit a

20、 b: countb, c: countc, d: countd Reducers perform element-wise sum of associative arrays(a, b) 1 (a, c) 2 (a, d) 5 (a, e) 3 (a, f) 2 a b: 1, c: 2, d: 5, e: 3, f: 2 a b: 1, d: 5, e: 3 a b: 1, c: 2, d: 2, f: 2 a b: 2, c: 2, d: 7, e: 3, f: 2 +Key: cleverly-constructed data structurebrings together part

21、ial results29Stripes: Pseudo-Code30“Stripes” AnalysisAdvantageslFar less sorting and shuffling of key-value pairslCan make better use of combinersDisadvantageslMore difficult to implementlUnderlying object more heavyweightlFundamental limitation in terms of size of event space31Cluster size: 38 core

22、sData Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)3233Relative FrequenciesHow do we estimate relative frequencies from counts?Why do we want to do this?How do we do this with MapReduce?3

23、4f(B|A): “Stripes” Easy!lOne pass to compute (a, *)lAnother pass to directly compute f(B|A)a b1:3, b2 :12, b3 :7, b4 :1, 35f(B|A): “Pairs” For this to work:lMust emit extra (a, *) for every bn in mapperlMust make sure all as get sent to same reducer (use partitioner)lMust make sure (a, *) comes firs

24、t (define sort order)lMust hold state in reducer across different key-value pairs(a, b1) 3 (a, b2) 12 (a, b3) 7(a, b4) 1 (a, *) 32 (a, b1) 3 / 32 (a, b2) 12 / 32(a, b3) 7 / 32(a, b4) 1 / 32Reducer holds this value in memory36“Order Inversion”Common design patternlComputing relative frequencies requi

25、res marginal countslBut marginal cannot be computed until you see all countslBuffering is a bad idea!lTrick: getting the marginal counts to arrive at the reducer before the joint countsOptimizationslApply in-memory combining pattern to accumulate marginal countslShould we apply combiners?37Synchroni

26、zation: Pairs vs. StripesApproach 1: turn synchronization into an ordering problemlSort keys into correct order of computationlPartition key space so that each reducer gets the appropriate set of partial resultslHold state in reducer across multiple key-value pairs to perform computationlIllustrated

27、 by the “pairs” approachApproach 2: construct data structures that bring partial results togetherlEach reducer receives all the data it needs to complete the computationlIllustrated by the “stripes” approach38Secondary SortingMapReduce sorts input to reducers by keylValues may be arbitrarily ordered

28、What if want to sort value also?lE.g., k (v1, r), (v3, r), (v4, r), (v8, r)39Secondary Sorting: SolutionsSolution 1:lBuffer values in memory, then sortlWhy is this a bad idea?Solution 2:l“Value-to-key conversion” design pattern: form composite intermediate key, (k, v1)lLet execution framework do the

29、 sortinglPreserve state across multiple key-value pairs to handle processinglAnything else we need to do?40Recap: Tools for SynchronizationPreserving state in mappers and reducerslCapture dependencies across multiple keys and valuesCleverly-constructed data structureslBring data togetherSort order o

30、f intermediate keyslControl order in which reducers process keysPartitionerlControl which reducer processes which keys41Issues and TradeoffsNumber of key-value pairslObject creation overheadlTime for sorting and shuffling pairs across the networkSize of each key-value pairlDe/serialization overheadL

31、ocal aggregationlOpportunities to perform local aggregation varieslCombiners make a big differencelCombiners vs. in-mapper combininglRAM vs. disk vs. network42Debugging at ScaleWorks on small datasets, wont scale why?lMemory management issues (buffering and object creation)lToo much intermediate datalMangled input recordsReal-world data is messy!lWord count: how many unique words in Wikipedia?lTheres no such thing as “consistent data”lWatch out for corner caseslIsolate unexpected behavior, bring local43Q&A?Thanks you!

展开阅读全文

大规模数据处理云计算Lecture5–MapreduceAlgorithm

最新文档