云端的小飞象系列报告之二

资源描述

《云端的小飞象系列报告之二》由会员分享，可在线阅读，更多相关《云端的小飞象系列报告之二（28页珍藏版）》请在金锄头文库上搜索。

1、L/O/G/O云端的小飞象系列报告之二云端的小飞象系列报告之二 Cloud组组L/O/G/OHadoop in SIGMOD OutlineOutline IntroductionNova: Continuous Pig/Hadoop WorkflowsApache Hadoop Goes Realtime at Facebook Emerging Trends in the Enterprise Data AnalyticsA Hadoop Based Distributed Loading Approach to Parallel Data WarehousesIndustrialSess

2、ioninSigmod2011IndustrialSessioninSigmod2011DataManagementforFeedsandStreams(2)DynamicOptimizationandUnstructuredContent(4)BusinessAnalytics(2)SupportforBusinessAnalyticsandWarehousing(4)ApplyingHadoop(4)IndustrialsessionNova: Continuous Pig/Hadoop WorkflowsBy Yahoo！Nova OverviewNova OverviewScenari

3、osIngesting and analyzing user behavior logs Building and updating a search index from a stream of crawled web pages Processing semi-structured dataTwo-layer programming model (Nova over Pig)Continuous processingIndependent schedulingCross-module optimizationManageability featuresWorkflow ModelWorkf

4、low ModelWorkflowTwo kinds of vertices: tasks (processing steps) and channels (data containers)Edges connect tasks to channels and channels to tasksFour common patterns of processingNon-incremental (template detection)Stateless incremental (shingling)Stateless incremental with lookup table (template

5、 tagging)Stateful incremental (de-duping)Workflow Model (Cont.)Workflow Model (Cont.)Data and Update ModelBlocks: A channels data is divided into blocksContains a complete snapshot of data on a channel as of some point in timeBase blocks are assigned increasing sequence numbers(B0,B1,B2Bn)BaseblockU

6、sed in conjunction with incremental processingContains instructions for transforming a base block into a new base block( )DeltablockWorkflow Model (Cont.)Workflow Model (Cont.)Task/Data InterfaceConsumption mode: all or newProduction mode: B or Workflow Model (Cont.)Workflow Model (Cont.)Workflow Pr

7、ogramming and SchedulingData-based trigger.Time-based triggerCascade trigger.Data Compaction and Garbage CollectionIf a channel has blocks B0，， , ，the compaction operation computes and adds B3 to the channelAfter compaction is used to add B3 to the channel，and current cursor is at sequence number 2

8、， then B0，， can be garbage-collected.Nova System ArchitectureNova System ArchitectureApache Hadoop Goes Realtime at FacebookBy FacebookWorkload TypesWorkload TypesFacebook MessagingHigh Write ThroughputLarge TablesData MigrationFacebook InsightsRealtime AnalyticsHigh Throughput IncrementsFacebook M

9、etrics System (ODS)Automatic ShardingFast Reads of Recent Data and Table ScansWhy Hadoop & HBaseElasticityHigh write throughputEfficient and low-latency strong consistency semantics within a data centerEfficient random reads from diskHigh Availability and Disaster RecoveryFault IsolationAtomic read-

10、modify-write primitivesRange ScansTolerance of network partitions within a single data centerZero Downtime in case of individual data center failureActive-active serving capability across different data centersRealtime Realtime HDFSHDFSHigh Availability - AvatarNodeRealtime HDFS (Cont.)Realtime HDFS

11、 (Cont.)Hadoop RPC compatibilityBlock Availability: Placement Policya pluggable block placement policyRealtime HDFS (Cont.)Realtime HDFS (Cont.)Performance Improvements for a Realtime WorkloadRPC TimeoutReads from Local ReplicasNew FeaturesHDFS syncConcurrent Readers Production HBaseACID Compliance

12、(RWCC: Read Write Consistency Control)Atomicity (WALEdit)ConsistencyAvailability ImprovementsHBase Master Rewrite，Region assignment in memory - ZooKeeperOnline UpgradesDistributed Log SplittingPerformance ImprovementsCompaction（minor and major）Read OptimizationsEmerging Trends in the Enterprise Data

13、 Analytics: Connecting Hadoop and DB2 WarehouseBy IBMMotivationMotivation1.Increasing volumes of data2. Hadoop-based solutions in conjunction with data warehousesA Hadoop Based Distributed Loading Approach to Parallel Data WarehousesBy TeradataMotivationMotivationETL(Extraction Transformation Loadin

14、g) is a critical part of data warehouseWhile data are partitioned and replicated across all nodes in a parallel data warehouse, load utilities reside on a single node(bottleneck)WhyHadoopforTeradataEDWWhyHadoopforTeradataEDW（EnterpriseDataWarehouseEnterpriseDataWarehouse）?More disk space can be easi

15、ly addedUse as a intermediate storageMapReduce for transformationLoad data in parallelBlock Assignment ProblemBlock Assignment ProblemBlock Assignment ProblemHDFS file F on a cluster of P nodes (each node is uniquely identified with an integer i where 1 i P) The problem is defined by: assignment(X,

16、Y, n,m, k, r) X is the set of n blocks (X = 1, . . . , n) of FY is the set of m nodes running PDBMS (called PDBMS nodes) (Y 1, . . . , P )k copies, m nodesr is the mapping recording the replicated block locations of each block. r(i) returns the set of nodes which has a copy of the block i.Block Assi

17、gnment ProblemBlock Assignment ProblemBlock Assignment Problem（Cont.Cont.Cont.）An assignment g from the blocks in X to the nodes in Y is denoted by a mapping from X = 1, . . . , n to Y where g(i) = j (i X, j Y ) means that the block i is assigned to the node j. An even assignment g is an assignment such that i Y j Y | | x | 1 x n&g(x) = i| - |y | 1 y n&g(y) = j| | 1. The cost of an assignment g is defined to be cost(g) = |i | g(i) r(i) 1 i n|, which is the number of blocks assigned to remote nodes.L/O/G/OThankYou!

展开阅读全文

云端的小飞象系列报告之二

最新文档