云端的小飞象系列报告之二

上传人:大米 文档编号:569707506 上传时间:2024-07-30 格式:PPT 页数:28 大小:1.26MB
返回 下载 相关 举报
云端的小飞象系列报告之二_第1页
第1页 / 共28页
云端的小飞象系列报告之二_第2页
第2页 / 共28页
云端的小飞象系列报告之二_第3页
第3页 / 共28页
云端的小飞象系列报告之二_第4页
第4页 / 共28页
云端的小飞象系列报告之二_第5页
第5页 / 共28页
点击查看更多>>
资源描述

《云端的小飞象系列报告之二》由会员分享,可在线阅读,更多相关《云端的小飞象系列报告之二(28页珍藏版)》请在金锄头文库上搜索。

1、L/O/G/O云端的小飞象系列报告之二云端的小飞象系列报告之二 Cloud组组L/O/G/OHadoop in SIGMOD OutlineOutline IntroductionNova: Continuous Pig/Hadoop WorkflowsApache Hadoop Goes Realtime at Facebook Emerging Trends in the Enterprise Data AnalyticsA Hadoop Based Distributed Loading Approach to Parallel Data WarehousesIndustrialSess

2、ioninSigmod2011IndustrialSessioninSigmod2011DataManagementforFeedsandStreams(2)DynamicOptimizationandUnstructuredContent(4)BusinessAnalytics(2)SupportforBusinessAnalyticsandWarehousing(4)ApplyingHadoop(4)IndustrialsessionNova: Continuous Pig/Hadoop WorkflowsBy Yahoo!Nova OverviewNova OverviewScenari

3、osIngesting and analyzing user behavior logs Building and updating a search index from a stream of crawled web pages Processing semi-structured dataTwo-layer programming model (Nova over Pig)Continuous processingIndependent schedulingCross-module optimizationManageability featuresWorkflow ModelWorkf

4、low ModelWorkflowTwo kinds of vertices: tasks (processing steps) and channels (data containers)Edges connect tasks to channels and channels to tasksFour common patterns of processingNon-incremental (template detection)Stateless incremental (shingling)Stateless incremental with lookup table (template

5、 tagging)Stateful incremental (de-duping)Workflow Model (Cont.)Workflow Model (Cont.)Data and Update ModelBlocks: A channels data is divided into blocksContains a complete snapshot of data on a channel as of some point in timeBase blocks are assigned increasing sequence numbers(B0,B1,B2Bn)BaseblockU

6、sed in conjunction with incremental processingContains instructions for transforming a base block into a new base block( )DeltablockWorkflow Model (Cont.)Workflow Model (Cont.)Task/Data InterfaceConsumption mode: all or newProduction mode: B or Workflow Model (Cont.)Workflow Model (Cont.)Workflow Pr

7、ogramming and SchedulingData-based trigger.Time-based triggerCascade trigger.Data Compaction and Garbage CollectionIf a channel has blocks B0, , , ,the compaction operation computes and adds B3 to the channelAfter compaction is used to add B3 to the channel,and current cursor is at sequence number 2

8、, then B0, , can be garbage-collected.Nova System ArchitectureNova System ArchitectureApache Hadoop Goes Realtime at FacebookBy FacebookWorkload TypesWorkload TypesFacebook MessagingHigh Write ThroughputLarge TablesData MigrationFacebook InsightsRealtime AnalyticsHigh Throughput IncrementsFacebook M

9、etrics System (ODS)Automatic ShardingFast Reads of Recent Data and Table ScansWhy Hadoop & HBaseElasticityHigh write throughputEfficient and low-latency strong consistency semantics within a data centerEfficient random reads from diskHigh Availability and Disaster RecoveryFault IsolationAtomic read-

10、modify-write primitivesRange ScansTolerance of network partitions within a single data centerZero Downtime in case of individual data center failureActive-active serving capability across different data centersRealtime Realtime HDFSHDFSHigh Availability - AvatarNodeRealtime HDFS (Cont.)Realtime HDFS

11、 (Cont.)Hadoop RPC compatibilityBlock Availability: Placement Policya pluggable block placement policyRealtime HDFS (Cont.)Realtime HDFS (Cont.)Performance Improvements for a Realtime WorkloadRPC TimeoutReads from Local ReplicasNew FeaturesHDFS syncConcurrent Readers Production HBaseACID Compliance

12、(RWCC: Read Write Consistency Control)Atomicity (WALEdit)ConsistencyAvailability ImprovementsHBase Master Rewrite,Region assignment in memory - ZooKeeperOnline UpgradesDistributed Log SplittingPerformance ImprovementsCompaction(minor and major)Read OptimizationsEmerging Trends in the Enterprise Data

13、 Analytics: Connecting Hadoop and DB2 WarehouseBy IBMMotivationMotivation1.Increasing volumes of data2. Hadoop-based solutions in conjunction with data warehousesA Hadoop Based Distributed Loading Approach to Parallel Data WarehousesBy TeradataMotivationMotivationETL(Extraction Transformation Loadin

14、g) is a critical part of data warehouseWhile data are partitioned and replicated across all nodes in a parallel data warehouse, load utilities reside on a single node(bottleneck)WhyHadoopforTeradataEDWWhyHadoopforTeradataEDW(EnterpriseDataWarehouseEnterpriseDataWarehouse)?More disk space can be easi

15、ly addedUse as a intermediate storageMapReduce for transformationLoad data in parallelBlock Assignment ProblemBlock Assignment ProblemBlock Assignment ProblemHDFS file F on a cluster of P nodes (each node is uniquely identified with an integer i where 1 i P) The problem is defined by: assignment(X,

16、Y, n,m, k, r) X is the set of n blocks (X = 1, . . . , n) of FY is the set of m nodes running PDBMS (called PDBMS nodes) (Y 1, . . . , P )k copies, m nodesr is the mapping recording the replicated block locations of each block. r(i) returns the set of nodes which has a copy of the block i.Block Assi

17、gnment ProblemBlock Assignment ProblemBlock Assignment Problem(Cont.Cont.Cont.)An assignment g from the blocks in X to the nodes in Y is denoted by a mapping from X = 1, . . . , n to Y where g(i) = j (i X, j Y ) means that the block i is assigned to the node j. An even assignment g is an assignment such that i Y j Y | | x | 1 x n&g(x) = i| - |y | 1 y n&g(y) = j| | 1. The cost of an assignment g is defined to be cost(g) = |i | g(i) r(i) 1 i n|, which is the number of blocks assigned to remote nodes.L/O/G/OThankYou!

展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 幼儿/小学教育 > 幼儿教育

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号