云数据管理挑战和机遇

资源描述

《云数据管理挑战和机遇》由会员分享，可在线阅读，更多相关《云数据管理挑战和机遇（63页珍藏版）》请在金锄头文库上搜索。

1、1陆嘉恒2009-08-25Cloud-based Data Management: Challenges & Opportunities云数据管理：挑战和机遇中科院软件所中科院软件所中国人民大学中国人民大学2National University of Singapore PhDXML query processing and XML keyword search University of California, Irvine PostdocApproximate string processingData integration and data cleaning Renmin Uni

2、versity of China Cloud data managementXML data managementResearch experience and interesting3OutlineMotivation: cloud data managementDatabase Future and Challenges：Large-scale Data management & transaction processingCloud-based data indexing and query optimizationRecent research work：An efficient mu

3、ltiple-dimensional indexes for cloud data managementCIKM Workshop CloudDB 20094Motivation: Internet Chatter5BLOG Wisdom“If you want vast, on-demand scalability, you need a non-relational database.” Since scalability requirements:Can change very quickly and,Can grow very rapidly.Difficult to manage w

4、ith a single in-house RDBMS server. Although RDBMS scale well:When limited to a single node.Overwhelming complexity to scale on multiple sever nodes.6Current StateMost enterprise solutions are based on RDBMS technology.Significant Operational Challenges:Provisioning for Peak DemandResource under-uti

5、lizationCapacity planning: too many variablesStorage management: a massive challengeSystem upgrades: extremely time-consuming7Internet Search Data Analytics: A Case StudyData analytics:Parsed WEB Logs ingested in a RDBMS store.Hourly and Daily summarization for custom reporting.Operational nightmare

6、:Maintaining live reporting system ON at all costs and at all times.Timely completion of hourly summarization.Constant tension between Ad-hoc workload versus reporting workload.Data-driven feedback to live products.Temporal depth of detailed data8Internet Search Data Analytics: A Case StudyVarious s

7、olutions explored:Data Warehousing appliance for fast summarization.Parallel RDBMS technology for fast ad-hoc queries.Business Intelligence Products (Data Cubes) for fast and intuitive reporting and analysis.None of the solutions completely satisfactory:Plans to migrate low-level data to file-based

8、system to overcome Database scalability bottlenecks9Paradigm Shift in Computing10WEB is replacing the Desktop11What is Cloud Computing?Old idea: Software as a service (SaaS)Def: delivering applications over the internetRecently: “Hardware, infrastructure, Platform as a service”Poorly defined so we a

9、void all “X as a service”Utility Computing: pay-as-you-go computingIllusion of infinite resourcesNo up-front costFine-grained billing (e.g. hourly)12Why Now?Experience with very large datacentersUnprecedented economies of scaleOther factorsPervasive broadband internetPay-as-you-go billing model13Clo

10、ud Computing SpectrumInstruction Set VM (Amazon EC2, 3Tera)Framework VMGoogle AppEngine, F14Cloud Killer AppsMobile and web applicationsExtensions of desktop softwareMatlab, MathematicaBatch processing/MapReduce15Economics of Cloud UsersPay by use instead of provisioning for peak16Economics of Cloud

11、 UsersRisk of over-provisioning: underutilization17Economics of Cloud UsersHeavy penalty for under-provisioning18Economics of Cloud Providers5-7X economies of scale Hamilton 2008Extra benefitsAmazon: utilize off-peak capacityMicrosoft: sell .NET toolsGoogle: reuse existing infrastructure19Engineerin

12、g DefinitionProviding services on virtual machines allocated on top of a large physical machine pool.20Business DefinitionA method to address scalability and availability concerns for large scale applications.21Data Management in the Cloud?22Cloud Computing Implications on DBMSsWhere do Databases fi

13、t in this paradigm?Generational reality:AStarted with 50 servers on Amazon EC2Growth of 25,000 users/hourNeed to scale to 3,500 servers in 2 days.Many similar stories:RightScaleJoyent23Clouded Data?Reality Number ：Unlimited processing assumptionInteractive page views:By targeting large number of SQL

14、 queries against MySQLStill Expect sub-millisecond object retrievalReality Number :Why cant the database tier be replicated in the same way as the Web Server and App Server can?These are the major challenges for Data Management in the cloud.24The VisionR&D Challenges at the macro level:Where and how

15、 does the DBMS fit into this model.R&D Challenges at micro level:Specific technology components that must be developed to enable the migration of enterprise data into the clouds.25Data and Networks: Attempt Distributed Database (1980s):Idealized view: unified access to distributed dataProhibitively

16、expensive: global synchronizationRemained a laboratory prototype:Associated technology widely in-use: 2PC26Data and Networks: Attempt 27Data and Networks: Pragmatics28Database on S3: SIGMOD08Amazons Simple Storage Service(S3):Updates may not preserve initiation orderNo “force” writesEventual guarant

17、eeProposed solution:Pending Update QueueCheckpoint protocol to ensure consistent orderingACID: only Atomicity + Durability29Unbundling Txns in the CloudResearch results:CIDR09 proposal to unbundle Transactions Management for Cloud InfrastructuresAttempts to refit the DBMS engine in the cloud storage

18、 and computing 30Analytical Processing31Architectural and System ImpactsCurrent state:MapReduce Paradigm for data analysisWhat is missing:Auxiliary structures and indexes for associative access to data (i.e., attribute-based access)Caveat: inherent inconsistency and approximationFuture projection:Ev

19、entual merger of databases (ODSs) and data warehouses as we have learned to use and implement them.32Underlying Principles: CIDR2009Business data may not always reflect the state of the world or the business:Inherent lack of perfect informationSecondary data need not be updated with primary data:Inh

20、erent latencyTransactions/Events may temporarily violate integrity constraints:Referential integrity may need to be compromised33Data Security & PrivacyData privacy remains a show-stopper in the context of database outsourcing.Encryption-based solutions are too expensive and are projected to be so i

21、n the foreseeable future:Private Information Retrieval (Sion2008)Other approaches:Information-theoretic approaches that uses data-partitioning for security (Emekci2007)Hardware-based solution for information security34Self management and self tuning in cloud-based data managementSelf management and

22、self tuningQuery optimization on thousands of nodes35 RemarksData Management for Cloud Computing poses a fundamental challenge to database researchers:ScalabilityReliabilityData ConsistencyRadically different approaches and solution are warranted to overcome this challenge:Need to understand the nat

23、ure of new applications36ReferencesLife Beyond Distributed Transactions: An Apostates Opinion by P.Helland, CIDR07Building a Database on S3 M.Brartner, D.Florescu, D.Graf, D.Kossman, T.Kraska, SIGMOD08Unbundling Transaction Services in the Cloud D.Lo,et, A.Fekete, G.Weikum, M.Zwilling, CIDR09Princip

24、les of Inconsistency S.Finkelstein, R.Brendle, D.Jacobs, CIDR09VLDB Database School (China) 2009 http:/ Efficient Multi-Dimensional Index for Cloud Data ManagementCIKM workshop CloudDB0938OutlineINTRODUCTIONMULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREEExtended Nodes partitionNode partitionCost Estim

25、ation StrategyEVALUATION39Cloud ComputingGoogle File SystemYahoo PNUTS40Distributed Cloud base? BigTable HBaseHow to query on other attributes besides primary key?41Distributed Index: Single Dimension?S. Wu and K.-L. Wu, “An indexing framework for efficient retrieval on the cloud,” IEEE Data Eng. Bu

26、ll., vol. 32, pp.7582, 2009.H. chih Yang and D. S. Parker, “Traverse: Simplified indexing on large map-reduce-merge clusters,” in Proceedings of DASFAA 2009, Brisbane, Australia, April 2009, pp. 308322.M. K. Aguilera, W. Golab, and M. A. Shah, “A practical scalable distributed b-tree,” in Proceeding

27、s of VLDB08, Auckland, New Zealand, August 2008, pp. 598609.42OutlineINTRODUCTIONMULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREEExtended Nodes partitionNode partitionCost Estimation StrategyEVALUATION43Framework of Request Processing in Cloud44R-TreeR-trees is a tree data structure that is similar to

28、a B-tree, but is used for spatial access methods 45KD-Treekd-tree (short for k-dimensional tree) is a space-partitioning data structure for organizing points in a k-dimensional space. 46R-Tree & KD-Tree: RKDTreeMasterSlaveSlaveSlaveSlaveSlaverange：02000,5001200range：8003500,3001300range ：63007000,59

29、91400range ：200040000,34008900range ：68009000,3400890047OutlineINTRODUCTIONMULTI-DIMENSIONAL INDEX WITH KDTREE AND RTREEExtended Nodes partitionNode partitionCost Estimation StrategyEVALUATION48Random cutting: Pick several random values on the attribute and cut by the points. with the random method

30、you may receive great performance, but also possible to have poor performance. Equal cutting: Cut the attribute into several equal intervals. This method is relatively stable since no extreme case will happen.Clustering-based cutting: Cut the attribute by clustering values on the attribute and cut b

31、etween clusters. This method may receive foreseeable better performance, but the time cost is also apparently higher. The time complexity of a clustering algorithm is typically O(nlogn) or even higher.Nodes partition for data summary49Random cuttingEqual cuttingClustering-based cuttingNodes partitio

32、n5051Update of node cube:Why? If the data distribution in the node cube have “greatly” changed and caused the cube to be sparse or greatly unevenHow? Reorganize the nodes partition againWhen? A two-phase approachAfter each update, compute the minimal T for next updateWhen the T expires, check if nee

33、ds updateDynamic maintenance of Indexes52Basic idea: benefit costVolume of a node cube is defined as the number of combination of records can be made out of the cube. The volume can be calculated as the product of lengths of all the intervals. We note volume of a cube by v.For the cube 1, 11, 2, 5,

34、the volume is (11-1)*(5-2) = 30.Dynamic maintenance of Indexes53Assumption:The amount of queries forwarded to each slave node is proportional to the total volume of all the node cubes of the slave node. Dynamic maintenance of Indexes54benefit = (v/v) * nq * T v: decrement of volume after updatenq: n

35、umber of queries this node must process before update.cost = mt/qtmt: time cost of last updateqt: time needed for processing one querybenefit cost = T (mt * v)/(qt * v * nq)Dynamic maintenance of Indexes55After T expires, check if an update is needed. This check involves following:Record update freq

36、uencyExpected benefit ratioPerformance requirementWe leave this as a future work.Dynamic maintenance of Indexes56Experimental Setup 6 machines1 master5 slaves : 1001000 nodesEach machine had a 2.33GHz Intel Core2 Quad CPU, 4GB of main memory, and a 320G disk. Machines ran Ubuntu 9.04 Server OS.57Poi

37、nt Query Experiment Results# nodestime cost/ms10020030040050060070080090010000100020003000400050006000700080009000Scan TableRKDTreeNBRKDTree(Random)NBRKDTree(Equal)NBRKDTree(K-Means)58Range Query Experiment Results# nodestime cost/ms1002003004005006007008009001000010002000300040005000600070008000900

38、010000Scan TableRKDTreeNBRKDTree(Random)NBRKDTree(Equal)NBRKDTree(K-Means)Result Cover Rate: one ten thousandth59ConclusionsIn this paper we presented a series of approaches on building efficient multi-dimensional index in cloud platform. We used the combination of R-tree and KD-tree to support the

39、index structure.We developed the node partition technique to reduce query processing cost on the cloud platform. In order to maintain efficiency of the index, we proposed a cost estimation-based approach for index update.60Better node partition algorithmsImprove the estimation-based approachConsider

40、 multiple replicas of dataFuture works61谢谢，敬请提问交流！62Backup(1)1234567891011050000100000150000200000250000300000Series1Series2Series3Series4Series5Series6Result Cover Rate: one thousandth1 2 63Backup(2)Result Cover Rate: one thousandth4 5 12345678910050000100000150000200000250000Series1Series2Series3Series4Series5Series6

展开阅读全文

云数据管理挑战和机遇

最新文档