Bekeley-Seminar--December-2003

资源描述

《Bekeley-Seminar--December-2003》由会员分享，可在线阅读，更多相关《Bekeley-Seminar--December-2003（28页珍藏版）》请在金锄头文库上搜索。

1、Approximation Techniques for Approximation Techniques for Data Management SystemsData Management Systems “We are drowning in data but “We are drowning in data but starved for knowledge”starved for knowledge” John NaisbittJohn Naisbitt CS 186 Fall 2005 2 Traditional Query ProcessingTraditional Query

2、Processing Exact answers NOT always required DSS applications usually exploratory: early feedback to help identify “interesting” regions Aggregate queries: precision to “last decimal” not needed e.g., “What percentage of the US sales are in NJ?” SQL Query Exact Answer Decision Support Systems (DSS)

3、Long Response Times!GB/TB 3 Primarily for Aggregate queries Goal is to quickly report the leading digits of answers In seconds instead of minutes or hours Most useful if can provide error guarantees E.g., Average salary $59,000 +/- $500 (with 95% confidence) in 10 seconds vs. $59,152.25 in 10 minute

4、s Achieved by answering the query based on compact synopsescompact synopses of the data Speed-up obtained because synopses are orders of magnitude smaller than the original data Fast Approximate AnswersFast Approximate Answers 4 ApproximateApproximate Query Processing Query Processing How do you How

5、 do you build effective data synopses?build effective data synopses? SQL Query Exact Answer Decision Support Systems (DSS) Long Response Times!GB/TB Compact Data Synopses “Transformed” Query KB/MB Approximate Answer FAST! 5 Sampling: BasicsSampling: Basics Idea: A small random sample S of the data o

6、ften well-represents all the data For a fast approx answer, apply the query to S (S) = s.d. for S.A 7 Sampling from DatabasesSampling from Databases Sampling disk-resident data is slow Row-level sampling has high I/O cost: must bring in entire disk block to get the row Block-level sampling: rows may

7、 be highly correlated Random access pattern, possibly via an index Need to account for the variable number of rows in a page, children in an index node, etc. Alternatives Random physical clustering: destroys “natural” clustering Precomputed samples: must incrementally maintain (at specified size) Fa

8、st to use: packed in disk blocks, can sequentially scan, can store as relation and leverage full DBMS query support, can store in main memory 8 One-Pass Uniform SamplingOne-Pass Uniform Sampling Best choice for incremental maintenance Low overheads, no random data access Reservoir Sampling Vit85: Ma

9、intains a sample S of a fixed-size MMaintains a sample S of a fixed-size M Add each new item to S with probability M/N, where N is the current number of data items If add an item, evict a random item from S Instead of flipping a coin for each item, determine the number of items to skip before the ne

10、xt to be added to S 9 HistogramsHistograms Partition attribute value(s) domain into a set of buckets Issues: How to partition What to store for each bucket How to estimate an answer using the histogram Long history of use for selectivity estimation within a query optimizer Recently explored as a too

11、l for fast approximate query processing 10 1-D Histograms1-D Histograms Number of buckets B log(d) 0L-1 25 A Little Streaming PuzzleA Little Streaming Puzzle Input: A stream of N numbers/elements Output: The stream majority element (if one exists) e is a majority element if frequency(e) N/2 Q: How d

12、o you do this in small space? Hint: Use just two memory locations Hint+: Look at this as a “knockout tournament” Feeling adventurous? How do you do the same majority query over a stream of insertions and deletions? Input: Stream of = insert e , = delete e Hint: Use a little more memory 26 In Summary

13、: Not your parents DBMS!In Summary: Not your parents DBMS! Database/data-management research goes far beyond the basics! Extends from distributed systems to theory to approximation algorithms to probability/statistics to Applications: data mining, sensornets, p2p, Just pick up a copy of recent SIGMO

14、D/VLDB proceedings More and more relevant in dealing with the “data tsunami” Data is everywhere! And, its constantly growing in volume! Exciting, relevant research! 27 More detailsMore details http:/www2.berkeley.intel- Tutorial slides on approximate query processing and data streams 28 你只闻闻到我的香水，却没看到我的汗水。你否定我的现现在，我决定我的未来！你嘲笑我一无所有，不配去爱爱，我可怜你总总是等待。你可以轻视轻视我们们的年轻轻，我们们会证证明这这是谁谁的时时代。梦想是注定孤独的旅行，路上少不了质质疑和嘲笑，但那又怎样样？哪怕遍体鳞伤鳞伤，也要活得漂亮！

展开阅读全文