DataMiningofVeryLargeData

上传人:cl****1 文档编号:569822741 上传时间:2024-07-31 格式:PPT 页数:50 大小:113.50KB
返回 下载 相关 举报
DataMiningofVeryLargeData_第1页
第1页 / 共50页
DataMiningofVeryLargeData_第2页
第2页 / 共50页
DataMiningofVeryLargeData_第3页
第3页 / 共50页
DataMiningofVeryLargeData_第4页
第4页 / 共50页
DataMiningofVeryLargeData_第5页
第5页 / 共50页
点击查看更多>>
资源描述

《DataMiningofVeryLargeData》由会员分享,可在线阅读,更多相关《DataMiningofVeryLargeData(50页珍藏版)》请在金锄头文库上搜索。

1、Data Mining of Very Large DataFrequent itemsets, market basketsA-priori algorithmHash-based improvementsOne- or two-pass approximationsHigh-correlation mining1The Market-Basket ModelA large set of items , e.g., things sold in a supermarket.A large set of baskets , each of which is a small set of the

2、 items, e.g., the things one customer buys on one day.Problem: find the frequent itemsets : those that appear in at least s (support ) baskets.2ExampleItems = milk, coke, pepsi, beer, juice.Support = 3 baskets.B1 = m, c, bB2 = m, p, jB3 = m, bB4 = c, jB5 = m, p, bB6 = m, p, b, jB7 = c, b, jB8 = b, p

3、Frequent itemsets: m, c, b, p, j, m, b, m, p, b, p.3Applications 1Real market baskets: chain stores keep terabytes of information about what customers buy together.Tells how typical customers navigate stores, lets them position tempting items.Suggests tie-in “tricks,” e.g., run sale on hamburger and

4、 raise the price of ketchup.4Applications 2“Baskets” = documents; “items” = words in those documents.Lets us find words that appear together unusually frequently, i.e., linked concepts.“Baskets” = sentences, “items” = documents containing those sentences.Items that appear together too often could re

5、present plagiarism.5Applications 3“Baskets” = Web pages; “items” = linked pages.Pairs of pages with many common references may be about the same topic.“Baskets” = Web pages p ; “items” = pages that link to p .Pages with many of the same links may be mirrors or about the same topic.6Scale of ProblemW

6、alMart sells 100,000 items and can store hundreds of millions of baskets.The Web has 100,000,000 words and several billion pages.7Computation ModelData is stored in a file, basket-by-basket.As we read the file one basket at a time, we can generate all the sets of items in that basket.The principal c

7、ost of an algorithm is the number of times we must read the file.Measured in disk I/Os.Bottleneck is often the amount of main memory available on a pass.8A-Priori Algorithm 1Goal: find the pairs of items that appear at least s times together.Data is stored in a file, one basket at a time.Nave algori

8、thm reads file once, counting in main memory the occurrences of each pair.Fails if #items-squared exceeds main memory.9A-Priori Algorithm 2A two-pass approach called a-priori limits the need for main memory.Key idea: monotonicity : if a set of items appears at least s times, so does every subset.Con

9、verse for pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets.10A-Priori Algorithm 3Pass 1: Read baskets and count in main memory the occurrences of each item.Requires only memory proportional to #items.Pass 2: Read baskets again and count in main memory o

10、nly those pairs both of which were found in Pass 1 to have occurred at least s times.Requires memory proportional to square of frequent items only.11PCY Algorithm 1Hash-based improvement to A-Priori.During Pass 1 of A-priori, most memory is idle.Use that memory to keep counts of buckets into which p

11、airs of items are hashed.Just the count, not the pairs themselves.Gives extra condition that candidate pairs must satisfy on Pass 2.12PCY Algorithm 2HashtableItem countsBitmapPass 1Pass 2Frequent itemsCounts ofcandidate pairs13PCY Algorithm 3PCY Pass 1:Count items.Hash each pair to a bucket and incr

12、ement its count by 1.PCY Pass 2:Summarize buckets by a bitmap : 1 = frequent (count = s ); 0 = not.Count only those pairs that (a) are both frequent and (b) hash to a frequent bucket.14Multistage AlgorithmKey idea: After Pass 1 of PCY, rehash only those pairs that qualify for Pass 2 of PCY.On middle

13、 pass, fewer pairs contribute to buckets, so fewer false drops - buckets that have count s , yet no pair that hashes to that bucket has count s .15Multistage PictureFirsthash tableSecondhash tableItem countsBitmap 1Bitmap 1Bitmap 2Freq. itemsFreq. itemsCounts ofCandidate pairs16Finding Larger Itemse

14、tsWe may proceed beyond frequent pairs to find frequent triples, quadruples, . . .Key a-priori idea: a set of items S can only be frequent if S - a is frequent for all a in S .The k th pass through the file is counts the candidate sets of size k : those whose every immediate subset (subset of size k

15、 - 1) is frequent.Cost is proportional to the maximum size of a frequent itemset.17All Frequent Itemsets in = 1 band.Tune b , r , k to catch most similar pairs, few nonsimilar pairs.42ExampleSuppose 100,000 columns.Signatures of 100 integers.Therefore, signatures take 40Mb.But 5,000,000,000 pairs of

16、 signatures can take a while to compare.Choose 20 bands of 5 integers/band.43Suppose C1, C2 are 80% SimilarProbability C1, C2 identical in one particular band: (0.8)5 = 0.328.Probability C1, C2 are not similar in any of the 20 bands: (1-0.328)20 = .00035 .I.e., we miss about 1/3000 of the 80% simila

17、r column pairs.44Suppose C1, C2 Only 40% SimilarProbability C1, C2 identical in any one particular band: (0.4)5 = 0.01 .Probability C1, C2 identical in = 1 of 20 bands: = 20 * 0.01 = 0.2 .Small probability C1, C2 not identical in a band, but hash to the same bucket.But false positives much lower for similarities PCY (hashing) - multistage.Finding all frequent itemsets:Simple - SON - Toivonen.Finding similar pairs:Minhash + LSH, Hamming LSH.50

展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 资格认证/考试 > 自考

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号