《MyFavoriteAlgorithmsforLargeScaleDataMining》由会员分享,可在线阅读,更多相关《MyFavoriteAlgorithmsforLargeScaleDataMining(48页珍藏版)》请在金锄头文库上搜索。
1、My Favorite Algorithms for Large-Scale Data MiningShinglingMinhashingLocality-Sensitive Hashing1Similarity SearchuA universal set of “objects.”uA collection of sets of objects.uFind the pairs of sets that are “similar.”wJaccard similarity of sets = size of intersection divided by size of union.2Exam
2、ple: Jaccard Similarity3 in intersection.8 in union.Jaccard similarity = 3/83ApplicationsuCollaborative Filtering : Amazon customers as the set of products they buy.wRecommend what similar customers bought.uSimilar Documents : A document as its set of k-shingles = strings of k consecutive characters
3、.wExamples: news articles from same source, plagiarism.4Applications (2)uFingerprint Checking : Represent a fingerprint by the set of positions of minutiae.wRequires discretization.uEntity Resolution : Represent records describing individuals by sets of attribute/value pairs.5Key Ideas1.Shingling :
4、(Andrei Broder) Convert documents into sets.2.Minhashing : (Edith Cohen, Broder) Construct small signatures for sets so Jaccard similarity of sets can be determined from the signatures.3.Locality-Sensitive Hashing : (Rajeev Motwani, Piotr Indyk) Focus on (likely) similar pairs without looking at all
5、 pairs.6The Big Picture DocumentsShinglingDocu-mentThe setof stringsof length kthat appearin the doc-umentMinhash-ingSignatures :short integervectors thatrepresent thesets andreflect theirsimilarityLocality-sensitiveHashingCandidatepairs :those pairsof signaturesthat we needto test forsimilarity.7Th
6、e Big Picture FingerprintsExtractminutiaeanddiscret-izeFinger-printLocality-sensitiveHashingCandidatepairs :those pairsof fingerprintsthat we needto test forsimilarity. Optionalminhashing here8When Is Similarity Interesting?1.When the sets are so large or so many that they cannot fit in main memory.
7、2.When there are so many sets that comparing all pairs of sets takes too much time.9Shinglingk -ShinglesDocuments as Sets10ShinglesuA k-shingle (or k-gram) for a document is a sequence of k characters that appears in the document.uExample: k=2; doc = abcab. Set of 2-shingles = ab, bc, ca.wOption: re
8、gard shingles as a bag, and count ab twice.uRepresent a doc by its set of k -shingles.11Working AssumptionuDocuments that have lots of shingles in common have similar text, even if the text appears in different order.uCareful: you must pick k large enough, or most documents will have most shingles.w
9、k = 5 is OK for short documents; k = 10 is better for long documents.12Shingles: Compression OptionuTo compress long shingles, we can hash them to (say) 4 bytes.uRepresent a doc by the set of hash values of its k-shingles.uTwo documents could (rarely) appear to have shingles in common, when in fact
10、only the hash-values were shared.13MinhashingMatrix FormulationSignaturesSimilarity of Signatures14Similarity as a Matrix ProblemuThink of sets represented by a matrix of 0s and 1s.uRow = object.uColumn = set.u1 means that object is in that set.15Example: Similarity of ColumnsC1C2u01v10w 11Sim (C1,
11、C2) =x002/5 = 0.4y11z01*C1 = v,w,yC2 = u,w,y,z16Four Types of RowsuGiven columns C1 and C2, rows may be classified as:C1C2a11b10c01d00uAlso, a = # rows of type a , etc.uNote Sim (C1, C2) = a /(a +b +c ).17MinhashinguImagine the rows permuted randomly.uDefine “hash” function h (C ) = the number of th
12、e first (in the permuted order) row in which column C has 1.uUse several (100?) independent hash functions to create a signature with that number of integer hash-values.18Minhashing ExampleInput matrix 0101010110101010101010010101 3476125Signature matrix M1212576312414124526731212119Surprising Prope
13、rtyuThe probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2).uBoth are a /(a +b +c )! Why?wLook down columns C1 and C2 (in permuted order) until we see a 1.wIf its a type-a row, then h (C1) = h (C2). If a type-b or type-c row, then not.20Similarity for Sig
14、naturesuThe similarity of signatures is the fraction of the rows in which they agree.wRemember, each row corresponds to a permutation or “hash function.”21Implementation (1)uYou cant really permute rows physically.uGood approximation to permuting rows: pick 100 (?) hash functions.uFor each column c
15、and each hash function hi , keep a “slot” M (i, c ) for that minhash value.22Implementation (2)for each row r for each column c if c has 1 in row r for each hash function hi do if hi (r ) is a smaller value than M (i, c ) thenM (i, c ) := hi (r );23ExampleRowC1C2 1 1 0 2 0 1 3 1 1 4 1 0 5 0 1h(x) =
16、x mod 5g(x) = 2x +1 mod 5h(1) = 11-g(1) = 33-h(2) = 212g(2) = 030h(3) = 312g(3) = 220h(4) = 412g(4) = 420h(5) = 010g(5) = 120Sig1Sig224Implementation (3)uOften, data is given by column, not row.wE.g., columns = documents, rows = shingles.uIf so, sort matrix once so it is by row.uAnd always compute h
17、i (r ) only once for each row.25Locality-Sensitive HashingThe All-Pairs ProblemBanding of Signature MatricesOther LSH Techniques26Finding Similar SetsuWe can use minhashing to replace sets (columns of the matrix) by short lists of integers.uBut we still need to compare each pair of signatures.uExamp
18、le: 20 million Amazon customers; 2*1014 pairs of customers to evaluate.27Locality-Sensitive HashinguWhat we want seems impossible. Map signatures to buckets so that:1.Two similar signatures have a very good chance of appearing in the same bucket.2.If two signatures are not very similar, they probabl
19、y dont appear in one bucket.uThen, we only have to compare bucket-mates (candidate pairs ).28LSH for SignaturesuThink of the signature for each column (set) as a column of the signature matrix S.uDivide the rows of S into b bands of r rows each.29Partition Into Bands (1)Matrix Sr rowsper bandb bands
20、 Onesignature30Partition into Bands (2)uFor each band, hash its portion of each column to a hash table with many buckets.uCandidate column pairs are those that hash to the same bucket for 1 band.uTune b and r to catch most similar pairs, but few nonsimilar pairs.31Matrix Sr rowsb bandsBuckets32Analy
21、sis of LSH What We Want Similarity s of two setsProbabilityof sharinga buckettNo chanceif s t33What One Band of One Row Gives YouSimilarity s of two setsProbabilityof sharinga buckettRemember:probability ofequal hash-values= similarity34What b Bands of r Rows Gives YouSimilarity s of two setsProbabi
22、lityof sharinga bucketts r All rowsof a bandare equal1 -Some rowof a bandunequal()b No bandsidentical1 -At leastone bandidenticalt (1/b)1/r 35Example: b = 20; r = 5 s 1-(1-sr)b.2 .006.3 .047.4 .186.5 .470.6 .802.7 .975.8 .999636Summary of Minhash/LSH1.Represent the objects you are comparing by sets
23、(e.g., shingling).2.Represent the sets by signatures (minhashing).3.Use LSH to create buckets; candidate pairs are those in the same bucket.4.Evaluate only the candidate pairs.37Application: LSH for FingerprintsuPlace a grid on a fingerprint.wNormalize so identical prints will overlap.uSet of grid p
24、oints where minutiae are located represents the fingerprint.wPossibly, treat minutiae near a grid boundary as if also present in adjacent grid points.38Discretizing MinutiaeMinutialocatedhereMaybe pretendit is here also39Applying LSH to FingerprintsuWe could minhash the bit-vectors to obtain signatu
25、res.uBut since there probably arent too many grid points, we can work from the bit-vectors directly.40LSH/Fingerprints (2)uPick 100 (?) sets of 3 (?) grid points, randomly.uFor each set of three grid points, those prints that have 1 for all three points are placed in a bucket.wAll pairs in this buck
26、et are candidates.41Application: Matching Customer RecordsuI once took a consulting job solving the following problem:wCompany A agreed to solicit customers for Company B, for a fee.wThey then argued over how many customers.wNeither recorded exactly which customers were involved.42Customer Records (
27、2)uCompany B had about 1 million records of all its customers.uCompany A had about 1 million records describing customers, some of which it had signed up for B.uRecords had name, address, and phone, but for various reasons, they could be different for the same person.43Customer Records (3)uStep 1: D
28、esign a measure (“score ”) of how similar records are:wE.g., deduct points for small misspellings (“Jeffrey” vs. “Geoffery”) or same phone with different area code.uStep 2: Score all pairs of records; report high scores as matches.44Customer Records (4)uProblem: (1 million)2 is too many pairs of rec
29、ords to score.uSolution: A simple LSH.wThree hash functions: exact values of name, address, phone.Compare iff records are identical in at least one.wMisses similar records with a small difference in all three fields.45Aside: Validation of ResultsuWe were able to tell what values of the scoring funct
30、ion were reliable in an interesting way.wIdentical records had a creation date difference of 10 days.wWe only looked for records created within 90 days, so bogus matches had a 45-day average.46Validation (2)uBy looking at the pool of matches with a fixed score, we could compute the average time-difference, say x, and deduce that fraction (45-x )/35 of them were valid matches.uAlas, the lawyers didnt think the jury would understand.47The EndThanks for Listening48