生物信息学课件英文原版课件57

资源描述

《生物信息学课件英文原版课件57》由会员分享，可在线阅读，更多相关《生物信息学课件英文原版课件57（41页珍藏版）》请在金锄头文库上搜索。

1、Efficient Mining of NichesGuozhu DongOrganizationnData miningnSurprising knowledge patterns nNiches and examplenCapturing policy setnSemantic based representative-pattern selectionnThe whole algorithmnFuture worknOther works of GD et alData MiningnWe are drowning in data, but starving for knowledge!

2、nSensors, gene chips, bar code readers, wwwnData mining: Extraction of interesting (non-trivial, implicit, previously unknown, potentially useful) patterns from data in large databases.nPattern: Age=boomers and Car=sportsnIt summarizes the set of boomer-aged sports car drivers.nOther names: Knowledg

3、e discovery in databases (KDD), data archeology, data/pattern analysis, information harvesting, business intelligence, .Mining surprising knowledgenExceptions to strong rules (Suzuki et al)nTry to find exceptions to all strong rulesnPeculiar rules (Zhong et al)nIsolate unusual data and then mine reg

4、ularity rules on such datanHoles in data (Liu et al)nGenerally: interesting rules, profitable rules, “actionable rules, Niche mining Insurance ConWhen XYZ Insurance started, business was good: it used some common-sense rules & niche-market rules to build up customer base.nAfter a few years of succes

5、s, business stopped growing. nCEO: “Strategic Opportunities Department must find niche markets, so that we can expand our market share. Similar scenario: medicine, government, scienceWhat are niches?nNiches: “profitable opportunities, overlooked by current operating policies/principlesnFor XYZ auto

6、insurance: nPolicy 35: If customer drives sports car, charge more. (sporty drivers risky)nNiche: If customer drives sports car, is boomer, has = children, =2 cars, then charge less. (such sporty drivers not risky)nAttract profitable customers from competitorsPoliciesnPolicies/principles are strong d

7、iscriminating rules, useful for predicting business outcomesnFor auto insurance:nDrive-Sports-Car risky sup:5%, conf:28%nAge 30 not risky 70%, 83%nThere are millions of such rules. Not all strong rules are used as policies.RiskyNon riskyPolicies contdnDrive-Sports-Car risky sup:5%, conf:28%nSports-c

8、ar owners: n6% of no-claim ownersn24% of claims owners nGrowth rate of 4nEmerging patterns (Dong-Li KDD 1999). Niche Mining Approach 1: Brute ForcenFind all strong rule/exception pairs.nShortcomings: nNot computationally feasible for low supports for nichesnHuman user will be overwhelmed by the moun

9、tains of pairsNiche Mining Approach 2: Find policies firstnTry to find from policy manualnTry to mine from operational datanMine discriminating (emerging) patterns, whose frequencies change a lot b/w Risky and NonRisky Use ConsEPMiner (Zhang, Dong, Rao KDD 2000)nSelect a small subset of strong rules

10、 to approximate the policiesnMine exceptions to policies, & select niches.How to select policies: semantic representatives of patternsn“Semantically disjoint rules: Each rule covers a “unique segment of cases, & different rules cover “disjoint segments. nCannot use total disjointness.nSyntactic disj

11、ointness is not good enough. E.G. small-car drivers vs young drivers.nSemantic-based selection mostly ignored so farSemantic-based representativesnLet SAT(X) be set of casesnsatisfying (covered by) X.nSet S of strong rules makes a policy set if nSAT(X) SAT(Y) is small for X,Y in S.nUNION SAT(X) cove

12、rs all cases.nEach rule in S corresponds to a unique segment of “clientst1: abcdt2: acdet3: cdefSAT(ac)=t1,t2 SAT(ac) SAT(de) = t2How to mine semantic-based representativesnComputationally expensive (many patterns, many cases extremely high dimensions).t1 t2 t3 t4 t5 t6 t7 tm p1: 0 1 0 0 1 0 1 0m=10

13、6 p2: 0 0 1 1 0 0 1 0n=105 pn 1 0 0 1 1 0 0 1nWe use an efficient (greedy) incremental algorithm, to compute the overlap and difference between SATs of the candidate and selected rules.Dong-Deshpande PAKDD 2001Iterative Bit Driven Incremental SelectionTrans/EPse1e2e3e4t11100t21101t30010t41010t50110t

15、the decision (e.g. risky, non-risky)2.Select, in semantic-based way, a policy set.3.Mine exceptions to each rule R in the policy set from relativized dataset; select a semantic-based representative of exceptions to R; remove exceptions implied by other rules in the policy set.ExperimentsnOur increme

16、ntal algorithm is about 30 times faster than “nave selection.nTested on “Adult (income prediction): nDominant rule:nage32,never-married,cap-mgin20K income 50Kn,edu-years=13,race=whiteincome 50Kn,relnship=not-in-family,race=Amer-Ind-Eskimo, HPW:20.6,40.2)income50KFuture worknThe niche and policy set

17、mining process can be an interactive process selecting next policy and niches, depending on what has been chosen. Policy rules can be ordered.nOther ways to discover policies, including pushing semantic-based selection into initial mining process.nUse semantic representative pattern discovery for ot

18、her data mining/interestingness analysis tasks.nApply methodology on real data perhaps for your company.GDs other worksnNew pattern types for KDDnKnowledge pattern mining algorithmsnClassification algorithms for extremely high dimensions (for texts, gene expressions, )nData cube; iceberg algorithms

19、to discover the unusual; “gradient analysisnBioinformatics: micro-array gene expression analysis; also involved with QSAR (for drug design) in WSUs bioinformatics group.How to model and mine nichesnNiches are exceptions to some rules in SRS, but are not implied by any rule in SRS.nNiches have small

20、(but above statistically meaningful level of) supports nWe mine these by “relativizing the whole datasets, to get smaller datasets, using the rules in the SRS.nRelativization: Only keep items that occur in the given rule.DiscussionnWithout semantic selection of SRS before mining niches nToo much com

21、putation for too many strong rulesnToo much repeated computation for different strong rulesnHard to get to low-support niche rulesnWithout semantic selection of niches, show too many exceptions to user.Potential applicationsnInsurance attract “perceived risky but non-risky clientsnMedicine find segm

22、ents that should be treated in ways contrary to conventional wisdomsnScience use data to discover specialized laws for special situationsnIn general, customized treatmentsnRelated WorksnException mining (Suzuki et al, Liu et al): find strong-rule exception pairs.nHoles in data (Liu et al).nInteresti

23、ngness (many refs).nSemantic cover based pattern selection (Toivonen et al).nClosed patterns and Galois connections (Zaki).KDD Lab, Bioinformatics GroupnNew pattern types for KDD, of practical usenKnowledge pattern mining algorithmsnClassification algorithms for extremely high dimensions (for texts,

24、 gene expressions, )nData cube; iceberg algorithms to discover the unusual; “gradient analysisnBioinformatics: micro-array gene expression analysis; also involved with QSAR (for drug design) in WSUs bioinformatics group.Need for Data MiningnData explosion problem nAutomated data collection tools + m

25、ature database technology huge amounts of data in databases, data warehouses, other data repositories nWe are drowning in data, but starving for knowledge! nSolutionsnData warehousing and on-line analytical processingnData mining to extract interesting knowledge (rules, patterns, regularities, const

26、raints) from large databasesGene chips, scanners, satellitesWhat is Data MiningnExtraction of interesting (non-trivial, implicit, previously unknown, potentially useful) info or patterns from data in large databases.nOther names: Knowledge discovery (mining) in databases (KDD), knowledge extraction,

27、 data archeology, data dredging, data/pattern analysis, information harvesting, business intelligence, etc.nWhat is not data mining?n(Deductive) query processing - not exploratory.nExpert systems or small ML/statistical programs.Data PreparationEx: KDs Insurance Drivers AgeType of CarNumber of kids

28、(#kids)Other_carsParking of carsClaims Made20Sports3AutoGarageYes60Sports2MinivanDrivewayNo35Auto4TruckStreetNo29Truck0TruckDrivewayNo30Auto1TruckStreetYes40Sports2TruckStreetYesEqual length binning for Age:(-.28,(28.36,(36.44,(44.52,(52.+Type: Sports, Auto, Truck#Kids: (-.1,(1.2,(2.3,(3.4,(4.+Other

29、_cars: Auto, Truck, MinivanParking: Garage, Driveway, StreetClasses: Claims made CClaims Not made: NotCAn interval is equivalent to an item. So (-.28 1 & Type:Auto 7DefinitionsnEmerging patternsEx: With GrowthRate threshold of 2, an EP is Other_cars:Truck, Parking:StreetSuppC = 2, SuppNotC = 1, Grow

30、thRate = 2nDominant trend (DT)Ex: dt1 = Type: Sports with SuppC = 2, SuppNotC = 1, GrowthRate = 2 and dt2 = e1nSet of dominant trends (DTS) Ex: dt1, dt2nNiche Contradiction to the DTMining of the DTS : SAT approachCandidate DTSSAT(DTS)OverlapType:Sports,#kids:2t1,t3t1,t3#kids:2,Other_cars:Truck,Park

31、ing:Streett1,t2,t3t3Type:Sports,Other_cars:Truck,Parking:Streett1,t2,t3t3Type:Sports,#kids:2,Other_cars:Truck,Parking:Streett1,t2,t3t1,t3Tuple idDrivers AgeType of CarNumber of kids (#kids)Other_carsParking of carst120Sports2AutoGaraget230Auto1TruckStreett340Sports2TruckStreetExpensive approach if w

32、e consider hundreds of thousands of EPsSAT based incremental approachEPSATType:Sportst1,t3#kids:2t1,t3Other_cars:Truck, Parking:Streett2,t3DTS = Type:SportsSAT(DTS) = t1,t3One with highest supportEPSATDTSOverlap with DTS#kids:20/2 = 0t1,t2,t3t1,t3Other_cars:Truck, Parking:Street1/1 = 1t1,t2,t3t3Tupl

33、e idDrivers AgeType of CarNumber of kids (#kids)Other_carsParking of carst120Sports2AutoGaraget230Auto1TruckStreett340Sports2TruckStreet =|SAT(X)SAT(DTS)| |SAT(X) SAT(DTS)| DTS=Type:Sports,Other_cars:Truck,Parking:Streeta)b)Mining of DTS : Techniques to improve efficiency of SATnExpensive steps of a

34、pproachnFind initial SATsnSAT(X)SAT(DTS) and SAT(X) SAT(DTS)nTechniques to improve efficiencynReduce initial cost by reducing search space - Use hash table indexingnMinimize cost to find difference and overlap using SAT matrix and drivers to drive the computationHash Table Indexing to Minimize Cost

35、of Finding Initial SATsnTypical “which-which problem123451,52,434,551,2,32,33,51,3,5EPs are stored in the hash tableHashing is based on first item in the EPWhich EPs are contained in a transaction can be easilyFound. For Ex: Transaction t1: 2,4,5Effect of Hash-table based initial SAT computationSAT

36、MatrixnMatrix composed of EPs and transactionsTrans/EPType:Sports#kids:2Other_cars: Truck, Parking: Streett1110t2001t3111SATt1,t3t1,t3t2,t3Tuple idDrivers AgeType of CarNumber of kids (#kids)Other_carsParking of carst120Sports2AutoGaraget230Auto1TruckStreett340Sports2TruckStreetBit Driven Incrementa

37、l SAT ApproachnIf EP Y is included in the DTS and p is the column position for YFor each row r of SAT matrixIf the r,p position of the matrix = 1For each column c If the r,c position of the matrix = 1 then change the value to 0 and decrement count for c;Mining of the DTSnComparison of various approa

38、chesExample of Relativization ApproachTuple idDrivers AgeType of CarNumber of kids (#kids)Other_carsParking of carsClasst120Sports2AutoDrivewayCt230Auto1TruckStreetCt343Sports2TruckStreetCt358Sports2MinivanStreetCt460Sports3MinivanDrivewayNotCt555Sports1AutoGarageNotCt635Auto2TruckStreetNotCt729Auto

39、2TruckDrivewayNotCTuple idDrivers AgeType of CarNumber of kids (#kids)Other_carsParking of carsClasst120Sports2AutoDrivewayCt343Sports2TruckStreetCt358Sports2MinivanStreetCt460Sports3MinivanDrivewayNotCt555Sports1AutoGarageNotCScalabilitynVolume scalabilityScalabilitynAttribute scalabilityRelated Ar

40、eas of ResearchnHoles Missing patterns different from exceptions in that they are expected to be there but are not presentnSurprising patterns Patterns not conforming to standards may not necessarily contradict the standard expected patternnExceptions Contradictory patternsnNiches Today companies ar

41、e moving towards customization. Niches help identify customers that need special treatment different form others. They are in a much global sense for the entire organization.Approaches to Mine NichesnUsing DFS approachnExpanding association rules using a tree and applying DFS to get exceptions to th

42、emnSemantic selection of exceptions gives nichesnSingle step mining in reverse direction and then doing semantic selection Not feasible as support thresholds have to be very low and that decreases feasibilitynOur method Three step approachConclusionnKey ideas introduced in this thesis are:nDTS Domin

43、ant trend set to capture operational policies of an organizationnNiches that signify segments of data that need special attention to generate profitsnEfficient mining of the DTS using the SAT based bit driven incremental approachnEfficient mining of niches using the three step approach & technique of relativization

展开阅读全文

生物信息学课件英文原版课件57

最新文档