《微软SQL数据挖掘数据仓库技术研讨会材料》由会员分享,可在线阅读,更多相关《微软SQL数据挖掘数据仓库技术研讨会材料(54页珍藏版)》请在金锄头文库上搜索。
1、欢迎光临微软SQL数据挖掘/数据仓库技术研讨会今日安排微软SQL数据挖掘技术概述左洪 微软公司数据仓库在电信的应用贝志城 明天高科数据挖掘在CRM中的应用王立军 中圣公司灵通IT Service维护管理服务系统 邹雄文 广州灵通Introduction to Data Mining with SQL Server 2000 左洪左洪 高级产品市场经理高级产品市场经理高级产品市场经理高级产品市场经理微软(中国)有限公司微软(中国)有限公司微软(中国)有限公司微软(中国)有限公司Agendal lWhat is Data Miningl lThe Data Mining Marketl lOLE
2、DB for Data Miningl lOverview of the Data Mining Features in SQL Server 2000l lQ&AWhat Is Data Mining?What is DM?l lA process of data exploration and analysis using automatic or semi-automatic means“Exploring data” scanning samples of known “Exploring data” scanning samples of known facts about “cas
3、es”.facts about “cases”.“knowledge”: “knowledge”: Clusters, Rules, Decision treesClusters, Rules, Decision trees, , Equations, Equations, Association rulesAssociation rulesl lOnce the “knowledge” is extracted it:Once the “knowledge” is extracted it:Can be browsed Can be browsed Provides a very usefu
4、l insight on the cases behaviorProvides a very useful insight on the cases behaviorCan be used to predict values of other casesCan be used to predict values of other cases Can serve as a key element in closed loop analysisCan serve as a key element in closed loop analysisWhat drive high school stude
5、nts to attend college?The deciding factors for high school students to attend college areAttend College:55% Yes45% NoAll StudentsAttend College:79% Yes11% NoIQ=HighAttend College:45% Yes55% NoIQ=LowIQ ?WealthAttend College:94% Yes6% NoWealth = TrueAttend College:69% Yes21% NoWealth = FalseParentsEnc
6、ourage?Attend College:70% Yes30% NoAttend College:31% Yes69% NoParentsEncourage = NoParentsEncourage = YesBusiness Oriented DM Problemsl lTargeted adsTargeted ads“What banner should I display to this visitor?”“What banner should I display to this visitor?”l lCross sellsCross sells“What other product
7、s is this customer likely to buy?“What other products is this customer likely to buy?l lFraud detectionFraud detection“Is this insurance claim a fraud?”“Is this insurance claim a fraud?”l lChurn analysisChurn analysis“Who are those customers likely to churn?”“Who are those customers likely to churn?
8、”l lRisk ManagementRisk Management“Should I approve the loan to this customer?”“Should I approve the loan to this customer?”l l Http:/Mining ModelMining Process - IllustratedDMEngineData To PredictDMEnginePredicted DataTraining DataMining ModelMining ModelThe Data Mining MarketThe $: Y2000 Market Si
9、ze l lDM Tools Market: $250M40% - license fees40% - license fees60% consulting60% consulting* GartnerThe Playersl lLeading vendorsSASSASSPSSSPSS IBM IBMHundreds of smaller vendors offering DM Hundreds of smaller vendors offering DM algorithmsalgorithmsl lOracle Thinking Machines acquisitionThe Produ
10、ctsl lEnd-to-end Data Mining toolsExtraction, Cleansing, Loading, Modeling, Algorithms (dozens), Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting.Analysts workbench, Reporting, Charting.l lThe customer is the power-analystPhD in statistics is usu
11、ally requiredPhD in statistics is usually requiredl lClosed tools no standard APIClosed tools no standard APITotal vendor lock-inTotal vendor lock-inLimited integration with applicationsLimited integration with applicationsl lDM an “outsider” in the Data WarehouseDM an “outsider” in the Data Warehou
12、sel lExtensive consulting requiredExtensive consulting requiredl lSky rocketing pricesSky rocketing prices$60K+ for a single user license$60K+ for a single user licenseWhat the analysts sayl l“Stand-alone Data Mining Is Dead” - Forresterl l“The demise of stand alone data mining” GartnerThe Microsoft
13、 ApproachDataPro Users Survey 1999-2001“Data mining will be the fastest-growing BI technology”The $: 2000 Market Size l lDM Applications Market Size: $1.5B* IDCSQL Server 2000 - The Analysis Platforml lSQL 2000 provides a complete Analysis PlatformNot an isolated, stand alone DM productNot an isolat
14、ed, stand alone DM productl lPlatform means:The infrastructure for applicationsThe infrastructure for applications Not an application by itselfNot an application by itselfIntegrated vision for all technologies, toolsIntegrated vision for all technologies, toolsStandard based APIs (OLE DB for DM)Stan
15、dard based APIs (OLE DB for DM)ExtensibleExtensibleScaleableScaleableData FlowDWOLTPOLAPDMAppsReports& AnalysisDMAnalysis Services 2000 - ArchitectureManager UIDSOAnalysis ServerClientOLE DB OLAPOLAPEngine(local)OLAPEngineDMEngineDMEngine(local)DMDMMDM WizardsDM DTS TaskExt.Ext.OLE DB for Data Minin
16、gWhy OLE DB for DM?Make DM a Make DM a mass market technologymass market technology by: by:l lLeverage existing technologies and knowledge Leverage existing technologies and knowledge SQL and OLE DB SQL and OLE DB l lCommon industry wide concepts and data Common industry wide concepts and data prese
17、ntationpresentationl lChanging DM market perception from Changing DM market perception from “proprietary” to “open”“proprietary” to “open”l lIncreasing the number of players:Increasing the number of players:Reduce the cost and risk of becoming a consumer one tool Reduce the cost and risk of becoming
18、 a consumer one tool works with multiple providersworks with multiple providersReduce the cost and risk of becoming a provider focus on Reduce the cost and risk of becoming a provider focus on expertise and find many partners to complement offeringexpertise and find many partners to complement offer
19、ingDramatically increase the number of DM developersDramatically increase the number of DM developersIntegration With RDBMSl lCustomers would like to Build DM models from within their RDBMSBuild DM models from within their RDBMSTrain the models directly off their relational tablesTrain the models di
20、rectly off their relational tablesPerform predictions as relational queries (tables Perform predictions as relational queries (tables in, tables out)in, tables out)Feel that DM is a native part of their database.Feel that DM is a native part of their database.l lThereforeThereforeData mining models
21、are relational objectsData mining models are relational objectsAll operations on the models are relationalAll operations on the models are relationalThe language used is SQL (w/Extensions)The language used is SQL (w/Extensions)l lThe effect: every DBA and VB developer can The effect: every DBA and V
22、B developer can become a DM developerbecome a DM developerCreating a Data Mining Model (DMM)Identifying the “Cases”l lDM algorithms analyze “cases”DM algorithms analyze “cases”l lThe “case” is the entity being categorized The “case” is the entity being categorized and classifiedand classifiedl lExam
23、plesExamplesCustomer credit risk analysis: Customer credit risk analysis: Case = Case = CustomerCustomerProduct profitability analysis: Product profitability analysis: Case = ProductCase = ProductPromotion success analysis: Promotion success analysis: Case = PromotionCase = Promotionl lEach case enc
24、apsulate all we know about Each case encapsulate all we know about the entitythe entityA Simple Set of CasesStudentIDStudentID GenderGenderParent Parent IncomeIncomeIQIQEncouragementEncouragementCollegeCollegePlansPlans1 1MaleMale2340023400120120 Not EncouragedNot Encouraged NoNo2 2FemaleFemale79200
25、792009090 EncouragedEncouragedYesYes3 3MaleMale4200042000105105 Not EncouragedNot Encouraged YesYesMore Complicated CasesCustCust ID IDAgeAgeMaritalMaritalStatusStatusIQIQFavorite MoviesFavorite MoviesTitleTitleScoreScore1 13535MM2 2Star WarsStar Wars8 8Toy StoryToy Story9 9TerminatorTerminator7 72
26、22020S S3 3Star WarsStar Wars7 7BraveheartBraveheart7 7The MatrixThe Matrix10103 35757MM2 2Sixth SenseSixth Sense9 9CasablancaCasablanca1010A DMM is a Table!l lA DMM structure is defined as a tableTraining a DMM means inserting data into Training a DMM means inserting data into the tablethe tablePre
27、dicting from a DMM means querying Predicting from a DMM means querying the tablethe tablel lAll information describing the case are contained in columnsCreating a Mining ModelCREATE MINING MODEL Plans PredictionCREATE MINING MODEL Plans Prediction( (StudentIDStudentID LONG KEY, LONG KEY,Gender TEXT
28、DISCRETE,Gender TEXT DISCRETE,ParentIncomeParentIncome LONG CONTINUOUS, LONG CONTINUOUS,IQ DOUBLE CONTINUOUS,IQ DOUBLE CONTINUOUS,Encouragement TEXT DISCRETE, Encouragement TEXT DISCRETE, CollegePlansCollegePlans TEXT DISCRETE PREDICT TEXT DISCRETE PREDICT) )USING Microsoft_Decision_TreesUSING Micro
29、soft_Decision_TreesCreating a mining model with nested tableCreate Mining Model Create Mining Model MoviePredictionMoviePrediction ( ( CutomerIdCutomerId long key, long key, Age long continuous, Age long continuous, Gender discrete,Gender discrete,Education discrete,Education discrete,MovieListMovie
30、List table predict ( table predict ( MovieNameMovieName text key text key ) ) ) ) using using microsoft_decision_treesmicrosoft_decision_treesTraining a DMMTraining a DMMl lTraining a DMM means passing it data for Training a DMM means passing it data for which the attributes to be predicted are know
31、nwhich the attributes to be predicted are knownMultiple passes are handled internally by the Multiple passes are handled internally by the provider!provider!l lUse an INSERT INTO statementUse an INSERT INTO statementl lThe DMM will not persist the inserted data The DMM will not persist the inserted
32、data l lInstead it will analyze the given cases and Instead it will analyze the given cases and build the DMM content (decision tree, build the DMM content (decision tree, segmentation model, association rules)segmentation model, association rules)INSERT INTO INSERT INTO (columns list)(columns list)
33、 INSERT INTOINSERT INTO Plans PredictionPlans Prediction(StudentID, Gender, ParentIncome, IQ,Encouragement, CollegePlans)SELECT StudentID, Gender, ParentIncome, IQ,Encouragement, CollegePlans FROM CollegePlans When Insert Into Is Donel lThe DMM is trainedThe model can be retrained The model can be r
34、etrained Content (rules, trees, formulas) can be Content (rules, trees, formulas) can be exploredexploredOLE DB Schema rowsetOLE DB Schema rowsetSELECT * FROM SELECT * FROM .CONTENT.CONTENTXML string (PMML)XML string (PMML)l lPrediction queries can be executedPredictionsWhat are Predictions?l lPredi
35、ctions apply the rules of a trained Predictions apply the rules of a trained model to a new set of data in order to model to a new set of data in order to estimate missing attributes or valuesestimate missing attributes or valuesl lPredictions = queriesPredictions = queriesThe syntax is SQL - likeTh
36、e syntax is SQL - likeThe output is a rowsetThe output is a rowsetl lIn order to predict you need:In order to predict you need:Input data setInput data setA trained DMMA trained DMMBinding (mapping) information between the Binding (mapping) information between the input data and the DMMinput data an
37、d the DMMSpecification of what to predictSpecification of what to predictThe Truth Table ConceptGenderGenderParent Parent IncomeIncomeIQIQEncouragementEncouragementCollegeCollegePlansPlansProbabilityProbabilityMaleMale20000200008585Not EncouragedNot EncouragedNoNo85%85%MaleMale20000200008585Not Enco
38、uragedNot EncouragedYesYes15%15%MaleMale20000200008585EncouragedEncouragedNoNo60%60%MaleMale20000200008585EncouragedEncouragedYesYes40%40%MaleMale20000200009090Not EncouragedNot EncouragedNoNo80%80%MaleMale20000200009090Not EncouragedNot EncouragedYesYes20%20%MaleMale20000200009090EncouragedEncourag
39、edNoNo58%58%PredictionGenderGenderParentParentIncomeIncomeIQIQEncouragementEncouragementCollege College PlansPlansProbabilityProbabilityMaleMale20000200008585Not EncouragedNot EncouragedNoNo85%85%MaleMale20000200008585Not EncouragedNot EncouragedYesYes15%15%MaleMale20000200008585EncouragedEncouraged
40、NoNo60%60%MaleMale20000200008585EncouragedEncouragedYesYes40%40%MaleMale20000200009090Not EncouragedNot EncouragedNoNo80%80%MaleMale20000200009090Not EncouragedNot EncouragedYesYes20%20%MaleMale20000200009090EncouragedEncouragedNoNo58%58%MaleMale20000200009090EncouragedEncouragedYesYes42%42%MaleMale
41、20000200009595Not EncouragedNot EncouragedNoNo78%78%MaleMale20000200009595Not EncouragedNot EncouragedYesYes22%22%MaleMale20000200009595EncouragedEncouragedNoNo45%45%Its a JOIN!StudentStudentIDIDGenderGenderParentParentIncomeIncomeIQIQEncouragementEncouragement1 1MaleMale43000430008585Not Encouraged
42、Not Encouraged2 2MaleMale2000020000135135Not EncouragedNot Encouraged3 3FemaleFemale2500025000105105EncouragedEncouraged4 4MaleMale9600096000100100EncouragedEncouraged5 5FemaleFemale5600056000125125Not EncouragedNot Encouraged6 6FemaleFemale46000460009090Not EncouragedNot EncouragedThe Prediction Qu
43、ery SyntaxSELECT SELECT FROM FROM PREDICTION JOIN PREDICTION JOIN ONON = = dmm input columnExampleSELECT SELECT New New Students.StudentIDStudents.StudentID, , Plans Plans Prediction.CollegePlansPrediction.CollegePlans, , PredictProbability(CollegePlansPredictProbability(CollegePlans)FROM FROM Plans
44、 Prediction Plans Prediction PREDICTION JOINPREDICTION JOIN New Students New StudentsON ON Plans Prediction.GenderPlans Prediction.Gender = = New Students.Gender New Students.Gender ANDAND Plans Prediction.IQPlans Prediction.IQ = = New Students.IQ New Students.IQ AND .AND .OLE DB DM Sample Provider
45、with Sourcel lAll required OLE DB objects, such as session, All required OLE DB objects, such as session, command, and command, and rowsetrowset l lThe OLE DB for Data Mining syntax parser The OLE DB for Data Mining syntax parser l lTokenization of input data Tokenization of input data l lQuery proc
46、essing engine Query processing engine l lA sample Nave A sample Nave BayesBayes algorithm algorithm l lModel persistence in XML and binary formats Model persistence in XML and binary formats l lAvailable at Available at OLAP and DM AnalysisWhy Use DM with OLAPl lRelational DM is designed for:Report
47、s of patternsReports of patternsBatch predictions fed into an OLTP systemBatch predictions fed into an OLTP systemReal-time singleton prediction in an Real-time singleton prediction in an operational environmentoperational environmentl lOLAP is designed for interactive analysis by a knowledge worker
48、 interactive analysis by a knowledge worker Consistent and convenient navigational Consistent and convenient navigational modelmodelPre-aggregations of OLAP allow faster Pre-aggregations of OLAP allow faster performanceperformanceUnderstanding DM Content Decision TreesCredit Risk:65% Good35% BadAll
49、CustomersCredit Risk:89% Good11% BadDebt=LowCredit Risk:94% Good6% BadET = SalariedCredit Risk:70% Good30% BadEducation?Credit Risk:31% Good69% BadEducation=High SchoolCredit Risk:79% Good21% BadCredit Risk:45% Good55% BadDebt=HighDebt ?Employ-ment Type?ET = SelfEmployedEducation=CollegeCustomers ha
50、ving high debt and college education:Filter(Individual Customers.Members,Customers.CurrentMember.Properties(“Debt”) = “High”And Customers.CurrentMember.Properties(“Education”) = “College”)Customers having low debt and are self employed:Filter(Individual Customers.Members,Customers.CurrentMember.Prop
51、erties(“Debt”) = LowAnd Customers.CurrentMember.Properties(“Employment Type”) = “Self Employed”)Equivalent DM DimensionCustomers with high debt and college educationAll CustomersCustomers with high debtCustomers with high debt and high school educationCustomers with low debt and self employedCustome
52、rs with low debtCustomers with low debt and salariedCustom Custom Roll-upRoll-upCredit RiskCredit Risk- -Good = 65%, Bad = 35%Good = 65%, Bad = 35%Aggregate(Filter(Aggregate(Filter( Good = 89%, Bad = 11%Good = 89%, Bad = 11%Aggregate(Filter(Aggregate(Filter( Good = 79%, Bad = 21%Good = 79%, Bad = 21
53、%Aggregate(Filter(Aggregate(Filter(Good = 94%, Bad = 6%Good = 94%, Bad = 6%Aggregate(Filter(Aggregate(Filter( Good = 45%, Bad = 55%Good = 45%, Bad = 55%Aggregate(Filter(Aggregate(Filter( Good = 70%, Bad = 30%Good = 70%, Bad = 30%Aggregate(Filter(Aggregate(Filter( Good = 31%, Bad = 69%Good = 31%, Bad
54、 = 69%Tree = Dimensionl lEvery node on the tree is a dimension memberEvery node on the tree is a dimension memberl lThe node statistics are the member propertiesThe node statistics are the member propertiesl lAll members are calculatedAll members are calculatedFormula aggregates the case dimension m
55、embers Formula aggregates the case dimension members that apply to this nodethat apply to this nodeThe MDX is generated by the DM algorithmThe MDX is generated by the DM algorithml lAnalysis Service will automatically generate the Analysis Service will automatically generate the calculated dimension
56、 based on the DM content calculated dimension based on the DM content and also a virtual cubeand also a virtual cubel lApplies to Applies to Classification (decision trees)Classification (decision trees)Segmentation (clusters)Segmentation (clusters)Browsing the Virtual Cubel lPivot the DM dimension:
57、WAWAORORCACAAll CustomersAll Customers320032002500250080008000 Customers with low debt Customers with low debt232023201503150343004300 Customers with high debt Customers with high debt88088099799747004700 Customers college Customers college32032045045023102310 Customers high school Customers high sc
58、hool56056054754723902390Credit Risk: 70% Good, 30% BadPredictionsl lYou might want to view predictions for each You might want to view predictions for each casecasel lFor example:For example:What is the expected profitability of a product?What is the expected profitability of a product?What is the c
59、redit risk of a specific customer?What is the credit risk of a specific customer?What are the products this customer is likely to buy?What are the products this customer is likely to buy?l lAll of those predictions are available through All of those predictions are available through MDX calculated m
60、embersMDX calculated membersl lSingleton query is created automaticallySingleton query is created automaticallyPrediction Calculated MemberMeasures.Probability of High Credit Risk:Measures.Probability of High Credit Risk:PREDICT(Customers.CurrentMember, PREDICT(Customers.CurrentMember, “Credit Risk
61、Model”,“Credit Risk Model”,“ “PredictionProbability(PredictionProbability( PredictionHistogram(“PredictionHistogram(“Credit RiskCredit Risk,(”,(” HighHigh) )“ “) )Predictions ExampleProbability of Probability of High Credit RiskHigh Credit RiskProbability of Probability of Low Credit RiskLow Credit RiskJoe SmithJoe Smith73%73%27%27%John DowJohn Dow68%68%32%32%William ClingtonWilliam Clington45%45%55%55%Robert MaxwellRobert Maxwell98%98%2%2%Denis RodmanDenis Rodman81%81%19%19%Questions ?E-Mail: http:/