第4课数据分类和预测

资源描述

《第4课数据分类和预测》由会员分享，可在线阅读，更多相关《第4课数据分类和预测（44页珍藏版）》请在金锄头文库上搜索。

1、第4课数据分类和预测Stillwatersrundeep.流静水深流静水深,人静心深人静心深Wherethereislife,thereishope。有生命必有希望。有生命必有希望内容提纲nWhat is classification? What is prediction?nIssues regarding classification and predictionnClassification by decision tree inductionnBayesian ClassificationnPredictionnSummarynReferencenClassification pred

2、icts categorical class labels (discrete or nominal)classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new datanPrediction models continuous-valued functions, i.e., predicts unknown or missing values nTypi

3、cal applicationsCredit approvalTarget marketingMedical diagnosisFraud detectionI.Classification vs. PredictionClassificationA Two-Step Process nModel construction: describing a set of predetermined classesEach tuple/sample is assumed to belong to a predefined class, as determined by the class label

4、attributeThe set of tuples used for model construction is training setThe model is represented as classification rules, decision trees, or mathematical formulaenModel usage: for classifying future or unknown objectsEstimate accuracy of the modelnThe known label of test sample is compared with the cl

5、assified result from the modelnAccuracy rate is the percentage of test set samples that are correctly classified by the modelnTest set is independent of training set, otherwise over-fitting will occurIf the accuracy is acceptable, use the model to classify data tuples whose class labels are not know

6、nClassification Process (1): Model ConstructionTrainingDataClassificationAlgorithmsIF rank = professorOR years 6THEN tenured = yes Classifier(Model)Classification Process (2): Use the Model in PredictionClassifierTestingDataUnseen Data(Jeff, Professor, 4)Tenured?Supervised vs. Unsupervised Learningn

7、Supervised learning (classification)Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observationsNew data is classified based on the training setnUnsupervised learning (clustering)The class labels of training data is unknownGiven

8、 a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the dataII.Issues Regarding Classification and Prediction (1): Data PreparationnData cleaningPreprocess data in order to reduce noise and handle missing valuesnRelevance analysis (feature

9、selection)Remove the irrelevant or redundant attributesnData transformationGeneralize and/or normalize dataIssues regarding classification and prediction (2): Evaluating classification methodsnAccuracy: classifier accuracy and predictor accuracynSpeed and scalabilitytime to construct the model (trai

10、ning time)time to use the model (classification/prediction time)nRobustnesshandling noise and missing valuesnScalabilityefficiency in disk-resident databases nInterpretabilityunderstanding and insight provided by the modelnOther measures, e.g., goodness of rules, such as decision tree size or compac

11、tness of classification rulesIII.Decision Tree Induction: Training DatasetThis follows an example of Quinlans ID3 (Playing Tennis)Output: A Decision Tree for “buys_computer”age?overcaststudent?credit rating?noyesfairexcellent40nonoyesyesyes30.40Algorithm for Decision Tree InductionnBasic algorithm (

12、a greedy algorithm)Tree is constructed in a top-down recursive divide-and-conquer mannerAt start, all the training examples are at the rootAttributes are categorical (if continuous-valued, they are discretized in advance)Examples are partitioned recursively based on selected attributesTest attribute

13、s are selected on the basis of a heuristic or statistical measure (e.g., information gain)nConditions for stopping partitioningAll samples for a given node belong to the same classThere are no remaining attributes for further partitioning majority voting is employed for classifying the leafThere are

14、 no samples leftAttribute Selection Measure: Information Gain (ID3/C4.5)nSelect the attribute with the highest information gainnS contains si tuples of class Ci for i = 1, , m ninformation measures info required to classify any arbitrary tuplenentropy of attribute A with values a1,a2,avninformation

15、gained by branching on attribute AAttribute Selection by Information Gain ComputationgClass P: buys_computer = “yes”gClass N: buys_computer = “no”gI(p, n) = I(9, 5) =0.940gCompute the entropy for age: means “age split-pointExtracting Classification Rules from TreesnRepresent the knowledge in the for

16、m of IF-THEN rulesnOne rule is created for each path from the root to a leafnEach attribute-value pair along a path forms a conjunctionnThe leaf node holds the class predictionnRules are easier for humans to understandnExampleIF age = “=30” AND student = “no” THEN buys_computer = “no”IF age = “40” A

17、ND credit_rating = “excellent” THEN buys_computer = “yes”IF age = “=30” AND credit_rating = “fair” THEN buys_computer = “no”Avoid Overfitting in ClassificationnOverfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliersPoor accura

18、cy for unseen samplesnTwo approaches to avoid overfitting Prepruning: Halt tree construction earlydo not split a node if this would result in the goodness measure falling below a thresholdnDifficult to choose an appropriate thresholdPostpruning: Remove branches from a “fully grown” treeget a sequenc

19、e of progressively pruned treesnUse a set of data different from the training data to decide which is the “best pruned tree”Approaches to Determine the Final Tree SizenSeparate training (2/3) and testing (1/3) setsnUse cross validationnUse all the data for trainingbut apply a statistical test (e.g.,

20、 chi-square) to estimate whether expanding or pruning a node may improve the entire distributionnEnhancements to Basic Decision Tree InductionnAllow for continuous-valued attributesDynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of

21、intervalsnHandle missing attribute valuesAssign the most common value of the attributeAssign probability to each of the possible valuesnAttribute constructionCreate new attributes based on existing ones that are sparsely representedThis reduces fragmentation, repetition, and replicationClassificatio

22、n in Large DatabasesnClassificationa classical problem extensively studied by statisticians and machine learning researchersnScalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speednWhy decision tree induction in data mining?relatively faster lear

23、ning speed (than other classification methods)convertible to simple and easy to understand classification rulescan use SQL queries for accessing databasescomparable classification accuracy with other methodsScalable Decision Tree Induction MethodsnSLIQ (EDBT96 Mehta et al.)builds an index for each a

24、ttribute and only class list and the current attribute list reside in memorynSPRINT (VLDB96 J. Shafer et al.)constructs an attribute list data structure nPUBLIC (VLDB98 Rastogi & Shim)integrates tree splitting and tree pruning: stop growing the tree earliernRainForest (VLDB98 Gehrke, Ramakrishnan &

25、Ganti)separates the scalability aspects from the criteria that determine the quality of the treebuilds an AVC-list (attribute, value, class label)Presentation of Classification ResultsVisualization of a Decision Tree in SGI/MineSet 3.0Interactive Visual Mining by Perception-Based Classification (PBC

26、)IV.Bayesian Classification: Why?nProbabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problemsnIncremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Pr

27、ior knowledge can be combined with observed data.nProbabilistic prediction: Predict multiple hypotheses, weighted by their probabilitiesnStandard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be mea

28、suredBayesian Theorem: BasicsnLet X be a data sample whose class label is unknownnLet H be a hypothesis that X belongs to class C nFor classification problems, determine P(H|X): the probability that the hypothesis holds given the observed data sample XnP(H): prior probability of hypothesis H (i.e. t

29、he initial probability before we observe any data, reflects the background knowledge)nP(X): probability that sample data is observednP(X|H): probability of observing the sample X, given that the hypothesis holdsBayesian TheoremnGiven training data X, posteriori probability of a hypothesis H, P(H|X)

30、follows the Bayes theoremnInformally, this can be written as posteriori = likelihood x prior / evidencenMAP (maximum posteriori) hypothesisnPractical difficulty: require initial knowledge of many probabilities, significant computational costNave Bayes Classifier nA simplified assumption: attributes

31、are conditionally independent:nThe product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P(y1,y2, C) = P(y1, C) * P(y2, C)nNo dependence relation between attributes nGreatly reduces

32、the computation cost, only count the class distribution.nOnce the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci) * P(Ci)Training datasetClass:C1:buys_computer=yesC2:buys_computer=noData sample X =(age=30,Income=medium,Student=yesCredit_rating=Fair)Nave Bayesian Classifier:

33、An ExamplenCompute P(X|Ci) for each classP(age=“30” | buys_computer=“yes”) = 2/9=0.222 P(age=“30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes

34、” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age=30 , income =medium, student=yes, credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”

35、)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 Therefore, X belongs to class “buys_computer=yes”Nave Bayesian Classifier: CommentsnAdvantages Easy to implement Good results obtained in mos

36、t of the casesnDisadvantagesAssumption: class conditional independence, therefore loss of accuracyPractically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these canno

37、t be modeled by Nave Bayesian ClassifiernHow to deal with these dependencies?Bayesian Belief Networks Bayesian Belief NetworksnBayesian belief network allows a subset of the variables conditionally independentnA graphical model of causal relationshipsRepresents dependency among the variables Gives a

38、 specification of joint probability distribution XYZPqNodes: random variablesqLinks: dependencyqX,Y are the parents of Z, and Y is the parent of PqNo dependency between Z and PqHas no loops or cyclesBayesian Belief Network: An ExampleFamilyHistoryLungCancerPositiveXRaySmokerEmphysemaDyspneaLCLC(FH,

39、S)(FH, S)(FH, S) (FH, S)0.80.20.50.50.70.30.10.9Bayesian Belief NetworksThe conditional probability table for the variable LungCancer:Shows the conditional probability for each possible combination of its parentsLearning Bayesian NetworksnSeveral casesGiven both the network structure and all variabl

40、es observable: learn only the CPTsNetwork structure known, some hidden variables: method of gradient descent, analogous to neural network learningNetwork structure unknown, all variables observable: search through the model space to reconstruct graph topology Unknown structure, all hidden variables:

41、 no good algorithms known for this purposenD. Heckerman, Bayesian networks for data miningV.What Is Prediction?n(Numerical) prediction is similar to classificationconstruct a modeluse model to predict continuous or ordered value for a given inputnPrediction is different from classificationClassifica

42、tion refers to predict categorical class labelPrediction models continuous-valued functionsnMajor method for prediction: regressionmodel the relationship between one or more independent or predictor variables and a dependent or response variablenRegression analysisLinear and multiple regressionNon-l

43、inear regressionOther regression methods: generalized linear model, Poisson regression, log-linear models, regression treesLinear Regression nLinear regression: involves a response variable y and a single predictor variable x，y = w0 + w1xwhere w0 (y-intercept) and w1 (slope) are regression coefficie

44、nts nMethod of least squares: estimates the best-fitting straight linenMultiple linear regression: involves more than one predictor variableTraining data is of the form (X1, y1), (X2, y2), (X|D|, y|D|) Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2Solvable by extension of least square method o

45、r using SAS, S-PlusMany nonlinear functions can be transformed into the abovenSome nonlinear models can be modeled by a polynomial functionnA polynomial regression model can be transformed into linear regression model. For example,y = w0 + w1 x + w2 x2 + w3 x3convertible to linear with new variables

46、: x2 = x2, x3= x3y = w0 + w1 x + w2 x2 + w3 x3 nOther functions, such as power function, can also be transformed to linear modelnSome models are intractable nonlinear (e.g., sum of exponential terms)possible to obtain least square estimates through extensive calculation on more complex formulaeNonli

47、near Regression nGeneralized linear model: Foundation on which linear regression can be applied to modeling categorical response variablesVariance of y is a function of the mean value of y, not a constantLogistic regression: models the prob. of some event occurring as a linear function of a set of p

48、redictor variablesPoisson regression: models the data that exhibit a Poisson distributionnLog-linear models: (for categorical data)Approximate discrete multidimensional prob. distributions Also useful for data compression and smoothingnRegression trees and model treesTrees to predict continuous valu

49、es rather than class labelsOther Regression-Based ModelsRegression Trees and Model TreesnRegression tree: proposed in CART system (Breiman et al. 1984)CART: Classification And Regression TreesEach leaf stores a continuous-valued predictionIt is the average value of the predicted attribute for the tr

50、aining tuples that reach the leafnModel tree: proposed by Quinlan (1992)Each leaf holds a regression modela multivariate linear equation for the predicted attributeA more general case than regression treenRegression and model trees tend to be more accurate than linear regression when the data are no

51、t represented well by a simple linear modelnPredictive modeling: Predict data values or construct generalized linear models based on the database datanOne can only predict value ranges or category distributionsnMethod outline: Minimal generalization Attribute relevance analysis Generalized linear mo

52、del construction PredictionnDetermine the major factors which influence the predictionData relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc.nMulti-level prediction: drill-down and roll-up analysisPredictive Modeling in Multidimensional DatabasesPrediction: Numeric

53、al DataPrediction: Categorical DataVI.SummarynClassification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. nEffective and scalable methods have been developed for decision trees induction, Naive Ba

54、yesian classification, Bayesian belief network, rule-based classifier, Backpropagation, Support Vector Machine (SVM), associative classification, nearest neighbor classifiers, and case-based reasoning, and other classification methods such as genetic algorithms, rough set and fuzzy set approaches.nL

55、inear, nonlinear, and generalized linear models of regression can be used for prediction. Many nonlinear problems can be converted to linear problems by performing transformations on the predictor variables. Regression trees and model trees are also used for prediction. VII.ReferencenJ. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.nJ. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.nR. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd ed.). John Wiley & Sons, 2001.nT. M. Mitchell. Machine Learning. McGraw Hill, 1997.

展开阅读全文

第4课数据分类和预测

最新文档