第4课数据分类和预测

上传人:cn****1 文档编号:568556380 上传时间:2024-07-25 格式:PPT 页数:44 大小:447KB
返回 下载 相关 举报
第4课数据分类和预测_第1页
第1页 / 共44页
第4课数据分类和预测_第2页
第2页 / 共44页
第4课数据分类和预测_第3页
第3页 / 共44页
第4课数据分类和预测_第4页
第4页 / 共44页
第4课数据分类和预测_第5页
第5页 / 共44页
点击查看更多>>
资源描述

《第4课数据分类和预测》由会员分享,可在线阅读,更多相关《第4课数据分类和预测(44页珍藏版)》请在金锄头文库上搜索。

1、慌韭杂萌胁雹森断簧蛤料邑渴样民吉瞄税简繁像的亏碴堤铡酵入蔫滞逃整第4课数据分类和预测第4课数据分类和预测第4课 数据分类和预测 徐从富,副教授 浙江大学人工智能研究所浙江大学本科生数据挖掘导论课件淫硼漫啡氦匙疚竿茎校越微鄂稍糟路员氖康江灰徒郡沥莆尺亡仓芬苏拒脑第4课数据分类和预测第4课数据分类和预测内容提纲nWhat is classification? What is prediction?nIssues regarding classification and predictionnClassification by decision tree inductionnBayesian Cla

2、ssificationnPredictionnSummarynReference邱时卒朝踢团扑涩沦叭褥谩殆厚砰陷俺桃抗搽岿支哆板诱寸扫贞据菠姆萨第4课数据分类和预测第4课数据分类和预测nClassification predicts categorical class labels (discrete or nominal)classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in cl

3、assifying new datanPrediction models continuous-valued functions, i.e., predicts unknown or missing values nTypical applicationsCredit approvalTarget marketingMedical diagnosisFraud detectionI.Classification vs. Prediction营柯瞳统糯边焉烟找蛀摔傍葫跺涪化剑阀惠瓣婿溉焙狰胸袋恨厚呈搓闸紫第4课数据分类和预测第4课数据分类和预测ClassificationA Two-Step P

4、rocess nModel construction: describing a set of predetermined classesEach tuple/sample is assumed to belong to a predefined class, as determined by the class label attributeThe set of tuples used for model construction is training setThe model is represented as classification rules, decision trees,

5、or mathematical formulaenModel usage: for classifying future or unknown objectsEstimate accuracy of the modelnThe known label of test sample is compared with the classified result from the modelnAccuracy rate is the percentage of test set samples that are correctly classified by the modelnTest set i

6、s independent of training set, otherwise over-fitting will occurIf the accuracy is acceptable, use the model to classify data tuples whose class labels are not known阵筋宏坊牟甸沿育岸折跃余慰纂诺鸯锗冒斋耍李篓南跋偏武陈融总谗侠徒第4课数据分类和预测第4课数据分类和预测Classification Process (1): Model ConstructionTrainingDataClassificationAlgorithmsI

7、F rank = professorOR years 6THEN tenured = yes Classifier(Model)玖询炽滨契俯异逗嘉荷莲套搀逗茶龚朝桶黍礼搀恫狄纬藐染步曳裙车砰鸯第4课数据分类和预测第4课数据分类和预测Classification Process (2): Use the Model in PredictionClassifierTestingDataUnseen Data(Jeff, Professor, 4)Tenured?虱走病仁橙局钟劣袋匣钓瓣嘉排尾冀甭摹纽聪侗馆栈花颜持滓疏创雷路属第4课数据分类和预测第4课数据分类和预测Supervised vs. Un

8、supervised LearningnSupervised learning (classification)Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observationsNew data is classified based on the training setnUnsupervised learning (clustering)The class labels of training

9、data is unknownGiven a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data处刨比哀鸦贷拭眩竿拼虫嘉替涌捎摈牧朋吾鼻疲豌菱崇灸魁怂夏韦蛹培澜第4课数据分类和预测第4课数据分类和预测II.Issues Regarding Classification and Prediction (1): Data PreparationnData cleaningPreprocess data in orde

10、r to reduce noise and handle missing valuesnRelevance analysis (feature selection)Remove the irrelevant or redundant attributesnData transformationGeneralize and/or normalize data钩量邀貉蜡失柏病僧蘸顶功瓢淬翔殷热露洪郧瘩埔绣景尊膳童党峙梧颜瓦第4课数据分类和预测第4课数据分类和预测Issues regarding classification and prediction (2): Evaluating classi

11、fication methodsnAccuracy: classifier accuracy and predictor accuracynSpeed and scalabilitytime to construct the model (training time)time to use the model (classification/prediction time)nRobustnesshandling noise and missing valuesnScalabilityefficiency in disk-resident databases nInterpretabilityu

12、nderstanding and insight provided by the modelnOther measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules婚伦殿腕褂督娩侯销王次短还候岸窃吊工宏悲佃冻纂谭碟芋扬烹叁简企摇第4课数据分类和预测第4课数据分类和预测III.Decision Tree Induction: Training DatasetThis follows an example of Quinlans ID3 (Playing T

13、ennis)众纷枚拧宰吐晋率扭庭解碳舅阑冒呛盈鼓虱熙荔止巢逻鸟揭殿使贼梗泄喷第4课数据分类和预测第4课数据分类和预测Output: A Decision Tree for “buys_computer”age?overcaststudent?credit rating?noyesfairexcellent40nonoyesyesyes30.40姿卤酞锨猴赏吞赴卒污学掸凝麻枣钞匠黑碳杖橱航繁氖甩宝啮增狡舒散糜第4课数据分类和预测第4课数据分类和预测Algorithm for Decision Tree InductionnBasic algorithm (a greedy algorithm)T

14、ree is constructed in a top-down recursive divide-and-conquer mannerAt start, all the training examples are at the rootAttributes are categorical (if continuous-valued, they are discretized in advance)Examples are partitioned recursively based on selected attributesTest attributes are selected on th

15、e basis of a heuristic or statistical measure (e.g., information gain)nConditions for stopping partitioningAll samples for a given node belong to the same classThere are no remaining attributes for further partitioning majority voting is employed for classifying the leafThere are no samples left托翟纷玲

16、欣碰酗涌拢仇摔墒幻嚎瘸夕龄培颤邯攘贡尽承牧嘿辙唁裔侯棘糙第4课数据分类和预测第4课数据分类和预测Attribute Selection Measure: Information Gain (ID3/C4.5)nSelect the attribute with the highest information gainnS contains si tuples of class Ci for i = 1, , m ninformation measures info required to classify any arbitrary tuplenentropy of attribute A w

17、ith values a1,a2,avninformation gained by branching on attribute A盼锚坏幌吸沽纤澈洪谢粘店绎忻淌舷卉审凝讣查漾老杖三旁缄死俱顾丽侠第4课数据分类和预测第4课数据分类和预测Attribute Selection by Information Gain ComputationgClass P: buys_computer = “yes”gClass N: buys_computer = “no”gI(p, n) = I(9, 5) =0.940gCompute the entropy for age: means “age spli

18、t-point搔夫类窄文蜜浩陨嚼迈葛镑诧刁片碳偷堆将相匀婚祈睡砷堵萝脐惫兄纽栖第4课数据分类和预测第4课数据分类和预测Extracting Classification Rules from TreesnRepresent the knowledge in the form of IF-THEN rulesnOne rule is created for each path from the root to a leafnEach attribute-value pair along a path forms a conjunctionnThe leaf node holds the clas

19、s predictionnRules are easier for humans to understandnExampleIF age = “=30” AND student = “no” THEN buys_computer = “no”IF age = “40” AND credit_rating = “excellent” THEN buys_computer = “yes”IF age = “=30” AND credit_rating = “fair” THEN buys_computer = “no”食婴骨车澎阻聘椒洱轨至血拭猫嫉雕逸趁暂苑戍羞社戈灾涝宴士倦掘寓岳第4课数据分类和

20、预测第4课数据分类和预测Avoid Overfitting in ClassificationnOverfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliersPoor accuracy for unseen samplesnTwo approaches to avoid overfitting Prepruning: Halt tree construction earlydo not split a

21、 node if this would result in the goodness measure falling below a thresholdnDifficult to choose an appropriate thresholdPostpruning: Remove branches from a “fully grown” treeget a sequence of progressively pruned treesnUse a set of data different from the training data to decide which is the “best

22、pruned tree”虱狰饵夹猖担疟脉垛舵县矮珠碍沪嫁拔紫谢声瑚蔑容喜教痰冶浓搭帜植耿第4课数据分类和预测第4课数据分类和预测Approaches to Determine the Final Tree SizenSeparate training (2/3) and testing (1/3) setsnUse cross validationnUse all the data for trainingbut apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node

23、 may improve the entire distributionn宋哎辐瞅车皿闯每组享僵巷征例礁支送淡涉扒撅递诉嘱杰刊咳守税垂奴蓬第4课数据分类和预测第4课数据分类和预测Enhancements to Basic Decision Tree InductionnAllow for continuous-valued attributesDynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of interval

24、snHandle missing attribute valuesAssign the most common value of the attributeAssign probability to each of the possible valuesnAttribute constructionCreate new attributes based on existing ones that are sparsely representedThis reduces fragmentation, repetition, and replication谷奋泵朔信列债祷障抒秧驶性廖社庸痪炕美尹寨

25、尽桩稍帮迹嘛浊亡研猿去第4课数据分类和预测第4课数据分类和预测Classification in Large DatabasesnClassificationa classical problem extensively studied by statisticians and machine learning researchersnScalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speednWhy decision tree ind

26、uction in data mining?relatively faster learning speed (than other classification methods)convertible to simple and easy to understand classification rulescan use SQL queries for accessing databasescomparable classification accuracy with other methods谋杠悟馅舍吾逻驾吁码琉刑禁胖拿茬沤渺闸席便块臣拽炔娶晚绒剧浑蕉澈第4课数据分类和预测第4课数据分类

27、和预测Scalable Decision Tree Induction MethodsnSLIQ (EDBT96 Mehta et al.)builds an index for each attribute and only class list and the current attribute list reside in memorynSPRINT (VLDB96 J. Shafer et al.)constructs an attribute list data structure nPUBLIC (VLDB98 Rastogi & Shim)integrates tree spli

28、tting and tree pruning: stop growing the tree earliernRainForest (VLDB98 Gehrke, Ramakrishnan & Ganti)separates the scalability aspects from the criteria that determine the quality of the treebuilds an AVC-list (attribute, value, class label)散蛋缎腆淳捂陇郡资掷韦板损脸獭许拳掸淀元娩匀豺咐靖序洪呢叭驴残东第4课数据分类和预测第4课数据分类和预测Presen

29、tation of Classification Results铭悼泳影乞窑底薯嫂淹符骑烧蹭阶秉齐斋腕力据计蕴氯无她酸砸考汤翰悬第4课数据分类和预测第4课数据分类和预测Visualization of a Decision Tree in SGI/MineSet 3.0洋吕烧宠砷从莎园齿轴鹰霍忧淌帆斜牙皮撰洗趋仅颇臃居冠腑碴亲路衍怯第4课数据分类和预测第4课数据分类和预测Interactive Visual Mining by Perception-Based Classification (PBC)柱叹求洱诣待棠枷耐曹也肘宏虑视均诫折手上靖吞暴泊蝴棠弃赎注岔凌娇第4课数据分类和预测第4课数据

30、分类和预测IV.Bayesian Classification: Why?nProbabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problemsnIncremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct

31、. Prior knowledge can be combined with observed data.nProbabilistic prediction: Predict multiple hypotheses, weighted by their probabilitiesnStandard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be

32、 measured堪树菱漾凶痕廖拾谈碘旋磕蛋奢娘糙杖豺廉握佃掐衰叫紧巢溶那虾匣膊窿第4课数据分类和预测第4课数据分类和预测Bayesian Theorem: BasicsnLet X be a data sample whose class label is unknownnLet H be a hypothesis that X belongs to class C nFor classification problems, determine P(H|X): the probability that the hypothesis holds given the observed data

33、sample XnP(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge)nP(X): probability that sample data is observednP(X|H): probability of observing the sample X, given that the hypothesis holds霓耍漓套秘僻聊拴鞋架腕凹赞潮轿鸣颊盲揪秤堆蹋彩着沏满屿很善寸驮哈第

34、4课数据分类和预测第4课数据分类和预测Bayesian TheoremnGiven training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theoremnInformally, this can be written as posteriori = likelihood x prior / evidencenMAP (maximum posteriori) hypothesisnPractical difficulty: require initial knowledge of m

35、any probabilities, significant computational cost驰腕赃托拌碌洱雇肃容庇惭拣窥噶汹汇账霜啮助宅饯逻盾拨重终频痞魂坚第4课数据分类和预测第4课数据分类和预测Nave Bayes Classifier nA simplified assumption: attributes are conditionally independent:nThe product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the pr

36、obabilities of each element taken separately, given the same class P(y1,y2, C) = P(y1, C) * P(y2, C)nNo dependence relation between attributes nGreatly reduces the computation cost, only count the class distribution.nOnce the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci) *

37、 P(Ci)烛仗查献垛栓腮饭句献虹殊兔油攀刹筛短础绕蚁庸泽舟本苏揍来阜提咏插第4课数据分类和预测第4课数据分类和预测Training datasetClass:C1:buys_computer=yesC2:buys_computer=noData sample X =(age=30,Income=medium,Student=yesCredit_rating=Fair)蚕谅五睹磁涕移向篮恋围拂邀候槛恼涟啡临呀犊邻败眷吗悔绎筑夹捻录膊第4课数据分类和预测第4课数据分类和预测Nave Bayesian Classifier: An ExamplenCompute P(X|Ci) for each

38、classP(age=“30” | buys_computer=“yes”) = 2/9=0.222 P(age=“30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(c

39、redit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age=30 , income =medium, student=yes, credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|

40、Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007 Therefore, X belongs to class “buys_computer=yes”盈结该漓喇采茸猿人捞甚藤复殆长疡汗宛贿醛劳蜂我炬强涪淤琵骆摸何团第4课数据分类和预测第4课数据分类和预测Nave Bayesian Classifier: CommentsnAdvantages Easy to implement Good results

41、 obtained in most of the casesnDisadvantagesAssumption: class conditional independence, therefore loss of accuracyPractically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies a

42、mong these cannot be modeled by Nave Bayesian ClassifiernHow to deal with these dependencies?Bayesian Belief Networks 即访学御芽卑塔燕衙界许综替空汾样晚乐尽右梁墩颇惶皂鄂迭葡秆湿世畏第4课数据分类和预测第4课数据分类和预测Bayesian Belief NetworksnBayesian belief network allows a subset of the variables conditionally independentnA graphical model of c

43、ausal relationshipsRepresents dependency among the variables Gives a specification of joint probability distribution XYZPqNodes: random variablesqLinks: dependencyqX,Y are the parents of Z, and Y is the parent of PqNo dependency between Z and PqHas no loops or cycles锚留枪拣马秸扇忱馒擅馒虚滚斤诫药镍蔡制愚木谰伞异谰涉似现贿舌酝冕第

44、4课数据分类和预测第4课数据分类和预测Bayesian Belief Network: An ExampleFamilyHistoryLungCancerPositiveXRaySmokerEmphysemaDyspneaLCLC(FH, S)(FH, S)(FH, S) (FH, S)0.80.20.50.50.70.30.10.9Bayesian Belief NetworksThe conditional probability table for the variable LungCancer:Shows the conditional probability for each pos

45、sible combination of its parents址层拒喀猖耶奈捶苍溃惮廓迪都纬停墟厩之陵肇蒜泡察牌让蔡硼漾岂漏获第4课数据分类和预测第4课数据分类和预测Learning Bayesian NetworksnSeveral casesGiven both the network structure and all variables observable: learn only the CPTsNetwork structure known, some hidden variables: method of gradient descent, analogous to neura

46、l network learningNetwork structure unknown, all variables observable: search through the model space to reconstruct graph topology Unknown structure, all hidden variables: no good algorithms known for this purposenD. Heckerman, Bayesian networks for data mining猴奸点祸齿痴零菩撩吐颇致费啤秦拣活菱技孙翠呵君拉喀廓靳惑饮期祷夹第4课数据分

47、类和预测第4课数据分类和预测V.What Is Prediction?n(Numerical) prediction is similar to classificationconstruct a modeluse model to predict continuous or ordered value for a given inputnPrediction is different from classificationClassification refers to predict categorical class labelPrediction models continuous-v

48、alued functionsnMajor method for prediction: regressionmodel the relationship between one or more independent or predictor variables and a dependent or response variablenRegression analysisLinear and multiple regressionNon-linear regressionOther regression methods: generalized linear model, Poisson

49、regression, log-linear models, regression trees几黑候樊索厚粗状词梢兜昧耪崖抬祟块倡屉倚一郎松辖致免孰将西炽憋累第4课数据分类和预测第4课数据分类和预测Linear Regression nLinear regression: involves a response variable y and a single predictor variable x,y = w0 + w1xwhere w0 (y-intercept) and w1 (slope) are regression coefficients nMethod of least squ

50、ares: estimates the best-fitting straight linenMultiple linear regression: involves more than one predictor variableTraining data is of the form (X1, y1), (X2, y2), (X|D|, y|D|) Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2Solvable by extension of least square method or using SAS, S-PlusMany

51、nonlinear functions can be transformed into the above溉驴皿目乾术效湿嫌棵渠焚咙娟摊盲译腹炙歇寻从瞪坚糊韵楼倍跪有腻热第4课数据分类和预测第4课数据分类和预测nSome nonlinear models can be modeled by a polynomial functionnA polynomial regression model can be transformed into linear regression model. For example,y = w0 + w1 x + w2 x2 + w3 x3convertible

52、to linear with new variables: x2 = x2, x3= x3y = w0 + w1 x + w2 x2 + w3 x3 nOther functions, such as power function, can also be transformed to linear modelnSome models are intractable nonlinear (e.g., sum of exponential terms)possible to obtain least square estimates through extensive calculation o

53、n more complex formulaeNonlinear Regression 邵薄移道袒徘虞垄枢腻裹琼锡蔼凭谦赔右拷序炎媳能劫莉佯隔愤倍呼浚纹第4课数据分类和预测第4课数据分类和预测nGeneralized linear model: Foundation on which linear regression can be applied to modeling categorical response variablesVariance of y is a function of the mean value of y, not a constantLogistic regress

54、ion: models the prob. of some event occurring as a linear function of a set of predictor variablesPoisson regression: models the data that exhibit a Poisson distributionnLog-linear models: (for categorical data)Approximate discrete multidimensional prob. distributions Also useful for data compressio

55、n and smoothingnRegression trees and model treesTrees to predict continuous values rather than class labelsOther Regression-Based Models倘也硒饶肄愚非滑扇蒜面赞牡冒摈庙信陋委淹禾滓耙论窄胡涧渣牡擞了靳第4课数据分类和预测第4课数据分类和预测Regression Trees and Model TreesnRegression tree: proposed in CART system (Breiman et al. 1984)CART: Classificat

56、ion And Regression TreesEach leaf stores a continuous-valued predictionIt is the average value of the predicted attribute for the training tuples that reach the leafnModel tree: proposed by Quinlan (1992)Each leaf holds a regression modela multivariate linear equation for the predicted attributeA mo

57、re general case than regression treenRegression and model trees tend to be more accurate than linear regression when the data are not represented well by a simple linear model锭秧詹疚堑粒温升务棍视面孪窃陆膘丹窥诉绥伍精婶腐氢县骨很仟绞琼昂第4课数据分类和预测第4课数据分类和预测nPredictive modeling: Predict data values or construct generalized linear

58、 models based on the database datanOne can only predict value ranges or category distributionsnMethod outline: Minimal generalization Attribute relevance analysis Generalized linear model construction PredictionnDetermine the major factors which influence the predictionData relevance analysis: uncer

59、tainty measurement, entropy analysis, expert judgement, etc.nMulti-level prediction: drill-down and roll-up analysisPredictive Modeling in Multidimensional Databases份愉轴哗锻钓赌簇皇株氖其舞咀礁耳为札淄肤库犁郝鲁足懂讯丸脖沾间西第4课数据分类和预测第4课数据分类和预测Prediction: Numerical Data汲锭卢汝酣款邦氯话亲尝匆警造泣凄障璃钙腕的予痒透桃辕当呀驰叹胳讣第4课数据分类和预测第4课数据分类和预测Predi

60、ction: Categorical Data你右颗铬菱龄戳松肋长睬盆尸自巳世顿巩钡依饥淌同风只杂滩公番液累涪第4课数据分类和预测第4课数据分类和预测VI.SummarynClassification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. nEffective and scalable methods have been develope

61、d for decision trees induction, Naive Bayesian classification, Bayesian belief network, rule-based classifier, Backpropagation, Support Vector Machine (SVM), associative classification, nearest neighbor classifiers, and case-based reasoning, and other classification methods such as genetic algorithm

62、s, rough set and fuzzy set approaches.nLinear, nonlinear, and generalized linear models of regression can be used for prediction. Many nonlinear problems can be converted to linear problems by performing transformations on the predictor variables. Regression trees and model trees are also used for p

63、rediction. 豹埔秤啦戌踏栓庇辗膳沫甥井瞩咀这香累妻嗜愉厅鲁廓晨奖纂孽觅耳菩逊第4课数据分类和预测第4课数据分类和预测VII.ReferencenJ. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986.nJ. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.nR. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification (2nd ed.). John Wiley & Sons, 2001.nT. M. Mitchell. Machine Learning. McGraw Hill, 1997. 校过是缀堑年擂冒舔肖鸿帖敢碟免涯游端帐玲渐情间妙逮逛穗抓技袭氦萧第4课数据分类和预测第4课数据分类和预测

展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 建筑/环境 > 施工组织

电脑版 |金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号