统计软件R作业——adult、babiesI数据PPT课件

日度

实名认证

店铺

PPTX

1.89MB

约50页

文档ID:153219120

1/50页

点击查看更多>>

文本预览下载提示常见问题

数据分析与统计软件作业,姓名：杨烨军学号：2010110148,adult、babiesI数据,第一部分 adult数据,,,,,,,,,,,,,,,,,2,2.4：组合方法：adaboost、 bagging、随机森林分析,2.1、2.2、2.3：rpart分析,2.5：最近邻方法分析,2.6：人工神经网络分析,2.8：关联规则分析,2.7：支持向量机分析,,目录,1.数据简介,数据来自于1994年人口普查数据，经过年龄16、AGI100、AFNLWGT1和每周工作时间0等条件筛选共有48842个观测，其中：训练集32561个观测，测试集16281个观测 15个变量，其中：6个连续性变量，9个名义变量资料来源：http://archive.ics.uci.edu/ml/datasets/Adult,任务：预测人们收入是否超过5万/年变量描述,数据概览,age workclass fnlwgt education education.num marital.status 1 39 State-gov 77516 Bachelors 13 Never-married 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse 3 38 Private 215646 HS-grad 9 Divorced 4 53 Private 234721 11th 7 Married-civ-spouse 5 28 Private 338409 Bachelors 13 Married-civ-spouse occupation relationship race sex capital.gain capital.loss 1 Adm-clerical Not-in-family White Male 2174 0 2 Exec-managerial Husband White Male 0 0 3 Handlers-cleaners Not-in-family White Male 0 0 4 Handlers-cleaners Husband Black Male 0 0 5 Prof-specialty Wife Black Female 0 0 hours.per.week native.country class 1 40 United-States <=50K 2 13 United-States <=50K 3 40 United-States <=50K 4 40 United-States <=50K 5 40 Cuba <=50K,2.1分类树rpart分析:程序,library(rpart); w=read.table(e:/adult.txt,header=TRUE,sep=,); wt=read.table(e:/adulttest.txt,header=TRUE,sep=,); summary(w); summary(wt); (b=rpart(class.,w)) ; b; plot(b,uniform=T,branch=1, margin=0.1, cex=0.9); text(b,cex=0.85); table(predict(b, w, type=class), wclass); table(predict(b, wt, type=class), wtclass),,2.1分类树rpart分析：输出结果,n= 32561 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 32561 7841 =7073.5 318 12 50K (0.03773585 0.96226415) * 3) relationship= Husband, Wife 14761 6663 =5095.5 522 10 50K (0.01915709 0.98084291) * 7) education= Bachelors, Doctorate, Masters, Prof-school 4432 1225 50K (0.27639892 0.72360108) *,关系：未婚、自己为孩子、不在家庭、其他,关系：丈夫、妻子,学历较高,学历较低,财产收益大于5096,财产收益大于7074,财产收益小于7074,财产收益小于5096,2.1分类树rpart分析：输出结果,2.1分类树rpart分析：结论,可见：年工资收入是否超过5万，与个人在家庭中担任的角色、所受教育和财产收益有关。

个人如果是家庭的丈夫或者妻子，收入相对较高；所受教育越高，收入相对较高；财产收益越高，收入相对较高判断一个人年收入是否超过5万，可从关系、教育、财产收益三个变量表现来决定2.2分类树rpart分析：程序（变量筛选1）,考虑到education（教育）与education.num（教育年限）相关性较大，只采用education.num（教育年限） summary(w); (b1=rpart(classage+workclass+education.num+marital.status+occupation+race+sex+capital.gain+capital.loss+hours.per.week+native.country,w)) ; b1; plot(b1); text(b1,use.n=T) table(predict(b1, w, type=class), wclass); table(predict(b1, wt, type=class), wtclass),,2.2分类树rpart分析：输出（变量筛选1）,n= 32561 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 32561 7841 =7139.5 310 11 50K (0.03548387 0.96451613) * 3) marital.status= Married-AF-spouse, Married-civ-spouse 14999 6702 =5095.5 528 11 50K (0.02083333 0.97916667) * 7) education.num=12.5 4473 1255 50K (0.28057232 0.71942768) *,婚姻状况：离婚、配偶失踪、丧偶等,婚姻状况：已婚有配偶,学历较高,学历较低,2.2分类树rpart分析：结果（变量筛选1）,再考虑到capital.gain、capital.loss本身与收入类别紧密相关，为挖掘其余变量与收入类别的关系，这里分析中不包括capital.gain与capital.loss变量。

(b2=rpart(classage+workclass+education.num+marital.status+occupation+race+sex+hours.per.week+native.country,w)) ; b2; plot(b2); text(b2,use.n=T) table(predict(b2, w, type=class), wclass); table(predict(b2, wt, type=class), wtclass),2.3分类树rpart分析：程序（变量筛选2）,,,n= 32561 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 32561 7841 =12.5 4473 1255 50K (0.28057232 0.71942768) *,婚姻状况：离婚、配偶失踪、分居等,婚姻状况：已婚有配偶,受教育年限,2.3分类树rpart分析：结果（变量筛选2）,,与前面分析相比，训练集、测试集误判率均有所上升，因为这里少了财产收益和损失的信息2.3分类树rpart分析：结果（变量筛选2）,library(adabag); b4=adaboost.M1(class.,data=w,mfinal=15, maxdepth=5) b4.pred <- predict.boosting(b4,newdata=w) ;b4.pred-1 b5.pred <- predict.boosting(b4,newdata=wt) ;b5.pred-1 barplot(b4$importance) b4$importance,训练集： Observed Class Predicted Class 50K 50K 869 3937 $error 1 0.1465864,测试集： Observed Class Predicted Class 50K. <=50K. 12435 3846 $error 1 0.2362263,测试集中全部判断为<=50K。

2.4组合方法之adaboost分析,, b4$importance age workclass fnlwgt education education.num 11.764706 0.000000 0.000000 15.294118 1.176471 marital.status occupation relationship race sex 7.058824 12.941176 9.411765 0.000000 0.000000 capital.gain capital.loss hours.per.week native.country 24.705882 9.411765 8.235294 0.000000,重要性较强的变量有： capital.gain education occupation age,2.4组合方法之adaboost分析,library(mlbench); b6=adaboost.M1(class .,data=w,mfinal=25, maxdepth=5) b6.pred <- predict.boosting(b6,newdata=w) ;b6.pred-1 b7.pred <- predict.boosting(b6,newdata=wt) ;b7.pred-1 barplot(b6$importance) b6$importance,训练集： Observed Class Predicted Class 50K 50K 1255 4475 $error 1 0.1419182,测试集： Observed Class Predicted Class 50K. <=50K. 12435 3846 $error 1 0.2362263,测试集中仍全部判断为<=50K。

mfinal增加至25,训练集误判率有所下降，相差不大,2.4组合方法之adaboost分析,, b6$importance age workclass fnlwgt education education.num 10.6666667 0.0000000 0.0000000 12.6666667 2.0000000 marital.status occupation relationship race sex 6.6666667 12.6666667 9.3333333 0.0000000 0.0000000 capital.gain capital.loss 。

下载提示

点击查看常见问题

相似文档

正为您匹配相似的精品文档