数据分析与统计软件作业,姓名:杨烨军 学号:2010110148,adult、babiesI数据,第一部分 adult数据,,,,,,,,,,,,,,,,,2,2.4:组合方法:adaboost、 bagging、随机森林分析,2.1、2.2、2.3:rpart分析,2.5:最近邻方法分析,2.6:人工神经网络分析,2.8:关联规则分析,2.7:支持向量机分析,,目录,1.数据简介,数据来自于1994年人口普查数据,经过年龄16、AGI100、AFNLWGT1和每周工作时间0等条件筛选 共有48842个观测,其中:训练集32561个观测,测试集16281个观测 15个变量,其中:6个连续性变量,9个名义变量资料来源:http://archive.ics.uci.edu/ml/datasets/Adult,任务:预测人们收入是否超过5万/年变量描述,数据概览,age workclass fnlwgt education education.num marital.status 1 39 State-gov 77516 Bachelors 13 Never-married 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse 3 38 Private 215646 HS-grad 9 Divorced 4 53 Private 234721 11th 7 Married-civ-spouse 5 28 Private 338409 Bachelors 13 Married-civ-spouse occupation relationship race sex capital.gain capital.loss 1 Adm-clerical Not-in-family White Male 2174 0 2 Exec-managerial Husband White Male 0 0 3 Handlers-cleaners Not-in-family White Male 0 0 4 Handlers-cleaners Husband Black Male 0 0 5 Prof-specialty Wife Black Female 0 0 hours.per.week native.country class 1 40 United-States <=50K 2 13 United-States <=50K 3 40 United-States <=50K 4 40 United-States <=50K 5 40 Cuba <=50K,2.1分类树rpart分析:程序,library(rpart); w=read.table(e:/adult.txt,header=TRUE,sep=,); wt=read.table(e:/adulttest.txt,header=TRUE,sep=,); summary(w); summary(wt); (b=rpart(class.,w)) ; b; plot(b,uniform=T,branch=1, margin=0.1, cex=0.9); text(b,cex=0.85); table(predict(b, w, type=class), wclass); table(predict(b, wt, type=class), wtclass),,2.1分类树rpart分析:输出结果,n= 32561 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 32561 7841 =7073.5 318 12 50K (0.03773585 0.96226415) * 3) relationship= Husband, Wife 14761 6663 =5095.5 522 10 50K (0.01915709 0.98084291) * 7) education= Bachelors, Doctorate, Masters, Prof-school 4432 1225 50K (0.27639892 0.72360108) *,关系:未婚、自己为孩子、不在家庭、其他,关系:丈夫、妻子,学历较高,学历较低,财产收益大于5096,财产收益大于7074,财产收益小于7074,财产收益小于5096,2.1分类树rpart分析:输出结果,2.1分类树rpart分析:结论,可见:年工资收入是否超过5万,与个人在家庭中担任的角色、所受教育和财产收益有关。
个人如果是家庭的丈夫或者妻子,收入相对较高; 所受教育越高,收入相对较高; 财产收益越高,收入相对较高 判断一个人年收入是否超过5万,可从关系、教育、财产收益三个变量表现来决定2.2分类树rpart分析:程序(变量筛选1),考虑到education(教育)与education.num(教育年限)相关性较大,只采用education.num(教育年限) summary(w); (b1=rpart(classage+workclass+education.num+marital.status+occupation+race+sex+capital.gain+capital.loss+hours.per.week+native.country,w)) ; b1; plot(b1); text(b1,use.n=T) table(predict(b1, w, type=class), wclass); table(predict(b1, wt, type=class), wtclass),,2.2分类树rpart分析:输出(变量筛选1),n= 32561 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 32561 7841 =7139.5 310 11 50K (0.03548387 0.96451613) * 3) marital.status= Married-AF-spouse, Married-civ-spouse 14999 6702 =5095.5 528 11 50K (0.02083333 0.97916667) * 7) education.num=12.5 4473 1255 50K (0.28057232 0.71942768) *,婚姻状况:离婚、配偶失踪、丧偶等,婚姻状况:已婚有配偶,学历较高,学历较低,2.2分类树rpart分析:结果(变量筛选1),再考虑到capital.gain、capital.loss本身与收入类别紧密相关,为挖掘其余变量与收入类别的关系,这里分析中不包括capital.gain与capital.loss变量。
(b2=rpart(classage+workclass+education.num+marital.status+occupation+race+sex+hours.per.week+native.country,w)) ; b2; plot(b2); text(b2,use.n=T) table(predict(b2, w, type=class), wclass); table(predict(b2, wt, type=class), wtclass),2.3分类树rpart分析:程序(变量筛选2),,,n= 32561 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 32561 7841 =12.5 4473 1255 50K (0.28057232 0.71942768) *,婚姻状况:离婚、配偶失踪、分居等,婚姻状况:已婚有配偶,受教育年限,2.3分类树rpart分析:结果(变量筛选2),,与前面分析相比,训练集、测试集误判率均有所上升,因为这里少了财产收益和损失的信息2.3分类树rpart分析:结果(变量筛选2),library(adabag); b4=adaboost.M1(class.,data=w,mfinal=15, maxdepth=5) b4.pred <- predict.boosting(b4,newdata=w) ;b4.pred-1 b5.pred <- predict.boosting(b4,newdata=wt) ;b5.pred-1 barplot(b4$importance) b4$importance,训练集: Observed Class Predicted Class 50K 50K 869 3937 $error 1 0.1465864,测试集: Observed Class Predicted Class 50K. <=50K. 12435 3846 $error 1 0.2362263,测试集中全部判断为<=50K。
2.4组合方法之adaboost分析,, b4$importance age workclass fnlwgt education education.num 11.764706 0.000000 0.000000 15.294118 1.176471 marital.status occupation relationship race sex 7.058824 12.941176 9.411765 0.000000 0.000000 capital.gain capital.loss hours.per.week native.country 24.705882 9.411765 8.235294 0.000000,重要性较强的变量有: capital.gain education occupation age,2.4组合方法之adaboost分析,library(mlbench); b6=adaboost.M1(class .,data=w,mfinal=25, maxdepth=5) b6.pred <- predict.boosting(b6,newdata=w) ;b6.pred-1 b7.pred <- predict.boosting(b6,newdata=wt) ;b7.pred-1 barplot(b6$importance) b6$importance,训练集: Observed Class Predicted Class 50K 50K 1255 4475 $error 1 0.1419182,测试集: Observed Class Predicted Class 50K. <=50K. 12435 3846 $error 1 0.2362263,测试集中仍全部判断为<=50K。
mfinal增加至25,训练集误判率有所下降,相差不大,2.4组合方法之adaboost分析,, b6$importance age workclass fnlwgt education education.num 10.6666667 0.0000000 0.0000000 12.6666667 2.0000000 marital.status occupation relationship race sex 6.6666667 12.6666667 9.3333333 0.0000000 0.0000000 capital.gain capital.loss 。