《李欣海:用r实现随机森林的分类与回归资料》由会员分享,可在线阅读,更多相关《李欣海:用r实现随机森林的分类与回归资料(26页珍藏版)》请在金锄头文库上搜索。
1、李欣海第五届中国李欣海第五届中国R语言会议 北京语言会议 北京2012 用用R实现随机森林的分类与回归实现随机森林的分类与回归 Applications of Random Forest using R Classification and Regression 李欣海 中科院动物所 李欣海 中科院动物所 邮件:lixh 主页: 博客: 微博: 李欣海第五届中国李欣海第五届中国R语言会议 北京语言会议 北京2012 Random Forest Random Forest is an ensemble classifier that consists of many decision trees
2、. It outputs the class that is the mode of the classs output by individual trees (Breiman 2001). It deals with “small n large p”-problems, high-order interactions, correlated predictor variables. Breiman, L. 2001. Random forests. Machine Learning 45:5-32. (Being cited 6500 times until 2012) 2/25 an-
3、introduction-to-data-mining-for-marketing-and-business-intelligence/ 随机森林简介随机森林简介 李欣海第五届中国李欣海第五届中国R语言会议 北京语言会议 北京2012 History The algorithm for inducing a random forest was developed by Leo Breiman (2001) and Adele Cutler, and “Random Forests“ is their trademark. The term came from random decision f
4、orests that was first proposed by Tin Kam Ho of Bell Labs in 1995. The method combines Breimans “bagging“ idea and the random selection of features, introduced independently by Ho (1995) and Amit and Geman (1997) in order to construct a collection of decision trees with controlled variation. 3/25随机森
5、林简介随机森林简介 an-introduction-to-data-mining-for-marketing-and-business-intelligence/ 李欣海第五届中国李欣海第五届中国R语言会议 北京语言会议 北京2012 Tree models Regression tree (Crawley 2007 The R Book p691) Classification tree (Crawley 2007 The R Book p694) iiiii xxxy+= 3322110 4/25随机森林简介随机森林简介 李欣海第五届中国李欣海第五届中国R语言会议 北京语言会议 北京201
6、2 The statistical community uses irrelevant theory, questionable conclusions? David R. CoxBruce Hoadley Brad Efron Emanuel Parzen NO YES 5/25随机森林简介随机森林简介 李欣海第五届中国李欣海第五届中国R语言会议 北京语言会议 北京2012 Ensemble classifiers Tree models are simple, often produce noisy (bushy) or weak (stunted) classifiers. Baggin
7、g (Breiman, 1996): Fit many large trees to bootstrap- resampled versions of the training data, and classify by majority vote. Boosting (Freund impvar 15/25随机森林:分类随机森林:分类 李欣海第五届中国李欣海第五届中国R语言会议 北京语言会议 北京2012 varImpPlot(RF) prec_jan footprint t_july aspect pop t_ann elevation t_jan landcover prec_july
8、GDP prec_ann slope x y 0.350.450.55 MeanDecreaseAccuracy prec_jan footprint aspect pop slope GDP prec_july prec_ann t_july t_jan t_ann elevation landcover x y 050100150 MeanDecreaseGini 16/25随机森林:分类随机森林:分类 李欣海第五届中国李欣海第五届中国R语言会议 北京语言会议 北京2012 partialPlot: partial dependence of elevation 0100020003000
9、 500150025003500 Index ibis$elevation 500150025003500 123456 Absence Elevation 0100020003000 500150025003500 Index ibis$elevation 500150025003500 -6-5-4-3-2-1 Presence Elevation ibis=ibisorder(ibis$x), plot(ibis$elevation, col=4-as.numeric(ibis$use), cex=0.5, pch=4) ibis=ibisorder(ibis$elevation), p
10、lot(ibis$elevation, col=4-as.numeric(ibis$use), cex=0.5, pch=4) partialPlot(RF, ibis, elevation, “0“, main=Absence,xlab=Elevation) partialPlot(RF, ibis, elevation, “1“, main=Presence,xlab=Elevation) 17/25随机森林:分类随机森林:分类 李欣海第五届中国李欣海第五届中国R语言会议 北京语言会议 北京2012 Model comparison OccurrencesGLMGAM CTAANNSRE
11、GBMRFMDA MARS Predicted current suitable habitat of crested ibis using the models in BIOMOD (The warm color areas are the suitable areas) 18/25随机森林:分类随机森林:分类 李欣海第五届中国李欣海第五届中国R语言会议 北京语言会议 北京2012 Predicted current suitable habitat of black snub-nose monkey using BIOMOD (The warm color areas are the su
12、itable areas) GLMGAMCTA ANNSREGBM RFMDAMARS 19/25随机森林:分类随机森林:分类 李欣海第五届中国李欣海第五届中国R语言会议 北京语言会议 北京2012 Historical decline of Asian elephant YearYear ! 211 ! 212 - 500 ! 501 - 600 ! 601 - 700 ! 701 - 800 ! 801 - 900 ! 901 - 1000 ! 1001 - 1100 ! 1101 - 1200 ! 1201 - 1300 ! 1301 - 1400 ! 1401 - 1500 ! 150
13、1 - 1600 ! 1601 - 1700 ! 1701 - 1800 ! 1801 - 1900 ! 1901 - 1940 ! 1941 - 1960 ! 1961 - 1980 ! 1981 - 2000 20/25随机森林:回归随机森林:回归 李欣海第五届中国李欣海第五届中国R语言会议 北京语言会议 北京2012 Variables associated with species range lat.max Temp. Yang Temp. Ljungqvist Temp. Moberg Temp. Mann.cps Temp. Mann.eiv Precipi tation Dro
14、ughtFloodPopulationlat.max Temp. Yang Temp. Ljungqvist Temp. Moberg Temp. Mann.cps Temp. Mann.eiv Precipi tation DroughtFloodPopulation 34.42-0.52 -0.41 -0.49 -0.59 -0.22 -0.07 -0.10 -5.32 16.44 31.57-0.41 -0.21 -0.35 0.00 -0.02 -0.35 -2.12 -6.37 16.44 29.530.28 -0.23 -0.25 -0.21 -0.11 -0.78 -3.76 -
15、2.19 17.74 31.870.35 -0.25 -0.29 -0.27 0.02 -0.34 -3.24 -5.07 16.31 28.93-0.13 -0.09 -0.31 -0.62 -0.16 0.11 -2.38 -1.79 16.93 34.790.19 0.09 -0.13 -0.38 0.14 0.67 7.30 22.57 16.60 RF - randomForest(lat.max ., data=dd, ntree=1000, importance=TRUE) imp - importance(RF) impvar - rownames(imp)order(imp,
16、 1, decreasing=TRUE) #sort importance # Plot partial effects op - par(mfrow=c(3, 3),mar=c(4,4,2,2) for (i in seq_along(impvar) partialPlot(RF, dd, impvari, xlab=impvari, #Partial effects ylab=Longitude, ylim=c(26,30), main=) 21/25随机森林:回归随机森林:回归 李欣海第五届中国李欣海第五届中国R语言会议 北京语言会议 北京2012 Partial effect of variables on maximum latitude 22/25随机森林:回归随机森林:回归 16.517.518.519.5 262830 Population Lat.max -0.4-0.20.0 262830 Temp.Mann.eiv.cru Lat.max -