王彤-回归分析中高维变量选择

资源描述

《王彤-回归分析中高维变量选择》由会员分享，可在线阅读，更多相关《王彤-回归分析中高维变量选择（46页珍藏版）》请在金锄头文库上搜索。

1、Variable Selection in High-dimensional Regression Models,Tong Wang Shanxi Medical University 2014, 11,Big Data Era,Business & Finance: recommender system, fraud detection, risk management, portfolio optimization,. Engineering: computer vision, natural language processing, signal/image processing,. S

2、ociology & Psychology: customer behavior, social networks,. Meteorology: climate change, weather forecast,. Genomics & Computational biology: microarray, proteomics, sequencing data,. .In many applications we need to identify a few variables that are important to the response of interest.,Talk Outli

3、ne,Review of approach to variable selection Classical approach Penalized methods Screening methods Work of our team Future work,Regression Model,Data: Goal in regression analysis: Explore the relationship between x and y. A typical Target of estimation: A popular regression model:where is an error t

4、erm with , for simplicity we can often assume is independent of x.,m(x) fully unspecified a nonparametric regression model. This model is not very useful if dim(x) is large It is very difficult to estimate m(x) due to the so-called “curse of dimensionality”. Modeling strategy: impose some structure

5、on m(x). linear model additive model (Hastie & Tibshirani, 1990) varying coefficient model (Hastie & Tibshirani, 1993) . We focus on linear model here.,Classical Variable Selection Approach,Mallows Cp,Summary,Best subset selection plus model selection criteria (AIC, BIC, etc.) Combinatorial optimiza

6、tion problem (NP hard) Instability in the selection process (Breiman, 1996) Not a good idea for high-dimensional data.,Penalized Variable Selection Approach,Impact of High Dimensionality,Noise accumulation is a common phenomenon in high-dimensional prediction.,Impact of High Dimensionality,High coll

7、inearity and spurious correlation are commonly associated with large-scale data sets An experiment,Scientific implication: When gene X1 is responsible for breast cancer, one can also possibly discover five other genes that are independent of the outcome,(Fan and Lv, A selective overview of variable

8、selection in high dimensional feature space , 2010),Summary,To reduce the effect of noise in high-dimensional inference, it is often useful to impose some sparsity structure on the model Sparse modeling has been widely used to deal with high dimensionality The main assumption is that the p-dimension

9、al parameter vector is sparse with many components being exactly zero or negligibly small Such an assumption is crucial in identifiability especially for the relatively small sample size Variable selection can increase the estimation accuracy by effectively identifying the important predictors and i

10、mprove the model interpretability,Penalty Regression Frame,Penalized Least SquaresPenalized Likelihood Formulation,L0-penalized LS,There are other ICs with different .,Penalty Functions,Hard thresholding penalty, Antoniadis(1997)penalty, also as bridge regression, Frank and Friedman (1993), ridge re

11、gression, Lasso(Least Absolute Shrinkage and Selection Operator), Tibishirani(1996), its estimator is soft-thresholding estimator(Donoho and Johnstone (1994)SCAD(Smoothly Clipped Absolute Deviation), Fan and Li(2001)Elastic Net, Zou and Hastie(2005),Penalty Functions,Adaptive Lasso, Zou(2006)where ,

12、 is the initial value of the jth parameter. MCP(Minimax Concave Penalty), Zhang (2009) When , the MCP penalty becomes the LASSO penalty When , the MCP penalty becomes the penaltySeamless penalty, Dicker, Huang and Lin(2012),Desirable Properties for Penalty Function,Fan and Li (2001) advocate penalty

13、 functions that give estimators with three properties: sparsity:The resulting estimator is a thresholding rule, which automatically sets small estimated coefficients to zero to reduce model complexity; unbiasedness:The resulting estimator is nearly unbiased when the true unknown parameter is large t

14、o avoid unnecessary modeling bias; continuity:The resulting estimator is continuous in datazto avoid instability in model prediction.,Thresholding Properties,(Fan and Li(2001). Variable Selection via Nonconcave Penalized likelihood and its oracle properties),Screening Method for Variable Selection,C

15、urse of Ultrahigh Dimensionality,Sure Independent Screening Method,(p variables),(d variables),Penalized Likelihood Estimation,(d variables),(s variables),Potential Drawbacks,Iterative SIS,Illustration of ISIS,Illustration of ISIS,Iterative Feature Selection,Iterative Feature Selection,Extensions of

16、 Marginal Correlation,Extensions of Model,The idea of penalized regression and ISIS is widely applicable. It can be applied to Classification Categorical data analysis Survival analysis Nonparametric learning Robust and quantile regression,Work of Our Team,Big Data in Medicine,Interesting Questions:

17、 Disease classification, such as for diagnosis Predict clinical outcome, such as for personalized treatment Association between phenotypes and SNPs or eQTL Analysis of interaction, such as gene-gene, gene-environment .Our team focus on predicting clinical outcome of cancer and make comparative studies on prediction of different approaches.,

展开阅读全文

王 彤-回归分析中高维变量选择

最新文档

王彤-回归分析中高维变量选择