Variable Selection Techniques
This post addresses the following prompt:
When building regression models, it often takes a lot of experience and knowledge about your data in order to determine the variables and transformation of variables that you want to include in the model building process. There are many variable selection techniques (or feature selection) but it can be a confusing practice when you are first learning. Write up a brief discussion of how you would plan to determine variables to use in a regression model. What variable selection techniques do you prefer and why?
Variable selection is important in regression analysis because our model becomes increasingly unstable as more variables are included in it. Our overall goal is to have the smallest model that fits our data best. Additionally, if we keep too many variables in our model, it increases the likelihood of having collinearity. Although there are many techniques that can be used to determine which variables to include in a regression model, I believe the criterion-based technique is ideal. While stepwise procedures are the simplest, they are pretty restrictive. The criterion-based procedure is much less restrictive because we fit more models.
However, even within criterion-based selection, there are multiple criterion which we must decide between. I believe that Mallow’s Cp statistic is the best choice for a few reasons… Firstly, it is closely related to R^2 and the AIC, two of our other selection criterion. adjusted R^2 is often used on conjunction with Cp in order to find the best model. We want to minimize Cp and maximize adjusted R^2. Secondly, for our full model, Cp=p. This means that we can use p to help us figure out our model by investigating subsets where the Cp value is near p, which means there is little bias in the model. Finally, it is not a complex value to compute, which is always ideal.