The statistical dangers of standard stepwise variable selection

Yesterday I attended the first internal economics research seminar of the academic year and it got me thinking about the above issue.  Most standard financial econometrics textbooks describe stepwise regression in their assessment of multivariate regression model selection, with little regard for its fatal flaws (not true of all textbooks but common in some of the most popular).

In recent years I have been more influenced by the field of Statistics, which has known for some time of the fatal issue that such a method has.

These flaws can be summarised as follows (see Harrell (2010) for detailed proof):

  1. Standard errors are biased towards zero
  2. P-values are also biased towards zero
  3. Parameter estimates are biased away from zero
  4. F and Chi-Squared tests don’t have the desired distribution
  5. R-Squared is biased upwards
  6. Resulting models are complex with exacerbated collinearity problems

These flaws arise due to the fact that a single hypothesis test is ‘wrongly’ applied multiple times under the assumption that consecutive tests are independent.  Flom & Cassell (2007) use a nice analogy to summarise this issue:

In stepwise regression, this assumption is grossly violated in ways that are difficult to determine. For example, if you toss a coin ten times and get ten heads, then you are pretty sure that something weird is going on. You can quantify exactly how unlikely such an event is, given that the probability of heads on any one toss is 0.5. If you have 10 people each toss a coin ten times, and one of them gets 10 heads, you are less suspicious, but you can still quantify the likelihood. But if you have a bunch of friends (you don’t count them) toss coins some number of times (they don’t tell you how many) and someone gets 10 heads in a row, you don’t even know how suspicious to be. That’s stepwise.

Flom & Cassell (2007) go on to provide a number of more statistically valid solutions, my favourite being the Lasso method.

In my Financial Econometrics class I now encourage students to thinking careful when selecting variables for a  model and I encourage students to use Gelman & Hill (2006) general principles for building a model:


Copy of Slides:General principles for building a model


P. L. Flom and D. L. Cassell, (2007). Stopping stepwise: Why stepwise and similar selection methods are bad, and what you should use. NESUG 2007 Proceedings.

Harrell, F. E. (2010), Regression modeling strategies: With applications to linear models, logistic regression, and survival analysis, Springer-Verlag, New York.

A Gelman & J Hill (2006) “Data analysis using regression and multilevel/hierarchical models” , Cambridge Press, New York.