‘Metrics Monday: Multicollinearity
Suppose you have the following regression model:
(1) Y = a + b1X1 + b2X2 + …+ bKXK + e.
You have N observations which you use to estimate the regression. If N < K, you will not be able to estimate the vector of parameters b = (b1, b2, …, bK). That’s because you have fewer equations than you have unknowns in your system–recall from your middle-school algebra classes that you need at least as many equations as you have unknowns in order to solve for those unknowns. So in econometrics, N < K means that you cannot “solve” for b (i.e., it is under-determined), N = K means that your equation has a unique solution for b (i.e., it is exactly determined), and N > K means that your equation has several solutions for b (i.e., it is over-determined).
Multicollinearity is the problem that arises when N is too small relative to K, or what Arthur Goldberger called “micronumerosity,” referring to too small a number of observations relative to the number of parameters. The most extreme version of multicollinearity is N < K, in which case you cannot estimate anything.
A less extreme version of multicollinearity is when there is an exact linear relationship between two variables. Suppose X1 and X2 in equation (1) above are respectively dummy variables for whether one is male and whether one is female. Barring the unlikely case where the data include one or more intersex individuals, trying to estimate equation (1) will lead to one of the two variables being dropped, simply because X1 + X2 = 1, i.e., there is an exact linear relationship between the two. If you were to try to “force” that estimation, your statistical package would not be able to invert the (X’X) matrix necessary to estimate b by least squares, and the only way to include both variables would be to estimate equation (1) without a constant.
The more common version of the multicollinearity problem is when the correlation between two or more variables is “too high,” meaning that there is an approximate linear relationship between those variables. A good example would be between the amount of food one purchases which one consumes, the amount of food one purchases which one wastes, and the total amount of food one purchases. Food consumed and food wasted need not sum up to one’s total food purchases–sometimes one gives food to someone else–but the correlation is high.
When that happens, the OLS estimator is still unbiased, and as Kennedy (2008)–my Bible when it comes to the fundamental of econometrics–notes, the Gauss-Markov theorem still holds, and OLS is BLUE. Rather, the problem is that the standard errors blow up, and b is imprecisely estimated, and so hypothesis tests will tend to fail to reject the null hypothesis that the components of b are not statistically different from zero.
Kennedy provides a neat intuitive discussion of why that is. Think of the variation in X1 and X2 in in the context of a Venn-Diagram. Each of two sets represents the variation in one variable, with the intersection between the two representing the variation that is common to both variable. Then, the variation in each variable that is not common to the other is represented by the part of the set for that variable which lies outside the intersection. This means that the more highly correlated two variables are, the less variation is available to identify their coefficients–that is, the more imprecisely estimated those coefficients will be. It is in that sense that multicollinearity is a consequence of there being not enough variation in the data, which is why the common recommendation that is made to deal with collinearity is to “get more data,” i.e., increase N, since multicollinearity is caused N being too small relative to K.
Unless you have perfect collinearity, in which case Stata will drop a regressor, detecting multicollinearity is tricky, given that having imprecise estimates is not uncommon with observational data. One thing I see often in the manuscripts I review or am in charge of as an editor is a correlation matrix, which shows the correlation between the variables in a regression. But this is only useful insofar as you have multicollinearity issues between two variables; if the multicollinearity issue stems from an approximate linear relationship between three or more variables, the correlation matrix will be useless.
What to do when you suspect you are dealing with a multicollinearity problem? Kennedy offers a few ideas; I am listing those that strike me as the most practical:
- Do nothing. This is especially useful if your coefficient estimates turn out to be statistically significant–if you do get significance even with imprecisely estimated coefficients, you’re in relatively good shape.
- Get more data. See the discussion above for why that might be a good idea. This can be a costly option, however, and by “costly,” I mean “impossible.”
- Drop one of the collinear variables. That would have been my default prior to writing this post, but this only is a workable solution if that variable adds nothing to the regression to begin with, i.e., if its estimated coefficient is zero. But then, how can you tell whether that is the case if that coefficient is imprecisely estimated? Moreover, doing this introduces bias, so you need to think carefully about whether you’re willing to deal with bias in order to mitigate imprecision.
- Use principal components or factor analysis. This boils down to creating an index with the multicollinear variables or estimating some linear combination of those same variables which is then used as a single regressor. The latter is especially useful when you have several variables that aim to measure the same thing, and you want to include them all.
I must confess that I hardly ever worry about collinearity in my own work. That’s because if the problem gets too extreme, Stata will drop one of the collinear variables, and if the problem is not extreme, it is hard to determine whether a statistically insignificant coefficient estimate is imprecisely estimate because of multicollinearity or because of there being no statistically significant relationship. In the latter case, my personal preference is to just call it a day go with the assumption that there is no statistically significant relationship rather than mine the data by combining variables into an index, omitting a variable, or doing some factor analysis.
That said, in his chapter on collinearity, Kennedy also has a neat quote by Williams (1992), which states that “the worth of an econometrics textbook tends to be inversely related to the technical material devoted to multicollinearity.” Relatedly, it is perhaps no surprise then that the people who most worry about collinearity tend to be people who are just starting out in applied econometrics, i.e., people who have not gotten their hands dirty enough with data yet.