It’s been a while since I wrote a post for this series, so I thought I should discuss overcontrolling.
Two of our doctoral students are working together on an article in which they are interested in the effect of a spike in the price of a staple food on the welfare of consumers. The staple they are looking at is primarily sold at two types of retail outlets, let’s call them A and B (Because I did not consult with the students before writing this post, I am remaining purposely vague about the application).
In principle, then, the students are interested in identifying the causal effect of a change in the price of the staple at retailers of type A and the causal effect of a change in the price of the staple at retailers of type B. Let’s denote those prices as [math]p_A[/math] and [math]p_B[/math]. Letting [math]y[/math] denote welfare, the students are interested in the effect of [math]p_A[/math] on [math]y[/math] and on the effect of [math]p_B[/math] on [math]y[/math].
Initially, they estimated the following equation
(1) [math]y = \alpha + \beta_A p_A + \beta_B p_B + \epsilon[/math],
from which I omit controls for brevity. When I first saw their results, one of them explain that they had obtained some really weird results. If I recall correctly, they’d found something like “both [math]\beta_A[/math] and [math]\beta_B[/math] are positive” which, given that their identification strategy is pretty solid, did not make sense: How could a sharp increase in the price of a staple food actually be beneficial to consumers?
After thinking about it for a minute, I explained: It does not make sense to include both of those two highly correlated variables measuring the same thing (i.e., the price of a given staple) together in the same regression. Because they measure the same thing and are highly correlated with each other, including both in the same regression “overcontrols” for the staple price, and it is thus not surprising that the results are weird.
The solution, as I see it, is to run two regressions–one with [math]p_A[/math] as the treatment variable, and one with [math]p_B[/math] as the treatment variable–and see if they give you roughly the same answer. A colleague suggested running a principal components analysis on [math]p_A[/math] and [math]p_B[/math] so as to extract the notional staple price signal from the signal plus noise combination of the two variables.
The bottom line is this: Although we tend to think of control variables as the more the merrier, there are cases where specific controls should be left out. A related–but distinct–problem is the problem of bad control variables, i.e., control variables that are themselves the outcome of the treatment variable, and whose inclusion will bias your estimate of the treatment effect.