Last updated on June 14, 2015
Continuing the ‘Metrics Monday series, and continuing on last week’s theme of control variables discussed in the de Luca et al. working paper, I wanted to discuss endogenous control variables. Note that a lot of what follows is me thinking out loud, and I may well be mistaken about all of this. If so, I welcome comments exploring this topic.
As always, suppose you have observational data, and you are interested in estimating the causal effect of your variable interest D on your outcome of interest Y, and you also have access to a vector of control variables X. For the sake of argument, let’s assume there is only one control variable in the equation
(1) Y = a + bX + cD + e.
The parameter of interest is c. If you have observational data, then you know that in most cases E(D’e) is different from zero–that is, D is endogenous to Y in equation 1, and c does not capture the causal effect of D on Y.
But what about X? It often happens that X is also obviously endogenous to Y–say, because X is a decision variable which is determined by each individual respondent’s expectation of Y, which would constitute a case of reverse causality.
In terms of the peer-review process one thing I would not encourage you to do is to try to find an instrumental variable for X. Why is that? To put it simply, if a bit cynically: Because D is your variable of interest, and it is difficult enough to deal with the fact that D is endogenous–that is, how well you do so will determine how well your paper is received by reviewers and editors–that attempting to deal with the endogeneity of your control variable exponentially expands the number of reasons why your reviewers might recommend that your paper be rejected.
Seriously, I still sometimes see papers where the authors are looking at the effect of some variable of interest D on some outcome of interest Y, but where they spend a considerable amount of time trying to deal with X (generally, those authors are also waist-deep in likelihood procedures like the Heckman selection model, too, so dealing with X is only one of a laundry list of things they burden the reader with). But that is really besides the point, because it is D that is the variable of interest, not X.
So how do we deal with endogenous controls? First, let’s think about what an endogenous controls means:
- An endogenous control X means that E(X’e) is different from zero, which obviously means that the estimated b in equation 1 will be biased.
- An endogenous control X also means that the OLS estimator for c–the parameter of interest–will be biased, since X appears in the formula for the OLS estimator of c (see here for the OLS estimator in a simple, two-variable case). Moreover, see this article by Frölich (2008) for a discussion of how both OLS and 2SLS will be inconsistent in the presence of endogenous controls. That is, they do not converge to the true value of the parameter of interest.
- Excluding the endogenous control X means that X is now in the error term e, and so if X is correlated with D, then your estimate of c is also biased.
This suggests the following: If D and X are uncorrelated, then it is better to leave X out of your regression altogether, because in that case, it does not bias your estimate of c, no matter how much variation in Y is explained by X.
If D and X are correlated, then you have a problem either way. Omitting X means that you have an omitted variable bias. Including it means that your estimates are inconsistent. (See here for an enlightening, short discussion of bias vs. consistency.) What should you do? I think the middle-of-the-road approach is the usual “do both,” that is to present results both with and without the endogenous control, and see what changes. But even that is not terribly satisfactory, since there is bias in both cases, and “get a better research design” is even less helpful.
Ideally, you would find a good (i.e., valid and relevant) IV for X, but those are difficult to find, and if the IVs used for endogenous variables of interest D in the papers I have seen trying to tackle the of endogenous controls X were usually not the best, the IVs used for those endogenous controls were even worse.
Also see here for a discussion of this issue which I found a bit difficult to follow given the many voices involved (and a few typos, I think). There is also this article by Lechner (2008), but it seems specifically geared towards matching methods.