Suppose you are interested in the effect of a treatment variable D on some outcome Y, and you have some controls X. You can thus estimate the following equation by ordinary least squares (OLS):

(1) Y = a + bX + cD + e.

As it so often is the case in the social sciences, the problem is that it is not true that E(D’e) = 0, i.e., D is endogenous to Y, and so estimating equation 1 by OLS means that the estimated coefficient–let’s call it c_{OLS}, for simplicity–is biased, meaning that it will not be equal to the true value c of the coefficient.

Suppose further that you have an instrumental variable (IV) Z for the (endogenous) treatment variable D. Assume Z is a valid IV: it explains enough of the variation in D (i.e., it is not weak) and, perhaps more importantly, it meets the exclusion restriction in that it only affects Y through D. You can thus estimate the following two equations by two-stage least squares (2SLS):

(2) D = f + gX + hZ + u, and

(3) Y = a’ + b’X + c’D + e.

Let’s re-label the coefficient c’ and call it c_{2SLS} for simplicity.

One thing I still read in manuscripts or hear in seminars way too often is people comparing c_{OLS} and c_{2SLS} as though they estimate the same thing.

It usually goes something like this: Someone presents OLS and 2SLS results, and then they (or someone in the audience) will compare the OLS and 2SLS coefficients. If the c_{OLS} > (<) c_{2SLS}, something like “Ignoring endogeneity concerns leads to overstating (understating) the relationship between D and Y.”

The problem is that you can’t compare OLS and 2SLS coefficients. At least not that way. *Continue reading →*