Last updated on January 10, 2016
“If you know a good story, tell it from time to time.” — Noah Smith.
Actually, I know two related stories, which I will recount in this post because both stories need to be understood much more widely than they currently are given how often their affiliated problems crop up in the manuscripts I read.
Take the most basic theoretical problem in microeconomics: A producer has to choose how much labor ℓ to use in order to maximize its profit from producing and selling some output q whose production is dictated by the production function q = f(ℓ), where f(.) is the technology available to the producer. The output q sells at price p, and labor ℓ sells at wage w.
Setting the maximization problem, taking the first-order condition, checking that the second-order condition is satisfied, and solving for the profit-maximizing quantity of labor will yield a labor input demand ℓ* = ℓ(p,w). In such a problem, we say that ℓ is an endogenous variable–it is determined within the context of the problem–while p and w are exogenous variables–they are predetermined, that is, they are given, and they do not depend on the problem. (Alternatively, we also say that p, w, and f(.) are the primitives of the problem, but that is neither here nor there for the purposes of this discussion).
Now suppose you wanted to study the labor allocation decisions on farms in a developing country. If you believe the theoretical model above, the least you would want to do is to regress each farm’s labor allocation ℓ on the price of the crop grown on that farm p and on the wage that farm pays its workers. It would be a mistake, however, to claim that because p and w are exogenous in the theoretical problem above, you can treat them as exogenous in the empirical problem.
So my first story is this: Endogeneity and exogeneity have vastly different theoretical and empirical meanings.
My second story is related: It’s not because output price p and the wage w are not caused by ℓ that they are exogenous. Indeed, there is more than one cause of (statistical) endogeneity. In the regression
(1) ℓ = a + bp + cw + e,
statistical endogeneity can bias your estimates of a, b, and c in three ways:
- Unobserved heterogeneity. This is also known as the omitted variables problem. Suppose it is more physically demanding to work on a low-quality plot than it is to work on a high-quality plot, and that you have to pay workers accordingly. In this case, your estimate of c is biased because of the correlation between (omitted) soil quality, which is in the error term e, and w.
- Measurement error. Suppose the farmers you collected data from tend to lie about the price at which they sell their crop (say, because they wish to under-report their actual income). Then the price p you observe is such that p = p* + u, where p* is the real price they receive for their crop, and u is the “adjustment” they make to that price when they tell you how much they received for their crop. If u is correlated with p–say, the higher the price a farmer receives, the bigger the lie–then your estimate of b is biased.
- Reverse causality or simultaneity. This is what a lot of people think of as the source of statistical endogeneity. Suppose that, for some reason, the amount of labor a farmer employs on his farm has an effect on the price that farmer receives for his crops or on the wage he has to pay (say, because he has to pay his workers for overtime). This too will bias your coefficient estimates.
Now, none of this should be new to people who received their graduate training in the past 10 years. But there are still some folks who believe reason #3 is the only source of statistical endogeneity, and who confuse theoretical and statistical endogeneity.
The foregoing, however, suggests a systematic way to think through and discuss identification issues when writing applied papers: In my own work, I almost always include a point-by-point discussion of whether (i) unobserved heterogeneity, (ii) measurement error, and (iii) reverse causality/simultaneity are a source of bias in the application at hand, and of how I deal with each source of statistical endogeneity. I see such a discussion as second only to the introduction in terms of importance in the grand scheme of a research paper, and I think most young researchers would benefit from including such a discussion when using observational data.