Last updated on October 22, 2017
A few weeks ago I was in a meeting with a team of graduate students with whom I am working on a research project. As we were going over their estimation results, I asked a few questions to make sure that those results were sound.
At some point, I asked: “Are there any generated regressors in those regressions?” Hearing no answer, I looked up and saw a bunch of puzzled faces looking back at me. Before I even began explaining, one of the students alluded to how this would make a good blog post.
Suppose you want to estimate the equation
(1) y = f(x,D) + e
where y is your dependent variable, D is your treatment variable, and x is a control variable. Suppose, however, that you don’t observe x, or you observe it only partially, and you have to generate it by estimating
(2) x = g(w) + u,
and getting \hat{x} so as to estimate
(1′) y = f(\hat{x},D) + e.
For example, observations on x might be missing for a number of respondents, and you might be able to forecast x by using the information in w. Or you might be trying to deal with endogeneity, and equation (2) is the first-stage equation in a 2SLS or a control function setup. Or you might be estimating x structurally from w. There are many possibilities as to why you want to include a generated regressor, or how you obtain such a regressor.
The problem with generated regressors is that because they are estimated from the data, generated regressors have a sampling variance all of their own, and so including them leads to the standard errors from equation (1) that are too small. This means that using generated regressors in equation (1′) and leaving the standard errors “as is” leads to too many type I errors, i.e., they lead to over-rejecting null hypotheses, and to finding significance where there isn’t any.
Though a few methods were developed to obtain the right standard errors when dealing with generated regressors (see, for instance, Murphy and Topel 1985), a commonly accepted and relatively easy way to deal with generated regressors nowadays is to just bootstrap the whole procedure.
Quoth the Maven:
In the example above, you would simply need to write a simple program to bootstrap the procedure defined by both equations (2) and (1′).
When estimating 2SLS with -ivreg-, Stata takes care of the standard error correction all by itself; for many other things, however, it is necessary to correct the standard errors yourself. This is practically always true for things that you code by hand, without relying on a canned command.
Other than Murphy and Topel (1985), the classic paper on generated regressors is by Pagan (1984), and there is also a survey by Oxley and McAleer (1993), but these latter articles seem to be distinctly about the use of generated regressors in macro.