‘Metrics Monday: Proxy Variables
It often happens in the course of doing empirical work that we wish study the relationship between some variable of interest D and some outcome Y, but that we don’t have access to a good measure of D. Rather, what we have instead is a proxy for D, which Wikipedia defines as
a variable that is not in itself directly relevant, but that serves in place of an unobservable or immeasurable variable. In order for a variable to be a good proxy, it must have a close correlation, not necessarily linear or positive, with the variable of interest.
For example, we may observe a dummy variable for whether one has started a business as a proxy for entrepreneurial ability. Or we may observe one’s IQ as a proxy for intellectual ability. Or we may observe the frequency of elections as a proxy for democracy. The possibilities here are endless.
For the sake of argument, then let’s denote our proxy variable–what we actually observe in lieu of D–as D*, so that
D* = f(D) + u,
where f(.) is a mapping from D to D* and u is some kind of error term to make the relationship between D and D* stochastic, for if that relationship were deterministic and D* were equal to f(D), then observing the proxy D* would be as good as observing the variable of interest D.
Our ideal goal is to estimate the coefficient c accurately in the regression
(1) Y = a + bX + cD + e,
but the best we can do is to estimate
(2) Y = a* + b*X + c*D* + e*,
where the stars denote that a and a*, b and b*, c and c*, as well as e and e* are different things given the move from (1) to the following equation:
(2′) Y = a* + b*X + c*f(D) + u + e.
If you’ve taken a minimal amount of econometrics, you already know where this is going: We now have to contend with u being in the error term, and so if u is correlated with any of the variables on the right-hand side of (2′), then we are dealing with an endogeneity problem.
An example might be useful here. Suppose we are using a dummy variable D* for whether one has started a business as a proxy for entrepreneurial ability D. In this case, it is reasonable to argue that the proxy is measured with error. Specifically, D* will tend to underestimate entrepreneurial ability. Indeed, many academics are what one would deem “entrepreneurial” in that they undertake risky activities that might have high payoff and are willing to invest in activities where the production function is nonconvex (such risky and nonconvex activities are really what tenure is designed to foster, by the way), but few academics start businesses, save for the occasional consulting business on the side. In this example, D* would tend to understate D, which would lead to a biased estimate of c* (and this discussion implicitly assumes that the correlation between D and D* is “good enough” for D* to pick up any significance that D might have in its relationship with Y).
In the best-case scenario, u is uncorrelated with the variables on the right-hand side of (2′), but that isn’t always the case, and it isn’t even clear that this is frequently the case. And then there care cases where the variable that you use as a proxy really does not have a monotonic relationship with Y, and in which case any statistical test related to c* is unidentified because you don’t know what to test for. Zack Brown and I once won an award for pointing out, using relatively simple micro theory, that wealth is a terrible proxy for risk aversion in applied contract theory–in addition to changing the curvature of a utility function, a change in wealth also changes utility at the margin, which complicates the relationship between contract choice and wealth by making it nonmonotonic in most cases. Here is the abstract of that paper:
Tests of risk sharing in the contracting literature often rely on wealth as a proxy for risk aversion. The intuition behind these tests is that since contract choice is monotonic in the coefficients of risk aversion, which are themselves assumed monotonic in wealth, the effect of a change in wealth on contract choice is clearly identified. We show that tests of risk sharing relying on wealth as a proxy for risk aversion are only identified insofar as the econometrician is willing to assume that (i) the principal is risk-neutral or her preferences exhibit CARA; and (ii) the agent is risk-neutral.
Given the frequent use of proxies, it pays to think carefully about two related questions when doing empirical work:
1. Are any of the variables in the regression of interest proxies for something else? Even a variable like GDP per capita is really only a proxy for standard of living, and it often happens that we treat proxies as being the thing we are really interested in. With all due respect to my poli sci friends, a number of variables used in the international political economy literature to measure “democracy,” “human rights,” etc. strike me as obvious proxies.
2. If any of the variables in the regression a proxy for something else, how might that proxy measure the variable of interest with error? Is this error more like classical measurement error, in which case this causes attenuation bias (i.e., c* is biased towards zero, or what a friend and colleague once called “the good kind of bias”), or is this error systematic, which introduces systematic bias (and not necessarily of the good kind!) in your estimates?
Against 1, I guess the solution is to be honest about what we are truly measuring and use careful language in doing so. In our forthcoming JDE article on female genital cutting (FGC), for example, Lindsey Novak, Tara Steinmetz, and I were careful to talk of “respondents who report having undergone FGC” rather than of “respondents who have undergone FGC,” because short of interviewers physically verifying FGC status, it is impossible to know whether a woman has actually undergone FGC.
Against 2, the solution is to assess whether there is bias and, if so, what is its direction. In some cases (the “good kind of bias” cases), the bias serves to strengthen a significant result by providing a lower bound (in absolute value) on the true effect. In other cases, it gets trickier. As with any endogeneity problem, one might have to use an instrumental variable to deal in such cases.