Last updated on June 23, 2015
Last week I talked about what to do what to do with an obviously endogenous control variable. This week, I answer a question received via email:
… [Y]ou should consider publishing a blog post about how you handle various types of missing data when you are working with secondary data. … I come across data with a lot of [missing] values when analyzing managing household data. I get confusing and contradicting responses when I search on Google as well as when I ask my peers about how to treat missing values. I feel how we handle missing values affects the reproducibility of one’s results hence I wanted to learn if you have any suggestions on how to manage missing values. I am of the view that I may not be the only one who can benefit from learning how you handle this issue when analyzing data for your various research projects.
That is a good question, and its object is something which is not discussed often in econometrics classes, where students are often presented with data sets that have been cleaned and have no missing values. As the email indicates, real-world data is often much messier.
Suppose you have the following regression:
Y = a + bX + cD+ e,
where, as is usual, I use Y to denote the outcome of interest, X to denote control variables, and D to denote the variable of interest, i.e., the treatment variable. The parameters a, b, and c are what we are interested, especially c. To keep things simple, let’s say X is a single variable instead of a vector of control variables.
Suppose you observe D for everyone in your sample, but you have missing data for X. What should you do? Here are a few options:
1. Ignore the problem. When I taught at a policy school, I often had remind students that, as much as people in policy schools would like to ignore it, doing nothing is always an option in terms of policy. Same thing in econometrics: you can choose to ignore problems. With missing data, there is an implicit assumption that is made when you ignore the problem, viz. that data are missing at random. If you are going to ignore the problem, you should think carefully about whether data are likely to be missing at random. For example, when I asked people in Madagascar whether they had a bank account and, if so, how much they had in it, all in an effort to figure out people’s assets, many people refused to answer. I suspect that the more people had in their bank account, the more likely they were to refuse to answer the question, and so ignoring the problem would lead to a sample that is biased in favor of people who have a higher savings rate, or who are wealthier.
2. Run a balancing test. If you want to have an idea of how missing data may bias your sample, you can also run balancing tests. That is, use a t-test to compare the mean of Y for those observations with missing X versus those observations with X present, and do the same for D. If you fail to reject the null hypotheses that (i) the mean of Y is equal for those with X and those with missing X, and (ii) the mean of D is equal for those with X and those with missing X, you can be a bit more confident that your missing values for X appear to leave the sample intact. If you find, say, that there are systematic differences in some variable between those with X and those with missing X, that tells you how those missing values might bias your sample.
3. Run the sub-regression Y = a + cD + e with and without those observations for which X is missing. Is c roughly the same across samples? If so, then that is an additional reason not to worry about missing values for X, given that c is the parameter of interest. Of course, if you have missing values for D, that is a different problem.
4. Use “missing dummies” to keep those observations. You can create a dummy variable–let’s call it Z–equal to 1 if X is missing and equal to zero otherwise. Then, create a variable X’ equal to X if X is nonmissing and equal to zero otherwise, and estimate
Y = a + bX’ + gZ + cD + e.
This has the advantage of retaining all observations. This is something a reviewer once asked me to do, and though it feels like a bit of a kludge, I think it is fine when presented alongside the results of a regression where you treat the missing values of X as missing at random (UPDATE: A comment on Twitter links to this, noting that this strategy really isn’t great), which brings me to…
5. “Do both.” This is pretty much my mantra when it comes to applied econometrics, which is more like rhetoric than dialectic, and in which you need to show that your finding holds over and over in different specifications, building your case for it like a lawyer would build his client’s case in court. So don’t be afraid to do all of 1 to 4 above.
6. Another thing you can do is to impute those missing values. That is, regress X on D and get the predicted values of X, i.e., X hat, and replace missing values of X with the X hats. This also feels like a bit of a kludge, but when used with other methods, and not as your only solution, it should be all right.
7. Finally, should you be lucky enough to have an instrumental variable that (i) is relevant, i.e., it is correlated with missing values, and (ii) is valid, i.e., it only affects Y through X, you can try to estimate a 2SLS or selection correction model, but this seems like a lot of work, and it is rare that we have a good IV for D, not to mention for X.
(UPDATE: 8. Some commenters on Twitter said they missed having “get better data” among my list of proposed solutions. It was missing (heh) on purpose, because it really should go without saying that if you can get better data, in most cases, you will.)
The foregoing presupposes that you have a sizable proportion of your sample with missing X. If you only have five cases where X is missing out of 500 observations, I don’t think anyone will seriously mind if you treat those missing values as missing at random. But if, say, more than 5% of your sample is missing, you might want to run through the list above–and even that is an arbitrary rule of thumb. The best thing to do, as always, is to be forthcoming about the problem, explore how it might compromise (i.e., bias) your results, and try to show robustness as best as you can.