I spent a few days last week attending the annual meetings of the Agricultural and Applied Economics Association (AAEA) in San Francisco as well as a one-day pre-conference on agricultural value chains in developing countries. At that pre-conference, I happened to be sitting with Ben Wood, who co-manages 3ie’s replication program, and in light of his experience with replication in economics, he suggested I write a post about data cleaning.
Why data cleaning? Because most students will typically have very little experience with that thankless task, which happens because most econometrics classes usually present students with nice, picture-perfect data sets for applied problem sets (and that’s when those classes actually have students estimate stuff instead of just memorizing the properties of various estimators).
Many textbooks now come with a number of data sets that readers can use to apply various techniques and replicate the examples in the book (for example, here are the data sets from Jeff Wooldridge’s textbook), which is great. But as I noted above, the problem with those data sets is that they are “perfect.” That is, no data are missing, no values are the product of an obvious typo, all the data are in one neat file, and so on.
Very often, however, the data you will want to use for a research project is not clean. It will come in several files covering different questionnaire modules across different years. Monetary values will have been recorded in real terms. Some people will have refused to answer some questions; others will have trolled the enumerators with crazy answers. Whoever entered the data will have made typos.
The list of possible issues is almost endless, and each data set has its unique set of data-cleaning, which is why it is very difficult to actually teach students how to clean data. But if there is one thing that you need to remember on the data-cleaning front, it’s this:
Cleaning data will typically involve running a .do file wherein
- You merge different data files together. This can range from easy if you only have to match observations with themselves (i.e., individuals’ answers to demographic questions with the same individuals’ answers to financial questions) to very tricky if you have to ascribe several sub-observations (e.g., a household’s individual plots) to one “master” variable (e.g., the household itself), and you might want to check that step several times over to make sure everything is okay, going so far as inspecting a few observations to see if they line up with the actual values recorded in the survey questionnaire.
- You tab each variable to see whether there are obvious irregularities: missing values, outliers, censoring, truncation, etc. For cases where you have several sub-observations per unit (say, several country-year observations), you might want to check that the time-invariant values are indeed time-invariant, checking the mean of those variables by country. Here, you might also want to plot your dependent variable against each right-hand side variable, just to get a visual sense of what is going on as well as to detect outliers and leverage points.
- You drop some observations because of missing values, outliers, typos, etc.
- You transform some variables by taking a log, applying an inverse hyperbolic sine transformation, expressing them in real terms, converting two-week recall into seasonal data, dividing by 1,000 to have estimated coefficients more in line with your other estimated coefficients, and so on.
- You generate new variables from those you currently have, whether this means adding variables together (e.g., to calculate household size), creating dummies from continuous variables (e.g., to break up income into income brackets), creating ratios of two variables (e.g, to use firms’ price-earnings ratios as a regressor), etc.
- You perform other operations that will lead to a nice, clean data set you can just run a parsimonious estimation .do file on.
So what I suggest–and what I try to do myself–is to write a .do file that begins by loading raw data files (i.e., Excel or ASCII files) in memory, merges and appends them with one another, and which documents every data-cleaning decision via embedded comments (in Stata, those comments are lines that begin with an asterisk) so as to allow others to see what assumptions have been made and when. This is like writing a chemistry lab report which another chemist could use to replicate your work.
Another important rule is to never, ever save over (i.e., replace) a data file. If you replace a data file from which you have dropped something or in which you have transform the data in some irreversible way (say, because you failed to follow the “Document everything” rule and did not document what you did to the data), then that file is gone forever. Thankfully, most people now have way more storage space than they need to (after three years of use, my laptop’s C:\ drive is still 67% empty), and Stata is pretty helpful when it comes to that: Every time you want to save a data file, you have to actually tell Stata to do it, and when you quit Stata, if the data have changed, Stata will ask you whether you want to save before quitting (the answer to that question is almost always “No.”)
Lastly, another thing I did when I first cleaned data was to “replicate” my own data cleaning: When I had received all the files for my dissertation data in 2004, the data were spread across a dozen spreadsheets. I first merged and cleaned them and did about a month’s worth of empirical work with the data. I then decided to re-merge and re-clean everything from scratch just to make sure I had done everything right the first time.
Really, there is no big secret to cleaning data other than “Document everything” and to save everything in different files and in different locations (i.e., your computer, Dropbox, Google Drive), and there is no other way to learn data cleaning than by doing it. But it is something that is so rarely discussed that it is worth having a discussion, however short, of what it involves.