This month’s issue of Foreign Affairs has a great article (you’ll need to log in to read the whole thing, ufortunately) on the rise of big data, which Wikipedia defines as
a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
So far, so good. As a development economist, I have to make do with 500 observations more often than not (the largest dataset I have ever worked with had about 8,000 observations), so I obviously welcome ever larger datasets.
Kenneth Cukier and Victor Mayer-Schoenberger, the authors of the article, argue that big data introduces three changes in the information landscape:
- We have much larger datasets, which are often closer to encompassing entire populations rather than being small samples.
- We have slightly messier data, which are a byproduct of more data being collected much faster.
- We have to sacrifice the identification of causal relationships and be happy with mere correlations.
About that third and last point, the authors also write:
Take UPS, the delivery company. It places sensors on vehicle parts to identify certain heat or vibrational patterns that in the past have been associated with failures in those parts. In this way, the company can predict a breakdown before it happens and replace the part when it is convenient, instead of on the side of the road.The data do not reveal the exact relationship between the heat or the vibrational patterns and the part’s failure. They do not tell UPS why the part is in trouble. But they reveal enough for the company to know what to do in the near term and guide its investigation into any underlying problem that might exist with the part in question or with the vehicle.
That’s great news for UPS, but I can see two problems with this. First, without the identification of causal relationships, there can be no science, social or otherwise. This means that no matter how large a dataset, if it does not allow answering questions of the form “Does X cause Y?,” that dataset is worthless to scientists.
Sure, the dataset can be used for forecasting, much like UPS does. But UPS was never in the business of identifying causal relationships to begin with. Rather, UPS’ purpose is to maximize its profits — to make big money.
This brings me to my second point: There is a fundamental difference between estimating causal relationships and forecasting. The former requires a research design in which X is plausibly exogenous to Y. The latter only requires that X include as much stuff as possible.
When it comes to forecasting, big data is unbeatable. With an ever larger number of observations and variables, it should become very easy to forecast all kinds of things, from election results to sports scores and from stock prices to terrorist attacks.
But when it comes to doing science, big data is dumb. It is only when we think carefully about the research design required to answer the question “Does X cause Y?” that we know which data to collect, and how much of them. The trend in the social sciences over the last 20 years has been toward identifying causal relationships, and away from observational data — big or not.