‘Metrics Monday: What to Do When You Have the Population Instead of a Sample?

Rob writes:

I am not an econometrician–I spend my time playing with CGE models–but have to know something about econometrics. Recently I have been reviewing draft papers on a project using detailed tax data in my country–firm-level, matched with individual returns of employees, valued-added tax, import duties, etc.–for the period 2009-2014. A massive and rather unusual database.

It is all good work, but I have two concerns. (Note: I will get to Rob’s second concern at next week’s installment of ‘Metrics Mondays. — MFB.) One is about big data. Many of the researchers report t-statistics and other statistics as if this does not matter. In fact some say they are dealing with the population of firms, in which case my sense is that standard errors say nothing about statistical fit, but maybe about economic significance of relations between means. Even if it is a sample, as n/N becomes closer to 1, sample statistics become problematic.

That is a very interesting question. Let me just rephrase it a bit more broadly to this: What do you do when you are dealing with the population itself instead of dealing with a sample that is representative of a population?

Your first reaction might be to think “Well, since I have the entire population, I don’t need to compute standard errors anymore, and everything I estimate is statistically significant.”

Though that kind of reasoning is intuitively appealing, let me ask this: Would you be willing to submit for publication a paper where you make that claim? Suppose, for example, that you are in my shoes, and you have data on food-borne illness and farmers markets for all 50 states plus the District of Columbia for 2004-2013. Would you really submit an article for publication in which you tell the editor and reviewers that you don’t need to compute standard errors and run t-tests because you have the entire population of states?

I didn’t think so.

In fact, if you look at published papers using data on all 50 states (here is a favorite example of mine, which I use when I teach my graduate class in applied econometrics), those papers still report standard errors in tables of regression results. So what gives?

On the one hand, I understand analytically why having access to a whole population obviates the need to compute those pesky standard errors. On the other hand, there are a few reasons why you might still want to treat your population as a sample.

One reason why you might want to treat your population as a sample is to test whether some estimated relationship is meaningful. In other words, you might want to check that the relationship between your dependent variable and some regressor is statistically significant as a means of testing whether there really is a relationship between the two in your population, or whether the estimated relationship is indistinguishable from zero and the result of chance.

Another reason why you might want to treat your population as a sample and calculate standard deviations around the means of the various variables you are interested in is simply because those standard deviations are means in themselves–they are the average departure from the mean of each variable, or how far from the average you can expect each observation to be.

Lastly, a more compelling reason why you might want to treat your population as a sample is because you might be interested in prediction. For example, a policy maker might ask you to predict the effect on your dependent variable of changing an explanatory variable by a certain amount. Without treating the population as a sample, you would be making a very sharp prediction: In essence, you’d give the policy maker a single number, without there being any uncertainty around it. In practice, you would likely want to qualify that number with a range of credible values. And what better to do that than a 90/95/99 percent confidence interval?

Update: Matthew Martin adds the following via Twitter:

https://twitter.com/hyperplanes/status/699226275181944832

And Cyrus Samii points to a previous blog post of his on this topic:

https://twitter.com/cdsamii/status/699233078401748992

‘Metrics Monday: What to Do When You Have the Population Instead of a Sample?

Share this: