## Follow-Up on Achieving Statistical Significance via Covariates

I received the following email from Indiana University’s Dan Sacks in response to my post yesterday:

Dear Marc,

I enjoy reading your blog and find it informative and stimulating. However your recent post on “achieving statistical significance with covariates” may mislead some readers. The basic issue is that it seems fine to me if the

precisionof your coefficient is sensitive to the inclusion of pre-determined covariates, as long as theexpected valueis not. That is, in such cases it seems fine to emphasize the precisely estimated result.Here are more details. You note that in the model

Y = a + bX + cD + e,

the estimated coefficient c on D might or might be statistically significant, depending on what is included in the control vector X. The usual concern in the applied literature—which of course I share completely—is that if we don’t condition on a sufficient set of confounders, then c is estimated with bias. We all want to avoid bias. Bias is about expected values, though, not statistical significance, and it is not obvious to me that we should be worried about models in which including covariates changes the statistical significance (but not the expected value) of the results. Including pre-determined regressors which are uncorrelated with D but (conditionally) correlated with Y will generally reduce var(e), reducing the standard error of c and possibly leading to statistical significance. The fact that our results are only significant if we control for some set of X’s does not necessarily mean that there is bias – only that we might be underpowered without enough controls.

Here’s a hypothetical example. You run an RCT looking at the effect of an unconditional cash transfer on happiness. You randomly assign different people to get money or not. This is an expensive intervention so you don’t have a huge sample. You estimate that D = 0.1 (se = 0.06) without controlling for anything, but when you control for a vector of characteristics measured at baseline, you estimate D = 0.09 (se= 0.04). In this case, I think we would all agree that it’s fine to control for age. [There is a different issue which is about data mining for statistical significance, but I think that’s not the point you’re raising, either.]

I used to be a real purist about this, particularly as a grad student. “If your experiment/IV is valid,” I would ask, “then why do you need to include controls?” But the answer is that controls help with statistical precision.

Dan

Three things:

- Dan raises an excellent point, and I am grateful for his email. In my (poor) attempt to lump as many of my recent econometric readings together in a single blog post hastily written at the tail end of a Sunday afternoon spent at the office, I mistakenly rolled the issue of precision (i.e., the Ganz and Lenz working paper) together with that of expected value (i.e., the Oster JBES article and the Pei et al. NBER working paper, or the original Altonji et al. JPE article).
- That said, data mining for statistical significance was the point I was trying to discuss in relation to Ganz and Lenz.
- The RCT example is great because it clarifies the matter, but RCT data lend themselves systematically less to data mining than observational data (the focus of the Ganz and Lenz working paper).

Generally, note that whether an estimated coefficient is statistically significant depends both on the precision and on the expected value of the parameter of interest. As Ganz and Lenz note in their abstract, “[r]esearchers choose which covariates to include in statistical models and these choices affect the size and statistical significance of estimates reported in studies.” And so holding an estimated coefficient’s precision (i.e., its standard error) constant, the larger that coefficient in absolute value, the more likely it is to be statistically significant, because the more likely its confidence interval will exclude zero. Conversely, and again holding holding the standard error constant, the smaller that coefficient in absolute value, the more likely it is to be statistically insignificant, because the more likely its confidence interval will include zero.

Given the foregoing, the worst-case scenario is when adding in new covariates on the RHS substantially affects both the estimate coefficient *and* its standard error. Again, all of this militates in favor of showing a bare-knuckles regression of outcome on treatment before anything else, and of coming up with a good rationale in those cases where the bare-knuckles estimate of c is insignificant but becomes significant when controls are included.