Skip to content

Category: Uncategorized

Follow-Up on Achieving Statistical Significance via Covariates

I received the following email from Indiana University’s Dan Sacks in response to my post yesterday:

Dear Marc,

I enjoy reading your blog and find it informative and stimulating. However your recent post on “achieving statistical significance with covariates” may mislead some readers. The basic issue is that it seems fine to me if the precision of your coefficient is sensitive to the inclusion of pre-determined covariates, as long as the expected value is not. That is, in such cases it seems fine to emphasize the precisely estimated result.

Here are more details. You note that in the model

Y = a + bX + cD + e,

the estimated coefficient c on D might or might be statistically significant, depending on what is included in the control vector X. The usual concern in the applied literature—which of course I share completely—is that if we don’t condition on a sufficient set of confounders, then c is estimated with bias. We all want to avoid bias. Bias is about expected values, though, not statistical significance, and it is not obvious to me that we should be worried about models in which including covariates changes the statistical significance (but not the expected value) of the results. Including pre-determined regressors which are uncorrelated with D but (conditionally) correlated with Y will generally reduce var(e), reducing the standard error of c and possibly leading to statistical significance. The fact that our results are only significant if we control for some set of X’s does not necessarily mean that there is bias – only that we might be underpowered without enough controls.

Here’s a hypothetical example. You run an RCT looking at the effect of an unconditional cash transfer on happiness. You randomly assign different people to get money or not. This is an expensive intervention so you don’t have a huge sample. You estimate that D = 0.1 (se = 0.06) without controlling for anything, but when you control for a vector of characteristics measured at baseline, you estimate D = 0.09 (se= 0.04). In this case, I think we would all agree that it’s fine to control for age. [There is a different issue which is about data mining for statistical significance, but I think that’s not the point you’re raising, either.]

I used to be a real purist about this, particularly as a grad student. “If your experiment/IV is valid,” I would ask, “then why do you need to include controls?” But the answer is that controls help with statistical precision.

Dan

‘Metrics Monday: Achieving Statistical Significance with Covariates (Updated)

Those of us who do applied work for a living will have at some point noticed that, depending on which variables we include in X on the right-hand side (RHS) of an equation like

(1) y = a + bX + cD + e,

the coefficient c on the treatment variable D might go from significant to insignificant or vice versa.

That this is true is the very reason why it is common practice in applied work to present several specifications of equation (1) in the same table, ranging from the most parsimonious (i.e., a regression of y on D alone) to slightly less parsimonious (i.e., a regression of y on D and ever increasing subsets of X) to the least parsimonious (i.e., a regression of y on D and all the controls in X). It is also the rationale behind the method put forth by Altonji et al. (2005) to assess the robustness of a finding.

I came across an interesting new working paper by Lenz and Sahn by way of Dave Giles’ blog, titled “Achieving Statistical Significance with Covariates,” in which the authors conduct an interesting meta-analysis of articles published in the American Journal of Political Science which reveals that in almost 40% of the observational studies analyzed, researchers obtain statistical significance of c by tinkering with the covariates included (or not, as it were) in X.

Here is the abstract of Lenz and Sahn’s paper:

Contract Farming as Partial Insurance

That is the title of a new working paper of mine coauthored with my former doctoral students Yu Na Lee (University of Guelph Food, Agricultural & Resource Economics) and Lindsey Novak (who is joining the Department of Economics at Colby College in a few weeks). Here is the abstract:

The institution of contract farming, wherein a processing firm contracts out the production of an agricultural commodity to a grower household, has received much attention in recent years. We look at whether participation in contract farming is associated with lower levels of income variability for a sample of 1,200 households in rural Madagascar. Relying on a framed field experiment aimed at eliciting respondent marginal utility of participation in contract farming for identification in a selection-on-observables design, we find that participation in contract farming is associated with a 0.2-standard deviation decrease in income variability. Looking at the mechanism behind this finding, we find strong support for the hypothesis that fixed-price contracts explain the reduction in income variability associated with contract farming. Then, because the same assumption that makes the selection-on-observables design possible also satisfies the conditional independence assumption, we estimate propensity score matching models, the results of which show that our core results are robust and that participation in contract farming would have greater beneficial effects for those households that do not participate than for those who do, i.e., the magnitude of the average treatment effect on the untreated exceeds that of the average treatment effect on the treated. Our findings thus show that participation in contract farming can help rural households partially insure against income risk via contracts that transfer price risk from growers to processors.