Skip to content

Splitting a Random Sample Along Some Control Variable to Get at Treatment Heterogeneity (Updated)

I teach the second-year PhD research seminar in the Department, and it’s that time of year again when students have to submit a draft of their second-year paper. In case you are not familiar with a second-year paper, it is essentially the widespread practice in applied economics and economics department of having students who are done with their first-year courses to write an entire publishable paper from start to finish.

As such, teaching the second-year paper involves reading a lot of drafts. One of the drafts I read last week did something that always baffles when I see it. This might be a simple question whose answer is obvious, so bear with me, but the practice is so common that I thought I would ask readers whether it is me who is missing something. The practice is as follows (note that I am positing all this for observational data, not experimental data):

  1. You have a random sample in which you estimate a relationship of interest. Say, you are interested in whether a land title means that a plot is more productive, or whether having a college degree means that an individual makes more money.
  2. You are interested in heterogeneous treatment effects. Say, you are interested in whether the effect of the land title is different by plot size–say, for plots smaller than one hectare and for plots larger than one hectare. Or you are interested in whether having a college degree has different effects by race.
  3. To look at treatment heterogeneity, you split your sample up by group and re-estimate your relationship of interest. So you re-estimate your productivity equation once for small plots, and once for larger plots. Or you re-estimate your wage equation once for each race.

My problem is this: You start from a random sample. The minute you split that sample by group, however, your sub-samples are no longer random! Intuitively, it is unclear to me what the estimate you get for each sub-group means given the non-randomness of the sub-sample on which it is based.

To get a treatment heterogeneity, wouldn’t it be better to maintain your sample as is, but to interact your treatment (i.e., land title, college degree, etc.) with groups (i.e., small and large plots, race, etc.), going so far as to omitting the constant in order to be able to retain each group? If anything, you would get much better statistical power from preserving a larger sample. Anyone have any insight as to whether my intuition is right regarding the presumed bias that comes from looking at sub-groups, or is the overall effect simply the weighted effect by group? So, readers: Is the practice legitimate and I simply haven’t seen good applications of it?

Update: Many good comments below. My new colleague Jason Kerwin even came up with his own proof-by-Stata (see the link in his comment below). Comments like the ones on this post are one of the reasons why I love blogging.

8 Comments

  1. Devon Devon

    Efficiency is definitely one reason not to split up the samples. Your standard errors will be much larger in the two split samples than if you estimated it in a large model with the interactions you indicated.

    I’m not seeing the problem with losing the randomness of the sub-sample. Conditional upon the group, isn’t it still random within that group? Maybe I’m missing something here.

  2. Thanks for your comment, Devon. Yes, you are absolutely right regarding efficiency. Intuitively, it just seems to me that the estimate you get off of the subgroup might be driven by unobservable factors that differ systematically along your splitting variable. For example, suppose you are interested in how much of some behavior someone undertakes (e.g., smoking) as your dependent variable and want to know the effect of income on smoking, and you split along gender. If men and women differ along risk aversion (which you don’t observe in my example) and this relates to your variable of interest (risk preferences change with income, at least in theory).

    By splitting the sample, the estimates you get for men and women in each sub-sample are both biased by this unobserved confounder–and it is conceivable that said treatment effect wouldn’t be biased so in the whole random sample. Again, as I said in my post, maybe I’m the one who’s confused!

  3. Devon Devon

    Thanks for the reply, Marc. I follow you up until “and it is conceivable that said treatment effect wouldn’t be biased so in the whole random sample.” If the bias exists in the sub-sample, wouldn’t it also exist in the whole sample? The unobserved confounder is still there. I guess I’m struggling to see how moving from the sub-sample to the whole sample would ever remove that issue.

    Here’s how I’m thinking about it. Wouldn’t the point estimate in a regression using only the sub-sample be the same as the regression on the whole sample using the interactions? That’s not to say the point estimate isn’t biased in both samples by an unobserved confounder, but I don’t see how moving between sub-sample and whole sample would ever change that problem. Again, I might be off here. I’ll need to think more about your example…

  4. In terms of bias, it isn’t a problem as long as you are splitting the sample using a variable that isn’t affected by your main variable of interest. In your land title example, you could split landholders into quantiles and estimate impacts within each group as long as landholdings are measured pre-treatment. You would end up with unbiased estimates of the average impact within each land group, that is, that is, conditional treatment effects. Of course, it is true that there may be something else about individuals with landholdings of different sizes that is driving the impact heterogeneity. But that doesn’t mean your estimated impacts by subgroup are biased. The treatment effect estimate is measuring the impact of assigning someone within a given land group to the treatment group instead of control. It isn’t measuring what would happen if you took someone from a smaller land group that is treated and gave them more land.
    Now, you could instead run a single saturated regression where you have dummies for land quantiles and each dummy is interacted with all of the covariates in your regression. This should get you the exact same point estimates for the average treatment effect (as long as you subtract the subgroup means from the interactions between the covariates and the subgroup dummies).
    Someone can correct if I’m wrong, but you should get the same standard errors on the subgroup effects in either approach as well. I could, for example, regress income on an intercept in one subgroup, do the same in another subgroup, and do a t-test using the variance of the intercept from each individual regression to construct the standard error of the difference. I could instead regress income on an intercept and group dummy and use the standard error on the dummy to do a t-test. In both cases, the standard error and the degrees of freedom for the difference in means are the same. So I don’t think pooling the groups gets you more power for testing group differences.

  5. Anna Anna

    Hi Marc,

    I agree with Devon. I don’t see how conditioning on an independent variable would ‘create’ bias vis-a-vis the full random sample. If you use the full sample and interact income with a male dummy, the male dummy will still be picking up some of the variation caused by the omitted risk aversion variable.

    Conditioning on the dependent variable is different and, in that case, we can think of the researcher ‘introducing sample selection bias’ (a la Heckman, 1979). I recall reading something in Mostly Harmless Econometrics on splitting the sample based on independent variables, but I don’t have the book with me right now… I might have read that somewhere else.

    p.s. I really enjoy reading your blog (and papers – recently started reading the FGC paper – very nice) . I appreciate the emphasis on solid applied microeconomics / econometrics. Thanks for sharing.

  6. My response to this got kind of long so I put it on my own blog instead:
    https://nonparibus.wordpress.com/2015/03/27/understanding-heterogeneous-treatment-effect-estimation-using-proof-by-stata/
    A brief summary is:
    1. I think it helps with intuition to simulate the data-generating process in Stata.
    2. Splitting your sample by an exogenous X won’t bias your coefficient estimates
    3. A saturated regression that interacts a dummy for the sample split with the treatment variable and all other variables in the model gives identical point estimates to splitting the sample first and then running two regressions
    4. However, the SEs will be different
    5. If the variable you want to split by is plausibly affected by the treatment, Marc’s intuition that this is a bad idea is spot-on. So I would be careful with splitting the sample by risk aversion, but not by gender.

  7. Thanks, Jason and everyone else! Quality discussion does occur in comments sections on the Internet; this post is proof!

Comments are closed.