‘Metrics Monday: The Tobit Temptation
And because thou wast acceptable to God, it was necessary that temptation should prove thee. And now the Lord hath sent me to heal thee … — Tobit 12:13.
This week I wanted to discuss tobit estimators. In case you are not familiar with it, Wiki describes the tobit estimators (people say tobit “models,” but I don’t like calling estimators models, which confuses theory with empirics a bit too much for my taste) as
a statistical model proposed by James Tobin (1958) to describe the relationship between a non-negative dependent variable Y and an independent variable X. The term tobit was derived from Tobin’s name by truncating and adding -it by analogy with the probit model.
The model supposes that there is a latent (i.e. unobservable) variable Y*. This variable linearly depends on X via a parameter (vector) b which determines the relationship between the independent variable (or vector) X and the latent variable Y* (just as in a linear model). In addition, there is a normally distributed error term U to capture random influences on this relationship. The observable variable Y is defined to be equal to the latent variable whenever the latent variable is above zero and zero otherwise.
There are many types of tobits–Wiki lists five, which are such that
- Type I Tobit: “a special case of a censored regression model, because the latent variable Y* cannot always be observed while the independent variable X is observable.” This would be the case, for example, if you observe age and income, but incomes below $30,000 per year are censored, but you observe age for everyone. The censoring can also occur above a certain threshold, or both above and below specific thresholds.
- Type II Tobit: “Heckman (1987) falls into the Type II Tobit. In Type I Tobit, the latent variable absorb both the process of participation and ‘outcome’ of interest. Type II Tobit allows the process of participation/selection and the process of ‘outcome’ to be independent, conditional on x.” This would be the case, for example, if you want to account for selection into a specific thing. Taking an example from my own work, you might want to account for whether a farmer participates in contract farming when studying whether participation in contract farming increases welfare, since participation in contract farming is not randomly sprinkled across farmers.
- Type III Tobit: This is the bivariate version of the tobit, i.e., it simultaneously estimates two tobits.
- Type IV Tobit: This is the trivariate version of the tobit, i.e., it simultaneously estimates three tobits.
- Type V Tobit: “Similar to type II, in type V we only observe the sign of Y*.”
My goal with this post is simply to discuss the temptation among some people to control for selection with a type II tobit, also known as a Heckman selection model or a heckit (once again, following the tradition to add “-it” at the end of those limited and discrete-choice ML estimators).
Indeed, my view is this: Assuming you have a decent variable that you can exclude from the equation of interest to explain selection into treatment, why go through the trouble of estimating a heckit when you can estimate a plain-old 2SLS?
(And I say this as someone who in a past, more structural life, was also tempted by heckits and wrote a likelihood function that involves something similar and slapped the label of “ordered tobit” on it–proof that fads and fashions are definitely a thing in econometrics as with almost everything else.)
Why should you go for the 2SLS instead of the heckit? Simply because the current preference is to keep it simple, and because the 2SLS does just that relative to the heckit. Indeed, both the 2SLS and the heckit estimate two equations. The first equation attempts to purge treatment of its correlation with the second-equation error term due to selection by using a plausibly exogenous variable to do so,* and the second equation estimates a treatment effect on the basis of this purged-of-endogeneity version of the treatment variable.**
There is a difference, of course. Whereas 2SLS will just use the exogenized version of your treatment variable as a regressor of interest, the heckit will transform said treatment variable through something called the inverted Mills ratio (IMR), i.e., “the ratio of the probability density function to the cumulative distribution function of a distribution,” per Wiki. But this imposes quite a bit of structure on the ecosystem (and it makes a distributional form assumption, usually a Gaussian one), which is unnecessary. Not only is it unnecessary, it can lead to identification because of the specific functional form (i.e., the IMR) assumed. All of this leads to a clear case for estimating a relatively simple linear setup like 2SLS rather than a heckit.
* And this is in those cases where the heckit setup follows that of 2SLS, viz. cases the variable of interest is instrumented by your IV and other variables serve as instruments for themselves, or cases where the controls are exactly the same across both equations. The heckit accommodates cases where the variables on the RHS are not the same across the two equations, which seems to me like it can lead to serious data mining.
** This is assuming, of course, that you are interested in the effect of treatment itself, and not in controlling for treatment while studying the effect of some other variable.