# A Rant on Estimation with Binary Dependent Variables (Technical)

Suppose you are trying to explain some outcome $y$, where $y$ is equal to 0 or 1 (e.g., whether someone is a nonsmoker or a smoker). You also have data on a vector of explanatory variables $x$ (e.g., someone’s age, their gender, their level of education, etc.) and on a treatment variable $D$, which we will also assume is binary, so that $D$ is equal to 0 or 1 (e.g., whether someone has attended an information session on the negative effects of smoking).

If you were interested in knowing what the effect of attending the information session on the likelihood that someone is a smoker, i.e., the impact of $D$ on $y$ The equation of interest in this case is

(1) $y=\alpha+\betaX+\gammaD+\epsilon$,

where $\alpha$ is a constant, $\beta$ is a vector of the coefficients attached to the explanatory variables, $\gamma$ is the parameter of interest, and $\epsilon$ is the error term.

This post is about why, in most cases, you should be estimating equation (1) by ordinary least squares, i.e., estimate a linear probability model (LPM). I have heard and read so many arguments against the LPM and for the probit or logit that I wanted to write something on this.

### Arguments Typically Made against the LPM

The arguments typically made against the LPM are:

1. The error term of a binary variable has a Bernoulli structure, i.e., $Var(\epsilon_i)=p_i(1-p_i)$, where $p_i=Pr(y_i=1)$. This non-constant variance of the error term means that you have a heteroskedasticity problem and the LPM standard errors will be wrong.
2. The LPM can yield values of $\hat{y}_i$, i.e., predicted values of the dependent variable, outside of the $[0,1]$ interval. In other words, the LPM can yield predicted probabilities that are negative or greater than 100%.
3. The LPM imposes linearity on the relationship between the dependent variable and the right-hand side variables.

For these reasons, many will dismiss LPM estimates as wrong. I respond:

1. With robust standard errors, the standard errors are correct, and it is very easy it is to implement robust standard errors with most statistical packages. Indeed, in the package that I use the most, it is simply a matter of adding “, robust” at the end of my estimation command.
2. This is only a concern if your reason for estimating equation (1) is to forecast probabilities. For most readers of this blog, that will not be why they are estimating equation (1). Rather, they will be interested in knowing the precise value of $\gamma$.
3. Sure, but who says an assumed nonlinear relationship is much better? On their Mostly Harmless Econometrics blog, Angrist and Pischke write:

If the conditional expectation function (CEF) is linear, as it is for a saturated model, regression gives the CEF – even for LPM. If the CEF is non-linear, regression approximates the CEF. Usually it does it pretty well. Obviously, the LPM won’t give the true marginal effects from the right nonlinear model. But then, the same is true for the “wrong” nonlinear model! The fact that we have a probit, a logit, and the LPM is just a statement to the fact that we don’t know what the “right” model is. Hence, there is a lot to be said for sticking to a linear regression function as compared to a fairly arbitrary choice of a non-linear one! Nonlinearity per se is a red herring.

### Arguments One Can Make Against the Probit or Logit

People who dismiss the LPM, usually by invoking the two arguments above, usually argue in favor of estimating a probit or a logit instead. Here are some arguments one can make against the probit or logit:

1. Both the probit and the logit can lead to identification by functional form. If you are interested in identifying the causal relationship flowing from $D$ to $y$, i.e., in precisely estimating $\gamma$, you want to avoid this.
2. The probit and logit are not well-suited to the use of fixed effects because of the incidental parameters problem.

So when should you use an LPM, and when should you use a probit or a logit? If you have experimental data, i.e., if values of $D$ were randomly assigned, there is no harm in estimating a probit or a logit — your estimate of $\gamma$ is cleanly identified because of the random assignment. If you want to forecast the likelihood that something will happen, estimate a probit or a logit.

But if you are interested in estimating the causal impact of $D$ on $y$ and have any reason to believe that your identification is less than clean, if you want to use fixed effects, and if you are not interested in forecasting the value of $y$, you should prefer the LPM with robust standard errors.

### Conclusion

I have made the points above several times over the last few years, in conversations with colleagues, when advising students, in referee reports, etc. But every once in a while, I will get admonished by an anonymous reviewer for my use of the LPM, and so I wanted to write something about it.

Ultimately, I think the preference for one or the other is largely generational, with people who went to graduate school prior to the Credibility Revolution preferring the probit or logit to the LPM, and with people who went to graduate school during or after the Credibility Revolution preferring the LPM.

As always, the right way to approach things is probably to estimate all three if possible, to present your preferred specification, and to explain in a footnote (or show in an appendix) that your results are robust to the choice of estimator.

No related content found.

1. J Wells

Is there not a selection effect present here, where D is also affected by X (or Y)? Surely there are demographics that are more likely than others to attend an informative session.

2. Marc F. Bellemare

Yes, of course, that is almost always a problem with observational data. That’s why I noted that if D is exogenous to y, then one can estimate a probit. I’ve seen some people on Twitter mention that if D is randomly assigned, then you don’t need a regression. That is not completely true, since even with a randomized D, running a regression helps increasing the precision of your estimate of gamma.

3. Matt

As a counter-counter-argument to #2, robust standard erros cannot make up for a misspecified model (King and Roberts 2012):
“We show that settling for a misspecified model (even with robust standard errors) can be a big mistake, in that all but a few quantities of interest will be impossible to estimate (or simulate) from the model without bias. We suggest a different practice: Recognize that differences between robust and classical standard errors are like canaries in the coal mine, providing clear indications that your model is misspecified and your inferences are likely biased.”
http://gking.harvard.edu/publications/how-robust-standard-errors-expose-methodological-problems-they-do-not-fix

4. Conner Mullally

The King and Roberts results are more relevant for the case when identification of all parameters of interest requires that we have the correct model, e.g., forecasting probabilities. This isn’t the case when looking at binary treatment assignment and are interested in estimating average treatment effects. You just need the expectation of the error term to be the same in the treatment and control groups.
King and Roberts more or less make this point themselves on page 3 of their paper.

5. Pingback: Friday links: Science Cafe at the ESA meeting, Peter Medawar > EO Wilson as a source of advice, and more | Dynamic Ecology
6. Pingback: Love It or Logit, or: Man, People *Really* Care About Binary Dependent Variables | Marc F. Bellemare