One of the advantages of having really smart colleagues — the kind who exhibit genuine intellectual curiosity, and who are truly interested in doing things well — is that you get to learn a lot from them.
I was recently having a conversation with my colleague and next-door office neighbor Joe Ritter in which we were discussing the possibility that the (binary) treatment variable in a paper I am working on might suffer from some misclassification. That is, my variable D = 1 if an individual has received the treatment and D = 0 otherwise, but it is possible that some people for whom D = 1 actually report D = 0, and that some people for whom D = 0 actually report D = 1.
When the possibility that my treatment variable might suffer from misclassification (or measurement error) arose, Joe recalled that he’d read a paper by Christopher R. Bollinger about this a while back. A few hours later, he sent me an email to which he’d attached the paper. Here is the abstract:
A Rant on Estimation with Binary Dependent Variables (Technical)
Suppose you are trying to explain some outcome [math]y[/math], where [math]y[/math] is equal to 0 or 1 (e.g., whether someone is a nonsmoker or a smoker). You also have data on a vector of explanatory variables [math]x[/math] (e.g., someone’s age, their gender, their level of education, etc.) and on a treatment variable [math]D[/math], which we will also assume is binary, so that [math]D[/math] is equal to 0 or 1 (e.g., whether someone has attended an information session on the negative effects of smoking).
If you were interested in knowing what the effect of attending the information session on the likelihood that someone is a smoker, i.e., the impact of [math]D[/math] on [math]y[/math] The equation of interest in this case is