Last updated on March 6, 2016
Picking up where I left off at the end of last week’s ‘Metrics Monday post, I wanted to continue discussing the interpretation of coefficients this week.
Specifically, I wanted to discuss the interpretation of coefficients on dummy variables in semi-logarithmic equations. What’s a semi-logarithmic equation? It’s an equation of the form
[math]\ln{y} = \alpha + \beta{D} + \gamma{x} + \epsilon[/math],*
where [math]y[/math] is the dependent variable, [math]D[/math] is a binary (i.e., zero or one) treatment variable, [math]x[/math] is a vector of control variables, and [math]\epsilon[/math] is an error term whose mean is equal to zero. To take a classic example, [math]y[/math] could be an individual’s wage, [math]D[/math] a variable equal to one if they have a college degree and equal to zero otherwise, and [math]x[/math] their age, gender, etc. The equation above is called “semi-log” because we take the logarithm of only one side of the equation.
A log-log equation would regress a logarithm on the left-hand side on a logarithm on the right-hand side, in which case the estimated coefficient is directly interpretable as an elasticity, i.e., a percentage change in [math]y[/math] for a 1% increase in the variable of interest. It is unfortunately not possible to take the log of a binary treatment like [math]D[/math] above,** because the log of zero is undefined.
As an aside if you are interested in the question of why we log some variables (e.g., wages, prices, incomes) but not others (e.g., age, years of education, etc.), see this discussion.
Perhaps because of the foregoing, a common mistake in interpreting [math]\beta[/math] in the equation above is to treat it as a percentage. That is, to claim that [math]\hat{\beta}[/math] tells us by how much [math]\ln{y}[/math] changes in percentage terms when an observation goes from untreated to treated, i.e., when [math]D[/math] goes from zero to on.
In what is perhaps the shortest paper I have ever read, however, Kennedy (1981), who was correcting an earlier mistake in an earlier paper by Halvorsen and Palmquist (1980), derived a formula that allows deriving the effect of the treatment in percentage terms, which is such that
[math]{\hat{g}} = \exp[\hat{\beta} – \frac{1}{2}\hat{V}(\hat{\beta})] – 1[/math],
and wherein [math]{\hat{g}}[/math] is, in Kennedy’s words “the percentage impact of the dummy variable on the variable being explained.”
I thought this was reasonably well-known, but I still review too many papers that directly interpret the coefficient on a dummy in a semi-log equation as the percentage change in [math]y[/math].
As with so many other things I talk about in this series, Dave Giles had a nice long post on this (and other related topics) five years ago. In his post, Dave links to two papers of his: a 1982 paper where he corrects a slight mistake in Kennedy’s analysis, and a 2011 paper where he discusses exact distributional results for coefficients in semi-log equations. (Among other things, one interesting point Dave’s post makes is that when you have a log on the left-hand side, discrete changes in one of the explanatory variables will have asymmetric effects.)
Kennedy’s [math]\hat{g}[/math] is easily implementable in Stata as follows:
. reg y D x . nlcom exp(_b[D]-0.5*((_se[D])^2))-1
* Note the spiffy use of TeX in this post. After someone on Reddit noted that the math was hard to understand in my last post, I decided to up my blog notation game.
** A common, used instead of a log is the inverse hyperbolic sine (IHS) transformation, which behaves like a log but allows keeping zero and negative values. In my 2013 AJAE article with Barrett and Just on the welfare impacts of price volatility, because the net sales of each crop (which can take positive, zero, or negative values) were our dependent variables, we used the IHS transformation extensively. The IHS is now much more common and acceptable than adding 1, 0.01, etc. to your variable of interest.