Marc F. Bellemare

Published February 4, 2019

‘Metrics Monday: Recoding Dummy Variables (Updated)

Continuing my reading of James et al.’s Introduction to Statistical Learning in an attempt to learn machine learning during this calendar year largely free from teaching, I learned something new (new to me, that is) in section 3.3, on pages 84-85, about the use of dummy variables as regressors.

Suppose you have a continuous dependent variable y (e.g., an individual’s wage) and a dichotomous regressor x such that x is either equal to zero or one (e.g., it equals one if an individual has graduated from college and equal to zero if she has not). Linearly projecting y on x yields the following equation:

(1) y = a_0 + b_0*x + e_0.

Estimating equation 1 by least squares gets you \hat{a_0} and \hat{b_0}, where the former is the average wage among the individuals in your sample who haven’t graduated from college and \hat{a_0} +
\hat{b_0} is the average wage among the individuals in your sample who have graduate from college, with the conclusion that \hat{b_0} is the wage differential associated (not caused, as that requires that we make more assumptions) with having graduated from college.

(Or you can also recode x as being equal to zero if someone has graduated from college and equal to one if he hasn’t, with the opposite interpretation.)

Jusqu’ici tout va bien…

What was new to me in James et al.’s discussion of dichotomous regressors is this: You can also recode x as being equal to -1 if someone hasn’t graduated from college and equal to 1 if someone has graduated from college. Then \hat{a_0} can be interpreted as the average wage in your sample (whether one has graduated from college or not), and \hat{b_0} becomes the wage college graduates earn over and above \hat{b_0} and the amount that those who haven’t graduated earn from college earn below \hat{a_0}.

Here is a proof by Stata:

clear
drop _all
set obs 1000
set seed 123456789
gen base_wage = rnormal(50,15)
gen diff = rnormal(20,1)
gen college = rbinomial(1,0.5)
gen wage = base_wage + college * diff
reg wage college
gen college2 = -1 if college==0
replace college2 = 1 if college==1
reg wage college2

The first regression yields estimates of (I drop the hats for clarity) a_0 = 49.27 and of b_0 = 20.19, i.e., the average person who hasn’t graduated from college makes $49.27 per hour on average, and the increase in wage associated with having graduated from college is $20.19 per hour, for an average of $69.46 for college graduates in your sample.

The second regression, which recodes the college dummy as either -1 or 1, yields estimates of a_0 = 59.36 and of b_0 = 10.09. In other words, the average individual in the sample (ignoring whether one has graduated from college or not) makes $59.36 an hour. Using information about whether someone has graduated college, we know that that person makes on average $59.36 + $10.09 = $69.45 per hour if she has graduated from college, but $59.36 – $10.09 = $49.27 if she has not graduated from college.

In retrospect, this may all seem obvious, but as I said, this was new to me. I think coding a dummy as -1 or 1 gets to the spirit of what regression is about, viz. “What is happening on average?” By coding a dummy -1 or 1, the constant returns the (true) average of the dependent variable. By coding a dummy as zero or one, the constant instead tells you the average for one group, and what you add to that average to recover the average for the other group.

As the example above illustrates, the two present the same information–they just package it differently, and you might want to report different things to different audiences. For example, academics might not care or they might be interested in the results coding the dummy as 0/1, but policy makers might be interested in the results coding it as -1/1, because they might be interested in knowing the average wage ignoring whether one has graduated from college.

UPDATE: A few readers wrote to correct a mistake I made. For example, Climent Quintana-Domeque writes:

Perhaps I missed something (a clarification) but the intercept when the binary is classified as -1 or 1 is not the unconditional mean unless the fraction of observations in each category is the same (i.e., 1/2). Let’s say that p is the fraction with D=1 and (1-p) with D=0. Recoding, p is the fraction with D2=1 and (1-p) with D2=-1. Then, it is the case that:
a_1=E[Y] if p=1/2.
a_1 is approx. E[Y] if p is approx. 1/2
Linear projection of Y on D is
Y = a_0 + b_0 D + e_0
E[Y|D=1]=a_0 +b_0
E[Y|D=0]=a_0
E[Y] = E[E[Y|D]] = p(a_0 + b_0) +(1-p)a_0 = pa_0 + pb_0 + (1-p)a_0 = a_0 + pb_0
With your simulated example,
display .49769 * .45112 +(1-.497) * 49.26481
59.297406
or
display 49.26481 + .497 * 20.18631
59.297406
sum wage
Variable | Obs Mean Std. Dev. Min Max
————-+————————————————–
wage 1,000 59.29741 17.99279 2.976877 108.1303
Linear projection of Y on D2 is
Y = a_1 + b_1 D2 + e_1
E[Y|D2=1] = a_1 + b_1
E[Y|D2=-1] = a_1 – b_1
E[Y] = E[E[Y|D]] = p(a_1 + b_1) + (1-p)(a_1 – b_1) = pa_1 + pb_1 +(1-p)a_1 –(1-p)b_1 = a_1 + pb_1 – b_1 + pb_1 = a_1 + (2p-1)b_1
. display .497 * (59.35797 + 10.09316) + (1-.497) * (59.35797-10.09316)
59.297411
or
. display 59.35797 + (20.497-1) *10.09316
59.297411

Published January 21, 2019

‘Metrics Monday: Learning Machine Learning

A long time ago I promised myself that I would not become one of those professors who gets too comfortable knowing what he already knows. This means that I do my best to keep up-to-date about recent developments in applied econometrics.

So my incentives in writing this series of post isn’t entirely selfless: Because good (?) writing is clear thinking made visible, doing so helps me better understand econometrics and keep up with recent developments in applied econometrics.

By “applied econometrics,” I mean applied econometrics almost exclusively of the causal-inference-with-observational-data variety. I haven’t really thought about time-series econometrics since the last time I took a doctoral-level class on the subject in 2000, but that’s mostly because I don’t foresee doing anything involving those methods in the future.

One thing that I don’t necessarily foresee using but that I really don’t want to ignore, however, is machine learning (ML), especially since ML methods are now being combined with causal inference techniques. So having been nudged by Dave Giles’ post on the topic earlier this week, I figured 2019 would be a good time–my only teaching this spring is our department’s second-year paper seminar, and I’m on sabbatical in the fall, so it really is now or never.

I’m not a theorem prover, so I really needed a gentle, intuitive introduction to the topic. Luckily, my friend and office neighbor Steve Miller also happens to teach our PhD program’s ML class and to do some work in this area (see his forthcoming Economics Letters article on FGLS using ML, for instance), and he recommended just what I needed: Introduction to Statistical Learning, by James et al. (2013).

The cool thing about James et al. is that it also provides an introduction to R for newbies. Being such a neophyte, going through this book will provide a double learning dividend for me. Even better is the fact that the book is available for free on the companion website (which features R code, data sets, etc.) here.

I’m only in chapter 2, but I have already learned some new things. Most of those things have to do with new terminology (e.g., supervised learning wherein you have a dependent variable, vs. unsupervised learning, wherein there is no such thing as a dependent variable), but here is one thing that was new to me: The idea that there is a tradeoff between flexibility and interpretability.

(Source: James et al., 2013, Introduction to Statistical Learning, Springer.)

Specifically, what this tradeoff says is this: The more flexible your estimation method gets, the less interpretable it is. OLS, for instance, is relatively inflexible: It imposes a linear relationship between Y and X, which is rather restrictive. But it is also rather easy to interpret, since the coefficient on X is an estimate of the change in Y associated with a one-unit increase in X. And so in the figure above, OLS tends to be low on flexibility, but high on interpretability.

Conversely, really flexible methods–those methods that tend to be very good at accounting for the specific features of the data–tend to be harder to interpret. Think, for instance, of kernel density estimation. You get a nice graph approximating the distribution of the variable you’re interested in, whose smoothness depends on the specific bandwidth you chose, but that’s it: You only get a graph, and there is little in the way of interpretation to be provided beyond “Look at this graph.”

Bonus: Throughout all of the readings I’ve done this week I also came across the following joke (apologies for forgetting where I saw it):

Q: What’s a data scientist?
A: A statistician who lives in San Francisco.

Published January 16, 2019

Read Bad Papers

Last summer, Advice to Writers, one of my favorite blogs, had a post titled “Read Bad Stuff.” Given that Advice to Writers posts are usually very short, I reproduce the post here in full:

If you are going to learn from other writers don’t only read the great ones, because if you do that you’ll get so filled with despair and the fear that you’ll never be able to do anywhere near as well as they did that you’ll stop writing. I recommend that you read a lot of bad stuff, too. It’s very encouraging. “Hey, I can do so much better than this.” Read the greatest stuff but read the stuff that isn’t so great, too. Great stuff is very discouraging. — Edward Albee.

This applies to many other areas of life, and it academic research is no exception.

Over the years, I have found that besides learning by doing (i.e., writing your own papers), one of the best ways to improve as a researcher is learn from others. Obviously, this means that you should read good papers–but not good papers exclusively.

The issue, as I see it, is that doctoral courses tend to have students read only the very best papers on any given topic. At best, a doctoral course will have students referee current working papers as an assignment, but even then, those current working papers usually tend to be selected from those of researchers who produce high-quality work.

If you were interested in knowing what makes some people poor and others not, you would need to sample both poor people and people who aren’t poor. Likewise, if you are interested in knowing what makes a piece of research good and another one not as good, it helps to read widely, and to make some time for reading bad papers. For most people, this comes in the form of refereeing, especially early on in their career.

(When I started out, a journal editor told me that “like referees like,” and I’ve found that to be true. That is, early-career researchers often review the work of other early-career researchers, and senior researchers often review the work of other senior researchers. So if you have ever asked yourself “When will I get better papers to referee?,” the answer is generally “Just wait,” assuming of course that the quality of academic output increases with time spent in a discipline.)

Many scholars–economists, in particular–see refereeing as an unfortunate tax they need to pay in order to get their own papers reviewed and published. Unlike a tax, however, there is almost always something to be learned from refereeing, and from refereeing bad papers in particular.

Marc F. Bellemare Posts

‘Metrics Monday: Recoding Dummy Variables (Updated)

‘Metrics Monday: Learning Machine Learning

Read Bad Papers