Skip to content

Marc F. Bellemare Posts

‘Metrics Monday: Learning Machine Learning

A long time ago I promised myself that I would not become one of those professors who gets too comfortable knowing what he already knows. This means that I do my best to keep up-to-date about recent developments in applied econometrics.

So my incentives in writing this series of post isn’t entirely selfless: Because good (?) writing is clear thinking made visible, doing so helps me better understand econometrics and keep up with recent developments in applied econometrics.

By “applied econometrics,” I mean applied econometrics almost exclusively of the causal-inference-with-observational-data variety. I haven’t really thought about time-series econometrics since the last time I took a doctoral-level class on the subject in 2000, but that’s mostly because I don’t foresee doing anything involving those methods in the future.

One thing that I don’t necessarily foresee using but that I really don’t want to ignore, however, is machine learning (ML), especially since ML methods are now being combined with causal inference techniques. So having been nudged by Dave Giles’ post on the topic earlier this week, I figured 2019 would be a good time–my only teaching this spring is our department’s second-year paper seminar, and I’m on sabbatical in the fall, so it really is now or never.

I’m not a theorem prover, so I really needed a gentle, intuitive introduction to the topic. Luckily, my friend and office neighbor Steve Miller also happens to teach our PhD program’s ML class and to do some work in this area (see his forthcoming Economics Letters article on FGLS using ML, for instance), and he recommended just what I needed: Introduction to Statistical Learning, by James et al. (2013).

The cool thing about James et al. is that it also provides an introduction to R for newbies. Being such a neophyte, going through this book will provide a double learning dividend for me. Even better is the fact that the book is available for free on the companion website (which features R code, data sets, etc.) here.

I’m only in chapter 2, but I have already learned some new things. Most of those things have to do with new terminology (e.g., supervised learning wherein you have a dependent variable, vs. unsupervised learning, wherein there is no such thing as a dependent variable), but here is one thing that was new to me: The idea that there is a tradeoff between flexibility and interpretability.

(Source: James et al., 2013, Introduction to Statistical Learning, Springer.)

Specifically, what this tradeoff says is this: The more flexible your estimation method gets, the less interpretable it is. OLS, for instance, is relatively inflexible: It imposes a linear relationship between Y and X, which is rather restrictive. But it is also rather easy to interpret, since the coefficient on X is an estimate of the change in Y associated with a one-unit increase in X. And so in the figure above, OLS tends to be low on flexibility, but high on interpretability.

Conversely, really flexible methods–those methods that tend to be very good at accounting for the specific features of the data–tend to be harder to interpret. Think, for instance, of kernel density estimation. You get a nice graph approximating the distribution of the variable you’re interested in, whose smoothness depends on the specific bandwidth you chose, but that’s it: You only get a graph, and there is little in the way of interpretation to be provided beyond “Look at this graph.”

Bonus: Throughout all of the readings I’ve done this week I also came across the following joke (apologies for forgetting where I saw it):

Q: What’s a data scientist?

A: A statistician who lives in San Francisco.

Read Bad Papers

Last summer, Advice to Writers, one of my favorite blogs, had a post titled “Read Bad Stuff.” Given that Advice to Writers posts are usually very short, I reproduce the post here in full:

If you are going to learn from other writers don’t only read the great ones, because if you do that you’ll get so filled with despair and the fear that you’ll never be able to do anywhere near as well as they did that you’ll stop writing. I recommend that you read a lot of bad stuff, too. It’s very encouraging. “Hey, I can do so much better than this.” Read the greatest stuff but read the stuff that isn’t so great, too. Great stuff is very discouraging. — Edward Albee.

This applies to many other areas of life, and it academic research is no exception.

Over the years, I have found that besides learning by doing (i.e., writing your own papers), one of the best ways to improve as a researcher is learn from others. Obviously, this means that you should read good papers–but not good papers exclusively.

The issue, as I see it, is that doctoral courses tend to have students read only the very best papers on any given topic. At best, a doctoral course will have students referee current working papers as an assignment, but even then, those current working papers usually tend to be selected from those of researchers who produce high-quality work.

If you were interested in knowing what makes some people poor and others not, you would need to sample both poor people and people who aren’t poor. Likewise, if you are interested in knowing what makes a piece of research good and another one not as good, it helps to read widely, and to make some time for reading bad papers. For most people, this comes in the form of refereeing, especially early on in their career.

(When I started out, a journal editor told me that “like referees like,” and I’ve found that to be true. That is, early-career researchers often review the work of other early-career researchers, and senior researchers often review the work of other senior researchers. So if you have ever asked yourself “When will I get better papers to referee?,” the answer is generally “Just wait,” assuming of course that the quality of academic output increases with time spent in a discipline.)

Many scholars–economists, in particular–see refereeing as an unfortunate tax they need to pay in order to get their own papers reviewed and published. Unlike a tax, however, there is almost always something to be learned from refereeing, and from refereeing bad papers in particular.

‘Metrics Monday: Front-Door Criterion Follow-Up (Updated)

My last post in this series, on how to use Pearl’s front-door criterion in a regression context, generated lots of page views as well as lots of commentary on Twitter–enough so that I thought a follow-up post might be useful.

Recall that with outcome Y, treatment X, mechanism M, and an unobserved confounder U affecting both Y and X but not M, the method I outlined in the post is pretty simple:

  1. Regress M on X, get b_{MX}, the coefficient on X.
  2. Regress Y on X and M, get b_{YM}, the coefficient on M.
  3. The product of b_{MX} and b_{YM} is the effect of treatment X on outcome Y estimated by the front-door criterion.

One of the things that came up on Twitter was whether someone should use the procedure outlined above, or do the following instead:

  1. Regress M on X, get \hat{e}, the residual.
  2. Regress Y on \hat{e}, get b_Ye, the coefficient on \hat{e}.
  3. Regress M on X, get b_{MX}, the coefficient on X.
  4. The product of b_{Ye} andb_{MX} is the effect of treatment X on outcome Y estimated by the front-door criterion.

Note that the two methods yield the exact same treatment effect. Here is a Kerwinian proof by Stata:

clear
drop _all
set obs 1000

set seed 123456789
gen u = rnormal(0,1)
gen treat = u + rnormal(0,1)
gen mech = -0.3 * treat + rnormal(0,1)
gen outcome = 0.5 * mech + u + rnormal(0,1)

sureg (mech treat) (outcome mech treat)
nlcom [mech]_b[treat]*[outcome]_b[mech]

reg mech treat
predict e, resid

reg outcome e
matrix a = _b[e]
reg mech treat
matrix b = _b[treat]
matrix c = a*b

matrix list c

Note how the estimate obtained in the line that begins with -nlcom- and the line that begins with -matrix list- are identical. In terms of implementation, I prefer the (somewhat old-fashioned, I realize) use of seemingly unrelated regression, since it allows the error terms to be correlated across the two component regressions.

To reiterate what I talked about at the end of my last post: Caveat emptor. When it comes to observational data, rare is the scenario where one can claim that the mechanism M whereby an endogenous treatment X is entirely unaffected by the unobserved confounders U that simultaneously affect treatment X and outcome Y. So this post and the previous one are really meant to be illustrative of something that might work in some rare situations more than an encouragement to apply the front-door criterion unthinkingly as a means of identifying a causal relationship on the cheap. In this as in so many things, TINSTAAFL.

On Twitter, Daniel Millimet adds: