Skip to content

Marc F. Bellemare Posts

The Art of Research Discovery and Writing Good Articles

Tom Reardon is one of my favorite agricultural economists. Not only is he incredibly productive (he has over published over 150 articles and his work has garnered over 27,000 Google Scholar citations), his work also has real-world policy impact (he was the first agricultural economist invited to the World Economic Forum in Davos). Over the years, Tom has been a wonderful mentor, and he has become a very good friend.

To know Tom is to love him, and if you know Tom well, you know that he has laser-like focus when it comes to his research, but that it can be hard to get him to focus on something that is not the writing of whatever he is currently working on right now. Over dinner, he is likely to go from discussing the etymology of an obscure French word to how e-commerce is disrupting food systems to how he has been struggling to make good brisket sous vide… all within five minutes!

So I was particularly happy to receive an email from Tom earlier this week in which he linked to a talk in which he focuses 75-minutes on the art of research discovery and writing good articles. If you are a researcher, whether early-career or seasoned, this is one of those rare occasions where a master craftsman takes the time to generously share some deep insights into his craft.

‘Metrics Monday: Goodness of Fit with Panel Data in Stata

With panel data, it is not uncommon to present regression results by starting with a pooled ordinary least squares (OLS) regression, then moving on to a specification with fixed effects (FE). If anything, this helps the reader see how important time-invariant unobserved heterogeneity is to your coefficient estimates.

Let y denote your outcome variable, x denote your control variables, and unit denote the unit of observation within which you have variation. If you use Stata, one of the problem that comes from using

xtreg y x, fe i(unit)

instead of

reg y x i.unit

is that none of the R-square measures returned by Stata after the former are in no way comparable to the R-square returned by Stata after the latter. From the “Assessing goodness of fit” section of the xtreg entry in the Stata manual (click on the image to enlarge it):

 

 

What this means in practice is that if you don’t pay attention to what is going on when making tables of result, you often end up with tables where the R-square in your OLS specification is higher than the R-square in your FE specification. But this is impossible–with the same outcome and control variables, including unit FEs will necessarily raise the R-square since a (usually much) higher percentage of the variation in the outcome is explained by variables on the RHS when using FEs.

This isn’t too bad in and of itself, but of course the first time I noticed this was when someone asked me in a seminar: “Why is your R-square going down instead of up when including fixed effects?,” and I had no good answer other than “I’ll have to check and get back to you on this,” which is seminar-speak for “Beats me.”

Here is a simple (if not terribly elegant) workaround I have come up with and have used and reused in papers where I use the xtreg set of commands. After estimating

xtreg y x, fe i(unit)

I add the following lines of code

egen ybar = mean(y)
gen y2 = (y - ybar)^2
predict resid, e
gen e2 = resid^2
drop resid
egen sse = sum(e2)
egen sst = sum(y2)
gen r2 = 1 - sse/sst
sum r2
drop sse sst y2 e2 ybar r2

The variable r2 is then “right” (i.e., comparable to OLS) R-square.

 

‘Metrics Monday: Identification by Functional Form (Updated)

One of the things I often tell students when discussing whether to use linear regression or a more complicated nonlinear (i.e., maximum likelihood-based) procedure is that one advantage of linear regression is that it prevents identification by functional form.

By “identification via functional form,” what I mean is that the distributional or functional form assumptions made in the context of more complicated nonlinear procedures can lead you to estimate a coefficient which is purely identified because of those distributional or functional form assumptions.

I always had a hard time clearly explaining the intuition behind this, until my colleague Arne Henningsen, with whom I co-taught my advanced econometrics class at the University of Copenhagen, gave a really good example to the class. Here is that example.