Skip to content

Marc F. Bellemare Posts

‘Metrics Monday: Least Squares Is But One Approach to Linear Regression

One of the things I learned as an undergraduate at Montreal is the equivalence of ordinary least squares (OLS), maximum likelihood (ML), and the generalized method of moments (GMM) when it comes to linear regression.

This is something which I suspect a lot of people have lost track of in the wake of the Credibility Revolution, which has emphasized the use of linear methods.

(For instance, when I taught my causal inference with observational data class last semester, I remember showing my students a likelihood function and asking them whether they had covered maximum likelihood estimation in their first-year courses, drawing a number of blank stares.)

The idea is simple. When estimating the equation

(1) y = bx + e,

one has a choice of estimator, viz. OLS, ML, or GMM.

Intuitively,

  1. OLS picks b so as to minimize the sum of squared residuals,
  2. ML picks b so as to maximize the likelihood that the estimation sample is a random sample from a population of interest, and
  3. GMM picks b by solving for what is known as a moment condition which, in the case of OLS, is such that E(x’e) = 0. That is, it chooses b to solve E(x'(y – xb)) = 0. Note that this is simply assuming that the regressors are uncorrelated with the errors, or an assumption of exogeneity.

If e is distributed normally, the OLS and ML estimators of b in equation 1 are identical. If the observations are independent and identically distributed (iid), the OLS and GMM estimators of b in equation 1 are equivalent.

Here is a bit of code that offers proof by Stata:

* Linear Regression Three Ways

clear
drop _all
set obs 1000
set seed 123456789
gen x = rnormal(0,1)
gen y = rnormal(5,1) + 10*x + rnormal(0,1)

* Ordinary Least Squares

reg y x

* Maximum Likelihood

capture program drop ols
program ols
  args lnf xb lnsigma
  local y "$ML_y1"
  quietly replace `lnf' = ln(normalden(`y', `xb',exp(`lnsigma')))
end 

ml model lf ols (xb: y = x) (lnsigma:)
ml maximize

* Generalized Method of Moments
 
gmm (y - x*{beta} - {alpha}), instruments(x) vce(unadjusted)

In that code, I generate a variable x distributed N(0,1), and a variable y equal to a constant distributed N(5,1) added to 10 times variable x plus an error term e distributed N(0,1). Note that, by default, GMM estimates an IV-like setup; with OLS, this collapses to x serving as an instrument for itself.

The OLS, ML, and GMM estimators all yield the same point estimates of 5.014798 for the constant and 9.932382 for the slope coefficient. The standard errors in the example above are identical for the MLE and GMM cases, and they differ only very, very slightly for the OLS case.

A few things to note:

  • MLE is a special case of GMM where a specific distribution is imposed on the data. In the ML example above, the normal distribution has been imposed on the data.
  • Both GMM and MLE are iterative procedures, meaning that they start from a guess as to the value of b, then go on from there. In contrast, OLS does not guess, as its formula immediately solves for the value of b that minimizes the sum of squared residuals.

‘Metrics Monday: When In Doubt, Standardize

A few weeks ago, I received an email from one of my PhD students asking me whether I could help him interpret his coefficient estimates, given that his outcome variables was measured using an unfamiliar scale.

I told him to standardize his outcome variable, which would allow interpreting the estimate of coefficient b in the following regression

(1) y = a + bD + e

as follows, assuming a dichotomous treatment D: On average, when a unit goes from untreated (i.e., D = 0) to treated (i.e., D = 1), y increases by b standard deviations.

For example, suppose you are interested in looking at the effect of taking a test-prep class D on someone’s quantitative GRE score, which is scored on a scale of 130 to 170.

Suppose you have data on a number of test takers’ quantitative GRE scores and whether they took a test-prep class. You estimate equation (1) above and an estimate of a equal to 140 and an estimate of b equal to 10.

(The example in this post is entirely fictitious, for what it’s worth; I have never taken a test-prep class for the GRE nor did I ever estimate anything involving data on GRE scores.)

Suppose further that, like me, you took the test a long time ago, when each section of the GRE was scored on a scale of 200 to 800,* so that the 130-170 scale is really unfamiliar to you. How would you assess whether taking the test-prep class is worth it (assuming, for the sake of argument, that b is identified)?

Standardizing y would go a long way toward helping you, because it would allow expressing the impact of the test-prep class in a familiar quantity. How do you standardize? Simply by taking the variable that is expressed in unfamiliar terms (here, GRE test scores), subtracting the mean from each observation, and dividing each observation minus the mean by the variable’s standard deviation. In other words,

(2) y’ = (y – m)/s,

where y’ is the standardized version of y, m is the mean of y, and s is the standard deviation of y. You would then estimate

(3) y’ = a + bD + e,

where the estimate of b becomes the effect measured in standard deviations of y instead of in points on the test. So finding that the estimate of b in equation (3) is 0.15, you would conclude that taking the test-prep class would lead to an increase in one’s quantitative GRE score of 0.15 standard deviation.

You can standardize your outcome variable, a right-hand side (RHS) variable, or both. If you standardize an RHS variable, the interpretation becomes in terms of what happens to y in its own units if the standardized x increases by one standard deviation. If you standardize on both sides, the interpretation is in terms of standard deviation on both sides, or what happens to y (in standard deviations) for a one standard deviation increase in x. That really is all there is to standardization.

* Yes, I took the GRE before the analytical portion was a written essay. I am old.

‘Metrics Monday: New Version of “Elasticities and the Inverse Hyperbolic Sine Transformation”

Casey Wichman and I have just finished revising our paper titled “Elasticities and the Inverse Hyperbolic Sine Transformation,” in which we derive exact elasticities for log-linear, linear-log, and log-log specifications where the inverse hyperbolic sine transformation is used in lieu of logs so as to be able to keep zero-valued observations instead of systematically dropping them due to the undefined nature of ln(0).

Here is the abstract:

Applied econometricians frequently apply the inverse hyperbolic sine (or arcsinh) transformation to a variable because it approximates the natural logarithm of that variable and allows retaining zero-valued observations. We provide derivations of elasticities in common applications of the inverse hyperbolic sine transformation and show empirically that the difference in elasticities driven by ad hoc transformations can be substantial. We conclude by offering practical guidance for applied researchers.

In this new version, we have made a number of changes in responses to reviewer comments. In my view, the most important of these changes is appendix B, where we provide Stata code to compute elasticities with an inverse hyperbolic sine transformation.