Skip to content

Marc F. Bellemare Posts

‘Metrics Monday: When In Doubt, Standardize

A few weeks ago, I received an email from one of my PhD students asking me whether I could help him interpret his coefficient estimates, given that his outcome variables was measured using an unfamiliar scale.

I told him to standardize his outcome variable, which would allow interpreting the estimate of coefficient b in the following regression

(1) y = a + bD + e

as follows, assuming a dichotomous treatment D: On average, when a unit goes from untreated (i.e., D = 0) to treated (i.e., D = 1), y increases by b standard deviations.

For example, suppose you are interested in looking at the effect of taking a test-prep class D on someone’s quantitative GRE score, which is scored on a scale of 130 to 170.

Suppose you have data on a number of test takers’ quantitative GRE scores and whether they took a test-prep class. You estimate equation (1) above and an estimate of a equal to 140 and an estimate of b equal to 10.

(The example in this post is entirely fictitious, for what it’s worth; I have never taken a test-prep class for the GRE nor did I ever estimate anything involving data on GRE scores.)

Suppose further that, like me, you took the test a long time ago, when each section of the GRE was scored on a scale of 200 to 800,* so that the 130-170 scale is really unfamiliar to you. How would you assess whether taking the test-prep class is worth it (assuming, for the sake of argument, that b is identified)?

Standardizing y would go a long way toward helping you, because it would allow expressing the impact of the test-prep class in a familiar quantity. How do you standardize? Simply by taking the variable that is expressed in unfamiliar terms (here, GRE test scores), subtracting the mean from each observation, and dividing each observation minus the mean by the variable’s standard deviation. In other words,

(2) y’ = (y – m)/s,

where y’ is the standardized version of y, m is the mean of y, and s is the standard deviation of y. You would then estimate

(3) y’ = a + bD + e,

where the estimate of b becomes the effect measured in standard deviations of y instead of in points on the test. So finding that the estimate of b in equation (3) is 0.15, you would conclude that taking the test-prep class would lead to an increase in one’s quantitative GRE score of 0.15 standard deviation.

You can standardize your outcome variable, a right-hand side (RHS) variable, or both. If you standardize an RHS variable, the interpretation becomes in terms of what happens to y in its own units if the standardized x increases by one standard deviation. If you standardize on both sides, the interpretation is in terms of standard deviation on both sides, or what happens to y (in standard deviations) for a one standard deviation increase in x. That really is all there is to standardization.

* Yes, I took the GRE before the analytical portion was a written essay. I am old.

‘Metrics Monday: New Version of “Elasticities and the Inverse Hyperbolic Sine Transformation”

Casey Wichman and I have just finished revising our paper titled “Elasticities and the Inverse Hyperbolic Sine Transformation,” in which we derive exact elasticities for log-linear, linear-log, and log-log specifications where the inverse hyperbolic sine transformation is used in lieu of logs so as to be able to keep zero-valued observations instead of systematically dropping them due to the undefined nature of ln(0).

Here is the abstract:

Applied econometricians frequently apply the inverse hyperbolic sine (or arcsinh) transformation to a variable because it approximates the natural logarithm of that variable and allows retaining zero-valued observations. We provide derivations of elasticities in common applications of the inverse hyperbolic sine transformation and show empirically that the difference in elasticities driven by ad hoc transformations can be substantial. We conclude by offering practical guidance for applied researchers.

In this new version, we have made a number of changes in responses to reviewer comments. In my view, the most important of these changes is appendix B, where we provide Stata code to compute elasticities with an inverse hyperbolic sine transformation.

‘Metrics Monday: Recoding Dummy Variables (Updated)

Continuing my reading of James et al.’s Introduction to Statistical Learning in an attempt to learn machine learning during this calendar year largely free from teaching, I learned something new (new to me, that is) in section 3.3, on pages 84-85, about the use of dummy variables as regressors.

Suppose you have a continuous dependent variable y (e.g., an individual’s wage) and a dichotomous regressor x such that x is either equal to zero or one (e.g., it equals one if an individual has graduated from college and equal to zero if she has not). Linearly projecting y on x yields the following equation:

(1) y = a_0 + b_0*x + e_0.

Estimating equation 1 by least squares gets you \hat{a_0} and \hat{b_0}, where the former is the average wage among the individuals in your sample who haven’t graduated from college and \hat{a_0} +
\hat{b_0} is the average wage among the individuals in your sample who have graduate from college, with the conclusion that \hat{b_0} is the wage differential associated (not caused, as that requires that we make more assumptions) with having graduated from college.

(Or you can also recode x as being equal to zero if someone has graduated from college and equal to one if he hasn’t, with the opposite interpretation.)

Jusqu’ici tout va bien…

So far so good.

What was new to me in James et al.’s discussion of dichotomous regressors is this: You can also recode x as being equal to -1 if someone hasn’t graduated from college and equal to 1 if someone has graduated from college. Then \hat{a_0} can be interpreted as the average wage in your sample (whether one has graduated from college or not), and \hat{b_0} becomes the wage college graduates earn over and above \hat{b_0} and the amount that those who haven’t graduated earn from college earn below \hat{a_0}.

Here is a proof by Stata:

clear
drop _all
set obs 1000
set seed 123456789
gen base_wage = rnormal(50,15)
gen diff = rnormal(20,1)
gen college = rbinomial(1,0.5)
gen wage = base_wage + college * diff
reg wage college
gen college2 = -1 if college==0
replace college2 = 1 if college==1
reg wage college2

The first regression yields estimates of (I drop the hats for clarity) a_0 = 49.27 and of b_0 = 20.19, i.e., the average person who hasn’t graduated from college makes $49.27 per hour on average, and the increase in wage associated with having graduated from college is $20.19 per hour, for an average of $69.46 for college graduates in your sample.

The second regression, which recodes the college dummy as either -1 or 1, yields estimates of a_0 = 59.36 and of b_0 = 10.09. In other words, the average individual in the sample (ignoring whether one has graduated from college or not) makes $59.36 an hour. Using information about whether someone has graduated college, we know that that person makes on average $59.36 + $10.09 = $69.45 per hour if she has graduated from college, but $59.36 – $10.09 = $49.27 if she has not graduated from college.

In retrospect, this may all seem obvious, but as I said, this was new to me. I think coding a dummy as -1 or 1 gets to the spirit of what regression is about, viz. “What is happening on average?” By coding a dummy -1 or 1, the constant returns the (true) average of the dependent variable. By coding a dummy as zero or one, the constant instead tells you the average for one group, and what you add to that average to recover the average for the other group.

As the example above illustrates, the two present the same information–they just package it differently, and you might want to report different things to different audiences. For example, academics might not care or they might be interested in the results coding the dummy as 0/1, but policy makers might be interested in the results coding it as -1/1, because they might be interested in knowing the average wage ignoring whether one has graduated from college.

UPDATE: A few readers wrote to correct a mistake I made. For example, Climent Quintana-Domeque writes:

Perhaps I missed something (a clarification) but the intercept when the binary is classified as -1 or 1 is not the unconditional mean unless the fraction of observations in each category is the same (i.e., 1/2). Let’s say that p is the fraction with D=1 and (1-p) with D=0. Recoding, p is the fraction with D2=1 and (1-p) with D2=-1. Then, it is the case that:
a_1=E[Y] if p=1/2.
a_1 is approx. E[Y] if p is approx. 1/2
Linear projection of Y on D is
Y = a_0 + b_0 D + e_0
E[Y|D=1]=a_0 +b_0
E[Y|D=0]=a_0
E[Y] = E[E[Y|D]] = p(a_0 + b_0) +(1-p)a_0 = pa_0 + pb_0 + (1-p)a_0 = a_0 + pb_0
With your simulated example,
display .49769 * .45112 +(1-.497) * 49.26481
59.297406
or
display 49.26481 + .497 * 20.18631
59.297406
sum wage
Variable | Obs Mean Std. Dev. Min Max
————-+————————————————–
wage 1,000 59.29741 17.99279 2.976877 108.1303
Linear projection of Y on D2 is
Y = a_1 + b_1 D2 + e_1
E[Y|D2=1] = a_1 + b_1
E[Y|D2=-1] = a_1 – b_1
E[Y] = E[E[Y|D]] = p(a_1 + b_1) + (1-p)(a_1 – b_1) = pa_1 + pb_1 +(1-p)a_1 –(1-p)b_1 = a_1 + pb_1 – b_1 + pb_1 = a_1 + (2p-1)b_1
. display .497 * (59.35797 + 10.09316) + (1-.497) * (59.35797-10.09316)
59.297411
or
. display 59.35797 + (20.497-1) *10.09316
59.297411