## Testing Thursday: Comparing Distributions Redux

On the subject of tests used to compare two distributions, Varun writes with two questions. His first question is as follows:

I teach part of a data analysis course at our institute. With the example of auto.dta the comes with Stata, we found that the variable miles per gallon (mpg) to be not normally distributed.
To find out whether mpg is statistically different for domestic (a sub-sample of 52 cases) and foreign (a sub-sample of 22 cases) cars (total sample size of 74), I told students there are two nonparametric tests, the Wilcoxon ranks-sum test and the two-sample Kolmogorov-Smirnov test. The Wilcoxon rank-sum test tests whether medians are significantly different while the two-sample Kolmogorov-Smirnov test tests whether distributions are different both groups. My students asked me which one they should go for between these two tests for mpg in auto.dta.

Answer: I have never used the Wilcoxon rank-sum test, and so I was not familiar with the procedure.  After digging around, however, I think I can answer the question “Should one should use -ksmirnov- or -ranksum-“? These guidelines might be helpful:

Both the Mann-Whitney [Note: The test conducted by -ranksum- –MFB.]and the Kolmogorov-Smirnov tests are nonparametric tests to compare two unpaired groups of data. Both compute p-values that test the null hypothesis that the two groups have the same distribution. But they work very differently:

The Mann-Whitney test first ranks all the values from low to high, and then computes a P value that depends on the discrepancy between the mean ranks of the two groups.

• The Kolmogorov-Smirnov test compares the cumulative distribution of the two data sets, and computes a P value that depends on the largest discrepancy between distributions. Here are some guidelines for choosing between the two tests:
• The KS test is sensitive to any differences in the two distributions. Substantial differences in shape, spread or median will result in a small P value. In contrast, the MW test is mostly sensitive to changes in the median.
• The MW test is used more often and is recognized by more people, so choose it if you have no idea which to choose.
• The MW test has been extended to handle tied values. The KS test does not handle ties so well. If your data are categorical, so has many ties, don’t choose the KS test.
• Some fields of science tend to prefer the KS test over the MW test. It makes sense to follow the traditions of your field.

In my contract farming data, both -ksmirnov- and -ranksum- give a similar answer:

```. ksmirnov hunger, by(cf) exact

Two-sample Kolmogorov-Smirnov test for equality of distribution functions

Smaller group D P-value Exact
----------------------------------------------
0: 0.0001 1.000
1: -0.0887 0.010
Combined K-S: 0.0887 0.019 .

Note: Ties exist in combined dataset;
there are 21 unique values out of 1182 observations.

. ranksum hunger, by(cf)

Two-sample Wilcoxon rank-sum (Mann-Whitney) test

cf | obs rank sum expected
-------------+---------------------------------
0 | 601 373279.5 355491.5
1 | 581 325873.5 343661.5
-------------+---------------------------------
combined | 1182 699153 699153

----------

Ho: hunger(cf==0) = hunger(cf==1)
z = 3.050
Prob > |z| = 0.0023

```

In both cases, the null (of equality of distributions) is rejected.

Which test are people more likely to use in economics? My hunch was that the Kolmogorov-Smirnov test was more popular than the Mann-Whitney U test, the test conducted by -ranksum- (by the Law of Small Numbers, as this hunch was based on the fact that when I was in grad school, a problem set once asked us to conduct a Kolmogorov-Smirnov test, but we were never asked to conduct a Mann-Whitney test), so I looked on JSTOR. It turns out that between 2010 and 2016, there are 108 instances of “Kolmogorov-Smirnov” in economics articles on JSTOR versus 101 instances of “Mann-Whitney,” so it looks as though there is no clear preference for either, at least in the recent past.

As always, my advice is “do (and report) both.” This is especially so with nonparametric procedures which, compared to two parametric procedures aimed at getting at the same answer (e.g., probit vs. logit), are more likely to yield different answers by virtue of looking at the same problem very differently.