Wednesday, August 7, 2013

Are your data normal? Hint: no.

One of the frequently-asked questions over at the statistics subreddit (reddit.com/r/statistics) is how to test whether a dataset is drawn from a particular distribution, most often the normal distribution.

There are standard tests for this sort of thing, many with double-barreled names like Anderson-Darling, Kolmogorov-Smirnov, Shapiro-Wilk, Ryan-Joiner, etc.

But these tests are almost never what you really want.  When people ask these questions, what they really want to know (most of the time) is whether a particular distribution is a good model for a dataset.  And that's not a statistical test; it is a modeling decision.

All statistical analysis is based on models, and all models are based on simplifications.  Models are only useful if they are simpler than the real world, which means you have to decide which aspects of the real world to include in the model, and which things you can leave out.

For example, the normal distribution is a good model for many physical quantities.  The distribution of human height is approximately normal (see this previous blog post).  But human heights are not normally distributed.  For one thing, human heights are bounded within a narrow range, and the normal distribution goes to infinity in both directions.  But even ignoring the non-physical tails (which have very low probability anyway), the distribution of human heights deviates in systematic ways from a normal distribution.

So if you collect a sample of human heights and ask whether they come from a normal distribution, the answer is no.  And if you apply a statistical test, you will eventually (given enough data) reject the hypothesis that the data came from the distribution.

Instead of testing whether a dataset is drawn from a distribution, let's ask what I think is the right question: how can you tell whether a distribution is a good model for a dataset?

The best approach is to create a visualization that compares the data and the model, and there are several ways to do that.

Comparing to a known distribution

If you want to compare data to a known distribution, the simplest visualization is to plot the CDFs of the model and the data on the same axes.  Here's an example from an earlier blog post:


I used this figure to confirm that the data generated by my simulation matches a chi-squared distribution with 3 degrees of freedom.

Many people are more familiar with histograms than CDFs, so they sometimes try to compare histograms or PMFs.  This is a bad idea.  Use CDFs.

Comparing to an estimated distribution

If you know what family of distributions you want to use as a model, but don't know the parameters, you can use the data to estimate the parameters, and then compare the estimated distribution to the data. Here's an example from Think Stats that compares the distribution of birth weight (for live births in the U.S.) to a normal model.

This figure shows that the normal model is a good match for the data, but there are more light babies than we would expect from a normal distribution.  This model is probably good enough for many purposes, but probably not for research on premature babies, which account for the deviation from the normal model.  In that case, a better model might be a mixture of two normal distribution.

Depending on the data, you might want to transform the axes.  For example, the following plot compares the distribution of adult weight for a sample of Americans to a lognormal model:


I plotted the x-axis on a log scale because under a log transform a lognormal distribution is a nice symmetric sigmoid where both tails are visible and easy to compare.  On a linear scale the left tail is too compressed to see clearly.

UPDATE: Someone asked what that graph would look like on a linear scale:


So that gives a sense of what it looks like when you fit a normal model to lognormal data.

Instead of plotting CDFs, a common alternative is a Q-Q plot, which you can generate like this:

def QQPlot(cdf, fit):
    """Makes a QQPlot of the values from actual and fitted distributions.

    cdf: actual Cdf
    fit: model Cdf
    """
    ps = cdf.ps
    actual = cdf.xs
    fitted = [fit.Value(p) for p in ps]

    pyplot.plot(fitted, actual)

cdf and fit are Cdf objects as defined in thinkbayes.py.

The ps are the percentile ranks from the actual CDF.  actual contains the corresponding values from the CDF.  fitted contains values generated by looping through the ps and, for each percentile rank, looking up the corresponding value in the fitted CDF.

The resulting plot shows fitted values on the x-axis and actual values on the y-axis.  If the model matches the data, the plot should be the identity line (y=x).

Here's an example from an earlier blog post:


The blue line is the identity line.  Clearly the model a good match for the data, except the last point.  In this example, I reversed the axes and put the actual values on the x-axis.  It seemed like a good idea at the time, but probably was not.


Comparing to a family of distributions

When you compare data to an estimated model, the quality of fit depends on the quality of your estimate.  But there are ways to avoid this problem.

For some common families of distributions, there are simple mathematical operations that transform the CDF to a straight line.   For example, if you plot the complementary CDF of an exponential distribution on a log-y scale, the result is a straight line (see this section of Think Stats).

Here's an example that shows the distribution of time between deliveries (births) at a hospital in Australia:


The result is approximately a straight line, so the exponential model is probably a reasonable choice.
There are similar transformations for the Weibull and Pareto distributions.

Comparing to a normal distribution

Sadly, things are not so easy for the normal distribution.  But there is a good alternative: the normal probability plot.

There are two ways to generate normal probability plots.  One is based on rankits, but there is a simpler method:
  1. From a normal distribution with µ = 0 and σ = 1, generate a sample with the same size as your dataset and sort it.
  2. Sort the values in the dataset.
  3. Plot the sorted values from your dataset versus the random values.
Here's an example using the birthweight data from a previous figure:

A normal probability plot is basically a Q-Q plot, so you read it the same way.  This figure shows that the data deviate from the model in the range from 1.5 to 2.5 standard deviations below the mean.  In this range, the actual birth weights are lower than expected according to the model (I should have plotted the model as a straight line; since I didn't, you have to imagine it).

There's another example of a probability plot (including a fitted line) in this previous post.

The normal probability plot works because the family of normal distributions is closed under linear transformation, so it also works with other stable distributions (see http://en.wikipedia.org/wiki/Stable_distribution).


In summary: choosing an analytic model is not a statistical question; it is a modeling decision.  No statistical test that can tell you whether a particular distribution is a good model for your data.  In general, modeling decisions are hard, but I think the visualizations in this article are some of the best tools to guide those decisions.


UPDATE:

I got the following question on reddit:
I don't understand. Why not go to the next step and perform a Kolmogorov-Smirnov test, etc? Just looking at plots is good but performing the actual test and looking at plots is better.
And here's my reply:
The statistical test does not provide any additional information. Real world data never match an analytic distribution perfectly, so if you perform a test, there are only two possible outcomes: 
1) You have a lot of data, so the p-value is low, so you (correctly) conclude that the data did not really come from the analytic distribution. But this doesn't tell you how big the discrepancy is, or whether the analytic model would still be good enough. 
2) You don't have enough data, so the p-value is high, and you conclude that there is not enough evidence to reject the null hypothesis. But again, this doesn't tell you whether the model is good enough. It only tells you that you don't have enough data. 
Neither outcome helps with what I think is the real problem: deciding whether a particular model is good enough for the intended purpose.

You can read the rest of the conversation here.

15 comments:

  1. Could you explain why you are using a "normal probability plot" instead of a QQ-plot?

    ReplyDelete
    Replies
    1. Yes! The QQ plot requires you to compare your data to a specific distribution, so it is only as good as your estimated parameters.

      The normal probability plot does not require you to estimate parameters. In the example I used the standard normal distribution, but I could have used any normal distribution. This works because of the property I mentioned: normal distributions are closed under linear transformation.

      Delete
  2. What you're describing is actually a problem with hypothesis testing in general, rejecting the null does not tell you anything about effect size, so we can fool ourselves with over-powered tests. Goodness of fit tests like Kolmogorov-Smirnov and Anderson-Darling model the empirical process as a Brownian Bridge and look for irregularities in the sup-norm or weighted L2 norm respectively. It's certainly possible to interpret an effect size here, assuming one bothers to report it, and it tells the same kind of stories that your pictures do.

    For example suppose I want to know if 10^6 data points are well-modeled by a uniform distribution on [0,1] -- I can form the empirical process, which should be approximately the Brownian Bridge. The sup of the empirical process, which is after all a one-sided KS statistic is nothing more than 1000 * (max difference between empirical and desired cdf). If I reject above some level T, I can say what I actually mean, I am rejecting for errors of probability on order T/1000. If this effect size was reported, you could make a decision about the meaning of the test, and would probably conclude for many purposes that it was ridiculously over-powered.

    What I'm saying is just basic hygiene when we talk about tests for a single parameter, but for some reason as soon as things get non-parametric, common sense goes out the window.

    ReplyDelete
    Replies
    1. Yes, most of my objections about this kind of distribution testing also apply to hypothesis testing in general.

      I am picking on this particular application because when people ask about distribution testing, it is almost always the wrong question. What they really want, most of the time, is help making modeling decisions.

      Thanks for this comment!

      Delete
  3. This comment has been removed by a blog administrator.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  4. Can you say something about why CDF is preferable to PDF for comparing distributions?

    ReplyDelete
    Replies
    1. I am working on a talk that addresses exactly this question. I will post it soon.

      Delete
    2. And here it is: http://allendowney.blogspot.com/2013/09/how-to-consume-statistical-analysis.html

      Delete
  5. For the normality plot, I wanted to mention a little trick that helps out with small sample sizes (which will have varying results because of your small random draw from the normal distribution): you can draw a much larger set of normally distributed points, sort them, and then take only a subset of those sorted points. Honestly, the code is simpler than my prose, so here is a little sample:

    n = 5000 # this makes for very solid normpts
    data = [820,770,710,910,840,670,830]
    sdata = sorted(data)
    normpts = sorted(np.random.normal(0,1,len(data)*n))[n/2::n]

    ReplyDelete
    Replies
    1. Good suggestion. Thanks!

      Or there's one more alternative -- you can generate several rows with size N, sort the rows, then take the average of each column. As the number of rows increases, the averages converge on the rankits.

      Delete
  6. Do you ever use a maximum likelihood estimation to determine distributions?

    ReplyDelete
    Replies
    1. I usually go the other way: I use Bayesian methods to generate a posterior distribution, and then if I need an MLE, I can get it from the posterior. But if all you have is an MLE, that's not enough to determine the distribution (in general).

      Delete
  7. Hi sir
    I have data of all blood pressure readings max 150 min 70. All data are recorded from normal pattern and patients looks very normal. Totally 872 reading s of 30 patients.However AD test non normal. It fits only nearly 3 parameter logo logistical distribution AD of 4,.. And p of 0.0005<. Now can I use process capability evaluation on these data just considering it as normal data? If I do so I will get excellent results

    ReplyDelete
    Replies
    1. That's a good example of what I'm talking about: if you have enough data, Anderson-Darling and related tests will eventually indicate that your data are not normally distributed, but that doesn't mean a normal model would not be reasonable and useful.

      Delete