Probably Overthinking It: August 2014

According to an exciting new study, childhood vaccines are almost miraculously effective at preventing suffering and death due to infectious disease.

Sadly, that is not actually a headline, because it doesn't generate clicks. What does generate clicks? This: Journal questions validity of autism and vaccine study (CNN Health).

If, at this point, headlines like this make you roll your eyes and click on more interesting things, like "Teen catches one-in-2 million lobster," let me assure you that you are absolutely right. In fact, if you want to skip the rest of this post, and read something that will make you happy to live in the 21st century, I recommend this list of vaccine-preventable diseases.

But if you have already heard about this new paper and you are curious to know why it is, from a statistical point of view, completely bogus, read on!

Let me start by reviewing the basics of statistical hypothesis testing. Suppose you see an apparent effect like a difference in risk of autism between groups of children who are vaccinated at different ages. You might wonder whether the effect you see in your selected sample is likely to exist in the general population, or whether it might appear by chance in your sample only, and not in the general population.

To answer at least part of that question, you can compute a p-value, which is the probability of seeing a difference as big as the one you saw if, in fact, there is no difference between the groups. If this value is small, you can conclude that the difference is unlikely to be an artifact of random sampling. And if you are willing to cut a couple of logical corners, you can conclude that the effect is more likely to exist in the general population.

In many fields it is common to use 5% as a magical threshold for statistical significance. If the p-value is less than 5%, the effect is considered statistically significant and, more importantly, fit for publication. If it comes in at 6%, it doesn't count.

This is, of course, arbitrary and mostly silly, but it does have one virtue. If we follow this process with care, it should yield a known and small false positive rate: if, in fact, there is no difference between the groups, we should expect to conclude, incorrectly, that the apparent difference is significant about 5% of the time.

But that expectation only holds if we apply hypothesis tests very carefully. In practice, people don't, and there is accumulating evidence that the false positive rate in published research is much higher than 5%, possibly higher than 50%.

There are lots of ways to mess up hypothesis testing, but one of the most common is performing multiple tests. Every time you perform a test you have a 5% chance of generating a false positive, so if you perform 20 independent tests, you should expect about 1 false positive. This problem is ably demonstrated by this classic xkcd cartoon:

If you are not already familiar with xkcd, you're welcome.

So let's get back to the paper reported in the CNN article. Just looking at the results that appear in the tables, this paper reports 35 p-values. Five of them are identified as significant at the 5% level. That's more than the expected number of false positives, but they might still be due to chance (especially since the tests are not independent).

Fortunately, there is a simple process that corrects for multiple tests, the Holm-Bonferroni method. You can get the details from the Wikipedia article, but I'll give you a quick-and dirty version here: if you perform n tests, the lowest p-value needs to be below 0.05 / n to be considered statistically significant.

Since the paper reports 35 tests, their threshold is 0.05 / 35, which is 0.0014. Since their lowest p-value is 0.0019, it should not be considered statistically significant, and neither should any of the other test results.

And I am being generous by assuming that the authors only performed 35 tests. It is likely that they performed many more and chose carefully which ones to report. And I assume that they are doing everything else right as well.

But even with the benefit of the doubt, these results are not statistically significant. Given how hard the authors apparently tried to find evidence that vaccines cause autism, the fact that they failed should be considered evidence that vaccines do NOT cause autism.

Before I read this paper, I was nearly certain that vaccines do not cause autism. After reading this paper, I am very slightly more certain that vaccines do not cause autism. And by the way, I am also nearly certain that vaccines do prevent suffering and death due to infectious disease.

Now go check out the video of that blue lobster.

UPDATE: Or go read this related article by my friend Ted Bunn.

I've just turned in the manuscript for the second edition of Think Stats. If you're dying to get your hands on a copy, you can pre-order one here.

Most of the book is about computational methods, but in the last chapter I break out some analytic methods, too. In the last section of the book, I explain the underlying philosophy:

This book focuses on computational methods like resampling and permutation. These methods have several advantages over analysis:

They are easier to explain and understand. For example, one of the most difficult topics in an introductory statistics class is hypothesis testing. Many students don’t really understand what p-values are. I think the approach I presented in Chapter 9—simulating the null hypothesis and computing test statistics—makes the fundamental idea clearer.

They are robust and versatile. Analytic methods are often based on assumptions that might not hold in practice. Computational methods require fewer assumptions, and can be adapted and extended more easily.

They are debuggable. Analytic methods are often like a black box: you plug in numbers and they spit out results. But it’s easy to make subtle errors, hard to be confident that the results are right, and hard to find the problem if they are not. Computational methods lend themselves to incremental development and testing, which fosters confidence in the results.

But there is one drawback: computational methods can be slow. Taking into account these pros and cons, I recommend the following process:

Use computational methods during exploration. If you find a satisfactory answer and the run time is acceptable, you can stop.

If run time is not acceptable, look for opportunities to optimize. Using analytic methods is one of several methods of optimization.

If replacing a computational method with an analytic method is appropriate, use the computational method as a basis of comparison, providing mutual validation between the computational and analytic results.

For the vast majority of problems I have worked on, I didn’t have to go past Step 1.

The last exercise in the book is based on a question my colleague, Lynn Stein, asked me for a paper she was working on:

In a recent paper², Stein et al. investigate the effects of an intervention intended to mitigate gender-stereotypical task allocation within student engineering teams. Before and after the intervention, students responded to a survey that asked them to rate their contribution to each aspect of class projects on a 7-point scale.

Before the intervention, male students reported higher scores for the programming aspect of the project than female students; on average men reported a score of 3.57 with standard error 0.28. Women reported 1.91, on average, with standard error 0.32.

Question 1: Compute a 90% confidence interval and a p-value for the gender gap (the difference in means).

After the intervention, the gender gap was smaller: the average score for men was 3.44 (SE 0.16); the average score for women was 3.18 (SE 0.16). Again, compute the sampling distribution of the gender gap and test it.

Question 2: Compute a 90% confidence interval and a p-value for the change in gender gap.

[2] Stein et al. “Evidence for the persistent effects of an intervention to mitigate gender-sterotypical task allocation within student engineering teams,” Proceedings of the IEEE Frontiers in Education Conference, 2014.

In the book I present ways to do these computations, and I will post my "solutions" here soon. But first I want to pose these questions as a challenge for statisticians and people learning statistics. How would you approach these problems?

The reason I ask: Question 1 is pretty much a textbook problem; you can probably find an online calculator to do it for you. But you are less likely to find a canned solution to Question 2, so I am curious to see how people go about it. I hope to post some different solutions soon.

By the way, this is not meant to be a "gotcha" question. If some people get it wrong, I am not going to make fun of them. I am looking for different correct approaches; I will ignore mistakes, and only point out incorrect approaches if they are interestingly incorrect.

You can post a solution in the comments below, or discuss it on reddit.com/r/statistics, or if you don't want to be influenced by others, send me email at downey at allendowney dot com.

UPDATE August 26, 2014

The discussion of this question on reddit.com/r/statistics was as interesting as I hoped. People suggested several very different approaches to the problem. The range of responses extends from something like, "This is a standard problem with a known, canned answer," to something like, "There are likely to be dependencies among the values that make the standard model invalid, so the best you can do is an upper bound." In other words, the problem is either trivial or impossible!

The approach I had in mind is to compute sampling distributions for the gender gap and the change in gender gap using normal approximations, and then use the sampling distributions to compute standard errors, confidence intervals, and p-values.

I used a simple Python class that represents a normal distribution. Here is the API:

class Normal(object):
"""Represents a Normal distribution"""

def __init__(self, mu, sigma2, label=''):
"""Initializes a Normal object with given mean and variance."""

def __add__(self, other):
"""Adds two Normal distributions."""

def __sub__(self, other):
"""Subtracts off another Normal distribution."""

def __mul__(self, factor):
"""Multiplies by a scalar."""

def Sum(self, n):
"""Returns the distribution of the sum of n values."""

def Prob(self, x):
"""Cumulative probability of x."""

def Percentile(self, p):
"""Inverse CDF of p (0 - 100)."""

The implementation of this class is here.

Here's a solution that uses the Normal class. First we make normal distributions that represent the sampling distributions of the estimated means, using the given means and sampling errors. The variance of the sampling distribution is the sampling error squared:

male_before = normal.Normal(3.57, 0.28**2)
male_after = normal.Normal(3.44, 0.16**2)

female_before = normal.Normal(1.91, 0.32**2)
female_after = normal.Normal(3.18, 0.16**2)

Now we compute the gender gap before the intervention, and print the estimated difference, p-value, and confidence interval:

diff_before = female_before - male_before
print('mean, p-value', diff_before.mu, 1-diff_before.Prob(0))
print('CI', diff_before.Percentile(5), diff_before.Percentile(95))

The estimated gender gap is -1.66 with SE 0.43, 90% CI (-2.3, -0.96) and p-value 5e-5. So that's statistically significant.

Then we compute the gender gap after intervention and the change in gender gap:

diff_after = female_after - male_after

diff = diff_after - diff_before

print('mean, p-value', diff.mu, diff.Prob(0))

print('CI', diff.Percentile(5), diff.Percentile(95))

The estimated change is 1.4 with SE 0.48, 90% CI (0.61, 2.2) and p-value 0.002. So that's statistically significant, too.

This solution is based on a two assumptions:

1) It assumes that the sampling distribution of the estimated means is approximately normal. Since the data are on a Likert scale, the variance and skew are small, so the sum of n values converges to normal quickly. The samples sizes are in the range of 30-90, so the normal approximation is probably quite good.

This claim is based on the Central Limit Theorem, which only applies if the samples are drawn from the population independently. In this case, there are dependencies within teams: for example, if someone on a team does a larger share of a task, the rest of the team necessarily does less. But the team sizes are 2-4 people and the sample sizes are much larger, so these dependencies have short enough "range" that I think it is acceptable to ignore them.

2) Every time we subtract two distributions to get the distribution of the difference, we are assuming that values are drawn from the two distributions independently. In theory, dependencies within teams could invalidate this assumption, but I don't think it's likely to be a substantial effect.

3) As always, remember that the standard error (and confidence interval) indicate uncertainty due to sampling, but say nothing about measurement error, sampling bias, and modeling error, which are often much larger sources of uncertainty.

The approach I presented here is a bit different from what's presented in most introductory stats classes. If you have taken (or taught) a stats class recently, I would be curious to know what you think of this problem. After taking the class, would you be able to solve problems like this? Or if you are teaching, could your students do it?

Your comments are welcome.

Probably Overthinking It

Friday, August 29, 2014

New study: vaccines prevent disease and death

Friday, August 22, 2014

An exercise in hypothesis testing