According to an exciting new study, childhood vaccines are almost miraculously effective at preventing suffering and death due to infectious disease.
Sadly, that is not actually a headline, because it doesn't generate clicks. What does generate clicks? This: Journal questions validity of autism and vaccine study (CNN Health).
If, at this point, headlines like this make you roll your eyes and click on more interesting things, like "Teen catches one-in-2 million lobster," let me assure you that you are absolutely right. In fact, if you want to skip the rest of this post, and read something that will make you happy to live in the 21st century, I recommend this list of vaccine-preventable diseases.
But if you have already heard about this new paper and you are curious to know why it is, from a statistical point of view, completely bogus, read on!
Let me start by reviewing the basics of statistical hypothesis testing. Suppose you see an apparent effect like a difference in risk of autism between groups of children who are vaccinated at different ages. You might wonder whether the effect you see in your selected sample is likely to exist in the general population, or whether it might appear by chance in your sample only, and not in the general population.
To answer at least part of that question, you can compute a p-value, which is the probability of seeing a difference as big as the one you saw if, in fact, there is no difference between the groups. If this value is small, you can conclude that the difference is unlikely to be an artifact of random sampling. And if you are willing to cut a couple of logical corners, you can conclude that the effect is more likely to exist in the general population.
In many fields it is common to use 5% as a magical threshold for statistical significance. If the p-value is less than 5%, the effect is considered statistically significant and, more importantly, fit for publication. If it comes in at 6%, it doesn't count.
This is, of course, arbitrary and mostly silly, but it does have one virtue. If we follow this process with care, it should yield a known and small false positive rate: if, in fact, there is no difference between the groups, we should expect to conclude, incorrectly, that the apparent difference is significant about 5% of the time.
But that expectation only holds if we apply hypothesis tests very carefully. In practice, people don't, and there is accumulating evidence that the false positive rate in published research is much higher than 5%, possibly higher than 50%.
There are lots of ways to mess up hypothesis testing, but one of the most common is performing multiple tests. Every time you perform a test you have a 5% chance of generating a false positive, so if you perform 20 independent tests, you should expect about 1 false positive. This problem is ably demonstrated by this classic xkcd cartoon:
If you are not already familiar with xkcd, you're welcome.
So let's get back to the paper reported in the CNN article. Just looking at the results that appear in the tables, this paper reports 35 p-values. Five of them are identified as significant at the 5% level. That's more than the expected number of false positives, but they might still be due to chance (especially since the tests are not independent).
Fortunately, there is a simple process that corrects for multiple tests, the Holm-Bonferroni method. You can get the details from the Wikipedia article, but I'll give you a quick-and dirty version here: if you perform n tests, the lowest p-value needs to be below 0.05 / n to be considered statistically significant.
Since the paper reports 35 tests, their threshold is 0.05 / 35, which is 0.0014. Since their lowest p-value is 0.0019, it should not be considered statistically significant, and neither should any of the other test results.
And I am being generous by assuming that the authors only performed 35 tests. It is likely that they performed many more and chose carefully which ones to report. And I assume that they are doing everything else right as well.
But even with the benefit of the doubt, these results are not statistically significant. Given how hard the authors apparently tried to find evidence that vaccines cause autism, the fact that they failed should be considered evidence that vaccines do NOT cause autism.
Before I read this paper, I was nearly certain that vaccines do not cause autism. After reading this paper, I am very slightly more certain that vaccines do not cause autism. And by the way, I am also nearly certain that vaccines do prevent suffering and death due to infectious disease.
Now go check out the video of that blue lobster.
UPDATE: Or go read this related article by my friend Ted Bunn.
Thank you for this simple, useful explanation. I appreciate it.
ReplyDeletethat xkcd is a perfect analogy to the Hooker paper nonsense. Thanks you for clarifying it!
ReplyDeleteI'm glad you point out this fallacious use of p-values. What bothers me, though, is that people fail to dub erroneous p-values the pretend p-values that they are. In order for a p-value, say of .05, to be an actual and not merely a nominal (computed or pretend) p-value, it's required that
ReplyDeleteProb(p-value < .05; Ho) ~ .05.
With the multiple testing in the jelly bean case, say, the probability of so impressive-seeming a p-value is ~.65. I will look carefully at the paper you cite, but I just wanted to note this because it drives me crazy when pretend p-values aren't immediately called out for what they are.
thank you for bringing out the fallacy of failing to adjust p-values.
errorstatistics.com