Monday, March 2, 2015

Statistical inference is only mostly wrong

p-values banned!

The journal Basic and Applied Social Psychology (BASP) made news recently by "banning" p-values.  Here's a summary of their major points:
  1. "...the null hypothesis significance testing procedure (NHSTP) is invalid...".   "We believe that the p<0.05 bar is too easy to pass and sometimes serves as an excuse for lower quality research."
  2. "Confidence intervals suffer from an inverse problem that is not very different from that suffered by the NHSTP.   ... the problem is that, for example, a 95% confidence interval does not indicate that the parameter of interest has a 95% probability of being within the interval."
  3. "Bayesian procedures are more interesting.  The usual problem ... is that they depend on some sort of Laplacian assumption to generate numbers where none exist."
There are many parts of this new policy, and the rationale for it, that I agree with.  But there are several things I think they got wrong, both in the diagnosis and the prescription.  So I find myself -- to my great surprise -- defending classical statistical inference.  I'll take their points one at a time:


Classical hypothesis testing is problematic, but it is not "invalid", provided that you remember what it is and what it means.  A hypothesis test is a partial answer to the question, "Is it likely that I am getting fooled by randomness?"

Suppose you see an apparent effect, like a difference between two groups.  An example I use in Think Stats is the apparent difference in pregnancy length between first babies and others: in a sample of more than 9000 pregnancies in a national survey, first babies are born about 13 hours earlier, on average, than other babies.

There are several possible explanations for this difference:
  1. The effect might be real; that is, a similar difference would be seen in the general population.
  2. The apparent effect might be due to a biased sampling process, so it would not appear in the general population.
  3. The apparent effect might be due to measurement errors.
  4. The apparent effect might be due to chance; that is, the difference might appear in a random sample, but not in the general population.
Hypothesis testing addresses only the last possibility.  It asks, "If there were actually no difference between first babies and others, what would be the chance of selecting a sample with a difference as big as 13 hours?"  The answer to this question is the p-value.

If the p-value is small, you can conclude that the fourth possibility is unlikely, and infer that the other three possibilities are more likely.  In the actual sample I looked at, the p-value was 0.17, which means that there is a 17% chance of seeing a difference as big as 13 hours, even if there is no actual difference between the groups.  So I concluded that the fourth possibility can't be ruled out.

There is nothing invalid about these conclusions, provided that you remember a few caveats:
  1. Hypothesis testing can help rule out one explanation for the apparent effect, but it doesn't account for others, particularly sampling bias and measurement error.
  2. Hypothesis testing doesn't say anything about how big or important the effect is.
  3. There is nothing magic about the 5% threshold.

Confidence intervals

The story is pretty much the same for confidence intervals; they are problematic for many of the same reasons, but they are not useless.  A confidence interval is a partial answer to the question: "How precise is my estimate of the effect size?"

If you run an experiment and observe an effect, like the 13-hour difference in pregnancy length, you might wonder whether you would see the same thing if you ran the experiment again.  Would it always be 13 hours, or might it range between 12 and 14?  Or maybe between 11 and 15?

You can answer these questions by 
  1. Making a model of the experimental process,
  2. Analyzing or simulating the model, and
  3. Computing the sampling distribution, which quantifies how much the estimate varies due to random sampling. 
Confidence intervals and standard errors are two ways to summarize the sampling distribution and indicate the precision of the estimate.

Again, confidence intervals are useful if you remember what they mean:
  1. The CI quantifies the variability of the estimate due to random sampling, but does not address other sources of error.
  2. You should not make any claim about the probability that the actual value falls in the CI.
As the editors of BASP point out, "a 95% confidence interval does not indicate that the parameter of interest has a 95% probability of being within the interval."  Ironically, the situation is worse when the sample size is large.  In that case, the CI is usually small, other sources of error dominate, and the CI is less likely to contain the actual value.

One other limitation to keep in mind: both p-values and confidence intervals are based on modeling decisions: a p-value is based on a model of the null hypothesis, and a CI is based on a model of the experimental process.  Modeling decisions are subjective; that is, reasonable people could disagree about the best model of a particular experiment.  For any non-trivial experiment, there is no unique, objective p-value or CI.

Bayesian inference

The editors of BASP write that the problem with Bayesian statistics is that "they depend on some sort of Laplacian assumption to generate numbers where none exist."  This is an indirect reference to the fact that Bayesian analysis depends on the choice of a prior distribution, and to Laplace's principle of indifference, which is an approach some people recommend for choosing priors.

The editors' comments evoke a misunderstanding of Bayesian methods.  If you use Bayesian methods as another way to compute p-values and confidence intervals, you are missing the point.  Bayesian methods don't do the same things better; they do different things, which are better.

Specifically, the result of Bayesian analysis is a posterior distribution, which includes all possible estimates and their probabilities.  Using posterior distributions, you can compute answers to questions like:
  1. What is the probability that a given hypothesis is true, compared to a suite of alternative hypotheses?
  2. What is the probability that the true value of a parameter falls in any given interval?
These are questions we care about, and their answers can be used directly to inform decision-making under uncertainty.

But I don't want to get bogged down in a Bayesian-frequentist debate.  Let me get back to the BASP editorial.


Conventional statistical inference, using p-values and confidence intervals, is problematic, and it fosters bad science, as the editors of BASP claim.  These problems have been known for a long time, but previous attempts to instigate change have failed.  The BASP ban (and the reaction it provoked) might be just what we need to get things moving.

But the proposed solution is too blunt; statistical inference is broken but not useless.  Here is the approach I recommend:
  1. The effect size is the most important thing.  Papers that report statistical results should lead with the effect size and explain its practical impact in terms relevant to the context.
  2. The second most important thing -- a distant second -- is a confidence interval or standard error, which quantifies error due to random sampling.  This is a useful additional piece of information, provided that we don't forget about other sources of error.
  3. The third most important thing -- a distant third -- is a p-value, which provides a warning if an apparent effect could be explained by randomness.
  4. We should drop the arbitrary 5% threshold, and forget about the conceit that 5% is the false positive rate.
For me, the timing of the BASP editorial is helpful.  I am preparing a new tutorial on statistical inference, which I will present at PyCon 2015 in April.  The topic just got a little livelier!


"The CI quantifies uncertainty due to random sampling, but it says nothing about sampling bias or measurement error. Because these other sources of error are present in nearly all real studies, published CIs will contain the population value far less often than 90%."

A fellow redditor raised an objection I'll try to paraphrase:

"The 90% CI contains the true parameter 90% of the time because that's the definition of the 90% CI.  If you compute an interval that doesn't contain the true parameter 90% of the time, it's not a 90% CI."

The reason for the disagreement is that my correspondent and I were using different definitions of CI:
  1. I was defining CI by procedure; that is, a 90% CI is what you get if you compute the sampling distribution and take the 5th and 95th percentiles.
  2. My correspondent was defining CI by intent: A CI is an interval that contains the true value 90% of the time.
If you use the first definition, the problem with CIs is that they don't contain the true value 90% of the time.

If you use the second definition, the problem with CIs is that our standard way of computing them doesn't work, at least not in the real world.

Once this point was clarified, my correspondent and I agreed about the conclusion: the fraction of published CIs that contain the true value of the parameter is less than 90%, and probably a lot less.


  1. You're still giving undue credit to CIs as measures of precision: and I think your recommended approach - essentially sticking with orthodox statistical inference - is unjustifiable. We know that it's fundamentally, conceptually flawed, fosters bad science, lends itself to misuse and misinterpretation even by experts, and that there is a well-founded, good science-fostering, reason and intuition-congruent alternative. It just doesn't make sense to recommend sticking with it.

  2. "The apparent effect might be due to chance; that is, the difference might appear in a random sample, but not in the general population."

    p-values don't tell you this, either. All a p-value tells you is the probability of your data (or data more extreme) given the null is true. They say nothing about whether your data is "due to chance."

    1. I'm not sure I understand your objection. I was using "due to chance" as a shorthand for "under the null hypothesis", since the null hypothesis is a model of random variation if there is no actual effect.

      The sentence you quoted is one of four possible explanations for an apparent effect: it might be caused by random variation in the absence of a real effect.

      As you said, the p-value is the probability of the apparent effect under the null hypothesis, which is the probability of the effect under (at least a model of) random chance.

      Can you clarify what you are objecting to?

  3. GamingLifer nails it. Allen, you appear to have made *the* mistake the editors are so concerned about re p-values.

  4. Allen, his/her point is that the p value is "the probability of the data given chance". It is not "the probability of chance given the data".