Thursday, October 6, 2011

Repeated tests: how bad can it be?

A couple of weeks ago Red Dave wrote this blog post about the dangers of A/B testing with repeated tests.  I think he's right, and does a good job of explaining why.  But to get a better understanding of the issue, I wrote some simulations, and things can be even worse than Dave suggests.

As an example of A/B testing, suppose you are advertising a business and you develop two versions of an ad, A and B, with different text, graphics, etc.  You want to know which ad is more effective.

If you post your ads online, you can find out how many times each version was shown (number of impressions), and how many people clicked on each ad (number of clicks).  The number of clicks per impression is the "click-through rate," or CTR.  At least, that's what Google calls it.

If you show each ad 1000 times, you might see a CTR of 1% for Version A and 2% for Version B.  To see whether this difference is significant, you can use a chi-square test, which compares the actual number of clicks for each ad with the expected number.  The result is a p-value, which tells you the chance of seeing a difference as big as that if the ads are actually the same.  If the p-value is small, the effect is unlikely to be due to chance, so it is more likely to be real.

So far, this is just classical hypothesis testing.  No problem.  Here's where things get ugly.

If, instead of showing each ad 1000 times, you run the ad for a while, and then check for a statistically-significant difference, run a while longer, check again, and repeat until you get a significant difference, then what you are doing is Bogus.

And when I say "Bogus", I mean capital-B Bogus, because the p-values you compute are not going to be off by a factor of 2, which would actually be acceptable for most purposes.  They are not going to be off by a factor of 10, which would make them meaningless.  They are going to be completely and totally wrong, to the point where, if you run long enough, you are nearly certain to see a "statistically significant" difference, even if both versions are the same.

To see why, let's look at the case where you check for significance after every ad impression.  Suppose you show version A or version B with equal probability, and in either case the click-through rate is 50%.  After 100 impressions, you expect to show A and B 50 times each, and you expect 25 clicks on each.

After each impression, we can compute the CTR for A and B, and plot the difference as a random walk.  Here's what it looks like for 10 simulations of 200 impressions:

When the number of trials is small, the difference in CTR can be big, but for n>50 the law of large numbers kicks in and the difference gets small.

Instead of computing the difference, we can compute the chi-square statistic, which yields another random walk.  Here's what it looks like after 1000 steps:

After an initial transient, the distribution of the lines stabilizes, which shows that for large N, the distribution of the chi-square stat does not depend on N, which is exactly why we can use the chi-square distribution to compute p-values analytically.

SciPy provides a function that evaluates the chi-squared CDF.  The following function uses it to look up chi-squared values and return the corresponding p-value:

import scipy.stats

def Pvalue(chi2, df):
    """Probability of getting chi2 or greater from a chi-squared distribution.

    chi2: observed chi-squared statistic
    df: degrees of freedom
    return 1 - scipy.stats.chi2.cdf(chi2, df)

In this example, the number of degrees of freedom is 3, because there are four values in the table (hits and misses for A and B), but they are constrained to add up to N.

Now here's the key question: if we take values that are actually from a chi-squared distribution and look up the corresponding p-values, what is the distribution of the resulting p-values?

Answer: they are uniformly distributed.  Why?  Because we are using the chi-squared CDF to get their percentile ranks.  By the definition of the CDF, 1% of them fall below the 1st percentile, 2% of them fall below the 2nd percentile, and so on.

If we plot the p-values as a random walk, here's what it looks like (20 simulations, 2000 trials):

For large N, the lines are spread out uniformly from 0 to 1, as expected.

The red line shows the significance criterion, α, which determines the false positive rate.  If α=5%, then at any point in time, we expect 5% of the lines to be below this threshold.  And in the figure, that looks about right.

If we stop the test at a pre-determined time, or at a randomly-chosen time, we expect to find one line out of twenty below the threshold.  But if you do repeated tests, checking for significance after every trial, then the relevant question is how many of these lines cross below the threshold, ever?

In this example, it is almost 50%.  After 10000 trials, it is 85%.  As the number of trials increases, the false positive rate approaches 100%.

A standard quip among statisticians goes like this: "A frequentist is a person whose long-run ambition is to be wrong 5% of the time."  Here's an update: "A/B testing with repeated tests is totally legitimate, if your long-run ambition is to be wrong 100% of the time."

You can download the code I used to generate these figures from  The file is named after the Khan Academy because that's the context of the blog post that brought the problem to my attention.

Note: in these examples, I used a CTR of 50%, which is much higher than you are likely to get with an online ad campaign: 1% is considered good.  If you run these simulations with more realistic CTRs, the random walks get jaggier, due to the statistics of small numbers, but the overall effect is the same.

UPDATE August 5, 2013:  In the comments below, I got the following inquiry:

The d.f. for chi square is 1, not 3, as the null hypothesis of independence adds 2 more constraints. For JxK 2-way tables, df for independence = (I-1)(J-1). See any stat text, e.g. A. Agresti, 2002, Categorical Data Analysis, 2nd Ed, Wiley, p. 79

I maintain that the argument I made above is correct: there are four numbers in the table, but because they are constrained to add up to N, there are three degrees of freedom.  But getting df right is notoriously tricky, so let's check.  I ran 100 simulations, computed the resulting chi2 values and plotted their CDF.  Then I compared it to chi2 distributions with df = 1, 2, 3.  The figure below shows the result:

This figure shows that the values from my simulation are consistent with the chi2 distribution with df=3 (and substantially different from a chi2 distribution with df=1).


  1. Great post. I love the visualisations.

    This also reminds me I need to finish that post on multi armed bandit algorithms. Partly because they work, partly because of the name but mainly because "Originally considered by Allied scientists in World War II, it proved so intractable that it was proposed the problem be dropped over Germany so that German scientists could also waste their time on it."

    Also thanks for saying such nice things about my post

  2. The d.f. for chi square is 1, not 3, as the null hypothesis of independence adds 2 more constraints. For JxK 2-way tables,
    df for independence = (I-1)(J-1). See any stat text, e.g. A. Agresti, 2002, Categorical Data Analysis, 2nd Ed, Wiley, p. 79

    1. I think the argument I made above is correct. Please see the update I just added with an additional figure.

  3. I appreciate the dialogue about degrees of freedom prompted by Unknown's comment. I can't remember where I picked it up, but I always thought that for online A/B tests, the # of DoF was determined by the number of variants. In other words, A/B test would have a DoF of 1, A/B/C would have DoF of 2, etc. After reading this post, I went back through my stats textbook and realized that you're right. While I can't cite any resources off the top of my head, I'm all but positive that I'd read one DoF per variant in a marketing forum or blogpost somewhere. I'm glad to realize the error, I'm now updating the tools that I've built.

    BTW, there is A LOT of misinformation about statistics & optimization in the marketing community. Almost every A/B testing tool available encourages letting your tests run until confidence is reached. Many of the prominent tools for MVT encourage Taguchi method for speeding up test results. I'm sure this isn't limited to the world of online marketing, but as I learn more about statistics I'm shocked by the amount of bad math offered by otherwise reputable companies. I'm preaching against this within my company (which is why I've built my own tools to begin with).

    This leads to a question. For online A/B/n tests, is there ever a circumstance where you'll have an even DoF? In other words, if version A gets no clicks, would there only be 3 outcomes (no for ver A, yes/no for ver B), leading to DoF of 2? Or would you still count 4 possible outcomes, leading to DoF of 3?

    Thanks for informative post!

    1. Sorry for the slow reply -- I missed your question. The smallest case I can think of with an even DoF is A/B/C testing with three possible outcomes (for example, a use might click on a pop-up ad, close it, or ignore it). Then there are 9 cells in the table and one constraint, so DoF=8. More generally, the number of test conditions and the number of outcomes must be odd.

  4. Allen,

    I so liked your take on how "legitimate" it is to do repeated A/B tests that I quoted you in an article of mine: I argue that online marketing should be much more science-based and I introduce some statistical concepts. Your quote was perfect for my "arbitrary stopping" part, thanks!