A couple of weeks ago Red Dave wrote this blog post about the dangers of A/B testing with repeated tests. I think he's right, and does a good job of explaining why. But to get a better understanding of the issue, I wrote some simulations, and things can be even worse than Dave suggests.
As an example of A/B testing, suppose you are advertising a business and you develop two versions of an ad, A and B, with different text, graphics, etc. You want to know which ad is more effective.
If you post your ads online, you can find out how many times each version was shown (number of impressions), and how many people clicked on each ad (number of clicks). The number of clicks per impression is the "click-through rate," or CTR. At least, that's what Google calls it.
If you show each ad 1000 times, you might see a CTR of 1% for Version A and 2% for Version B. To see whether this difference is significant, you can use a chi-square test, which compares the actual number of clicks for each ad with the expected number. The result is a p-value, which tells you the chance of seeing a difference as big as that if the ads are actually the same. If the p-value is small, the effect is unlikely to be due to chance, so it is more likely to be real.
So far, this is just classical hypothesis testing. No problem. Here's where things get ugly.
If, instead of showing each ad 1000 times, you run the ad for a while, and then check for a statistically-significant difference, run a while longer, check again, and repeat until you get a significant difference, then what you are doing is Bogus.
And when I say "Bogus", I mean capital-B Bogus, because the p-values you compute are not going to be off by a factor of 2, which would actually be acceptable for most purposes. They are not going to be off by a factor of 10, which would make them meaningless. They are going to be completely and totally wrong, to the point where, if you run long enough, you are nearly certain to see a "statistically significant" difference, even if both versions are the same.
To see why, let's look at the case where you check for significance after every ad impression. Suppose you show version A or version B with equal probability, and in either case the click-through rate is 50%. After 100 impressions, you expect to show A and B 50 times each, and you expect 25 clicks on each.
After each impression, we can compute the CTR for A and B, and plot the difference as a random walk. Here's what it looks like for 10 simulations of 200 impressions:
When the number of trials is small, the difference in CTR can be big, but for n>50 the law of large numbers kicks in and the difference gets small.
Instead of computing the difference, we can compute the chi-square statistic, which yields another random walk. Here's what it looks like after 1000 steps:
After an initial transient, the distribution of the lines stabilizes, which shows that for large N, the distribution of the chi-square stat does not depend on N, which is exactly why we can use the chi-square distribution to compute p-values analytically.
SciPy provides a function that evaluates the chi-squared CDF. The following function uses it to look up chi-squared values and return the corresponding p-value:
def Pvalue(chi2, df):
"""Probability of getting chi2 or greater from a chi-squared distribution.
chi2: observed chi-squared statistic
df: degrees of freedom
return 1 - scipy.stats.chi2.cdf(chi2, df)
In this example, the number of degrees of freedom is 3, because there are four values in the table (hits and misses for A and B), but they are constrained to add up to N.
Now here's the key question: if we take values that are actually from a chi-squared distribution and look up the corresponding p-values, what is the distribution of the resulting p-values?
Answer: they are uniformly distributed. Why? Because we are using the chi-squared CDF to get their percentile ranks. By the definition of the CDF, 1% of them fall below the 1st percentile, 2% of them fall below the 2nd percentile, and so on.
If we plot the p-values as a random walk, here's what it looks like (20 simulations, 2000 trials):
The red line shows the significance criterion, α, which determines the false positive rate. If α=5%, then at any point in time, we expect 5% of the lines to be below this threshold. And in the figure, that looks about right.
If we stop the test at a pre-determined time, or at a randomly-chosen time, we expect to find one line out of twenty below the threshold. But if you do repeated tests, checking for significance after every trial, then the relevant question is how many of these lines cross below the threshold, ever?
In this example, it is almost 50%. After 10000 trials, it is 85%. As the number of trials increases, the false positive rate approaches 100%.
A standard quip among statisticians goes like this: "A frequentist is a person whose long-run ambition is to be wrong 5% of the time." Here's an update: "A/B testing with repeated tests is totally legitimate, if your long-run ambition is to be wrong 100% of the time."
You can download the code I used to generate these figures from thinkstats.com/khan.py. The file is named after the Khan Academy because that's the context of the blog post that brought the problem to my attention.
Note: in these examples, I used a CTR of 50%, which is much higher than you are likely to get with an online ad campaign: 1% is considered good. If you run these simulations with more realistic CTRs, the random walks get jaggier, due to the statistics of small numbers, but the overall effect is the same.