Probably Overthinking It: October 2011

Thursday, October 27, 2011

All your Bayes are belong to us!

This week's post contains solutions to My Favorite Bayes's Theorem Problems, and one new problem. If you missed last week's post, go back and read the problems before you read the solutions!

If you don't understand the title of this post, brush up on your memes.

1) The first one is a warm-up problem. I got it from Wikipedia (but it's no longer there):

Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of Bowl #1?

First the hypotheses:
A: the cookie came from Bowl #1
B: the cookie came from Bowl #2

And the priors:
P(A) = P(B) = 1/2

The evidence:
E: the cookie is plain

And the likelihoods:
P(E|A) = prob of a plain cookie from Bowl #1 = 3/4
P(E|B) = prob of a plain cookie from Bowl #2 = 1/2

Plug in Bayes's theorem and get
P(A|E) = 3/5

You might notice that when the priors are equal they drop out of the BT equation, so you can often skip a step.

2) This one is also an urn problem, but a little trickier.

The blue M&M was introduced in 1995. Before then, the color mix in a bag of plain M&Ms was (30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan). Afterward it was (24% Blue , 20% Green, 16% Orange, 14% Yellow, 13% Red, 13% Brown).

A friend of mine has two bags of M&Ms, and he tells me that one is from 1994 and one from 1996. He won't tell me which is which, but he gives me one M&M from each bag. One is yellow and one is green. What is the probability that the yellow M&M came from the 1994 bag?

Hypotheses:
A: Bag #1 from 1994 and Bag #2 from 1996
B: Bag #2 from 1994 and Bag #1 from 1996

Again, P(A) = P(B) = 1/2.

The evidence is:
E: yellow from Bag #1, green from Bag #2

We get the likelihoods by multiplying the probabilities for the two M&M:

P(E|A) = (0.2)(0.2)
P(E|B) = (0.1)(0.14)

For example, P(E|B) is the probability of a yellow M&M in 1996 (0.14) times the probability of a green M&M in 1994 (0.1).

Plugging the likelihoods and the priors into Bayes's theorem, we get P(A|E) = 40 / 54 ~ 0.74

By introducing the terms Bag #1 and Bag #2, rather than "the bag the yellow M&M came from" and "the bag the green came from," I avoided the part of this problem that can be tricky: keeping the hypotheses and the evidence straight.

3) This one is from one of my favorite books, David MacKay's Information Theory, Inference, and Learning Algorithms:

Elvis Presley had a twin brother who died at birth. What is the probability that Elvis was an identical twin?

To answer this one, you need some background information: According to the Wikipedia article on twins: ``Twins are estimated to be approximately 1.9% of the world population, with monozygotic twins making up 0.2% of the total---and 8% of all twins.''

There are several ways to set up this problem; I think the easiest is to think about twin birth events, rather than individual twins, and to take the fact that Elvis was a twin as background information.

So the hypotheses are
A: Elvis's birth event was an identical birth event
B: Elvis's birth event was a fraternal twin event

If identical twins are 8% of all twins, then identical birth events are 8% of all twin birth events, so the priors are

P(A) = 8%
P(B) = 92%

The relevant evidence is
E: Elvis's twin was male

So the likelihoods are
P(E|A) = 1
P(E|B) = 1/2

Because identical twins are necessarily the same sex, but fraternal twins are equally likely to be opposite sex (or, at least, I assume so). So

P(A|E) = 8/54 ~ 0.15.

The tricky part of this one is realizing that the sex of the twin provides relevant information!

4) Also from MacKay's book:

Two people have left traces of their own blood at the scene of a crime. A suspect, Oliver, is tested and found to have type O blood. The blood groups of the two traces are found to be of type O (a common type in the local population, having frequency 60%) and of type AB (a rare type, with frequency 1%). Do these data (the blood types found at the scene) give evidence in favour [sic] of the proposition that Oliver was one of the two people whose blood was found at the scene?

For this problem, we are not asked for a posterior probability; rather we are asked whether the evidence is incriminating. This depends on the likelihood ratio, but not the priors.

The hypotheses are
X: Oliver is one of the people whose blood was found
Y: Oliver is not one of the people whose blood was found

The evidence is
E: two blood samples, one O and one AB

We don't need priors, so we'll jump to the likelihoods. If X is true, then Oliver accounts for the O blood, so we just have to account for the AB sample:

P(E|X) = 0.01

If Y is true, then we assume the two samples are drawn from the general population at random. The chance of getting one O and one AB is

P(E|Y) = 2(0.6)(0.01) = 0.012

Notice that there is a factor of two here because there are two permutations that yield E.

So the evidence is slightly more likely under Y, which means that it is actually exculpatory! This problem is a nice reminder that evidence that is consistent with a hypothesis does not necessarily support the hypothesis.

5) I like this problem because it doesn't provide all of the information. You have to figure out what information is needed and go find it.

According to the CDC, ``Compared to nonsmokers, men who smoke are about 23 times more likely to develop lung cancer and women who smoke are about 13 times more likely.''
If you learn that a woman has been diagnosed with lung cancer, and you know nothing else about her, what is the probability that she is a smoker?

I find it helpful to draw a tree:

If y is the fraction of women who smoke, and x is the fraction of nonsmokers who get lung cancer, the number of smokers who get cancer is proportional to 13xy, and the number of nonsmokers who get lung cancer is proportional to x(1-y).

Of all women who get lung cancer, the fraction who smoke is 13xy / (13xy + x(1-y)).

The x's cancel, so it turns out that we don't actually need to know the absolute risk of lung cancer, just the relative risk. But we do need to know y, the fraction of women who smoke. According to the CDC, y was 17.9% in 2009. So we just have to compute

13y / (13y + 1-y) ~ 74%

This is higher than many people guess.

6) Next, a mandatory Monty Hall Problem. First, here's the general description of the scenario, from Wikipedia:

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say Door A [but the door is not opened], and the host, who knows what's behind the doors, opens Door B, which has a goat. He then says to you, "Do you want to pick Door C?" Is it to your advantage to switch your choice?

The answer depends on the behavior of the host when the car is behind Door A. In this case the host can open either B or C. Suppose he chooses B with probability p and C otherwise. What is the probability that the car is behind Door A (as a function of p)?

The hypotheses are
A: the car is behind Door A
B: the car is behind Door B
C: the car is behind Door C

And the priors are
P(A) = P(B) = P(C) = 1/3

The likelihoods are
P(E|A) = p, because in this case Monty has a choice and chooses B with probability p,
P(E|B) = 0, because if the car were behind B, Monty would not have opened B, and
P(E|C) = 1, because in this case Monty has no choice.

Applying Bayes's Theorem,
P(A|E) = p / (1+p)

In the canonical scenario, p=1/2, so P(A|E) = 1/3, which is the canonical solution. If p=0, P(A|E) = 0, so you can switch and win every time (when Monty opens B, that it). If p=1, P(A|E) = 1/2, so in that case it doesn't matter whether you stick or switch.

When Monty opens C, P(A|E) = (1-p) / (2-p)

[Correction: the answer in this case is not (1-p) / (1+p), which what I wrote in a previous version of this article. Sorry!].

7) And finally, here is a new problem I just came up with:

If you meet a man with (naturally) red hair, what is the probability that neither of his parents has red hair?

Hints: About 2% of the world population has red hair. You can assume that the alleles for red hair are purely recessive. Also, you can assume that the Red Hair Extinction theory is false, so you can apply the Hardy–Weinberg principle.

Solution to this one next week!

Please let me know if you have suggestions for more problems. An ideal problem should meet at least some of these criteria:

1) It should be based on a context that is realistic or at least interesting, and not too contrived.

2) It should make good use of Bayes's Theorem -- that is, it should be easier to solve with BT than without.

3) It should involve some real data, which the solver might have to find.

4) It might involve a trick, but should not be artificially hard.

If you send me something that is not under copyright, or is usable under fair use, I will include it in the next edition of Think Stats and add you to the contributors list.

Thursday, October 20, 2011

My favorite Bayes's Theorem problems

This week: some of my favorite problems involving Bayes's Theorem. Next week: solutions.

1) The first one is a warm-up problem. I got it from Wikipedia (but it's no longer there):

Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of Bowl #1?

This is a thinly disguised urn problem. It is simple enough to solve without Bayes's Theorem, but good for practice.

2) This one is also an urn problem, but a little trickier.

The blue M&M was introduced in 1995. Before then, the color mix in a bag of plain M&Ms was (30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan). Afterward it was (24% Blue , 20% Green, 16% Orange, 14% Yellow, 13% Red, 13% Brown).

A friend of mine has two bags of M&Ms, and he tells me that one is from 1994 and one from 1996. He won't tell me which is which, but he gives me one M&M from each bag. One is yellow and one is green. What is the probability that the yellow M&M came from the 1994 bag?

3) This one is from one of my favorite books, David MacKay's "Information Theory, Inference, and Learning Algorithms":

Elvis Presley had a twin brother who died at birth. What is the probability that Elvis was an identical twin?

Two people have left traces of their own blood at the scene of a crime. A suspect, Oliver, is tested and found to have type O blood. The blood groups of the two traces are found to be of type O (a common type in the local population, having frequency 60%) and of type AB (a rare type, with frequency 1%). Do these data (the blood types found at the scene) give evidence in favour [sic] of the proposition that Oliver was one of the two people whose blood was found at the scene?

5) I like this problem because it doesn't provide all of the information. You have to figure out what information is needed and go find it.

According to the CDC, ``Compared to nonsmokers, men who smoke are about 23 times more likely to develop lung cancer and women who smoke are about 13 times more likely.''
If you learn that a woman has been diagnosed with lung cancer, and you know nothing else about her, what is the probability that she is a smoker?

6) And finally, a mandatory Monty Hall Problem. First, here's the general description of the scenario, from Wikipedia:

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say Door A [but the door is not opened], and the host, who knows what's behind the doors, opens another door, say Door B, which has a goat. He then says to you, "Do you want to pick Door C?" Is it to your advantage to switch your choice?

The answer depends on the behavior of the host if the car is behind Door A. In this case the host can open either B or C. Suppose he chooses B with probability p and C otherwise. What is the probability that the car is behind Door A (as a function of p)?

If you like this problem, you might also like the Blinky Monty Problem.

Solutions next week!

Please let me know if you have suggestions for more problems. An ideal problem should meet at least some of these criteria:

1) It should be based on a context that is realistic or at least interesting, not too contrived.

2) It should make good use of Bayes's Theorem -- that is, it should be easier to solve with BT than without.

3) It should involve some real data, which the solver might have to find.

4) It might involve a trick, but should not be artificially hard.

If you send me something that is not under copyright, or is usable under fair use, I will include it in the next edition of Think Stats and add you to the contributors list.

Monday, October 17, 2011

The Blinky Monty Problem

I read Jason Rosenhouse's book about The Monty Hall Problem recently, and I use the problem as an example in my statistics class. Last semester I wrote a variation of the problem that turns out to be challenging, and a motivating problem for Bayesian estimation. Here's what I call the "Blinky Monty Problem."

Suppose you are on Let's Make a Deal and you are playing the Monty Hall Game, with one twist. Before you went on the show you analyzed tapes of previous shows and discovered that Monty has a tell: when the contestant picks the correct door, Monty is more likely to blink.

Of the 18 shows you watched, the contestant chose the correct door 5 times, and Monty blinked three of those times. Of the other 13 times, Monty blinked three times.

Assume that you choose Door A. Monty opens door B and blinks. What should you do, and what is your chance of winning?

To get started, let's review the standard version of the Monty Hall problem (from Wikipedia):

"Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say A [but the door is not opened], and the host, who knows what's behind the doors, opens another door, say B, which has a goat. He then says to you, "Do you want to pick door C?" Is it to your advantage to switch your choice?"

We can solve this problem using Bayes's Theorem. There are two relevant hypotheses:

A: the car is behind door A
C: the car is behind door C

Before Monty opens door B, the prior probability for both hypotheses is 1/3. Now the evidence is

E: Monty opened door B.

To compute the posterior probabilities, we need the likelihoods P(E|A) and P(E|C).

P(E|A) is the probability of the evidence if the car is behind door A. In this case Monty has a choice; he could have opened door B or C. In the canonical version of the problem, he chooses at random, so

P(E|A) = 1/2

P(E|C) is the probability of the evidence if the car is behind door C. In this case Monty has no choice, so

P(E|C) = 1

Applying Bayes's Theorem yields:

P(A|E) = 1/3
P(C|E) = 2/3

So if you switch to Door C, you increase the chances of winning from 1/3 to 2/3. This is the standard answer to the standard version of the problem. If you are not familiar with the Monty Hall problem and this answer comes as a shock to you, please don't take it out on me. Go read about it, starting here, and come back when you have calmed down.

Are you ready to go on? Ok. We can do a similar analysis for the Blinky Monty problem, but now we have more evidence to consider:

E: Monty opens Door B and blinks.

According to the scouting report, the probability that Monty blinks if the car is behind A is 3/5, so

P(E|A) = (1/2)(3/5) = 3/10

If the car is behind C, Monty blinks 3/13 of the time, so

P(E|C) = (1)(3/13) = 3/13

Plugging in Bayes' Theorem, we get P(A|E) = 0.51, so in this scenario you are slightly better off sticking with Door A.

But is that all there is to it? No! And here's where this blog earns the name "Probably Overthinking It." Remember that the probabilities for Monty's blinking are estimates based on a sample, so we have some uncertainty about them.

To take that uncertainty into account, we have to

1) Use the scouting report to estimate the distribution of P(blink|A) and P(blink|C).

2) For each pair of values from those distributions, compute P(A|E).

3) Apply the law of total probability to get the overall P(A|E).

That might sound complicated, but it is straightforward to estimate computationally.

First, let me take a minute to make fun of frequentists. According to conventional hypothesis testing, the data from the scouting report is not statistically significant. If the null hypothesis is that Monty has the same chance of blinking regardless of the location of the car, the p-value of the sample we saw is 0.27 (I used Fisher's exact test, computed using this online calculator). We can't reject the null hypothesis, so if we play by the rules of conventional hypothesis testing, I guess that means we can't take advantage of Monty's tell. If you are a committed frequentist, you should stop reading now.

Good. Now that the riff-raff are gone, let's proceed. Here's how we make the posterior distribution for P(blink|doorA):

doorA = MakeUniformSuite(0.0, 1.0, 101, name='Door A')
evidence = 3, 2
Update(doorA, evidence)

MakeUniformSuite() makes a Pmf of 101 values equally spaced between 0 and 1, with equal probability. Update() does the usual Bayesian update:

def Update(suite, evidence):

"""Updates a suite of hypotheses based on new evidence.

Modifies the suite directly; if you want to keep the original, make

a copy.

Args:

suite: Pmf object

evidence: whatever kind of object Likelihood expects

"""

for hypo in suite.Values():

likelihood = Likelihood(evidence, hypo)

suite.Mult(hypo, likelihood)

suite.Normalize()

def Likelihood(evidence, hypo):

"""Computes the likelihood of the evidence assuming the hypothesis is true.

Args:

evidence: a tuple of (number of heads, number of tails)

hypo: float probability of heads

Returns:

probability of tossing the given number of heads and tails with a

coin that has p probability of heads

"""

heads, tails = evidence

p = hypo

return pow(p, heads) * pow(1-p, tails)

If you are not familiar with that, you can read Chapter 8 of Think Stats, or see this blog post.

To find P(blink|C), it's pretty much the same:

doorC = MakeUniformSuite(0.0, 1.0, 101, name='Door C')

evidence = 3, 10

Update(doorC, evidence)

And to finish it off, we apply the law of total probability:

print TotalProbability(doorA, doorC, ProbWinning)

TotalProbability() enumerates the values from doorA and doorC, and calls ProbWinning() for each pair:

def TotalProbability(pmf1, pmf2, func):

"""Enumerates pairs from the Pmfs, calls the func, and returns

the total probability.

func: takes a value from each Pmf and returns a probability.

"""

total = 0.0

for x, px in pmf1.Items():

for y, py in pmf2.Items():

if px and py:

total += px * py * func(x, y)

return total

Finally, ProbWinning() takes P(blink|A) and P(blink|C) and returns P(A|E):

def ProbWinning(pbA, pbC):

"""Computes the probability that the car is behind door A:

pbA: probability that Monty blinks if the car is behind A

pbC: probability that Monty blinks if the car is behind C

"""

pea = 0.5 * pbA

pec = 1.0 * pbC

pae = pea / (pea + pec)

return pae

You can download all this code here. If you run it, you'll see the posterior distributions for P(blink|A) and P(blink|C):

The most likely values are 0.6 and 0.23, but since we don't have much data, the uncertainty is still large.

Taking it all together, the probability that the car is behind Door A, P(A|E), is 0.52. So you are slightly better off sticking with Door A.

To summarize:

1) In the canonical version of the game, P(A|E) = 1/3, so you are better off switching.

2) In the Blinky Monty version, if we take P(blink|A) and P(blink|C) as givens, we estimate P(A|E) = 0.51, so you are slightly better off sticking with Door A.

3) If you are a frequentist, you conclude that Monty's tell is not statistically significant, so you ignore it, switch, and lose most of the time.

4) If you are a Bayesian, you can take advantage of the tell and maximize your chance of winning. If Monty blinks and you stick, you win 52% of the time. If he doesn't blink and you switch, you win 87% of the time. Since he blinks 40% of the time, overall you can expect to win 73% of the time. So the tell provides a substantial advantage.

Exercise for the reader: download my code and modify it to analyze the case where Monty doesn't blink. Hint: you only have to change two lines.

EDIT October 19, 2011: I got a comment on reddit that raises a legitimate concern about data collection. In the scenario I presented, if you review tapes of the show and look for several potential tells (blinking, facial tics, verbal habits, etc.) and you choose one that seems to be correlated with the outcome, you could easily get fooled (see this post on repeated tests). This is a fair warning for both the frequentist and Bayesian analyses.

So to be more careful, maybe I should say that you were tipped off to the possibility that Monty's tell is a blink before you reviewed the tape, and that's the only data you collected.

EDIT January 26, 2012: In the comments below, Jacob pointed out an error in my calculation. As a result, I went back and changed the scenario to match my calculations, which forced me to make a few other revisions. I think it is all settled now, but let me know if you disagree.

Thursday, October 6, 2011

Repeated tests: how bad can it be?

A couple of weeks ago Red Dave wrote this blog post about the dangers of A/B testing with repeated tests. I think he's right, and does a good job of explaining why. But to get a better understanding of the issue, I wrote some simulations, and things can be even worse than Dave suggests.

As an example of A/B testing, suppose you are advertising a business and you develop two versions of an ad, A and B, with different text, graphics, etc. You want to know which ad is more effective.

If you post your ads online, you can find out how many times each version was shown (number of impressions), and how many people clicked on each ad (number of clicks). The number of clicks per impression is the "click-through rate," or CTR. At least, that's what Google calls it.

If you show each ad 1000 times, you might see a CTR of 1% for Version A and 2% for Version B. To see whether this difference is significant, you can use a chi-square test, which compares the actual number of clicks for each ad with the expected number. The result is a p-value, which tells you the chance of seeing a difference as big as that if the ads are actually the same. If the p-value is small, the effect is unlikely to be due to chance, so it is more likely to be real.

So far, this is just classical hypothesis testing. No problem. Here's where things get ugly.

If, instead of showing each ad 1000 times, you run the ad for a while, and then check for a statistically-significant difference, run a while longer, check again, and repeat until you get a significant difference, then what you are doing is Bogus.

And when I say "Bogus", I mean capital-B Bogus, because the p-values you compute are not going to be off by a factor of 2, which would actually be acceptable for most purposes. They are not going to be off by a factor of 10, which would make them meaningless. They are going to be completely and totally wrong, to the point where, if you run long enough, you are nearly certain to see a "statistically significant" difference, even if both versions are the same.

To see why, let's look at the case where you check for significance after every ad impression. Suppose you show version A or version B with equal probability, and in either case the click-through rate is 50%. After 100 impressions, you expect to show A and B 50 times each, and you expect 25 clicks on each.

After each impression, we can compute the CTR for A and B, and plot the difference as a random walk. Here's what it looks like for 10 simulations of 200 impressions:

When the number of trials is small, the difference in CTR can be big, but for n>50 the law of large numbers kicks in and the difference gets small.

Instead of computing the difference, we can compute the chi-square statistic, which yields another random walk. Here's what it looks like after 1000 steps:

After an initial transient, the distribution of the lines stabilizes, which shows that for large N, the distribution of the chi-square stat does not depend on N, which is exactly why we can use the chi-square distribution to compute p-values analytically.

SciPy provides a function that evaluates the chi-squared CDF. The following function uses it to look up chi-squared values and return the corresponding p-value:

import scipy.stats

def Pvalue(chi2, df):
"""Probability of getting chi2 or greater from a chi-squared distribution.

chi2: observed chi-squared statistic
df: degrees of freedom
"""
return 1 - scipy.stats.chi2.cdf(chi2, df)

In this example, the number of degrees of freedom is 3, because there are four values in the table (hits and misses for A and B), but they are constrained to add up to N.

Now here's the key question: if we take values that are actually from a chi-squared distribution and look up the corresponding p-values, what is the distribution of the resulting p-values?

Answer: they are uniformly distributed. Why? Because we are using the chi-squared CDF to get their percentile ranks. By the definition of the CDF, 1% of them fall below the 1st percentile, 2% of them fall below the 2nd percentile, and so on.

If we plot the p-values as a random walk, here's what it looks like (20 simulations, 2000 trials):

For large N, the lines are spread out uniformly from 0 to 1, as expected.

The red line shows the significance criterion, α, which determines the false positive rate. If α=5%, then at any point in time, we expect 5% of the lines to be below this threshold. And in the figure, that looks about right.

If we stop the test at a pre-determined time, or at a randomly-chosen time, we expect to find one line out of twenty below the threshold. But if you do repeated tests, checking for significance after every trial, then the relevant question is how many of these lines cross below the threshold, ever?

In this example, it is almost 50%. After 10000 trials, it is 85%. As the number of trials increases, the false positive rate approaches 100%.

A standard quip among statisticians goes like this: "A frequentist is a person whose long-run ambition is to be wrong 5% of the time." Here's an update: "A/B testing with repeated tests is totally legitimate, if your long-run ambition is to be wrong 100% of the time."

You can download the code I used to generate these figures from thinkstats.com/khan.py. The file is named after the Khan Academy because that's the context of the blog post that brought the problem to my attention.

Note: in these examples, I used a CTR of 50%, which is much higher than you are likely to get with an online ad campaign: 1% is considered good. If you run these simulations with more realistic CTRs, the random walks get jaggier, due to the statistics of small numbers, but the overall effect is the same.

UPDATE August 5, 2013: In the comments below, I got the following inquiry:

The d.f. for chi square is 1, not 3, as the null hypothesis of independence adds 2 more constraints. For JxK 2-way tables, df for independence = (I-1)(J-1). See any stat text, e.g. A. Agresti, 2002, Categorical Data Analysis, 2nd Ed, Wiley, p. 79

I maintain that the argument I made above is correct: there are four numbers in the table, but because they are constrained to add up to N, there are three degrees of freedom. But getting df right is notoriously tricky, so let's check. I ran 100 simulations, computed the resulting chi2 values and plotted their CDF. Then I compared it to chi2 distributions with df = 1, 2, 3. The figure below shows the result:

This figure shows that the values from my simulation are consistent with the chi2 distribution with df=3 (and substantially different from a chi2 distribution with df=1).