Tuesday, September 1, 2015

Bayesian analysis of gluten sensitivity

Last week a new study showed that many subjects diagnosed with non-celiac gluten sensitivity (NCGS) were not able to distinguish gluten flour from non-gluten flour in a blind challenge.

In this article, I review the the study and use a simple Bayesian model to show that the results support the hypothesis that none of the subjects are sensitive to gluten.  But there are complications in the design of the study that might invalidate the model.

Here is a description of the study:
"We studied 35 non-CD subjects (31 females) that were on a gluten-free diet (GFD), in a double-blind challenge study. Participants were randomised to receive either gluten-containing flour or gluten-free flour for 10 days, followed by a 2-week washout period and were then crossed over. The main outcome measure was their ability to identify which flour contained gluten.
"The gluten-containing flour was correctly identified by 12 participants (34%)..."
Since 12 out of 35 participants were able to identify the gluten flour, the authors conclude "Double-blind gluten challenge induces symptom recurrence in just one-third of patients fulfilling the clinical diagnostic criteria for non-coeliac gluten sensitivity."

This conclusion seems odd to me, because if none of the patients were sensitive to gluten, we would expect some of them to identify the gluten flour by chance.  So the results are consistent with the hypothesis that none of the subjects are actually gluten sensitive.

We can use a Bayesian approach to interpret the results more precisely.  But first, as always, we have to make some modeling decisions.

First, of the 35 subjects, 12 identified the gluten flour based on resumption of symptoms while they were eating it.  Another 17 subjects wrongly identified the gluten-free flour based on their symptoms, and 6 subjects were unable to distinguish.  So each subject gave one of three responses.  To keep things simple I follow the authors of the study and lump together the second two groups; that is, I consider two groups: those who identified the gluten flour and those who did not.

Second, I assume (1) people who are actually gluten sensitive have a 95% chance of correctly identifying gluten flour under the challenge conditions, and (2) subjects who are not gluten sensitive have only a 40% chance of identifying the gluten flour by chance (and a 60% chance of either choosing the other flour or failing to distinguish).

Under this model, we can estimate the actual number of subjects who are gluten sensitive, gs.  I chose a uniform prior for gs, from 0 to 35.  To perform the Bayesian analysis, we have to compute the likelihood of the data under each hypothetical value of gs.  Here is the likelihood function in Python

    def Likelihood(self, data, hypo):
        gs = hypo
        yes, no = data
        n = yes + no
        ngs = n - gs
        pmf1 = thinkbayes2.MakeBinomialPmf(gs, 0.95)
        pmf2 = thinkbayes2.MakeBinomialPmf(ngs, 0.4)
        pmf = pmf1 + pmf2
        return pmf[yes]

The function works by computing the PMF of the number of gluten identifications conditioned on gs, and then selecting the actual number of identifications, yes, from the PMF.  The details of the computation are in this IPython notebook.

And here is the posterior distribution:

The most likely value of gs is 0, so it is quite possible that none of the respondents are gluten sensitive.  The 95% credible interval for gs is (0, 8), so a reasonable upper bound on the number of gluten-sensitive respondents is 8 out of 35, or 23%.

We can also use this analysis to compare two hypotheses:

A) Some of the respondents are gluten sensitive (equally likely from 0 to 35).
B) None of the respondents are gluten sensitive.

The Bayes factor in support of B turns out to be about 8.4, which is moderately strong.  If you believed, before reading this study, that the probability of B was 50%, you should now believe it is about 90%.

However, there are complications in the design of the study that might invalidate this simple model.  In particular, the gluten free flour in the study contained corn starch, which some people may be sensitive to.  And several subjects reported symptoms under both challenge conditions; that is, when eating both gluten flour and gluten-free flour.  So it is possible that misidentification of the gluten flour, as well as failure to distinguish, indicates sensitivity to both gluten and corn starch.

But if we limit ourselves to the question of whether people diagnosed with non-celiac gluten sensitivity are specifically sensitive to gluten, this study suggests that they are not.

Thank yous: I heard about this study in this blog post.  And I want to thank the authors of the study and their publisher for making the entire paper available on the web, which made my analysis possible.

Thursday, August 27, 2015

Bayes theorem in real life

I had a chance to practice Bayesian inference in real life today: at 1pm my wife called to tell me that the carbon monoxide (CO) alarm at the house was going off.  Immediately two hypotheses came to mind: (1) there is a dangerous amount of CO in my house, (2) it's a false alarm.

It's summer and all the windows are open in the house.  The furnace is not running and we don't have a gas stove.  And the detector is about 10 years old.  This background information makes a false alarm more plausible, so I started with a low prior for (1).  Since my wife was on her way out anyway, I suggested she disconnect the detector, turn on a fan, and leave.

After we hung up, I searched for information on CO detectors and false alarms.  Apparently, the rate of false alarms is low, at least in once sense: CO detectors are very specific, that is, unlikely to go off because of anything other than CO.  Of course the other possibility is that the detector is broken.  On balance, this information made me less confident of a false alarm.

I tried to think of other possibilities:
  • A major street nearby is being paved.  Does fresh pavement off-gas anything that would set off a CO detector?
  • At a construction site down the street, they just poured a concrete foundation, and my neighbor is having some masonry done.  Does fresh concrete off-gas CO?  I vaguely remember that it sequesters oxygen, which turned out to be a problem for one of the Biosphere projects.
  • What about a smoldering fire inside a wall?  Could it produce enough CO to set off the detector, but not enough smoke to set off the smoke alarms?
The first two seemed unlikely, but the third seemed plausible.  In fact, it seemed plausible enough that I decided to go home and check it out.  When I got there, the windows were open and fans were running, so I figured the air had exchanged at least once, and the detector had a chance to reset.

I put on over-the-head 30dB earmuffs and plugged the detector back in.  It went off immediately.  I plugged it into an outlet in another room, and it went off immediately.  At this point it seemed unlikely that every room in my wide-open, highly ventilated house was full of CO.

I still couldn't rule out a smoldering fire, but it sure seemed unlikely.  Nevertheless, I called the fire department, using their non-emergency number to preserve some dignity, and told them the story.  They sent a truck, walked through the house with a gas detector, and reported no CO anywhere.

One of the fireman told me that CO detectors are good for about 7 years, so it's time for a new one.  And while I'm at it, he suggested, I should get one more for the first floor (currently I have one in the basement and one on the second floor).  And that's what I did.

So what have we learned?

1) Make sure you have a CO detector on each floor of your house.  CO poisoning is no joke.  Want proof?  Run this search query any time.  Chances are you'll find several serious accidents within the last week or so.

2)  Replace detectors after 5-7 years.  Apparently newer ones have an end-of-life alert, so you don't have to keep track.  If you have an older one, figure out the replacement day and put it on your calendar.

3) Think like a Bayesian.  As you get more information, update your probabilities, and change your decisions accordingly.  My initial estimate for the probability of false alarm was so high, I decided not to check it out.  As I thought of more possible explanations, the probability of a real problem crept higher.  It was still very low, but high enough that I decided to go home.

4) On the other hand, a strictly Bayesian framework doesn't always fit real-world thinking.  In this case, my decision flipped when I thought of a new hypothesis (a smoldering fire).  Should I consider that a new piece of information and update my prior?  Or is it a new piece of background information that caused me to go back, revise my prior, and start over?  In practice, it makes no difference.  But from a Bayesian point of view, it's not quite as neat as it might be.

Tuesday, August 18, 2015

The Inspection Paradox is Everywhere

The following is a draft of an article I have submitted for publication in CHANCE Magazine, a publication of the American Statistical Association. With their encouragement, I am publishing it here to solicit comments from readers (and possibly corrections).

The Inspection Paradox is Everywhere

The inspection paradox is a common source of confusion, an occasional source of error, and an opportunity for clever experimental design.  Most people are unaware of it, but like the cue marks that appear in movies to signal reel changes, once you notice it, you can’t stop seeing it.

A common example is the apparent paradox of class sizes.  Suppose you ask college students how big their classes are and average the responses.  The result might be 56.  But if you ask the school for the average class size, they might say 31.  It sounds like someone is lying, but they could both be right.

The problem is that when you survey students, you oversample large classes.  If there are 10 students in a class, you have 10 chances to sample that class.  If there are 100 students, you have 100 chances.  In general, if the class size is x, it will be overrepresented in the sample by a factor of x.

That’s not necessarily a mistake.  If you want to quantify student experience, the average across students might be a more meaningful statistic than the average across classes.  But you have to be clear about what you are measuring and how you report it.

By the way, I didn’t make up the numbers in this example.  They come from class sizes reported by Purdue University for undergraduate classes in the 2013-14 academic year.  https://www.purdue.edu/datadigest/2013-14/InstrStuLIfe/DistUGClasses.html

From the data in their report, I estimate the actual distribution of class sizes; then I compute the “biased” distribution you would get by sampling students.  The CDFs of these distributions are in Figure 1.

Going the other way, if you are given the biased distribution, you can invert the process to estimate the actual distribution.  You could use this strategy if the actual distribution is not available, or if it is easier to run the biased sampling process.

Figure 1: Undergraduate class sizes at Purdue University, 2013-14 academic year: actual distribution and biased view as seen by students.

The same effect applies to passenger planes.  Airlines complain that they are losing money because so many flights are nearly empty.  At the same time passengers complain that flying is miserable because planes are too full.  They could both be right.  When a flight is nearly empty, only a few passengers enjoy the extra space.  But when a flight is full, many passengers feel the crunch.

Once you notice the inspection paradox, you see it everywhere.  Does it seem like you can never get a taxi when you need one?  Part of the problem is that when there is a surplus of taxis, only a few customers enjoy it.  When there is a shortage, many people feel the pain.

Another example happens when you are waiting for public transportation.  Busses and trains are supposed to arrive at constant intervals, but in practice some intervals are longer than others.  With your luck, you might think you are more likely to arrive during a long interval.  It turns out you are right: a random arrival is more likely to fall in a long interval because, well, it’s longer.

To quantify this effect, I collected data from the Red Line in Boston.  Using their real-time data service, I recorded the arrival times for 70 trains between 4pm and 5pm over several days.
Figure 2: Distribution of time between trains on the Red Line in Boston, between 4pm and 5pm.

The shortest gap between trains was less than 3 minutes; the longest was more than 15.  Figure 2 shows the actual distribution of time between trains, and the biased distribution that would be observed by passengers.  The average time between trains is 7.8 minutes, so we might expect the average wait time to be 3.8 minutes.  But the average of the biased distribution is 8.8 minutes, and the average wait time for passengers is 4.4 minutes, about 15% longer.

In this case the difference between the two distributions is not very big because the variance of the actual distribution is moderate.  When the actual distribution is long-tailed, the effect of the inspection paradox can be much bigger.

An example of a long-tailed distribution comes up in the context of social networks.  In 1991, Scott Feld presented the “friendship paradox”: the observation that most people have fewer friends than their friends have.  He studied real-life friends, but the same effect appears in online networks: if you choose a random Facebook user, and then choose one of their friends at random, the chance is about 80% that the friend has more friends.

The friendship paradox is a form of the inspection paradox.  When you choose a random user, every user is equally likely.  But when you choose one of their friends, you are more likely to choose someone with a lot of friends.  Specifically, someone with x friends is overrepresented by a factor of x.

To demonstrate the effect, I use data from the Stanford Large Network Dataset Collection (http://snap.stanford.edu/data), which includes a sample of about 4000 Facebook users.  We can compute the number of friends each user has and plot the distribution, shown in Figure 3.  The distribution is skewed: most users have only a few friends, but some have hundreds.

We can also compute the biased distribution we would get by choosing choosing random friends, also shown in Figure 3.  The difference  is huge.  In this dataset, the average user has 42 friends; the average friend has more than twice as many, 91.
Figure 3: Number of online friends for Facebook users: actual distribution and biased distribution seen by sampling friends.

Some examples of the inspection paradox are more subtle.  One of them occurred to me when I ran a 209-mile relay race in New Hampshire.  I ran the sixth leg for my team, so when I started running, I jumped into the middle of the race.  After a few miles I noticed something unusual: when I overtook slower runners, they were usually much slower; and when faster runners passed me, they were usually much faster.

At first I thought the distribution of runners was bimodal, with many slow runners, many fast runners, and few runners like me in the middle.  Then I realized that I was fooled by the inspection paradox.

In many long relay races, teams start at different times, and most teams include a mix of faster and slower runners.  As a result, runners at different speeds end up spread over the course; if you stand at  random spot and watch runners go by, you see a nearly representative sample of speeds.  But if you jump into the race in the middle, the sample you see depends on your speed.

Whatever speed you run, you are more likely to pass very slow runners, more likely to be overtaken by fast runners, and unlikely to see anyone running at the same speed as you.  The chance of seeing another runner is proportional to the difference between your speed and theirs.

We can simulate this effect using data from a conventional road race.  Figure 4 shows the actual distribution of speeds from the James Joyce Ramble, a 10K race in Massachusetts.  It also shows biased distributions that would be seen by runners at 6.5 and 7.5 mph.  The observed distributions are bimodal, with fast and slow runners oversampled and fewer runners in the middle.
Figure 4: Distribution of speed for runners in a 10K, and biased distributions as seen by runners at different speeds.

A final example of the inspection paradox occurred to me when I was reading Orange is the New Black, a memoir by Piper Kerman, who spent 13 months in a federal prison.  At several points Kerman expresses surprise at the length of the sentences her fellow prisoners are serving.  She is right to be surprised, but it turns out that she is the victim of not just an inhumane prison system, but also the inspection paradox.

If you arrive at a prison at a random time and choose a random prisoner, you are more likely to choose a prisoner with a long sentence.  Once again, a prisoner with sentence x is oversampled by a factor of x.

Using data from the U.S. Sentencing Commission, I made a rough estimate of the distribution of sentences for federal prisoners, shown in Figure 5.  I also computed the biased distribution as observed by a random arrival.
Figure 5: Approximate distribution of federal prison sentences, and a biased distribution as seen by a random arrival.

As expected, the biased distribution is shifted to the right.  In the actual distibution the mean is 121 months; in the biased distribution it is 183 months.

So what happens if you observe a prison over an interval like 13 months?  It turns out that if your sentence is y months, the chance of overlapping with a prisoner whose sentence is x months is proportional to x + y.

Figure 6 shows biased distributions as seen by hypothetical prisoners serving sentences of 13, 120, and 600 months.
Figure 6: Biased distributions as seen by prisoners with different sentences.

Over an interval of 13 months, the observed sample is not much better than the biased sample seen by a random arrival.  After 120 months, the magnitude of the bias is about halved.  Only after a very long sentence, 600 months, do we get a more representative sample, and even then it is not entirely unbiased.

These examples show that the inspection paradox appears in many domains, sometimes in subtle ways.  If you are not aware of it, it can cause statistical errors and lead to invalid inferences.  But in many cases it can be avoided, or even used deliberately as part of an experimental design.

Further reading
I discuss the class size example in my book, Think Stats, 2nd Edition, O’Reilly Media, 2014, and the Red Line example in Think Bayes, O’Reilly Media, 2013.  I wrote about relay races, social networks, and Orange Is the New Black in my blog, “Probably Overthinking It”.  http://allendowney.blogspot.com/

John Allen Paulos presents the friendship paradox in Scientific American, “Why You're Probably Less Popular Than Your Friends”, 18 January 2011.  http://www.scientificamerican.com/article/why-youre-probably-less-popular/
The original paper on the topic might be Scott Feld,  “Why Your Friends Have More Friends Than You Do”, American Journal of Sociology, Vol. 96, No. 6 (May, 1991), pp. 1464-1477. http://www.jstor.org/stable/2781907

Amir Aczel discusses some of these examples, and a few different ones, in a Discover Magazine blog article, “On the Persistence of Bad Luck (and Good)”, 4 September 4, 2013.

The code I used to generate these examples is in these Jupyter notebooks:


Allen Downey is a Professor of Computer Science at Olin College of Engineering in Massachusetts.  He is the author of several books, including Think Python, Think Stats, and Think Bayes.  He is a runner with a maximum 10K speed of 8.7 mph.

Tuesday, August 4, 2015

Orange is the new stat

I've been reading Piper Kerman's Orange Is the New Black, a memoir by a woman who served 11 months in a federal prison.  One of the recurring themes is the author's surprise at the length of the sentences her fellow prisoners are serving, especially the ones convicted of drug offences.  In my opinion, she is right to be shocked.

About half of federal prisoners were convicted of drug crimes, according to this fact sheet from the US Sentencing Commission (USSC).  In minimum security prisons, the proportion is higher, and in women's prisons, I would guess it is even higher.  About 45% of federal prisoners were sentenced under mandatory minimum guidelines that sensible people would find shocking.  And about a third of them had no prior criminal record, according to this report, also from the USSC.

In many cases, minor drug offenders are serving sentences much longer than sentences for serious violent crimes.  For a list of heart-breaking examples, see these prisoner profiles at Families Against Mandatory Minimums.

Or watch this clip from Jon Oliver's Last Week Tonight:

When you are done being outraged, here are a few things to do:

1) Read more about Families Against Mandatory Minimums, write about them on social media, and consider making a donation (Charity Navigator gives them an excellent rating).

2) Another excellent source of information, and another group that deserves support, is the Prison Policy Initiative.

3) Then read the rest of this article, which points out that although Kerman's observations are fundamentally correct, her sampling process is biased in an interesting way.

The inspection paradox

It turns out that Kerman is the victim not just of a criminal justice system that is out of control, but also of a statistical error called the inspection paradox.  I wrote about it in Chapter 3 of Think Stats, where I called it the Class Size Paradox, using the example of average class size.

If you ask students how big their classes are, the average of their responses will be higher than the actual average, often substantially higher.  And if you ask them how many children are in their families, the average of their responses will be higher than the average family size.

The problem is not the students, for once, but the sampling process.  Large classes are overrepresented in the sample because in a large class there are more students to report a large class size.  If there is a class with only one student, only one student will report that size.

And similarly with the number of children, large families are overrepresented and small families underrepresented; in fact, families with no children aren't represented at all.

The inspection paradox is an example of the Paradox Paradox, which is that a large majority of the things called paradoxes are not, actually, but just counter-intuitive truths.  The apparent contradiction between the different averages is resolved when you realize that they are averages over different populations.  One is the average in the population of classes; the other is the average in the population of student-class experiences.

Neither is right or wrong, but they are useful for different things.  Teachers might care about the average size of the classes they teach; students might care more about the average of the classes they take.

Prison inspection

The same effect occurs if you visit a prison.  Suppose you pick a random day, choose a prisoner at random, and ask the length of her sentence.  The response is more likely to be a long sentence than a short one, because a prisoner with a long sentence has a better chance of being sampled.  For each sentence duration, x, suppose the fraction of convicts given that sentence is p(x).  In that case the probability of observing someone with that sentence is proportional to x p(x).

Now imagine a different scenario: suppose you are serving an absurdly-long prison sentence, like 55 years for a minor drug offense.  During that time you see prisoners with shorter sentences come and go, and if you keep track of their sentences, you get an unbiased view of the distribution of sentence lengths.  So the probability of observing someone with sentence x is just p(x).

And that brings me to the question that occurred to me while I was reading Orange: what happens if you observe the system for a relatively short time, like Kerman's 11 months?  Presumably the answer is somewhere between p(x) and x p(x).   But where?  And how does it depend on the length of the observer's sentence?

UPDATE 17 August 2015: A few days after I posted the original version of this article, Jerzy Wieczorek dropped by my office.  Jerzy is an Olin alum who is now a grad student in statistics at CMU, so I posed this problem to him.  A few days later he emailed me the solution, which is that the probability of observing a sentence, x, during and interval, t, is proportional to x + t.  Couldn't be much simpler than that!

To see why, imagine a row of sentences arrange end-to-end along the number line.  If you make an instantaneous observation, that's like throwing a dart at the number line.  You chance of hitting a sentence with length x is (again) proportional to x.

Now imagine that instead of throwing a dart, you throw a piece of spaghetti with length t.  What is the chance that the spaghetti overlaps with a sentence of length x?  If we say arbitrarily that the sentence runs from 0 to x, the spaghetti will overlap the sentence if the left side falls anywhere between -t and x.  So the size of the target is x + t.

Based on this result, here's a Python function that takes the actual PMF and returns a biased PMF as seen by someone serving a sentence with duration t:

def bias_pmf(pmf, t=0):
    new_pmf = pmf.Copy()

    for x, p in pmf.Items():
        new_pmf[x] *= (x + t)
    return new_pmf

This IPython notebook has the details, and here's a summary of the results.


To model the distribution of sentences, I use random values from a gamma distribution, rounded to the nearest integer. All sentences are in units of months.  I chose parameters that very roughly match the histogram of sentences reported by the USSC.

The following code generates a sample of sentences as observed by a series of random arrivals.  The notebook explains how it works.

sentences = np.random.gamma(shape=2, scale=60, size=1000).astype(int)
releases = sentences.cumsum()
arrivals = np.random.random_integers(1, releases[-1], 10000)
prisoners = releases.searchsorted(arrivals)
sample = sentences[prisoners]
cdf2 = thinkstats2.Cdf(sample, label='biased')

The following figure shows the actual distribution of sentences (that is, the model I chose), and the biased distribution as would be seen by random arrivals:

Due to the inspection paradox, we oversample long sentences.  As expected, the sample mean is substantially higher than the actual mean, about 190 months compared to 120 months.

The following function simulates the observations of a person serving a sentence of t months.  Again, the notebook explains how it works:

def simulate_sentence(sentences, t):
    counter = Counter()
    releases = sentences.cumsum()
    last_release = releases[-1]
    arrival = np.random.random_integers(1, max(sentences))
    for i in range(arrival, last_release-t, 100):
        first_prisoner = releases.searchsorted(i)
        last_prisoner = releases.searchsorted(i+t)
        observed_sentences = sentences[first_prisoner:last_prisoner+1]
    return thinkstats2.Cdf(counter, label='observed %d' % t)

Here's the distribution of sentences as seen by someone serving 11 months, as Kerman did:

The observed distribution is almost as biased as what would be seen by an instantaneous observer.  Even after 120 months (near the average sentence), the observed distribution is substantially biased:

After 600 months (50 years!) the observed distribution nearly converges to the actual distribution.

I conclude that during Kerman's 11 month sentence, she would have seen a biased sample of the distribution of sentences.  Nevertheless, her observation  that many prisoners are serving long sentences that do not fit their crimes  is still valid, in my opinion.

Monday, July 13, 2015

Will Millennials Ever Get Married?

At SciPy last week I gave a talk called "Will Millennials Ever Get Married?  Survival Analysis and Marriage Data".  I presented results from my analysis of data from the National Survey of Family Growth (NSFG).  The slides I presented are here.

The video is here:

The paper the talk is based on will appear in the conference proceedings, which should be out any day now.

And here's the press release from Olin talking about the project:

More women will get married later and the number of women who remain unmarried will increase according to a new statistical investigation into the marriage patterns of millennials by Olin College professor Allen Downey. The analysis was unveiled at the SciPy conference in Austin, Texas. 
The data shows that millennials are already marrying much later than previous generations, but now they are on pace to remain unmarried in substantially larger numbers than prior generations of women. 
“The thing that is surprising to me is the size of the effect,” said Downey.
Downey, a data scientist, set out to analyze marriage patterns for women born in the 1980s and 1990s using information from the National Survey of Family Growth compiled by the Center for Disease Control (CDC).
  • Women born in the 90s = 13 percent were married at age 22
  • Women born in the 80s = 23 percent were married at age 22
  • Women born in the 40s = 69 percent were married at age 22
“Going from 69 percent to 13 percent in 50 years is a huge magnitude of change,” said Downey. 
He then used a statistical survival analysis to project what percentage of women will marry later in life – or at all.  Downey only studied numbers for women because the data from the CDC is more comprehensive for females.  
Of women born in the 1980s, Downey’s statistical survival projections show 72 percent are likely to marry by age 42, down from 82 percent of women born in the 1970s who were married by that age. For women born in the 1990s, the projection shows that only 68 percent are likely to marry by age 42. Contrast these figures with the fact that 92 percent of women born in the 1940s were married by age 42. 
“These are projections, they are based on current trends and then seeing what would happen if the current trends continue…but at this point women born in the 80s and 90s are so far behind in marriage terms they would have to start getting married at much higher rates than previous generations in order to catch up,” said Downey.

Thursday, June 25, 2015

Bayesian Billiards

I recently watched a video of Jake VanderPlas explaining Bayesian statistics at SciPy 2104.  It's an excellent talk that includes some nice examples.  In fact, it looks like he had even more examples he didn't have time to present.

Although his presentation is very good, he includes one statement that is a little misleading, in my opinion.  Specifically, he shows an example where frequentist and Bayesian methods yield similar results, and concludes, "for very simple problems, frequentist and Bayesian results are often practically indistinguishable."

But as I argued in my recent talk, "Learning to Love Bayesian Statistics," frequentist methods generate point estimates and confidence intervals, whereas Bayesian methods produce posterior distributions.  That's two different kinds of things (different types, in programmer-speak) so they can't be equivalent.

If you reduce the Bayesian posterior to a point estimate and an interval, you can compare Bayesian and frequentist results, but in doing so you discard useful information and lose what I think is the most important advantage of Bayesian methods: the ability to use posterior distributions as inputs to other analyses and decision-making processes.

The next section of Jake's talk demonstrates this point nicely.  He presents the "Bayesian Billiards Problem", which Bayes wrote about in 1763.  Jake presents a version of the problem from this paper by Sean Eddy, which I'll quote:
"Alice and Bob are playing a game in which the first person to get 6 points wins. The way each point is decided is a little strange. The Casino has a pool table that Alice and Bob can't see. Before the game begins, the Casino rolls an initial ball onto the table, which comes to rest at a completely random position, which the Casino marks. Then, each point is decided by the Casino rolling another ball onto the table randomly. If it comes to rest to the left of the initial mark, Alice wins the point; to the right of the mark, Bob wins the point. The Casino reveals nothing to Alice and Bob except who won each point. 
"Clearly, the probability that Alice wins a point is the fraction of the table to the left of the mark—call this probability p; and Bob's probability of winning a point is 1 - p. Because the Casino rolled the initial ball to a random position, before any points were decided every value of p was equally probable. The mark is only set once per game, so p is the same for every point. 
"Imagine Alice is already winning 5 points to 3, and now she bets Bob that she's going to win. What are fair betting odds for Alice to offer Bob? That is, what is the expected probability that Alice will win?"
Eddy solves the problem using a Beta distribution to estimate p, then integrates over the posterior distribution of p to get the probability that Bob wins.  If you like that approach, you can read Eddy's paper.  If you prefer a computational approach, read on!  My solution is in this Python file.

The problem statement indicates that the prior distribution of p is uniform from 0 to 1.  Given a hypothetical value of p and the observed number of wins and losses, we can compute the likelihood of the data under each hypothesis:

  class Billiards(thinkbayes.Suite):

    def Likelihood(self, data, hypo):
        """Likelihood of the data under the hypothesis.

        data: tuple (#wins, #losses)
        hypo: float probability of win
        p = hypo
        win, lose = data
        like = p**win * (1-p)**lose
        return like

Billiards inherits the Update function from Suite (which is defined in thinkbayes.py and explained in Think Bayes) and provides Likelihood, which uses the binomial formula:

\textstyle {n \choose k}\, p^k (1-p)^{n-k}

I left out the first term, the binomial coefficient, because it doesn't depend on p, so it would just get normalized away.

Now I just have to create the prior and update it:

    ps = numpy.linspace(0, 1, 101)
    bill = Billiards(ps)
    bill.Update((5, 3))

The following figure shows the resulting posterior:

Now to compute the probability that Bob wins the match.  Since Alice is ahead 5 points to 3, Bob needs to win the next three points.  His chance of winning each point is (1-p), so the chance of winning the next three is (1-p)³.

We don't know the value of p, but we have its posterior distribution, which we can "integrate over" like this:

def ProbWinMatch(pmf):
    total = 0
    for p, prob in pmf.Items():
        total += prob * (1-p)**3
    return total

The result is = 0.091, which corresponds to 10:1 odds against.

Using a frequentist approach, we get a substantially different answer.  Instead of a posterior distribution, we get a single point estimate.  Assuming we use the MLE, we estimate p = 5/8, and (1-p)³ = 0.056, which corresponds to 18:1 odds against.

Needless to say, the Bayesian result is right and the frequentist result is wrong.

But let's consider why the frequentist result is wrong.  The problem is not the estimate itself.  In fact, in this example, the Bayesian maximum aposteriori probability (MAP) is the same as the frequentist MLE.  The difference is that Bayesian posterior contains all of the information we have about p, whereas the frequentist result discards a large part of that information.

The result we are interested in, the probability of winning the match, is a non-linear transform of p, and in general for a non-linear transform f, the expectation E[f(p)] does not equal f(E[p]).  The Bayesian method computes the first, which is right; the frequentist method approximates the second, which is wrong.

To summarize, Bayesian methods are better not just because the results are correct, but more importantly because the results are in a form, the posterior distribution, that lends itself to answering questions and guiding decision-making under uncertainty.

[UPDATE June 26, 2015]  Some readers have objected that this article is unfair, because any competent statistician would know that the frequentist method I presented here would behave badly, and even a classical statistician could happily use Bayes's theorem to solve this problem, because the prior is provided by the problem statement.

A few thoughts in response:

1) This article compares a Bayesian method and frequentist method.  I never said anything about Bayesians and frequentists as people, or about what methods a hypothetical statistician might choose.

2) The problem with the frequentist approach to this problem is that it produces a point estimate rather than a posterior distribution.  Any method that produces a point estimate is going to have the same problem.

3) Maybe it's true that a competent statistician would expect the frequentist approach to perform badly for this problem, but it still seems like that is a disadvantage of the frequentist approach.  I would rather have a method that is always right than a method that sometimes works and sometimes fails, and requires substantial expertise to know when to expect which.

Friday, June 12, 2015

The Sleeping Beauty Problem

How did I get to my ripe old age without hearing about The Sleeping Beauty Problem?  I've taught and written about The Monty Hall Problem and The Girl Named Florida Problem.  But I didn't hear about Sleeping Beauty until this reddit post, pointing to this video:

Of course, there's a Wikipedia page about it, which I'll borrow to provide the background:
"Sleeping Beauty volunteers to undergo the following experiment and is told all of the following details: On Sunday she will be put to sleep. Once or twice during the experiment, Beauty will be awakened, interviewed, and put back to sleep with an amnesia-inducing drug that makes her forget that awakening. 
A fair coin will be tossed to determine which experimental procedure to undertake: if the coin comes up heads, Beauty will be awakened and interviewed on Monday only. If the coin comes up tails, she will be awakened and interviewed on Monday and Tuesday. In either case, she will be awakened on Wednesday without interview and the experiment ends. 
Any time Sleeping Beauty is awakened and interviewed, she is asked, 'What is your belief now for the proposition that the coin landed heads?'"
The problem is discussed at length on this CrossValidated thread.  As the person who posted the question explains, there are two common reactions to this problem:
The Halfer position. Simple! The coin is fair--and SB knows it--so she should believe there's a one-half chance of heads. 
The Thirder position. Were this experiment to be repeated many times, then the coin will be heads only one third of the time SB is awakened. Her probability for heads will be one third.
The thirder position is correct, and I think the argument based on long-run averages is the most persuasive.  From Wikipedia:
Suppose this experiment were repeated 1,000 times. It is expected that there would be 500 heads and 500 tails. So Beauty would be awoken 500 times after heads on Monday, 500 times after tails on Monday, and 500 times after tails on Tuesday. In other words, only in one-third of the cases would heads precede her awakening. This long-run expectation should give the same expectations for the one trial, so P(Heads) = 1/3.
But here's the difficulty (from CrossValidated):

Most, but not all, people who have written about this are thirders. But:
  • On Sunday evening, just before SB falls asleep, she must believe the chance of heads is one-half: that’s what it means to be a fair coin.
  • Whenever SB awakens, she has learned absolutely nothing she did not know Sunday nightWhat rational argument can she give, then, for stating that her belief in heads is now one-third and not one-half?
As a stark raving Bayesian, I find this mildly disturbing.  Is this an example where frequentism gets it right and Bayesianism gets it wrong?  One of the responses on reddit pursues the same thought:

I wonder where exactly in Bayes' rule does the formula "fail". It seems like P(wake|H) = P(wake|T) = 1, and P(H) = P(T) = 1/2, leading to the P(H|wake) = 1/2 conclusion.
Is it possible to get 1/3 using Bayes' rule?
I have come to a resolution of this problem that works, I think, but it made me realize the following subtle point: even if two things are inevitable, that doesn't make them equally likely.

In the previous calculation, the priors are correct: P(H) = P(T) = 1/2

It's the likelihoods that are wrong.  The datum is "SB wakes up".  This event happens once if the coin is heads and twice if it is tails, so the likelihood ratio P(wake|H) / P(wake|T) = 1/2

If you plug that into Bayes's theorem, you get the correct answer, 1/3.

This is an example where the odds form of Bayes's theorem is less error prone: the prior odds are 1:1.  The likelihood ratio is 1:2, so the posterior odds are 1:2.  By thinking in terms of likelihood ratio, rather than conditional probability, we avoid the pitfall.

If this example is still making your head hurt, here's an analogy that might help: suppose you live near a train station, and every morning you hear one express train and two local trains go past.  The probability of hearing an express train is 1, and the probability of hearing a local train is 1.  Nevertheless, the likelihood ratio is 1:2, and if you hear a train, the probability is only 1/3 that it is the express.

[UPDATE June 15, 2015]  In the comments below, you’ll see an exchange between me and a reader named James.  It took me a few tries to understand his question, so I’ll take the liberty of editing the conversation to make it clearer (and to make me seem a little quicker on the uptake):

James: I'd be interested in your reaction to the following extension. Before going to sleep on Sunday, Sleeping Beauty makes a bet at odds of 3:2 that the coin will come down heads. (This is favourable for her when the probability of heads is 1/2, and unfavourable when the probability of heads is 1/3). She is told that whenever she is woken up, she will be offered the opportunity to cancel any outstanding bets. Later she finds herself woken up, and asked whether she wants to cancel any outstanding bets. Should she say yes or no? (Let's say she doesn't have access to any external randomness to help her choose). Is her best answer compatible with a "belief of 1/3 that the coin is showing heads"?

Allen: If the bet is only resolved once (on Wednesday), then SB should accept the bet (and not cancel it) because she is effectively betting on a coin toss with favorable odds, and the whole sleeping-waking scenario is irrelevant.

James: Right, the bet is only resolved once.  So, we agree that she should not cancel. But isn't there something odd? Put yourself in SB's position when you are woken up. You say that you have a "belief of 1/3 in the proposition that the coin is heads". The bet is unfavourable to you if the probability of heads is 1/3. And yet you don't cancel it. That suggests one sense in which you do NOT have a belief of 1/3 after all.

Allen: Ah, now I see why this is such an interesting problem.  You are right that I seem to have SB keeping a bet that is inconsistent with her beliefs.  But SB is not obligated to bet based on her current beliefs. If she knows that more information is coming in the future, she can compute a posterior based on that future information and bet accordingly.

Each time she wakes up, she should believe that she is more likely to be in the Tails scenario -- that is, that P(H) = 1/3 -- but she also knows that more information is coming her way.

Specifically, she knows that when she wakes up on Wednesday, and is told that it is Wednesday and the experiment is over, she will update her beliefs and conclude that the probability of Heads is 50% and the bet is favorable.

So when she wakes up on Monday or Tuesday and has the option to cancel the bet, she could think: "Based on my current beliefs, this bet is unfavorable, but I know that before the bet is resolved I will get more information that makes the bet favorable. So I will take that future information into account now and keep the bet (decline to cancel)."

I think the weirdness here is not in her beliefs but in the unusual scenario where she knows that she will get more information in the future. The Bayesian formulation of the problem tells you what she should believe after performing each update, but [the rest of the sentence deleted because I don’t think it’s quite right any more].


Upon further reflection, I think there is a general rule here:

When you evaluate a bet, you should evaluate it relative to what you will believe when the bet is resolved, which is not necessarily what you believe now.  I’m going to call this the Fundamental Theorem of Betting, because it reminds me of Sklansky’s Fundamental Theorem of Poker, which says that the correct decision in a poker game is the decision you would make if all players’ cards were visible.

Under normal circumstances, we don’t know what we will believe in the future, so we almost always use our current beliefs as a heuristic for, or maybe estimate of, our future beliefs.  Sleeping Beauty’s situation is unusual because she knows that more information is coming in the future, and she knows what the information will be!

To see how this theorem holds up, let me run the SB scenario and see if we can make sense of Sleeping Beauty’s beliefs and betting strategy:

Experimenter: Ok, SB, it’s Sunday night.  After you go to sleep, we’re going to flip this fair coin.  What do you believe is the probability that it will come up heads, P(H)?

Sleeping Beauty:  I think P(H) is ½.

Ex: Ok.  In that case, I wonder if you would be interested in a wager.  If you bet on heads and win, I’ll pay 3:2, so if you bet $100, you will either win $150 or lose $100.  Since you think P(H) is ½, this bet is in your favor.  Do you want to accept it?

SB: Sure, why not?

Ex: Ok, on Wednesday I’ll tell you the outcome of the flip and we’ll settle the bet.  Good night.

SB:  Zzzzz.

Ex: Good morning!

SB: Hello.  Is it Wednesday yet?

Ex: No, it’s not Wednesday, but that’s all I can tell you.  At this point, what do you think is the probability that I flipped heads?

SB: Well, my prior was P(H) = ½.  I’ve just observed an event (D = waking up before Wednesday) that is twice as likely under the tails scenario, so I’ll update my beliefs and conclude that  P(H|D) = ⅓.

Ex: Interesting.  Well, if the probability of heads is only ⅓, the bet we made Sunday night is no longer in your favor.  Would you like to call it off?

SB: No, thanks.

Ex: But wait, doesn’t that mean that you are being inconsistent?  You believe that the probability of heads is ⅓, but you are betting as if it were ½.

SB: On the contrary, my betting is consistent with my beliefs.  The bet won’t be settled until Wednesday, so my current beliefs are not important.  What matters is what I will believe when the bet is settled.

Ex: I suppose that makes sense.  But do you mean to say that you know what you will believe on Wednesday?

SB: Normally I wouldn’t, but this scenario seems to be an unusual case.  Not only do I know that I will get more information tomorrow; I even know what it will be.

Ex: How’s that?

SB: When you give me the amnesia drug, I will forget about the update I just made and revert to my prior.  Then when I wake up on Wednesday, I will observe an event (E = waking up on Wednesday) that is equally likely under the heads and tails scenarios, so my posterior will equal my prior, I will believe that P(H|E) is ½, and I will conclude that the bet is in my favor.

Ex: So just before I tell you the outcome of the bet, you will believe that the probability of heads is ½?

SB: Right.

Ex: Well, if you know what information is coming in the future, why don’t you do the update now, and start believing that the probability of heads is ½?

SB: Well, I can compute P(H|E) now if you want.  It’s ½ -- always has been and always will be.  But that’s not what I should believe now, because I have only seen D, and not E yet.

Ex: So right now, do you think you are going to win the bet?

SB: Probably not.  If I’m losing, you’ll ask me that question twice.  But if I’m winning, you’ll only ask once.  So ⅔ of the time you ask that question, I’m losing.

Ex: So you think you are probably losing, but you still want to keep the bet?  That seems crazy.

SB: Maybe, but even so, my beliefs are based on the correct analysis of my situation, and my decision is consistent with my beliefs.

Ex: I’ll need to think about that.  Well, good night.

SB: Zzzzz.