Thursday, June 25, 2015

Bayesian Billiards

I recently watched a video of Jake VanderPlas explaining Bayesian statistics at SciPy 2014.  It's an excellent talk that includes some nice examples.  In fact, it looks like he had even more examples he didn't have time to present.

Although his presentation is very good, he includes one statement that is a little misleading, in my opinion.  Specifically, he shows an example where frequentist and Bayesian methods yield similar results, and concludes, "for very simple problems, frequentist and Bayesian results are often practically indistinguishable."

But as I argued in my recent talk, "Learning to Love Bayesian Statistics," frequentist methods generate point estimates and confidence intervals, whereas Bayesian methods produce posterior distributions.  That's two different kinds of things (different types, in programmer-speak) so they can't be equivalent.

If you reduce the Bayesian posterior to a point estimate and an interval, you can compare Bayesian and frequentist results, but in doing so you discard useful information and lose what I think is the most important advantage of Bayesian methods: the ability to use posterior distributions as inputs to other analyses and decision-making processes.

The next section of Jake's talk demonstrates this point nicely.  He presents the "Bayesian Billiards Problem", which Bayes wrote about in 1763.  Jake presents a version of the problem from this paper by Sean Eddy, which I'll quote:
"Alice and Bob are playing a game in which the first person to get 6 points wins. The way each point is decided is a little strange. The Casino has a pool table that Alice and Bob can't see. Before the game begins, the Casino rolls an initial ball onto the table, which comes to rest at a completely random position, which the Casino marks. Then, each point is decided by the Casino rolling another ball onto the table randomly. If it comes to rest to the left of the initial mark, Alice wins the point; to the right of the mark, Bob wins the point. The Casino reveals nothing to Alice and Bob except who won each point. 
"Clearly, the probability that Alice wins a point is the fraction of the table to the left of the mark—call this probability p; and Bob's probability of winning a point is 1 - p. Because the Casino rolled the initial ball to a random position, before any points were decided every value of p was equally probable. The mark is only set once per game, so p is the same for every point. 
"Imagine Alice is already winning 5 points to 3, and now she bets Bob that she's going to win. What are fair betting odds for Alice to offer Bob? That is, what is the expected probability that Alice will win?"
Eddy solves the problem using a Beta distribution to estimate p, then integrates over the posterior distribution of p to get the probability that Bob wins.  If you like that approach, you can read Eddy's paper.  If you prefer a computational approach, read on!  My solution is in this Python file.

The problem statement indicates that the prior distribution of p is uniform from 0 to 1.  Given a hypothetical value of p and the observed number of wins and losses, we can compute the likelihood of the data under each hypothesis:

  class Billiards(thinkbayes.Suite):

    def Likelihood(self, data, hypo):
        """Likelihood of the data under the hypothesis.

        data: tuple (#wins, #losses)
        hypo: float probability of win
        p = hypo
        win, lose = data
        like = p**win * (1-p)**lose
        return like

Billiards inherits the Update function from Suite (which is defined in and explained in Think Bayes) and provides Likelihood, which uses the binomial formula:

\textstyle {n \choose k}\, p^k (1-p)^{n-k}

I left out the first term, the binomial coefficient, because it doesn't depend on p, so it would just get normalized away.

Now I just have to create the prior and update it:

    ps = numpy.linspace(0, 1, 101)
    bill = Billiards(ps)
    bill.Update((5, 3))

The following figure shows the resulting posterior:

Now to compute the probability that Bob wins the match.  Since Alice is ahead 5 points to 3, Bob needs to win the next three points.  His chance of winning each point is (1-p), so the chance of winning the next three is (1-p)³.

We don't know the value of p, but we have its posterior distribution, which we can "integrate over" like this:

def ProbWinMatch(pmf):
    total = 0
    for p, prob in pmf.Items():
        total += prob * (1-p)**3
    return total

The result is = 0.091, which corresponds to 10:1 odds against.

Using a frequentist approach, we get a substantially different answer.  Instead of a posterior distribution, we get a single point estimate.  Assuming we use the MLE, we estimate p = 5/8, and (1-p)³ = 0.056, which corresponds to 18:1 odds against.

Needless to say, the Bayesian result is right and the frequentist result is wrong.

But let's consider why the frequentist result is wrong.  The problem is not the estimate itself.  In fact, in this example, the Bayesian maximum aposteriori probability (MAP) is the same as the frequentist MLE.  The difference is that Bayesian posterior contains all of the information we have about p, whereas the frequentist result discards a large part of that information.

The result we are interested in, the probability of winning the match, is a non-linear transform of p, and in general for a non-linear transform f, the expectation E[f(p)] does not equal f(E[p]).  The Bayesian method computes the first, which is right; the frequentist method approximates the second, which is wrong.

To summarize, Bayesian methods are better not just because the results are correct, but more importantly because the results are in a form, the posterior distribution, that lends itself to answering questions and guiding decision-making under uncertainty.

[UPDATE June 26, 2015]  Some readers have objected that this article is unfair, because any competent statistician would know that the frequentist method I presented here would behave badly, and even a classical statistician could happily use Bayes's theorem to solve this problem, because the prior is provided by the problem statement.

A few thoughts in response:

1) This article compares a Bayesian method and frequentist method.  I never said anything about Bayesians and frequentists as people, or about what methods a hypothetical statistician might choose.

2) The problem with the frequentist approach to this problem is that it produces a point estimate rather than a posterior distribution.  Any method that produces a point estimate is going to have the same problem.

3) Maybe it's true that a competent statistician would expect the frequentist approach to perform badly for this problem, but it still seems like that is a disadvantage of the frequentist approach.  I would rather have a method that is always right than a method that sometimes works and sometimes fails, and requires substantial expertise to know when to expect which.

1 comment:

  1. Hi Allen - thank you for this article! I'm trying to convert to bayesian b/c it looks so much more useful and practical, but I think this frequent objection mentioned in the Nature article sums up one of my biggest discomforts. Curious how you think about this. Is there any possible negative effect of using a uniform prior if in reality the distribution is quite different (although even that kind of blows my mind since in reality it's a point value, which i guess is covered in this first objection Nature covers above the one i paste below).

    "Second, Bayesian methods require specifying prior probability distributions, which are often themselves unknown. Bayesian analyses generally assume so-called 'uninformative' (often uniform) priors in such cases. Introducing subjective assumptions into an inference is unpalatable to some statisticians. The usual counterargument is that non-Bayesian methods make comparable assumptions implicitly, and it's probably better to have one's assumptions out in the open."