Thursday, June 25, 2015

Bayesian Billiards

I recently watched a video of Jake VanderPlas explaining Bayesian statistics at SciPy 2104.  It's an excellent talk that includes some nice examples.  In fact, it looks like he had even more examples he didn't have time to present.

Although his presentation is very good, he includes one statement that is a little misleading, in my opinion.  Specifically, he shows an example where frequentist and Bayesian methods yield similar results, and concludes, "for very simple problems, frequentist and Bayesian results are often practically indistinguishable."

But as I argued in my recent talk, "Learning to Love Bayesian Statistics," frequentist methods generate point estimates and confidence intervals, whereas Bayesian methods produce posterior distributions.  That's two different kinds of things (different types, in programmer-speak) so they can't be equivalent.

If you reduce the Bayesian posterior to a point estimate and an interval, you can compare Bayesian and frequentist results, but in doing so you discard useful information and lose what I think is the most important advantage of Bayesian methods: the ability to use posterior distributions as inputs to other analyses and decision-making processes.

The next section of Jake's talk demonstrates this point nicely.  He presents the "Bayesian Billiards Problem", which Bayes wrote about in 1763.  Jake presents a version of the problem from this paper by Sean Eddy, which I'll quote:
"Alice and Bob are playing a game in which the first person to get 6 points wins. The way each point is decided is a little strange. The Casino has a pool table that Alice and Bob can't see. Before the game begins, the Casino rolls an initial ball onto the table, which comes to rest at a completely random position, which the Casino marks. Then, each point is decided by the Casino rolling another ball onto the table randomly. If it comes to rest to the left of the initial mark, Alice wins the point; to the right of the mark, Bob wins the point. The Casino reveals nothing to Alice and Bob except who won each point. 
"Clearly, the probability that Alice wins a point is the fraction of the table to the left of the mark—call this probability p; and Bob's probability of winning a point is 1 - p. Because the Casino rolled the initial ball to a random position, before any points were decided every value of p was equally probable. The mark is only set once per game, so p is the same for every point. 
"Imagine Alice is already winning 5 points to 3, and now she bets Bob that she's going to win. What are fair betting odds for Alice to offer Bob? That is, what is the expected probability that Alice will win?"
Eddy solves the problem using a Beta distribution to estimate p, then integrates over the posterior distribution of p to get the probability that Bob wins.  If you like that approach, you can read Eddy's paper.  If you prefer a computational approach, read on!  My solution is in this Python file.

The problem statement indicates that the prior distribution of p is uniform from 0 to 1.  Given a hypothetical value of p and the observed number of wins and losses, we can compute the likelihood of the data under each hypothesis:

  class Billiards(thinkbayes.Suite):

    def Likelihood(self, data, hypo):
        """Likelihood of the data under the hypothesis.

        data: tuple (#wins, #losses)
        hypo: float probability of win
        """
        p = hypo
        win, lose = data
        like = p**win * (1-p)**lose
        return like

Billiards inherits the Update function from Suite (which is defined in thinkbayes.py and explained in Think Bayes) and provides Likelihood, which uses the binomial formula:

\textstyle {n \choose k}\, p^k (1-p)^{n-k}

I left out the first term, the binomial coefficient, because it doesn't depend on p, so it would just get normalized away.

Now I just have to create the prior and update it:

    ps = numpy.linspace(0, 1, 101)
    bill = Billiards(ps)
    bill.Update((5, 3))
    thinkplot.Pdf(bill)

The following figure shows the resulting posterior:


Now to compute the probability that Bob wins the match.  Since Alice is ahead 5 points to 3, Bob needs to win the next three points.  His chance of winning each point is (1-p), so the chance of winning the next three is (1-p)³.

We don't know the value of p, but we have its posterior distribution, which we can "integrate over" like this:

def ProbWinMatch(pmf):
    total = 0
    for p, prob in pmf.Items():
        total += prob * (1-p)**3
    return total

The result is = 0.091, which corresponds to 10:1 odds against.

Using a frequentist approach, we get a substantially different answer.  Instead of a posterior distribution, we get a single point estimate.  Assuming we use the MLE, we estimate p = 5/8, and (1-p)³ = 0.056, which corresponds to 18:1 odds against.

Needless to say, the Bayesian result is right and the frequentist result is wrong.

But let's consider why the frequentist result is wrong.  The problem is not the estimate itself.  In fact, in this example, the Bayesian maximum aposteriori probability (MAP) is the same as the frequentist MLE.  The difference is that Bayesian posterior contains all of the information we have about p, whereas the frequentist result discards a large part of that information.

The result we are interested in, the probability of winning the match, is a non-linear transform of p, and in general for a non-linear transform f, the expectation E[f(p)] does not equal f(E[p]).  The Bayesian method computes the first, which is right; the frequentist method approximates the second, which is wrong.

To summarize, Bayesian methods are better not just because the results are correct, but more importantly because the results are in a form, the posterior distribution, that lends itself to answering questions and guiding decision-making under uncertainty.



[UPDATE June 26, 2015]  Some readers have objected that this article is unfair, because any competent statistician would know that the frequentist method I presented here would behave badly, and even a classical statistician could happily use Bayes's theorem to solve this problem, because the prior is provided by the problem statement.

A few thoughts in response:

1) This article compares a Bayesian method and frequentist method.  I never said anything about Bayesians and frequentists as people, or about what methods a hypothetical statistician might choose.

2) The problem with the frequentist approach to this problem is that it produces a point estimate rather than a posterior distribution.  Any method that produces a point estimate is going to have the same problem.

3) Maybe it's true that a competent statistician would expect the frequentist approach to perform badly for this problem, but it still seems like that is a disadvantage of the frequentist approach.  I would rather have a method that is always right than a method that sometimes works and sometimes fails, and requires substantial expertise to know when to expect which.



Friday, June 12, 2015

The Sleeping Beauty Problem

How did I get to my ripe old age without hearing about The Sleeping Beauty Problem?  I've taught and written about The Monty Hall Problem and The Girl Named Florida Problem.  But I didn't hear about Sleeping Beauty until this reddit post, pointing to this video:


Of course, there's a Wikipedia page about it, which I'll borrow to provide the background:
"Sleeping Beauty volunteers to undergo the following experiment and is told all of the following details: On Sunday she will be put to sleep. Once or twice during the experiment, Beauty will be awakened, interviewed, and put back to sleep with an amnesia-inducing drug that makes her forget that awakening. 
A fair coin will be tossed to determine which experimental procedure to undertake: if the coin comes up heads, Beauty will be awakened and interviewed on Monday only. If the coin comes up tails, she will be awakened and interviewed on Monday and Tuesday. In either case, she will be awakened on Wednesday without interview and the experiment ends. 
Any time Sleeping Beauty is awakened and interviewed, she is asked, 'What is your belief now for the proposition that the coin landed heads?'"
The problem is discussed at length on this CrossValidated thread.  As the person who posted the question explains, there are two common reactions to this problem:
The Halfer position. Simple! The coin is fair--and SB knows it--so she should believe there's a one-half chance of heads. 
The Thirder position. Were this experiment to be repeated many times, then the coin will be heads only one third of the time SB is awakened. Her probability for heads will be one third.
The thirder position is correct, and I think the argument based on long-run averages is the most persuasive.  From Wikipedia:
Suppose this experiment were repeated 1,000 times. It is expected that there would be 500 heads and 500 tails. So Beauty would be awoken 500 times after heads on Monday, 500 times after tails on Monday, and 500 times after tails on Tuesday. In other words, only in one-third of the cases would heads precede her awakening. This long-run expectation should give the same expectations for the one trial, so P(Heads) = 1/3.
But here's the difficulty (from CrossValidated):


Most, but not all, people who have written about this are thirders. But:
  • On Sunday evening, just before SB falls asleep, she must believe the chance of heads is one-half: that’s what it means to be a fair coin.
  • Whenever SB awakens, she has learned absolutely nothing she did not know Sunday nightWhat rational argument can she give, then, for stating that her belief in heads is now one-third and not one-half?
As a stark raving Bayesian, I find this mildly disturbing.  Is this an example where frequentism gets it right and Bayesianism gets it wrong?  One of the responses on reddit pursues the same thought:


I wonder where exactly in Bayes' rule does the formula "fail". It seems like P(wake|H) = P(wake|T) = 1, and P(H) = P(T) = 1/2, leading to the P(H|wake) = 1/2 conclusion.
Is it possible to get 1/3 using Bayes' rule?
I have come to a resolution of this problem that works, I think, but it made me realize the following subtle point: even if two things are inevitable, that doesn't make them equally likely.

In the previous calculation, the priors are correct: P(H) = P(T) = 1/2

It's the likelihoods that are wrong.  The datum is "SB wakes up".  This event happens once if the coin is heads and twice if it is tails, so the likelihood ratio P(wake|H) / P(wake|T) = 1/2

If you plug that into Bayes's theorem, you get the correct answer, 1/3.

This is an example where the odds form of Bayes's theorem is less error prone: the prior odds are 1:1.  The likelihood ratio is 1:2, so the posterior odds are 1:2.  By thinking in terms of likelihood ratio, rather than conditional probability, we avoid the pitfall.

If this example is still making your head hurt, here's an analogy that might help: suppose you live near a train station, and every morning you hear one express train and two local trains go past.  The probability of hearing an express train is 1, and the probability of hearing a local train is 1.  Nevertheless, the likelihood ratio is 1:2, and if you hear a train, the probability is only 1/3 that it is the express.


[UPDATE June 15, 2015]  In the comments below, you’ll see an exchange between me and a reader named James.  It took me a few tries to understand his question, so I’ll take the liberty of editing the conversation to make it clearer (and to make me seem a little quicker on the uptake):

James: I'd be interested in your reaction to the following extension. Before going to sleep on Sunday, Sleeping Beauty makes a bet at odds of 3:2 that the coin will come down heads. (This is favourable for her when the probability of heads is 1/2, and unfavourable when the probability of heads is 1/3). She is told that whenever she is woken up, she will be offered the opportunity to cancel any outstanding bets. Later she finds herself woken up, and asked whether she wants to cancel any outstanding bets. Should she say yes or no? (Let's say she doesn't have access to any external randomness to help her choose). Is her best answer compatible with a "belief of 1/3 that the coin is showing heads"?

Allen: If the bet is only resolved once (on Wednesday), then SB should accept the bet (and not cancel it) because she is effectively betting on a coin toss with favorable odds, and the whole sleeping-waking scenario is irrelevant.

James: Right, the bet is only resolved once.  So, we agree that she should not cancel. But isn't there something odd? Put yourself in SB's position when you are woken up. You say that you have a "belief of 1/3 in the proposition that the coin is heads". The bet is unfavourable to you if the probability of heads is 1/3. And yet you don't cancel it. That suggests one sense in which you do NOT have a belief of 1/3 after all.

Allen: Ah, now I see why this is such an interesting problem.  You are right that I seem to have SB keeping a bet that is inconsistent with her beliefs.  But SB is not obligated to bet based on her current beliefs. If she knows that more information is coming in the future, she can compute a posterior based on that future information and bet accordingly.

Each time she wakes up, she should believe that she is more likely to be in the Tails scenario -- that is, that P(H) = 1/3 -- but she also knows that more information is coming her way.

Specifically, she knows that when she wakes up on Wednesday, and is told that it is Wednesday and the experiment is over, she will update her beliefs and conclude that the probability of Heads is 50% and the bet is favorable.

So when she wakes up on Monday or Tuesday and has the option to cancel the bet, she could think: "Based on my current beliefs, this bet is unfavorable, but I know that before the bet is resolved I will get more information that makes the bet favorable. So I will take that future information into account now and keep the bet (decline to cancel)."

I think the weirdness here is not in her beliefs but in the unusual scenario where she knows that she will get more information in the future. The Bayesian formulation of the problem tells you what she should believe after performing each update, but [the rest of the sentence deleted because I don’t think it’s quite right any more].

----

Upon further reflection, I think there is a general rule here:

When you evaluate a bet, you should evaluate it relative to what you will believe when the bet is resolved, which is not necessarily what you believe now.  I’m going to call this the Fundamental Theorem of Betting, because it reminds me of Sklansky’s Fundamental Theorem of Poker, which says that the correct decision in a poker game is the decision you would make if all players’ cards were visible.

Under normal circumstances, we don’t know what we will believe in the future, so we almost always use our current beliefs as a heuristic for, or maybe estimate of, our future beliefs.  Sleeping Beauty’s situation is unusual because she knows that more information is coming in the future, and she knows what the information will be!

To see how this theorem holds up, let me run the SB scenario and see if we can make sense of Sleeping Beauty’s beliefs and betting strategy:

Experimenter: Ok, SB, it’s Sunday night.  After you go to sleep, we’re going to flip this fair coin.  What do you believe is the probability that it will come up heads, P(H)?

Sleeping Beauty:  I think P(H) is ½.

Ex: Ok.  In that case, I wonder if you would be interested in a wager.  If you bet on heads and win, I’ll pay 3:2, so if you bet $100, you will either win $150 or lose $100.  Since you think P(H) is ½, this bet is in your favor.  Do you want to accept it?

SB: Sure, why not?

Ex: Ok, on Wednesday I’ll tell you the outcome of the flip and we’ll settle the bet.  Good night.

SB:  Zzzzz.

Ex: Good morning!

SB: Hello.  Is it Wednesday yet?

Ex: No, it’s not Wednesday, but that’s all I can tell you.  At this point, what do you think is the probability that I flipped heads?

SB: Well, my prior was P(H) = ½.  I’ve just observed an event (D = waking up before Wednesday) that is twice as likely under the tails scenario, so I’ll update my beliefs and conclude that  P(H|D) = ⅓.

Ex: Interesting.  Well, if the probability of heads is only ⅓, the bet we made Sunday night is no longer in your favor.  Would you like to call it off?

SB: No, thanks.

Ex: But wait, doesn’t that mean that you are being inconsistent?  You believe that the probability of heads is ⅓, but you are betting as if it were ½.

SB: On the contrary, my betting is consistent with my beliefs.  The bet won’t be settled until Wednesday, so my current beliefs are not important.  What matters is what I will believe when the bet is settled.

Ex: I suppose that makes sense.  But do you mean to say that you know what you will believe on Wednesday?

SB: Normally I wouldn’t, but this scenario seems to be an unusual case.  Not only do I know that I will get more information tomorrow; I even know what it will be.

Ex: How’s that?

SB: When you give me the amnesia drug, I will forget about the update I just made and revert to my prior.  Then when I wake up on Wednesday, I will observe an event (E = waking up on Wednesday) that is equally likely under the heads and tails scenarios, so my posterior will equal my prior, I will believe that P(H|E) is ½, and I will conclude that the bet is in my favor.

Ex: So just before I tell you the outcome of the bet, you will believe that the probability of heads is ½?

SB: Right.

Ex: Well, if you know what information is coming in the future, why don’t you do the update now, and start believing that the probability of heads is ½?

SB: Well, I can compute P(H|E) now if you want.  It’s ½ -- always has been and always will be.  But that’s not what I should believe now, because I have only seen D, and not E yet.

Ex: So right now, do you think you are going to win the bet?

SB: Probably not.  If I’m losing, you’ll ask me that question twice.  But if I’m winning, you’ll only ask once.  So ⅔ of the time you ask that question, I’m losing.

Ex: So you think you are probably losing, but you still want to keep the bet?  That seems crazy.

SB: Maybe, but even so, my beliefs are based on the correct analysis of my situation, and my decision is consistent with my beliefs.

Ex: I’ll need to think about that.  Well, good night.

SB: Zzzzz.






Friday, May 1, 2015

Hypothesis testing is only mostly useless

UPDATE May 4: Nature recently reported the outcome of an attempt to replicate 100 results from psychology research. They characterize the findings as "worrying."




In my previous article I argued that classical statistical inference is only mostly wrong. About null-hypothesis significance testing (NHST),  I wrote:


“If the p-value is small, you can conclude that the fourth possibility is unlikely, and infer that the other three possibilities are more likely.”


Where “the fourth possibility” I referred to was:


“The apparent effect might be due to chance; that is, the difference might appear in a random sample, but not in the general population.”


Several commenters chastised me; for example:


“All a p-value tells you is the probability of your data (or data more extreme) given the null is true. They say nothing about whether your data is ‘due to chance.’”


My correspondent is partly right.  The p-value is the probability of the apparent effect under the null hypothesis, which is not the same thing as the probability we should assign to the null hypothesis after seeing the data.  But they are related to each other by Bayes’s theorem, which is why I said that you can use the first to make an inference about the second.


Let me explain by showing a few examples.  I’ll start with what I think is a common scenario:  suppose you are reading a scientific paper that reports a new relationship between two variables.  There are (at least) three explanations you might consider for this apparent effect:


A:  The effect might be actual; that is, it might exist in the unobserved population as well as the observed sample.
B:  The effect might be bogus, caused by errors like sampling bias and measurement error, or by fraud.
C:  The effect might be due to chance, appearing randomly in the sample but not in the population.


If we think of these as competing hypotheses to explain the apparent effect, we can use Bayes’s theorem to update our belief in each hypothesis.


Scenario #1



As a first scenario, suppose that the apparent effect is plausible, and the p-value is low.  The following table shows what the Bayesian update might look like:



Prior
Likelihood
Posterior
Actual
70
0.51.0
~75
Bogus
20
0.51.0
~25
Chance
10
p = 0.01
~0.001


Since the apparent effect is plausible, I give it a prior probability of 70%.  I assign a prior of 20% to the hypothesis that the result is to due to error or fraud, and 10% to the hypothesis that it’s due to chance.  If you don’t agree with the numbers I chose, we’ll look at some alternatives soon.  Or feel free to plug in your own.


Now we compute the likelihood of the apparent effect under each hypothesis.  If the effect is real, it is quite likely to appear in the sample, so I figure the likelihood is between 50% and 100%.   And in the presence of error or fraud, I assume the apparent effect is also quite likely.


If the effect is due to chance, we can compute the likelihood directly.  The likelihood of the data under the null hypothesis is the p-value.  As an example, suppose p=0.01.


The table shows the resulting posterior probabilities.  The hypothesis that the effect is due to chance has been almost eliminated, and the other hypotheses are marginally more likely.  The hypothesis test helps a little, by ruling out one source of error, but it doesn't help a lot, because randomness was not the source of error I was most worried about.


Scenario #2



In the second scenario, suppose the p-value is low, again, but the apparent effect is less plausible.  In that case, I would assign a lower prior probability to Actual and a higher prior to Bogus.  I am still inclined to assign a low priority to Chance, simply because I don’t think it is the most common cause of scientific error.



Prior
Likelihood
Posterior
Actual
20
0.51.0
~25
Bogus
70
0.51.0
~75
Chance
10
p = 0.01
~0.001


The results are pretty much the same as in Scenario #1: we can be reasonably confident that the result is not due to chance.  But again, the hypothesis test does not make a lot of difference in my belief about the validity of the paper.


I believe these examples cover a large majority of real-world scenarios, and in each case my claim holds up:  If the p-value is small, you can conclude that apparent effect is unlikely to be due to chance, and the other possibilities (Actual and Bogus) are more likely.

Scenario #3


I admit that there are situations where this conclusion would not be valid.  For example, if the effect is implausible and you have good reason to think that error and fraud are unlikely, you might start with a larger prior belief in Chance.  In that case a p-value like 0.01 might not be sufficient to rule out Chance:



Prior
Likelihood
Posterior
Actual
10
0.5
46
Bogus
10
0.5
46
Chance
80
p = 0.01
8


But even in this contrived scenario, the p-value has a substantial effect on our belief in the Chance hypothesis.  And a somewhat smaller p-value, like 0.001, would be sufficient to rule out Chance as a likely cause of error.


In summary, NHST is problematic but not useless.  If you think an apparent effect might be due to chance, choosing an appropriate null hypothesis and computing a p-value is a reasonable way to check.

Wednesday, April 15, 2015

Two hour marathon by 2041 -- probably

Last fall I taught an introduction to Bayesian statistics at Olin College.  My students worked on some excellent projects, and I invited them to write up their results as guest articles for this blog.  Previous articles in the series are:

Bayesian predictions for Super Bowl XLIX
Bayesian analysis of match rates on Tinder
Bayesian survival analysis for Game of Thrones

Now, with the Boston Marathon coming up, we have an article about predicting world record marathon times.

Modeling Marathon Records With Bayesian Linear Regression

Cypress Frankenfeld

In this blog article, Allen Downey shows that marathon world record times have been advancing at a linear rate. He uses least squares estimation to calculate the parameters of the line of best fit, and predicts that the first 2-hour marathon will occur around 2041.

I extend his model, and calculate how likely it will be that someone will run a 2-hour marathon in 2041. Or more generally, I find the probability distribution of a certain record being hit on a certain date.

I break down my process into 4 steps:
  1. Decide on a set of prior hypotheses about the trend of marathon records.
  2. Update my priors using the world record data.
  3. Use my posterior to run a monte-carlo simulation of world records.
  4. Use this simulation to estimate the probability distribution of dates and records.

What are my prior hypotheses?

Since we are assuming the trend in running times is linear, I write my guesses as linear functions,
yi(xi) = α + βx + ei ,
where α and β are constants and ei is random error drawn from a normal distribution characterized by standard deviation σ. Thus I choose my priors to be list of combinations of α, β, and σ.

intercept_estimate, slope_estimate = LeastSquares(dates, records)
alpha_range = 0.007
beta_range = 0.2
alphas = np.linspace(intercept_estimate * (1 - alpha_range), 
intercept_estimate * (1 + alpha_range),
40)
betas = np.linspace(slope_estimate * (1 - beta_range), 
slope_estimate * (1 + beta_range), 
40)
sigmas = np.linspace(0.001, 0.1, 40)
hypos = [(alpha, beta, sigma) for alpha in alphas 
for beta in betas for sigma in sigmas]

I set the values of α, β, and σ partially on the least squares estimate and partially on trial and error. I initialize my prior hypotheses to be uniformly distributed.

How do I update my priors?

I perform a bayesian update based on a list of running records. For each running record, (date, record_mph), I calculate the likelihood based on the following function.

def Likelihood(self, data, hypo):
alpha, beta, sigma = hypo
total_likelihood = 1
for date, measured_mph in data:
hypothesized_mph = alpha + beta * date
error = measured_mph - hypothesized_mph
total_likelihood *= EvalNormalPdf(error, mu=0, sigma=sigma)
return total_likelihood

I find the error between the hypothesized record and the measured record (in miles per hour), then I find the probability of seeing that error given the normal distribution with standard deviation of σ.

The posterior PMFs for α, β, and σ


α
β
σ

Here are the posterior marginal distributions of α, β, and σ after updating with all the running records. The distributions of α and β peak close to the least squares estimate. How close are they to least squares estimate? Here I’ve plotted a line based on the maximum likelihood of α, β, compared to the least squares estimate. Graphically, they are almost indistinguishable.
But I use these distributions to estimate more than a single line! I create a monte-carlo simulation of running records.

How do I simulate these records?

I pull a random α, β, and σ from their PMF (weighted by probability), generate a list of records for each year from 2014 to 2060, then repeat. The results are in the figure below, plotted alongside the previous records.
It may seem counterintuitive to see simulated running speeds fall below the current records, but it make sense! Think of those as years wherein the first-place marathon didn’t set a world-record.

These simulated records are most dense where they are most likely, but the scatter plot is a poor visualization of their densities. To fix this, I merge the points into discrete buckets and count the number of points in each bucket to estimate their probabilities. This resulting joint probability distribution is what I’ve been looking for: given a date and a record, I can tell you the probability of setting that record on that date. I plot the joint distribution in the figure below.
It is difficult to create a pcolor plot with pandas Timestamp objects as the unit, so I use nanoseconds since 1970. The vertical and horizontal lines are at year 2041 and 13.11mph (the two-hour marathon pace). The lines fall within a dense region of the graph, which indicates that they are likely.

To identify the probability of the two-hour record being hit before the year 2042, I take the conditional distribution for the running speed of 13.11mph to obtain the PMF of years, then I compute the total probability of that running speed being hit before the year 2042.

>>> year2042 = pandas.to_datetime(‘2042’)
>>> joint_distribution.Conditional(0, 1, 13.11).ProbLess(year2042.value)
0.385445

This answers my original question. There is a 38.5% probability that someone will have run a two-hour marathon by the end of 2041.

To get a bigger picture, I create a similar conditional PMF for each running speed bucket, convert it to a CDF, and extract the years corresponding to several percentiles. The 95th percentile means there is a 95% probability of reaching that particular running speed by that year. I plot these percentiles to get a better sense of how likely any record is to be broken by any date.

Again, the black horizontal and vertical lines correspond to the running speed for a two-hour marathon and the year 2041. You can see that they fall between 25% and 50% probability (as we already know). By looking at this plot, we can also say that the two-hour marathon will be broken by the year 2050 with about a 95% probability.