At SciPy last week I gave a talk called "Will Millennials Ever Get Married? Survival Analysis and Marriage Data". I presented results from my analysis of data from the National Survey of Family Growth (NSFG). The slides I presented are here.

The video is here:

The paper the talk is based on will appear in the conference proceedings, which should be out any day now.

More women will get married later and the number of women who remain unmarried will increase according to a new statistical investigation into the marriage patterns of millennials by Olin College professor Allen Downey. The analysis was unveiled at the SciPy conference in Austin, Texas.

The data shows that millennials are already marrying much later than previous generations, but now they are on pace to remain unmarried in substantially larger numbers than prior generations of women.

“The thing that is surprising to me is the size of the effect,” said Downey. Downey, a data scientist, set out to analyze marriage patterns for women born in the 1980s and 1990s using information from the National Survey of Family Growth compiled by the Center for Disease Control (CDC).

Women born in the 90s = 13 percent were married at age 22

Women born in the 80s = 23 percent were married at age 22

Women born in the 40s = 69 percent were married at age 22

“Going from 69 percent to 13 percent in 50 years is a huge magnitude of change,” said Downey.

He then used a statistical survival analysis to project what percentage of women will marry later in life – or at all. Downey only studied numbers for women because the data from the CDC is more comprehensive for females.

Of women born in the 1980s, Downey’s statistical survival projections show 72 percent are likely to marry by age 42, down from 82 percent of women born in the 1970s who were married by that age. For women born in the 1990s, the projection shows that only 68 percent are likely to marry by age 42. Contrast these figures with the fact that 92 percent of women born in the 1940s were married by age 42.

“These are projections, they are based on current trends and then seeing what would happen if the current trends continue…but at this point women born in the 80s and 90s are so far behind in marriage terms they would have to start getting married at much higher rates than previous generations in order to catch up,” said Downey.

Although his presentation is very good, he includes one statement that is a little misleading, in my opinion. Specifically, he shows an example where frequentist and Bayesian methods yield similar results, and concludes, "for very simple problems, frequentist and Bayesian results are often practically indistinguishable."

But as I argued in my recent talk, "Learning to Love Bayesian Statistics," frequentist methods generate point estimates and confidence intervals, whereas Bayesian methods produce posterior distributions. That's two different kinds of things (different types, in programmer-speak) so they can't be equivalent.

If you reduce the Bayesian posterior to a point estimate and an interval, you can compare Bayesian and frequentist results, but in doing so you discard useful information and lose what I think is the most important advantage of Bayesian methods: the ability to use posterior distributions as inputs to other analyses and decision-making processes.

The next section of Jake's talk demonstrates this point nicely. He presents the "Bayesian Billiards Problem", which Bayes wrote about in 1763. Jake presents a version of the problem from this paper by Sean Eddy, which I'll quote:

"Alice and Bob are playing a game in which the first person to get 6 points wins. The way each point is decided is a little strange. The Casino has a pool table that Alice and Bob can't see. Before the game begins, the Casino rolls an initial ball onto the table, which comes to rest at a completely random position, which the Casino marks. Then, each point is decided by the Casino rolling another ball onto the table randomly. If it comes to rest to the left of the initial mark, Alice wins the point; to the right of the mark, Bob wins the point. The Casino reveals nothing to Alice and Bob except who won each point.

"Clearly, the probability that Alice wins a point is the fraction of the table to the left of the mark—call this probability p; and Bob's probability of winning a point is 1 - p. Because the Casino rolled the initial ball to a random position, before any points were decided every value of p was equally probable. The mark is only set once per game, so p is the same for every point.

"Imagine Alice is already winning 5 points to 3, and now she bets Bob that she's going to win. What are fair betting odds for Alice to offer Bob? That is, what is the expected probability that Alice will win?"

Eddy solves the problem using a Beta distribution to estimate p, then integrates over the posterior distribution of p to get the probability that Bob wins. If you like that approach, you can read Eddy's paper. If you prefer a computational approach, read on! My solution is in this Python file.

The problem statement indicates that the prior distribution of p is uniform from 0 to 1. Given a hypothetical value of p and the observed number of wins and losses, we can compute the likelihood of the data under each hypothesis:

class Billiards(thinkbayes.Suite): def Likelihood(self, data, hypo): """Likelihood of the data under the hypothesis. data: tuple (#wins, #losses) hypo: float probability of win """ p = hypo win, lose = data like = p**win * (1-p)**lose return like

Billiards inherits the Update function from Suite (which is defined in thinkbayes.py and explained in Think Bayes) and provides Likelihood, which uses the binomial formula:

I left out the first term, the binomial coefficient, because it doesn't depend on p, so it would just get normalized away.

Now I just have to create the prior and update it:

ps = numpy.linspace(0, 1, 101)

bill = Billiards(ps)

bill.Update((5, 3))

thinkplot.Pdf(bill)

The following figure shows the resulting posterior:

Now to compute the probability that Bob wins the match. Since Alice is ahead 5 points to 3, Bob needs to win the next three points. His chance of winning each point is (1-p), so the chance of winning the next three is (1-p)³.

We don't know the value of p, but we have its posterior distribution, which we can "integrate over" like this:

def ProbWinMatch(pmf):

total = 0

for p, prob in pmf.Items():

total += prob * (1-p)**3

return total

The result is p = 0.091, which corresponds to 10:1 odds against.

Using a frequentist approach, we get a substantially different answer. Instead of a posterior distribution, we get a single point estimate. Assuming we use the MLE, we estimate p = 5/8, and (1-p)³ = 0.056, which corresponds to 18:1 odds against.

Needless to say, the Bayesian result is right and the frequentist result is wrong.

But let's consider why the frequentist result is wrong. The problem is not the estimate itself. In fact, in this example, the Bayesian maximum aposteriori probability (MAP) is the same as the frequentist MLE. The difference is that Bayesian posterior contains all of the information we have about p, whereas the frequentist result discards a large part of that information.

The result we are interested in, the probability of winning the match, is a non-linear transform of p, and in general for a non-linear transform f, the expectation E[f(p)] does not equal f(E[p]). The Bayesian method computes the first, which is right; the frequentist method approximates the second, which is wrong.

To summarize, Bayesian methods are better not just because the results are correct, but more importantly because the results are in a form, the posterior distribution, that lends itself to answering questions and guiding decision-making under uncertainty.

[UPDATE June 26, 2015] Some readers have objected that this article is unfair, because any competent statistician would know that the frequentist method I presented here would behave badly, and even a classical statistician could happily use Bayes's theorem to solve this problem, because the prior is provided by the problem statement.

A few thoughts in response:

1) This article compares a Bayesian method and frequentist method. I never said anything about Bayesians and frequentists as people, or about what methods a hypothetical statistician might choose.

2) The problem with the frequentist approach to this problem is that it produces a point estimate rather than a posterior distribution. Any method that produces a point estimate is going to have the same problem.

3) Maybe it's true that a competent statistician would expect the frequentist approach to perform badly for this problem, but it still seems like that is a disadvantage of the frequentist approach. I would rather have a method that is always right than a method that sometimes works and sometimes fails, and requires substantial expertise to know when to expect which.

"Sleeping Beauty volunteers to undergo the following experiment and is told all of the following details: On Sunday she will be put to sleep. Once or twice during the experiment, Beauty will be awakened, interviewed, and put back to sleep with an amnesia-inducing drug that makes her forget that awakening.

A fair coin will be tossed to determine which experimental procedure to undertake: if the coin comes up heads, Beauty will be awakened and interviewed on Monday only. If the coin comes up tails, she will be awakened and interviewed on Monday and Tuesday. In either case, she will be awakened on Wednesday without interview and the experiment ends.

Any time Sleeping Beauty is awakened and interviewed, she is asked, 'What is your belief now for the proposition that the coin landed heads?'"

The Halfer position. Simple! The coin is fair--and SB knows it--so she should believe there's a one-half chance of heads.

The Thirder position. Were this experiment to be repeated many times, then the coin will be heads only one third of the time SB is awakened. Her probability for heads will be one third.

The thirder position is correct, and I think the argument based on long-run averages is the most persuasive. From Wikipedia:

Suppose this experiment were repeated 1,000 times. It is expected that there would be 500 heads and 500 tails. So Beauty would be awoken 500 times after heads on Monday, 500 times after tails on Monday, and 500 times after tails on Tuesday. In other words, only in one-third of the cases would heads precede her awakening. This long-run expectation should give the same expectations for the one trial, so P(Heads) = 1/3.

Most, but not all, people who have written about this are thirders. But:

On Sunday evening, just before SB falls asleep, she must believe the chance of heads is one-half: that’s what it means to be a fair coin.

Whenever SB awakens, she has learned absolutely nothing she did not know Sunday night. What rational argument can she give, then, for stating that her belief in heads is now one-third and not one-half?

I wonder where exactly in Bayes' rule does the formula "fail". It seems like P(wake|H) = P(wake|T) = 1, and P(H) = P(T) = 1/2, leading to the P(H|wake) = 1/2 conclusion. Is it possible to get 1/3 using Bayes' rule?

I have come to a resolution of this problem that works, I think, but it made me realize the following subtle point: even if two things are inevitable, that doesn't make them equally likely.

In the previous calculation, the priors are correct: P(H) = P(T) = 1/2

It's the likelihoods that are wrong. The datum is "SB wakes up". This event happens once if the coin is heads and twice if it is tails, so the likelihood ratio P(wake|H) / P(wake|T) = 1/2

If you plug that into Bayes's theorem, you get the correct answer, 1/3.

This is an example where the odds form of Bayes's theorem is less error prone: the prior odds are 1:1. The likelihood ratio is 1:2, so the posterior odds are 1:2. By thinking in terms of likelihood ratio, rather than conditional probability, we avoid the pitfall.

If this example is still making your head hurt, here's an analogy that might help: suppose you live near a train station, and every morning you hear one express train and two local trains go past. The probability of hearing an express train is 1, and the probability of hearing a local train is 1. Nevertheless, the likelihood ratio is 1:2, and if you hear a train, the probability is only 1/3 that it is the express.

[UPDATE June 15, 2015] In the comments below, you’ll see an exchange between me and a reader named James. It took me a few tries to understand his question, so I’ll take the liberty of editing the conversation to make it clearer (and to make me seem a little quicker on the uptake):

James: I'd be interested in your reaction to the following extension. Before going to sleep on Sunday, Sleeping Beauty makes a bet at odds of 3:2 that the coin will come down heads. (This is favourable for her when the probability of heads is 1/2, and unfavourable when the probability of heads is 1/3). She is told that whenever she is woken up, she will be offered the opportunity to cancel any outstanding bets. Later she finds herself woken up, and asked whether she wants to cancel any outstanding bets. Should she say yes or no? (Let's say she doesn't have access to any external randomness to help her choose). Is her best answer compatible with a "belief of 1/3 that the coin is showing heads"?

Allen: If the bet is only resolved once (on Wednesday), then SB should accept the bet (and not cancel it) because she is effectively betting on a coin toss with favorable odds, and the whole sleeping-waking scenario is irrelevant.

James: Right, the bet is only resolved once. So, we agree that she should not cancel. But isn't there something odd? Put yourself in SB's position when you are woken up. You say that you have a "belief of 1/3 in the proposition that the coin is heads". The bet is unfavourable to you if the probability of heads is 1/3. And yet you don't cancel it. That suggests one sense in which you do NOT have a belief of 1/3 after all.

Allen: Ah, now I see why this is such an interesting problem. You are right that I seem to have SB keeping a bet that is inconsistent with her beliefs. But SB is not obligated to bet based on her current beliefs. If she knows that more information is coming in the future, she can compute a posterior based on that future information and bet accordingly.

Each time she wakes up, she should believe that she is more likely to be in the Tails scenario -- that is, that P(H) = 1/3 -- but she also knows that more information is coming her way.

Specifically, she knows that when she wakes up on Wednesday, and is told that it is Wednesday and the experiment is over, she will update her beliefs and conclude that the probability of Heads is 50% and the bet is favorable.

So when she wakes up on Monday or Tuesday and has the option to cancel the bet, she could think: "Based on my current beliefs, this bet is unfavorable, but I know that before the bet is resolved I will get more information that makes the bet favorable. So I will take that future information into account now and keep the bet (decline to cancel)."

I think the weirdness here is not in her beliefs but in the unusual scenario where she knows that she will get more information in the future. The Bayesian formulation of the problem tells you what she should believe after performing each update, but [the rest of the sentence deleted because I don’t think it’s quite right any more].

----

Upon further reflection, I think there is a general rule here:

When you evaluate a bet, you should evaluate it relative to what you will believe when the bet is resolved, which is not necessarily what you believe now. I’m going to call this the Fundamental Theorem of Betting, because it reminds me of Sklansky’s Fundamental Theorem of Poker, which says that the correct decision in a poker game is the decision you would make if all players’ cards were visible.

Under normal circumstances, we don’t know what we will believe in the future, so we almost always use our current beliefs as a heuristic for, or maybe estimate of, our future beliefs. Sleeping Beauty’s situation is unusual because she knows that more information is coming in the future, and she knows what the information will be!

To see how this theorem holds up, let me run the SB scenario and see if we can make sense of Sleeping Beauty’s beliefs and betting strategy:

Experimenter: Ok, SB, it’s Sunday night. After you go to sleep, we’re going to flip this fair coin. What do you believe is the probability that it will come up heads, P(H)?

Sleeping Beauty: I think P(H) is ½.

Ex: Ok. In that case, I wonder if you would be interested in a wager. If you bet on heads and win, I’ll pay 3:2, so if you bet $100, you will either win $150 or lose $100. Since you think P(H) is ½, this bet is in your favor. Do you want to accept it?

SB: Sure, why not?

Ex: Ok, on Wednesday I’ll tell you the outcome of the flip and we’ll settle the bet. Good night.

SB: Zzzzz.

Ex: Good morning!

SB: Hello. Is it Wednesday yet?

Ex: No, it’s not Wednesday, but that’s all I can tell you. At this point, what do you think is the probability that I flipped heads?

SB: Well, my prior was P(H) = ½. I’ve just observed an event (D = waking up before Wednesday) that is twice as likely under the tails scenario, so I’ll update my beliefs and conclude that P(H|D) = ⅓.

Ex: Interesting. Well, if the probability of heads is only ⅓, the bet we made Sunday night is no longer in your favor. Would you like to call it off?

SB: No, thanks.

Ex: But wait, doesn’t that mean that you are being inconsistent? You believe that the probability of heads is ⅓, but you are betting as if it were ½.

SB: On the contrary, my betting is consistent with my beliefs. The bet won’t be settled until Wednesday, so my current beliefs are not important. What matters is what I will believe when the bet is settled.

Ex: I suppose that makes sense. But do you mean to say that you know what you will believe on Wednesday?

SB: Normally I wouldn’t, but this scenario seems to be an unusual case. Not only do I know that I will get more information tomorrow; I even know what it will be.

Ex: How’s that?

SB: When you give me the amnesia drug, I will forget about the update I just made and revert to my prior. Then when I wake up on Wednesday, I will observe an event (E = waking up on Wednesday) that is equally likely under the heads and tails scenarios, so my posterior will equal my prior, I will believe that P(H|E) is ½, and I will conclude that the bet is in my favor.

Ex: So just before I tell you the outcome of the bet, you will believe that the probability of heads is ½?

SB: Right.

Ex: Well, if you know what information is coming in the future, why don’t you do the update now, and start believing that the probability of heads is ½?

SB: Well, I can compute P(H|E) now if you want. It’s ½ -- always has been and always will be. But that’s not what I should believe now, because I have only seen D, and not E yet.

Ex: So right now, do you think you are going to win the bet?

SB: Probably not. If I’m losing, you’ll ask me that question twice. But if I’m winning, you’ll only ask once. So ⅔ of the time you ask that question, I’m losing.

Ex: So you think you are probably losing, but you still want to keep the bet? That seems crazy.

SB: Maybe, but even so, my beliefs are based on the correct analysis of my situation, and my decision is consistent with my beliefs.

Ex: I’ll need to think about that. Well, good night.

“If the p-value is small, you can conclude that the fourth possibility is unlikely, and infer that the other three possibilities are more likely.”

Where “the fourth possibility” I referred to was:

“The apparent effect might be due to chance; that is, the difference might appear in a random sample, but not in the general population.”

Several commenters chastised me; for example:

“All a p-value tells you is the probability of your data (or data more extreme) given the null is true. They say nothing about whether your data is ‘due to chance.’”

My correspondent is partly right. The p-value is the probability of the apparent effect under the null hypothesis, which is not the same thing as the probability we should assign to the null hypothesis after seeing the data. But they are related to each other by Bayes’s theorem, which is why I said that you can use the first to make an inference about the second.

Let me explain by showing a few examples. I’ll start with what I think is a common scenario: suppose you are reading a scientific paper that reports a new relationship between two variables. There are (at least) three explanations you might consider for this apparent effect:

A: The effect might be actual; that is, it might exist in the unobserved population as well as the observed sample.

B: The effect might be bogus, caused by errors like sampling bias and measurement error, or by fraud.

C: The effect might be due to chance, appearing randomly in the sample but not in the population.

If we think of these as competing hypotheses to explain the apparent effect, we can use Bayes’s theorem to update our belief in each hypothesis.

Scenario #1

As a first scenario, suppose that the apparent effect is plausible, and the p-value is low. The following table shows what the Bayesian update might look like:

Prior

Likelihood

Posterior

Actual

70

0.5–1.0

~75

Bogus

20

0.5–1.0

~25

Chance

10

p = 0.01

~0.001

Since the apparent effect is plausible, I give it a prior probability of 70%. I assign a prior of 20% to the hypothesis that the result is to due to error or fraud, and 10% to the hypothesis that it’s due to chance. If you don’t agree with the numbers I chose, we’ll look at some alternatives soon. Or feel free to plug in your own.

Now we compute the likelihood of the apparent effect under each hypothesis. If the effect is real, it is quite likely to appear in the sample, so I figure the likelihood is between 50% and 100%. And in the presence of error or fraud, I assume the apparent effect is also quite likely.

If the effect is due to chance, we can compute the likelihood directly. The likelihood of the data under the null hypothesis is the p-value. As an example, suppose p=0.01.

The table shows the resulting posterior probabilities. The hypothesis that the effect is due to chance has been almost eliminated, and the other hypotheses are marginally more likely. The hypothesis test helps a little, by ruling out one source of error, but it doesn't help a lot, because randomness was not the source of error I was most worried about.

Scenario #2

In the second scenario, suppose the p-value is low, again, but the apparent effect is less plausible. In that case, I would assign a lower prior probability to Actual and a higher prior to Bogus. I am still inclined to assign a low priority to Chance, simply because I don’t think it is the most common cause of scientific error.

Prior

Likelihood

Posterior

Actual

20

0.5–1.0

~25

Bogus

70

0.5–1.0

~75

Chance

10

p = 0.01

~0.001

The results are pretty much the same as in Scenario #1: we can be reasonably confident that the result is not due to chance. But again, the hypothesis test does not make a lot of difference in my belief about the validity of the paper.

I believe these examples cover a large majority of real-world scenarios, and in each case my claim holds up: If the p-value is small, you can conclude that apparent effect is unlikely to be due to chance, and the other possibilities (Actual and Bogus) are more likely.

Scenario #3

I admit that there are situations where this conclusion would not be valid. For example, if the effect is implausible and you have good reason to think that error and fraud are unlikely, you might start with a larger prior belief in Chance. In that case a p-value like 0.01 might not be sufficient to rule out Chance:

Prior

Likelihood

Posterior

Actual

10

0.5

46

Bogus

10

0.5

46

Chance

80

p = 0.01

8

But even in this contrived scenario, the p-value has a substantial effect on our belief in the Chance hypothesis. And a somewhat smaller p-value, like 0.001, would be sufficient to rule out Chance as a likely cause of error.

In summary, NHST is problematic but not useless. If you think an apparent effect might be due to chance, choosing an appropriate null hypothesis and computing a p-value is a reasonable way to check.

Last fall I taught an introduction to Bayesian statistics at Olin College. My students worked on some excellent projects, and I invited them to write up their results as guest articles for this blog. Previous articles in the series are:

In this blog article, Allen Downey shows that marathon world record times have been advancing at a linear rate. He uses least squares estimation to calculate the parameters of the line of best fit, and predicts that the first 2-hour marathon will occur around 2041.

I extend his model, and calculate how likely it will be that someone will run a 2-hour marathon in 2041. Or more generally, I find the probability distribution of a certain record being hit on a certain date.

I break down my process into 4 steps:

Decide on a set of prior hypotheses about the trend of marathon records.

Update my priors using the world record data.

Use my posterior to run a monte-carlo simulation of world records.

Use this simulation to estimate the probability distribution of dates and records.

What are my prior hypotheses?

Since we are assuming the trend in running times is linear, I write my guesses as linear functions,

yi(xi) = Î± + Î²x + ei ,

where Î± and Î² are constants and ei is random error drawn from a normal distribution characterized by standard deviation Ïƒ. Thus I choose my priors to be list of combinations of Î±, Î², and Ïƒ.

I set the values of Î±, Î², and Ïƒ partially on the least squares estimate and partially on trial and error. I initialize my prior hypotheses to be uniformly distributed.

How do I update my priors?

I perform a bayesian update based on a list of running records. For each running record, (date, record_mph), I calculate the likelihood based on the following function.

I find the error between the hypothesized record and the measured record (in miles per hour), then I find the probability of seeing that error given the normal distribution with standard deviation of Ïƒ.

The posterior PMFs for Î±, Î², and Ïƒ

Î±

Î²

Ïƒ

Here are the posterior marginal distributions of Î±, Î², and Ïƒ after updating with all the running records. The distributions of Î± and Î² peak close to the least squares estimate. How close are they to least squares estimate? Here I’ve plotted a line based on the maximum likelihood of Î±, Î², compared to the least squares estimate. Graphically, they are almost indistinguishable.

But I use these distributions to estimate more than a single line! I create a monte-carlo simulation of running records.

How do I simulate these records?

I pull a random Î±, Î², and Ïƒ from their PMF (weighted by probability), generate a list of records for each year from 2014 to 2060, then repeat. The results are in the figure below, plotted alongside the previous records.

It may seem counterintuitive to see simulated running speeds fall below the current records, but it make sense! Think of those as years wherein the first-place marathon didn’t set a world-record.

These simulated records are most dense where they are most likely, but the scatter plot is a poor visualization of their densities. To fix this, I merge the points into discrete buckets and count the number of points in each bucket to estimate their probabilities. This resulting joint probability distribution is what I’ve been looking for: given a date and a record, I can tell you the probability of setting that record on that date. I plot the joint distribution in the figure below.

It is difficult to create a pcolor plot with pandas Timestamp objects as the unit, so I use nanoseconds since 1970. The vertical and horizontal lines are at year 2041 and 13.11mph (the two-hour marathon pace). The lines fall within a dense region of the graph, which indicates that they are likely.

To identify the probability of the two-hour record being hit before the year 2042, I take the conditional distribution for the running speed of 13.11mph to obtain the PMF of years, then I compute the total probability of that running speed being hit before the year 2042.

This answers my original question. There is a 38.5% probability that someone will have run a two-hour marathon by the end of 2041.

To get a bigger picture, I create a similar conditional PMF for each running speed bucket, convert it to a CDF, and extract the years corresponding to several percentiles. The 95th percentile means there is a 95% probability of reaching that particular running speed by that year. I plot these percentiles to get a better sense of how likely any record is to be broken by any date.

Again, the black horizontal and vertical lines correspond to the running speed for a two-hour marathon and the year 2041. You can see that they fall between 25% and 50% probability (as we already know). By looking at this plot, we can also say that the two-hour marathon will be broken by the year 2050 with about a 95% probability.