Wednesday, September 26, 2018

This blog has moved

As of September 2018, I am moving Probably Overthinking It to a new location.

Blogger has been an excellent host; so far, this blog has received more than two million page views.

But earlier this month, when I published a new article, Blogger prompted me to post it on Google+, and I did. A few hours later I discovered that my Google+ account had been suspended for violating terms of service, but I got no information about what terms I had violated.

While my Google+ account was suspended, I was unable to access Blogger and some other Google services. And since Probably Overthinking It is a substantial part of my professional web presence, that was unacceptable.

I appealed the suspension by pressing a button, with no opportunity to ask a question. Within 24 hours, my account was restored, but with no communication and still no information.

So for me, using Google+ has become a game of Russian Roulette. Every time I post something, there seems to be a random chance that I will lose control of my web presence. And maybe next time it will be permanent.

It is nice that using Blogger is free, but this episode has been a valuable reminder that "If you are not paying for it, you are not the customer". (Who said that?)

I have moved Probably Overthinking It to a site I control, hosted by a company I pay, a company that has provided consistently excellent customer service.

Lesson learned.

[When I published this article, Blogger asked if I wanted to post it on Google+. I did not.]

UPDATE: See the discussion of this post on Hacker News, with lots of good advice for migrating to services you have more control over.

Wednesday, September 19, 2018

Two hour marathon in 2031, maybe

On Sunday (September 16, 2018) Eliud Kipchoge ran the Berlin Marathon in 2:01:39, smashing the previous world record by more than a minute and taking a substantial step in the progression toward a two hour marathon.

In a previous article, I noted that the marathon record pace since 1970 has been progressing linearly over time, and I proposed a model that explains why we might expect it to continue. Based on a linear extrapolation of the data so far, I predicted that someone would break the two hour barrier in 2041, plus or minus a few years.

Now it is time to update my predictions in light of the new record. The following figure shows the progression of world record pace since 1970 (orange line), a linear fit to the data (blue line) and a 90% predictive confidence interval (shaded area). The dashed lines show the two hour marathon pace (13.1 mph) and lower and upper bounds for the year we will reach it.

Since the previous record was broken in 2014, we have been slightly behind the long-term trend. But the new record more than makes up for it, putting us at the upper edge of the predictive interval.

This model predicts that we might see a two hour marathon as early as 2031, and probably will before 2041.

Note that this model is based on data from races. It is possible that we will see a two hour marathon sooner in under time trial conditions, as in the Nike Breaking2 project.

Thursday, September 13, 2018

Tom Bayes and the case of the double dice

The double dice problem

Suppose I have a box that contains one each of 4-sided, 6-sided, 8-sided, and 12-sided dice. I choose a die at random, and roll it twice without letting you see the die or the outcome. I report that I got the same outcome on both rolls.

1) What is the posterior probability that I rolled each of the dice?
2) If I roll the same die again, what is the probability that I get the same outcome a third time?

You can see the complete solution in this Jupyter notebook, or read the HTML version here.

Solution

Here's a BayesTable that represents the four hypothetical dice.

In [3]:

hypo = [Fraction(sides) for sides in [4, 6, 8, 12]]
table = BayesTable(hypo)

Out[3]:

	hypo	prior	likelihood	unnorm	posterior
0	4	1	NaN	NaN	NaN
1	6	1	NaN	NaN	NaN
2	8	1	NaN	NaN	NaN
3	12	1	NaN	NaN	NaN

Since we didn't specify prior probabilities, the default value is equal priors for all hypotheses. They don't have to be normalized, because we have to normalize the posteriors anyway.
Now we can specify the likelihoods: if a die has n sides, the chance of getting the same outcome twice is 1/n.
So the likelihoods are:

In [4]:

table.likelihood = 1/table.hypo
table

Out[4]:

	hypo	prior	likelihood	unnorm	posterior
0	4	1	1/4	NaN	NaN
1	6	1	1/6	NaN	NaN
2	8	1	1/8	NaN	NaN
3	12	1	1/12	NaN	NaN

Now we can use update to compute the posterior probabilities:

In [5]:

table.update()
table

Out[5]:

	hypo	prior	likelihood	unnorm	posterior
0	4	1	1/4	1/4	2/5
1	6	1	1/6	1/6	4/15
2	8	1	1/8	1/8	1/5
3	12	1	1/12	1/12	2/15

In [6]:

table.posterior.astype(float)

Out[6]:

0    0.400000
1    0.266667
2    0.200000
3    0.133333
Name: posterior, dtype: float64

The 4-sided die is most likely because you are more likely to get doubles on a 4-sided die than on a 6-, 8-, or 12- sided die.

Part two

The second part of the problem asks for the (posterior predictive) probability of getting the same outcome a third time, if we roll the same die again.
If the die has n sides, the probability of getting the same value again is 1/n, which should look familiar.
To get the total probability of getting the same outcome, we have to add up the conditional probabilities:

P(n | data) * P(same outcome | n)

The first term is the posterior probability; the second term is 1/n.

In [7]:

total = 0
for _, row in table.iterrows():
    total += row.posterior / row.hypo
    
total

Out[7]:

Fraction(13, 72)

This calculation is similar to the first step of the update, so we can also compute it by
1) Creating a new table with the posteriors from table.
2) Adding the likelihood of getting the same outcome a third time.
3) Computing the normalizing constant.

In [8]:

table2 = table.reset()
table2.likelihood = 1/table.hypo
table2

Out[8]:

	hypo	prior	likelihood	unnorm	posterior
0	4	2/5	1/4	NaN	NaN
1	6	4/15	1/6	NaN	NaN
2	8	1/5	1/8	NaN	NaN
3	12	2/15	1/12	NaN	NaN

In [9]:

table2.update()

Out[9]:

Fraction(13, 72)

In [10]:

table2

Out[10]:

	hypo	prior	likelihood	unnorm	posterior
0	4	2/5	1/4	1/10	36/65
1	6	4/15	1/6	2/45	16/65
2	8	1/5	1/8	1/40	9/65
3	12	2/15	1/12	1/90	4/65

This result is the same as the posterior after seeing the same outcome three times.

This example demonstrates a general truth: to compute the predictive probability of an event, you can pretend you saw the event, do a Bayesian update, and record the normalizing constant.
(With one caveat: this only works if your priors are normalized.)