Monday, September 29, 2014

Two hour marathon in 2041

Today Dennis Kimetto ran the Berlin Marathon in 2:02:57, shaving 26 seconds off the previous world record.  That means it's time to check in on the world record progression and update my update from last year of  my article from two years ago.  The following is a revised version of that article, including the new data point.

Abstract: I propose a model that explains why world record progressions in running speed are improving linearly, and should continue as long as the population of potential runners grows exponentially.  Based on recent marathon world records, I extrapolate that we will break the two-hour barrier in 2041.


Let me start with the punchline:

The blue points show the progression of marathon records since 1970, including the new mark.  The blue line is a least-squares fit to the data, and the red line is the target pace for a two-hour marathon, 13.1 mph.  The blue line hits the target pace in 2041.

In general, linear extrapolation of a time series is a dubious business, but in this case I think it is justified:

1) The distribution of running speeds is not a bell curve.  It has a long tail of athletes who are much faster than normal runners.  Below I propose a model that explains this tail, and suggests that there is still room between the fastest human ever born and the fastest possible human.

2) I’m not just fitting a line to arbitrary data; there is theoretical reason to expect the progression of world records to be linear, which I present below.  And since there is no evidence that the curve is starting to roll over, I think it is reasonable to expect it to continue for a while.

3) Finally, I am not extrapolating beyond reasonable human performance. The target pace for a two-hour marathon is 13.1 mph, which is slower than the current world record for the half marathon (58:23 minutes, or 13.5 mph). It is unlikely that the current top marathoners will be able to maintain this pace for two hours, but we have no reason to think that it is beyond theoretical human capability.

My model, and the data it is based on, are below.


In April 2011, I collected the world record progression for running events of various distances and plotted speed versus year (here's the data, mostly from Wikipedia). The following figure shows the results:

You might notice a pattern; for almost all of these events, the world record progression is a remarkably straight line. I am not the first person to notice, but as far as I know, no one has proposed an explanation for the shape of this curve.

Until now -- I think I know why these lines are straight.  Here are the pieces:

1) Each person's potential is determined by several factors that are independent of each other; for example, your VO2 max and the springiness of your tendons are probably unrelated.

2) Each runner's overall potential is limited by their weakest factor. For example, if there are 10 factors, and you are really good at 9 of them, but below average on the 10th, you will probably not be a world-class runner.

3) As a result of (1) and (2), potential is not normally distributed; it is long-tailed. That is, most people are slow, but there are a few people who are much faster.

4) Runner development has the structure of a pipeline. At each stage, the fastest runners are selected to go on to the next stage.

5) If the same number of people went through the pipeline each year, the rate of new records would slow down quickly.

6) But the number of people in the pipeline actually grows exponentially.

7) As a result of (5) and (6) the rate of new records is linear.

8) This model suggests that linear improvement will continue as long as the world population grows exponentially.

Let's look at each of those pieces in detail:

Physiological factors that determine running potential include VO2 max, anaerobic capacity, height, body type, muscle mass and composition (fast and slow twitch), conformation, bone strength, tendon elasticity, healing rate and probably more. Psychological factors include competitiveness, persistence, tolerance of pain and boredom, and focus.

Most of these factors have a large component that is inherent, they are mostly independent of each other, and any one of them can be a limiting factor. That is, if you are good at all of them, and bad at one, you will not be a world-class runner. To summarize: there is only one way to be fast, but there are a lot of ways to be slow.

As a simple model of these factors, we can generate a random person by picking N random numbers, where each number is normally-distributed under a logistic transform. This yields a bell-shaped distribution bounded between 0 and 1, where 0 represents the worst possible value (at least for purposes of running speed) and 1 represents the best possible value.

Then to represent the running potential of each person, we compute the minimum of these factors. Here's what the code looks like:

def GeneratePerson(n=10):
    factors = [random.normalvariate(0.0, 1.0) for i in range(n)]
    logs = [Logistic(x) for x in factors]
    return min(logs)

Yes, that's right, I just reduced a person to a single number. Cue the humanities majors lamenting the blindness and arrogance of scientists. Then explain that this is supposed to be an explanatory model, so simplicity is a virtue. A model that is as rich and complex as the world is not a model.

Here's what the distribution of potential looks like for different values of N:

When N=1, there are many people near the maximum value. If we choose 100,000 people at random, we are likely to see someone near 98% of the limit. But as N increases, the probability of large values drops fast. For N=5, the fastest runner out of 100,000 is near 85%. For N=10, he is at 65%, and for N=50 only 33%.

In this kind of lottery, it takes a long time to hit the jackpot. And that's important, because it suggests that even after 7 billion people, we might not have seen anyone near the theoretical limit.

Let's see what effect this model has on the progression of world records. Imagine that we choose a million people and test them one at a time for running potential (and suppose that we have a perfect test). As we perform tests, we keep track of the fastest runner we have seen, and plot the "world record" as a function of the number of tests.

Here's the code:

def WorldRecord(m=100000, n=10):
    data = []
    best = 0.0
    for i in xrange(m):
        person = GeneratePerson(n)
        if person > best:
            best = person
            data.append(i/m, best))
    return data

And here are the results with M=100,000 people and the number of factors N=10:

The x-axis is the fraction of people we have tested. The y-axis is the potential of the best person we have seen so far. As expected, the world record increases quickly at first and then slows down.

In fact, the time between new records grows geometrically. To see why, consider this: if it took 100 people to set the current record, we expect it to take 100 more to exceed the record. Then after 200 people, it should take 200 more. After 400 people, 400 more, and so on. Since the time between new records grows geometrically, this curve is logarithmic.

So if we test the same number of people each year, the progression of world records is logarithmic, not linear. But if the number of tests per year grows exponentially, that's the same as plotting the previous results on a log scale. Here's what you get:

That's right: a log curve on a log scale is a straight line. And I believe that that's why world records behave the way they do.

This model is unrealistic in some obvious ways. We don't have a perfect test for running potential and we don't apply it to everyone. Not everyone with the potential to be a runner has the opportunity to develop that potential, and some with the opportunity choose not to.

But the pipeline that selects and trains runners behaves, in some ways, like the model.  If a person with record-breaking potential is born in Kenya, where running is the national sport, the chances are good that he will be found, have opportunities to train, and become a world-class runner.  It is not a certainty, but the chances are good.

If the same person is born in rural India, he may not have the opportunity to train; if he is in the United States, he might have options that are more appealing.

So in some sense the relevant population is not the world, but the people who are likely to become professional runners, given the talent.  As long as this population is growing exponentially, world records will increase linearly.

That said, the slope of the line depends on the parameter of exponential growth.  If economic development increases the fraction of people in the world who have the opportunity to become professional runners, these curves could accelerate.

So let's get back to the original question: when will a marathoner break the 2-hour barrier?  Before 1970, marathon times improved quickly; since then the rate has slowed.  But progress has been consistent for the last 40 years, with no sign of an impending asymptote.  So I think it is reasonable to fit a line to recent data and extrapolate.  Here are the results [based on 2011 data; see above for the update]:

The red line is the target: 13.1 mph.  The blue line is a least squares fit to the data.  So here's my prediction: there will be a 2-hour marathon in 2041.  I'll be 76, so my chances of seeing it are pretty good.  But that's a topic for another article (see Chapter 1 of Think Bayes).

Thursday, September 25, 2014

Bayesian election forecasting

Last week Nate Silver posted this article explaining how the FiveThirtyEight Senate forecast model works.  If you are familiar with Silver's work, you probably know that (1) he has been notably successful at predicting outcomes of elections, and (2) he is an advocate for Bayesian statistics.  In last week's article, Silver provides a high-level view of his modeling philosophy and some details about how his model works, but he didn't say much that is explicitly Bayesian.

His first principle of modeling hints at some Bayesian ideas:
Principle 1: A good model should be probabilistic, not deterministic.
The FiveThirtyEight model produces probabilistic forecasts as opposed to hard-and-fast predictions.[...] the FiveThirtyEight model might estimate the Democrat has a 20 percent chance of winning the Senate race in Kentucky.  My view is that this is often the most important part of modeling — and often the hardest part. 
Silver suggests that it is more useful to report the probability of winning than the margin of victory.  Bayesian models are generally good at this kind of thing; classical statistical methods are not.  But Silver never makes the connection between this principle and Bayesian methods; in fact the article doesn't mention Bayes at all.

And that's fine; it was not central to the point of his article.  But since I am teaching my Bayesian statistics class this semester, I will take this opportunity to fill in some details.  I don't know anything about Silver's model other than what's in his article, but I think it is a good guess that there is something in there similar to what follows.

But before I get into it, here's an outline:

  1. I present an example problem and formulate a solution using a Bayesian framework.
  2. I develop Python code to compute a solution; if you don't speak Python, you can skip this part.
  3. I show results for an update with a single poll result.
  4. I show how to combine results from a different polls.
My example supports Silver's argument that it is more useful to predict the probability of winning than the margin of victory: after the second update, the predicted margin of victory decreases, but the probability of winning increases.  In this case, predicting only the margin of victory would misrepresent the effect of the second poll.

Formulating the problem

Here's the exercise I presented to my class:
Exercise 1: The polling company Strategic Vision reports that among likely voters, 53% intend to vote for your favorite candidate and 47% intend to vote for the opponent (let's ignore undecided voters for now).  Suppose that, based on past performance, you estimate that the distribution of error for this company has mean 1.1 percentage point (in favor of your candidate) and standard deviation 3.7 percentage points.  What is the probability that the actual fraction of likely voters who favor your candidate is less than 50%?
Strategic Vision is an actual polling company, but other than that, everything about this example is made up.  Also, the standard deviation of the error is probably lower than what you would see in practice.

To solve this problem, we can treat the polling company like a measurement instrument with known error characteristics.  If we knew the actual fraction of the electorate planning to vote for your candidate, which I'll call A for "actual", and we knew the distribution of the error, ε, we could compute the distribution of the measured value, M:

M = A + ε

But in this case we want to solve the inverse problem: given a measurement M and the distribution of ε, compute the posterior distribution of A.  As always with this kind of problem, we need three things:

1) A prior distribution for A,
2) Data that allow us to improve the estimate of A, and
3) A likelihood function that computes the probability of the data for hypothetical values of A.

Once you have these three things, the Bayesian framework does the rest.

Implementing the solution

To demonstrate, I use the Suite class from thinkbayes2, which is a Python module that goes with my book, Think Bayes.  The Suite class is documented here, but I will explain what you need to know below.  You can download the code from this file in this GitHub repository.

I'll start by defining a new class called Electorate that inherits methods from thinkbayes2.Suite:

class Electorate(thinkbayes2.Suite):
    """Represents hypotheses about the state of the electorate."""

As a starting place, I'll create a uniform prior distribution.  In practice this would not be a good choice, for reasons I'll explain soon, but it will allow me to make a point.

    hypos = range(0, 101)
    suite = Electorate(hypos)

hypos is a list of integers from 0 to 100, representing the percentage of the electorate planning to vote for your candidate.

I'll represent the data with a tuple of three values: the estimated bias of the polling company in percentage points, the standard deviation of their previous errors (also in percentage points), and the result of the poll:

    data = 1.1, 3.7, 53

When we call Update, it loops through the hypotheses and computes the likelihood of the data under each hypothesis; we have to provide a Likelihood function that does this calculation:

class Electorate(thinkbayes2.Suite):
    """Represents hypotheses about the electorate."""

    def Likelihood(self, data, hypo):
        """Likelihood of the data under the hypothesis.

        hypo: fraction of the population
        data: poll results
        bias, std, result = data
        error = result - hypo
        like = thinkbayes2.EvalNormalPdf(error, bias, std)
        return like

Likelihood unpacks the given data into bias, std, and result.  Given a hypothetical value for A, it computes the hypothetical error.  For example, if hypo is 50 and result is 53, that means the poll is off by 3 percentage points.  The resulting likelihood is the probability that we would be off by that much, given the bias and standard deviation of the poll.

We estimate this probability by evaluating the normal/Gaussian distribution with the given parameters.  I am assuming that the distribution of errors is approximately normal, which is probably not a bad assumption when the probabilities are near 50%.

One technical detail: The result of EvalNormalPdf is actually a probability density, not a probability.  But the result from Likelihood doesn't actually have to be a probability; it only has to be proportional to a probability, so a probability density will do the job nicely.

The results

And that's it -- we've solved the problem!  Here are the results:

The prior distribution is uniform from 0 to 100.  The mean of the posterior is 51.9, which makes sense because the result is 53 and the known bias is 1.1, so the posterior mean is (53 - 1.1).  The standard deviation of the posterior is 3.7, the same as the standard deviation of the error.

To compute the probability of losing the election (if it were held today), we can loop through the hypotheses and add up the probability of all values less than 50%.  The Suite class provides ProbLess, which does that calculation.  The result is 0.26, which means your candidate is a 3:1 favorite.

In retrospect we could have computed this posterior analytically with a lot less work, which is the point I wanted to make by using a uniform prior.  But in general it's not quite so simple, as we can see by incorporating a second poll:
Exercise 2: What if another poll comes out from Research 2000 showing that 49% of likely voters intend to vote for your candidate, but past poll show that this company's results tend to favor the opponent by 2.3 points, and their past errors (after correcting for this bias) have standard deviation 4.1 points.  Now what should you believe?
The second update looks just like the first:

    data = -2.3, 4.1, 49

The bias is negative now because this polling company (in my fabricated world) tends to favor the opponent.  Here are the results after the second update:

The mean of the new posterior is 51.6, slightly lower than the mean after the first update, 51.9.  The two polls are actually consistent with each other after we correct for the biases of the two companies.

The predicted margin of victory is slightly smaller, but the uncertainty of the prediction is also smaller.  Based on the second update, the probability is 0.22, which means your candidate is now nearly a 4:1 favorite.

Again, this example demonstrates Silver's point: predicting the probability of winning is more meaningful that predicting the margin of error.  And that's exactly the kind of thing Bayesian models are good for.

One more technical note:  This analysis is based on the assumption that errors in one poll iare independent of errors in another.  It seems likely that in practice there is correlation between polls; in that case we could extend this solution to model the errors with a joint distribution that includes the correlation. 

Sunday, September 14, 2014

Regression with Python, pandas and StatsModels

I was at Boston Data-Con 2014 this morning, which was a great event.  The organizer, John Verostek, seems to have created this three-day event single-handedly, so I am hugely impressed.

Imran Malek started the day with a very nice iPython tutorial.  The description is here, and his slides are here.  He grabbed passenger data from the MBTA and generated heat maps showing the number of passengers at each stop in the system during each hour.  The tutorial covered a good range of features, and it seemed like many of the participants were able to download the data and follow along in iPython.

And Imran very kindly let me use his laptop to project slides for my talk, which was next.  The description of my talk is here:
Regression is a powerful tool for fitting data and making predictions. In this talk I present the basics of linear regression and logistic regression and show how to use them in Python. I demonstrate pandas, a Python module that provides structures for data analysis, and StatsModels, a module that provides tools for regression and other statistical analysis. 
As an example, I will use data from the National Survey of Family Growth to generate predictions for the date of birth, weight, and sex of an expected baby. This presentation is based on material from the recent revision of Think Stats, an introduction to data analysis and statistics with Python.
This talk is appropriate for people with no prior experience with regression. Basic familiarity with Python is recommended but not required.
 And here are my slides:

The material for this talk is from the second edition of Think Stats, which is in production now and scheduled for release in early November.  My draft is available here, and you can pre-order a paper copy here.

As I expected, I prepared way more material than I could present.  The audience had some excellent questions, so we spent more time on linear regression and did not get to logistic regression.

The nice people at O'Reilly Media sent over 25 copies of my book, Think Python, so we had a book signing after the talk.  I had a chance to chat with everyone who got a book, which is always a lot of fun.

I believe video of the talk will be available soon.  I will post it here when it is.

Many thanks to John Verostek for organizing this conference, and to the sponsors for breakfast and lunch.  Looking forward to next year!