Monday, October 26, 2015

When will I win the Great Bear Run?

Almost every year since 2008 I have participated in the Great Bear Run, a 5K road race in Needham MA. I usually finish in the top 40 or so, and in my age group I have come in 4th, 6th, 4th, 3rd, 2nd, 4th and 4th. In 2015 I didn't run because of a scheduling conflict, but based on the results I estimate that I would have come 4th again.

Having come close in 2012, I have to wonder what my chances are of winning my age group.  I've developed two models to predict the results.

Binomial model

According to a simple binomial model (the details are in this IPython notebook), the predictive distribution for the number of people who finish ahead of me is:

And my chances of winning my age group are about 5%.

An improved model

The binomial model ignores an important piece of information: some of the people who have displaced me from my rightful victory have come back year after year.  Here are the people who have finished ahead of me in my age group:

    2008: Gardiner, McNatt, Terry
    2009: McNatt, Ryan, Partridge, Turner, Demers
    2010: Gardiner, Barrett, Partridge
    2011: Barrett, Partridge
    2012: Sagar
    2013: Hammer, Wang, Hahn
    2014: Partridge, Hughes, Smith
    2015: Barrett, Sagar, Fernandez

Several of them are repeat interlopers.  I have developed a model that takes this information into account and predicts my chances of winning.  It is based on the assumption that there is a population of n runners who could displace me, each with some probability p.

In order to displace me, a runner has to

  • Show up,
  • Outrun me, and
  • Be in my age group.

For each runner, the probability of displacing me is a product of these factors:


Some runners have a higher SOB factor than others; we can use previous results to estimate it.

First we have to think about an appropriate prior.   Again, the details are in this IPython notebook.

  • Based on my experience, I conjecture that the prior distribution of S is an increasing function, with many people who run nearly every year, and fewer who run only occasionally.
  • The prior distribution of O is biased toward high values. Of the people who have the potential to beat me, many of them will beat me every time. I am only competitive with a few of them.  (For example, of the 15 people who have beat me, I have only ever beat 2).
  • The probability that a runner is in my age group depends on the difference between his age and mine. Someone exactly my age will always be in my age group. Someone 4 years older will be in my age group only once every 5 years (the Great Bear run uses 5-year age groups).  So the distribution of B is uniform.
I used Beta distributions for each of the three factors, so each piis the product of three Beta-distributed variates. In general, the result is not a Beta distribution, but it turns out there is a Beta distribution that is a good approximation of the actual distribution.

Using this distribution as a prior, we can compute the posterior distribution of p for each runner who has displaced me, and another posterior for any hypothetical runner who has not displaced me in any of 8 races.  As an example, for Rich Partridge, who has displaced me in 4 of 8 years, the mean of the posterior distribution is 42%.  For a runner who has only displaced me once, it is 17%.

Then, for any hypothetical number of runners, n, we can draw samples from these distributions of p and compute the conditional distribution of k, the number of runners who finish ahead of me.  Here's what that looks like for a few values of n:

If there are 18 runners, the mostly likely value of k is 3, so I would come in 4th.  As the number of runners increases, my prospects look a little worse.

These represent distributions of k conditioned on n, so we can use them to compute the likelihood of the observed values of k.  Then, using Bayes theorem, we can compute the posterior distribution of n, which looks like this:

It's noisy because I used random sampling to estimate the conditional distributions of k. But that's ok because we don't really care about n; we care about the predictive distribution of k. And noise in the distribution of n has very little effect on k.

The predictive distribution of k is a weighted mixture of the conditional distributions we already computed, and it looks like this:

Sadly, according to this model, my chance of winning my age group is less than 2% (compared to the binomial model, which predicts that my chances are more than 5%).

1 comment:

  1. For some interesting statistics I would like to see a post that shows how your likelihood of winning the 5k race varies if you do the following things, at least 6 mo's before the next race, analyzing each contribution independently:
    - you start lifting weights and strengthen upper body, core, and legs
    - one day a week you run 8-10k to build up your endurance
    - you eat organic bison or lean beef twice a week, and cut carbs as low as you can
    - you get a coach to comment on your form to make sure you are as efficient as possible
    I used to run a lot, I talked to a lot of other runners, I took a long distance class, and I read a running book written by an Olympian world class runner, I bet there is a change you can make that you haven't discovered or implemented yet that will get you that first place spot.