Having come close in 2012, I have to wonder what my chances are of winning my age group. I've developed two models to predict the results.
Binomial model
According to a simple binomial model (the details are in this IPython notebook), the predictive distribution for the number of people who finish ahead of me is:And my chances of winning my age group are about 5%.
An improved model
The binomial model ignores an important piece of information: some of the people who have displaced me from my rightful victory have come back year after year. Here are the people who have finished ahead of me in my age group:2008: Gardiner, McNatt, Terry
2009: McNatt, Ryan, Partridge, Turner, Demers
2010: Gardiner, Barrett, Partridge
2011: Barrett, Partridge
2012: Sagar
2013: Hammer, Wang, Hahn
2014: Partridge, Hughes, Smith
2015: Barrett, Sagar, Fernandez
Several of them are repeat interlopers. I have developed a model that takes this information into account and predicts my chances of winning. It is based on the assumption that there is a population of n runners who could displace me, each with some probability p.
In order to displace me, a runner has to
- Show up,
- Outrun me, and
- Be in my age group.
For each runner, the probability of displacing me is a product of these factors:
pi=SOB
Some runners have a higher SOB factor than others; we can use previous results to estimate it.
First we have to think about an appropriate prior. Again, the details are in this IPython notebook.
- Based on my experience, I conjecture that the prior distribution of S is an increasing function, with many people who run nearly every year, and fewer who run only occasionally.
- The prior distribution of O is biased toward high values. Of the people who have the potential to beat me, many of them will beat me every time. I am only competitive with a few of them. (For example, of the 15 people who have beat me, I have only ever beat 2).
- The probability that a runner is in my age group depends on the difference between his age and mine. Someone exactly my age will always be in my age group. Someone 4 years older will be in my age group only once every 5 years (the Great Bear run uses 5-year age groups). So the distribution of B is uniform.
I used Beta distributions for each of the three factors, so each piis the product of three Beta-distributed variates. In general, the result is not a Beta distribution, but it turns out there is a Beta distribution that is a good approximation of the actual distribution.
Using this distribution as a prior, we can compute the posterior distribution of p for each runner who has displaced me, and another posterior for any hypothetical runner who has not displaced me in any of 8 races. As an example, for Rich Partridge, who has displaced me in 4 of 8 years, the mean of the posterior distribution is 42%. For a runner who has only displaced me once, it is 17%.
Then, for any hypothetical number of runners, n, we can draw samples from these distributions of p and compute the conditional distribution of k, the number of runners who finish ahead of me. Here's what that looks like for a few values of n:
If there are 18 runners, the mostly likely value of k is 3, so I would come in 4th. As the number of runners increases, my prospects look a little worse.
These represent distributions of k conditioned on n, so we can use them to compute the likelihood of the observed values of k. Then, using Bayes theorem, we can compute the posterior distribution of n, which looks like this:
It's noisy because I used random sampling to estimate the conditional distributions of k. But that's ok because we don't really care about n; we care about the predictive distribution of k. And noise in the distribution of n has very little effect on k.
The predictive distribution of k is a weighted mixture of the conditional distributions we already computed, and it looks like this:
Sadly, according to this model, my chance of winning my age group is less than 2% (compared to the binomial model, which predicts that my chances are more than 5%).
And one more time, the details are in this IPython notebook.