Monday, October 26, 2015

When will I win the Great Bear Run?

Almost every year since 2008 I have participated in the Great Bear Run, a 5K road race in Needham MA. I usually finish in the top 40 or so, and in my age group I have come in 4th, 6th, 4th, 3rd, 2nd, 4th and 4th. In 2015 I didn't run because of a scheduling conflict, but based on the results I estimate that I would have come 4th again.

Having come close in 2012, I have to wonder what my chances are of winning my age group.  I've developed two models to predict the results.

Binomial model

According to a simple binomial model (the details are in this IPython notebook), the predictive distribution for the number of people who finish ahead of me is:



And my chances of winning my age group are about 5%.

An improved model

The binomial model ignores an important piece of information: some of the people who have displaced me from my rightful victory have come back year after year.  Here are the people who have finished ahead of me in my age group:

    2008: Gardiner, McNatt, Terry
    2009: McNatt, Ryan, Partridge, Turner, Demers
    2010: Gardiner, Barrett, Partridge
    2011: Barrett, Partridge
    2012: Sagar
    2013: Hammer, Wang, Hahn
    2014: Partridge, Hughes, Smith
    2015: Barrett, Sagar, Fernandez

Several of them are repeat interlopers.  I have developed a model that takes this information into account and predicts my chances of winning.  It is based on the assumption that there is a population of n runners who could displace me, each with some probability p.

In order to displace me, a runner has to

  • Show up,
  • Outrun me, and
  • Be in my age group.

For each runner, the probability of displacing me is a product of these factors:

pi=SOB

Some runners have a higher SOB factor than others; we can use previous results to estimate it.

First we have to think about an appropriate prior.   Again, the details are in this IPython notebook.

  • Based on my experience, I conjecture that the prior distribution of S is an increasing function, with many people who run nearly every year, and fewer who run only occasionally.
  • The prior distribution of O is biased toward high values. Of the people who have the potential to beat me, many of them will beat me every time. I am only competitive with a few of them.  (For example, of the 15 people who have beat me, I have only ever beat 2).
  • The probability that a runner is in my age group depends on the difference between his age and mine. Someone exactly my age will always be in my age group. Someone 4 years older will be in my age group only once every 5 years (the Great Bear run uses 5-year age groups).  So the distribution of B is uniform.
I used Beta distributions for each of the three factors, so each piis the product of three Beta-distributed variates. In general, the result is not a Beta distribution, but it turns out there is a Beta distribution that is a good approximation of the actual distribution.


Using this distribution as a prior, we can compute the posterior distribution of p for each runner who has displaced me, and another posterior for any hypothetical runner who has not displaced me in any of 8 races.  As an example, for Rich Partridge, who has displaced me in 4 of 8 years, the mean of the posterior distribution is 42%.  For a runner who has only displaced me once, it is 17%.

Then, for any hypothetical number of runners, n, we can draw samples from these distributions of p and compute the conditional distribution of k, the number of runners who finish ahead of me.  Here's what that looks like for a few values of n:



If there are 18 runners, the mostly likely value of k is 3, so I would come in 4th.  As the number of runners increases, my prospects look a little worse.

These represent distributions of k conditioned on n, so we can use them to compute the likelihood of the observed values of k.  Then, using Bayes theorem, we can compute the posterior distribution of n, which looks like this:



It's noisy because I used random sampling to estimate the conditional distributions of k. But that's ok because we don't really care about n; we care about the predictive distribution of k. And noise in the distribution of n has very little effect on k.

The predictive distribution of k is a weighted mixture of the conditional distributions we already computed, and it looks like this:


Sadly, according to this model, my chance of winning my age group is less than 2% (compared to the binomial model, which predicts that my chances are more than 5%).

Friday, October 23, 2015

Bayes meets Fourier

Joseph Fourier never met Thomas Bayes—Fourier was born in 1768, seven years after Bayes died.  But recently I have been exploring connections between the Bayes filter and the Fourier transform.

By "Bayes filter", I don't mean spam filtering using a Bayesian classifier, but rather recursive Bayesian estimation, which is used in robotics and other domains to estimate the state of a system that evolves over time, for example, the position of a moving robot.  My interest in this topic was prompted by Roger Labbe's book, Kalman and Bayesian Filters in Python, which I am reading with my book club.

The Python code for this article is in this IPython notebook.  If you get sick of the math, you might want to jump into the code.

The Bayes filter

A Bayes filter starts with a distribution that represents probabilistic beliefs about the initial position of the robot.  It proceeds by alternately executing predict and update steps:

  1. The predict step uses a physical model of the system and its current state to predict the state in the future.  For example, given the position and velocity of the robot, you could predict its location at the next time step.
  2. The update step uses a measurement and a model of measurement error to update beliefs about the state of the system.  For example, if we use GPS to measure the location of the robot, the measurement would include some error, and we could characterize the distribution of errors.

The update step uses Bayes theorem, so computationally it involves multiplying the prior distribution by the likelihood distribution and then renormalizing.

The predict step involves adding random variables, like position and velocity, so computationally it involves the convolution of two distributions.

The Convolution Theorem

The Convolution Theorem suggests a symmetry between the predict and update steps, which leads to an efficient algorithm.  I present the Convolution Theorem in Chapter 8 of Think DSP.  For discrete signals, it says:

DFT(f ∗ g) = DFT(f) · DFT(g)

That is, if you want to compute the discrete Fourier transform (DFT) of a convolution, you can compute the DFT of the signals separately and then multiply the results.  In words, convolution in the time domain corresponds to multiplication in the frequency domain.  And conversely, multiplication in the time domain corresponds to convolution in the frequency domain.

In the context of Bayes filters, we are working with spatial distributions rather than temporal signals, but the convolution theorem applies: instead of the "time domain", we have the "space domain", and instead of the DFT, we have the characteristic function.

The characteristic function

This section shows that the characteristic function and the Fourier transform are the same thing.  If you already knew that, you can skip to the next section.

If you took mathematical statistics, you might have seen this definition of the characteristic function:

    \varphi_X(t) = \operatorname{E} \left [ e^{itX} \right ]

Where X is a random variable, t is a transform variable that corresponds to frequency (temporal, spatial, or otherwise), and E is the expectation operator.

If you took signals and systems, you saw this definition of the Fourier transform:

\hat{f}(\xi) = \int_{-\infty}^\infty f(x)\ e^{- 2\pi i x \xi}\,dx,

where f is a function, \hat{f} is its Fourier transform, and ΞΎ is a transform variable that corresponds to a frequency.

The two definitions look different, but if you write out the expectation operator, you get 

\varphi_X(t) = \langle e^{itX} \rangle = \int_{\mathbf{R}} e^{itx}p(x)\, dx = \overline{\left( \int_{\mathbf{R}} e^{-itx}p(x)\, dx \right)} = \overline{P(t)},

where p(x) is the probability density function of X, and P(t) is its Fourier transform.  The only difference between the characteristic function and the Fourier transform is the sign of the exponent, which is just a convention choice.  (The bar above P(t) indicates the complex conjugate, which is there because of the sign of the exponent.)

And that brings us to the whole point of the characteristic function: to add two random variables, you have to convolve their distributions.  By the Convolution Theorem, convolution in the space domain corresponds to multiplication in the frequency domain, so you can add random variables like this:
  1. Compute the characteristic function of both distributions.
  2. Multiply the characteristic functions elementwise.
  3. Apply the inverse transform to compute the distribution of the sum.

Back to Bayes

Now we can see the symmetry between the predict and update steps more clearly:
  1. The predict step involves convolution in the space domain, which corresponds to multiplication in the frequency domain.
  2. The update step involves multiplication in the space domain, which corresponds to convolution in the frequency domain.
This observation is interesting, at least to me, and possibly useful, because it suggests an efficient algorithm based on the Fast Fourier Transform (FFT).  The simple implementation of convolution takes time proportional to N², where N is the number of values in the distributions.  Using the FFT, we can get that down to N log N.

Using this algorithm, a complete predict-update step looks like this:
  1. Compute the FFT of the distributions for position and velocity.
  2. Multiply them elementwise; the result is the characteristic function of the convolved distributions.
  3. Compute the inverse FFT of the result, which is the predictive distribution.
  4. Multiply the predictive distribution by the likelihood and renormalize.
Steps 1 and 3 are N log N; steps 2 and 4 are linear.

I demonstrate this algorithm in this IPython notebook.

For more on the Bayesian part of the Bayes filter, you might want to read Chapter 6 of Think Bayes.  For more on the Convolution Theorem, see Chapter 8 of Think DSP.  And for more about adding random variables, see Chapter 6 of Think Stats.

Acknowledgements

Thanks to Paul Ruvolo, the other member of my very exclusive reading club, and Wikipedia for the typeset equations I borrowed.