Monday, January 26, 2015

Bayesian predictions for Super Bowl XLIX

Last fall I taught an introduction to Bayesian statistics at Olin College. My students worked on some excellent projects, and I invited them to write up their results as guest articles for this blog. Here is the first article of the series:


Predicting Super Bowl XLIX
Alex Crease and Matt Wismer


All code for this post can be found on Github: https://github.com/MattWis/PoissonFootball


With the Super Bowl coming up this Sunday, everyone is probably wondering where to place their bets. Fortunately, we've developed a model that uses Bayesian statistics to predict the results of football matches based upon prior season data of each team. We then used our model to predict a game between the Seahawks and the Patriots.


Our model is an expanded version of Allen Downey's solution to the “World Cup Problem”, where one can predict goals scored in a soccer match based on prior data. While the soccer model uses a Poisson process to predict when the next goal will be scored, our football model is a bit trickier because it not only accounts for defense, but also incorporates the two types of scoring; field goals and 7 point touchdowns, to predict final scores. Safeties and two-point conversions were eliminated to simplify our model. We also didn't include overtime considerations, allowing the two teams to tie at the end of regulation.


This model can be useful in a number of different betting schemes. Here’s what it predicts:


Who’s going to win?


Patriots, easy. Well, we may be biased because we both go to school in New England. Here’s the real story:


We considered scoring to be a single Poisson process, and have a separate distribution that represents our belief about the probability of a team scoring a touchdown or a field goal. We then calculate the probability of scoring a certain number of touchdowns and field goals based on the binomial distribution. The single Poisson process is related to the ability of the offense to get into field goal range, and the percentage of touchdowns is related to the team’s red zone efficiency.


We also keep track of offense and defense separately. To calculate the distribution of points scored, we average the offense’s scoring rate with the rate at which the defense allows scoring, and average the offense’s touchdown percent with the percentage of touchdowns that the defense allows.


The points distributions for each team were generated as described above, so we then calculated the probability that one would be greater than the other, or, in simpler terms, the probability that each team would win:


Outcome:
Patriots Win
Seahawks Win
Overtime
Probability:
52.0%
44.1%
3.9%

At this point, the teams seem well matched, with our model slightly favoring the Patriots. Looking more closely at the data, the 90% confidence interval gives the Patriots a scoring range of 6-45, and the Seahawks 6-44, bringing us to the investigation of the point spread.


Betting the Point Spread


This game is likely to be pretty close. We looked at the probability of different spreads by subtracting the Seahawk's point distribution from the Patriot’s point distribution. The cumulative distribution of the point spread is shown below. As you can see, the 50% mark appears right around the Patriots winning by 1. So if you can find a spread significantly distant from that, you can consult this chart to figure out how likely you will be to win the spread. However, this result does not consider overtime a possibility, it depends on the chance that the teams will tie. Thus the point spread may be different if the game goes into overtime.
Point_Spread.png


Bets on the Total


Some people like to bet on the total points scored by both teams, so we wanted to figure that out as well. We added our two points distributions, and got an interesting result. There is almost an identical probability of the total being 37 (3.636%) or 44 (3.634%). You can consult the below graph for other likely total scores to see where to place your bets.


Total_Points.png


Another method of betting on the total is based on an over/under scale, like the point spread. We've got you covered in that case, too. We expect the total score to be centered around 45 points, as that is the mean of the distribution. According to http://www.oddsshark.com/nfl/odds, the over/under line is currently at 48 or 48.5, and with our results in mind, we advise betting under. However, if the game goes into overtime, all bets are off (pun intended) and the total score could go higher.


Total_CDF.png
Office Pool Grid Betting


Because of the multiple ways of scoring in football, certain scores are more likely to occur than others. As you can see in the histograms below, scoring 17 points in a match is the most likely scenario for both teams because it is a combination of 2 touchdowns and 1 field goal, both quantities being likely for a football match. However, it is impossible to score something like 11 points because no combinations of 7 and 3 add up to that value. From the two graphs, you can see that the Seahawks are actually more likely to score field goals than the Patriots because their probability of scoring 3 points is higher.


Patriots.png


Seahawks.png
To satisfy your betting needs, we've created histograms below displaying the likelihood that teams will score values with specific last digits so you can (hopefully) blow your office mates away with your office pool bets. If you are familiar with the way points add up in football, our data isn't very surprising. Across both teams, the top two digits to place your bets on are 0 and 7, with 4 and 3 coming up behind.


Patriots_Digits.png


Seahawks_Digits.png

Sadly, our results do not show a clear outcome to place your bets on. However, we hope that we've shed some light on how you can make an educated guess when deciding where to put your money. Please remember that you ultimately decide how to bet, and we are no way liable for any money you may lose (or win) over the Super Bowl weekend.

Wednesday, December 17, 2014

Statistics tutorials at PyCon 2015

I am happy to announce that I will offer two statistics tutorials at PyCon 2015 on April 9 in Montreal.  In the morning session I am teaching Bayesian Statistics Made Simple, which I have taught several times before, including the last three PyCons.  In the afternoon I am offering a new tutorial, Statistical Inference with Computational Methods.

The whole tutorial schedule is here, along with registration information.  And here are some details about the tutorials:

Bayesian statistics made simple

Audience level:
Intermediate
Category:
Science

Description

An introduction to Bayesian statistics using Python.  Bayesian statistics are usually presented mathematically, but many of the ideas are easier to understand computationally.  People who know Python can get started quickly and use Bayesian analysis to solve real problems.  This tutorial is based on material and case studies from Think Bayes (O’Reilly Media).

Abstract

Bayesian statistical methods are becoming more common and more important, but there are not many resources to help beginners get started.  People who know Python can use their programming skills to get a head start.
I will present simple programs that demonstrate the concepts of Bayesian statistics, and apply them to a range of example problems.  Participants will work hands-on with example code and practice on example problems.
Attendees should have at least basic level Python and basic statistics.  If you learned about Bayes’s theorem and probability distributions at some time, that’s enough, even if you don’t remember it!
Attendees should bring a laptop with Python and matplotlib.  You can work in any environment; you just need to be able to download a Python program and run it.  I will provide code to help attendees get set up ahead of time.

Statistical inference with computational methods

Audience level:
Intermediate
Category:
Science

Description

Statistical inference is a fundamental tool in science and engineering, but it is often poorly understood.  This tutorial uses computational methods, including Monte Carlo simulation and resampling, to explore estimation, hypothesis testing and statistical modeling.  Attendees will develop understanding of statistical concepts and learn to use real data to answer relevant questions.

Abstract

Do you know the difference between standard deviation and standard error?  Do you know what statistical test to use for any occasion?  Do you really know what a p-value is?  How about a confidence interval?
Most students don’t really understand these concepts, even after taking several statistics classes.  The problem is that these classes focus on mathematical methods that bury the concepts under a mountain of details.
This tutorial uses Python to implement simple statistical experiments that develop deep understanding.  Attendees will learn about resampling and related tools that use random simulation to perform statistical inference, including estimation and hypothesis testing.  We will use pandas, which provides structures for data analysis, along with NumPy and SciPy.
I will present examples using real-world data to answer relevant questions.  The tutorial material is based on my book, Think Stats, a class I teach at Olin College, and my blog, “Probably Overthinking It.

More information and registration here.

Thursday, December 4, 2014

The Rock Hyrax Problem



This is the third of a series of articles about Bayesian analysis.  The previous article is here.

Earlier this semester I posed this problem to my Bayesian statistics class at Olin College:
Suppose I capture and tag 10 rock hyraxes.  Some time later, I capture another 10 hyraxes and find that two of them are already tagged.  How many hyraxes are there in this environment?
This is an example of a mark and recapture experiment, which you can read about on Wikipedia.  The Wikipedia page also includes the photo of a tagged hyrax shown above.

As always with problems like this, we have to make some modeling assumptions.

1) For simplicity, you can assume that the environment is reasonably isolated, so the number of hyraxes does not change between observations.

2) And you can assume that each hyrax is equally likely to be captured during each phase of the experiment, regardless of whether it has been tagged.  In reality, it is possible that tagged animals would avoid traps in the future, or possible that the same behavior that got them caught the first time makes them more likely to be caught again.  But let's start simple.

My solution to this problem uses the computation framework from my book, Think Bayes.  The framework is described in this notebook.  If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.

If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.

And when you are ready, you can see my solution in this notebook.

I will post more of the exercises from my class over the next few weeks.

UPDATE December 5, 2014: João Neto posted a solution to this problem in BUGS using a Jeffrey's prior.

Tuesday, November 25, 2014

The World Cup Problem Part 2: Germany v. Argentina

This is the second of two articles about Bayesian analysis applied to World Cup soccer.  The previous article is here.

Earlier this semester I posed this problem to my Bayesian statistics class at Olin College:
In the final match of the 2014 FIFA World Cup, Germany defeated Argentina 1-0. How much evidence does this victory provide that Germany had the better team? What is the probability that Germany would win a rematch?
Before you can answer a question like this, you have to make some modeling decisions.  As I suggested to my class, scoring in games like soccer and hockey can be well modeled by a Poisson process, which assumes that each team, against a given opponent, will score goals at some goal-scoring rate, λ, and that this rate is stationary; in other words, the probability of scoring a goal is about the same at any point during the game.

If this model holds, we expect the distribution of time between goals to be exponential, and the distribution of goals per game to be Poisson.

My solution to this problem uses the computation framework from my book, Think Bayes.  The framework is described in this notebook.  If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.

If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.

And when you are ready, you can see my solution in this notebook.

I will post more of the exercises from my class over the next few weeks.

One technical note: Last week on reddit, someone asked about my use of "fake data" to construct the prior distribution.  This move is not as bogus as it sounds; it is just a convenient way to construct a prior that has the right mean (or whatever other statistics are known) and a shape that is consistent with background information.

UPDATE November 25, 2014:  Cameron Davidson-Pilon has once again risen to the challenge and implemented a solution using PyMC.  You can see his solution here.

If you want to learn more about PyMC, an excellent place to start is Cameron's online book Bayesian Methods for Hackers.

Tuesday, November 18, 2014

The World Cup Problem: Germany v. Brazil

Earlier this semester I posed this problem to my Bayesian statistics class at Olin College:
In the 2014 FIFA World Cup, Germany played Brazil in a semifinal match. Germany scored after 11 minutes and again at the 23 minute mark. At that point in the match, how many goals would you expect Germany to score after 90 minutes? What was the probability that they would score 5 more goals (as, in fact, they did)?
Before you can answer a question like this, you have to make some modeling decisions.  As I suggested to my class, scoring in games like soccer and hockey can be well modeled by a Poisson process, which assumes that each team, against a given opponent, will score goals at some goal-scoring rate, λ, and that this rate is stationary; in other words, the probability of scoring a goal is about the same at any point during the game.

If this model holds, we expect the distribution of time between goals to be exponential, and the distribution of goals per game to be Poisson.

My solution to this problem uses the computation framework from my book, Think Bayes.  The framework is described in this notebook.  If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.

If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.

And when you are ready, you can see my solution in this notebook.

I will post more of the exercises from my class over the next few weeks.  Coming next: The World Cup Problem Part II: Germany v. Argentina.


UPDATE November 19, 2014:  Cameron Davidson-Pilon kindly (and quickly!) responded to my request for a solution to this problem using PyMC.  You can see his solution here.  If you want to learn more about PyMC, an excellent place to start is Cameron's online book Bayesian Methods for Hackers.

Saturday, October 4, 2014

On efficient algorithms for finding the goddamn endnotes

Introduction

In many recent books, the goddamn endnotes are numbered sequentially within each chapter, but chapter numbers seldom appear in the header on each page.  So a reader who wants to find a goddamn endnote typically has to search backward to figure out which chapter they are reading before they can find the goddamn endnote.  Several simple changes can make this process more efficient and less infuriating.

In this paper, we present a novel algorithm for finding a goddamn endnote (FGE) in O(1) time; that is, time bounded by a constant regardless of the size or number of chapters.

Background

The most widely deployed algorithm for FGE requires O(n+m) time; that is, time proportional to either n, the number of pages in a chapter, or m, the number of chapters, whichever is larger.  The following pseudocode sketches this algorithm:

1) Store the endnote number you want to find as e.
2) Figure out what chapter you are reading by searching for the beginning of the chapter, and store the result as c.
3) Find the footnotes at the end of the chapter and find the footnotes for chapter c.
4) Look up footnote e.

The most common implementation of Step 2 is linear in n, the number of pages in the chapter.  Some readers can execute a binary search by comparing chapter titles in the header, reducing search time to O(log n).

Similarly, the most common implementation of Step 3 is linear in m, the number of chapters, but some readers perform a binary search.

Assuming that the number of footnotes in a chapter is bounded by a constant, we consider Step 4 to be constant time.

Alternatives

1) Put the goddamn chapter number (GCN) in the goddamn header.  This simple change immediately reduces Step 2 to constant time.

2) Number the goddamn endnotes from the beginning to the end of the book.  Do not start over at the beginning of each goddamn chapter.  This change also reduces Step 2 to constant time, but with this change Step 4 may no longer be considered constant time.

3) Along with the endnote number, e, also include p, the page where the goddamn endnote appears. This change makes Step 2 constant time; and Step 3, although technically log time, practically constant time for most readers.

4) Put the goddamn endnotes at the bottom of the goddamn page (cf. footnotes).  For a bounded page size, looking at the bottom of the page is a constant time operation.

Recommendations

We recommend that publishers replace existing linear-time algorithms with any of several easy-to-implement, efficient alternatives, goddammit.

Monday, September 29, 2014

Two hour marathon in 2041

Today Dennis Kimetto ran the Berlin Marathon in 2:02:57, shaving 26 seconds off the previous world record.  That means it's time to check in on the world record progression and update my update from last year of  my article from two years ago.  The following is a revised version of that article, including the new data point.

Abstract: I propose a model that explains why world record progressions in running speed are improving linearly, and should continue as long as the population of potential runners grows exponentially.  Based on recent marathon world records, I extrapolate that we will break the two-hour barrier in 2041.

-----

Let me start with the punchline:


The blue points show the progression of marathon records since 1970, including the new mark.  The blue line is a least-squares fit to the data, and the red line is the target pace for a two-hour marathon, 13.1 mph.  The blue line hits the target pace in 2041.

In general, linear extrapolation of a time series is a dubious business, but in this case I think it is justified:

1) The distribution of running speeds is not a bell curve.  It has a long tail of athletes who are much faster than normal runners.  Below I propose a model that explains this tail, and suggests that there is still room between the fastest human ever born and the fastest possible human.

2) I’m not just fitting a line to arbitrary data; there is theoretical reason to expect the progression of world records to be linear, which I present below.  And since there is no evidence that the curve is starting to roll over, I think it is reasonable to expect it to continue for a while.

3) Finally, I am not extrapolating beyond reasonable human performance. The target pace for a two-hour marathon is 13.1 mph, which is slower than the current world record for the half marathon (58:23 minutes, or 13.5 mph). It is unlikely that the current top marathoners will be able to maintain this pace for two hours, but we have no reason to think that it is beyond theoretical human capability.

My model, and the data it is based on, are below.

-----

In April 2011, I collected the world record progression for running events of various distances and plotted speed versus year (here's the data, mostly from Wikipedia). The following figure shows the results:

You might notice a pattern; for almost all of these events, the world record progression is a remarkably straight line. I am not the first person to notice, but as far as I know, no one has proposed an explanation for the shape of this curve.

Until now -- I think I know why these lines are straight.  Here are the pieces:

1) Each person's potential is determined by several factors that are independent of each other; for example, your VO2 max and the springiness of your tendons are probably unrelated.

2) Each runner's overall potential is limited by their weakest factor. For example, if there are 10 factors, and you are really good at 9 of them, but below average on the 10th, you will probably not be a world-class runner.

3) As a result of (1) and (2), potential is not normally distributed; it is long-tailed. That is, most people are slow, but there are a few people who are much faster.

4) Runner development has the structure of a pipeline. At each stage, the fastest runners are selected to go on to the next stage.

5) If the same number of people went through the pipeline each year, the rate of new records would slow down quickly.

6) But the number of people in the pipeline actually grows exponentially.

7) As a result of (5) and (6) the rate of new records is linear.

8) This model suggests that linear improvement will continue as long as the world population grows exponentially.

Let's look at each of those pieces in detail:

Physiological factors that determine running potential include VO2 max, anaerobic capacity, height, body type, muscle mass and composition (fast and slow twitch), conformation, bone strength, tendon elasticity, healing rate and probably more. Psychological factors include competitiveness, persistence, tolerance of pain and boredom, and focus.

Most of these factors have a large component that is inherent, they are mostly independent of each other, and any one of them can be a limiting factor. That is, if you are good at all of them, and bad at one, you will not be a world-class runner. To summarize: there is only one way to be fast, but there are a lot of ways to be slow.

As a simple model of these factors, we can generate a random person by picking N random numbers, where each number is normally-distributed under a logistic transform. This yields a bell-shaped distribution bounded between 0 and 1, where 0 represents the worst possible value (at least for purposes of running speed) and 1 represents the best possible value.

Then to represent the running potential of each person, we compute the minimum of these factors. Here's what the code looks like:

def GeneratePerson(n=10):
    factors = [random.normalvariate(0.0, 1.0) for i in range(n)]
    logs = [Logistic(x) for x in factors]
    return min(logs)


Yes, that's right, I just reduced a person to a single number. Cue the humanities majors lamenting the blindness and arrogance of scientists. Then explain that this is supposed to be an explanatory model, so simplicity is a virtue. A model that is as rich and complex as the world is not a model.

Here's what the distribution of potential looks like for different values of N:


When N=1, there are many people near the maximum value. If we choose 100,000 people at random, we are likely to see someone near 98% of the limit. But as N increases, the probability of large values drops fast. For N=5, the fastest runner out of 100,000 is near 85%. For N=10, he is at 65%, and for N=50 only 33%.

In this kind of lottery, it takes a long time to hit the jackpot. And that's important, because it suggests that even after 7 billion people, we might not have seen anyone near the theoretical limit.

Let's see what effect this model has on the progression of world records. Imagine that we choose a million people and test them one at a time for running potential (and suppose that we have a perfect test). As we perform tests, we keep track of the fastest runner we have seen, and plot the "world record" as a function of the number of tests.

Here's the code:

def WorldRecord(m=100000, n=10):
    data = []
    best = 0.0
    for i in xrange(m):
        person = GeneratePerson(n)
        if person > best:
            best = person
            data.append(i/m, best))
    return data

And here are the results with M=100,000 people and the number of factors N=10:


The x-axis is the fraction of people we have tested. The y-axis is the potential of the best person we have seen so far. As expected, the world record increases quickly at first and then slows down.

In fact, the time between new records grows geometrically. To see why, consider this: if it took 100 people to set the current record, we expect it to take 100 more to exceed the record. Then after 200 people, it should take 200 more. After 400 people, 400 more, and so on. Since the time between new records grows geometrically, this curve is logarithmic.

So if we test the same number of people each year, the progression of world records is logarithmic, not linear. But if the number of tests per year grows exponentially, that's the same as plotting the previous results on a log scale. Here's what you get:


That's right: a log curve on a log scale is a straight line. And I believe that that's why world records behave the way they do.

This model is unrealistic in some obvious ways. We don't have a perfect test for running potential and we don't apply it to everyone. Not everyone with the potential to be a runner has the opportunity to develop that potential, and some with the opportunity choose not to.

But the pipeline that selects and trains runners behaves, in some ways, like the model.  If a person with record-breaking potential is born in Kenya, where running is the national sport, the chances are good that he will be found, have opportunities to train, and become a world-class runner.  It is not a certainty, but the chances are good.

If the same person is born in rural India, he may not have the opportunity to train; if he is in the United States, he might have options that are more appealing.

So in some sense the relevant population is not the world, but the people who are likely to become professional runners, given the talent.  As long as this population is growing exponentially, world records will increase linearly.

That said, the slope of the line depends on the parameter of exponential growth.  If economic development increases the fraction of people in the world who have the opportunity to become professional runners, these curves could accelerate.

So let's get back to the original question: when will a marathoner break the 2-hour barrier?  Before 1970, marathon times improved quickly; since then the rate has slowed.  But progress has been consistent for the last 40 years, with no sign of an impending asymptote.  So I think it is reasonable to fit a line to recent data and extrapolate.  Here are the results [based on 2011 data; see above for the update]:



The red line is the target: 13.1 mph.  The blue line is a least squares fit to the data.  So here's my prediction: there will be a 2-hour marathon in 2043.  I'll be 76, so my chances of seeing it are pretty good.  But that's a topic for another article (see Chapter 1 of Think Bayes).