Olin College is Hiring

Olin College is Hiring. I teach at Olin College, a new undergraduate engineering college with the mission to fix engineering education. If you're interested in joining our team, here is information about the Faculty Search at Olin College.

Saturday, April 30, 2011

Two Hour Marathon in 2045

On April 18 Geoffry Mutai finished the Boston Marathon in 2:03:02, which is the fastest a human has ever run 26.2 miles. This remarkable performance raises the question: when will we break the 2-hour barrier?

To answer that question, I collected the world record progression for running events of various distances and plotted them (here's the data, mostly from Wikipedia). The following figure shows the results:

You might notice a pattern; for almost all of these events, the world record progression is a remarkably straight line. I am not the first person to notice, but as far as I know, no one has proposed an explanation for the shape of this curve.

Until now -- I think I know why these lines are straight.  Here are the pieces:

1) Each person's potential is determined by several factors that are independent of each other; for example, your VO2 max and the springiness of your tendons are probably unrelated.

2) Each runner's overall potential is limited by their weakest factor. For example, if there are 10 factors, and you are really good at 9 of them, but below average on the 10th, you will probably not be a world-class runner.

3) As a result of (1) and (2), potential is not normally distributed; it is long-tailed. That is, most people are slow, but there are a few people who are much faster.

4) Runner development has the structure of a pipeline. At each stage, the fastest runners are selected to go on to the next stage.

5) If the same number of people went through the pipeline each year, the rate of new records would slow down quickly.

6) But the number of people in the pipeline actually grows exponentially.

7) As a result of (5) and (6) the rate of new records is linear.

8) This model suggests that linear improvement will continue as long as the world population grows exponentially.

Let's look at each of those pieces in detail:

Physiological factors that determine running potential include VO2 max, anaerobic capacity, height, body type, muscle mass and composition (fast and slow twitch), conformation, bone strength, tendon elasticity, healing rate and probably more. Psychological factors include competitiveness, persistence, tolerance of pain and boredom, and focus.

Most of these factors have a large component that is inherent, they are mostly independent of each other, and any one of them can be a limiting factor. That is, if you are good at all of them, and bad at one, you will not be a world-class runner. There is only one way to be fast, but there are a lot of ways to be slow.

As a simple model of these factors, we can generate a random person by picking N random numbers, where each number is normally-distributed under a logistic transform. This yields a bell-shaped distribution bounded between 0 and 1, where 0 represents the worst possible value (at least for purposes of running speed) and 1 represents the best possible value.

Then to represent the running potential of each person, we compute the minimum of these factors. Here's what the code looks like:

def GeneratePerson(n=10):
    factors = [random.normalvariate(0.0, 1.0) for i in range(n)]
    logs = [Logistic(x) for x in factors]
    return min(logs)


Yes, that's right, I just reduced a person to a single number. Bring in the humanities majors lamenting the blindness and arrogance of scientists. Then explain to them that this is supposed to be an explanatory model, so simplicity is a virtue. A model that is as rich and complex as the world is not a model.

Here's what the distribution of potential looks like for different values of N:


When N=1, there are many people near the maximum value. If we choose 100,000 people at random, we are likely to see someone near 98% of the limit. But as N increases, the probability of large values drops fast. For N=5, the fastest runner out of 100,000 is near 85%. For N=10, he is at 65%, and for N=50 only 33%.

In this kind of genetic lottery, it takes a long time to hit the jackpot. And that's important, because it suggests that even after 6 billion people, we might not have seen anyone near the theoretical limit.

Let's see what effect this model has on the progression of world records. Imagine that we choose a million people and test them one at a time for running potential (and suppose that we have a perfect test). As we perform tests, we keep track of the fastest runner we have seen, and plot the "world record" as a function of the number of tests.

Here's the code:

def WorldRecord(m=100000, n=10):
    data = []
    best = 0.0
    for i in xrange(m):
        person = GeneratePerson(n)
        if person > best:
            best = person
            data.append(i/m, best))
    return data

And here are the results with M=100,000 people and the number of factors N=10:


The x-axis is the fraction of people we have tested. The y-axis is the potential of the best person we have seen so far. As expected, the world record increases quickly at first and then slows down.

In fact, the time between new records grows geometrically. To see why, consider this: if it took 100 people to set the current record, we expect it to take 100 more to exceed the record. Then after 200 people, it should take 200 more. After 400 people, 400 more, and so on. Since the time between new records grows geometrically, this curve is logarithmic.

So if we test the same number of people each year, the progression of world records is logarithmic, not linear. But if the number of tests per year grows exponentially, that's the same as plotting the previous results on a log scale. Here's what you get:


That's right: a log curve on a log scale is a straight line. And I believe that that's why world records behave the way they do.

This model is unrealistic in some obvious ways. We don't have a perfect test for running potential and we don't apply it to everyone. Not everyone with the potential to be a runner has the opportunity to develop that potential, and some with the opportunity choose not to.

But the pipeline that selects and trains runners behaves, in some ways, like the model.  If a person with record-breaking potential is born in Kenya, where running is the national sport, the chances are good that he will be found, he will have opportunities to train, and he will become a world-class runner.  It is not a certainty, but the chances are good.

If the same person is born in rural India, he may not have the opportunity to train; if he is in the United States, he might have options that are more appealing.

So in some sense the relevant population is not the world, but the people who are likely to become professional runners, given the talent.  As long as this population is growing exponentially, world records will increase linearly.

That said, the slope of the line depends on the parameter of exponential growth.  If economic development increases the fraction of people in the world who have the opportunity to become professional runners, these curves could accelerate.

So let's get back to the original question: when will a marathoner break the 2-hour barrier?  Before 1970, marathon times improved quickly; since then the rate has slowed.  But progress has been consistent for the last 40 years, with no sign of an impending asymptote.  So I think it is reasonable to fit a line to recent data and extrapolate.  Here are the results:


The red line is the target: 13.1 mph.  The blue line is a least squares fit to the data.  So here's my prediction: there will be a 2-hour marathon in 2045.  I'll be 78, so my chances of seeing it are pretty good.  But maybe that's a topic for another article.

Related posts:

Wednesday, April 20, 2011

Bayesianness is next to Godliness

Is there evidence for God?

I watched a debate recently between William Lane Craig and Lawrence Krauss on the topic "Is there evidence for God?"  Dr. Craig argued the affirmative; Dr. Krauss the negative.  Since their names are monosyllabic, Germanic, and alliterative, it might be hard to keep them straight.  I will call them Affirmative and Negative.

Affirmative spoke first and offered the following definition of evidence: "To say that there is evidence for a hypothesis is just to say that the hypothesis is more probable given certain facts than it would have been without them."  He explained that if E is a set of facts and H is a hypothesis, then E is evidence for H if

P(H | E & B) > P(H | B),

where B represents background information independent of E.  To simplify I will make B implicit and write

P(H | E) > P(H).

In Bayesian terms, P(H) is the prior probability, that is, prior to taking E into account;  P(H | E) is the posterior probability.

This definition of evidence sounds reasonable to me, and I think it is uncontroversial.  But Negative chose to refute it: "I will get to the fact that [Affirmative's] claim for what evidence is is not at all what we use in science nowadays.  It doesn’t relate at all to what we use in science."

I think I have a pretty good idea what we use in science nowadays, and I have no idea what Negative is talking about.  He never explained.

In his rebuttal, Affirmative replied, "This is the standard definition of 'is evidence for,' used in probability theory.  And I’m astonished to hear [Negative] attacking logic in Bayesian probability theory as the basis for his argument."

So am I.  On this point I think Affirmative is basically right, but I do have one reservation.  People often assume, incorrectly, that E is evidence for H if E is consistent with H, but in fact, that's not enough.

Oliver's blood!

To see why not, let's look at an exercise from one of my favorite books, David MacKay's Information Theory, Inference, and Learning Algorithms:
MacKay's blood type problem
Two people have left traces of their own blood at the scene of a crime.  A suspect, Oliver, is tested and found to have type O blood.  The blood groups of the two traces are found to be of type O (a common type in the local population, having frequency 60%) and of type AB (a rare type, with frequency 1%).  Do these data (the blood types found at the scene) give evidence in favour of the proposition that Oliver was one of the two people whose blood was found at the scene?
Because the findings at the scene are consistent with the hypothesis that Oliver was at the scene, it is tempting to say that they are evidence for the hypothesis.  In fact, they are not.

Let's apply the test proposed by Affirmative.  I use G for "guilty" to represent the hypothesis that Oliver left blood at the scene.  I assume that the the prior probability P(G) is some unspecified value p; it will turn out not to matter.  Then according to Bayes's theorem the posterior probability is

P(G | E) = p * P(E | G) / P(E)

The term P(E | G) is the likelihood of the evidence if the hypothesis is true. This is easy to evaluate.  If Oliver left blood at the scene, that accounts for the type O trace.  So P(E | G) is the probability that the other sample, from an unknown person, is type AB, which is 1%.

The denominator, called the normalizing constant, is the probability of seeing E under any circumstances at all.  It is not always easy to formulate P(E).  But in this case we can, by defining NG, which is the hypothesis that Oliver did not leave blood at the scene.  Applying Bayes's theorem again, the posterior probability of NG is

P(NG | E) = (1-p) * P(E | NG) / P(E)

P(E | NG) is the likelihood of the evidence under the hypothesis that Oliver is not one of the people who left blood at the scene.  In that case, the blood was left by two unknown people.  If we choose two people at random, the chance of finding types O and AB is 2 * 60% * 1% = 1.2%.

Now we can get rid of P(E) by computing the ratio of the posterior probabilities:

P(G | E) / P(NG | E) = p / (1-p) * P(E | G) / P(E | NG)

In words, the posterior ratio is the product of the prior ratio and the likelihood ratio.  That means that the probability of G will increase if

P(E | G) / P(E | NG) > 1;

that is, if the evidence is more likely under G than under NG.  In this example the ratio is 1/1.2, which means that the posterior probability of G is slightly lower.  So the evidence is mildly exculpatory.

Surprised?  So was I.  To understand why, think about the evidence.  In the local population, AB blood is rare, so the AB sample is harder to explain.  If Oliver accounts for the type O sample, there is only one candidate to account for the AB sample.  If neither sample is accounted for, we get two bites at the apple.  And in this example, that makes the difference.

The moral of this story is that evidence consistent with your hypothesis does not necessarily support your hypothesis.  The evidence has to be more likely under your hypothesis than under the alternative.

Evidence for God?

In the previous section, I started with the definition that E is evidence for H if

P(H | E) > P(H),

and used Bayes's theorem to derive the equivalent requirement

P(E | H) > P(E | NH),

where NH is the complement of H.  I think this way of expressing the requirement is clearer and less prone to abuse.

To demonstrate, I will analyze one of the points of evidence proposed by Affirmative, the existence of contingent beings.  He explains, "By a contingent being I mean a being which exists but which might not have existed.  Examples: mountains, planets, galaxies, you and me.  Such things might not have existed."

He then offers a version of Aquinas's argument from First Cause.  You can listen to it here (between 2:30 and 4:30) and read about it here.  Based on this argument, he concludes "The most plausible explanation of the universe is God.  Hence the existence of contingent beings makes God's existence more probable than it would have been without them."

But I think he skipped a few steps.  If E is the existence of contingent beings and G is the hypothesis that God exists, we have to evaluate both P(E | G) and P(E | NG).

If Aquinas's First Cause is Yahweh as described in the Torah, then P(E | G) = 1, and it's hard to do better than that.  But one of the standard objections to Aquinas's argument is that it provides no reason to identify the First Cause as Yahweh.

Because we don't know anything about what the First Cause is (or even whether the concept makes sense), P(E | G) is just P(E); that is, the hypothesis has no bearing on the probability of the evidence.

How about under NG?  What is the probability of the universe if there is no god?

I have no idea.

And that brings me to...

The long-awaited point of this article

Bayes's theorem is a computational tool that is only useful if the right side of the equation is easier to evaluate than the left.  In the case of Oliver's Blood, it provides a classic divide-and-conquer algorithm.  But for the God Hypothesis, all it does is replace one imponderable with another.

So I applaud Dr. Craig (Affirmative) for breaking out Bayesian probability, but I don't think it helps his case.  As for Dr. Krauss (Negative), I am still waiting to hear "what we use in science nowadays."

Wednesday, April 13, 2011

Survival analysis

xkcd and I are in sync.  In case you live in a cave, xkcd is a webcomic "of romance, sarcasm, math, and language," by Randall Munroe.  Several times in the last year, the topics that appear in xkcd have been strangely synchronized with topics I am covering in class.

It happened again this week: I was planning to write something about survival analysis and yoga when I read this:


The figure in the first frame is a survival curve.  The solid line is the survival function, S(t), which is the percentage of people who survive at least t years.   The dotted line is the hazard function, which is proportional to the fraction of deaths that occur during year t.  In this figure the death rate is low immediately after diagnosis, highest after 5 years, then decreasing.

The numbers in the second frame are survival rates: 81% survive at least 5 years and 77% survive at least 10.

To explore this topic, and to demonstrate computational techniques in survival analysis, I downloaded data from the Surveillance, Epidemiology and End Results (SEER) Program run by the National Cancer Institute, which is one of the U.S. National Institutes of Health.

SEER collects "data on patient demographics, primary tumor site, tumor morphology and stage at diagnosis, first course of treatment, and follow-up for vital status," currently covering "approximately 28 percent of the US population."

I will use data from the SEER 9 registry, which covers Atlanta, Connecticut, Detroit, Hawaii, Iowa, New Mexico, San Francisco-Oakland, Seattle-Puget Sound, and Utah.  This dataset includes cases diagnosed from 1973 through 2007.

As usual I wrote Python programs to read and parse the datasets.  For my analysis, I select cases of breast cancer where the behavior of the tumor was classified as malignant, where the patient had only one primary tumor, and where the case has been followed up.  There are 370263 records in this subset.

I consider all causes of death, not just deaths due to cancer.  This kind of analysis is most useful for prognosis, but because it includes other causes of death, it depends on patient characteristics like age, health, income, etc.  Also, survival depends strongly on the stage of the cancer.  The best way to deal with that problem is to generate curves that control for these characteristics.  For this example, I limit the analysis to 19168 patients who were in their 30s at the time of diagnosis, but I don't control for other factors.

To estimate survival curves, I use the Kaplan-Meier product-limit method, which is sufficiently obvious that I re-invented it before I knew what it was called.  The following figure shows the survival and hazard curves for this cohort:

These curves are similar to the ones in the cartoon figure, but the 5 and 10 year survival rates are lower, about 75% and 64%.  However, there are a few things to keep in mind:

1) You should not use these curves to make a prognosis for any patient.  This curve includes patients with different stages of cancer and other health factors.  Also...

2) For purposes of demonstration, I am keeping the analysis simple.  People who do this for a living do it more carefully.  And finally...

3) There is a systematic source of bias in this figure: the longest survival times are necessarily based on patients with earlier dates of diagnosis.  In this figure, the leftmost data point is based on a cohort with a mean date of diagnosis in 1992.  The rightmost data point is based on a cohort centered around 1982.  Since patient prognosis improves over time, the entire curve is too pessimistic for current patients, and it gets more pessimistic from left to right.

To see how prognoses are changing over time, I partitioned the dataset by date of diagnosis.  The following figure shows the results:

So prognoses have been improving consistently.  Part of this effect is due to improvements in treatment; part is due to earlier diagnosis.  To distinguish between these causes, we could control for stage of cancer, but I have not done that analysis.

Another reason for optimism is that survival rates generally increase after diagnosis; that is, the longer a patient survives, the longer they are likely to survive.  The following figure shows 5 and 10 year survival rates conditioned on t; that is, Pr(T > t+5 | T > t) and Pr(T > t+10 | T > t):

At the time of diagnosis, the 5-year survival rate is 75%.  But for patients who survive 5 years, the chance of surviving 10 years is 85%, and if you survive 10 years, the chance of surviving 15 is 92%.

In manufacturing, this property is called UBNE, which stands for "used better than new in expectation:" a used part is expected to last longer than a new part, because it has demonstrated its longevity.

Newborn babies, cancer patients and novice motorcyclists have the UBNE property; the longer they live, the longer they are expected to live.  So that's a reason to be optimistic.

Note: Sadly, the xkcd strip might be autobiographical.  I don't know details, but Munroe has indicated that he is dealing with an illness in his family, and several recent strips have been on medical topics.  If my inference is right, I want to extend my best wishes to Mr. Munroe and his family.


Data source: Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov) Research Data (1973-2007), National Cancer Institute, DCCPS, Surveillance Research Program, Cancer Statistics Branch, released April 2010, based on the November 2009 submission.

Update July 31, 2011: Here's an update from xkcd, including a first-person view of a survival curve: