Friday, May 1, 2015

Hypothesis testing is only mostly useless

UPDATE May 4: Nature recently reported the outcome of an attempt to replicate 100 results from psychology research. They characterize the findings as "worrying."

In my previous article I argued that classical statistical inference is only mostly wrong. About null-hypothesis significance testing (NHST),  I wrote:

“If the p-value is small, you can conclude that the fourth possibility is unlikely, and infer that the other three possibilities are more likely.”

Where “the fourth possibility” I referred to was:

“The apparent effect might be due to chance; that is, the difference might appear in a random sample, but not in the general population.”

Several commenters chastised me; for example:

“All a p-value tells you is the probability of your data (or data more extreme) given the null is true. They say nothing about whether your data is ‘due to chance.’”

My correspondent is partly right.  The p-value is the probability of the apparent effect under the null hypothesis, which is not the same thing as the probability we should assign to the null hypothesis after seeing the data.  But they are related to each other by Bayes’s theorem, which is why I said that you can use the first to make an inference about the second.

Let me explain by showing a few examples.  I’ll start with what I think is a common scenario:  suppose you are reading a scientific paper that reports a new relationship between two variables.  There are (at least) three explanations you might consider for this apparent effect:

A:  The effect might be actual; that is, it might exist in the unobserved population as well as the observed sample.
B:  The effect might be bogus, caused by errors like sampling bias and measurement error, or by fraud.
C:  The effect might be due to chance, appearing randomly in the sample but not in the population.

If we think of these as competing hypotheses to explain the apparent effect, we can use Bayes’s theorem to update our belief in each hypothesis.

Scenario #1

As a first scenario, suppose that the apparent effect is plausible, and the p-value is low.  The following table shows what the Bayesian update might look like:

p = 0.01

Since the apparent effect is plausible, I give it a prior probability of 70%.  I assign a prior of 20% to the hypothesis that the result is to due to error or fraud, and 10% to the hypothesis that it’s due to chance.  If you don’t agree with the numbers I chose, we’ll look at some alternatives soon.  Or feel free to plug in your own.

Now we compute the likelihood of the apparent effect under each hypothesis.  If the effect is real, it is quite likely to appear in the sample, so I figure the likelihood is between 50% and 100%.   And in the presence of error or fraud, I assume the apparent effect is also quite likely.

If the effect is due to chance, we can compute the likelihood directly.  The likelihood of the data under the null hypothesis is the p-value.  As an example, suppose p=0.01.

The table shows the resulting posterior probabilities.  The hypothesis that the effect is due to chance has been almost eliminated, and the other hypotheses are marginally more likely.  The hypothesis test helps a little, by ruling out one source of error, but it doesn't help a lot, because randomness was not the source of error I was most worried about.

Scenario #2

In the second scenario, suppose the p-value is low, again, but the apparent effect is less plausible.  In that case, I would assign a lower prior probability to Actual and a higher prior to Bogus.  I am still inclined to assign a low priority to Chance, simply because I don’t think it is the most common cause of scientific error.

p = 0.01

The results are pretty much the same as in Scenario #1: we can be reasonably confident that the result is not due to chance.  But again, the hypothesis test does not make a lot of difference in my belief about the validity of the paper.

I believe these examples cover a large majority of real-world scenarios, and in each case my claim holds up:  If the p-value is small, you can conclude that apparent effect is unlikely to be due to chance, and the other possibilities (Actual and Bogus) are more likely.

Scenario #3

I admit that there are situations where this conclusion would not be valid.  For example, if the effect is implausible and you have good reason to think that error and fraud are unlikely, you might start with a larger prior belief in Chance.  In that case a p-value like 0.01 might not be sufficient to rule out Chance:

p = 0.01

But even in this contrived scenario, the p-value has a substantial effect on our belief in the Chance hypothesis.  And a somewhat smaller p-value, like 0.001, would be sufficient to rule out Chance as a likely cause of error.

In summary, NHST is problematic but not useless.  If you think an apparent effect might be due to chance, choosing an appropriate null hypothesis and computing a p-value is a reasonable way to check.

Wednesday, April 15, 2015

Two hour marathon by 2041 -- probably

Last fall I taught an introduction to Bayesian statistics at Olin College.  My students worked on some excellent projects, and I invited them to write up their results as guest articles for this blog.  Previous articles in the series are:

Bayesian predictions for Super Bowl XLIX
Bayesian analysis of match rates on Tinder
Bayesian survival analysis for Game of Thrones

Now, with the Boston Marathon coming up, we have an article about predicting world record marathon times.

Modeling Marathon Records With Bayesian Linear Regression

Cypress Frankenfeld

In this blog article, Allen Downey shows that marathon world record times have been advancing at a linear rate. He uses least squares estimation to calculate the parameters of the line of best fit, and predicts that the first 2-hour marathon will occur around 2041.

I extend his model, and calculate how likely it will be that someone will run a 2-hour marathon in 2041. Or more generally, I find the probability distribution of a certain record being hit on a certain date.

I break down my process into 4 steps:
  1. Decide on a set of prior hypotheses about the trend of marathon records.
  2. Update my priors using the world record data.
  3. Use my posterior to run a monte-carlo simulation of world records.
  4. Use this simulation to estimate the probability distribution of dates and records.

What are my prior hypotheses?

Since we are assuming the trend in running times is linear, I write my guesses as linear functions,
yi(xi) = α + βx + ei ,
where α and β are constants and ei is random error drawn from a normal distribution characterized by standard deviation σ. Thus I choose my priors to be list of combinations of α, β, and σ.

intercept_estimate, slope_estimate = LeastSquares(dates, records)
alpha_range = 0.007
beta_range = 0.2
alphas = np.linspace(intercept_estimate * (1 - alpha_range), 
intercept_estimate * (1 + alpha_range),
betas = np.linspace(slope_estimate * (1 - beta_range), 
slope_estimate * (1 + beta_range), 
sigmas = np.linspace(0.001, 0.1, 40)
hypos = [(alpha, beta, sigma) for alpha in alphas 
for beta in betas for sigma in sigmas]

I set the values of α, β, and σ partially on the least squares estimate and partially on trial and error. I initialize my prior hypotheses to be uniformly distributed.

How do I update my priors?

I perform a bayesian update based on a list of running records. For each running record, (date, record_mph), I calculate the likelihood based on the following function.

def Likelihood(self, data, hypo):
alpha, beta, sigma = hypo
total_likelihood = 1
for date, measured_mph in data:
hypothesized_mph = alpha + beta * date
error = measured_mph - hypothesized_mph
total_likelihood *= EvalNormalPdf(error, mu=0, sigma=sigma)
return total_likelihood

I find the error between the hypothesized record and the measured record (in miles per hour), then I find the probability of seeing that error given the normal distribution with standard deviation of σ.

The posterior PMFs for α, β, and σ


Here are the posterior marginal distributions of α, β, and σ after updating with all the running records. The distributions of α and β peak close to the least squares estimate. How close are they to least squares estimate? Here I’ve plotted a line based on the maximum likelihood of α, β, compared to the least squares estimate. Graphically, they are almost indistinguishable.
But I use these distributions to estimate more than a single line! I create a monte-carlo simulation of running records.

How do I simulate these records?

I pull a random α, β, and σ from their PMF (weighted by probability), generate a list of records for each year from 2014 to 2060, then repeat. The results are in the figure below, plotted alongside the previous records.
It may seem counterintuitive to see simulated running speeds fall below the current records, but it make sense! Think of those as years wherein the first-place marathon didn’t set a world-record.

These simulated records are most dense where they are most likely, but the scatter plot is a poor visualization of their densities. To fix this, I merge the points into discrete buckets and count the number of points in each bucket to estimate their probabilities. This resulting joint probability distribution is what I’ve been looking for: given a date and a record, I can tell you the probability of setting that record on that date. I plot the joint distribution in the figure below.
It is difficult to create a pcolor plot with pandas Timestamp objects as the unit, so I use nanoseconds since 1970. The vertical and horizontal lines are at year 2041 and 13.11mph (the two-hour marathon pace). The lines fall within a dense region of the graph, which indicates that they are likely.

To identify the probability of the two-hour record being hit before the year 2042, I take the conditional distribution for the running speed of 13.11mph to obtain the PMF of years, then I compute the total probability of that running speed being hit before the year 2042.

>>> year2042 = pandas.to_datetime(‘2042’)
>>> joint_distribution.Conditional(0, 1, 13.11).ProbLess(year2042.value)

This answers my original question. There is a 38.5% probability that someone will have run a two-hour marathon by the end of 2041.

To get a bigger picture, I create a similar conditional PMF for each running speed bucket, convert it to a CDF, and extract the years corresponding to several percentiles. The 95th percentile means there is a 95% probability of reaching that particular running speed by that year. I plot these percentiles to get a better sense of how likely any record is to be broken by any date.

Again, the black horizontal and vertical lines correspond to the running speed for a two-hour marathon and the year 2041. You can see that they fall between 25% and 50% probability (as we already know). By looking at this plot, we can also say that the two-hour marathon will be broken by the year 2050 with about a 95% probability.

Wednesday, March 25, 2015

Bayesian survival analysis for "Game of Thrones"

Last fall I taught an introduction to Bayesian statistics at Olin College.  My students worked on some excellent projects, and I invited them to write up their results as guest articles for this blog.

One of the teams applied Bayesian survival analysis to the characters in A Song of Ice and Fire, the book series by George R. R. Martin.  Using data from the first 5 books, they generate predictions for which characters are likely to survive and which might die in the forthcoming books.  

With Season 5 of the Game of Thrones television series starting on April 12, we thought this would be a good time to publish their report.
Bayesian Survival Analysis in A Song of Ice and Fire

Erin Pierce and Ben Kahle

The Song of Ice and Fire series has a reputation for being quite deadly. No character, good or bad, major or minor is safe from Martin’s pen. The reputation is not unwarranted; of the 916 named characters that populate Martin’s world, a third have died, alongside uncounted nameless ones.

In this report, we take a closer look at the patterns of death in the novels and create a Bayesian model that predicts the probability that characters will survive the next two books.

Using data from A Wiki of Ice and Fire, we created a dataset of all 916 characters that appeared in the books so far. For every character, we know what chapter and book they first appeared, if they are male or female, if they are part of the nobility or not, what major house they are loyal to, and, if applicable, the chapter and book of their death.  We used this data to predict which characters will survive the next couple books.


We extrapolated the survival probabilities of the characters through the seventh book using Weibull distributions. A Weibull distribution provides a way to model the hazard function, which measures the probability of death at a specific age. The Weibull distribution depends on two parameters, k and lambda, which control its shape.

To estimate these parameters, we start with a uniform prior.  For each alive character, we check how well that value of k or lambda predicted the fact that the character was still alive by comparing the calculated Weibull distribution with the character’s hazard function. For each dead character, we check how well the parameters predicted the time of their death by comparing the Weibull distribution with the character’s survival function.

The main code used to update these distributions is:

class GOT(thinkbayes2.Suite, thinkbayes2.Joint):

def Likelihood(self, data, hypo):
"""Determines how well a given k and lam 
predict the life/death of a character """
age, alive = data
k, lam = hypo
if alive:
prob = 1-exponweib.cdf(age, k, lam)
prob = exponweib.pdf(age, k, lam)
return prob

def Update(k, lam, age, alive):
"""Performs the Baysian Update and returns the PMFs
of k and lam"""
joint = thinkbayes2.MakeJoint(k, lam)
suite = GOT(joint)
suite.Update((age, alive))
k = suite.Marginal(0, label=k.label), 
lam = suite.Marginal(1, label=lam.label)
return k, lam

def MakeDistr(introductions, lifetimes,k,lam):
"""Iterates through all the characters for a given k 
and lambda.  It then updates the k and lambda
k.label = 'K'
lam.label = 'Lam'
print("Updating deaths")
for age in lifetimes:
k, lam = Update(k, lam, age, False)
print('Updating alives')
for age in introductions:
k, lam = Update(k, lam, age, True)
return k,lam
For the Night’s Watch, this lead to the posterior distribution in Figure 3.

Figure 3: The distribution for lambda is quite tight, around 0.27, but the distribution for k is broader.

To translate this back to a survival curve, we took the mean of k and lambda, as well as the 90 percent credible interval for each parameter. We then plot the original data, the credible interval, and the survival curve based on the posterior means.

Jon Snow

Using this analysis, we can can begin to make a prediction for an individual character like Jon Snow.  At the end of A Dance with Dragons, the credible interval for the Night’s Watch survival (Figure 4) stretches from 36 percent to 56 percent. The odds are not exactly rosy that Jon snow is still alive. Even if Jon is still alive at the end of book 5, the odds that he will survive the next two books drop to between 30 percent and 51 percent.

Figure 4: The credible interval closely encases the data, and the mean-value curve appears to be a reasonable approximation.

However, it is worth considering that Jon is not an average member of the Night’s Watch. He had a noble upbringing and is well trained at arms. We repeated the same analysis with only members of the Night’s Watch considered noble due to their family, rank, or upbringing.

There have only been 11 nobles in the Night’s Watch, so the credible interval as seen in Figure 5 is understandably much wider, however, the best approximation of the survival curve suggests that a noble background does not increase the survival rate for brothers of the Night’s Watch.

Figure 5: When only noble members of the Night’s Watch are included, the credible interval widens significantly and the lower bound gets quite close to zero.

The Houses of ASOIAF

The 90 percent credible intervals for all of the major houses. This includes the 9 major houses, the Night’s Watch, the Wildlings, and a "None" category which includes non-allied characters.

Figure 6: 90 percent credible interval for Arryn (Blue), Lannister (Gold), None (Green), and Stark (Grey)

Figure 7: 90 percent credible interval for Tyrell(Green), Tully(Blue), Baratheon(Orange), and Night’s Watch (Grey)

Figure 8: 90 percent credible interval for Martell(Orange), Targaryen (Maroon), Greyjoy (Yellow), and Wildling (Purple)

These intervals, shown in Figures 6, 7, and 8, demonstrate a much higher survival probability for the houses Arryn, Tyrell, and Martell. Supporting these results, these houses have stayed out of most of the major conflicts in the books, however this also means there is less information on them. We have 5 or fewer examples of dead members for those houses, so the survival curves don’t have very many points. This uncertainty is reflected in the wide credible intervals.

In contrast, our friends in the north, the Starks, Night’s Watch, and Wildlings have the lowest projected survival rates and smaller credible intervals given their warring positions in the story and the many important characters included amongst their ranks. This analysis considers entire houses, but there are also additional ways to sort the characters.

Men and women

While A Song of Ice and Fire has been lauded for portraying women as complex characters who take an a variety of roles, there are still many more male characters (769) than female ones (157). Despite a wider credible interval, the women tend to fare better than their male counterparts, out-surviving them by a wide margin as seen in Figure 9.

Figure 9: The women of Westeros appear to have a better chance of surviving then the men.


The ratio between noble characters(429) and smallfolk characters (487) is much more even than gender and provides an interesting comparison for analysis. Figure 10 suggests that while more smallfolk tend to die quickly after being introduced, those that survive their introductions tend to live for a longer period of time and may in fact outpace the nobles.

Figure 10: The nobility might have a slight advantage when introduced, but their survival probability continues to fall while the smallfolk’s levels much more quickly

Selected Characters

The same analysis can be extended to combine traits, sorting by gender, house, and class to provide a rough model for individual characters. One of the most popular characters in the books is Arya and many readers are curious about her fate in the books to come. The category of noblewomen loyal to the Starks also includes other noteworthy characters like Sansa and Brienne of Tarth (though she was introduced later). Other intriguing characters to investigate are the Lannister noblewomen Cersei and poor Myrcella. As it turns out, not a lot noble women die. In order to get more precise credible intervals for the specific female characters we included the data of both noble and smallfolk women.

Figure 11: While both groups have very wide ranges of survival probabilities, the Lannister noblewomen may be a bit more likely to die than the Starks.

The data presented in Figure 11 is inconclusive, but it looks like Arya has a slightly better chance of survival than Cersei.

Two minor characters we are curious about are Val, the wildling princess, and the mysterious Quaithe.

Figure 12: Representing the survival curves of more minor characters, Quaithe and Val have dramatically different odds of surviving the series.

They both had more data than the Starks and Lannisters, but they have the complication that they were not introduced at the beginning of the series. Val is introduced at 2.1 books, and so her chances of surviving the whole series are between 10 percent and 53 percent, which are not the most inspiring of chances.

Quaithe is introduced at 1.2 books, and her chances are between 58 percent and 85 percent, which are significantly better than Val’s. These curves are shown in Figure 12.

For most of the male characters (with the exception of Mance), there was enough data to narrow to house, gender and class.

Figure 13: The survival curves of different classes and alliances of men shown through various characters.

Figure 13 shows the Lannister brothers with middling survival chances ranging from 35 percent to 79 percent. The data for Daario is less conclusive, but seems hopeful, especially considering he was introduced at 2.5 books. Mance seems to have to worst chance of surviving until the end. He was introduced at 2.2 books, giving him a chance of survival between 19 percent and 56 percent.

Figure 14: The survival curves of different classes and alliances of men shown through various characters.

Some characters who many wouldn’t mind seeing kick the bucket include Lord Walder Frey and Theon Greyjoy. However, Figure 14 suggests that neither are likely meet untimely (or in Walder Frey’s case, very timely) deaths. Theon seems likely to survive to the bitter end. Walder Frey was introduced at 0.4 books, putting his chances at 44 percent to 72 percent. As it is now, Hoster Tully may be the only character to die of old age, so perhaps Frey will hold out until the end.


Of course who lives and who dies in the next two books has more to do with plot and storyline than with statistics. Nonetheless, using our data we were able we were able to see patterns of life and death among groups of characters. For some characters, especially males, we are able to make specific predictions of how they will fare in the next novels.  For females and characters from the less central houses, the verdict is still out.

Our data and code are available from this GitHub repository.

Notes on the Data Set

Most characters were fairly easy to classify, but there are always edge cases.
  1. Gender - This was the most straight forward. There are not really any gender-ambigous characters.
  2. Nobility - Members of major and minor Westeros houses were counted as noble, but hedge knights were not. For characters from Essos, I used by best judgement based on money and power, and it was usually an easy call. For the wildlings, I named military leaders as noble, though that was often a blurry line. For members of the Night’s Watch, I looked at their status before joining in the same way I looked at other Westeros characters. For bastards, we decided on a case by case basis. Bastards who were raised in a noble family and who received the education and training of nobles were counted as noble. Thus Jon Snow was counted as noble, but someone like Gendry was not.
  3. Death - Characters that have come back alive-ish (like Beric Dondarrion) were judged dead at the time of their first death. Wights are not considered alive, but others are. For major characters whose deaths are uncertain, we argued and made a case by case decision.
  4. Houses - This was the trickiest one because some people have allegiances to multiple houses or have switched loyalties. We decided on a case by case basis. The people with no allegiance were of three main groups:
    • People in Essos who are not loyal to the Targaryens.
    • People in the Riverlands, either smallfolk whose loyalty is not known, or groups like the Brotherhood Without Banners or the Brave Companions with ambiguous loyalty.
    • Nobility that are mostly looking out for their own interests, like the Freys, Ramsay Bolton, or Petyr Baelish.