## Wednesday, June 8, 2011

### A hierarchical Bayesian model of pond scum

This week I am working with one of my colleagues, the extraordinary biologist Jean Huang, on an interesting problem related to bioinformatics.  This project is my first attempt at implementing a hierarchical Bayesian model in Python, and it has been a pleasant surprise.  If you are familiar with tree-processing algorithms, it's pretty straightforward.  First the likelihoods propagate up the tree, then the updates propagate down.

Here is the background: Prof Huang and her students collect samples of pond water and culture the bacteria they find under narrow-spectrum artificial light, trying to find species that photosynthesize at different wavelengths.

To identity the species in each culture, they amplify the 16S ribosomal RNA gene, send it out to be sequenced, and then look up the sequence in a database like greengenes.  The result is either the name of a species that has been cultured, the genus of an uncultured species, or an unknown genus.

They usually start by identifying 15 samples from each culture.  For example, one of their cultures yielded 9 samples of one species, three of another, and one each of three more.

Looking at the results for a given culture, biologists would like to be able to answer these questions:

1) How many more species are there likely to be?
2) What is the likely prevalence of each species?
3) If we test m additional samples, how many new species are we likely to find?
4) Given a limited budget, which cultures warrant additional sampling?

To answer these questions, I implemented a hierarchical Bayesian model where the top level is "How many species are there?" and the second level is "Given that there are n species, what is the prevalence of each species?"  The prevalence of a species is its proportion of the population.

Knowing how many species there are helps model their prevalences.  In the simplest case, if we know that there is only 1 species, the prevalence of that species must be 100%.  If we know there are 2 species, I assume that the distribution of prevalences for both species is uniform from 0 to 100, so the expected prevalence for both is 50%, which makes sense.

If we know there are 3 species, it is less obvious what the prior distribution should be; the Beta distribution is a natural choice because

1) If we know there are k species, we can use Beta(1, k-1) as a prior, and the expected value is 1/k, which makes sense.  For k=2, the result is a uniform distribution, so that makes sense, too.

2) For this scenario, the Beta distribution is a conjugate prior, which means that after an update the posterior is also a Beta distribution.

We can represent a Beta distribution with an object that keeps track of the parameters:

class Beta:
def __init__(self, yes, no):
"""Initializes a Beta distribution.
yes and no are the number of
successful/unsuccessful trials.
"""
self.alpha = yes+1
self.beta = no+1

Update couldn't be easier.

def Update(self, yes, no):
self.alpha += yes
self.beta += no

At the top level ("How many species are there?") the prior is a uniform distribution from 1 to 20, at least for now.   I could replace that with a distribution that reflects more of the domain knowledge biologists have about the experiment; for example, based on how the cultures are processed, it would be rare to find more than 10 species with any substantial prevalence.

Now that the priors are in place, how do we update?  Here's the top level:

def Update(self, evidence):
"""Updates based on observing a given taxon."""
for hypo, prob in self.hypos.Items():
likelihood = hypo.Likelihood(evidence)
if likelihood:
self.hypos.Mult(hypo, likelihood)
else:
# if a hypothesis has been ruled out, remove it
self.hypos.Remove(hypo)

self.hypos.Normalize()

The evidence is a string that indicates which taxon (species or genus) was observed.  self.hypos is a Pmf that maps from a hypo to its probability (Pmf is provided by the thinkstats package, a collection of modules used in my book, Think Stats).  Each hypo is a hypothesis about how many taxons there are.  We ask each hypo to compute the likelihood of the evidence, then update hypos accordingly.  If we see N different taxons, all hypotheses with k<N are eliminated.

Having updated the top level, we pass the evidence down to the lower level and update each hypothesis.

# update the hypotheses
for hypo, prob in self.hypos.Items():
hypo.Update(evidence)

To compute the likelihood of the evidence under a hypothesis, we ask "What is the probability of observing this taxon, given our current belief about the prevalence of each taxon?"  There are two cases:

1) If the observed taxon is one we have seen before, we just look up the current Beta distribution and compute its mean:

def Mean(self):
"""Computes the mean of a Beta distribution."""
return self.alpha / (self.alpha + self.beta)

2) If we are seeing a taxon for the first time, we get the likelihoods for all unseen taxons and add them up.

To update the lower level distributions, we loop through the Beta distributions that represent the prevalences, and update them with either a hit or a miss.

def Update(self, evidence):
"""Updates each Beta dist with new evidence.

evidence is a taxon
"""
for taxon, dist in self.taxa.iteritems():
if taxon == evidence:
dist.Update(1, 0)
else:
dist.Update(0, 1)

And that's it.  After doing a series of updates, the top level is the posterior distribution of the number of taxons, and each hypothesis at the lower level is a set of Beta distributions that represent the prevalences.

Let's look at an example.  Suppose we see 11 of one taxon, 4 of another, and 2 of a third.  The posterior distribution of the number of taxons is:

The probability is 47% that there are only 3 taxons, 28% that there are 4, and 13% that there are 5.

By summing over the hypotheses, we can generate distributions for the prevalence of each taxon:

The median prevalence of taxon A is 58%; the credible interval is (39%, 75%).  For taxons B and C the medians are 23% and 13%.  For taxons D, E, and F, there is a high probability that the prevalence is 0, because these taxons have not been observed.

To predict the number of new taxons we might find by sequencing additional samples, I use Monte Carlo simulation.  Here is the kernel of the algorithm:

for i in range(m):
taxon = self.GenerateTaxon()
taxons.append(taxon)
meta.Update(taxon)

curve = MakeCurve(taxons)

m is the number of samples to simulate.  Each time through the loop, we generate a random taxon and then update the hierarchy.  The result is a rarefaction curve, which is the number of taxons we observe as a function of the number of samples.

To generate a random taxon, we take advantage of the recursive structure of the hierarchy; that is, we ask each hypothesis to choose a random taxon, and then choose among them in accordance with the probability for each hypothesis:

def GenerateTaxon(self):
"""Chooses a random taxon."""
pmf = Pmf.Pmf()
for hypo, prob in sorted(self.hypos.Items()):
taxon = hypo.GenerateTaxon()
pmf.Incr(taxon, prob)
return pmf.Random()

The algorithm the hypotheses use to generate taxons is ugly, so I will spare you the details, but ultimately it depends on the ability to generate Beta variates, which is provided in Python's random module.

Given a sequence of taxons (real or simulated), it is easy to generate a rarefaction curve:

def MakeCurve(sample):
"""Makes a rarefaction curve for the given sample."""
s = set()
curve = []
for i, taxon in enumerate(sample):
curve.append((i+1, len(s)))
return curve

To show what additional samples might yield, I generate 100 rarefaction curves and plot them:

The first 17 points are random permutations of the observed data, so after 17 samples, we have seen 3 taxa on every curve.  The last 15 points are based on simulated data.  The lines are shifted slightly so they don't overlap; that way you can eyeball high-probability paths.

Even after m=15 additional samples, there is a good chance that we don't see any more taxons -- but there is a reasonable chance of seeing 1-2, and a small chance of seeing 3.

Finally, we estimate the probability of seeing additional taxons as a function of the number of addition samples:

With 5 additional samples, the chance of seeing at least one additional taxon is 24%; with 10 samples it's 40%, and with 15 it's 53%.

These estimates should help biologists allocate their budget for additional samples is order to minimize the chance of missing an important low-prevalence taxon.

Next steps: I am working with Prof. Huang and her students to report the results most usefully; then we are thinking about distributing the code or deploying it as a web app.

## Thursday, June 2, 2011

### More hypotheses, less trivia

In my last post I wrote about hypothesis testing with simulation, and one of the readers at reddit/r/statistics wrote this comment:
Is there a book or similar reference [with] examples of building such simulations? I never have enough confidence to run such simulations and I guess that is mainly because I have only seen them for trivial problems. A book or a tutorial with a number of problems with various complexity would help a lot.
My answer to the first part (is there a good book?) is "I don't know."  My answer to the second part (non-trivial examples) is this post with some less trivial examples.

Difference in means

If you think there is a difference between two groups, the most common test is for a difference in the means.  In Think Stats, I apply this test to pregnancy length and birth weight for first babies and others.

In this case the test statistic is obvious: compute the mean for each group and subtract.  The only hard part is deciding whether to do a one-sided test or a two-sided test.  It depends on what hypothesis you are testing.  If you think men are taller than women, or that the new drug is better than the old drug, then a one-sided test is appropriate.  If you think there is a difference between the groups but you don't know what it is, you can use a two-sided test; that is, use the absolute value of the difference as the test statistic.

This decision can seem arbitrary, but don't worry.  The effect on the p-value is just a factor of two, and (as I argued last time) we only care about the order of magnitude.  Whether the p-value is 2% or 4% or 8% doesn't really matter.

The null hypothesis is that the two groups are actually the same.  There are several ways to implement this model:

Parametric: Merge the groups.  If the pooled distribution is well modeled by an analytic distribution, estimate the parameters (for example, the mean and variance of a normal distribution) and then generate random samples from the fitted distribution.

Resampling: Again, merge the groups, and then draw samples (with replacement) from the pooled empirical distribution.

Permutation: Or shuffle the sample and assign the elements to the groups at random.  This is equivalent to resampling without replacement.

In this example, the difference in birthweight is significant (first babies are lighter) and the difference in pregnancy length is not.  But it turns out that first babies are more likely to be early and late (and less likely to be on time).  To test that hypothesis, I used the procedure I described last time for the casino problem.

Slope of a fitted line

In this post I fit a line to a time series and compute the slope.  The test statistic is the estimated slope, so that part is easy.

The null hypothesis is that the apparent slope is due to chance, so I used a permutation model; that is, I shuffled the time series 1000 times and computed the slope each time.

Since my hypothesis is specifically that the slope is positive, I could have used a one-sided test, but to be conservative I reported a two sided p-value:
The red line shows the linear least squares fit, with slope 0.033; the p-value (chance of seeing an absolute slope as big as that) is 0.006, so you can either reject the null hypothesis or update your subjective degree of belief accordingly.
Notice that I didn't say anything about how I computed the p-value.  Depending on where you publish, you might or might not have to be more specific.

Tail curvature

In this paper, I was characterizing the tail behavior of long-tailed distributions and trying to quantify how well Pareto and lognormal models captured that behavior.

If you plot the complementary CDF of a Pareto distribution on a log-log scale, you get a straight line (example here); for a lognormal distribution, the line drops off with increasing slope.  To distinguish these behaviors, I wanted a robust measure of tail curvature, so I computed the first derivative numerically and the estimated the slope.  Since slope of a derivative is the second derivative, I called the result "tail curvature" and used it as the test statistic.

In this example I had two hypotheses (HP for Pareto and HL for lognormal) and no null hypothesis.  To compute P(E | HP) I chose a Pareto distribution that was a good fit for the tail of the observed distribution, then generated random samples, and computed tail curvature.  Likewise for P(E | HL).

Here's how I reported the results:
For these cases, the curvature test provides some insight. The tail curvature of the Calgary dataset is 0.141, which has a negligible p-value under the Pareto model. This means that we can reject the hypothesis that the data are a sample from a Pareto distribution. The tail curvature of the ClarkNet dataset is 0.023, which has p > 0.5, which means we cannot reject the hypothesis that the sample comes from a long-tailed distribution.
I am not crazy about the "reject the null hypothesis" way of talking, but it was appropriate for the venue.

ANOVA

If you collect a sample of adult human heights in centimeters, the variance might be 94 cm^2.  If you divide the sample into men and women, the variance in each group is smaller (49 and 44 cm^2), which suggests that there is a relationship between height and gender.  To test whether that is significant, you can divide the sample at random and see how often the variance drops as much.  That's an analysis of variance (ANOVA) test.

The nice thing about ANOVA is that is generalizes easily to more than two groups.

To summarize, the test statistic is the decrease in variance when you divide the sample into groups, also known as "assigning labels."  The null hypothesis is that the mean is the same for all groups, so the simulation is to assign the labels at random.

Time of day effect?

Here's a recent post from reddit/r/statistics:
A reviewer wants me to test if various behavioural assays I performed were influenced by the time of day I performed them (i.e. biased).The assays have either binary or continuous responses. What is the best approach to this? It is as simple as fitting a GLM, response~time, and testing for a non-zero slope? That seems overly simplistic to me. I've also tried GAMs with gaussian or binomial error functions, but it doesn't seem to be doing what I expect it to do...Any thoughts would be greatly appreciated. I'm sure its stupidly simple, but I'm having trouble wrapping my head around the problem!
At first I suggested a permutation test to check whether time of day had a substantial effect.  But the submitter added these details:
We assayed the behaviour of six species of spiders. The purpose of the study was to compare the behaviour of some species to others. For various reasons, we could not balance the time of day at which we assayed the behaviours for each species (the wonders of field biology...). It may be the case that these spiders have some sort of daily behavioural cycle, e.g. more active at night (however it could be the opposite, or really any function of time; we have no idea). I think the reviewer's concern is that this could influence the comparisons, e.g. if we assayed species 1 more often at night than species 2?
I suppose the explicit hypothesis I am interested in testing is that time of day has a significant effect on behavioural response, but I am unsure how to construct a model to test this, given that I have no idea if the relationship will be linear or some complex function.
So there are several questions on the table here:

1) Are the observed differences between the species significant?

I assume the answer is yes, or the submitter would not be trying to publish (see publication bias).

2) Is there a time of day effect?

One option is to divide the day into intervals, maybe 6 periods of 4 hours.  Then for the continuous response variable, you could use an ANOVA test.  For the binary variable you could use a logistic regression with one dummy variable for each interval.

Or if the question is whether there are any times of day that differ from the others, then an appropriate test statistic would be the maximum difference between any interval and the overall average (the next case study talks about this option in more detail).

3) If there is a time of day effect, can we control for it?

If it turns out that there is a time of day effect, it may or may not be possible to control for it.  For example, if you only observed Species 1 in the morning and Species 2 in the afternoon, then there is no way to separate the effect of the two variables.  In that case, the problem is experimental design, not statistics.

As a quick check, you could make a table with one row per species and one column per time interval, and count how many observations fall in each cell.  If there are a lot of empty (or nearly empty) cells, you are in trouble.

If you make the intervals bigger, you will have more observations per cell, but you might obscure an actual time of day effect.

4) If we control for time of day, are the differences between the species still significant?

If you get this far, you could do ANOVA or regression with dummy variables for each species and for each interval.  This is what some people suggested on reddit (if I understood correctly), but I would suggest spending some time on steps (2) and (3) before diving into (4).

Testing pair-wise differences

Here's another recent post from reddit/r/statistics:

My data can be divided into 4 samples. I wanted to test whether the distribution of a binary variable differs significantly between the samples, so I did 6 t-tests - one for each pair of samples.
However, I also need to analyze how another categorical variable differs between the samples, this time with 3 possible values.
Is it okay to compute new t-tests, each time for a pair of values and a pair of samples, for a total of 6*3 tests? The problem is that this is too detailed: I don't want to test each pair of values, only each pair of samples.
SAS also gives me the option (in proc freq) of calculating a chi-square test for all 4 samples, but this is not enough information: it only tells me that not all of the means are equal.
And then there's a one-way ANOVA test, which can be used to compare the means of >2 samples, but apparently can't be used for categorical variables with >2 values. Or can it?
How would you analyze this data?
If you run lots of tests, some of them are likely to be "significant" even if there is no real effect going on.  The problem is demonstrated in this xkcd cartoon.  So in general I think it is a good idea to apply one big test to the entire dataset, not lots of little tests.

In this case, the question seems to be whether there are significant differences between any pair of groups.  So an appropriate test statistic would be the maximum difference between any pair.

The null hypothesis is that the groups are the same, so an easy implementation is to divide the sample into groups at random.

For categorical variables, you can think of the distribution within each group as a vector and use a distance function to compute the difference between groups.  If the groups are roughly the same size, you could use a 1-norm or a 2-norm.  If not, you should probably normalize the vectors to length 1 before computing norms.

If you are not really interested in pair-wise differences, but you want to know whether any group deviates significantly from the expected, you could pool all the groups to get the expected distribution of the categorical variable, then compute a chi-squared statistic for each group (how much it deviates from the expected).  Again the test statistic would be the maximum of the chi-square statistic.

The important thing here is that the choice of test statistic depends on the question you are asking.  By using the maximum difference, you are effectively asking, "What is the probability of finding any group with this much deviance, just by chance?"

Running a series of tests

Here's one more question from reddit/r/statistics
I'm running a series of chi-square analyses and for several of them, I'm getting a message which states that "# cells have an expected count less than 5". If the results are significant, are they still worth reporting in spite of this message? I'm comparing participant selections of one of three alternatives when assigned to one of four conditions, examining pairs of comparisons.
The problem here is that the analytic version of the chi-squared test uses an approximation that is only accurate if the sample size is large enough; otherwise you have to use a different test.

But this is a perfect example of what I talked about in my last post: if you spend all your time in statistics class learning one damn test after another, and your entire cognitive capacity is consumed trying to remember when each of them is applicable, you have no time and no available brain cells to do the real work, which is thinking carefully about the what you are doing.  [Note: I mean this as a general warning, not a criticism of the submitter.]

In this case the biggest hazard is not the sample size; it's "running a series of chi-square analyses."  When I hear "series of tests," alarm bells go off in my head.  If you don't know why, please review this xkcd cartoon.

So for this question, I have two suggestions:

1) If you use simulation, you don't have to worry about the expected count in each cell.  If the sample size is small, that tends to make the p-value bigger, but that happens without any additional effort.

2) If you are running a series of tests, you should either (a) adjust the significance criterion using something like the Holm–Bonferroni method, or (b) treat the whole series as a single test, using the maximum of the chi-square values as a test statistic.  I think (b) is simpler, and does a better job of capturing what you are trying to test.

That's all for now!