Here is the background: Prof Huang and her students collect samples of pond water and culture the bacteria they find under narrow-spectrum artificial light, trying to find species that photosynthesize at different wavelengths.
To identity the species in each culture, they amplify the 16S ribosomal RNA gene, send it out to be sequenced, and then look up the sequence in a database like greengenes. The result is either the name of a species that has been cultured, the genus of an uncultured species, or an unknown genus.
They usually start by identifying 15 samples from each culture. For example, one of their cultures yielded 9 samples of one species, three of another, and one each of three more.
Looking at the results for a given culture, biologists would like to be able to answer these questions:
1) How many more species are there likely to be?
2) What is the likely prevalence of each species?
3) If we test m additional samples, how many new species are we likely to find?
4) Given a limited budget, which cultures warrant additional sampling?
To answer these questions, I implemented a hierarchical Bayesian model where the top level is "How many species are there?" and the second level is "Given that there are n species, what is the prevalence of each species?" The prevalence of a species is its proportion of the population.
Knowing how many species there are helps model their prevalences. In the simplest case, if we know that there is only 1 species, the prevalence of that species must be 100%. If we know there are 2 species, I assume that the distribution of prevalences for both species is uniform from 0 to 100, so the expected prevalence for both is 50%, which makes sense.
If we know there are 3 species, it is less obvious what the prior distribution should be; the Beta distribution is a natural choice because
1) If we know there are k species, we can use Beta(1, k-1) as a prior, and the expected value is 1/k, which makes sense. For k=2, the result is a uniform distribution, so that makes sense, too.
2) For this scenario, the Beta distribution is a conjugate prior, which means that after an update the posterior is also a Beta distribution.
We can represent a Beta distribution with an object that keeps track of the parameters:
def __init__(self, yes, no):
"""Initializes a Beta distribution.
yes and no are the number of
self.alpha = yes+1
self.beta = no+1
Update couldn't be easier.
def Update(self, yes, no):
"""Updates a Beta distribution."""
self.alpha += yes
self.beta += no
At the top level ("How many species are there?") the prior is a uniform distribution from 1 to 20, at least for now. I could replace that with a distribution that reflects more of the domain knowledge biologists have about the experiment; for example, based on how the cultures are processed, it would be rare to find more than 10 species with any substantial prevalence.
Now that the priors are in place, how do we update? Here's the top level:
def Update(self, evidence):
"""Updates based on observing a given taxon."""
for hypo, prob in self.hypos.Items():
likelihood = hypo.Likelihood(evidence)
# if a hypothesis has been ruled out, remove it
The evidence is a string that indicates which taxon (species or genus) was observed. self.hypos is a Pmf that maps from a hypo to its probability (Pmf is provided by the thinkstats package, a collection of modules used in my book, Think Stats). Each hypo is a hypothesis about how many taxons there are. We ask each hypo to compute the likelihood of the evidence, then update hypos accordingly. If we see N different taxons, all hypotheses with k<N are eliminated.
Having updated the top level, we pass the evidence down to the lower level and update each hypothesis.
# update the hypotheses
for hypo, prob in self.hypos.Items():
To compute the likelihood of the evidence under a hypothesis, we ask "What is the probability of observing this taxon, given our current belief about the prevalence of each taxon?" There are two cases:
1) If the observed taxon is one we have seen before, we just look up the current Beta distribution and compute its mean:
"""Computes the mean of a Beta distribution."""
return self.alpha / (self.alpha + self.beta)
2) If we are seeing a taxon for the first time, we get the likelihoods for all unseen taxons and add them up.
To update the lower level distributions, we loop through the Beta distributions that represent the prevalences, and update them with either a hit or a miss.
def Update(self, evidence):
"""Updates each Beta dist with new evidence.
evidence is a taxon
for taxon, dist in self.taxa.iteritems():
if taxon == evidence:
And that's it. After doing a series of updates, the top level is the posterior distribution of the number of taxons, and each hypothesis at the lower level is a set of Beta distributions that represent the prevalences.
Let's look at an example. Suppose we see 11 of one taxon, 4 of another, and 2 of a third. The posterior distribution of the number of taxons is:
The probability is 47% that there are only 3 taxons, 28% that there are 4, and 13% that there are 5.
By summing over the hypotheses, we can generate distributions for the prevalence of each taxon:
The median prevalence of taxon A is 58%; the credible interval is (39%, 75%). For taxons B and C the medians are 23% and 13%. For taxons D, E, and F, there is a high probability that the prevalence is 0, because these taxons have not been observed.
To predict the number of new taxons we might find by sequencing additional samples, I use Monte Carlo simulation. Here is the kernel of the algorithm:
for i in range(m):
taxon = self.GenerateTaxon()
curve = MakeCurve(taxons)
m is the number of samples to simulate. Each time through the loop, we generate a random taxon and then update the hierarchy. The result is a rarefaction curve, which is the number of taxons we observe as a function of the number of samples.
To generate a random taxon, we take advantage of the recursive structure of the hierarchy; that is, we ask each hypothesis to choose a random taxon, and then choose among them in accordance with the probability for each hypothesis:
"""Chooses a random taxon."""
pmf = Pmf.Pmf()
for hypo, prob in sorted(self.hypos.Items()):
taxon = hypo.GenerateTaxon()
The algorithm the hypotheses use to generate taxons is ugly, so I will spare you the details, but ultimately it depends on the ability to generate Beta variates, which is provided in Python's random module.
Given a sequence of taxons (real or simulated), it is easy to generate a rarefaction curve:
"""Makes a rarefaction curve for the given sample."""
s = set()
curve = 
for i, taxon in enumerate(sample):
To show what additional samples might yield, I generate 100 rarefaction curves and plot them:
The first 17 points are random permutations of the observed data, so after 17 samples, we have seen 3 taxa on every curve. The last 15 points are based on simulated data. The lines are shifted slightly so they don't overlap; that way you can eyeball high-probability paths.
Even after m=15 additional samples, there is a good chance that we don't see any more taxons -- but there is a reasonable chance of seeing 1-2, and a small chance of seeing 3.
Finally, we estimate the probability of seeing additional taxons as a function of the number of addition samples:
With 5 additional samples, the chance of seeing at least one additional taxon is 24%; with 10 samples it's 40%, and with 15 it's 53%.
These estimates should help biologists allocate their budget for additional samples is order to minimize the chance of missing an important low-prevalence taxon.
Next steps: I am working with Prof. Huang and her students to report the results most usefully; then we are thinking about distributing the code or deploying it as a web app.