*Think Bayes: Bayesian Statistics Made Simple*, the book I am working on now. You can read the entire current draft at http://thinkbayes.com.

### Belly button bacteria

## Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen science project with the goal of identifying bacterial species that can be found in human navels (

`http://bbdata.yourwildlife.org`).

The project might seem whimsical, but it is part of an increasing interest in the human microbiome, the set of microorganisms that live on human skin and other surfaces that contact the environment.

In their pilot study, BBB2 researchers collected swabs from the navels of 60 volunteers, used multiplex pyrosequencing to extract and sequence fragments of 16S rDNA, then identified the species or genus the fragments came from. Each identified fragment is called a “read.”

We can use these data to answer several related questions:

- Based on the number of species observed, can we estimate the total number of species in the environment?
- Can we estimate the prevalence of each species; that is, the fraction of the total population belonging to each species?
- If we are planning to collect additional samples, can we predict how many new species we are likely to discover?
- How many additional reads are needed to increase the fraction of observed species to a given threshold?

### Lions and tigers and bears

## I’ll start with a simplified version of the problem where we know that there are exactly three species. Let’s call them lions, tigers and bears. Suppose we visit a wild animal preserve and see 3 lions, 2 tigers and one bear.

If we have an equal chance of observing any animal in the preserve then the number of each species we see is governed by the multinomial distribution. If the prevalence of lions and tigers and bears is

`p_lion`

and `p_tiger`

and `p_bear`

, the likelihood of
seeing 3 lions, 2 tigers and one bear is`p_lion**3 * p_tiger**2 * p_bear**1`

An approach that is tempting, but not correct, is to use beta
distributions, as in Section 4.6, to describe the prevalence of
each species separately. For example, we saw 3 lions and 3 non-lions;
if we think of that as 3 “heads” and 3 “tails,” then the posterior
distribution of `p_lion`

is:```
beta = thinkbayes.Beta()
beta.Update((3, 3))
print beta.MaximumLikelihood()
```

The maximum likelihood estimate for `p_lion`

is the observed
rate, 50%. Similarly the MLEs for `p_tiger`

and `p_bear`

are 33% and 17%.But there are two problems:

- We have implicitly used a prior for each species that is uniform from 0 to 1, but since we know that there are three species, that prior is not correct. The right prior should have a mean of 1/3, and there should be zero likelihood that any species has a prevalence of 100%.
- The distributions for each species are not independent, because the prevalences have to add up to 1. To capture this dependence, we need a joint distribution for the three prevalences.

`http://en.wikipedia.org/wiki/Dirichlet_distribution`). In the same way we used the beta distribution to describe the distribution of bias for a coin, we can use a Dirichlet distribution to describe the joint distribution of

`p_lion`

,
`p_tiger`

and `p_bear`

.The Dirichlet distribution is the multi-dimensional generalization of the beta distribution. Instead of two possible outcomes, like heads and tails, the Dirichlet distribution handles any number of outcomes: in this example, three species.

If there are

`n`outcomes, the Dirichlet distribution is described by

`n`parameters, written α

*.*

_{i}Here’s the definition, from

`thinkbayes.py`, of a class that represents a Dirichlet distribution:

```
class Dirichlet(object):
def __init__(self, n):
self.n = n
self.params = numpy.ones(n, dtype=numpy.int)
```

`n`is the number of dimensions; initially the parameters are all 1. I use a

`numpy`array to store the parameters so I can take advantage of array operations.

Given a Dirichlet distribution, the marginal distribution for each prevalence is a beta distribution, which we can compute like this:

```
def MarginalBeta(self, i):
alpha0 = self.params.sum()
alpha = self.params[i]
return Beta(alpha, alpha0-alpha)
```

`i`is the index of the marginal distribution we want.

`alpha0`is the sum of the parameters;

`alpha`is the parameter for the given species.

In the example, the prior marginal distribution for each species is

`Beta(1, 2)`. We can compute the prior means like this:

```
dirichlet = thinkbayes.Dirichlet(3)
for i in range(3):
beta = dirichlet.MarginalBeta(i)
print beta.Mean()
```

As expected, the prior mean prevalence for each species is 1/3.To update the Dirichlet distribution, we add the number of observations to each parameter, like this:

```
def Update(self, data):
m = len(data)
self.params[:m] += data
```

Here `data`is a sequence of counts in the same order as

`params`, so in this example, it should be the number of lions, tigers and bears.

But

`data`can be shorter than

`params`; in that case there are some hypothetical species that have not been observed.

Here’s code that updates

`dirichlet`with the observed data and computes the posterior marginal distributions.

```
data = [3, 2, 1]
dirichlet.Update(data)
for i in range(3):
beta = dirichlet.MarginalBeta(i)
pmf = beta.MakePmf()
print i, pmf.Mean()
```

This figure shows the results. The posterior
mean prevalences are 44%, 33% and 22%.### A hierarchical model

## We have solved a simplified version of the problem: if we know how many species there are, we can estimate the prevalence of each.

Now let’s get back to the original problem, estimating the total number of species. To solve this problem I’ll define a metasuite, which is a Suite that contains other Suites as hypotheses. In this case, the top-level Suite contains hypotheses about the number of species; the bottom level contains hypotheses about prevalences. A multi-level model like this is called “hierarchical.”

Here’s the class definition:

```
class Species(thinkbayes.Suite):
def __init__(self, ns):
hypos = [thinkbayes.Dirichlet(n) for n in ns]
thinkbayes.Suite.__init__(self, hypos)
```

`__init__`

takes a list of possible values for `n`and makes a list of Dirichlet objects.

Here’s the code that creates the top-level suite:

```
ns = range(3, 30)
suite = Species(ns)
```

`ns`is the list of possible values for

`n`. We have seen 3 species, so there have to be at least that many. I chose an upper bound that seemed reasonable, but we will have to check later that the probability of exceeding this bound is low. And at least initially we assume that any value in this range is equally likely.

To update a hierarchical model, you have to update all levels. Sometimes it is necessary or more efficient to update the bottom level first and work up. In this case it doesn’t matter, so I update the top level first:

```
#class Species
def Update(self, data):
thinkbayes.Suite.Update(self, data)
for hypo in self.Values():
hypo.Update(data)
```

`Species.Update`invokes

`Update`in the parent class, then loops through the sub-hypotheses and updates them.

Now all we need is a likelihood function. As usual,

`Likelihood`gets a hypothesis and a dataset as arguments:

```
# class Species
def Likelihood(self, hypo, data):
dirichlet = hypo
like = 0
for i in range(1000):
like += dirichlet.Likelihood(data)
return like
```

`hypo`is a Dirichlet object;

`data`is a sequence of observed counts.

`Species.Likelihood`calls

`Dirichlet.Likelihood`1000 times and returns the total.

Why do we have to call it 1000 times? Because

`Dirichlet.Likelihood`doesn’t actually compute the likelihood of the data under the whole Dirichlet distribution. Instead, it draws one sample from the hypothetical distribution and computes the likelihood of the data under the sampled set of prevalences.

Here’s what it looks like:

```
# class Dirichlet
def Likelihood(self, data):
m = len(data)
if self.n < m:
return 0
x = data
p = self.Random()
q = p[:m]**x
return q.prod()
```

The length of `data`is the number of species observed. If we see more species than we thought existed, the likelihood is 0.

Otherwise we select a random set of prevalences,

`p`, and compute the multinomial PDF, which is

c(x) |
| p_{i}^{xi} |

*p*is the prevalence of the

_{i}*i*th species, and

*x*is the observed number. The first term,

_{i}*c*(

*x*), is the multinomial coefficient; I left it out of the computation because it is a multiplicative factor that depends only on the data, not the hypothesis, so it gets normalized away (see

`http://en.wikipedia.org/wiki/Multinomial_distribution`).

Also, I truncated

`p`at

`m`, which is the number of observed species. For the unseen species,

*x*is 0, so

_{i}*p*is 1, so we can leave them out of the product.

_{i}^{xi}### Random sampling

## There are two ways to generate a random sample from a Dirichlet distribution. One is to use the marginal beta distributions, but in that case you have to select one at a time and scale the rest so they add up to 1 (see

`http://en.wikipedia.org/wiki/Dirichlet_distribution#Random_number_generation`).

A less obvious, but faster, way is to select values from

`n`gamma distributions, then normalize by dividing through by the total. Here’s the code:

```
# class Dirichlet
def Random(self):
p = numpy.random.gamma(self.params)
return p / p.sum()
```

Now we’re ready to look at some results. Here is the code that
updates the top-level suite and extracts the posterior PMF of `n`:

```
data = [3, 2, 1]
suite.Update(data)
pmf = suite.DistOfN()
```

To get the posterior distribution of `n`,

`DistOfN`iterates through the top-level hypotheses:

```
def DistOfN(self):
pmf = thinkbayes.Pmf()
for hypo, prob in self.Items():
pmf.Set(hypo.n, prob)
return pmf
```

This figure shows the result:

The most likely value is 5. Values from 3 to 8 are all likely; after that the probabilities drop off quickly. The probability that there are 29 species is low enough to be negligible; if we chose a higher bound, we would get the same result.

But remember that we started with a uniform prior for

`n`. If we have background information about the number of species in the environment, we might choose a different prior.

### Optimization

## I have to admit that I am proud of this example. The unseen species problem is not easy, and I think this solution is simple and clear, and takes surprisingly few lines of code (about 50 so far).

The only problem is that it is slow. It’s good enough for the example with only 3 observed species, but not good enough for the belly button data, with more than 100 species in some samples.

In

*Think Bayes*I present a series of optimizations we need to make this solution scale. Here’s a road map of the steps:

- The first step is to recognize that if we update the Dirichlet
distributions with the same data, the first
`m`parameters are the same for all of them. The only difference is the number of hypothetical unseen species. So we don’t really need`n`Dirichlet objects; we can store the parameters in the top level of the hierarchy.`Species2`implements this optimization. `Species2`also uses the same set of random values for all of the hypotheses. This saves time generating random values, but it has a second benefit that turns out to be more important: by giving all hypothesis the same selection from the sample space, we make the comparison between the hypotheses more fair, so it takes fewer iterations to converge.- But there is still a major performance problem. As the
number of observed species increases, the array of random
prevalences gets bigger, and the chance of choosing one that is
approximately right becomes small. So the vast majority
of iterations yield small likelihoods that don’t contribute
much to the total, and don’t discriminate between hypotheses.The solution is to do the updates one species at a time.
`Species4`is a simple implementation of this strategy using Dirichlet objects to represent the sub-hypotheses. - Finally,
`Species5`combines the sub-hypothesis into the top level and uses`numpy`array operations to speed things up.

You can be proud of this :)

ReplyDeleteMy own over-simple version never got further than uniform prevalence for each species

I've found these kinds of problems fascinating since as a child I learned how capture-recapture works. Something about using data and logic to quantify unobserved events really captured my imagination.

One application I've been thinking about casually for a little while is estimating the incidence of scientific fraud. It seems tricky, and I have to admit I haven't made much progress. Any suggestions?

Hi Tom. Thanks for the link to your article. It's very interesting, and I like the variations on this problem you presented.

DeleteEstimating the prevalence of scientific fraud sounds interesting, too. The obvious challenge is the difficulty of estimating the fraction of instances that are detected. The same problem comes up in estimating the frequency of under-reported violent crimes.

I'll think about it. Thanks again!