Thursday, May 30, 2013

Belly Button Biodiversity: The End Game

In the previous installment of this saga, I admitted that my predictions had completely failed, and I outlined the debugging process I began.  Then the semester happened, so I didn't get to work on it again until last week.

It turns out that there were several problems, but the algorithm is now calibrating and validating!  Before I proceed, I should explain how I am using these words:

  • Calibrate: Generate fake data from the same model the analysis is based on.  Run the analysis on fake data and generate predictive distributions.  Check whether the predictive distributions are correct.
  • Validate: Using real data, generate a rarefied sample.  Run the analysis on the sample and generate predictive distributions.  Check whether the predictive distributions are correct.

If the analysis calibrates, but fails to validate, that suggests that there is some difference between the model and reality that is causing a problem.  And that turned out to be the case.

Here are the problems I discovered, and what I had to do to fix them:

The prior distribution of prevalences

For the prior I used a Dirichlet distribution with all parameters set to 1.  I neglected to consider the "concentration parameter," which represents the prior belief about how uniform or concentrated the prevalences are.  As the concentration parameter approaches 0, prevalences tend to be close to 1 or 0; that is, there tends to be one dominant species and many species with small prevalences.  As the concentration parameter gets larger, all species tend to have the same prevalence.  It turns out that a concentration parameter near 0.1 yields a distribution of prevalences that resembles real data.

The prior distribution of n

With a smaller concentration parameter, there are more species with small prevalences, so I had to increase the range of n (the number of species).  The prior distribution for n is uniform up to an upper bound, where I choose the upper bound to be big enough to avoid cutting off non-negligible probability.  I had to increase this upper bound to 1000, which slows the analysis down a little, but it still takes only a few seconds per subject (on my not-very-fast computer).

Up to this point I hadn't discovered any real errors; it was just a matter of choosing appropriate prior distributions, which is ordinary work for Bayesian analysis.

But it turns out there were two legitimate errors.

Bias due to the definition of "unseen"

I was consistently underestimating the prevalence of unseen species because of a bias that underlies the definition of "unseen."  To see the problem, consider a simple scenario where there are two species, A and B, with equal prevalence.  If I only collect one sample, I get A or B with equal probability.

Suppose I am trying to estimate the prevalence of A.  If my sample is A, the posterior marginal distribution for the prevalence of A is Beta(2, 1), which has mean 2/3.  If the sample is B, the posterior is Beta(1, 2), which has mean 1/3.  So the expected posterior mean is the average of 2/3 and 1/3, which is 1/2.  That is the actual prevalence of A, so this analysis is unbiased.

But now suppose I am trying to estimate the prevalence of the unseen species.  If I draw A, the unseen species is B and the posterior mean is 1/3.  If I draw B, the unseen species is A and the posterior mean is 1/3.  So either way I believe that the prevalence of the unseen species is 1/3, but it is actually 1/2.  Since I did not specify in advance which species is unseen, the result is biased.

This seems obvious in retrospect.  So that's embarrassing (the price I pay for this experiment in Open Science), but it is easy to fix:

a) The posterior distribution I generate has the right relative prevalences for the seen species (based on the data) and the right relative prevalences for the unseen species (all the same), but the total prevalence for the unseen species (which I call q) is too low.

b) Fortunately, there is only one part of the analysis where this bias is a problem: when I draw a sample from the posterior distribution.  To fix it, I can draw a value of q from the correct posterior distribution (just by running a forward simulation) and then unbias the posterior distribution with the selected value of q.

Here's the code that generates q:


    def RandomQ(self, n):
        
        # generate random prevalences
        dirichlet = thinkbayes.Dirichlet(n, conc=self.conc)
        prevalences = dirichlet.Random()

        # generate a simulated sample
        pmf = thinkbayes.MakePmfFromItems(enumerate(prevalences))
        cdf = pmf.MakeCdf()
        sample = cdf.Sample(self.num_reads)
        seen = set(sample)

        # add up the prevalence of unseen species
        q = 0
        for species, prev in enumerate(prevalences):
            if species not in seen:
                 q += prev

        return q


n is the hypothetical number of species.  conc is the concentration parameter.  RandomQ creates a Dirichlet distribution, draws a set of prevalences from it, then draws a simulated sample with the appropriate number of reads, and adds up the total prevalence of the species that don't appear in the sample.

And here's the code that unbiases the posterior:


    def Unbias(self, n, m, q_desired):
        
        params = self.params[:n].copy()
        
        x = sum(params[:m])
        y = sum(params[m:])
        a = x + y

        g = q_desired * a / y
        f = (a - g * y) / x
        params[:m] *= f
        params[m:] *= g


n is the hypothetical number of species; m is the number seen in the actual data.

x is the total prevalence of the seen species; y is the total prevalence of the unseen species.  f and g are the factors we have to multiply by so that the corrected prevalence of unseen species is q_desired.

After fixing this error, I find that the analysis calibrates nicely.


From each predictive distribution I generate credible intervals with ideal percentages 10, 20, ... 90, and then count how often the actual value falls in each interval.

For example, the blue line is the calibration curve for n, the number of species.  After 100 runs, the 10% credible interval contains the actual value 9.5% of of the time.The 50% credible interval contains the actual value 51.5% of the time.  And the 90% credible interval contains the actual value 88% of the time.  These results show that the posterior distribution for n is, in fact, the posterior distribution for n.

The results are similar for q, the prevalence of unseen species, and l, the predicted number of new species seen after additional sampling.

To check whether the analysis validates, I used the dataset collected by the Belly Button Biodiversity project.  For each subject with more than 400 reads, I chose a random subset of 100 reads, ran the analysis, and checked the predictive distributions for q and n.  I can't check the predictive distribution of n, because I don't know the actual value.

Sadly, the analysis does not validate with the collected data.  The reason is:

The data do not fit the model

The data deviate substantially from the model that underlies the analysis.  To see this, I tried this experiment:

a) Use the data to estimate the parameters of the model.
b) Generate fake samples from the model.
c) Compare the fake samples to the real data.

Here's a typical result:


The blue line is the CDF of prevalences, in order by rank.  The top-ranked species makes up about 27% of the sample.  The top 10 species make up about 75%, and the top 100 species make up about 90%.

The green lines show CDFs from 10 fake samples.  The model is a good match for the data for the first 10-20 species, but then it deviates substantially.  The prevalence of rare species is higher in the data than in the model.

The problem is that the real data seem to come from a mixture of two distributions, one for dominant species and one for rare species.  Among the dominant species the concentration parameter is near 0.1.  For the rare species, it is much higher; that is, the rare species all have about the same prevalence.

There are two possible explanations: this effect might be real or it might be an artifact of errors in identifying reads.  If it's real, I would have to extend my model to account for it.  If it is due to errors, it might be possible to clean the data.

I have heard from biologists that when a large sample yields only a single read of a particular species, that read is likely to be in error; that is, the identified species might not actually be present.

So I explored a simple error model with the following features:

1) If a species appears only once after r reads, the probability that the read is bogus is p = (1 - alpha/r), where alpha is a parameter.

2) If a species appears k times after n reads, the probability that all k reads are bogus is p^k.

To clean the data, I compute the probability that each observed species is bogus, and then delete it with the computed probability.


With cleaned data (alpha=50), the model fits very nicely.  And since the model fits, and the analysis calibrates, we expect the analysis to validate.  And it does.


For n there is no validation curve because we don't know the actual values.

For q the validation curve is  a little off because we only have a lower bound for the prevalence of unseen species, so the actual values used for validation are too high.

But for l the validation curve is quite good, and that's what we are actually trying to predict, after all.

At this point the analysis depends on two free parameters, the concentration parameter and the cleaning parameter, alpha, which controls how much of the data gets discarded as erroneous.

So the next step is to check whether these parameters cross-validate.  That is, if we tune the parameters based on a training set, how well do those values do on a test set?

Another next step is to improve the error model.  I chose something very simple, and it does a nice job of getting the data to conform to the analysis model, but it is not well motivated.  If I can get more information about where the errors are coming from, I could take a Bayesian approach (what else?) and compute the probability that each datum is legit or bogus.

Or if the data are legit and the prevalences are drawn from a mixture of Dirichlet distributions with different concentrations, I will have to extend the analysis accordingly.

Summary

There were four good reasons my predictions failed:

1) The prior distribution of prevalences had the wrong concentration parameter.

2) The prior distribution of n was too narrow.

3) I neglected an implicit bias due to the definition of "unseen species."

4) The data deviate from the model the analysis is based on.  If we "clean" the data, it fits the model and  the analysis validates, but the cleaning process is a bit of a hack.

I was able to solve these problems, but I had to introduce two free parameters, so my algorithm is not as versatile as I hoped.  However, it seems like it should be possible to choose these parameters automatically, which would be an improvement.

And now I have to stop, incorporate these corrections into Think Bayes, and then finish the manuscript!

1 comment: