Probably Overthinking It

Wednesday, April 24, 2013

The Price is Right Problem: Part Two

This article is an excerpt from Think Bayes, a book I am working on. The entire current draft is available from http://thinkbayes.com. I welcome comments and suggestions.

In the previous article, I described presented The Price is Right problem and a Bayesian approach to estimating the value of a showcase of prizes. This article picks up from there...

Optimal bidding

Now that we have a posterior distribution, we can use it to compute the optimal bid, which I define as the bid that maximizes expected gain.

To compute optimal bids, I wrote a class called GainCalculator:

class GainCalculator(object):

    def __init__(self, player, opponent):
        self.player = player
        self.opponent = opponent

player and opponent are Player objects.

GainCalculator provides ExpectedGains, which computes a sequence of bids and the expected gain for each bid:

    def ExpectedGains(self, low=0, high=75000, n=101):
        bids = numpy.linspace(low, high, n)

        gains = [self.ExpectedGain(bid) for bid in bids]

        return bids, gains

low and high specify the range of possible bids; n is the number of bids to try. Here is the function that computes expected gain for a given bid:

    def ExpectedGain(self, bid):
        suite = self.player.posterior
        total = 0
        for price, prob in suite.Items():
            gain = self.Gain(bid, price)
            total += prob * gain
        return total

ExpectedGain loops through the values in the posterior and computes the gain for each bid, given the actual prices of the showcase. It weights each gain with the corresponding probability and returns the total.
Gain takes a bid and an actual price and returns the expected gain:

    def Gain(self, bid, price):
        if bid > price:
            return 0

        diff = price - bid
        prob = self.ProbWin(diff)

        if diff <= 250:
            return 2 * price * prob
        else:
            return price * prob

If you overbid, you get nothing. Otherwise we compute the difference between your bid and the price, which determines your probability of winning.

If diff is less than $250, you win both showcases. For simplicity, I assume that both showcases have the same price. Since this outcome is rare, it doesn’t make much difference.

Finally, we have to compute the probability of winning based on diff:

    def ProbWin(self, diff):
        prob = (self.opponent.ProbOverbid() + 
                self.opponent.ProbWorseThan(diff))
        return prob

If your opponent overbids, you win. Otherwise, you have to hope that your opponent is off by more than diff. Player provides methods to compute both probabilities:

# class Player:

    def ProbOverbid(self):
        return self.cdf_diff.Prob(-1)

    def ProbWorseThan(self, diff):
        return 1 - self.cdf_diff.Prob(diff)

This code might be confusing because the computation is now from the point of view of the opponent, who is computing, “What is the probability that I overbid?” and “What is the probability that my bid is off by more than diff?”

Both answers are based on the CDF of diff [CDFs are described here]. If your opponent’s diff is less than or equal to -1, you win. If your opponent’s diff is worse than yours, you win. Otherwise you lose.

Finally, here’s the code that computes optimal bids:

# class Player:

    def OptimalBid(self, guess, opponent):
        self.MakeBeliefs(guess)
        calc = GainCalculator(self, opponent)
        bids, gains = calc.ExpectedGains()
        gain, bid = max(zip(gains, bids))
        return bid, gain

Given a guess and an opponent, OptimalBid computes the posterior distribution, instantiates a GainCalculator, computes expected gains for a range of bids and returns the optimal bid and expected gain. Whew!

Figure 6.4

Figure 6.4 shows the results for both players, based on a scenario where Player 1’s best guess is $20,000 and Player 2’s best guess is $40,000.

For Player 1 the optimal bid is $21,000, yielding an expected return of almost $16,700. This is a case (which turns out to be unusual) where the optimal bid is actually higher than the contestant’s best guess.

For Player 2 the optimal bid is $31,500, yielding an expected return of almost $19,400. This is the more typical case where the optimal bid is less than the best guess.

Discussion

One of the most useful features of Bayesian estimation is that the result comes in the form of a posterior distribution. Classical estimation usually generates a single point estimate or a confidence interval, which is sufficient if estimation is the last step in the process, but if you want to use an estimate as an input to a subsequent analysis, point estimates and intervals are often not much help.

In this example, the Bayesian analysis yields a posterior distribution we can use to compute an optimal bid. The gain function is asymmetric and discontinuous (if you overbid, you lose), so it would be hard to solve this problem analytically. But it is relatively simple to do computationally.

Newcomers to Bayesian thinking are often tempted to summarize the posterior distribution by computing the mean or the maximum likelihood estimate. These summaries can be useful, but if that’s all you need, then you probably don’t need Bayesian methods in the first place.

Bayesian methods are most useful when you can carry the posterior distribution into the next step of the process to perform some kind of optimization, as we did in this chapter, or some kind of prediction, as we will see in the next chapter [which you can read here].

Monday, April 22, 2013

The Price is Right Problem

This article is an excerpt from Think Bayes, a book I am working on. The entire current draft is available from http://thinkbayes.com. I welcome comments and suggestions.

The Price is Right Problem

On November 1, 2007, contestants named Letia and Nathaniel appeared on The Price is Right, an American game show. They competed in a game called the Showcase, where the objective is to guess the price of a showcase of prizes. The contestant who comes closest to the actual price of the showcase, without going over, wins the prizes.

Nathaniel went first. His showcase included a dishwasher, a wine cabinet, a laptop computer, and a car. He bid $26,000.

Letia’s showcase included a pinball machine, a video arcade game, a pool table, and a cruise of the Bahamas. She bid $21,500.

The actual price of Nathaniel’s showcase was $25,347. His bid was too high, so he lost.

The actual price of Letia’s showcase was $21,578. She was only off by $78, so she won her showcase and, because her bid was off by less than $250, she also won Nathaniel’s showcase.

For a Bayesian thinker, this scenario suggests several questions:

Before seeing the prizes, what prior beliefs should the contestant have about the price of the showcase?
After seeing the prizes, how should the contestant update those prior beliefs?
Based on the posterior distribution, what should the contestant bid?

The third question demonstrates a common use of Bayesian analysis: optimization. Given a posterior distribution, we can choose the bid that maximizes the contestant’s expected return.

This problem is inspired by an example in Cameron Davidson-Pilon’s book, Bayesian Methods for Hackers.

The prior

To choose a prior distribution of prices, we can take advantage of data from previous episodes. Fortunately, fans of the show keep detailed records. When I corresponded with Mr. Davidson-Pilon about his book, he sent me data collected by Steve Gee at http://tpirsummaries.8m.com. It includes the price of each showcase from the 2011 and 2012 seasons, and the bids offered by the contestants.

Figure 6.1: Distribution of prices for showcases on The Price is Right, 2011-12.

Figure 6.1 shows the distribution of prices for these showcases. The most common value for both showcases is around $28,000, but the first showcase has a second mode near $50,000, and the second showcase is occasionally worth more than $70,000.

These distributions are based on actual data, but they have been smoothed by Gaussian kernel density estimation (KDE). So before we go on, I want to take a detour to talk about probability density functions and KDE.

Probability density functions

So far [in Think Bayes, that is] we have been working with probability mass functions, or PMFs. A PMF is a mapping from each possible value to its probability. In my implementation, a Pmf object provides a method named Prob that takes a value and returns a probability, also known as a “probability mass.”

A probability density function, or PDF, is the continuous version of a PMF, where the possible values make up a continuous range rather than a discrete set.

In mathematical notation, PDFs are usually written as functions; for example, here is the PDF of a Gaussian distribution with mean 0 and standard deviation 1:

f(x) = exp(−x²)

To represent this PDF in Python, I could define a class like this:

class StandardGaussianPdf(object):

    def Density(self, x):
        return math.exp(-x**2)

Density takes a value, x, and returns the probability density evaluated at x.

A probability density is similar to a probability mass in one way: higher density indicates that a value is more likely.

But a density is not a probability. If you integrate a density over a continuous range, the result is a probability. But for the applications in this book we seldom have to do that.

In this book we primarily use probability densities as part of a Likelihood function. We will see an example soon.

Representing Pdfs

Before we get back to The Price is Right, I want to present a more general way to represent PDFs.
thinkbayes.py provides a class named Pdf that defines two functions, Density and MakePmf:

class Pdf(object):

    def Density(self, x):
        raise UnimplementedMethodException()

    def MakePmf(self, xs):
        pmf = Pmf()
        for x in xs:
            pmf.Set(x, self.Density(x))
        pmf.Normalize()
        return pmf

Pdf is an abstract type, which means that it defines the interface a Pdf is supposed to have, but does not provide a complete implementation. Specifically, Pdf provides MakePmf but not Density. [PMFs are described in Chapter 2 of Think Bayes.]

A concrete type is a class that extends an abstract parent class and provides an implementation of the missing methods.

For example, GaussianPdf extends Pdf and provides Density:

class GaussianPdf(Pdf):

    def __init__(self, mu, sigma):
        self.mu = mu
        self.sigma = sigma
        
    def Density(self, x):
        density = scipy.stats.norm.pdf(x, 
                                       loc=self.mu, 
                                       scale=self.sigma)
        return density

__init__ takes mu and sigma, which are the mean and standard deviation of the distribution.
Density uses a function from scipy to evaluate the Gaussian PDF.

The Gaussian PDF is defined by a simple mathematical function, so it is easy to evaluate. And it is useful because many quantities in the real world have distributions that are approximately Gaussian.
But with real data, there is no guarantee that the PDF is Gaussian, or any other simple mathematical function. In that case we can use a data sample to estimate the PDF of the whole population.

For example, in The Price Is Right data, we have 313 prices for the first showcase. We can think of these values as a sample from the population of all possible showcase prices.

Near the middle of the distribution, we see the following values:

28800, 28868, 28941, 28957, 28958

In the sample, no values appear between 28801 and 28867, but there is no reason to think that these values are impossible. Based on our background information, we would expect all values in this range to be equally likely. In other words, we expect the PDF to be reasonably smooth.

Kernel density estimation (KDE) is an algorithm that takes a sample of values and finds an appropriately-smooth PDF that fits the data. You can read about the details at http://en.wikipedia.org/wiki/Kernel_density_estimation.

scipy provides an implementation of KDE. thinkbayes provides a class called EstimatedPdf that extends Pdf and uses KDE:

class EstimatedPdf(Pdf):

    def __init__(self, sample):
        xs = numpy.array(sample, dtype=numpy.double)
        self.kde = scipy.stats.gaussian_kde(xs)

    def Density(self, x):
        return self.kde.evaluate(x)

__init__ takes a sample, converts it to a NumPy array, and computes a kernel density estimate. The result is a gaussian_kde object that provides an evaluate method.

Density takes a value, calls gaussian_kde.evaluate, and returns the resulting density.
Finally, here’s an outline of the code I used to generate Figure 6.1:

    prices = ReadData()
    kde = thinkbayes.EstimatedPdf(prices)

    low, high = 0, 75000
    n = 101
    xs = numpy.linspace(low, high, n) 
    pmf = kde.MakePmf(xs)

    myplot.Pmf(pmf)

And now back to The Price is Right.

Modeling the contestants

The PDFs in Figure 6.1 estimate the distribution of possible prices for each showcase. If you were a contestant on the show, you could use this distribution to quantify your prior belief about the price of the showcases (before you see the prizes).

To update these priors, you have to answer these questions:

What data should we consider and how should we quantify it?
Can we compute a Likelihood function; that is, for each hypothetical value of price, can we compute the conditional likelihood of the data?

To answer these questions, I am going to model the contestant as a price-guessing instrument with known error characteristics. In other words, when the contestant sees the prizes, he or she guesses the price of each prize—ideally without taking into consideration the fact that the prize is part of a showcase—and adds up the prices. Let’s call this total guess.

Under this model, the question we have to answer is, “If the actual price is price, what is the likelihood that the contestant’s total estimate would be guess?”

Or if we define

    error = price - guess

then we could ask, “What is the likelihood that the contestant’s estimate is off by error?”
To answer this question, we can use the historical data again. Figure 6.2 shows the cumulative distribution of diff, the difference between the contestant’s bid and the actual price of the showcase.
The definition of diff is

    diff = price - bid

When diff is negative, the bid is too high. As an aside, we can use this CDF to compute the probability that the contestants overbid: the first contestant overbids 25% of the time; the second contestant overbids 29% of the time.

Figure 6.2: Cumulative distribution (CDF) of the difference between the contestant’s bid and the actual price.

We can also use this distribution to estimate the reliability of the contestants’ guesses. This step is a little tricky because we don’t actually know the contestant’s guesses; we only know what they bid.
In Figure 6.2 we can see that the bids are biased; that is, they are more likely to be too low than too high. And that makes sense, given the rules of the game.

So we’ll have to make some assumptions. Specifically, I assume that the distribution of error is Gaussian with mean 0 and the same variance as diff.

The Player class implements this model:

class Player(object):

    def __init__(self, price, bid, diff):
        self.price = price
        self.bid = bid
        self.diff = diff

        self.pdf_price = thinkbayes.EstimatedPdf(price)
        self.cdf_diff = thinkbayes.MakeCdfFromList(diff)

        mu = 0
        sigma = numpy.std(self.diff)
        self.pdf_error = thinkbayes.GaussianPdf(mu, sigma)

price is a sequence of showcase prices, bid is a sequence of bids, and diff is a sequence of diffs, where again diff = price - bid.

pdf_price is the smoothed PDF of prices, estimated by KDE. cdf_diff is the cumulative distribution of diff, which we saw in Figure 6.2. And pdf_error is the PDF that characterizes the distribution of errors; where error = price - guess.

Again, we use the variance of diff to estimate the variance of error. But contestant’s bids are sometimes strategic; for example, if Player 2 thinks that Player 1 has overbid, Player 2 might make a very low bid. In that case diff does not reflect error. If this strategy is common, the observed variance in diff might overestimate the variance in error. Nevertheless, I think it is a reasonable modeling decision.

As an alternative, someone preparing to appear on the show could estimate their own distribution of error by watching previous shows and recording their guesses and the actual prices.

Likelihood

Now we are ready to write the likelihood function. As usual, I define a new class that extends thinkbayes.Suite:

class Price(thinkbayes.Suite):

    def __init__(self, pmf, player):
        thinkbayes.Suite.__init__(self)

        for price, prob in pmf.Items():
            self.Set(price, prob)

        self.player = player

pmf represents the prior distribution. The for loop copies the values and probabilities from pmf into the new Suite.

player is a Player object as described in the previous section. And here’s Likelihood:

    def Likelihood(self, hypo, data):
        price = hypo
        guess = data

        error = price - guess
        like = self.player.ErrorDensity(error)

        return like

hypo is the hypothetical price of the showcase. data is the contestant’s best guess at the price. error is the difference, and like is the likelihood of the data, given the hypothesis.

ErrorDensity is defined in Player:

# class Player:

    def ErrorDensity(self, error):
        return self.pdf_error.Density(error)

ErrorDensity works by evaluating pdf_error at the given value of error.

The result is a probability density, which means we can’t treat it as a probability. But remember that Likelihood does not really need to compute a probability; it only has to compute something proportional to a probability. As long as the constant of proportionality is the same for all likelihoods, it gets cancelled out when we normalize the posterior distribution.

And therefore, a probability density is a perfectly good likelihood.

Update

Player provides a method that takes the contestant’s guess and computes the posterior distribution:

    def MakeBeliefs(self, guess):
        pmf = self.PmfPrice()
        self.prior = Price(pmf, self, name='prior')
        self.posterior = self.prior.Copy(name='posterior')
        self.posterior.Update(guess)

PmfPrice evaluates pdf_price at an equally-spaced series of values:

    def PmfPrice(self):
        return self.pdf_price.MakePmf(self.price_xs)

The result is a new Pmf object, which we use to construct the prior. To construct the posterior, we make a copy of the prior and then invoke Update, which invokes Likelihood for each hypothesis, multiplies the priors by the likelihoods, and then renormalizes.

So let’s get back to the original scenario. Suppose you are Player 1 and when you see your showcase, your best guess is that the total price of the prizes is $20,000.

The following code constructs and plots your prior and posterior beliefs about the actual price:

    player1.MakeBeliefs(20000)
    myplot.Pmf(player1.prior)
    myplot.Pmf(player2.prior)









Figure 6.3: Prior and posterior distributions for Player 1, based on a best guess of $20,000.

Figure 6.3 shows the results. The value of your guess is on the low end of the prior range, so the posterior is shifted to the left. The mean of the posterior is $25,096; the most likely value is $24,000.
On one level, this result makes sense. The most likely value in the prior is $27,750. Your best guess is $20,000. And the most likely value in the posterior is about half way in between.

On another level, you might find this result bizarre, because it suggests that if you think the price is $20,000, then you should believe the price is $24,000.

To resolve this apparent paradox, remember that you are combining two sources of information, historical data about past showcases and guesses about the prizes you see.

We are treating the historical data as the prior and updating it based on your guesses, but we could equivalently use your guess as a prior and update it based on historical data.

If you think of it that way, maybe it is less surprising that the most likely value in the posterior is not your original guess.

In the next installment, we'll use the posterior distribution to compute the optimal bid for each player.

Thursday, April 11, 2013

The price is right

On the reddit statistics forum, I recently posted a link to my tutorial on Bayesian statistics. One of my fellow redditors drew my attention to this article, which uses pymc to do a Bayesian analysis of The Price is Right. He or she asked how I would solve this problem using the framework in Think Bayes. So, here goes.

First, we have to define a Suite of hypotheses. In this example, each hypothesis represents a belief about the total price of the showcase. We are told, based on data from previous shows, that a reasonable prior distribution is normal with mean 35000 dollars and standard deviation 7500.

So here's the code that creates the Suite:

class Price(thinkbayes.Suite):

def __init__(self, error_sigma):

"""Constructs the suite.

error_sigma: standard deviation of the distribution of error

"""

thinkbayes.Suite.__init__(self)

pmf = thinkbayes.MakeGaussianPmf(35000, 7500, num_sigmas=4)

# copy items from pmf to self

for val, prob in pmf.Items():

self.Set(val, prob)

# store error_sigma for use in Likelihood

self.error_sigma = error_sigma

MakeGaussianPmf makes a Pmf (probability mass function) that approximates a normal distribution. It truncates the range of the distribution 4 sigmas in each direction from the mean.

I'll explain error_sigma in a minute.

Now that we have a suite of hypotheses, we think about how to represent the data. In this case, the "data" is my best guess about the total value of the showcase, which we can think of as a measurement produced by a somewhat unreliable instrument, my brain.

According to the problem statement, my best guess is 3000 for the price of the snowmobile, 12000 for the price of the trip, so 15000 for the total price.

Now we need a Likelihood function that takes a hypothetical price for the showcase, and my best guess, and returns the Likelihood of the data given the hypothesis. That is, if the actual price of the showcase is X, what is the probability that I would guess 12000?

To answer that question, we need some information about how good I am at guessing prices, and that's where error_sigma comes in.

We are told that my uncertainty about the price of the snowmobile can be captured by a normal distribution with sigma=500. And my uncertainty about the price of the trip is normal with sigma=3000. So let's compute the standard deviation of my total error:

error_snowmobile = 500

error_trip = 3000

error_total = math.sqrt(error_snowmobile**2 + error_trip**2)

Now we can create the Suite

suite = Price(error_total)

and update it with my guess

my_guess = 15000

suite.Update(my_guess)

When we invoke Update, it invokes Likelihood once for each hypothetical showcase price. Here's the Likelihood function:

def Likelihood(self, hypo, data):

"""Computes the likelihood of the data under the hypothesis.

hypo: actual price

data: my guess

"""

actual_price = hypo

my_guess = data

error = my_guess - actual_price

like = thinkbayes.EvalGaussianPdf(

mu=0,

sigma=self.error_sigma,

x=error)

return like

x is the error; that is, how much my guess is off by. like is the likelihood that I would be off by that much, computing by evaluating the density function of the Gaussian with sigma=error_sigma.

And we're done. Let's see what the results look like.

The mean of the posterior distribution is 17800, substantially less than the prior mean (35000). It is also substantially less than the result reported in the original article, near 28000. So we are left with a suite of three hypotheses:

1) There is a mistake in my implementation.

2) There is a mistake in the other author's implementation.

3) We are actually making subtly different modeling assumptions.

Option #3 is possible because I am not sure I understand the model in the original article. The author describes my beliefs about the snowmobile and the trip as "priors", which suggests that they are going to get updated. In contrast, I am treating my guess about the prices as data (that is, a summary of what I learned by seeing the contents of the showcase), but I am also modeling myself as a measurement instrument with a characteristic distribution of errors.

Under my interpretation, the posterior shown above makes sense. For example, if my guess is 15000, and the standard deviation of my guesses is 3050, then it is very unlikely that I am off by 4 standard deviations, so the upper bound of the posterior should be around 15000 + 4 * 3050 = 27200. That makes the mean reported in the original article (28000) seem too high.

But maybe I am not interpreting the statement of the problem (or the model) as intended. I will check in with my correspondent and update this article when we have an explanation!

UPDATE April 18, 2013: I exchanged a few emails with the author of the original article, Cameron Davidson-Pilon. He found a bug in his code that explains at least part of the difference between his results and mine. So I think he is planning to update his article and the book he is working on.

He also sent me some data on the value of recent showcases on The Price is Right and the bids offered by the contestants. The data were collected by Steve Gee and posted at The Price is Right Stats.

I have written code that uses this data to form the prior distribution of prices, and also to estimate the distribution of errors for the contestants. And I am writing it up as a chapter in Think Bayes. I'll post the new chapter here when it is done!

UPDATE April 22, 2013: I have added a chapter to Think Bayes, and I am publishing it as a two-part series, starting here.

Tuesday, April 9, 2013

Freshman hordes regress to the mean

More nones, no nuns

For several years I have been following one of the most under-reported stories of the decade: the fraction of college freshmen who report no religious preference has tripled since 1985, from 8% to 24%, and the trend is accelerating.

Two years ago I wrote Freshman hordes more godless than ever; last year I updated it with Freshman hordes even more godless. Each year, the number of students with no religious preference increased, and the number attending religious services decreased.

In last year's installment, I made the bold prediction that the trend would continue, and that the students starting college in 2012 would again, be the most godless ever. It turns out I was wrong: attendance went up slightly, and the fraction of "Nones" dropped slightly, in both cases reverting toward long-term trends.

My analysis is based on survey results from the Cooperative Institutional Research Program (CIRP) of the Higher Education Research Insitute (HERI). In 2012, more than 190,000 students at 283 colleges and universities completed the CIRP Freshman Survey, which includes questions about students’ backgrounds, activities, and attitudes.

In one question, students select their “current religious preference,” from a choice of seventeen common religions, “Other religion,” or “None.”

Another question asks students how often they “attended a religious service” in the last year. The choices are “Frequently,” “Occasionally,” and “Not at all.” Students are instructed to select “Occasionally” if they attended one or more times.

The following figure shows the fraction of Nones over more than 40 years of the survey:

The blue line shows actual data through 2011; the red line shows a quadratic fit to the data. The dark gray region shows a 90% confidence interval, taking into account sampling error, so it reflects uncertainty about the parameters of the fit.

The light gray region shows a 90% confidence interval taking into account both sampling error and residual error. So it reflects total uncertainty about the predicted value, including uncertainty due to random variation from year to year.

We expect the new data point from 2012, shown as a blue square, to fall within the light gray interval, and it does. In fact, at 23.8% it falls almost exactly on the fitted curve.

Here is the corresponding plot for attendance at religious services:

Again, the new data point for 2012, 26.8%, falls comfortably in the predicted range. Don't listen to Nate Silver; prediction is easy :)

Predictions for 2013

Using the new 2012 data, we can generate predictions for 2013. Here is the revised plot for "Nones":

The prediction for next year is that the fraction of Nones will hit a new all-time high at 25% (up from 23.8%).

And here is the prediction for "No attendance":

The prediction for 2013 is a small decrease to 26.6% (from 26.8%). I'll be back next year to check on these predictions.

Other updates

1) This year the survey repeated two questions from 2010, asking students if they consider themselves "Born again Christian" or "Evangelical". The fraction reporting "Born again" dropped from 22.8% to 20.2%. The fraction who consider themselves Evangelical dropped from 8.9% to 8.5%. But it's too early to declare a trend.

2) As always, more males than females report no religious preference. The gender gap increased this year, but still falls in the predicted range, as shown in the following plot:

Evidence that the gender gap is increasing is strong. The p-value of the slope of the fitted curve is less than 10e-5.

Data Source

The American Freshman: National Norms Fall 2012
Pryor, J.H., Eagan, K., Palucki Blake, L., Hurtado, S., Berdan, J., Case, M.H.
ISBN: 978-1-878477-22-4 90 pages.
Jan 2013

This and all previous reports are available from the HERI publications page.

Friday, March 22, 2013

Belly Button Biodiversity: Part Four

March 22, 2013

Well, I've started testing the predictions I made in my previous post, and exactly as I expected and deserved, I am getting killed. The actual results pretty consistently show more species than I predicted, sometimes way more.

I have started the process of debugging the problem. Of course, now that I have looked at the right answers, I can no longer use this data set for validation, especially since I plan to bang on my algorithm until it produces the right answers.

But in the interest of transparent science, I will at least document the debugging process. My first step was to review the most recent (and least-tested) code for obvious bugs, and I found one. I made an error parsing one of the data files, which had the effect of double-counting the total reads for each subject. I fixed that, but it didn't help the results much.

To start the debugging process, I am looking at the various places where my predictions could go wrong:

0) The data could be wrong. In particular, I assume that the rarefacted data I got from one data file is consistent with the complete dataset I got from another.

1) The posterior distribution could be right, but the predictive distribution could be wrong.

2) The posterior distribution might be wrong because of modeling errors.

3) The posterior distribution might be wrong because of implementation errors.

To check (0), I used the complete dataset to generate a few re-rarefacted datasets to see if my rarefaction process looks like theirs. It does, so I accept the data, at least for now.

To check (1), I used the posterior distribution to generate a 90% credible interval for the total number of species. Since the number of observed species in the complete dataset is necessarily less than the total number of species, the actual values should fall in or below the CIs, but in fact they often exceed the CIs, meaning that there are just more species than my algorithm expects.

While investigating (1) I discovered one problem. In my prior distribution on the number of species, I was setting the upper bound too low, cutting off some values of n with non-negligible probability. So I cranked it up high enough that any additional increase has no further effect on the results. That helps, but my predictions are still too low.

The next step is to test (2). I will generate simulated datasets, generate predictions and then validation them. Since the simulated data come straight from the model, there can be no modeling errors. If I can't validate on simulated data, the problem has to be the algorithm or the implementation, not the model.

Of course, I should have done all this first, before blowing my testing data.

I won't have a chance to get back to this for a little while, but I'll update this post when I do.

Monday, February 18, 2013

Belly Button Biodiversity: Part Three

This is part three of a series of articles about a Bayesian solution to the Unseen Species problem, applied to data from the Belly Button Biodiversity project.

In Part One I presented the simplest version of the algorithm, which I think is easy to understand, but slow. In Think Bayes I present some ways to optimize it. In Part Two I apply the algorithm to real data and generate predictive distributions. Now in Part Three, as promised, I publish the predictions the algorithm generates. In Part Four I will compare the predictions to actual data.

Background: Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen science project with the goal of identifying bacterial species that can be found in human navels (http://bbdata.yourwildlife.org).

Transparent science

In an effort to explore the limits of transparent science, I have started publishing my research in this blog as I go along. This past summer I wrote a series of articles exploring the relationship between Internet use and religious disaffiliation. This "publish as you go" model should help keep researchers honest. Among other things, it might mitigate publication bias due to the "file drawer effect." And if the data and code are published along with the results, that should help make experiments more reproducible.

Toward that end, I will now subject myself to public humiliation by generating a set of predictions using my almost-entirely-unvalidated solution to the Unseen Species problem. In the next installment I will publish the correct answers and score my predictions. Here are the details:

1) I am working with data from the Belly Button Biodiversity project; this data was used in a paper published in PLOS ONE and made available on the web pages of the journal and the researchers. The data consists of rDNA "reads" from 60 subjects. In order to facilitate comparisons between subjects, the researchers chose subjects with at least 400 reads, and for each subject they chose a random subset of 400 reads. The data for the other reads was not published.

2) For each subject, I know the results of the 400 selected reads, and the total number of reads. I will use my algorithm to generate a "prediction" for each subject, which is the number of additional species in the complete dataset.

3) Specifically, for each subject I will generate 9 posterior credible intervals (CIs) for the number of additional species: a 10% CI, a 20% CI, and so on up to a 90% CI.

4) To validate my predictions, I will count the number of CIs that contain the actual, correct value. Ideally, 10% of the correct values should fall in the 10% CIs, 20% should fall in the 20% CIs, and so on. Since the predictions and actual values are integers, a value that hits one end of a predicted CI counts as a half-hit.

Predictions

And now, without further ado, here are my predictions. The columns labelled 10, 20, etc. are 10% credible intervals, 20% CIs, and so on.

Code	# reads	# species	10	20	30	40	50	60	70	80	90
B1234	1392	48	(4, 4)	(4, 5)	(3, 5)	(3, 5)	(3, 6)	(2, 6)	(2, 7)	(1, 7)	(1, 9)
B1235	2452	69	(11, 12)	(10, 12)	(10, 13)	(9, 13)	(8, 14)	(8, 15)	(7, 16)	(6, 17)	(5, 19)
B1236	2964	45	(4, 5)	(4, 5)	(4, 6)	(4, 6)	(3, 7)	(3, 7)	(3, 8)	(2, 9)	(1, 10)
B1237	3090	62	(9, 10)	(9, 11)	(8, 11)	(8, 11)	(7, 12)	(7, 12)	(6, 13)	(5, 14)	(4, 16)
B1242	3056	61	(9, 9)	(8, 10)	(8, 10)	(7, 11)	(7, 11)	(6, 12)	(6, 14)	(5, 15)	(5, 16)
B1243	1518	71	(10, 11)	(10, 12)	(9, 12)	(8, 13)	(8, 13)	(8, 14)	(7, 15)	(6, 16)	(5, 17)
B1246	4230	91	(23, 24)	(22, 25)	(21, 26)	(21, 27)	(19, 28)	(18, 29)	(17, 30)	(16, 33)	(14, 35)
B1253	1928	86	(16, 17)	(15, 18)	(14, 18)	(14, 20)	(13, 20)	(13, 21)	(12, 23)	(11, 24)	(10, 26)
B1254	918	58	(5, 5)	(4, 6)	(4, 6)	(3, 6)	(3, 7)	(3, 7)	(2, 8)	(2, 9)	(1, 10)
B1258	1350	87	(15, 16)	(14, 17)	(14, 17)	(13, 18)	(12, 19)	(11, 19)	(11, 20)	(10, 21)	(8, 24)
B1259	1002	80	(10, 11)	(10, 12)	(10, 12)	(9, 13)	(9, 14)	(8, 14)	(7, 15)	(6, 16)	(6, 18)
B1260	1944	96	(22, 23)	(21, 24)	(20, 25)	(19, 25)	(19, 26)	(18, 27)	(17, 29)	(15, 30)	(14, 32)
B1264	1122	83	(12, 13)	(12, 14)	(11, 14)	(10, 15)	(10, 15)	(9, 16)	(8, 17)	(7, 18)	(6, 20)
B1265	2928	85	(18, 19)	(17, 20)	(16, 21)	(16, 22)	(15, 23)	(14, 24)	(13, 25)	(12, 26)	(11, 28)
B1273	2980	61	(9, 9)	(8, 10)	(8, 10)	(7, 11)	(7, 12)	(6, 12)	(6, 13)	(5, 14)	(4, 16)
B1275	1672	85	(16, 17)	(15, 18)	(15, 19)	(14, 19)	(13, 20)	(13, 21)	(12, 22)	(11, 24)	(9, 25)
B1278	1242	47	(4, 4)	(3, 4)	(3, 5)	(3, 5)	(2, 6)	(2, 6)	(2, 6)	(2, 7)	(1, 8)
B1280	1772	46	(4, 4)	(4, 5)	(3, 5)	(3, 5)	(3, 6)	(2, 6)	(2, 7)	(2, 8)	(1, 9)
B1282	1132	67	(8, 9)	(7, 9)	(7, 10)	(6, 10)	(6, 11)	(6, 11)	(5, 12)	(5, 13)	(3, 15)
B1283	1414	67	(8, 9)	(8, 10)	(7, 10)	(7, 11)	(7, 11)	(6, 12)	(5, 13)	(4, 14)	(3, 16)
B1284	1158	91	(15, 16)	(14, 17)	(14, 17)	(13, 18)	(13, 19)	(12, 19)	(12, 20)	(10, 21)	(9, 23)
B1285	2340	55	(7, 7)	(6, 8)	(6, 8)	(5, 8)	(5, 9)	(4, 9)	(4, 10)	(3, 12)	(2, 13)
B1286	1728	66	(9, 10)	(9, 11)	(8, 11)	(8, 12)	(8, 12)	(7, 13)	(6, 14)	(6, 14)	(4, 16)
B1288	1280	107	(23, 24)	(22, 25)	(21, 25)	(21, 26)	(20, 27)	(19, 27)	(18, 29)	(17, 31)	(15, 32)
B1289	2054	103	(26, 27)	(25, 28)	(24, 29)	(23, 30)	(23, 30)	(22, 32)	(21, 33)	(20, 34)	(17, 36)
B1291	1248	94	(17, 18)	(16, 19)	(16, 20)	(15, 20)	(15, 21)	(13, 22)	(13, 23)	(12, 25)	(10, 27)
B1292	1864	82	(15, 16)	(14, 16)	(13, 17)	(13, 18)	(13, 19)	(12, 20)	(11, 21)	(10, 22)	(9, 24)
B1293	1904	76	(13, 14)	(12, 14)	(12, 15)	(11, 16)	(11, 16)	(10, 17)	(9, 18)	(8, 19)	(7, 22)
B1294	1784	78	(14, 15)	(13, 16)	(12, 16)	(12, 17)	(11, 18)	(11, 19)	(10, 19)	(9, 20)	(8, 23)
B1295	1408	70	(10, 10)	(9, 11)	(9, 12)	(8, 12)	(8, 12)	(7, 13)	(7, 14)	(6, 15)	(4, 17)
B1296	2034	55	(7, 7)	(6, 8)	(6, 8)	(6, 8)	(5, 9)	(4, 9)	(4, 10)	(4, 11)	(3, 12)
B1298	1478	72	(10, 11)	(9, 12)	(9, 12)	(9, 13)	(8, 13)	(8, 14)	(7, 15)	(6, 16)	(5, 18)
B1308	1160	58	(6, 6)	(5, 7)	(5, 7)	(5, 7)	(4, 8)	(4, 8)	(3, 9)	(3, 10)	(2, 11)
B1310	1066	80	(11, 12)	(11, 13)	(10, 13)	(9, 14)	(9, 15)	(8, 15)	(7, 16)	(7, 17)	(5, 19)
B1374	2364	48	(5, 5)	(4, 6)	(4, 6)	(4, 6)	(3, 7)	(3, 7)	(3, 8)	(2, 9)	(2, 10)
B940	2874	93	(22, 24)	(21, 25)	(21, 25)	(20, 26)	(19, 27)	(19, 28)	(18, 30)	(16, 32)	(14, 33)
B941	2154	48	(5, 6)	(5, 6)	(4, 6)	(4, 7)	(4, 7)	(3, 7)	(3, 8)	(2, 9)	(2, 11)
B944	954	52	(4, 4)	(4, 5)	(3, 5)	(3, 5)	(3, 6)	(2, 6)	(2, 6)	(2, 7)	(1, 9)
B945	2390	67	(10, 11)	(10, 12)	(9, 12)	(9, 13)	(8, 13)	(8, 14)	(7, 15)	(7, 16)	(5, 17)
B946	5012	85	(20, 21)	(19, 22)	(19, 23)	(18, 24)	(18, 24)	(17, 26)	(16, 27)	(15, 28)	(12, 31)
B947	3356	62	(10, 11)	(9, 11)	(9, 12)	(8, 12)	(7, 13)	(7, 14)	(6, 14)	(5, 15)	(5, 17)
B948	2384	80	(16, 17)	(15, 18)	(14, 18)	(14, 19)	(13, 20)	(12, 21)	(11, 22)	(10, 23)	(9, 25)
B950	1560	63	(8, 9)	(8, 10)	(8, 10)	(7, 10)	(7, 11)	(6, 11)	(5, 12)	(5, 13)	(4, 15)
B952	1648	57	(7, 7)	(6, 8)	(6, 8)	(6, 8)	(5, 9)	(5, 9)	(4, 10)	(3, 11)	(3, 12)
B953	1452	32	(2, 2)	(1, 2)	(1, 2)	(1, 3)	(1, 3)	(1, 3)	(0, 3)	(0, 4)	(0, 5)
B954	1996	29	(2, 2)	(1, 2)	(1, 2)	(1, 2)	(1, 3)	(1, 3)	(0, 3)	(0, 4)	(0, 4)
B955	1474	65	(8, 9)	(8, 9)	(7, 9)	(7, 10)	(7, 10)	(6, 11)	(5, 12)	(5, 13)	(4, 14)
B956	1482	71	(10, 11)	(10, 12)	(10, 12)	(9, 13)	(8, 14)	(8, 14)	(7, 15)	(6, 16)	(5, 18)
B957	2604	36	(3, 3)	(3, 3)	(2, 4)	(2, 4)	(2, 5)	(1, 5)	(1, 6)	(1, 6)	(1, 7)
B958	2840	29	(2, 2)	(2, 2)	(1, 2)	(1, 3)	(1, 3)	(1, 3)	(1, 4)	(0, 4)	(0, 5)
B961	1214	36	(2, 3)	(2, 3)	(2, 3)	(2, 4)	(1, 4)	(1, 4)	(1, 5)	(1, 5)	(0, 6)
B962	1138	41	(3, 3)	(2, 3)	(2, 3)	(2, 4)	(2, 4)	(1, 4)	(1, 5)	(1, 6)	(0, 7)
B963	1600	71	(10, 12)	(10, 12)	(9, 12)	(9, 13)	(8, 14)	(8, 14)	(7, 15)	(5, 16)	(4, 19)
B966	1950	80	(15, 16)	(14, 16)	(14, 17)	(13, 17)	(12, 18)	(11, 18)	(11, 20)	(10, 22)	(9, 23)
B967	1108	47	(3, 4)	(3, 4)	(3, 4)	(2, 5)	(2, 5)	(2, 5)	(2, 6)	(1, 7)	(1, 7)
B968	2432	52	(6, 7)	(6, 7)	(6, 7)	(5, 8)	(5, 8)	(4, 9)	(3, 9)	(3, 10)	(2, 11)
B971	1462	49	(4, 5)	(4, 5)	(4, 5)	(3, 6)	(3, 6)	(3, 7)	(2, 7)	(2, 8)	(1, 9)
B972	1438	75	(11, 12)	(11, 13)	(10, 13)	(10, 14)	(9, 14)	(8, 15)	(8, 16)	(7, 17)	(6, 18)
B974	5072	54	(7, 8)	(7, 9)	(6, 9)	(6, 9)	(5, 10)	(5, 10)	(4, 11)	(4, 12)	(2, 14)
B975	1542	63	(7, 8)	(7, 9)	(6, 9)	(6, 9)	(6, 10)	(5, 11)	(5, 11)	(4, 12)	(3, 14)

Reading the last row, subject B975 yielded 1542 reads; in a random subset of 400 reads, there were 63 different species (or more precisely, OTUs). My algorithm predicts that if we look at all 1542 reads, the number of additional species we'll find is between 3 and 14, with 90% confidence.

I have to say that this table fills me with dread. The intervals seem quite small, which is to say that the algorithm is more confident than I am. The 90% CIs seem especially narrow to me; it is hard for me to believe that 90% of them will contain the correct values. Well, I guess that's why Karl Popper called them "bold hypotheses". We'll find out soon whether they are bold, or just reckless.

I want to thank Rob Dunn at BBB2 for his help with this project. The code and data I used to generate these results are available from this SVN repository.

EDIT 2-22-13: I ran the predictions again with more simulations. The results are not substantially different. I still haven't looked at the answers.

Friday, February 8, 2013

Belly Button Biodiversity: Part Two

This is part two of a series of articles about a Bayesian solution to the Unseen Species problem, applied to data from the Belly Button Biodiversity project.

In Part One I presented the simplest version of the algorithm, which I think is easy to understand, but slow. In Think Bayes I present some ways to optimize it. Now in Part Two I apply the algorithm to real data and generate predictive distributions. In Part Three I will publish the predictions the algorithm generates, and in Part Four I will compare the predictions to actual data.

Background: Belly Button Biodiversity 2.0 (BBB2) is a nation-wide citizen science project with the goal of identifying bacterial species that can be found in human navels (http://bbdata.yourwildlife.org).

The belly button data

To get a sense of what the data look like, consider subject B1242, whose sample of 400 reads yielded 61 species with the following counts:

92, 53, 47, 38, 15, 14, 12, 10, 8, 7, 7, 5, 5, 
4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

There are a few dominant species that make up a substantial fraction of the whole, but many species that yielded only a single read. The number of these “singletons” suggests that there are likely to be at least a few unseen species.

In the example with lions and tigers, we assume that each animal in the preserve is equally likely to be observed. Similarly, for the belly button data, we assume that each bacterium is equally likely to yield a read.

In reality, it is possible that each step in the data-collection process might introduce consistent biases. Some species might be more likely to be picked up by a swab, or to yield identifiable amplicons. So when we talk about the prevalence of each species, we should remember this source of error.

I should also acknowledge that I am using the term “species” loosely. First, bacterial species are not well-defined. Second, some reads identify a particular species, others only identify a genus. To be more precise, I should say “operational taxonomic unit”, or OTU.

Now let’s process some of the belly button data. I defined a class called Subject to represent information about each subject in the study:

class Subject(object):

    def __init__(self, code):
        self.code = code
        self.species = []

Each subject has a string code, like “B1242”, and a list of (count, species name) pairs, sorted in increasing order by count. Subject provides several methods to make it easy to these counts and species names. You can see the details in http://thinkbayes.com/species.py.

Figure 12.3: Distribution of n for subject B1242.

In addition, Subject.Process creates a suite, specifically a suite of type Species5, which represents the distribution of n and the prevalences after processing the data.

It also provides PlotDistOfN, which plots the posterior distribution of n. Figure 12.3 shows this distribution for subject B1242. The probability that there are exactly 61 species, and no unseen species, is nearly zero. The most likely value is 72, with 90% credible interval 66 to 79. At the high end, it is unlikely that there are as many as 87 species.

Next we compute the posterior distribution of prevalence for each species. Species2 providesDistOfPrevalence:

# class Species2

    def DistOfPrevalence(self, index):
        pmfs = thinkbayes.Pmf()

        for n, prob in zip(self.ns, self.probs):
            beta = self.MarginalBeta(n, index)
            pmf = beta.MakePmf()
            pmfs.Set(pmf, prob)

        mix = thinkbayes.MakeMixture(pmfs)
        return pmfs, mix

index indicates which species we want. For each value of n, we have a different posterior distribution of prevalence.

Figure 12.4: Distribution of prevalences for subject B1242.

So the loop iterates through the possible values of n and their probabilities. For each value of n it gets a Beta object representing the marginal distribution for the indicated species. Remember that Beta objects contain the parameters alpha and beta; they don’t have values and probabilities like a Pmf, but they provide MakePmf which generates a discrete approximation to the continuous beta distribution.

pmfs is a MetaPmf that contains the distributions of prevalence, conditioned on n. MakeMixturecombines the MetaPmf into mix, which combines the conditional distributions into the answer, a single distribution of prevalence.

Figure 12.4 shows these distributions for the five species with the most reads. The most prevalent species accounts for 23% of the 400 reads, but since there are almost certainly unseen species, the most likely estimate for its prevalence is 20%, with 90% credible interval between 17% and 23%.

Predictive distributions

Figure 12.5: Simulated rarefaction curves for subject B1242.

I introduced the hidden species problem in the form of four related questions. We have answered the first two by computing the posterior distribution for n and the prevalence of each species.

The other two questions are:

If we are planning to collect additional samples, can we predict how many new species we are likely to discover?
How many additional reads are needed to increase the fraction of observed species to a given threshold?

To answer predictive questions like this we can use the posterior distributions to simulate possible future events and compute predictive distributions for the number of species, and fraction of the total, we are likely to see.

The kernel of these simulations looks like this:

Choose n from its posterior distribution.
Choose a prevalence for each species, including possible unseen species, using the Dirichlet distribution.
Generate a random sequence of future observations.
Compute the number of new species, num_new, as a function of the number of additional samples, k.
Repeat the previous steps and accumulate the joint distribution of num_new and k.

And here’s the code. RunSimulation runs a single simulation:

# class Subject

    def RunSimulation(self, num_samples):
        m, seen = self.GetSeenSpecies()
        n, observations = self.GenerateObservations(num_samples)

        curve = []
        for k, obs in enumerate(observations):
            seen.add(obs)

            num_new = len(seen) - m
            curve.append((k+1, num_new))

        return curve

num_samples is the number of additional samples to simulate. m is the number of seen species, and seen is a set of strings with a unique name for each species. n is a random value from the posterior distribution, and observations is a random sequence of species names.

The result of RunSimulation is a “rarefaction curve”, represented as a list of pairs with the number of samples and the number of new species seen.

Before we see the results, let’s look at GetSeenSpecies and GenerateObservations.

#class Subject

    def GetSeenSpecies(self):
        names = self.GetNames()
        m = len(names)
        seen = set(SpeciesGenerator(names, m))
        return m, seen

GetNames returns the list of species names that appear in the data files, but for many subjects these names are not unique. So I use SpeciesGenerator to extend each name with a serial number:

def SpeciesGenerator(names, num):
    i = 0
    for name in names:
        yield '%s-%d' % (name, i)
        i += 1

    while i < num:
        yield 'unseen-%d' % i
        i += 1

Given a name like Corynebacterium, SpeciesGenerator yields Corynebacterium-1. When the list of names is exhausted, it yields names like unseen-62.

Here is GenerateObservations:

# class Subject

    def GenerateObservations(self, num_samples):
        n, prevalences = self.suite.Sample()

        names = self.GetNames()
        name_iter = SpeciesGenerator(names, n)

        d = dict(zip(name_iter, prevalences))
        cdf = thinkbayes.MakeCdfFromDict(d)
        observations = cdf.Sample(num_samples)

        return n, observations

Again, num_samples is the number of additional samples to generate. n and prevalences are samples from the posterior distribution.

cdf is a Cdf object that maps species names, including the unseen, to cumulative probabilities. Using a Cdf makes it efficient to generate a random sequence of species names.

Finally, here is Species2.Sample:

    def Sample(self):
        pmf = self.DistOfN()
        n = pmf.Random()
        prevalences = self.SampleConditional(n)
        return n, prevalences

And SampleConditional, which generates a sample of prevalences conditioned on n:

# class Species2

    def SampleConditional(self, n):
        params = self.params[:n]
        gammas = numpy.random.gamma(params)
        gammas /= gammas.sum()
        return gammas

We saw this algorithm for generating prevalences previously in Species2.SampleLikelihood.

Figure 12.5 shows 100 simulated rarefaction curves for subject B1242. I shifted each curve by a random offset so they would not all overlap. By inspection we can estimate that after 400 more samples we are likely to find 2–6 new species.

Joint posterior

Figure 12.6: Distributions of the number of new species conditioned on the number of additional samples.

To be more precise, we can use the simulations to estimate the joint distribution of num_new andk, and from that we can get the distribution of num_new conditioned on any value of k.

# class Subject

    def MakeJointPredictive(self, curves):
        joint = thinkbayes.Joint()
        for curve in curves:
            for k, num_new in curve:
                joint.Incr((k, num_new))
        joint.Normalize()
        return joint

MakeJointPredictive makes a Joint object, which is a Pmf whose values are tuples.

curves is a list of rarefaction curves created by RunSimulation. Each curve contains a list of pairs of k and num_new.

The resulting joint distribution is a map from each pair to its probability of occurring. Given the joint distribution, we can get the distribution of num_new conditioned on k:

# class Joint

    def Conditional(self, i, j, val):
        pmf = Pmf()
        for vs, prob in self.Items():
            if vs[j] != val: continue
            pmf.Incr(vs[i], prob)

        pmf.Normalize()
        return pmf

i is the index of the variable whose distribution we want; j is the index of the conditional variables, and val is the value the jth variable has to have. You can think of this operation as taking vertical slices out of Figure 12.5.

Subject.MakeConditionals takes a list of ks and computes the conditional distribution of num_new for each k. The result is a list of Cdf objects.

# class Subject

    def MakeConditionals(self, curves, ks):
        joint = self.MakeJointPredictive(curves)

        cdfs = []
        for k in ks:
            pmf = joint.Conditional(1, 0, k)
            pmf.name = 'k=%d' % k
            cdf = thinkbayes.MakeCdfFromPmf(pmf)
            cdfs.append(cdf)

        return cdfs

Figure 12.6 shows the results. After 100 samples, the median predicted number of new species is 2; the 90% credible interval is 0 to 5. After 800 samples, we expect to see 3 to 12 new species.

Coverage

Figure 12.7: Complementary CDF of coverage for a range of additional samples.

The last question we want to answer is, “How many additional reads are needed to increase the fraction of observed species to a given threshold?”

To answer this question, we’ll need a version of RunSimulation that computes the fraction of observed species rather than the number of new species.

# class Subject

    def RunSimulation(self, num_samples):
        m, seen = self.GetSeenSpecies()
        n, observations = self.GenerateObservations(num_samples)

        curve = []
        for k, obs in enumerate(observations):
            seen.add(obs)

            frac_seen = len(seen) / float(n)
            curve.append((k+1, frac_seen))

        return curve

Next we loop through each curve and make a dictionary, d, that maps from the number of additional samples, k, to a list of fracs; that is, a list of values for the coverage achieved after ksamples.

    def MakeFracCdfs(self, curves):
        d = {}
        for curve in curves:
            for k, frac in curve:
                d.setdefault(k, []).append(frac)

        cdfs = {}
        for k, fracs in d.iteritems():
            cdf = thinkbayes.MakeCdfFromList(fracs)
            cdfs[k] = cdf

        return cdfs

Then for each value of k we make a Cdf of fracs; this Cdf represents the distribution of coverage after k samples.

Remember that the CDF tells you the probability of falling below a given threshold, so thecomplementary CDF tells you the probability of exceeding it. Figure 12.7 shows complementary CDFs for a range of values of k.

To read this figure, select the level of coverage you want to achieve along the x-axis. As an example, choose 90%.

Now you can read up the chart to find the probability of achieving 90% coverage after k samples. For example, with 300 samples, you have about a 60% of getting 90% coverage. With 700 samples, you have a 90% chance of getting 90% coverage.

With that, we have answered the four questions that make up the unseen species problem. Next time: validation!