March 22, 2013
Well, I've started testing the predictions I made in my previous post, and exactly as I expected and deserved, I am getting killed. The actual results pretty consistently show more species than I predicted, sometimes way more.
I have started the process of debugging the problem. Of course, now that I have looked at the right answers, I can no longer use this data set for validation, especially since I plan to bang on my algorithm until it produces the right answers.
But in the interest of transparent science, I will at least document the debugging process. My first step was to review the most recent (and least-tested) code for obvious bugs, and I found one. I made an error parsing one of the data files, which had the effect of double-counting the total reads for each subject. I fixed that, but it didn't help the results much.
To start the debugging process, I am looking at the various places where my predictions could go wrong:
0) The data could be wrong. In particular, I assume that the rarefacted data I got from one data file is consistent with the complete dataset I got from another.
1) The posterior distribution could be right, but the predictive distribution could be wrong.
2) The posterior distribution might be wrong because of modeling errors.
3) The posterior distribution might be wrong because of implementation errors.
To check (0), I used the complete dataset to generate a few re-rarefacted datasets to see if my rarefaction process looks like theirs. It does, so I accept the data, at least for now.
To check (1), I used the posterior distribution to generate a 90% credible interval for the total number of species. Since the number of observed species in the complete dataset is necessarily less than the total number of species, the actual values should fall in or below the CIs, but in fact they often exceed the CIs, meaning that there are just more species than my algorithm expects.
While investigating (1) I discovered one problem. In my prior distribution on the number of species, I was setting the upper bound too low, cutting off some values of n with non-negligible probability. So I cranked it up high enough that any additional increase has no further effect on the results. That helps, but my predictions are still too low.
The next step is to test (2). I will generate simulated datasets, generate predictions and then validation them. Since the simulated data come straight from the model, there can be no modeling errors. If I can't validate on simulated data, the problem has to be the algorithm or the implementation, not the model.
Of course, I should have done all this first, before blowing my testing data.
I won't have a chance to get back to this for a little while, but I'll update this post when I do.