Thursday, October 6, 2011

Repeated tests: how bad can it be?

A couple of weeks ago Red Dave wrote this blog post about the dangers of A/B testing with repeated tests.  I think he's right, and does a good job of explaining why.  But to get a better understanding of the issue, I wrote some simulations, and things can be even worse than Dave suggests.

As an example of A/B testing, suppose you are advertising a business and you develop two versions of an ad, A and B, with different text, graphics, etc.  You want to know which ad is more effective.

If you post your ads online, you can find out how many times each version was shown (number of impressions), and how many people clicked on each ad (number of clicks).  The number of clicks per impression is the "click-through rate," or CTR.  At least, that's what Google calls it.

If you show each ad 1000 times, you might see a CTR of 1% for Version A and 2% for Version B.  To see whether this difference is significant, you can use a chi-square test, which compares the actual number of clicks for each ad with the expected number.  The result is a p-value, which tells you the chance of seeing a difference as big as that if the ads are actually the same.  If the p-value is small, the effect is unlikely to be due to chance, so it is more likely to be real.

So far, this is just classical hypothesis testing.  No problem.  Here's where things get ugly.

If, instead of showing each ad 1000 times, you run the ad for a while, and then check for a statistically-significant difference, run a while longer, check again, and repeat until you get a significant difference, then what you are doing is Bogus.

And when I say "Bogus", I mean capital-B Bogus, because the p-values you compute are not going to be off by a factor of 2, which would actually be acceptable for most purposes.  They are not going to be off by a factor of 10, which would make them meaningless.  They are going to be completely and totally wrong, to the point where, if you run long enough, you are nearly certain to see a "statistically significant" difference, even if both versions are the same.

To see why, let's look at the case where you check for significance after every ad impression.  Suppose you show version A or version B with equal probability, and in either case the click-through rate is 50%.  After 100 impressions, you expect to show A and B 50 times each, and you expect 25 clicks on each.

After each impression, we can compute the CTR for A and B, and plot the difference as a random walk.  Here's what it looks like for 10 simulations of 200 impressions:


When the number of trials is small, the difference in CTR can be big, but for n>50 the law of large numbers kicks in and the difference gets small.

Instead of computing the difference, we can compute the chi-square statistic, which yields another random walk.  Here's what it looks like after 1000 steps:


After an initial transient, the distribution of the lines stabilizes, which shows that for large N, the distribution of the chi-square stat does not depend on N, which is exactly why we can use the chi-square distribution to compute p-values analytically.

SciPy provides a function that evaluates the chi-squared CDF.  The following function uses it to look up chi-squared values and return the corresponding p-value:

import scipy.stats


def Pvalue(chi2, df):
    """Probability of getting chi2 or greater from a chi-squared distribution.


    chi2: observed chi-squared statistic
    df: degrees of freedom
    """
    return 1 - scipy.stats.chi2.cdf(chi2, df)


In this example, the number of degrees of freedom is 3, because there are four values in the table (hits and misses for A and B), but they are constrained to add up to N.

Now here's the key question: if we take values that are actually from a chi-squared distribution and look up the corresponding p-values, what is the distribution of the resulting p-values?

Answer: they are uniformly distributed.  Why?  Because we are using the chi-squared CDF to get their percentile ranks.  By the definition of the CDF, 1% of them fall below the 1st percentile, 2% of them fall below the 2nd percentile, and so on.

If we plot the p-values as a random walk, here's what it looks like (20 simulations, 2000 trials):

For large N, the lines are spread out uniformly from 0 to 1, as expected.

The red line shows the significance criterion, α, which determines the false positive rate.  If α=5%, then at any point in time, we expect 5% of the lines to be below this threshold.  And in the figure, that looks about right.

If we stop the test at a pre-determined time, or at a randomly-chosen time, we expect to find one line out of twenty below the threshold.  But if you do repeated tests, checking for significance after every trial, then the relevant question is how many of these lines cross below the threshold, ever?

In this example, it is almost 50%.  After 10000 trials, it is 85%.  As the number of trials increases, the false positive rate approaches 100%.

A standard quip among statisticians goes like this: "A frequentist is a person whose long-run ambition is to be wrong 5% of the time."  Here's an update: "A/B testing with repeated tests is totally legitimate, if your long-run ambition is to be wrong 100% of the time."

You can download the code I used to generate these figures from thinkstats.com/khan.py.  The file is named after the Khan Academy because that's the context of the blog post that brought the problem to my attention.

Note: in these examples, I used a CTR of 50%, which is much higher than you are likely to get with an online ad campaign: 1% is considered good.  If you run these simulations with more realistic CTRs, the random walks get jaggier, due to the statistics of small numbers, but the overall effect is the same.

UPDATE August 5, 2013:  In the comments below, I got the following inquiry:

The d.f. for chi square is 1, not 3, as the null hypothesis of independence adds 2 more constraints. For JxK 2-way tables, df for independence = (I-1)(J-1). See any stat text, e.g. A. Agresti, 2002, Categorical Data Analysis, 2nd Ed, Wiley, p. 79

I maintain that the argument I made above is correct: there are four numbers in the table, but because they are constrained to add up to N, there are three degrees of freedom.  But getting df right is notoriously tricky, so let's check.  I ran 100 simulations, computed the resulting chi2 values and plotted their CDF.  Then I compared it to chi2 distributions with df = 1, 2, 3.  The figure below shows the result:


This figure shows that the values from my simulation are consistent with the chi2 distribution with df=3 (and substantially different from a chi2 distribution with df=1).

Monday, August 29, 2011

The Jimmy Nut Company problem

I read with interest an article by Robert Hayden called “Advice to mathematics teachers on evaluating introductory statistics textbooks.” It makes two claims:
  • Statistics textbooks should emphasize real problems with real data. I couldn't agree more.
  • You should choose a statistics textbook written by a statistician at a large university who has good research publications. I couldn't agree less.
On the second point, Prof. Hayden argues that statistics books contain more errors than other math books, and that the errors are more often conceptual, not just typos.  The reason, he explains, is that “introductory statistics textbooks are often written by people with little or no training in statistics.”

As an example (in fact, the only example) he cites this problem from an unnamed popular textbook:


“Jimmy Nut Company advertises that their nut mix contains 40% cashews, 15% brazil nuts, 20% almonds, and only 25% peanuts. The truth in advertising investigators took a random sample (of size 20 lb) of the nut mix and found the distribution to be as follows:

Cashews Brazil Nuts Almonds Peanuts
6 lb   3 lb   5 lb   6 lb

At the 0.01 level of significance, is the claim made by Jimmy Nuts true?”

Prof. Hayden hates this problem.  He thinks it is ill-posed, and evidence of everything wrong with statistics books.  He explains: “The problem here is that the chi-squared goodness of fit test applies only to categorical (discrete) data. It compares the actually observed counts in each category to the counts we would expect if the hypothesis being tested were true.”


This is true, and if you blindly plug the numbers from the problem into the formula for a chi-squared statistic, the result is meaningless.  In fact, if you express the given data in other units, you can make the resulting p-value as big or small as you like.


But that doesn’t mean the textbook author has made a “gross error,” because the question doesn’t say “plug these numbers in blindly and write down the result without thinking.”  The question doesn’t even specify the chi-squared statistic.


The question, as written, is perfectly sensible, moderately challenging, and interesting.  It requires thought and care to formulate the question in a way that leads to a meaningful answer, and it is exactly the kind of question someone doing statistics is likely to be asked.


I don’t know what the textbook author had in mind, but there is nothing wrong with the question as written.  And to prove it, I’m going to answer it. I follow the procedure I recommend in my article "There is only one test."


The first step is to make a model of the null hypothesis, which is that the proportion of nuts is as advertised.  But if the proportions are always as advertised, the data observed would be impossible.  So we need a random model that characterizes the variation we expect to see in a sample.


What are the sources of variation in a 20 lb sample of nuts?  One source is mechanical error; that is, the people and machines that weigh and mix nuts might be miscalibrated.  If the accuracy of this process varies from day to day, we could treat it as a random variation.  To quantify this effect, we would need some information about how the nuts are processed.


Another source of error, and probably the one the textbook author had in mind, is discrete sampling.  If we take a small sample, we would not expect the proportions to be exactly as advertised. To model this kind of variation, we have to count individual nuts.


To convert the data from weights to counts, we have to know how much each kind of nut weighs.  This step requires some research, which I think is an interesting element of the problem.  Since we live in a post-Google information utopia, it is reasonable to expect students to answer this question, and useful for them to practice.


I found several sources with information about the weight of typical nuts, some more authoritative than others.  The one I use is “Nuts for Nutrition,” by Alice Henneman, MS, RD, Extension Educator at the University of Nebraska Cooperative Extension - Lancaster County.  In addition to the credentials and affiliation of the author, and the witty article title, I chose this source because it provides numbers for all four nuts in one table, so even if the methodology that produced them is not ideal, it is probably consistent, which means that the relative values are likely to be good even if the absolute values are not. From the table:

Nuts per ounce: Cashew 16-18, Brazil nut 6-8, Almonds 20-24, Peanuts 28.


To start I use the middle of each range. With this data we can convert observed values from pounds to counts and convert the expected proportions from “by weight” to “by count”.  Here’s a function that does it:


def ConvertToCount(sample, count_per):
   """Convert from weight to count.

   sample: Hist that maps from category to weight in pounds
   count_per: dict that maps from category to count per ounce
   """
   for value, count in sample.Items():
       sample.Mult(value, 16 * count_per[value])


The code examples here use the Pmf and Cdf libraries from Think Stats. Here’s the code that processes the information we’re given:

  
sample = dict(cashew=6, brazil=3, almond=5, peanut=6)
   count_per = dict(cashew=17, brazil=7, 

almond=22, peanut=28)

   observed = Pmf.MakeHistFromDict(sample)
   ConvertToCount(observed, count_per)

   advertised = dict(cashew=40, brazil=15, 

almond=20, peanut=25)
   expected = Pmf.MakePmfFromDict(advertised)
   ConvertToCount(expected, count_per)
   expected.Normalize(observed.Total())



Here are the results in tabular form:


Nut       Expected     Observed    % error    price per pound
cashew    2266         1632        -28 %      $10
brazil     350          336        - 4 %      $8

almond    1467         1760        +20 %      $8
peanut    2333         2688        +15 %      $3

So there are 28% fewer cashews than expected, and 20% more almonds.  Is this suspicious?  Intuitively, this looks like more deviation than I expect in a sample this large.  And the relationship between “error” and price makes it look even worse.


Let’s make that more rigorous.  The next step to choose a test statistic.  Mean relative error would be a reasonable choice.  Or I could devise a statistic that measures the excess of cheap nuts and lack of expensive nuts.  But to keep it simple I'll use the chi-square statistic.  The primary advantage of chi-square is that we can compute p-values analytically, which is efficient.  But since I am planning to use simulation, the only advantage is that it is conventional.


For the given data the chi-square statistic is 291, which doesn’t mean anything.  To make it mean something, we have to compute the p-value, which is the probability of seeing a chi-square statistic so large under the null hypothesis.


Which means we need to get back to our model of the null hypothesis.  One option is to imagine choosing nuts one at a time from a vat that contains a large number of nuts mixed in the advertised proportions.  Here’s what that looks like:


   num_nuts = observed.Total()
   cdf = Cdf.MakeCdfFromHist(expected)
   t = cdf.Sample(num_nuts)
   hist = Pmf.MakeHistFromList(t)

   chi2 = ChiSquared(expected, simulated)



If we run that 1000 times, we can count how many samples yield chi2 > 290.  And the answer is...0.  In fact, after 1000 simulations, the largest test statistic is 20.  So the p-value is much smaller than 1/1000 and it is unlikely that the observed sample came from the advertised mixture, at least under this model.


Since there was some uncertainty in the data we collected, we should do a perturbation analysis.  For example, Jimmy might claim that he used big nuts, which would make the counts smaller and the results less significant.  So let’s give Jimmy the benefit of the doubt and run the numbers with these counts per ounce:


Nuts per ounce: Cashew 12, Brazil nut 5, Almonds 18, Peanuts 24.


With these data, the observed test statistic is only 215 (down from 290).  But the chance of seeing a value that large under the null hypothesis is still negligible.


More credibly, Jimmy might claim that the model of the null hypothesis is unrealistic because the nuts are not perfectly mixed, so the sampling process is not independent.  In fact, correlation between successive draws would increase variation in the sample.  To see why, imagine the extreme scenario where the nuts are not mixed at all; in that case, most samples contain only one kind of nut.


To model this scenario, we can make a list that represents the vat of nuts the sample is drawn from. Initially the vat is completely unmixed. Then we "stir" by choosing random pairs and swapping them.

def MakeVat(expected, num_nuts, factor=10, stir=0.0):
    """Makes a list of nuts with partial stirring."""
    t = []
    for value, freq in expected.Items():
        t.extend([value] * int(freq * factor))


    [RandomSwap(t) for i in xrange(int(num_nuts*factor*stir))]
    
    return t


expected is a Pmf that represents the expected distribution. num_nuts is the total number of nuts in the sample. factor is the size of the vat, in multiples of num_nuts. And stir is the number of swaps to perform per nut in the vat.

I set factor to 10, which means nuts are mixed in 200 lb batches. With stir = 0, the nuts are unmixed and the p-value is 1; that is, it is certain that we will see a large deviation from the advertised proportions. If stir > 2, the nuts are well mixed and the p-value is 0, as we saw before.

Between 1.1 and 1.3 swaps per nut, the p-value drops quickly from 1 to 0. So Jimmy's argument is plausible: if the nuts are not well mixed, that could explain the deviation we saw in the sample.

In summary, what can we conclude about the Jimmy Nut Company?

1) If the sample is drawn from a well mixed batch of nuts, it provides strong evidence that the mix is not as advertised.

2) But if the nuts are not well mixed, that might explain the deviation we saw. We could test that hypothesis by mixing the vat more thoroughly or taking small samples from several parts of the vat.

And what can we conclude about the Jimmy Nut Company problem?

1) I think it's an excellent problem, as written. It requires careful thinking and a little research, and the results could lead to interesting class discussion.

2) I see no evidence that this problem indicates a conceptual error on the part of the textbook writer.

In summary, I agree with Prof. Hayden that statistics students should work with real data, but respectfully disagree with his assertion that statistics books should be written by statisticians. I think they should be written by computer scientists.
The code I used for this article is here.

UPDATE August 30, 2011: With the help of Google Books, I found the source of this problem:
Understandable statistics: concepts and methods by Charles Henry Brase and Corrinne Pellillo Brase. The problem appears in the 5th edition, but not the current 9th edition.

Tuesday, August 16, 2011

Upcoming webcast: Only One Test

The nice people at O'Reilly Media have asked me to do a webcast or two on topics related to Think Stats.  The first one is scheduled for October 4 at 1pm EDT.  Here is the announcement.


There's Only One Test

DateTuesdayOctober 4, 2011
Time10am PT, San Francisco 
6pm - London | 1pm - New York | Wed, Oct 5th at 3am - Sydney | Wed, Oct 5th at 2am - Tokyo | Wed, Oct 5th at 1am - Beijing | 10:30pm - Mumbai


Presented by: Allen B. Downey
Duration: Approximately 60 minutes.
Cost: Free
Can I use a t-test if my data are non-normal? What if the sample size is small? What's an exact test? And how many tests are there, anyway?
People working with real data are often confused about hypothesis testing and paralyzed by the number of tests and their requirements. But for many common problems, you don't need specialized tests at all.
In this talk, I present a framework for using simple simulations to estimate p-values. This framework works well for a wide range of problems, which is why I say that there's only one test.
You don't need prior knowledge of statistics (and you might be better off without it). I show programming examples in Python, but the talk is accessible to anyone with basic programming skills.

About Allen B. Downey

Allen B. Downey, author of Think Stats. Allen is a Professor of Computer Science at Olin College of Engineering. He has taught at Wellesley College, Colby College and U.C. Berkeley. He has a Ph.D. in Computer Science from U.C. Berkeley and Master's and Bachelor's degrees from MIT.
Questions? Please send email to webcast@oreilly.com


Here are the slides I used for this webcast:



EDIT Oct 12, 2011: Posted final version of slides.

Wednesday, June 8, 2011

A hierarchical Bayesian model of pond scum

This week I am working with one of my colleagues, the extraordinary biologist Jean Huang, on an interesting problem related to bioinformatics.  This project is my first attempt at implementing a hierarchical Bayesian model in Python, and it has been a pleasant surprise.  If you are familiar with tree-processing algorithms, it's pretty straightforward.  First the likelihoods propagate up the tree, then the updates propagate down.

Here is the background: Prof Huang and her students collect samples of pond water and culture the bacteria they find under narrow-spectrum artificial light, trying to find species that photosynthesize at different wavelengths.

To identity the species in each culture, they amplify the 16S ribosomal RNA gene, send it out to be sequenced, and then look up the sequence in a database like greengenes.  The result is either the name of a species that has been cultured, the genus of an uncultured species, or an unknown genus.

They usually start by identifying 15 samples from each culture.  For example, one of their cultures yielded 9 samples of one species, three of another, and one each of three more.

Looking at the results for a given culture, biologists would like to be able to answer these questions:

1) How many more species are there likely to be?
2) What is the likely prevalence of each species?
3) If we test m additional samples, how many new species are we likely to find?
4) Given a limited budget, which cultures warrant additional sampling?

To answer these questions, I implemented a hierarchical Bayesian model where the top level is "How many species are there?" and the second level is "Given that there are n species, what is the prevalence of each species?"  The prevalence of a species is its proportion of the population.

Knowing how many species there are helps model their prevalences.  In the simplest case, if we know that there is only 1 species, the prevalence of that species must be 100%.  If we know there are 2 species, I assume that the distribution of prevalences for both species is uniform from 0 to 100, so the expected prevalence for both is 50%, which makes sense.

If we know there are 3 species, it is less obvious what the prior distribution should be; the Beta distribution is a natural choice because

1) If we know there are k species, we can use Beta(1, k-1) as a prior, and the expected value is 1/k, which makes sense.  For k=2, the result is a uniform distribution, so that makes sense, too.

2) For this scenario, the Beta distribution is a conjugate prior, which means that after an update the posterior is also a Beta distribution.

We can represent a Beta distribution with an object that keeps track of the parameters:


class Beta:
    def __init__(self, yes, no):
        """Initializes a Beta distribution.
        yes and no are the number of 
        successful/unsuccessful trials.
        """
        self.alpha = yes+1
        self.beta = no+1

Update couldn't be easier.

    def Update(self, yes, no):
        """Updates a Beta distribution."""
        self.alpha += yes
        self.beta += no

At the top level ("How many species are there?") the prior is a uniform distribution from 1 to 20, at least for now.   I could replace that with a distribution that reflects more of the domain knowledge biologists have about the experiment; for example, based on how the cultures are processed, it would be rare to find more than 10 species with any substantial prevalence.

Now that the priors are in place, how do we update?  Here's the top level:


    def Update(self, evidence):
        """Updates based on observing a given taxon."""
        for hypo, prob in self.hypos.Items():
            likelihood = hypo.Likelihood(evidence)
            if likelihood:
                self.hypos.Mult(hypo, likelihood)
            else:
                # if a hypothesis has been ruled out, remove it
                self.hypos.Remove(hypo)


        self.hypos.Normalize()

The evidence is a string that indicates which taxon (species or genus) was observed.  self.hypos is a Pmf that maps from a hypo to its probability (Pmf is provided by the thinkstats package, a collection of modules used in my book, Think Stats).  Each hypo is a hypothesis about how many taxons there are.  We ask each hypo to compute the likelihood of the evidence, then update hypos accordingly.  If we see N different taxons, all hypotheses with k<N are eliminated.

Having updated the top level, we pass the evidence down to the lower level and update each hypothesis.

        # update the hypotheses
        for hypo, prob in self.hypos.Items():
            hypo.Update(evidence)

To compute the likelihood of the evidence under a hypothesis, we ask "What is the probability of observing this taxon, given our current belief about the prevalence of each taxon?"  There are two cases:

1) If the observed taxon is one we have seen before, we just look up the current Beta distribution and compute its mean:

    def Mean(self):
        """Computes the mean of a Beta distribution."""
        return self.alpha / (self.alpha + self.beta)

2) If we are seeing a taxon for the first time, we get the likelihoods for all unseen taxons and add them up.

To update the lower level distributions, we loop through the Beta distributions that represent the prevalences, and update them with either a hit or a miss.

    def Update(self, evidence):
        """Updates each Beta dist with new evidence.

        evidence is a taxon
        """
        for taxon, dist in self.taxa.iteritems():
            if taxon == evidence:
                dist.Update(1, 0)
            else:
                dist.Update(0, 1)

And that's it.  After doing a series of updates, the top level is the posterior distribution of the number of taxons, and each hypothesis at the lower level is a set of Beta distributions that represent the prevalences.

Let's look at an example.  Suppose we see 11 of one taxon, 4 of another, and 2 of a third.  The posterior distribution of the number of taxons is:


The probability is 47% that there are only 3 taxons, 28% that there are 4, and 13% that there are 5.

By summing over the hypotheses, we can generate distributions for the prevalence of each taxon:


The median prevalence of taxon A is 58%; the credible interval is (39%, 75%).  For taxons B and C the medians are 23% and 13%.  For taxons D, E, and F, there is a high probability that the prevalence is 0, because these taxons have not been observed.

To predict the number of new taxons we might find by sequencing additional samples, I use Monte Carlo simulation.  Here is the kernel of the algorithm:

        for i in range(m):
            taxon = self.GenerateTaxon()
            taxons.append(taxon)
            meta.Update(taxon)

        curve = MakeCurve(taxons)

m is the number of samples to simulate.  Each time through the loop, we generate a random taxon and then update the hierarchy.  The result is a rarefaction curve, which is the number of taxons we observe as a function of the number of samples.

To generate a random taxon, we take advantage of the recursive structure of the hierarchy; that is, we ask each hypothesis to choose a random taxon, and then choose among them in accordance with the probability for each hypothesis:

    def GenerateTaxon(self):
        """Chooses a random taxon."""
        pmf = Pmf.Pmf()
        for hypo, prob in sorted(self.hypos.Items()):
            taxon = hypo.GenerateTaxon()
            pmf.Incr(taxon, prob)
        return pmf.Random()

The algorithm the hypotheses use to generate taxons is ugly, so I will spare you the details, but ultimately it depends on the ability to generate Beta variates, which is provided in Python's random module.

Given a sequence of taxons (real or simulated), it is easy to generate a rarefaction curve:

def MakeCurve(sample):
    """Makes a rarefaction curve for the given sample."""
    s = set()
    curve = []
    for i, taxon in enumerate(sample):
        s.add(taxon)
        curve.append((i+1, len(s)))
    return curve

To show what additional samples might yield, I generate 100 rarefaction curves and plot them:


The first 17 points are random permutations of the observed data, so after 17 samples, we have seen 3 taxa on every curve.  The last 15 points are based on simulated data.  The lines are shifted slightly so they don't overlap; that way you can eyeball high-probability paths.

Even after m=15 additional samples, there is a good chance that we don't see any more taxons -- but there is a reasonable chance of seeing 1-2, and a small chance of seeing 3.

Finally, we estimate the probability of seeing additional taxons as a function of the number of addition samples:


With 5 additional samples, the chance of seeing at least one additional taxon is 24%; with 10 samples it's 40%, and with 15 it's 53%.

These estimates should help biologists allocate their budget for additional samples is order to minimize the chance of missing an important low-prevalence taxon.

Next steps: I am working with Prof. Huang and her students to report the results most usefully; then we are thinking about distributing the code or deploying it as a web app.


Thursday, June 2, 2011

More hypotheses, less trivia

In my last post I wrote about hypothesis testing with simulation, and one of the readers at reddit/r/statistics wrote this comment:
Is there a book or similar reference [with] examples of building such simulations? I never have enough confidence to run such simulations and I guess that is mainly because I have only seen them for trivial problems. A book or a tutorial with a number of problems with various complexity would help a lot.
My answer to the first part (is there a good book?) is "I don't know."  My answer to the second part (non-trivial examples) is this post with some less trivial examples.


Difference in means

If you think there is a difference between two groups, the most common test is for a difference in the means.  In Think Stats, I apply this test to pregnancy length and birth weight for first babies and others.

In this case the test statistic is obvious: compute the mean for each group and subtract.  The only hard part is deciding whether to do a one-sided test or a two-sided test.  It depends on what hypothesis you are testing.  If you think men are taller than women, or that the new drug is better than the old drug, then a one-sided test is appropriate.  If you think there is a difference between the groups but you don't know what it is, you can use a two-sided test; that is, use the absolute value of the difference as the test statistic.

This decision can seem arbitrary, but don't worry.  The effect on the p-value is just a factor of two, and (as I argued last time) we only care about the order of magnitude.  Whether the p-value is 2% or 4% or 8% doesn't really matter.

The null hypothesis is that the two groups are actually the same.  There are several ways to implement this model:

Parametric: Merge the groups.  If the pooled distribution is well modeled by an analytic distribution, estimate the parameters (for example, the mean and variance of a normal distribution) and then generate random samples from the fitted distribution.

Resampling: Again, merge the groups, and then draw samples (with replacement) from the pooled empirical distribution.

Permutation: Or shuffle the sample and assign the elements to the groups at random.  This is equivalent to resampling without replacement.

In this example, the difference in birthweight is significant (first babies are lighter) and the difference in pregnancy length is not.  But it turns out that first babies are more likely to be early and late (and less likely to be on time).  To test that hypothesis, I used the procedure I described last time for the casino problem.


Slope of a fitted line

In this post I fit a line to a time series and compute the slope.  The test statistic is the estimated slope, so that part is easy.

The null hypothesis is that the apparent slope is due to chance, so I used a permutation model; that is, I shuffled the time series 1000 times and computed the slope each time.

Since my hypothesis is specifically that the slope is positive, I could have used a one-sided test, but to be conservative I reported a two sided p-value:
The red line shows the linear least squares fit, with slope 0.033; the p-value (chance of seeing an absolute slope as big as that) is 0.006, so you can either reject the null hypothesis or update your subjective degree of belief accordingly.
Notice that I didn't say anything about how I computed the p-value.  Depending on where you publish, you might or might not have to be more specific.


Tail curvature

In this paper, I was characterizing the tail behavior of long-tailed distributions and trying to quantify how well Pareto and lognormal models captured that behavior.

If you plot the complementary CDF of a Pareto distribution on a log-log scale, you get a straight line (example here); for a lognormal distribution, the line drops off with increasing slope.  To distinguish these behaviors, I wanted a robust measure of tail curvature, so I computed the first derivative numerically and the estimated the slope.  Since slope of a derivative is the second derivative, I called the result "tail curvature" and used it as the test statistic.

In this example I had two hypotheses (HP for Pareto and HL for lognormal) and no null hypothesis.  To compute P(E | HP) I chose a Pareto distribution that was a good fit for the tail of the observed distribution, then generated random samples, and computed tail curvature.  Likewise for P(E | HL).

Here's how I reported the results:
For these cases, the curvature test provides some insight. The tail curvature of the Calgary dataset is 0.141, which has a negligible p-value under the Pareto model. This means that we can reject the hypothesis that the data are a sample from a Pareto distribution. The tail curvature of the ClarkNet dataset is 0.023, which has p > 0.5, which means we cannot reject the hypothesis that the sample comes from a long-tailed distribution.
I am not crazy about the "reject the null hypothesis" way of talking, but it was appropriate for the venue.


ANOVA

If you collect a sample of adult human heights in centimeters, the variance might be 94 cm^2.  If you divide the sample into men and women, the variance in each group is smaller (49 and 44 cm^2), which suggests that there is a relationship between height and gender.  To test whether that is significant, you can divide the sample at random and see how often the variance drops as much.  That's an analysis of variance (ANOVA) test.

The nice thing about ANOVA is that is generalizes easily to more than two groups.

To summarize, the test statistic is the decrease in variance when you divide the sample into groups, also known as "assigning labels."  The null hypothesis is that the mean is the same for all groups, so the simulation is to assign the labels at random.


Time of day effect?

Here's a recent post from reddit/r/statistics:
A reviewer wants me to test if various behavioural assays I performed were influenced by the time of day I performed them (i.e. biased).The assays have either binary or continuous responses. What is the best approach to this? It is as simple as fitting a GLM, response~time, and testing for a non-zero slope? That seems overly simplistic to me. I've also tried GAMs with gaussian or binomial error functions, but it doesn't seem to be doing what I expect it to do...Any thoughts would be greatly appreciated. I'm sure its stupidly simple, but I'm having trouble wrapping my head around the problem!
At first I suggested a permutation test to check whether time of day had a substantial effect.  But the submitter added these details:
We assayed the behaviour of six species of spiders. The purpose of the study was to compare the behaviour of some species to others. For various reasons, we could not balance the time of day at which we assayed the behaviours for each species (the wonders of field biology...). It may be the case that these spiders have some sort of daily behavioural cycle, e.g. more active at night (however it could be the opposite, or really any function of time; we have no idea). I think the reviewer's concern is that this could influence the comparisons, e.g. if we assayed species 1 more often at night than species 2?
I suppose the explicit hypothesis I am interested in testing is that time of day has a significant effect on behavioural response, but I am unsure how to construct a model to test this, given that I have no idea if the relationship will be linear or some complex function.
 So there are several questions on the table here:

1) Are the observed differences between the species significant?

I assume the answer is yes, or the submitter would not be trying to publish (see publication bias).

2) Is there a time of day effect?

One option is to divide the day into intervals, maybe 6 periods of 4 hours.  Then for the continuous response variable, you could use an ANOVA test.  For the binary variable you could use a logistic regression with one dummy variable for each interval.

Or if the question is whether there are any times of day that differ from the others, then an appropriate test statistic would be the maximum difference between any interval and the overall average (the next case study talks about this option in more detail).

3) If there is a time of day effect, can we control for it?

If it turns out that there is a time of day effect, it may or may not be possible to control for it.  For example, if you only observed Species 1 in the morning and Species 2 in the afternoon, then there is no way to separate the effect of the two variables.  In that case, the problem is experimental design, not statistics.

As a quick check, you could make a table with one row per species and one column per time interval, and count how many observations fall in each cell.  If there are a lot of empty (or nearly empty) cells, you are in trouble.

If you make the intervals bigger, you will have more observations per cell, but you might obscure an actual time of day effect.

4) If we control for time of day, are the differences between the species still significant?

If you get this far, you could do ANOVA or regression with dummy variables for each species and for each interval.  This is what some people suggested on reddit (if I understood correctly), but I would suggest spending some time on steps (2) and (3) before diving into (4).

Testing pair-wise differences

Here's another recent post from reddit/r/statistics:

My data can be divided into 4 samples. I wanted to test whether the distribution of a binary variable differs significantly between the samples, so I did 6 t-tests - one for each pair of samples.
However, I also need to analyze how another categorical variable differs between the samples, this time with 3 possible values.
Is it okay to compute new t-tests, each time for a pair of values and a pair of samples, for a total of 6*3 tests? The problem is that this is too detailed: I don't want to test each pair of values, only each pair of samples.
SAS also gives me the option (in proc freq) of calculating a chi-square test for all 4 samples, but this is not enough information: it only tells me that not all of the means are equal.
And then there's a one-way ANOVA test, which can be used to compare the means of >2 samples, but apparently can't be used for categorical variables with >2 values. Or can it?
How would you analyze this data?
If you run lots of tests, some of them are likely to be "significant" even if there is no real effect going on.  The problem is demonstrated in this xkcd cartoon.  So in general I think it is a good idea to apply one big test to the entire dataset, not lots of little tests.


In this case, the question seems to be whether there are significant differences between any pair of groups.  So an appropriate test statistic would be the maximum difference between any pair.


The null hypothesis is that the groups are the same, so an easy implementation is to divide the sample into groups at random.


For categorical variables, you can think of the distribution within each group as a vector and use a distance function to compute the difference between groups.  If the groups are roughly the same size, you could use a 1-norm or a 2-norm.  If not, you should probably normalize the vectors to length 1 before computing norms.


If you are not really interested in pair-wise differences, but you want to know whether any group deviates significantly from the expected, you could pool all the groups to get the expected distribution of the categorical variable, then compute a chi-squared statistic for each group (how much it deviates from the expected).  Again the test statistic would be the maximum of the chi-square statistic.


The important thing here is that the choice of test statistic depends on the question you are asking.  By using the maximum difference, you are effectively asking, "What is the probability of finding any group with this much deviance, just by chance?"


Running a series of tests


Here's one more question from reddit/r/statistics
I'm running a series of chi-square analyses and for several of them, I'm getting a message which states that "# cells have an expected count less than 5". If the results are significant, are they still worth reporting in spite of this message? I'm comparing participant selections of one of three alternatives when assigned to one of four conditions, examining pairs of comparisons.
The problem here is that the analytic version of the chi-squared test uses an approximation that is only accurate if the sample size is large enough; otherwise you have to use a different test.


But this is a perfect example of what I talked about in my last post: if you spend all your time in statistics class learning one damn test after another, and your entire cognitive capacity is consumed trying to remember when each of them is applicable, you have no time and no available brain cells to do the real work, which is thinking carefully about the what you are doing.  [Note: I mean this as a general warning, not a criticism of the submitter.]


In this case the biggest hazard is not the sample size; it's "running a series of chi-square analyses."  When I hear "series of tests," alarm bells go off in my head.  If you don't know why, please review this xkcd cartoon.

So for this question, I have two suggestions:

1) If you use simulation, you don't have to worry about the expected count in each cell.  If the sample size is small, that tends to make the p-value bigger, but that happens without any additional effort.

2) If you are running a series of tests, you should either (a) adjust the significance criterion using something like the Holm–Bonferroni method, or (b) treat the whole series as a single test, using the maximum of the chi-square values as a test statistic.  I think (b) is simpler, and does a better job of capturing what you are trying to test.

That's all for now!