Probably Overthinking It: More hypotheses, less trivia

In my last post I wrote about hypothesis testing with simulation, and one of the readers at reddit/r/statistics wrote this comment:

Is there a book or similar reference [with] examples of building such simulations? I never have enough confidence to run such simulations and I guess that is mainly because I have only seen them for trivial problems. A book or a tutorial with a number of problems with various complexity would help a lot.

My answer to the first part (is there a good book?) is "I don't know." My answer to the second part (non-trivial examples) is this post with some less trivial examples.

Difference in means

If you think there is a difference between two groups, the most common test is for a difference in the means. In Think Stats, I apply this test to pregnancy length and birth weight for first babies and others.

In this case the test statistic is obvious: compute the mean for each group and subtract. The only hard part is deciding whether to do a one-sided test or a two-sided test. It depends on what hypothesis you are testing. If you think men are taller than women, or that the new drug is better than the old drug, then a one-sided test is appropriate. If you think there is a difference between the groups but you don't know what it is, you can use a two-sided test; that is, use the absolute value of the difference as the test statistic.

This decision can seem arbitrary, but don't worry. The effect on the p-value is just a factor of two, and (as I argued last time) we only care about the order of magnitude. Whether the p-value is 2% or 4% or 8% doesn't really matter.

The null hypothesis is that the two groups are actually the same. There are several ways to implement this model:

Parametric: Merge the groups. If the pooled distribution is well modeled by an analytic distribution, estimate the parameters (for example, the mean and variance of a normal distribution) and then generate random samples from the fitted distribution.

Resampling: Again, merge the groups, and then draw samples (with replacement) from the pooled empirical distribution.

Permutation: Or shuffle the sample and assign the elements to the groups at random. This is equivalent to resampling without replacement.

In this example, the difference in birthweight is significant (first babies are lighter) and the difference in pregnancy length is not. But it turns out that first babies are more likely to be early and late (and less likely to be on time). To test that hypothesis, I used the procedure I described last time for the casino problem.

Slope of a fitted line

In this post I fit a line to a time series and compute the slope. The test statistic is the estimated slope, so that part is easy.

The null hypothesis is that the apparent slope is due to chance, so I used a permutation model; that is, I shuffled the time series 1000 times and computed the slope each time.

Since my hypothesis is specifically that the slope is positive, I could have used a one-sided test, but to be conservative I reported a two sided p-value:

The red line shows the linear least squares fit, with slope 0.033; the p-value (chance of seeing an absolute slope as big as that) is 0.006, so you can either reject the null hypothesis or update your subjective degree of belief accordingly.

Notice that I didn't say anything about how I computed the p-value. Depending on where you publish, you might or might not have to be more specific.

Tail curvature

In this paper, I was characterizing the tail behavior of long-tailed distributions and trying to quantify how well Pareto and lognormal models captured that behavior.

If you plot the complementary CDF of a Pareto distribution on a log-log scale, you get a straight line (example here); for a lognormal distribution, the line drops off with increasing slope. To distinguish these behaviors, I wanted a robust measure of tail curvature, so I computed the first derivative numerically and the estimated the slope. Since slope of a derivative is the second derivative, I called the result "tail curvature" and used it as the test statistic.

In this example I had two hypotheses (HP for Pareto and HL for lognormal) and no null hypothesis. To compute P(E | HP) I chose a Pareto distribution that was a good fit for the tail of the observed distribution, then generated random samples, and computed tail curvature. Likewise for P(E | HL).

Here's how I reported the results:

For these cases, the curvature test provides some insight. The tail curvature of the Calgary dataset is 0.141, which has a negligible p-value under the Pareto model. This means that we can reject the hypothesis that the data are a sample from a Pareto distribution. The tail curvature of the ClarkNet dataset is 0.023, which has p > 0.5, which means we cannot reject the hypothesis that the sample comes from a long-tailed distribution.

I am not crazy about the "reject the null hypothesis" way of talking, but it was appropriate for the venue.

ANOVA

If you collect a sample of adult human heights in centimeters, the variance might be 94 cm^2. If you divide the sample into men and women, the variance in each group is smaller (49 and 44 cm^2), which suggests that there is a relationship between height and gender. To test whether that is significant, you can divide the sample at random and see how often the variance drops as much. That's an analysis of variance (ANOVA) test.

The nice thing about ANOVA is that is generalizes easily to more than two groups.

To summarize, the test statistic is the decrease in variance when you divide the sample into groups, also known as "assigning labels." The null hypothesis is that the mean is the same for all groups, so the simulation is to assign the labels at random.

Time of day effect?

Here's a recent post from reddit/r/statistics:

A reviewer wants me to test if various behavioural assays I performed were influenced by the time of day I performed them (i.e. biased).The assays have either binary or continuous responses. What is the best approach to this? It is as simple as fitting a GLM, response~time, and testing for a non-zero slope? That seems overly simplistic to me. I've also tried GAMs with gaussian or binomial error functions, but it doesn't seem to be doing what I expect it to do...Any thoughts would be greatly appreciated. I'm sure its stupidly simple, but I'm having trouble wrapping my head around the problem!

At first I suggested a permutation test to check whether time of day had a substantial effect. But the submitter added these details:

We assayed the behaviour of six species of spiders. The purpose of the study was to compare the behaviour of some species to others. For various reasons, we could not balance the time of day at which we assayed the behaviours for each species (the wonders of field biology...). It may be the case that these spiders have some sort of daily behavioural cycle, e.g. more active at night (however it could be the opposite, or really any function of time; we have no idea). I think the reviewer's concern is that this could influence the comparisons, e.g. if we assayed species 1 more often at night than species 2?
I suppose the explicit hypothesis I am interested in testing is that time of day has a significant effect on behavioural response, but I am unsure how to construct a model to test this, given that I have no idea if the relationship will be linear or some complex function.

So there are several questions on the table here:

1) Are the observed differences between the species significant?

I assume the answer is yes, or the submitter would not be trying to publish (see publication bias).

2) Is there a time of day effect?

One option is to divide the day into intervals, maybe 6 periods of 4 hours. Then for the continuous response variable, you could use an ANOVA test. For the binary variable you could use a logistic regression with one dummy variable for each interval.

Or if the question is whether there are any times of day that differ from the others, then an appropriate test statistic would be the maximum difference between any interval and the overall average (the next case study talks about this option in more detail).

3) If there is a time of day effect, can we control for it?

If it turns out that there is a time of day effect, it may or may not be possible to control for it. For example, if you only observed Species 1 in the morning and Species 2 in the afternoon, then there is no way to separate the effect of the two variables. In that case, the problem is experimental design, not statistics.

As a quick check, you could make a table with one row per species and one column per time interval, and count how many observations fall in each cell. If there are a lot of empty (or nearly empty) cells, you are in trouble.

If you make the intervals bigger, you will have more observations per cell, but you might obscure an actual time of day effect.

4) If we control for time of day, are the differences between the species still significant?

If you get this far, you could do ANOVA or regression with dummy variables for each species and for each interval. This is what some people suggested on reddit (if I understood correctly), but I would suggest spending some time on steps (2) and (3) before diving into (4).

Testing pair-wise differences

Here's another recent post from reddit/r/statistics:

My data can be divided into 4 samples. I wanted to test whether the distribution of a binary variable differs significantly between the samples, so I did 6 t-tests - one for each pair of samples.
However, I also need to analyze how another categorical variable differs between the samples, this time with 3 possible values.
Is it okay to compute new t-tests, each time for a pair of values and a pair of samples, for a total of 6*3 tests? The problem is that this is too detailed: I don't want to test each pair of values, only each pair of samples.
SAS also gives me the option (in proc freq) of calculating a chi-square test for all 4 samples, but this is not enough information: it only tells me that not all of the means are equal.
And then there's a one-way ANOVA test, which can be used to compare the means of >2 samples, but apparently can't be used for categorical variables with >2 values. Or can it?
How would you analyze this data?

If you run lots of tests, some of them are likely to be "significant" even if there is no real effect going on. The problem is demonstrated in this xkcd cartoon. So in general I think it is a good idea to apply one big test to the entire dataset, not lots of little tests.

In this case, the question seems to be whether there are significant differences between any pair of groups. So an appropriate test statistic would be the maximum difference between any pair.

The null hypothesis is that the groups are the same, so an easy implementation is to divide the sample into groups at random.

For categorical variables, you can think of the distribution within each group as a vector and use a distance function to compute the difference between groups. If the groups are roughly the same size, you could use a 1-norm or a 2-norm. If not, you should probably normalize the vectors to length 1 before computing norms.

If you are not really interested in pair-wise differences, but you want to know whether any group deviates significantly from the expected, you could pool all the groups to get the expected distribution of the categorical variable, then compute a chi-squared statistic for each group (how much it deviates from the expected). Again the test statistic would be the maximum of the chi-square statistic.

The important thing here is that the choice of test statistic depends on the question you are asking. By using the maximum difference, you are effectively asking, "What is the probability of finding any group with this much deviance, just by chance?"

Running a series of tests

Here's one more question from reddit/r/statistics

I'm running a series of chi-square analyses and for several of them, I'm getting a message which states that "# cells have an expected count less than 5". If the results are significant, are they still worth reporting in spite of this message? I'm comparing participant selections of one of three alternatives when assigned to one of four conditions, examining pairs of comparisons.

The problem here is that the analytic version of the chi-squared test uses an approximation that is only accurate if the sample size is large enough; otherwise you have to use a different test.

But this is a perfect example of what I talked about in my last post: if you spend all your time in statistics class learning one damn test after another, and your entire cognitive capacity is consumed trying to remember when each of them is applicable, you have no time and no available brain cells to do the real work, which is thinking carefully about the what you are doing. [Note: I mean this as a general warning, not a criticism of the submitter.]

In this case the biggest hazard is not the sample size; it's "running a series of chi-square analyses." When I hear "series of tests," alarm bells go off in my head. If you don't know why, please review this xkcd cartoon.

So for this question, I have two suggestions:

1) If you use simulation, you don't have to worry about the expected count in each cell. If the sample size is small, that tends to make the p-value bigger, but that happens without any additional effort.

2) If you are running a series of tests, you should either (a) adjust the significance criterion using something like the Holm–Bonferroni method, or (b) treat the whole series as a single test, using the maximum of the chi-square values as a test statistic. I think (b) is simpler, and does a better job of capturing what you are trying to test.

That's all for now!

1 comment:

KieranJune 2, 2011 at 9:11 PM
Hi! I am OP of the reddit effect of time of day post. I actually ended up trying both the dummy variable regression approach and a general additive model approach (both methods indicated there was no effect of time, so I didn't pursue it further). Thanks for your help on /r/statistics and thanks for your post here!

Thursday, June 2, 2011

More hypotheses, less trivia

1 comment: