Friday, August 22, 2014

An exercise in hypothesis testing

I've just turned in the manuscript for the second edition of Think Stats.  If you're dying to get your hands on a copy, you can pre-order one here.

Most of the book is about computational methods, but in the last chapter I break out some analytic methods, too.  In the last section of the book, I explain the underlying philosophy:

This book focuses on computational methods like resampling and permutation. These methods have several advantages over analysis:
  • They are easier to explain and understand. For example, one of the most difficult topics in an introductory statistics class is hypothesis testing. Many students don’t really understand what p-values are. I think the approach I presented in Chapter 9—simulating the null hypothesis and computing test statistics—makes the fundamental idea clearer.
  • They are robust and versatile. Analytic methods are often based on assumptions that might not hold in practice. Computational methods require fewer assumptions, and can be adapted and extended more easily.
  • They are debuggable. Analytic methods are often like a black box: you plug in numbers and they spit out results. But it’s easy to make subtle errors, hard to be confident that the results are right, and hard to find the problem if they are not. Computational methods lend themselves to incremental development and testing, which fosters confidence in the results.
But there is one drawback: computational methods can be slow. Taking into account these pros and cons, I recommend the following process:
  1. Use computational methods during exploration. If you find a satisfactory answer and the run time is acceptable, you can stop.
  2. If run time is not acceptable, look for opportunities to optimize. Using analytic methods is one of several methods of optimization.
  3. If replacing a computational method with an analytic method is appropriate, use the computational method as a basis of comparison, providing mutual validation between the computational and analytic results.
For the vast majority of problems I have worked on, I didn’t have to go past Step 1.
The last exercise in the book is based on a question my colleague, Lynn Stein, asked me for a paper she was working on:

 In a recent paper2, Stein et al. investigate the effects of an intervention intended to mitigate gender-stereotypical task allocation within student engineering teams.  Before and after the intervention, students responded to a survey that asked them to rate their contribution to each aspect of class projects on a 7-point scale. 
Before the intervention, male students reported higher scores for the programming aspect of the project than female students; on average men reported a score of 3.57 with standard error 0.28. Women reported 1.91, on average, with standard error 0.32. 
Question 1:  Compute a 90% confidence interval and a p-value for the gender gap (the difference in means). 
After the intervention, the gender gap was smaller: the average score for men was 3.44 (SE 0.16); the average score for women was 3.18 (SE 0.16). Again, compute the sampling distribution of the gender gap and test it. 
Question 2:  Compute a 90% confidence interval and a p-value for the change in gender gap. 
[2] Stein et al. “Evidence for the persistent effects of an intervention to mitigate gender-sterotypical task allocation within student engineering teams,” Proceedings of the IEEE Frontiers in Education Conference, 2014.
In the book I present ways to do these computations, and I will post my "solutions" here soon.  But first I want to pose these questions as a challenge for statisticians and people learning statistics.  How would you approach these problems?

The reason I ask:  Question 1 is pretty much a textbook problem; you can probably find an online calculator to do it for you.  But you are less likely to find a canned solution to Question 2, so I am curious to see how people go about it.  I hope to post some different solutions soon.

By the way, this is not meant to be a "gotcha" question.  If some people get it wrong, I am not going to make fun of them.  I am looking for different correct approaches; I will ignore mistakes, and only point out incorrect approaches if they are interestingly incorrect.

You can post a solution in the comments below, or discuss it on, or if you don't want to be influenced by others, send me email at downey at allendowney dot com.

UPDATE August 26, 2014

The discussion of this question on was as interesting as I hoped.  People suggested several very different approaches to the problem.  The range of responses extends from something like, "This is a standard problem with a known, canned answer," to something like, "There are likely to be dependencies among the values that make the standard model invalid, so the best you can do is an upper bound."  In other words, the problem is either trivial or impossible!

The approach I had in mind is to compute sampling distributions for the gender gap and the change in gender gap using normal approximations, and then use the sampling distributions to compute standard errors, confidence intervals, and p-values.

I used a simple Python class that represents a normal distribution.  Here is the API:

class Normal(object):
    """Represents a Normal distribution"""

    def __init__(self, mu, sigma2, label=''):
        """Initializes a Normal object with given mean and variance."""

    def __add__(self, other):
        """Adds two Normal distributions."""

    def __sub__(self, other):
        """Subtracts off another Normal distribution."""

    def __mul__(self, factor):
        """Multiplies by a scalar."""

    def Sum(self, n):
        """Returns the distribution of the sum of n values."""

    def Prob(self, x):
        """Cumulative probability of x."""
   def Percentile(self, p):
        """Inverse CDF of p (0 - 100)."""

The implementation of this class is here.

Here's a solution that uses the Normal class.  First we make normal distributions that represent the sampling distributions of the estimated means, using the given means and sampling errors.  The variance of the sampling distribution is the sampling error squared:

    male_before = normal.Normal(3.57, 0.28**2)
    male_after = normal.Normal(3.44, 0.16**2)

    female_before = normal.Normal(1.91, 0.32**2)
    female_after = normal.Normal(3.18, 0.16**2)

Now we compute the gender gap before the intervention, and print the estimated difference, p-value, and confidence interval:

    diff_before = female_before - male_before
    print('mean, p-value',, 1-diff_before.Prob(0))
    print('CI', diff_before.Percentile(5), diff_before.Percentile(95))

The estimated gender gap is -1.66 with SE 0.43, 90% CI (-2.3, -0.96) and p-value 5e-5.  So that's statistically significant.

Then we compute the gender gap after intervention and the change in gender gap:

    diff_after = female_after - male_after
    diff = diff_after - diff_before
    print('mean, p-value',, diff.Prob(0))
    print('CI', diff.Percentile(5), diff.Percentile(95))

The estimated change is 1.4 with SE 0.48, 90% CI (0.61, 2.2) and p-value 0.002.  So that's statistically significant, too.

This solution is based on a two assumptions:

1) It assumes that the sampling distribution of the estimated means is approximately normal.  Since the data are on a Likert scale, the variance and skew are small, so the sum of n values converges to normal quickly.  The samples sizes are in the range of 30-90, so the normal approximation is probably quite good.

This claim is based on the Central Limit Theorem, which only applies if the samples are drawn from the population independently.  In this case, there are dependencies within teams: for example, if someone on a team does a larger share of a task, the rest of the team necessarily does less.  But the team sizes are 2-4 people and the sample sizes are much larger, so these dependencies have short enough "range" that I think it is acceptable to ignore them.

2) Every time we subtract two distributions to get the distribution of the difference, we are assuming that values are drawn from the two distributions independently.  In theory, dependencies within teams could invalidate this assumption, but I don't think it's likely to be a substantial effect.

3) As always, remember that the standard error (and confidence interval) indicate uncertainty due to sampling, but say nothing about measurement error, sampling bias, and modeling error, which are often much larger sources of uncertainty.

The approach I presented here is a bit different from what's presented in most introductory stats classes.  If you have taken (or taught) a stats class recently, I would be curious to know what you think of this problem.  After taking the class, would you be able to solve problems like this?  Or if you are teaching, could your students do it?

Your comments are welcome.


  1. I am pleased to discover that the approach I provided on Reddit (as username blippage) was basically sound. Phew. It's good to know that time hasn't totally withered away my reasoning ability.

    When you say "The approach I presented here is a bit different from what's presented in most introductory stats classes", how so? Isn't there only one basic way to solve this problem: namely, by understanding that the mean of the sum/difference of two normally distributed variables is the sum/differences of the means, and the variance is the sum of the variance?

    Also, don't you think that there is a danger that by approaching problems programmatically, it is teaching students to think like engineers rather than mathematicians; that is to say, "I know that it does work, but I'm not sure why". Having said that, an approach that is "too" mathematical can end up looking like symbols just being pushed around the page, with the underlying concepts lost in the process.

    Your Think books look really interesting, and I think I owe it to myself to read them.

    All the best, Professor. You're doing a great job of educating the general public about statistical ideas.

    1. Thanks for these comments, and for your kind words. You asked how what I presented differs from most intro stats classes. Based on the textbooks I've seen, I get the impression that many stats classes teach hypothesis testing as a cookbook process, so students learn how to perform various tests and when to use which test. I have not seen much emphasis on the sampling distribution as the basis for standard error and confidence interval (but I am sure there are example of books and classes that do).

      About the computational approach, you suggested that students might learn how to use tools, but not how they work. I don't think the computational approach prevents students from learning both, and compared to the standard mathematical approaches, it provides a lot of flexibility: students can learn how to use a black box, then learn how it works (a top-down approach) or start with building blocks and assemble the black box (bottom-up).