Probably Overthinking It: August 2011

I read with interest an article by Robert Hayden called “Advice to mathematics teachers on evaluating introductory statistics textbooks.” It makes two claims:

Statistics textbooks should emphasize real problems with real data. I couldn't agree more.
You should choose a statistics textbook written by a statistician at a large university who has good research publications. I couldn't agree less.

On the second point, Prof. Hayden argues that statistics books contain more errors than other math books, and that the errors are more often conceptual, not just typos. The reason, he explains, is that “introductory statistics textbooks are often written by people with little or no training in statistics.”

As an example (in fact, the only example) he cites this problem from an unnamed popular textbook:

“Jimmy Nut Company advertises that their nut mix contains 40% cashews, 15% brazil nuts, 20% almonds, and only 25% peanuts. The truth in advertising investigators took a random sample (of size 20 lb) of the nut mix and found the distribution to be as follows:

Cashews Brazil Nuts Almonds Peanuts

6 lb 3 lb 5 lb 6 lb

At the 0.01 level of significance, is the claim made by Jimmy Nuts true?”

Prof. Hayden hates this problem. He thinks it is ill-posed, and evidence of everything wrong with statistics books. He explains: “The problem here is that the chi-squared goodness of fit test applies only to categorical (discrete) data. It compares the actually observed counts in each category to the counts we would expect if the hypothesis being tested were true.”

This is true, and if you blindly plug the numbers from the problem into the formula for a chi-squared statistic, the result is meaningless. In fact, if you express the given data in other units, you can make the resulting p-value as big or small as you like.

But that doesn’t mean the textbook author has made a “gross error,” because the question doesn’t say “plug these numbers in blindly and write down the result without thinking.” The question doesn’t even specify the chi-squared statistic.

The question, as written, is perfectly sensible, moderately challenging, and interesting. It requires thought and care to formulate the question in a way that leads to a meaningful answer, and it is exactly the kind of question someone doing statistics is likely to be asked.

I don’t know what the textbook author had in mind, but there is nothing wrong with the question as written. And to prove it, I’m going to answer it. I follow the procedure I recommend in my article "There is only one test."

The first step is to make a model of the null hypothesis, which is that the proportion of nuts is as advertised. But if the proportions are always as advertised, the data observed would be impossible. So we need a random model that characterizes the variation we expect to see in a sample.

What are the sources of variation in a 20 lb sample of nuts? One source is mechanical error; that is, the people and machines that weigh and mix nuts might be miscalibrated. If the accuracy of this process varies from day to day, we could treat it as a random variation. To quantify this effect, we would need some information about how the nuts are processed.

Another source of error, and probably the one the textbook author had in mind, is discrete sampling. If we take a small sample, we would not expect the proportions to be exactly as advertised. To model this kind of variation, we have to count individual nuts.

To convert the data from weights to counts, we have to know how much each kind of nut weighs. This step requires some research, which I think is an interesting element of the problem. Since we live in a post-Google information utopia, it is reasonable to expect students to answer this question, and useful for them to practice.

I found several sources with information about the weight of typical nuts, some more authoritative than others. The one I use is “Nuts for Nutrition,” by Alice Henneman, MS, RD, Extension Educator at the University of Nebraska Cooperative Extension - Lancaster County. In addition to the credentials and affiliation of the author, and the witty article title, I chose this source because it provides numbers for all four nuts in one table, so even if the methodology that produced them is not ideal, it is probably consistent, which means that the relative values are likely to be good even if the absolute values are not. From the table:

Nuts per ounce: Cashew 16-18, Brazil nut 6-8, Almonds 20-24, Peanuts 28.

To start I use the middle of each range. With this data we can convert observed values from pounds to counts and convert the expected proportions from “by weight” to “by count”. Here’s a function that does it:

def ConvertToCount(sample, count_per):
   """Convert from weight to count.

   sample: Hist that maps from category to weight in pounds
   count_per: dict that maps from category to count per ounce
   """
   for value, count in sample.Items():
       sample.Mult(value, 16 * count_per[value])

The code examples here use the Pmf and Cdf libraries from Think Stats. Here’s the code that processes the information we’re given:

   sample = dict(cashew=6, brazil=3, almond=5, peanut=6)
   count_per = dict(cashew=17, brazil=7,
almond=22, peanut=28)

   observed = Pmf.MakeHistFromDict(sample)
   ConvertToCount(observed, count_per)

   advertised = dict(cashew=40, brazil=15,
almond=20, peanut=25)
   expected = Pmf.MakePmfFromDict(advertised)
   ConvertToCount(expected, count_per)
   expected.Normalize(observed.Total())

Here are the results in tabular form:

Nut       Expected     Observed    % error    price per pound
cashew    2266         1632        -28 %      $10
brazil     350          336        - 4 %      $8
almond    1467         1760        +20 %      $8
peanut    2333         2688        +15 %     $3

So there are 28% fewer cashews than expected, and 20% more almonds. Is this suspicious? Intuitively, this looks like more deviation than I expect in a sample this large. And the relationship between “error” and price makes it look even worse.

Let’s make that more rigorous. The next step to choose a test statistic. Mean relative error would be a reasonable choice. Or I could devise a statistic that measures the excess of cheap nuts and lack of expensive nuts. But to keep it simple I'll use the chi-square statistic. The primary advantage of chi-square is that we can compute p-values analytically, which is efficient. But since I am planning to use simulation, the only advantage is that it is conventional.

For the given data the chi-square statistic is 291, which doesn’t mean anything. To make it mean something, we have to compute the p-value, which is the probability of seeing a chi-square statistic so large under the null hypothesis.

Which means we need to get back to our model of the null hypothesis. One option is to imagine choosing nuts one at a time from a vat that contains a large number of nuts mixed in the advertised proportions. Here’s what that looks like:

   num_nuts = observed.Total()
   cdf = Cdf.MakeCdfFromHist(expected)
   t = cdf.Sample(num_nuts)
   hist = Pmf.MakeHistFromList(t)
   chi2 = ChiSquared(expected, simulated)

If we run that 1000 times, we can count how many samples yield chi2 > 290. And the answer is...0. In fact, after 1000 simulations, the largest test statistic is 20. So the p-value is much smaller than 1/1000 and it is unlikely that the observed sample came from the advertised mixture, at least under this model.

Since there was some uncertainty in the data we collected, we should do a perturbation analysis. For example, Jimmy might claim that he used big nuts, which would make the counts smaller and the results less significant. So let’s give Jimmy the benefit of the doubt and run the numbers with these counts per ounce:

Nuts per ounce: Cashew 12, Brazil nut 5, Almonds 18, Peanuts 24.

With these data, the observed test statistic is only 215 (down from 290). But the chance of seeing a value that large under the null hypothesis is still negligible.

More credibly, Jimmy might claim that the model of the null hypothesis is unrealistic because the nuts are not perfectly mixed, so the sampling process is not independent. In fact, correlation between successive draws would increase variation in the sample. To see why, imagine the extreme scenario where the nuts are not mixed at all; in that case, most samples contain only one kind of nut.

To model this scenario, we can make a list that represents the vat of nuts the sample is drawn from. Initially the vat is completely unmixed. Then we "stir" by choosing random pairs and swapping them.

def MakeVat(expected, num_nuts, factor=10, stir=0.0):
"""Makes a list of nuts with partial stirring."""
t = []
for value, freq in expected.Items():
t.extend([value] * int(freq * factor))

[RandomSwap(t) for i in xrange(int(num_nuts*factor*stir))]

return t

expected is a Pmf that represents the expected distribution. num_nuts is the total number of nuts in the sample. factor is the size of the vat, in multiples of num_nuts. And stir is the number of swaps to perform per nut in the vat.

I set factor to 10, which means nuts are mixed in 200 lb batches. With stir = 0, the nuts are unmixed and the p-value is 1; that is, it is certain that we will see a large deviation from the advertised proportions. If stir > 2, the nuts are well mixed and the p-value is 0, as we saw before.

Between 1.1 and 1.3 swaps per nut, the p-value drops quickly from 1 to 0. So Jimmy's argument is plausible: if the nuts are not well mixed, that could explain the deviation we saw in the sample.

In summary, what can we conclude about the Jimmy Nut Company?

1) If the sample is drawn from a well mixed batch of nuts, it provides strong evidence that the mix is not as advertised.

2) But if the nuts are not well mixed, that might explain the deviation we saw. We could test that hypothesis by mixing the vat more thoroughly or taking small samples from several parts of the vat.

And what can we conclude about the Jimmy Nut Company problem?

1) I think it's an excellent problem, as written. It requires careful thinking and a little research, and the results could lead to interesting class discussion.

2) I see no evidence that this problem indicates a conceptual error on the part of the textbook writer.

In summary, I agree with Prof. Hayden that statistics students should work with real data, but respectfully disagree with his assertion that statistics books should be written by statisticians. I think they should be written by computer scientists.

The code I used for this article is here.

UPDATE August 30, 2011: With the help of Google Books, I found the source of this problem: Understandable statistics: concepts and methods by Charles Henry Brase and Corrinne Pellillo Brase. The problem appears in the 5th edition, but not the current 9th edition.

Probably Overthinking It

Monday, August 29, 2011

The Jimmy Nut Company problem

Tuesday, August 16, 2011

Upcoming webcast: Only One Test

There's Only One Test

About Allen B. Downey