When I started writing Think Stats, I wanted to avoid dire warnings about all the things people do wrong with statistics. I enjoy feeling smug and pointing out other people's mistakes as much as the next statistics Nazi, but I don't see any evidence that the warnings have much effect on the quality of statistical analysis in the press.
Maybe another approach is in order. Instead of making statistics seems like an arcane art that can only be practiced correctly by trained professionals, I would like to emphasize that the majority of statistical analysis is very simple. Deep mathematics is seldom necessary; usually it is enough to ask good questions and apply simple techniques.
As an example, I'm going to do what I said I wouldn't: point out other people's mistakes. Here is an excerpt from a recent ASEE newsletter, Connections:
The paragraph tries to summarize the data in the table, and fails. Let's take it point by point:
Claim 1) The percentages of recipients of doctoral degrees from all engineering disciplines by race and ethnicity show a great deal of stability over the last ten years.
Validity: BASICALLY TRUE. If they had just stopped here, everything would be fine. A graph would make this conclusion easier to see. I copied their data into Google Docs and generated this graph:
The other thing that jumps out of this graph is that something funny happened in 2010. The caption in the article explains, "New race and ethnicity categories, first reported in 2010 ... are combined under “other”. This change in the survey seems to have caused a decrease in the number of respondents reporting "other", and an increase in "Causasian." I can't explain why it had that effect, but it is not surprising that it had an effect.
Claim 2) African Americans, as a percentage of total of all recipients of doctoral degrees grew about half a percent from 2001 to 2010;
Validity: FALSE. Because the survey changed in 2010, it is not a good idea to summarize the results by comparing the first and last data points. If we drop 2010, there is no evidence of any meaningful change in the percentage of African Americans.
Claim 3) Hispanics increased by about two percent during the same time period;
Validity: FALSE. Again, if we ignore 2010, there is no evidence of change.
Claim 4) Asian Americans stayed virtually unchanged;
Validity: MAYBE. If anything, there is a small decrease. Again ignoring 2010, the last three data points are all below the previous six. But if you fit a trend line, the slope is not statistically significant.
Claim 5) Caucasians increased by percent.
Validity: FALSE. If we ignore 2010, there is a clear downward trend. If you fit a trend line, the slope is about -0.6 percentage points per year, and the p-value is 0.003.
Claim 6) No comment on "Other"
Validity: ERROR OF OMISSION. There is a clear upward trend, with or without the last data point. The fitted slope is almost 0.9 percentage points per year, and the p-value is 0.005.
So let's summarize:
Race Article claims Actually
---- -------------- --------
African American +0.5 %age point No change
Hispanic +2 %age point No change
Asian American No change Maybe down
Caucasian Up -4 %age point
Other No comment +6 %age point
What's the point of this? Granted, a newsletter from ASEE is not the Proceedings of the National Academy of Sciences, so maybe I should't pick on it. But it makes a nice example of simple statistics gone wrong. I guess that makes me a statistics Nazi after all.
Here's one more lesson: if you run a survey every year, avoid changing the questions, or even the selection of responses. It is almost impossible to do time series analysis across different versions of a question.
If you read this far, here's a small reward. The electronic edition of Think Stats is on sale now at 50% off, which makes it $8.49. Click here to get the deal.