Wednesday, February 29, 2012

To be a statistics Nazi?

When I started writing Think Stats, I wanted to avoid dire warnings about all the things people do wrong with statistics.  I enjoy feeling smug and pointing out other people's mistakes as much as the next statistics Nazi, but I don't see any evidence that the warnings have much effect on the quality of statistical analysis in the press.

Maybe another approach is in order.  Instead of making statistics seems like an arcane art that can only be practiced correctly by trained professionals, I would like to emphasize that the majority of statistical analysis is very simple.  Deep mathematics is seldom necessary; usually it is enough to ask good questions and apply simple techniques.

As an example, I'm going to do what I said I wouldn't: point out other people's mistakes.  Here is an excerpt from a recent ASEE newsletter, Connections:


I. Databytes

Doctoral Degrees
by Race and Ethnicity:
A Decade of Little Change

The percentages of recipients of doctoral degrees from all engineering disciplines by race and ethnicity show a great deal of stability over the last ten years.  African Americans, as a percentage of total of all recipients of doctoral degrees grew about half a percent from 2001 to 2010; Hispanics increased by about two percent during the same time period; Asian Americans stayed virtually unchanged; and Caucasians increased by percent.  
Doctoral Degrees by Race and Ethnicity*
2001200220032004200520062007200820092010
African American3.9%3.5%3.4%3.8%3.7%3.7%3.6%3.2%3.8%4.4%
Hispanic3.3%3.9%3.6%3.5%3.7%3.0%3.5%3.6%3.8%5.2%
Other14.2%11.4%11.9%14.0%14.2%15.1%18.7%19.5%17.6%10.7%
Asian American13.9%14.6%14.4%14.0%14.4%16.6%12.0%12.4%13.2%14.0%
Caucasian64.7%66.6%66.7%64.7%64.0%61.6%62.2%61.3%61.6%65.7%
*Data on ethnicity does not include schools from Puerto Rico or foreign nationals. The percentage of Hispanic graduates is 5.5% in 2010 if graduates from the University of Puerto Rico, Mayaguez are included. New race and ethnicity categories, first reported in 2010, American Indians (0.4%), Hawaiian/Pacific Islanders (0.1%) and Two or More (0.5%) are combined under “other”. Six institutions reported virtually all degrees in the Unknown field. These institutions were removed from the calculations for race, ethnicity and residency.


The paragraph tries to summarize the data in the table, and fails.  Let's take it point by point:

Claim 1) The percentages of recipients of doctoral degrees from all engineering disciplines by race and ethnicity show a great deal of stability over the last ten years.

Validity: BASICALLY TRUE.  If they had just stopped here, everything would be fine.  A graph would make this conclusion easier to see.  I copied their data into Google Docs and generated this graph:
Yup.  Pretty flat.

The other thing that jumps out of this graph is that something funny happened in 2010.  The caption in the article explains, "New race and ethnicity categories, first reported in 2010 ... are combined under “other”.  This change in the survey seems to have caused a decrease in the number of respondents reporting "other", and an increase in "Causasian."  I can't explain why it had that effect, but it is not surprising that it had an effect.

Claim 2) African Americans, as a percentage of total of all recipients of doctoral degrees grew about half a percent from 2001 to 2010;

Validity: FALSE.  Because the survey changed in 2010, it is not a good idea to summarize the results by comparing the first and last data points.  If we drop 2010, there is no evidence of any meaningful change in the percentage of African Americans.

Claim 3) Hispanics increased by about two percent during the same time period;

Validity: FALSE.  Again, if we ignore 2010, there is no evidence of change.

Claim 4) Asian Americans stayed virtually unchanged;

Validity: MAYBE.  If anything, there is a small decrease.  Again ignoring 2010, the last three data points are all below the previous six.  But if you fit a trend line, the slope is not statistically significant.

Claim 5) Caucasians increased by percent.

Validity: FALSE. If we ignore 2010, there is a clear downward trend.  If you fit a trend line, the slope is about -0.6 percentage points per year, and the p-value is 0.003.

Claim 6) No comment on "Other"

Validity: ERROR OF OMISSION.  There is a clear upward trend, with or without the last data point.  The fitted slope is almost 0.9 percentage points per year, and the p-value is 0.005.

So let's summarize:

Race                 Article claims       Actually
----                 --------------       --------
African American     +0.5 %age point      No change
Hispanic             +2   %age point      No change
Asian American       No change            Maybe down
Caucasian            Up                   -4 %age point
Other                No comment           +6 %age point
                                   
What's the point of this?  Granted, a newsletter from ASEE is not the Proceedings of the National Academy of Sciences, so maybe I should't pick on it.  But it makes a nice example of simple statistics gone wrong.  I guess that makes me a statistics Nazi after all.

Here's one more lesson: if you run a survey every year, avoid changing the questions, or even the selection of responses.  It is almost impossible to do time series analysis across different versions of a question.

If you read this far, here's a small reward.  The electronic edition of Think Stats is on sale now at 50% off, which makes it $8.49.  Click here to get the deal.

No comments:

Post a Comment