Probably Overthinking It

Thursday, March 31, 2011

Freshman hordes more godless than ever!

Since 1978 the percentage of college freshmen whose religious preference is "None" has nearly tripled from 8% to 23%, and the trend is accelerating.

In 2007 I wrote this article for Free Inquiry magazine that reports on survey results from the Cooperative Institutional Research Program (CIRP) of the Higher Education Research Insitute (HERI). Here is some background on the survey, from my article:

The CIRP survey includes questions about students’ backgrounds, activities, and attitudes. In one question, students were asked their “current religious preference” and given a choice of seventeen common religions and Christian denominations, “Other Christian,” “Other religion,” or “None.” Another question asked students how often they “attended a religious service” in the last year. The choices were “Frequently,” “Occasionally,” and “Not at all.” The instructions directed students to select “Occasionally” if they attended one or more times, so a nonobservant student who attended a wedding and a funeral (and follows instructions) would not be counted among the apostates.

The following figure shows students' responses over the history of the survey (updated with the most recent data):

Here's what I said about this figure 4 years ago:

The number of students with no religious preference has been increasing steadily since the low point ... in 1978. ... The rate of growth from 1980 to the present has been between 0.25 and 0.35 percentage points per year... Since 1997, the rate may have increased to 0.6 or 0.7 percentage points per year. At that rate, the class of 2056 will have an atheist majority.

A linear extrapolation from data like this is mostly ridiculous, but as it turns out the next four points are pretty much in line. And finally, I claimed:

Both curves show a possible acceleration between 2005 and 2006. This jump may be due to increased visibility of atheism following the publication of books by Sam Harris, Daniel C. Dennett, and Richard Dawkins.

That last point is pure speculation on my part, but it is also what you might call a testable hypothesis. Which gives me an excuse to talk about another article of mine, "A novel changepoint detection algorithm," which you can download from arXiv. Here's the abstract:

We [my imaginary co-author and I] propose an algorithm for simultaneously detecting and locating changepoints in a time series, .... The kernel of the algorithm is a system of equations that computes, for each index i, the probability that the last (most recent) change point occurred at i.

In a time series, a changepoint is a time where the behavior of the system changes abruptly. By applying my algorithm to the CIRP data, we can test whether there are changepoints and when they are likely to have occurred.

My conjecture is that the rate of increase changed in 1997 and maybe again in 2006. Since this is a hypothesis about rates, I'll start by computing differences between successive elements as an estimate of the first derivative. Where there is missing data, I use the average yearly change. This figure shows the result:

As usual, taking differences amplifies noise and makes it harder to see patterns. But that's exactly what my algorithm (which is mine) is good for. Here are the results:

The y-axis is the probability that the last (most recent) change point occurred during a given year, accumulated from right to left. The gray lines indicate years with a relatively high probability of being the last changepoint. So, reading from right to left, there is a 5% chance of a changepoint in 2006 and a 5% chance for 1998. The most likely location of the last changepoint is 1984 (about 20%) or 1975 (25%). So that provides little if any evidence for my conjecture, which is pretty much what I deserve.

A simpler, and more likely, hypothesis is that the trend is accelerating; that is, the slope is changing continuously, not abruptly. And that's easy to test by fitting a line to the yearly changes.

The red line shows the linear least squares fit, with slope 0.033; the p-value (chance of seeing an absolute slope as big as that) is 0.006, so you can either reject the null hypothesis or update your subjective degree of belief accordingly.

The fitted value for the current rate of growth is 0.9 percentage points per year, accelerating at 0.033 percentage points per year^2. So here's my prediction: in 2011 the percentage of freshman who report no religious affiliation will be 23.0 + 0.9 + 0.03 = 23.9%.

Wednesday, March 23, 2011

Predicting marathon times

I ran the New Bedford Half Marathon on Sunday in 1:34:08, which is a personal record for me. Like a lot of runners in New England, I ran New Bedford as a training run for the Boston Marathon, coming up in about 4 weeks.

In addition to the training, the half marathon is also useful for predicting your marathon time. There are online calculators where you can type in a recent race time and predict your time at different distances. Most of them are based on the Riegel formula; here is the explanation from Runners' World:

The Distance Finish Times calculator ... uses the formula T2 = T1 x (D2/D1)^1.06 where T1 is the given time, D1 is the given distance, D2 is the distance to predict a time for, and T2 is the calculated time for D2.

The formula was developed by Pete Riegel and published first in a slightly different form in Runner's World, August 1977, in an article in that issue entitled "Time Predicting." The formula was refined for other sports (swimming, bicycling, walking,) in an article "Athletic Records and Human Endurance," also written by Pete Riegel, which appeared in American Scientist, May-June 1981.

Based on my half marathon, the formula says I should be able to run a marathon in 3:16:09. Other predictors use different parameters or different formulas, but even the most conservative prediction is under 3:20, which just happens to be my qualifying time. So I should be able to qualify, right?

There a few caveats. First, you have to train for the distance. If you have never run farther than 13.1 miles, you will have a hard time with the marathon, no matter what your half marathon time is. For me, this base is covered. I have been following the FIRST training program since January, including several runs over 20 miles (and another coming up on Sunday).

Second, weather is a factor. New Bedford this weekend was 40 degF and sunny, which is pretty close to optimal. But if Marathon Monday is 80 degF, no one is going to hit their target time.

Finally, terrain is a factor. Boston is a net-downhill course, which would normally make it fast, but with the hills, especially in miles 17-21, Boston is considered a tough course.

So that raises my question-of-the-week: how well does New Bedford predict Boston?

The 2010 results from New Bedford are here, and they are available in an easy-to-parse format. Results from Boston are here, but you can't download the whole thing; you have to search by name, etc. So I wrote a program that loops through the New Bedford results and searches for someone in Boston with the same name and age (or age+1 for anyone born under Aries).

I found 520 people who ran both races and I matched up their times. This scatter plot shows the results:

Not surprisingly, the results are correlated: R^2 is 0.86. There are a few outliers, which might be different people with the same name and age, or people who had one bad day and one good day.

Now let's fit a curve. Taking the log of both sides of the Riegel formula, we get

log(T2) = log(T1) + A log(2)

So if we plot T2 vs T1 on a log-log scale, we should see a line with slope 1, and we can estimate the intercept. This figure shows the result:

The fitted line has slope 1.0 and the intercept that minimizes least squared error. The estimated intercept corresponds to an exponent in the Riegel model of 1.16, substantially higher than the value used by the race time predictors (1.06).

Visually, the line is not a particularly good fit for the data, which suggests that this model is not capturing the shape of the relationship. We could certainly look for better models, but instead let's apply Downey's Law, which says "The vast majority of statistical questions in the real world can be answered by counting."

What are my chances of qualifying for Boston? Let's zoom in on the people who finished near me in New Bedford (minus one year):

The red line isn't a fitted curve; it's the 3:20 line. Of 50 people who ran my time in New Bedford (plus or minute two minutes) only 13 ran a 3:20 or better in Boston. So my chances are about 25%. Or maybe less -- most of those 13 were faster than me.

Why are the race predictors so optimistic? One possibility is weather. If Boston was hot last year, that would slow people down. But according to Wolfram|Alpha, New Bedford on March 21, 2010 was 55-60 degF and sunny. And Boston on April 18 was 40-50 degF and cloudy. So if anything we might expect times in Boston to be slightly faster.

The most likely explanation (or at least the only one left) is terrain. Curse you, Heartbreak Hill!

-----

Update March 24, 2011: Here's one more figure showing percentiles of marathon times as a function of half marathon time. The blue line is at 94 minutes; for that time the percentiles are 192, 200, 207, 216 and 239 minutes. So the median time for people in my cohort is 3:27:00.

Update April 19, 2011: I ran with a group of friends aiming for a 3:25 pace. We hit the halfway mark in 1:45, so we revised our goal to 3:30. But I thrashed my quads in the first half and finished in 3:45, pretty close to that dotted blue line. Still, it was an exciting day and a lot of fun, for some definition of fun.

Update May 17, 2011: It turns out I don't get credit for Downey's Law; Francis Galton beat me to it. And to make matters worse, he said it better: "Whenever you can, count."

If you find this sort of thing interesting, you might like my free statistics textbook, Think Stats. You can download it or read it at thinkstats.com.

Wednesday, March 2, 2011

BQ is unfair to women

Last week I wrote about the changes the BAA is making in the qualifying times for the Boston Marathon. Based on a sample of qualifiers from the Chicago Marathon, I predicted that the proportion of women in the open division (ages 18-34) will drop in the next two years.

This raises an obvious question: are the new standards fair? BAA executive director Tom Grilk explained “Looking back at the data that we have... we found that the fairest way to deal with this is to have a uniform reduction in qualifying standards across the board."

But he didn’t explain what he means by “fair.” There are several possibilities.

1) E-fairness (E for elite): By this definition, a standard is fair if the gender gap for qualifiers is the same as for elite runners. I discussed this standard in the previous post, and showed two problems: (a) elite women are farther from the pack than elite men, so qualifying times would be determined by a small number of outliers; and as a result, (b) this standard would disqualify 47% of the women in the open division.

A variation of E-fairness uses the relative difference in speeds rather than the absolute difference in times. This option reduces the impact, but doesn’t address what I think is the basic problem: it doesn’t make sense to base qualifying times on the performance of elite runners.

2) R-fairness (R for representative): By this definition a standard is fair if the qualifiers are a representative sample of the population of marathoners. I don’t have good data to evaluate R-fairness for the open division, but for the field as a whole the current standard is R-fair: according to running usa, 41% of marathoners are female, and in 2010 42% of Boston Marathon finishers were female.

[Note: this article in the Wall Street Journal claims that 42% is "higher than the percentage of all U.S. marathoners who are women," but I don't know what they are basing that on.]

A problem with R-fairness is that the population of marathoners includes some people who are competitive racers and others who are... not competitive racers. I don’t think it makes sense for the middle and the back of the pack to affect the standard.

3) C-fairness (C for contenders): Qualifying times should be determined by the most relevant population, runners who finish close to the standard. Specifically, I define a “contender” as someone who finishes within X minutes of the standard, where X is something like 20 minutes (we’ll look at some different values for X and see that it doesn’t matter very much).

And here’s what I propose: a standard is C-fair if the percentage of contenders who qualify is the same in each group. As an example, I'll compute a fair standard for men and women in the open division. Here’s how:

1) Like last week, I use data from the last three Chicago marathons as a sample of the population of marathoners. [Note: If this sample is not representative, that will affect my results, so I would like to get a more comprehensive dataset. I contacted marathonguide.com, but have not heard back.]

2) For the current male standard, 3:10, I select runners who finish within X minutes of the standard, and compute the percentage of these contenders that qualify.

3) Then I search for the female standard that yields the same percentage of qualifiers.

This figure shows the results:

The x-axis is the gender gap: the difference in minutes between male and female qualifying times. The y-axis is the difference in the percentage of contenders who qualify. The lines show results for values of X from 20 to 40 minutes. For smaller X, the results are noisier.

Where the lines cross through 0 is the gap that is C-fair. By that definition, the gap should be about 38 minutes. So if the male standard is 3:10, the female standard should be 3:48.

In 2013 the male standard will be 3:05. In that case, based on the same analysis, the gap should be 34 minutes, so the female standard should be 3:39.

C-fairness also has the property of “equal marginal impact,” which means that if we tighten the standard by 1 minute, we disqualify the same percentage of runners in all groups, which leaves the demographics of the field unchanged. Last week we saw that the current standard does not have this property -- tightening the qualifying times has a disproportionate impact on women.

In summary:

1) I think qualifying times should be based on the population of contenders -- runners near the standard -- not on the elites or the back of the pack.

2) A standard is fair if it qualifies the same proportion of contenders from each group.

3) By that definition, the gender gap in the open division should be 38 minutes in 2012 and 34 minutes in 2013.

4) The common belief that the standard for women is too easy is mistaken; by the definition of fair that I think is most appropriate, the standard for women is relatively hard.

-----

If you find this sort of thing interesting, you might like my free statistics textbook, Think Stats. You can download it or read it at thinkstats.com.

Thursday, February 24, 2011

Moving the goalposts

Last week I wrote about the effect of the Boston Marathon qualifying standard on finish times in other marathons. The next day the organizers of the Boston Marathon, the BAA, announced new qualifying standards for 2012 and 2013 and a new registration process. Here is my summary of the announcement:

For 2012, qualifying times for all groups are lower by 1 minute.
For 2013, qualifying times for all groups drop another 5 minutes.
For both years, runners who beat their qualifying time by 20 minutes are allowed to register first, followed by runners who qualify by 10 minutes, 5 minutes, and then 0 minutes.

Among runners, there has been a lot of discussion of the effect these changes will have on the demographics of the race, especially the proportion of women and older runners.

To answer this question, we have to get a sense of what the pool of potential qualifiers looks like. I collected data from the 2008, 2009 and 2010 Chicago Marathons (don’t ask me how). This dataset includes 101135 finish times with the gender and age of each runner. To keep things simple, I treat each finish time as a different runner, even though many people ran the race in more than one of the years I looked at.

I selected runners in the 18-34 group, which includes 22619 men and 24471 women. Then I selected the runners who ran 30 minutes over the current qualifying standard or faster, which includes 4951 men and 5502 women.

For each finish time, I computed the difference between the actual time and the standard (3:10 for men, 3:40 for women). This figure shows the distribution of those differences:

The BQ Effect, which I wrote about last week, is apparent for men and women. For the men there is another mode 10 minutes below the standard, which happens to be 3:00:00. Apparently the only thing more important than qualifying for Boston is breaking 3 hours.

The runners in this dataset are a sample of the population that constitutes the field at Boston. If we assume that this sample is representative, we can use it to predict the effect of the changes in the qualifying standard.

As a baseline, in 2010 there were 9602 finishers in the open division, 4651 men and 4951 women. In the Chicago sample, tightening the standard by 1 minute (which will happen in 2012) disqualifies 92 men (4.9%) and 102 women (5.6%). Extrapolating these changes to the open division in Boston (9602), the net effect is to replace 16 women with men.

In 2013, the standard will be 6 minutes lower, which disqualifies 30.1% of men and 33.9% of women, and displaces 134 women.

The new registration process might make the effect even more dramatic. If the field fills before the last group has a chance, the standard is effectively tougher by 11 minutes. That disqualifies 41.6% of men, 50.5% of women, and displaces 395 women.

This change would be noticeable. In 2010, the open division was almost 52% female. In 2013, it might be 52% male.

Is that fair? I will argue that it is not, but it will take some time to make the argument clear. Let me start by explaining why some people think it’s fair. It is conventional wisdom that the current standard is relatively easy for women. If that’s true, then these changes will be a correction.

For example, in this article Amby Burfoot proposes a standard based on the age-graded tables created by World Masters Athletics (WMA). Under that standard, the qualifying times would be 3:12 for men and 3:30 for women. In the Chicago sample, that would increase the number of men by 8% and decrease the number of women by 47%. Extrapolating to Boston, it would replace 1751 women with men, yielding a field of 66% men.

I think this standard is unreasonable because the WMA tables are based on world records in each group, not the performance of more typical qualifiers. The proposed standard has a bigger effect on women because the elite females are farther from the pack than the elite males.

To see the the tail of the distribution, let’s zoom in on the runners in the Chicago sample who beat the standard by 30 minutes or more:

The tail for women extends farther to the left. The fastest men beat the standard by 65 minutes; the fastest women beat it by more than 80 minutes. But the existence of a small number of outliers should not have such a drastic effect on the qualifying standard.

In conclusion:

1) The changes in the qualifying standard and process might decrease the representation of women in Boston from 52% to 48%.

2) The common belief that the standard is relatively easy for women is based on the performance of a small number of outliers. The gender gap for near-qualifiers is much larger than the gap for elites.

3) Implementing the standard proposed by Runners’ World would decrease the percentage of women in the open division at Boston to 33%.

-----

Coming soon:

1) I will evaluate possible definitions of a “fair” standard and compute qualifying times for each definition.

2) I will extend this analysis to the other age groups.

-----

If you find this sort of thing interesting, you might like my free statistics textbook, Think Stats. You can download it or read it at thinkstats.com.

Monday, February 14, 2011

The BQ Effect

In order to register for the Boston Marathon, you have to run another marathon faster than the qualifying standard for your age group and gender. For example, men 34 and under have to run 3:10 or better. Since I am 43, my qualifying time is 3:20 -- on my 45th birthday I get another ten minutes (wrapped up with a bow).

These standards are challenging. At most major marathons, less than 20% of the field runs a Boston qualifying time, known as a “BQ.” According to marathonguide.com, the races in 2010 with the highest percentage of qualifiers were the Bay State Marathon (37%) and Boston itself (44%).

Two years ago I ran the Bay State Marathon in my first (unsuccessful) attempt to qualify. I was almost on pace for the first 20 miles, then hit the wall and slowed from 7:50 per mile to 10:00. While I shuffling through the last few miles, I wondered whether the Boston qualifying times have a visible effect on the distribution of finish times.

I can imagine several possible BQ effects: (1) an excess of runners finishing just under the qualifying time, (2) a deficit of runners just over, and maybe (3) an excess of runners like me who attempt a qualifying time, bonk, and end up 10-20 minutes over.

To look for these effects, I collected data from the 2010 Bay State Marathon (available at coolrunning.com). For each of the 1564 finishers, I looked up their age and gender to get qualifying times, and computed the difference in minutes between their actual and qualifying times.

For comparison, I also collected data from the 2010 Sun Lowell Half Marathon, which runs on (mostly) the same course at the same time. For each half-marathoner, I computed a marathon time prediction using the well-known Riegel formula.

The following figure shows the PMF of finish time, relative to qualifying time, for runners in both races whose finish time (real or imagined) was within 60 minutes of their qualifying time. A negative difference means the runner qualified.

As expected, there is a substantial mode just under the qualifying time for the marathoners, but not for the half marathoners. The other two hypothetical effects are not apparent.

PMFs tend to be noisy, so I generally prefer to look at CDFs:

Again, the BQ effect is clear: in the full marathon there are more runners just under the qualifying time and fewer just over. But either the number of people who bonk is small, or their times are sufficiently spread out that they don’t constitute an apparent mode. I am a mode of one.

-----

If you find this sort of thing interesting, you might like my free statistics textbook, Think Stats. You can download it or read it at thinkstats.com.

Monday, February 7, 2011

Are first babies more likely to be late?

UPDATE: The version of this article with the most recent data is here.

When my wife and I were expecting our first child, we heard from several people that first babies tend to be late. If you Google this question, you will find plenty of discussion. Some people claim it's true, others say it's a myth, and some people say it's the other way around: first babies come early.

What you will probably not find is any data to support these claims. Except this kind of data:

``My two friends that have given birth recently to their first babies, BOTH went almost 2 weeks overdue before going into labour or being induced.''

``I don't think that can be true because my sister was my mother's first and she was early, as with many of my cousins.''

If you don’t find those arguments compelling, you might be interested in the National Survey of Family Growth (NSFG), a survey conducted by the U.S. Centers for Disease Control and Prevention (CDC) to gather ``information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men's and women's health.”

Their 2002 dataset includes 7643 women, who reported the gestational age for their 9148 live births. This figure shows the distribution of pregnancy length:

The distributions are similar for first babies and others. The mode is 39 weeks; the distribution is skewed to the left, with some births as early as 24 weeks, but almost none later than 44 weeks.

On average, first babies are 0.078 weeks later than others. This difference, 13 hours, is not statistically significant. But there are differences in the shape of the distribution. This figure shows the percent difference between first babies and others for weeks 34 through 46:

The general pattern is that first babies are more likely to be early (37 weeks or less), less likely to be on time (38-40), and more likely to be late (41 or more). In terms of relative risk, first babies are 8% more likely to be born early and 66% more likely to be late. And those differences are statistically significant (using a chi-square test, p < 0.001).

So far, none of this is useful for planning, so let’s consider a scenario: suppose you are in week 38 and you want to know your chances of delivering during the next three weeks. For first babies, the chance is 81%, for others it is 89%. So yes, in this scenario, first babies are more likely to be late. Sorry.

UPDATE: November 2011. Are first babies more likely to be light?

UPDATE: January 2013. Are first babies more likely to be late, revisited.

Wednesday, February 2, 2011

Yet another reason SAT scores are non-predictive

Most colleges use the SAT as a criterion for admission because they assume that SAT scores contain information about students’ preparation. The College Board defends this assumption with evidence that SAT scores are correlated with first-year college grades. But even in their own studies, the correlation is only R=0.53.

Correlations are often reported in terms of “explained variability”: a correlation of 0.53 explains R^2=28% of the variance in the data. But even that overstates the predictive power of SAT scores because it is expressed in terms of variance, not standard deviation.

To make that more concrete, suppose you are trying to guess students’ first year grades; how much better could you guess if you knew their SAT scores? Without using SAT scores, the best you could do is to guess the mean, which is 2.97 in the College Board study. In that case your root-mean-squared-error (RMS) would be the standard deviation, 0.71.

Using SAT scores, you could reduce RMS to 0.60, which is a 15% improvement. That’s a little bit better, but 0.6 is the difference between a B average and an A-, and you should expect to be off by at least that much 32% of the time.

At colleges with strict admissions standards, the correlation tends to be even lower. There are two reasons for this effect. One is explained in this article by John Allen Paulos, who points out that “Colleges usually accept students from a fairly narrow swath of the SAT spectrum,” and “the degree of correlation between two variables depends on the range of the variables considered.”

To see why that’s true, consider Harvard, where the 25th percentile for the combined SAT is 2100 (see satscores.us). So 75% of the students averaged 700 or better on the three parts of the exam. That’s a “fairly narrow swath.”

In fact the problem is even worse than Paulos suggests, because the SAT doesn’t discriminate very well between people at the top of the scale. For example, the difference between a 700 and a 770 can be as few as 4 questions, which might be due to chance.

To quantify that, I will compute the chance that someone who gets a 700 is actually better than someone with an 770. By “better,” I only mean “better at answering SAT questions,” which I will describe with P, the probability of getting a question right.

To estimate a student’s value of P, we start with a prior distribution and use the test score to perform an update. We can get a reasonable prior distribution by taking the distribution of scaled scores, mapping them back to raw scores, and converting raw scores to percentages.

I got the distribution of scores and the scale from College Board data tables. The distribution of raw scores is roughly normal (by design). This figure shows a normal probability plot, which compares the actual values to random values from a normal distribution:

The red line shows a linear fit to the data. At both ends, the data deviate from the model, which indicates that the range of the test is not as wide as the range of ability; if it were, the top score would be 80, not 54.

Nevertheless, we can convert these scores to percentages (54/54 = 100%) and use the distribution of percentages as a prior for P. Then to perform a Bayesian update, we compute the likelihood of a score, R, given P:

def Likelihood(self, R, P):
   right = R
   wrong = 54 - R
   return pow(P, right) * pow(1-P, wrong)

Which is an unnormalized binomial probability. This figure shows the posterior distributions of P for hypothetical students with scores of 700 and 770:

The 90% credible interval for a student who gets a 700 is (650, 730); for a student who gets a 770 it is (700, 780). To put that in perspective, at Harvard the interquartile range for the math exam is 700-790. So a student who gets a 770 might really be in the 25th percentile of his or her class. And the chance is about 10% that a classmate who got a 700 is actually better (at answering SAT questions, anyway).

-----

Notes: I made some simplifications, all of which tend to underestimate the margin of error for SAT scores.

1) Some SAT questions are easier than others, so the probability of getting each question is not the same. For students at the top of the range, the probability of getting the easy questions is close to 1, so the observed differences are effectively based on a small number of more difficult questions.

2) The compression of scores at the high end also compresses the margins of error. For example, the credible interval for a student who gets a 770 only goes up to 780. This is clearly an underestimate.

3) I ignored the fact that students who miss a question lose ¼ point. Since students at the top end of the range usually answer every question, a student who missed 4 questions loses 5 points, amplifying the effect of random errors.