Wednesday, March 2, 2011

BQ is unfair to women

Last week I wrote about the changes the BAA is making in the qualifying times for the Boston Marathon.  Based on a sample of qualifiers from the Chicago Marathon, I predicted that the proportion of women in the open division (ages 18-34) will drop in the next two years.

This raises an obvious question: are the new standards fair?  BAA executive director Tom Grilk explained “Looking back at the data that we have... we found that the fairest way to deal with this is to have a uniform reduction in qualifying standards across the board."

But he didn’t explain what he means by “fair.”  There are several possibilities.

1) E-fairness (E for elite): By this definition, a standard is fair if the gender gap for qualifiers is the same as for elite runners.  I discussed this standard in the previous post, and showed two problems: (a) elite women are farther from the pack than elite men, so qualifying times would be determined by a small number of outliers; and as a result, (b) this standard would disqualify 47% of the women in the open division.

A variation of E-fairness uses the relative difference in speeds rather than the absolute difference in times.  This option reduces the impact, but doesn’t address what I think is the basic problem: it doesn’t make sense to base qualifying times on the performance of elite runners.

2) R-fairness (R for representative): By this definition a standard is fair if the qualifiers are a representative sample of the population of marathoners.  I don’t have good data to evaluate R-fairness for the open division, but for the field as a whole the current standard is R-fair: according to running usa, 41% of marathoners are female, and in 2010 42% of Boston Marathon finishers were female.   

[Note: this article in the Wall Street Journal claims that 42% is "higher than the percentage of all U.S. marathoners who are women," but I don't know what they are basing that on.]

A problem with R-fairness is that the population of marathoners includes some people who are competitive racers and others who are... not competitive racers.  I don’t think it makes sense for the middle and the back of the pack to affect the standard.

3) C-fairness (C for contenders): Qualifying times should be determined by the most relevant population, runners who finish close to the standard.  Specifically, I define a “contender” as someone who finishes within X minutes of the standard, where X is something like 20 minutes (we’ll look at some different values for X and see that it doesn’t matter very much).

And here’s what I propose: a standard is C-fair if the percentage of contenders who qualify is the same in each group.  As an example, I'll compute a fair standard for men and women in the open division.  Here’s how:

1) Like last week, I use data from the last three Chicago marathons as a sample of the population of marathoners.  [Note: If this sample is not representative, that will affect my results, so I would like to get a more comprehensive dataset.  I contacted, but have not heard back.]

2) For the current male standard, 3:10, I select runners who finish within X minutes of the standard, and compute the percentage of these contenders that qualify.

3) Then I search for the female standard that yields the same percentage of qualifiers.

This figure shows the results:

The x-axis is the gender gap: the difference in minutes between male and female qualifying times.  The y-axis is the difference in the percentage of contenders who qualify.  The lines show results for values of X from 20 to 40 minutes.  For smaller X, the results are noisier.

Where the lines cross through 0 is the gap that is C-fair.  By that definition, the gap should be about 38 minutes.  So if the male standard is 3:10, the female standard should be 3:48.

In 2013 the male standard will be 3:05.  In that case, based on the same analysis, the gap should be 34 minutes, so the female standard should be 3:39.  

C-fairness also has the property of “equal marginal impact,” which means that if we tighten the standard by 1 minute, we disqualify the same percentage of runners in all groups, which leaves the demographics of the field unchanged.  Last week we saw that the current standard does not have this property -- tightening the qualifying times has a disproportionate impact on women.

In summary:

1) I think qualifying times should be based on the population of contenders -- runners near the standard -- not on the elites or the back of the pack.

2) A standard is fair if it qualifies the same proportion of contenders from each group.

3) By that definition, the gender gap in the open division should be 38 minutes in 2012 and 34 minutes in 2013.

4) The common belief that the standard for women is too easy is mistaken; by the definition of fair that I think is most appropriate, the standard for women is relatively hard.


If you find this sort of thing interesting, you might like my free statistics textbook, Think Stats. You can download it or read it at


  1. I'm just not getting the idea behind C-fairness. I can't map it onto any intuitive notion of "fairness."

    Let me see if I've got it right: for any value S of the standard, you define the subset of people whose times are in the interval [S-X,S+X] ("contenders"), and then ask what fraction of them are in the range [S-X,S] (contenders who qualify). For small X, that's essentially equivalent to looking at the second derivative of the cdf, or the first derivative of the pdf, as a function of S.

    Is that right? If so, can you say a little more about how that maps onto a notion of fairness?

  2. Great question, and you are right about matching derivatives. The reason this is fair is that we are trying to match up corresponding parts of different distributions. We can't use extrema because they are dictated by outliers, and we can't use percentiles, because they are dominated by the middle and back of the pack, so I am using local curvature to find similar regions. The underlying assumption is that the shape of the distributions is similar in the vicinity of the standard, which is true if X is big enough to provide some smoothing.

  3. Sorry, still just as baffled as before. I just don't see why local cdf curvature is the relevant (or even *a* relevant) consideration. Why the second derivative of the cdf, rather than the first or third, for instance?

  4. The essential pieces are (1) the standard should be based on a relevant population of competitive runners, and (2) the field in Boston should be representative of that population.

    If you buy that, then the only question is how to define the relevant population. I chose people within X minutes of S because (1) it's simple, (2) it has the property of equal marginal impact, and (3) as you pointed out, it corresponds to matching curvature.

    By itself, #3 doesn't provide a lot of intuition, but it makes some sense if what we are trying to do is align corresponding parts of the distributions.

    I considered matching first derivatives instead, but that gives the middle of the pack too much influence, and it doesn't map onto the notion of choosing qualifiers from a pool of contenders.

  5. I have a couple of issues with your choice of "C-fairness".

    (1) I wouldn't be so quick to dismiss Elites as "outliers". If anything, the elites serve as a great measure of what the female/male human body is capable of if you win the genetic lottery and train hard. If you're looking for the true biological difference between male and female, I'd see that as the way to go. When you start looking further down the pack, there are too many other factors involved: maybe more slower women are inclined to take up the sport of running, whereas their equivalently-not-very-talented male counterpart is more likely to take up ping-pong. Or whatever. You get cultural/sociological factors messing things up.

    (2) I think your data may be skewed by the current BQ differences and its resulting "BQ-effect". Isn't that what that big hump is in your figure?

    All that said, I actually think that the BQ gender difference should be whatever it needs to be to keep the field at 50/50. But that's just my gut feeling based on the fact that we're already up to our eyeballs in enough gender disparity in the world of politics/upper echelons of business/ science/ math etc, and this is one blessed area in which creating at least superficial gender equality is really easy, so we should do it.

    Sorry for the late comment, I just discovered your blog and it is awesome!

  6. @pentalith: You are right that the shape of the pack is influenced by cultural factors. But the problem with using elites to measure the male/female gap is that there is a biological asymmetry: a woman who is physiologically similar to a man is likely to be a good athlete, but a man who is physiologically similar to a woman is not. I speculate that this asymmetry is the reason the elite women are "more elite" than the elite men, and why the gap between the elites is smaller than the gap between the packs.

    Thanks for the comment, and for the kind words!

  7. If the male record is about 2:02 and the female record is about 2:15, then the gender gap should only be about 13 minutes. It seems like BQ is currently unfair to men. I think you need to be careful of your statistics. Are we comparing to pure ability or are we comparing to the pool of people who currently run? There may be a statistically significant difference between the % of men who are talented for running and actually run compared with the % of women. Your analysis assumes a similar spread

    1. What you are proposing is what I called E-fairness in this article. I explained two problems with E-fairness, and then proposed two alternative definitions of "fair". And I discuss the question of what the relevant population of comparison should be. By proposing alternatives and evaluating their consequences, I am not assuming anything.