Wednesday, November 30, 2011

Are first babies more likely to be light?

Way back in February I analyzed data from the National Survey of Family Growth (NSFG) and answered the question "Are first babies more likely to be late?"  Now I am getting back to that data to look at the related question, "Are first babies more likely to be light?"

In Think Stats (Chapter 7), I showed that the mean birth weight for first babies is lower than the mean for others by about 2 ounces, and that difference is statistically significant.  But there is also a relationship between birth weight and mother's age, and the mothers of first babies tend to be younger.  So the question is: if we control for the age of the mother, are first babies still lighter?

Several students in my class are working on projects involving multiple regression, so I want to use this question as an example.  I will follow the steps I recommend to my students:

1) Before looking at relationships between variables, look at each variable in isolation.  In particular, characterize the distributions and identify issues like outliers or long tails.

2) Look at the variables pairwise.  For each pair, look at CDFs and scatterplots, and compute correlations and/or least squares fits.

3) If there seem to be relationships among the variables, look for ways to separate the effects, either by breaking the data into subsets or doing multiple regression.

So let's proceed.

One variable at a time

The variables I'll use are
  1. Mother's age in years, 
  2. Birth weight in ounces, and 
  3. first, which is a dummy variable, 1 for first babies and 0 for others.
For live births we have 9148 records with valid ages; here is the distribution:
It looks like there is a little skew to the right, but other than that, nothing to worry about.

We have 9038 records with valid weights, with this distribution:
The middle of the distribution is approximately normal, with some jaggies due to round-off.  In the tails there are some values that are are certainly errors, but it is hard to draw a clear line between exceptional cases and bad data.  It might be a good idea to exclude some extreme values, but for the analysis below I did not.

There are only two values for first, so it's not much of a distribution, but with categorical data, we should check that we have enough values in each bin.  As it turns out, there are 4413 first babies and 4735 others, so that's just fine.

One pair at a time

To characterize the relationship between weight and first, we compare the CDF of weights for first babies and others:
There is some space between the distributions, and the difference in means is 2 ounces.  It's not obvious whether that difference is statistically significant, but it is, with p < 0.001.

Similarly we can compare the CDF of mother's age for first babies and others:
Here there is clearly space between the distributions.  The difference in means is 3.6 years, which is significant (no surprise this time).

It is not as easy to see the relationship between age and birthweight.  I often tell my students to start with a scatterplot:
But in this case it's not much help.  If there is a relationship there, it is hard to see.  An alternative is to break the age range into bins and compute the mean in each bin.  Here's what that looks like with 2-year bins:
We can't take the first and last points too seriously; there are not many cases in those bins.  And ideally I should represent the variability in each bin so we have a sense of whether the apparent differences are real.  Based on this figure, it looks like there is a relationship, but it might be nonlinear.

Pearson's coefficient of correlation between age and birthweight is 0.07, which is small but statistically significant.  Spearman's coefficient of correlation is 0.10; the difference between the two coefficients is another warning that the relationship is nonlinear.

If we ignore the non-linearity for now, we can compute a least squares fit for birthweight as a function of age.  The slope is 0.28, which means that we expect an additional 0.28 ounces per year of age.  Since first mothers are 3.6 years younger than others, we expect their babies to be 1.0 ounces lighter.  In fact, they are 2.0 ounces lighter, so the linear model of weight vs age accounts for 50% of the observed difference.

Multiple regression

So far I have been using the thinkstats libraries to compute correlation coefficients and least squares fits.  But for multiple regression I am going to break out rpy, which is the Python interface to R a "language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories."  To see how it works, you can download the code I used in this section:

The first model I ran is the same linear model we were just looking at:

weight = intercept + slope * age

In R's shorthand, that's:

weights ~ ages

Here are the results:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 109.28635    1.08775 100.470  < 2e-16 ***
ages          0.27926    0.04258   6.559 5.72e-11 ***

Multiple R-squared: 0.004738, Adjusted R-squared: 0.004628 

The intercept is 109 ounces; the slope is 0.28 ounces per year, which what we saw before---comparing my implementation to R makes me more confident that R is correct :)

The standard error provides a confidence interval for the estimates; plus or minus 2 standard errors is (roughly) a 95% confidence interval.

The last column is the p-value, which is very small, indicating that the slope and intercept are significantly different from 0.  The t-value is the test statistic used to compute the p-value, but I don't know why it gets reported; it doesn't mean much.

Since the p-values are so small, it might be surprising that R2 is so low, only 0.0047.  But there is no contradiction.  We can say with high confidence that there is a relationship between these variables; nevertheless, birth weight is highly variable, and even if you know the age of the mother, that does not reduce the variability by much.

To understand adjusted R2, consider this: if you add more explanatory variables to a model, R2 usually goes up even if there is no real relationship.  The adjusted R2 takes this into account, which makes it more meaningful to compare models with a different number of variables.

Now let's add first as an explanatory variable, so the model looks like this:

weights ~ first + ages

Here are the results:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 110.62791    1.24198  89.073  < 2e-16 ***
first        -1.11688    0.49940  -2.236   0.0253 *  
ages          0.24708    0.04493   5.499 3.93e-08 ***

Multiple R-squared: 0.005289, Adjusted R-squared: 0.005069 

The coefficient for first is -1.1, which means that we expect first babies to be 1.1 ounces lighter, controlling for age.  The p-value for this estimate is 2.5%, which I consider borderline significant.

The coefficient for ages is about the same as before, and significant.  And R2 is a little higher, but still small.

But remember that the relationship between weight and age is non-linear.  We can explore that by introducing a new variable:

ages2 = ages^2

Now if we run this model:

weights ~ ages + ages2

We are effectively fitting a parabola to the weight vs age curve.

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 89.151237   4.407677  20.226  < 2e-16 ***
ages         1.898151   0.346071   5.485 4.25e-08 ***
ages2       -0.031002   0.006577  -4.714 2.47e-06 ***

Multiple R-squared: 0.00718, Adjusted R-squared: 0.00696 

Estimates for both variables are significant.  The coefficient for ages2 is negative, which means that the parabola has downward curvature, as expected.  And R2 is a little bigger.

Finally, we can bring it all together:

weights ~ first + ages + ages2

With this model we get:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 91.07627    4.56813  19.937  < 2e-16 ***
first       -0.80708    0.50373  -1.602    0.109    
ages         1.79807    0.35163   5.113 3.23e-07 ***
ages2       -0.02953    0.00664  -4.447 8.80e-06 ***

Multiple R-squared: 0.007462, Adjusted R-squared: 0.007132 

When we include the parabolic model of weight and age, the coefficient for first gets smaller, and the p-value is 11%.

I conclude that the difference in weight for first babies is explained by the difference in mothers' ages.  When we control for age, the difference between first babies and others is no longer statistically significant.  It is still possible that there is a small difference in weight for first babies, but this dataset provides little evidence for it.

Monday, November 28, 2011

Estimating the age of renal tumors

UPDATE April 2, 2012: I wrote a paper describing this work and submitted it to arXiv. You can download it here.

Abstract: We present a Bayesian method for estimating the age of a renal tumor given its size. We use a model of tumor growth based on published data from observations of untreated tumors. We find, for example, that the median age of a 5 cm tumor is 20 years, with interquartile range 16-23 and 90% confidence interval 11-30 years.


A few weeks ago I read this post on
"I have Stage IV Kidney Cancer and am trying to determine if the cancer formed before I retired from the military. ... Given the dates of retirement and detection is it possible to determine when there was a 50/50 chance that I developed the disease? Is it possible to determine the probability on the retirement date?  My tumor was 15.5 cm x 15 cm at detection. Grade II."
I contacted the original poster and got more information; I learned that veterans get different benefits if it is "more likely than not" that a tumor formed while they were in military service (among other considerations).

Because renal tumors grow slowly, and often do not cause symptoms, they are often left untreated.  As a result, we can observe the rate of growth for untreated tumors by comparing scans from the same patient at different times.  Several papers have reported these growth rates.

I collected data from a paper by Zhang et al.  I contacted the authors to see if I could get raw data, but they refused on grounds of medical privacy.  Nevertheless, I was able to extract the data I needed by printing one of their graphs and measuring it with a ruler.  It's silly, but it works.

They report growth rates in reciprocal doubling time (RDT), which is in units of doublings per year.  So a tumor with RDT=1 doubles in volume each year; with RDT=2 it quadruples in the same time, and with RDT=-1, it halves.  The following figure shows the distribution of RDT for 53 patients:

The squares are the data points from the paper; the line is a model I fit to the data.  The positive tail fits an exponential distribution well, so I used a mixture of two exponentials.

As a simple model of tumor growth, I chose the median value of RDT, which is 0.45, and used that to estimate the age of a tumor with maximum dimension 15.5 cm.  Here's what I wrote in my letter to the Veterans Benefits Administration:
  1. In the largest study I reviewed (53 patients) the median volume doubling time is 811 days. By definition of median, 50% of observed tumors grew faster and 50% slower.  By geometry, the doubling time for the maximum linear dimension is approximately (811)(3) = 2433 days or 6.7 years.  Therefore, for a tumor with maximum linear dimension 15.5 cm on [diagnosis date], it is as likely as not that the size on [discharge date] was 6 cm.
  2. If the diameter of the tumor on [discharge date] were 1 mm and it grew to 15.5 cm by [diagnosis date], the effective volume doubling time would be 150 days.  Fewer than half of the tumors in the studies I reviewed grew at this rate or faster, so it is more likely than not that the tumor grew more slowly.
Based on this analysis, I conclude that it is more likely than not that this tumor formed prior to [discharge date].
I think this model is sufficient to answer the question as posed, but it occurred to me later (in the shower, where all good ideas come from) that we can do better.  By sampling from the distribution of growth rates and generating simulated tumor histories, we can estimate the distribution of size as a function of time and then, using Bayes's Theorem, get the distribution of age as a function of size.

Here's how.  The simulation starts with a small tumor (0.3 cm) and runs these steps:
  1. Choose a growth rate from the distribution of RDT.
  2. Compute the size of the tumor at the end of an 8 month interval (that's the median interval between  measurements in the data source).
  3. Repeat until the tumor is 20 cm in diameter.
This figure shows 100 simulated growth trajectories:
The line at 10 cm shows the range of ages for tumors at that size: the fastest-growing tumor gets there in 8 years; the slowest takes more than 35.

By drawing the line at different sizes, we can estimate the distribution of age as a function of size.  There's an implicit use of Bayes's Theorem in there, but because I did everything discretely, I didn't have to think too hard.  This figure shows the distribution of age for a few different sizes:
Not surprisingly, bigger tumors are likely to be older.  For any size, we can generate the CDF and compute the median, interquartile range, and 90% confidence interval.  Here's what that looks like (with size on a log scale):
The points are data from simulation, which produces some variability due to discrete approximation.  The lines are fitted to the data.

With these results, doctors can look up the size of a tumor and get the distribution of ages; for example, the median age of a 15 cm tumor is 27 years, with interquartile range 22-31 and 90% confidence interval 16-39 years.

This model yields more detail than the simple model I started with, but the results are qualitatively similar; a tumor this size is more likely than not to have formed prior to the original poster's date of discharge.  It looks like there is also a good chance that it formed prior to enlistment, but I don't know what the VBA makes of that.


I think this model makes the best use of the available data, but there are several limitations:

1) The factors that limit tumor growth are different for very small tumors, so the observed data doesn't apply.  We can extrapolate back to when the tumor was small (I chose 0.3 cm, a bit smaller than the smallest tumor in the study).  That gives us a lower bound on the age of the tumor, but we can't say much about when the first cancer cell appeared.

2) The distribution of growth rates is based on a sample of 53 patients.  A different sample would yield a different distribution.  I could use resampling to characterize this source of error, but haven't.

3) The growth model does not take into account tumor subtype or grade, which is consistent with the conclusion of Zhang et al: “Growth rates in renal tumors of different sizes, subtypes and grades represent a wide range and overlap substantially.”  

4) In our model of tumor growth, the growth rate during each interval is independent of previous growth rates.  It is plausible that, in reality, tumors that have grown quickly in the past are more likely to grow quickly.

If this correlation exists, it affects the location and spread of the results.  For example, running simulations with ρ = 0.4 increases the estimated median age by about a year, and the interquartile range
by about 3 years.  However, if there were a strong serial correlation in growth rate, there would be also be a correlation between tumor volume and growth rate, and prior work has shown no such relationship.

There could still be a weak serial correlation, but since there is currently no evidence for it, I ran these simulations with ρ  = 0.

Monday, November 21, 2011

Comment on "Racism and Meritocracy"

WARNING:  This article is on a topic that elicits emotional reactions.  I welcome comments, but please make them thoughtful and keep them civil.

Eric Ries wrote an article for TechCrunch last week, talking about racism and meritocracy among Silicon Valley entrepreneurs.  It's a good article; you should read it and then come back.

Although I mostly agree with him, Ries undermines his argument with a statistical bait-and-switch: he starts out talking about race, but most of the article (and the slide deck he refers to) are about gender.  Unfortunately, for both his argument and the world, the race gap is bigger than the gender gap, and it is compounded because racial minorities, unlike women, are minorities.

To quantify the size of the gap, I use data from Academically Adrift, a recent book that reports the results from the Collegiate Learning Assessment (CLA) database, collected by the Council for Aid to Education, "a national nonprofit organization ... established in 1952 to advance corporate support of education and to conduct policy research on higher education..."

"The CLA consists of three types of prompts within two types of task: the Performance Task and the Analytic Writing Task...The Analytic Writing Task includes a pair of prompts called Make-an-Argument and Critique-an-Argument.  
"The CLA uses direct measures of skills in which students perform cognitively demanding tasks... All CLA measures are administered online and contain open-ended prompts that require constructed responses. There are no multiple-choice questions. The CLA tasks require that students integrate critical thinking and written communication skills. The holistic integration of these skills on the CLA tasks mirrors the requirements of serious thinking and writing tasks faced in life outside of the classroom. "
This is not your father's SAT.  The exam simulates realistic workplace tasks and assesses skills that are relevant to many jobs, including (maybe especially) entrepreneurship.

On this assessment, the measured differences between black and white college students are stark.  For white college students, the mean and standard deviation are 1170 ± 179.  For black students, they are 995 ± 167.

To get a sense of what that difference looks like, suppose there are just two groups, which I call "blue" and "green" as a reminder that I am presenting an abstract model and not a realistic description.  This figure shows Gaussian distributions with the parameters reported in Academically Adrift:

The difference in means is 175 points, which is about one standard deviation.  If we select people from the upper tail, the majority are blue.  But the situation is even worse if greens are a minority.  If greens make up 20% of the population, the picture looks like this:

The fraction of greens in the upper tail is even smaller.  If, as Ries suggests, "Here in Silicon Valley, we’re looking for the absolute best and brightest, the people far out on the tail end of aptitude," the number of greens in that tail is very small.

How small?  That depends on where we draw the line.  If we select people who score above 1200, which includes 37% of the population, we get 6% greens (remember that they are 20% of the hypothetical population).  Above 1300 the proportion of greens is 3%, and above 1400 only 2%.

And that's not very "far out on the tail end of aptitude."  Above 1500, we are still talking about 3% of the general population, but more than 99% of them are blue.  So in this hypothetical world of blues and greens, perfect meritocracy does not lead to proportional representation.

Ries suggests that blind screening of applicants might help.  I think the system he proposes is a good idea, because it improves fairness and also the perception of fairness.  But if the racial gap in Y Combinator's applicant pool is similar to the racial gap in CLA scores, making the selection process more meritocratic won't make a big difference.

These numbers are bad.  I'm sorry to be reporting them, and if I know the Internet, some people are going to call me a racist for doing it.  But I didn't make them up, and I'm pretty sure I did the math right.  Of course, you are welcome to disagree with my conclusions.

Here are some of the objections I expect:

1) The CLA does not capture the full range of skills successful entrepreneurs need.

Of course it doesn't; no test could.  But I chose the CLA because I think it assesses thinking skills better than other standardized tests, and because the database includes "over 200,000 student results across hundreds of colleges."  I can't think of a better way to estimate the magnitude of the racial gap in the applicant pool.

2) The application process is biased against racial minorities and women.

The statistics I am reporting here, and my analysis of them, don't say anything about whether or not the application process is biased.  But they do suggest (a) We should not assume that because racial minorities are underrepresented among Silicon Valley entrepreneurs, racial bias explains a large part of the effect, and (b) We should not assume that eliminating bias from the process will have a large effect.

Of course, trying to eliminate bias is the right thing to do, whether the effect is big or small.


NOTE: The range of scores for the CLA was capped at 1600 until 2007, which changed the shape of the distribution at the high end.  For those years, the Gaussian distributions in the figures are not exactly right, but I don't think it affects my analysis much.  Since 2007, scores are no longer capped, but I don't know what the tail of the distribution looks like now.

EDIT 11-28-11: I revised a few sentences to clarify whether I was talking about representation or absolute numbers.  The fraction of greens in the population affects the absolute numbers in the tail but not their representation.

Thursday, November 10, 2011

Girl Named Florida solutions

In The Drunkard's Walk, Leonard Mlodinow presents "The Girl Named Florida Problem":
"In a family with two children, what are the chances, if one of the children is a girl named Florida, that both children are girls?"
I like this problem, and I use it on the first day of my class to introduce the topic of conditional probability.  But I've decided that it's too easy.  To give it a little more punch, I've decided to combine it with the Red-Haired Problem from last week:
In a family with two children, what are the chances, if at least one of the children is a girl with red hair, that both children are girls?
Just like last week, you can make some simplifying assumptions:
About 2% of the world population has red hair.  You can assume that the alleles for red hair are purely recessive.  Also, you can assume that the Red Hair Extinction theory is false, so you can apply the Hardy–Weinberg principle.  And you can ignore the effect of identical twins.
Before I present my solution, I want to sneak up on it with a series of warm-up problems.
  1. P[GG | two children]: if a family has two children, what is the chance that they have two girls?
  2. P[GG | two children, at least one girl]: if we know they have at least one girl, what is the chance that they have two girls?
  3. P[GG | two children, older child is a girl]: if the older child is a girl, what is the chance that they have two girls?
  4. P[GG | two children, at least one is a girl named Florida].
  5. P[GG | two children, at least one is a girl with red hair, and the parents have brown hair].
  6. P[GG | two children, at least one is a girl with red hair].

Problem 1: P[GG | two children]

If we assume that the probability that each child is a girl is 50%, then P[GG | two children] = 1/4.

Problem 2: P[GG | two children, at least one girl]

There are four equally-likely kinds of two child families:  BB, BG, GB and GG.  We know that BB is out, so the conditional probability is

P[GG | at least one girl] = P[GG and at least one girl] / P[at least one girl] = 1/3.

Problem 3: P[GG | two children, older child is a girl]

Now there are only two possible families, GB and GG, so the conditional probability is 1/2.  Informally we can argue that once we know about the older child we can treat the younger child as independent.  But if there's one thing we learn from this problem, it's that our intuition for independence is not reliable.

Problem 4: P[GG | two children, at least one girl named Florida]

Here's the one that makes people's head hurt.  For each child, there are three possibilities, boy, girl not named Florida, and girl named Florida, with these probabilities:

B: 1/2
G: 1/2 - x
GF: x

where x is the unknown percentage of people who are girls named Florida.  Of families with at least one girl named Florida, there are these possible combinations, with these probabilities

B GF: 1/2 x
GF B: 1/2 x
G GF: x (1/2 - x)
GF G: x (1/2 - x)
GF GF: x^2

The highlighted cases have two girls, so the probability we want is the sum of the highlighted cases over the sum of all cases.  With a little algebra, we get:

P(GG | at least one girl named Florida) = (1 - x) / (2 - x)

Assuming that Florida is not a common name, x approaches 0 and the answer approaches 1/2.  So it turns out, surprisingly, that the name of the girl is relevant information.

As x approaches 1/2, the answer converges on 1/3.  For example, if we know that at least one child is a girl with two X chromosomes, x is close to 1/2 and the problem reduces to Problem 2.

If this problem is still making your head hurt, this figure might help:
Here B a boy, Gx is a girl with some property X, and G is a girl who doesn't have that property.  If we select all families with at least one Gx, we get the five blue squares (light and dark).  The families with two girls are the three dark blue squares.

If property X is common, the ratio of dark blue to all blue approaches 1/3.  If X is rare, the same ratio approaches 1/2.

Problem 5: P[GG | two children, at least one girl with red hair, parents have brown hair]

If the parents have brown hair and one of their children has red hair, we know that both parents are heterozygous, so their chance of having a red-haired girl is 1/8.

Using the girl-named-Florida formula, we get

P[GG | two children, at least one girl with red hair, parents have brown hair] = (1 - 1/8) / (2 - 1/8) = 7/15.

And finally:

Problem 6: P[GG | two children, at least one girl with red hair]

In this case we don't know the genotype of the parents.  There are three possibilities: Aa Aa, Aa aa, and aa aa.

We follow these steps:

1) Use the prevalence of red hair to compute the prior probabilities of each parental genotype.

2) Use the evidence (at least one girl with red hair) to compute the posteriors.

3) For each combination, compute the conditional probability.  We have already computed

P(GG | two children, at least one with red hair, Aa Aa) = 7/15

The others are

P(GG | two children, at least one with red hair, Aa aa) = 3/7
P(GG | two children, at least one with red hair, aa aa) = 1/3

4) Apply the law of total probability to get the answer.

I'm too lazy to do the algebra, so I got Mathematica to do it for me.  Here is the notebook with the answer.

In general, if p is the prevalence of red hair alleles,

P(GG | two children, at least one with red hair) = (p^2 + 2p - 7) / (p^2 + 2p - 15)

If the prevalence of red hair is 0.02, then p = sqrt(0.02) =  0.141, and

P(GG | two children, at least one with red hair) = 45.6%

Congratulations to Professor Ted Bunn at the University of Richmond, the only person who submitted a correct answer before the deadline!

At least, I think it's the right answer.  Maybe we both made the same mistake.

For more fun with probability, see Chapter 5 of my book, Think Stats,





Monday, November 7, 2011

The red-haired girl named Florida

In The Drunkard's Walk, Leonard Mlodinow presents "The Girl Named Florida Problem":
"In a family with two children, what are the chances, if one of the children is a girl named Florida, that both children are girls?"
I like this problem, and I use it on the first day of my class to introduce the topic of conditional probability.  But I've decided that it's too easy.  To give it a little more punch, I've decided to combine it with the Red-Haired Problem from last week:
In a family with two children, what are the chances, if at least one of the children is a girl with red hair, that both children are girls?
[Edit: I clarified the wording to say that at least one of the children is a girl with red hair.  To be even more precise, I am looking for the conditional probability P(two girls | at least one girl with red hair).] 

Just like last week, you can make some simplifying assumptions:
About 2% of the world population has red hair.  You can assume that the alleles for red hair are purely recessive.  Also, you can assume that the Red Hair Extinction theory is false, so you can apply the Hardy–Weinberg principle.  And you can ignore the effect of identical twins.

Here is my solution.

For more fun with probability, see Chapter 5 of my book, Think Stats,

Thursday, November 3, 2011

Somebody bet on the Bayes

In last week's post I wrote solutions to some of my favorite Bayes's Theorem problems, and posed this new problem:
If you meet a man with (naturally) red hair, what is the probability that neither of his parents has red hair?
Hints: About 2% of the world population has red hair.  You can assume that the alleles for red hair are purely recessive.  Also, you can assume that the Red Hair Extinction theory is false, so you can apply the Hardy–Weinberg principle.
Given the prevalence of red hair, we know what fraction of the population is homozygous recessive.  To solve the problem, we also need to know how many are heterozygous and homozygous dominant.

I'll use a to represent the recessive allele (or alleles) for red hair, and A for the dominant alleles that code for other colors.  p is the prevalence of a and q is the prevalence of A, so p+q = 1.

If these prevalences are not changing over time, and they don't affect people's mating decisions, we can invoke the Hardy-Weinberg principle to get:

P(AA) = prevalence of homozygous dominant = q**2
P(Aa) = prevalence of heterozygous = 2 * p * q
P(aa) = prevalence of homozygous recessive = p**2

And P(AA) + P(Aa) + P(aa) = 1.

Given (aa), we can compute p, q and:

P(Aa) = 0.243
P(AA) = 0.737

Now if a child has red hair, both parents have at least one recessive allele, so the possible combinations are (Aa Aa), (Aa aa), and (aa aa).  If we assume, again, that mating decisions are not based on hair color, we can get the prevalence of each parental pair:

P(Aa Aa) = Aa**2
P(Aa aa) = 2 Aa aa
P(aa aa) = aa**2

And we can use those as the priors.  The evidence is

E: the child has red hair

To get the likelihoods, we apply Mendelian genetics:

P(E | Aa Aa) = 0.25
P(E | Aa aa) = 0.5
P(E | aa aa) = 1.0

Finally, applying Bayes's Theorem, we have

P(Aa Aa | E) = P(Aa Aa) P(E | Aa Aa) / P(E)

And the answer is 0.737.  With a little algebra we can show that

P(Aa Aa | E) = P(AA)

That is, the probability that neither parent has red hair is exactly the fraction of the population that is homozygous dominant.

Almost 75% of red-haired people come from parents with non-red hair, which might explain why a "red haired child" is a metaphor for a child that doesn't resemble his parents. In the expression, "beat like a red haired step child," some of the humor (for people who find child abuse funny) comes from the suggestion that the parentage of a red haired child is suspect.

But why should red hair be funnier than blue eyes or blonde hair?  It turns out that we can answer this question mathematically. If the prevalence of red hair were higher, say 10%, most red haired people would have at least one red-haired parent, and that would be less funny.

In general, as the prevalence of the recessive phenotype increases, the potential for amusing insinuations of infidelity decreases; near 0 it drops off steeply, as shown in this figure:

Since I believe I am the first person to quantify this effect, I humbly submit that it should be called "Downey's inverse law of mailman jokes."

For more fun with probability, see Chapter 5 of my book, Think Stats,


If you don't get the title of this post, it is a play on "Somebody bet on the bay," a lyric from the minstrel song "Camptown Races."  A bay is a horse with a reddish-brown coat, so I thought it was a pretty good fit.