Probably Overthinking It: 2011

Wednesday, November 30, 2011

Are first babies more likely to be light?

Way back in February I analyzed data from the National Survey of Family Growth (NSFG) and answered the question "Are first babies more likely to be late?" Now I am getting back to that data to look at the related question, "Are first babies more likely to be light?"

In Think Stats (Chapter 7), I showed that the mean birth weight for first babies is lower than the mean for others by about 2 ounces, and that difference is statistically significant. But there is also a relationship between birth weight and mother's age, and the mothers of first babies tend to be younger. So the question is: if we control for the age of the mother, are first babies still lighter?

Several students in my class are working on projects involving multiple regression, so I want to use this question as an example. I will follow the steps I recommend to my students:

1) Before looking at relationships between variables, look at each variable in isolation. In particular, characterize the distributions and identify issues like outliers or long tails.

2) Look at the variables pairwise. For each pair, look at CDFs and scatterplots, and compute correlations and/or least squares fits.

3) If there seem to be relationships among the variables, look for ways to separate the effects, either by breaking the data into subsets or doing multiple regression.

So let's proceed.

One variable at a time

The variables I'll use are

Mother's age in years,
Birth weight in ounces, and
first, which is a dummy variable, 1 for first babies and 0 for others.

For live births we have 9148 records with valid ages; here is the distribution:

It looks like there is a little skew to the right, but other than that, nothing to worry about.

We have 9038 records with valid weights, with this distribution:

The middle of the distribution is approximately normal, with some jaggies due to round-off. In the tails there are some values that are are certainly errors, but it is hard to draw a clear line between exceptional cases and bad data. It might be a good idea to exclude some extreme values, but for the analysis below I did not.

There are only two values for first, so it's not much of a distribution, but with categorical data, we should check that we have enough values in each bin. As it turns out, there are 4413 first babies and 4735 others, so that's just fine.

One pair at a time

To characterize the relationship between weight and first, we compare the CDF of weights for first babies and others:

There is some space between the distributions, and the difference in means is 2 ounces. It's not obvious whether that difference is statistically significant, but it is, with p < 0.001.

Similarly we can compare the CDF of mother's age for first babies and others:

Here there is clearly space between the distributions. The difference in means is 3.6 years, which is significant (no surprise this time).

It is not as easy to see the relationship between age and birthweight. I often tell my students to start with a scatterplot:

But in this case it's not much help. If there is a relationship there, it is hard to see. An alternative is to break the age range into bins and compute the mean in each bin. Here's what that looks like with 2-year bins:

We can't take the first and last points too seriously; there are not many cases in those bins. And ideally I should represent the variability in each bin so we have a sense of whether the apparent differences are real. Based on this figure, it looks like there is a relationship, but it might be nonlinear.

Pearson's coefficient of correlation between age and birthweight is 0.07, which is small but statistically significant. Spearman's coefficient of correlation is 0.10; the difference between the two coefficients is another warning that the relationship is nonlinear.

If we ignore the non-linearity for now, we can compute a least squares fit for birthweight as a function of age. The slope is 0.28, which means that we expect an additional 0.28 ounces per year of age. Since first mothers are 3.6 years younger than others, we expect their babies to be 1.0 ounces lighter. In fact, they are 2.0 ounces lighter, so the linear model of weight vs age accounts for 50% of the observed difference.

Multiple regression

So far I have been using the thinkstats libraries to compute correlation coefficients and least squares fits. But for multiple regression I am going to break out rpy, which is the Python interface to R, a "language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories." To see how it works, you can download the code I used in this section: age_lm.py

The first model I ran is the same linear model we were just looking at:

weight = intercept + slope * age

In R's shorthand, that's:

weights ~ ages

Here are the results:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 109.28635 1.08775 100.470 < 2e-16 ***
ages 0.27926 0.04258 6.559 5.72e-11 ***

Multiple R-squared: 0.004738, Adjusted R-squared: 0.004628

The intercept is 109 ounces; the slope is 0.28 ounces per year, which what we saw before---comparing my implementation to R makes me more confident that R is correct :)

The standard error provides a confidence interval for the estimates; plus or minus 2 standard errors is (roughly) a 95% confidence interval.

The last column is the p-value, which is very small, indicating that the slope and intercept are significantly different from 0. The t-value is the test statistic used to compute the p-value, but I don't know why it gets reported; it doesn't mean much.

Since the p-values are so small, it might be surprising that R² is so low, only 0.0047. But there is no contradiction. We can say with high confidence that there is a relationship between these variables; nevertheless, birth weight is highly variable, and even if you know the age of the mother, that does not reduce the variability by much.

To understand adjusted R², consider this: if you add more explanatory variables to a model, R² usually goes up even if there is no real relationship. The adjusted R² takes this into account, which makes it more meaningful to compare models with a different number of variables.

Now let's add first as an explanatory variable, so the model looks like this:

weights ~ first + ages

Here are the results:

Estimate Std. Error t value Pr(>|t|)
(Intercept) 110.62791 1.24198 89.073 < 2e-16 ***
first -1.11688 0.49940 -2.236 0.0253 *
ages 0.24708 0.04493 5.499 3.93e-08 ***

Multiple R-squared: 0.005289, Adjusted R-squared: 0.005069

The coefficient for first is -1.1, which means that we expect first babies to be 1.1 ounces lighter, controlling for age. The p-value for this estimate is 2.5%, which I consider borderline significant.

The coefficient for ages is about the same as before, and significant. And R² is a little higher, but still small.

But remember that the relationship between weight and age is non-linear. We can explore that by introducing a new variable:

ages2 = ages^2

Now if we run this model:

weights ~ ages + ages2

We are effectively fitting a parabola to the weight vs age curve.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 89.151237 4.407677 20.226 < 2e-16 ***
ages 1.898151 0.346071 5.485 4.25e-08 ***
ages2 -0.031002 0.006577 -4.714 2.47e-06 ***

Multiple R-squared: 0.00718, Adjusted R-squared: 0.00696

Estimates for both variables are significant. The coefficient for ages2 is negative, which means that the parabola has downward curvature, as expected. And R² is a little bigger.

Finally, we can bring it all together:

weights ~ first + ages + ages2

With this model we get:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 91.07627 4.56813 19.937 < 2e-16 ***

first -0.80708 0.50373 -1.602 0.109

ages 1.79807 0.35163 5.113 3.23e-07 ***

ages2 -0.02953 0.00664 -4.447 8.80e-06 ***

Multiple R-squared: 0.007462, Adjusted R-squared: 0.007132

When we include the parabolic model of weight and age, the coefficient for first gets smaller, and the p-value is 11%.

I conclude that the difference in weight for first babies is explained by the difference in mothers' ages. When we control for age, the difference between first babies and others is no longer statistically significant. It is still possible that there is a small difference in weight for first babies, but this dataset provides little evidence for it.

Monday, November 28, 2011

Estimating the age of renal tumors

UPDATE April 2, 2012: I wrote a paper describing this work and submitted it to arXiv. You can download it here.

Abstract: We present a Bayesian method for estimating the age of a renal tumor given its size. We use a model of tumor growth based on published data from observations of untreated tumors. We find, for example, that the median age of a 5 cm tumor is 20 years, with interquartile range 16-23 and 90% confidence interval 11-30 years.

-----

A few weeks ago I read this post on reddit.com/r/statistics:

"I have Stage IV Kidney Cancer and am trying to determine if the cancer formed before I retired from the military. ... Given the dates of retirement and detection is it possible to determine when there was a 50/50 chance that I developed the disease? Is it possible to determine the probability on the retirement date? My tumor was 15.5 cm x 15 cm at detection. Grade II."

I contacted the original poster and got more information; I learned that veterans get different benefits if it is "more likely than not" that a tumor formed while they were in military service (among other considerations).

Because renal tumors grow slowly, and often do not cause symptoms, they are often left untreated. As a result, we can observe the rate of growth for untreated tumors by comparing scans from the same patient at different times. Several papers have reported these growth rates.

I collected data from a paper by Zhang et al. I contacted the authors to see if I could get raw data, but they refused on grounds of medical privacy. Nevertheless, I was able to extract the data I needed by printing one of their graphs and measuring it with a ruler. It's silly, but it works.

They report growth rates in reciprocal doubling time (RDT), which is in units of doublings per year. So a tumor with RDT=1 doubles in volume each year; with RDT=2 it quadruples in the same time, and with RDT=-1, it halves. The following figure shows the distribution of RDT for 53 patients:

The squares are the data points from the paper; the line is a model I fit to the data. The positive tail fits an exponential distribution well, so I used a mixture of two exponentials.

As a simple model of tumor growth, I chose the median value of RDT, which is 0.45, and used that to estimate the age of a tumor with maximum dimension 15.5 cm. Here's what I wrote in my letter to the Veterans Benefits Administration:

In the largest study I reviewed (53 patients) the median volume doubling time is 811 days. By definition of median, 50% of observed tumors grew faster and 50% slower. By geometry, the doubling time for the maximum linear dimension is approximately (811)(3) = 2433 days or 6.7 years. Therefore, for a tumor with maximum linear dimension 15.5 cm on [diagnosis date], it is as likely as not that the size on [discharge date] was 6 cm.
If the diameter of the tumor on [discharge date] were 1 mm and it grew to 15.5 cm by [diagnosis date], the effective volume doubling time would be 150 days. Fewer than half of the tumors in the studies I reviewed grew at this rate or faster, so it is more likely than not that the tumor grew more slowly.

Based on this analysis, I conclude that it is more likely than not that this tumor formed prior to [discharge date].

I think this model is sufficient to answer the question as posed, but it occurred to me later (in the shower, where all good ideas come from) that we can do better. By sampling from the distribution of growth rates and generating simulated tumor histories, we can estimate the distribution of size as a function of time and then, using Bayes's Theorem, get the distribution of age as a function of size.

Here's how. The simulation starts with a small tumor (0.3 cm) and runs these steps:

Choose a growth rate from the distribution of RDT.
Compute the size of the tumor at the end of an 8 month interval (that's the median interval between measurements in the data source).
Repeat until the tumor is 20 cm in diameter.

This figure shows 100 simulated growth trajectories:

The line at 10 cm shows the range of ages for tumors at that size: the fastest-growing tumor gets there in 8 years; the slowest takes more than 35.

By drawing the line at different sizes, we can estimate the distribution of age as a function of size. There's an implicit use of Bayes's Theorem in there, but because I did everything discretely, I didn't have to think too hard. This figure shows the distribution of age for a few different sizes:

Not surprisingly, bigger tumors are likely to be older. For any size, we can generate the CDF and compute the median, interquartile range, and 90% confidence interval. Here's what that looks like (with size on a log scale):

The points are data from simulation, which produces some variability due to discrete approximation. The lines are fitted to the data.

With these results, doctors can look up the size of a tumor and get the distribution of ages; for example, the median age of a 15 cm tumor is 27 years, with interquartile range 22-31 and 90% confidence interval 16-39 years.

This model yields more detail than the simple model I started with, but the results are qualitatively similar; a tumor this size is more likely than not to have formed prior to the original poster's date of discharge. It looks like there is also a good chance that it formed prior to enlistment, but I don't know what the VBA makes of that.

-----

I think this model makes the best use of the available data, but there are several limitations:

1) The factors that limit tumor growth are different for very small tumors, so the observed data doesn't apply. We can extrapolate back to when the tumor was small (I chose 0.3 cm, a bit smaller than the smallest tumor in the study). That gives us a lower bound on the age of the tumor, but we can't say much about when the first cancer cell appeared.

2) The distribution of growth rates is based on a sample of 53 patients. A different sample would yield a different distribution. I could use resampling to characterize this source of error, but haven't.

3) The growth model does not take into account tumor subtype or grade, which is consistent with the conclusion of Zhang et al: “Growth rates in renal tumors of different sizes, subtypes and grades represent a wide range and overlap substantially.”

4) In our model of tumor growth, the growth rate during each interval is independent of previous growth rates. It is plausible that, in reality, tumors that have grown quickly in the past are more likely to grow quickly.

If this correlation exists, it affects the location and spread of the results. For example, running simulations with ρ = 0.4 increases the estimated median age by about a year, and the interquartile range
by about 3 years. However, if there were a strong serial correlation in growth rate, there would be also be a correlation between tumor volume and growth rate, and prior work has shown no such relationship.

There could still be a weak serial correlation, but since there is currently no evidence for it, I ran these simulations with ρ = 0.

Monday, November 21, 2011

Comment on "Racism and Meritocracy"

WARNING: This article is on a topic that elicits emotional reactions. I welcome comments, but please make them thoughtful and keep them civil.

Eric Ries wrote an article for TechCrunch last week, talking about racism and meritocracy among Silicon Valley entrepreneurs. It's a good article; you should read it and then come back.

Although I mostly agree with him, Ries undermines his argument with a statistical bait-and-switch: he starts out talking about race, but most of the article (and the slide deck he refers to) are about gender. Unfortunately, for both his argument and the world, the race gap is bigger than the gender gap, and it is compounded because racial minorities, unlike women, are minorities.

To quantify the size of the gap, I use data from Academically Adrift, a recent book that reports the results from the Collegiate Learning Assessment (CLA) database, collected by the Council for Aid to Education, "a national nonprofit organization ... established in 1952 to advance corporate support of education and to conduct policy research on higher education..."

Here is a description of the CLA:

"The CLA consists of three types of prompts within two types of task: the Performance Task and the Analytic Writing Task...The Analytic Writing Task includes a pair of prompts called Make-an-Argument and Critique-an-Argument.

"The CLA uses direct measures of skills in which students perform cognitively demanding tasks... All CLA measures are administered online and contain open-ended prompts that require constructed responses. There are no multiple-choice questions. The CLA tasks require that students integrate critical thinking and written communication skills. The holistic integration of these skills on the CLA tasks mirrors the requirements of serious thinking and writing tasks faced in life outside of the classroom. "

This is not your father's SAT. The exam simulates realistic workplace tasks and assesses skills that are relevant to many jobs, including (maybe especially) entrepreneurship.

On this assessment, the measured differences between black and white college students are stark. For white college students, the mean and standard deviation are 1170 ± 179. For black students, they are 995 ± 167.

To get a sense of what that difference looks like, suppose there are just two groups, which I call "blue" and "green" as a reminder that I am presenting an abstract model and not a realistic description. This figure shows Gaussian distributions with the parameters reported in Academically Adrift:

The difference in means is 175 points, which is about one standard deviation. If we select people from the upper tail, the majority are blue. But the situation is even worse if greens are a minority. If greens make up 20% of the population, the picture looks like this:

The fraction of greens in the upper tail is even smaller. If, as Ries suggests, "Here in Silicon Valley, we’re looking for the absolute best and brightest, the people far out on the tail end of aptitude," the number of greens in that tail is very small.

How small? That depends on where we draw the line. If we select people who score above 1200, which includes 37% of the population, we get 6% greens (remember that they are 20% of the hypothetical population). Above 1300 the proportion of greens is 3%, and above 1400 only 2%.

And that's not very "far out on the tail end of aptitude." Above 1500, we are still talking about 3% of the general population, but more than 99% of them are blue. So in this hypothetical world of blues and greens, perfect meritocracy does not lead to proportional representation.

Ries suggests that blind screening of applicants might help. I think the system he proposes is a good idea, because it improves fairness and also the perception of fairness. But if the racial gap in Y Combinator's applicant pool is similar to the racial gap in CLA scores, making the selection process more meritocratic won't make a big difference.

These numbers are bad. I'm sorry to be reporting them, and if I know the Internet, some people are going to call me a racist for doing it. But I didn't make them up, and I'm pretty sure I did the math right. Of course, you are welcome to disagree with my conclusions.

Here are some of the objections I expect:

1) The CLA does not capture the full range of skills successful entrepreneurs need.

Of course it doesn't; no test could. But I chose the CLA because I think it assesses thinking skills better than other standardized tests, and because the database includes "over 200,000 student results across hundreds of colleges." I can't think of a better way to estimate the magnitude of the racial gap in the applicant pool.

2) The application process is biased against racial minorities and women.

The statistics I am reporting here, and my analysis of them, don't say anything about whether or not the application process is biased. But they do suggest (a) We should not assume that because racial minorities are underrepresented among Silicon Valley entrepreneurs, racial bias explains a large part of the effect, and (b) We should not assume that eliminating bias from the process will have a large effect.

Of course, trying to eliminate bias is the right thing to do, whether the effect is big or small.

-----

NOTE: The range of scores for the CLA was capped at 1600 until 2007, which changed the shape of the distribution at the high end. For those years, the Gaussian distributions in the figures are not exactly right, but I don't think it affects my analysis much. Since 2007, scores are no longer capped, but I don't know what the tail of the distribution looks like now.

EDIT 11-28-11: I revised a few sentences to clarify whether I was talking about representation or absolute numbers. The fraction of greens in the population affects the absolute numbers in the tail but not their representation.

Thursday, November 10, 2011

Girl Named Florida solutions

In The Drunkard's Walk, Leonard Mlodinow presents "The Girl Named Florida Problem":

"In a family with two children, what are the chances, if one of the children is a girl named Florida, that both children are girls?"

I like this problem, and I use it on the first day of my class to introduce the topic of conditional probability. But I've decided that it's too easy. To give it a little more punch, I've decided to combine it with the Red-Haired Problem from last week:

In a family with two children, what are the chances, if at least one of the children is a girl with red hair, that both children are girls?

Just like last week, you can make some simplifying assumptions:

About 2% of the world population has red hair. You can assume that the alleles for red hair are purely recessive. Also, you can assume that the Red Hair Extinction theory is false, so you can apply the Hardy–Weinberg principle. And you can ignore the effect of identical twins.

Before I present my solution, I want to sneak up on it with a series of warm-up problems.

P[GG | two children]: if a family has two children, what is the chance that they have two girls?
P[GG | two children, at least one girl]: if we know they have at least one girl, what is the chance that they have two girls?
P[GG | two children, older child is a girl]: if the older child is a girl, what is the chance that they have two girls?
P[GG | two children, at least one is a girl named Florida].
P[GG | two children, at least one is a girl with red hair, and the parents have brown hair].
P[GG | two children, at least one is a girl with red hair].

Problem 1: P[GG | two children]

If we assume that the probability that each child is a girl is 50%, then P[GG | two children] = 1/4.

Problem 2: P[GG | two children, at least one girl]

There are four equally-likely kinds of two child families: BB, BG, GB and GG. We know that BB is out, so the conditional probability is

P[GG | at least one girl] = P[GG and at least one girl] / P[at least one girl] = 1/3.

Problem 3: P[GG | two children, older child is a girl]

Now there are only two possible families, GB and GG, so the conditional probability is 1/2. Informally we can argue that once we know about the older child we can treat the younger child as independent. But if there's one thing we learn from this problem, it's that our intuition for independence is not reliable.

Problem 4: P[GG | two children, at least one girl named Florida]

Here's the one that makes people's head hurt. For each child, there are three possibilities, boy, girl not named Florida, and girl named Florida, with these probabilities:

B: 1/2
G: 1/2 - x
GF: x

where x is the unknown percentage of people who are girls named Florida. Of families with at least one girl named Florida, there are these possible combinations, with these probabilities

B GF: 1/2 x
GF B: 1/2 x
G GF: x (1/2 - x)
GF G: x (1/2 - x)
GF GF: x^2

The highlighted cases have two girls, so the probability we want is the sum of the highlighted cases over the sum of all cases. With a little algebra, we get:

P(GG | at least one girl named Florida) = (1 - x) / (2 - x)

Assuming that Florida is not a common name, x approaches 0 and the answer approaches 1/2. So it turns out, surprisingly, that the name of the girl is relevant information.

As x approaches 1/2, the answer converges on 1/3. For example, if we know that at least one child is a girl with two X chromosomes, x is close to 1/2 and the problem reduces to Problem 2.

If this problem is still making your head hurt, this figure might help:

Here B a boy, Gx is a girl with some property X, and G is a girl who doesn't have that property. If we select all families with at least one Gx, we get the five blue squares (light and dark). The families with two girls are the three dark blue squares.

If property X is common, the ratio of dark blue to all blue approaches 1/3. If X is rare, the same ratio approaches 1/2.

Problem 5: P[GG | two children, at least one girl with red hair, parents have brown hair]

If the parents have brown hair and one of their children has red hair, we know that both parents are heterozygous, so their chance of having a red-haired girl is 1/8.

Using the girl-named-Florida formula, we get

P[GG | two children, at least one girl with red hair, parents have brown hair] = (1 - 1/8) / (2 - 1/8) = 7/15.

And finally:

Problem 6: P[GG | two children, at least one girl with red hair]

In this case we don't know the genotype of the parents. There are three possibilities: Aa Aa, Aa aa, and aa aa.

We follow these steps:

1) Use the prevalence of red hair to compute the prior probabilities of each parental genotype.

2) Use the evidence (at least one girl with red hair) to compute the posteriors.

3) For each combination, compute the conditional probability. We have already computed

P(GG | two children, at least one with red hair, Aa Aa) = 7/15

The others are

P(GG | two children, at least one with red hair, Aa aa) = 3/7
P(GG | two children, at least one with red hair, aa aa) = 1/3

4) Apply the law of total probability to get the answer.

I'm too lazy to do the algebra, so I got Mathematica to do it for me. Here is the notebook with the answer.

In general, if p is the prevalence of red hair alleles,

P(GG | two children, at least one with red hair) = (p^2 + 2p - 7) / (p^2 + 2p - 15)

If the prevalence of red hair is 0.02, then p = sqrt(0.02) = 0.141, and

P(GG | two children, at least one with red hair) = 45.6%

Congratulations to Professor Ted Bunn at the University of Richmond, the only person who submitted a correct answer before the deadline!

At least, I think it's the right answer. Maybe we both made the same mistake.

For more fun with probability, see Chapter 5 of my book, Think Stats, which you can read here, or buy here.

Monday, November 7, 2011

The red-haired girl named Florida

In The Drunkard's Walk, Leonard Mlodinow presents "The Girl Named Florida Problem":

"In a family with two children, what are the chances, if one of the children is a girl named Florida, that both children are girls?"

In a family with two children, what are the chances, if at least one of the children is a girl with red hair, that both children are girls?

[Edit: I clarified the wording to say that at least one of the children is a girl with red hair. To be even more precise, I am looking for the conditional probability P(two girls | at least one girl with red hair).]

Just like last week, you can make some simplifying assumptions:

About 2% of the world population has red hair. You can assume that the alleles for red hair are purely recessive. Also, you can assume that the Red Hair Extinction theory is false, so you can apply the Hardy–Weinberg principle. And you can ignore the effect of identical twins.

Here is my solution.

For more fun with probability, see Chapter 5 of my book, Think Stats, which you can read here, or buy here.

Thursday, November 3, 2011

Somebody bet on the Bayes

In last week's post I wrote solutions to some of my favorite Bayes's Theorem problems, and posed this new problem:

If you meet a man with (naturally) red hair, what is the probability that neither of his parents has red hair?

Hints: About 2% of the world population has red hair. You can assume that the alleles for red hair are purely recessive. Also, you can assume that the Red Hair Extinction theory is false, so you can apply the Hardy–Weinberg principle.

Given the prevalence of red hair, we know what fraction of the population is homozygous recessive. To solve the problem, we also need to know how many are heterozygous and homozygous dominant.

I'll use a to represent the recessive allele (or alleles) for red hair, and A for the dominant alleles that code for other colors. p is the prevalence of a and q is the prevalence of A, so p+q = 1.

If these prevalences are not changing over time, and they don't affect people's mating decisions, we can invoke the Hardy-Weinberg principle to get:

P(AA) = prevalence of homozygous dominant = q**2
P(Aa) = prevalence of heterozygous = 2 * p * q
P(aa) = prevalence of homozygous recessive = p**2

And P(AA) + P(Aa) + P(aa) = 1.

Given (aa), we can compute p, q and:

P(Aa) = 0.243
P(AA) = 0.737

Now if a child has red hair, both parents have at least one recessive allele, so the possible combinations are (Aa Aa), (Aa aa), and (aa aa). If we assume, again, that mating decisions are not based on hair color, we can get the prevalence of each parental pair:

P(Aa Aa) = Aa**2

P(Aa aa) = 2 Aa aa

P(aa aa) = aa**2

And we can use those as the priors. The evidence is

E: the child has red hair

To get the likelihoods, we apply Mendelian genetics:

P(E | Aa Aa) = 0.25

P(E | Aa aa) = 0.5

P(E | aa aa) = 1.0

Finally, applying Bayes's Theorem, we have

P(Aa Aa | E) = P(Aa Aa) P(E | Aa Aa) / P(E)

And the answer is 0.737. With a little algebra we can show that

P(Aa Aa | E) = P(AA)

That is, the probability that neither parent has red hair is exactly the fraction of the population that is homozygous dominant.

Almost 75% of red-haired people come from parents with non-red hair, which might explain why a "red haired child" is a metaphor for a child that doesn't resemble his parents. In the expression, "beat like a red haired step child," some of the humor (for people who find child abuse funny) comes from the suggestion that the parentage of a red haired child is suspect.

But why should red hair be funnier than blue eyes or blonde hair? It turns out that we can answer this question mathematically. If the prevalence of red hair were higher, say 10%, most red haired people would have at least one red-haired parent, and that would be less funny.

In general, as the prevalence of the recessive phenotype increases, the potential for amusing insinuations of infidelity decreases; near 0 it drops off steeply, as shown in this figure:

Since I believe I am the first person to quantify this effect, I humbly submit that it should be called "Downey's inverse law of mailman jokes."

For more fun with probability, see Chapter 5 of my book, Think Stats, which you can read here, or buy here.

-----

If you don't get the title of this post, it is a play on "Somebody bet on the bay," a lyric from the minstrel song "Camptown Races." A bay is a horse with a reddish-brown coat, so I thought it was a pretty good fit.

Thursday, October 27, 2011

All your Bayes are belong to us!

This week's post contains solutions to My Favorite Bayes's Theorem Problems, and one new problem. If you missed last week's post, go back and read the problems before you read the solutions!

If you don't understand the title of this post, brush up on your memes.

1) The first one is a warm-up problem. I got it from Wikipedia (but it's no longer there):

Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of Bowl #1?

First the hypotheses:
A: the cookie came from Bowl #1
B: the cookie came from Bowl #2

And the priors:
P(A) = P(B) = 1/2

The evidence:
E: the cookie is plain

And the likelihoods:
P(E|A) = prob of a plain cookie from Bowl #1 = 3/4
P(E|B) = prob of a plain cookie from Bowl #2 = 1/2

Plug in Bayes's theorem and get
P(A|E) = 3/5

You might notice that when the priors are equal they drop out of the BT equation, so you can often skip a step.

2) This one is also an urn problem, but a little trickier.

The blue M&M was introduced in 1995. Before then, the color mix in a bag of plain M&Ms was (30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan). Afterward it was (24% Blue , 20% Green, 16% Orange, 14% Yellow, 13% Red, 13% Brown).

A friend of mine has two bags of M&Ms, and he tells me that one is from 1994 and one from 1996. He won't tell me which is which, but he gives me one M&M from each bag. One is yellow and one is green. What is the probability that the yellow M&M came from the 1994 bag?

Hypotheses:
A: Bag #1 from 1994 and Bag #2 from 1996
B: Bag #2 from 1994 and Bag #1 from 1996

Again, P(A) = P(B) = 1/2.

The evidence is:
E: yellow from Bag #1, green from Bag #2

We get the likelihoods by multiplying the probabilities for the two M&M:

P(E|A) = (0.2)(0.2)
P(E|B) = (0.1)(0.14)

For example, P(E|B) is the probability of a yellow M&M in 1996 (0.14) times the probability of a green M&M in 1994 (0.1).

Plugging the likelihoods and the priors into Bayes's theorem, we get P(A|E) = 40 / 54 ~ 0.74

By introducing the terms Bag #1 and Bag #2, rather than "the bag the yellow M&M came from" and "the bag the green came from," I avoided the part of this problem that can be tricky: keeping the hypotheses and the evidence straight.

3) This one is from one of my favorite books, David MacKay's Information Theory, Inference, and Learning Algorithms:

Elvis Presley had a twin brother who died at birth. What is the probability that Elvis was an identical twin?

To answer this one, you need some background information: According to the Wikipedia article on twins: ``Twins are estimated to be approximately 1.9% of the world population, with monozygotic twins making up 0.2% of the total---and 8% of all twins.''

There are several ways to set up this problem; I think the easiest is to think about twin birth events, rather than individual twins, and to take the fact that Elvis was a twin as background information.

So the hypotheses are
A: Elvis's birth event was an identical birth event
B: Elvis's birth event was a fraternal twin event

If identical twins are 8% of all twins, then identical birth events are 8% of all twin birth events, so the priors are

P(A) = 8%
P(B) = 92%

The relevant evidence is
E: Elvis's twin was male

So the likelihoods are
P(E|A) = 1
P(E|B) = 1/2

Because identical twins are necessarily the same sex, but fraternal twins are equally likely to be opposite sex (or, at least, I assume so). So

P(A|E) = 8/54 ~ 0.15.

The tricky part of this one is realizing that the sex of the twin provides relevant information!

4) Also from MacKay's book:

Two people have left traces of their own blood at the scene of a crime. A suspect, Oliver, is tested and found to have type O blood. The blood groups of the two traces are found to be of type O (a common type in the local population, having frequency 60%) and of type AB (a rare type, with frequency 1%). Do these data (the blood types found at the scene) give evidence in favour [sic] of the proposition that Oliver was one of the two people whose blood was found at the scene?

For this problem, we are not asked for a posterior probability; rather we are asked whether the evidence is incriminating. This depends on the likelihood ratio, but not the priors.

The hypotheses are
X: Oliver is one of the people whose blood was found
Y: Oliver is not one of the people whose blood was found

The evidence is
E: two blood samples, one O and one AB

We don't need priors, so we'll jump to the likelihoods. If X is true, then Oliver accounts for the O blood, so we just have to account for the AB sample:

P(E|X) = 0.01

If Y is true, then we assume the two samples are drawn from the general population at random. The chance of getting one O and one AB is

P(E|Y) = 2(0.6)(0.01) = 0.012

Notice that there is a factor of two here because there are two permutations that yield E.

So the evidence is slightly more likely under Y, which means that it is actually exculpatory! This problem is a nice reminder that evidence that is consistent with a hypothesis does not necessarily support the hypothesis.

5) I like this problem because it doesn't provide all of the information. You have to figure out what information is needed and go find it.

According to the CDC, ``Compared to nonsmokers, men who smoke are about 23 times more likely to develop lung cancer and women who smoke are about 13 times more likely.''
If you learn that a woman has been diagnosed with lung cancer, and you know nothing else about her, what is the probability that she is a smoker?

I find it helpful to draw a tree:

If y is the fraction of women who smoke, and x is the fraction of nonsmokers who get lung cancer, the number of smokers who get cancer is proportional to 13xy, and the number of nonsmokers who get lung cancer is proportional to x(1-y).

Of all women who get lung cancer, the fraction who smoke is 13xy / (13xy + x(1-y)).

The x's cancel, so it turns out that we don't actually need to know the absolute risk of lung cancer, just the relative risk. But we do need to know y, the fraction of women who smoke. According to the CDC, y was 17.9% in 2009. So we just have to compute

13y / (13y + 1-y) ~ 74%

This is higher than many people guess.

6) Next, a mandatory Monty Hall Problem. First, here's the general description of the scenario, from Wikipedia:

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say Door A [but the door is not opened], and the host, who knows what's behind the doors, opens Door B, which has a goat. He then says to you, "Do you want to pick Door C?" Is it to your advantage to switch your choice?

The answer depends on the behavior of the host when the car is behind Door A. In this case the host can open either B or C. Suppose he chooses B with probability p and C otherwise. What is the probability that the car is behind Door A (as a function of p)?

The hypotheses are
A: the car is behind Door A
B: the car is behind Door B
C: the car is behind Door C

And the priors are
P(A) = P(B) = P(C) = 1/3

The likelihoods are
P(E|A) = p, because in this case Monty has a choice and chooses B with probability p,
P(E|B) = 0, because if the car were behind B, Monty would not have opened B, and
P(E|C) = 1, because in this case Monty has no choice.

Applying Bayes's Theorem,
P(A|E) = p / (1+p)

In the canonical scenario, p=1/2, so P(A|E) = 1/3, which is the canonical solution. If p=0, P(A|E) = 0, so you can switch and win every time (when Monty opens B, that it). If p=1, P(A|E) = 1/2, so in that case it doesn't matter whether you stick or switch.

When Monty opens C, P(A|E) = (1-p) / (2-p)

[Correction: the answer in this case is not (1-p) / (1+p), which what I wrote in a previous version of this article. Sorry!].

7) And finally, here is a new problem I just came up with:

If you meet a man with (naturally) red hair, what is the probability that neither of his parents has red hair?

Hints: About 2% of the world population has red hair. You can assume that the alleles for red hair are purely recessive. Also, you can assume that the Red Hair Extinction theory is false, so you can apply the Hardy–Weinberg principle.

Solution to this one next week!

Please let me know if you have suggestions for more problems. An ideal problem should meet at least some of these criteria:

1) It should be based on a context that is realistic or at least interesting, and not too contrived.

2) It should make good use of Bayes's Theorem -- that is, it should be easier to solve with BT than without.

3) It should involve some real data, which the solver might have to find.

4) It might involve a trick, but should not be artificially hard.

If you send me something that is not under copyright, or is usable under fair use, I will include it in the next edition of Think Stats and add you to the contributors list.

Thursday, October 20, 2011

My favorite Bayes's Theorem problems

This week: some of my favorite problems involving Bayes's Theorem. Next week: solutions.

1) The first one is a warm-up problem. I got it from Wikipedia (but it's no longer there):

Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of Bowl #1?

This is a thinly disguised urn problem. It is simple enough to solve without Bayes's Theorem, but good for practice.

2) This one is also an urn problem, but a little trickier.

The blue M&M was introduced in 1995. Before then, the color mix in a bag of plain M&Ms was (30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan). Afterward it was (24% Blue , 20% Green, 16% Orange, 14% Yellow, 13% Red, 13% Brown).

A friend of mine has two bags of M&Ms, and he tells me that one is from 1994 and one from 1996. He won't tell me which is which, but he gives me one M&M from each bag. One is yellow and one is green. What is the probability that the yellow M&M came from the 1994 bag?

3) This one is from one of my favorite books, David MacKay's "Information Theory, Inference, and Learning Algorithms":

Elvis Presley had a twin brother who died at birth. What is the probability that Elvis was an identical twin?

Two people have left traces of their own blood at the scene of a crime. A suspect, Oliver, is tested and found to have type O blood. The blood groups of the two traces are found to be of type O (a common type in the local population, having frequency 60%) and of type AB (a rare type, with frequency 1%). Do these data (the blood types found at the scene) give evidence in favour [sic] of the proposition that Oliver was one of the two people whose blood was found at the scene?

5) I like this problem because it doesn't provide all of the information. You have to figure out what information is needed and go find it.

According to the CDC, ``Compared to nonsmokers, men who smoke are about 23 times more likely to develop lung cancer and women who smoke are about 13 times more likely.''
If you learn that a woman has been diagnosed with lung cancer, and you know nothing else about her, what is the probability that she is a smoker?

6) And finally, a mandatory Monty Hall Problem. First, here's the general description of the scenario, from Wikipedia:

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say Door A [but the door is not opened], and the host, who knows what's behind the doors, opens another door, say Door B, which has a goat. He then says to you, "Do you want to pick Door C?" Is it to your advantage to switch your choice?

The answer depends on the behavior of the host if the car is behind Door A. In this case the host can open either B or C. Suppose he chooses B with probability p and C otherwise. What is the probability that the car is behind Door A (as a function of p)?

If you like this problem, you might also like the Blinky Monty Problem.

Solutions next week!

Please let me know if you have suggestions for more problems. An ideal problem should meet at least some of these criteria:

1) It should be based on a context that is realistic or at least interesting, not too contrived.

2) It should make good use of Bayes's Theorem -- that is, it should be easier to solve with BT than without.

3) It should involve some real data, which the solver might have to find.

4) It might involve a trick, but should not be artificially hard.

If you send me something that is not under copyright, or is usable under fair use, I will include it in the next edition of Think Stats and add you to the contributors list.

Monday, October 17, 2011

The Blinky Monty Problem

I read Jason Rosenhouse's book about The Monty Hall Problem recently, and I use the problem as an example in my statistics class. Last semester I wrote a variation of the problem that turns out to be challenging, and a motivating problem for Bayesian estimation. Here's what I call the "Blinky Monty Problem."

Suppose you are on Let's Make a Deal and you are playing the Monty Hall Game, with one twist. Before you went on the show you analyzed tapes of previous shows and discovered that Monty has a tell: when the contestant picks the correct door, Monty is more likely to blink.

Of the 18 shows you watched, the contestant chose the correct door 5 times, and Monty blinked three of those times. Of the other 13 times, Monty blinked three times.

Assume that you choose Door A. Monty opens door B and blinks. What should you do, and what is your chance of winning?

To get started, let's review the standard version of the Monty Hall problem (from Wikipedia):

"Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say A [but the door is not opened], and the host, who knows what's behind the doors, opens another door, say B, which has a goat. He then says to you, "Do you want to pick door C?" Is it to your advantage to switch your choice?"

We can solve this problem using Bayes's Theorem. There are two relevant hypotheses:

A: the car is behind door A
C: the car is behind door C

Before Monty opens door B, the prior probability for both hypotheses is 1/3. Now the evidence is

E: Monty opened door B.

To compute the posterior probabilities, we need the likelihoods P(E|A) and P(E|C).

P(E|A) is the probability of the evidence if the car is behind door A. In this case Monty has a choice; he could have opened door B or C. In the canonical version of the problem, he chooses at random, so

P(E|A) = 1/2

P(E|C) is the probability of the evidence if the car is behind door C. In this case Monty has no choice, so

P(E|C) = 1

Applying Bayes's Theorem yields:

P(A|E) = 1/3
P(C|E) = 2/3

So if you switch to Door C, you increase the chances of winning from 1/3 to 2/3. This is the standard answer to the standard version of the problem. If you are not familiar with the Monty Hall problem and this answer comes as a shock to you, please don't take it out on me. Go read about it, starting here, and come back when you have calmed down.

Are you ready to go on? Ok. We can do a similar analysis for the Blinky Monty problem, but now we have more evidence to consider:

E: Monty opens Door B and blinks.

According to the scouting report, the probability that Monty blinks if the car is behind A is 3/5, so

P(E|A) = (1/2)(3/5) = 3/10

If the car is behind C, Monty blinks 3/13 of the time, so

P(E|C) = (1)(3/13) = 3/13

Plugging in Bayes' Theorem, we get P(A|E) = 0.51, so in this scenario you are slightly better off sticking with Door A.

But is that all there is to it? No! And here's where this blog earns the name "Probably Overthinking It." Remember that the probabilities for Monty's blinking are estimates based on a sample, so we have some uncertainty about them.

To take that uncertainty into account, we have to

1) Use the scouting report to estimate the distribution of P(blink|A) and P(blink|C).

2) For each pair of values from those distributions, compute P(A|E).

3) Apply the law of total probability to get the overall P(A|E).

That might sound complicated, but it is straightforward to estimate computationally.

First, let me take a minute to make fun of frequentists. According to conventional hypothesis testing, the data from the scouting report is not statistically significant. If the null hypothesis is that Monty has the same chance of blinking regardless of the location of the car, the p-value of the sample we saw is 0.27 (I used Fisher's exact test, computed using this online calculator). We can't reject the null hypothesis, so if we play by the rules of conventional hypothesis testing, I guess that means we can't take advantage of Monty's tell. If you are a committed frequentist, you should stop reading now.

Good. Now that the riff-raff are gone, let's proceed. Here's how we make the posterior distribution for P(blink|doorA):

doorA = MakeUniformSuite(0.0, 1.0, 101, name='Door A')
evidence = 3, 2
Update(doorA, evidence)

MakeUniformSuite() makes a Pmf of 101 values equally spaced between 0 and 1, with equal probability. Update() does the usual Bayesian update:

def Update(suite, evidence):

"""Updates a suite of hypotheses based on new evidence.

Modifies the suite directly; if you want to keep the original, make

a copy.

Args:

suite: Pmf object

evidence: whatever kind of object Likelihood expects

"""

for hypo in suite.Values():

likelihood = Likelihood(evidence, hypo)

suite.Mult(hypo, likelihood)

suite.Normalize()

def Likelihood(evidence, hypo):

"""Computes the likelihood of the evidence assuming the hypothesis is true.

Args:

evidence: a tuple of (number of heads, number of tails)

hypo: float probability of heads

Returns:

probability of tossing the given number of heads and tails with a

coin that has p probability of heads

"""

heads, tails = evidence

p = hypo

return pow(p, heads) * pow(1-p, tails)

If you are not familiar with that, you can read Chapter 8 of Think Stats, or see this blog post.

To find P(blink|C), it's pretty much the same:

doorC = MakeUniformSuite(0.0, 1.0, 101, name='Door C')

evidence = 3, 10

Update(doorC, evidence)

And to finish it off, we apply the law of total probability:

print TotalProbability(doorA, doorC, ProbWinning)

TotalProbability() enumerates the values from doorA and doorC, and calls ProbWinning() for each pair:

def TotalProbability(pmf1, pmf2, func):

"""Enumerates pairs from the Pmfs, calls the func, and returns

the total probability.

func: takes a value from each Pmf and returns a probability.

"""

total = 0.0

for x, px in pmf1.Items():

for y, py in pmf2.Items():

if px and py:

total += px * py * func(x, y)

return total

Finally, ProbWinning() takes P(blink|A) and P(blink|C) and returns P(A|E):

def ProbWinning(pbA, pbC):

"""Computes the probability that the car is behind door A:

pbA: probability that Monty blinks if the car is behind A

pbC: probability that Monty blinks if the car is behind C

"""

pea = 0.5 * pbA

pec = 1.0 * pbC

pae = pea / (pea + pec)

return pae

You can download all this code here. If you run it, you'll see the posterior distributions for P(blink|A) and P(blink|C):

The most likely values are 0.6 and 0.23, but since we don't have much data, the uncertainty is still large.

Taking it all together, the probability that the car is behind Door A, P(A|E), is 0.52. So you are slightly better off sticking with Door A.

To summarize:

1) In the canonical version of the game, P(A|E) = 1/3, so you are better off switching.

2) In the Blinky Monty version, if we take P(blink|A) and P(blink|C) as givens, we estimate P(A|E) = 0.51, so you are slightly better off sticking with Door A.

3) If you are a frequentist, you conclude that Monty's tell is not statistically significant, so you ignore it, switch, and lose most of the time.

4) If you are a Bayesian, you can take advantage of the tell and maximize your chance of winning. If Monty blinks and you stick, you win 52% of the time. If he doesn't blink and you switch, you win 87% of the time. Since he blinks 40% of the time, overall you can expect to win 73% of the time. So the tell provides a substantial advantage.

Exercise for the reader: download my code and modify it to analyze the case where Monty doesn't blink. Hint: you only have to change two lines.

EDIT October 19, 2011: I got a comment on reddit that raises a legitimate concern about data collection. In the scenario I presented, if you review tapes of the show and look for several potential tells (blinking, facial tics, verbal habits, etc.) and you choose one that seems to be correlated with the outcome, you could easily get fooled (see this post on repeated tests). This is a fair warning for both the frequentist and Bayesian analyses.

So to be more careful, maybe I should say that you were tipped off to the possibility that Monty's tell is a blink before you reviewed the tape, and that's the only data you collected.

EDIT January 26, 2012: In the comments below, Jacob pointed out an error in my calculation. As a result, I went back and changed the scenario to match my calculations, which forced me to make a few other revisions. I think it is all settled now, but let me know if you disagree.