Friday, August 22, 2014

An exercise in hypothesis testing

I've just turned in the manuscript for the second edition of Think Stats.  If you're dying to get your hands on a copy, you can pre-order one here.

Most of the book is about computational methods, but in the last chapter I break out some analytic methods, too.  In the last section of the book, I explain the underlying philosophy:

This book focuses on computational methods like resampling and permutation. These methods have several advantages over analysis:
  • They are easier to explain and understand. For example, one of the most difficult topics in an introductory statistics class is hypothesis testing. Many students don’t really understand what p-values are. I think the approach I presented in Chapter 9—simulating the null hypothesis and computing test statistics—makes the fundamental idea clearer.
  • They are robust and versatile. Analytic methods are often based on assumptions that might not hold in practice. Computational methods require fewer assumptions, and can be adapted and extended more easily.
  • They are debuggable. Analytic methods are often like a black box: you plug in numbers and they spit out results. But it’s easy to make subtle errors, hard to be confident that the results are right, and hard to find the problem if they are not. Computational methods lend themselves to incremental development and testing, which fosters confidence in the results.
But there is one drawback: computational methods can be slow. Taking into account these pros and cons, I recommend the following process:
  1. Use computational methods during exploration. If you find a satisfactory answer and the run time is acceptable, you can stop.
  2. If run time is not acceptable, look for opportunities to optimize. Using analytic methods is one of several methods of optimization.
  3. If replacing a computational method with an analytic method is appropriate, use the computational method as a basis of comparison, providing mutual validation between the computational and analytic results.
For the vast majority of problems I have worked on, I didn’t have to go past Step 1.
The last exercise in the book is based on a question my colleague, Lynn Stein, asked me for a paper she was working on:

 In a recent paper2, Stein et al. investigate the effects of an intervention intended to mitigate gender-stereotypical task allocation within student engineering teams.  Before and after the intervention, students responded to a survey that asked them to rate their contribution to each aspect of class projects on a 7-point scale. 
Before the intervention, male students reported higher scores for the programming aspect of the project than female students; on average men reported a score of 3.57 with standard error 0.28. Women reported 1.91, on average, with standard error 0.32. 
Question 1:  Compute a 90% confidence interval and a p-value for the gender gap (the difference in means). 
After the intervention, the gender gap was smaller: the average score for men was 3.44 (SE 0.16); the average score for women was 3.18 (SE 0.16). Again, compute the sampling distribution of the gender gap and test it. 
Question 2:  Compute a 90% confidence interval and a p-value for the change in gender gap. 
[2] Stein et al. “Evidence for the persistent effects of an intervention to mitigate gender-sterotypical task allocation within student engineering teams,” Proceedings of the IEEE Frontiers in Education Conference, 2014.
In the book I present ways to do these computations, and I will post my "solutions" here soon.  But first I want to pose these questions as a challenge for statisticians and people learning statistics.  How would you approach these problems?

The reason I ask:  Question 1 is pretty much a textbook problem; you can probably find an online calculator to do it for you.  But you are less likely to find a canned solution to Question 2, so I am curious to see how people go about it.  I hope to post some different solutions soon.

By the way, this is not meant to be a "gotcha" question.  If some people get it wrong, I am not going to make fun of them.  I am looking for different correct approaches; I will ignore mistakes, and only point out incorrect approaches if they are interestingly incorrect.

You can post a solution in the comments below, or discuss it on reddit.com/r/statistics, or if you don't want to be influenced by others, send me email at downey at allendowney dot com.

UPDATE August 26, 2014

The discussion of this question on reddit.com/r/statistics was as interesting as I hoped.  People suggested several very different approaches to the problem.  The range of responses extends from something like, "This is a standard problem with a known, canned answer," to something like, "There are likely to be dependencies among the values that make the standard model invalid, so the best you can do is an upper bound."  In other words, the problem is either trivial or impossible!

The approach I had in mind is to compute sampling distributions for the gender gap and the change in gender gap using normal approximations, and then use the sampling distributions to compute standard errors, confidence intervals, and p-values.

I used a simple Python class that represents a normal distribution.  Here is the API:

class Normal(object):
    """Represents a Normal distribution"""

    def __init__(self, mu, sigma2, label=''):
        """Initializes a Normal object with given mean and variance."""

    def __add__(self, other):
        """Adds two Normal distributions."""

    def __sub__(self, other):
        """Subtracts off another Normal distribution."""

    def __mul__(self, factor):
        """Multiplies by a scalar."""

    def Sum(self, n):
        """Returns the distribution of the sum of n values."""

    def Prob(self, x):
        """Cumulative probability of x."""
     
   def Percentile(self, p):
        """Inverse CDF of p (0 - 100)."""

The implementation of this class is here.

Here's a solution that uses the Normal class.  First we make normal distributions that represent the sampling distributions of the estimated means, using the given means and sampling errors.  The variance of the sampling distribution is the sampling error squared:

    male_before = normal.Normal(3.57, 0.28**2)
    male_after = normal.Normal(3.44, 0.16**2)

    female_before = normal.Normal(1.91, 0.32**2)
    female_after = normal.Normal(3.18, 0.16**2)

Now we compute the gender gap before the intervention, and print the estimated difference, p-value, and confidence interval:

    diff_before = female_before - male_before
    print('mean, p-value', diff_before.mu, 1-diff_before.Prob(0))
    print('CI', diff_before.Percentile(5), diff_before.Percentile(95))

The estimated gender gap is -1.66 with SE 0.43, 90% CI (-2.3, -0.96) and p-value 5e-5.  So that's statistically significant.

Then we compute the gender gap after intervention and the change in gender gap:

    diff_after = female_after - male_after
    diff = diff_after - diff_before
    print('mean, p-value', diff.mu, diff.Prob(0))
    print('CI', diff.Percentile(5), diff.Percentile(95))

The estimated change is 1.4 with SE 0.48, 90% CI (0.61, 2.2) and p-value 0.002.  So that's statistically significant, too.

This solution is based on a two assumptions:

1) It assumes that the sampling distribution of the estimated means is approximately normal.  Since the data are on a Likert scale, the variance and skew are small, so the sum of n values converges to normal quickly.  The samples sizes are in the range of 30-90, so the normal approximation is probably quite good.

This claim is based on the Central Limit Theorem, which only applies if the samples are drawn from the population independently.  In this case, there are dependencies within teams: for example, if someone on a team does a larger share of a task, the rest of the team necessarily does less.  But the team sizes are 2-4 people and the sample sizes are much larger, so these dependencies have short enough "range" that I think it is acceptable to ignore them.

2) Every time we subtract two distributions to get the distribution of the difference, we are assuming that values are drawn from the two distributions independently.  In theory, dependencies within teams could invalidate this assumption, but I don't think it's likely to be a substantial effect.

3) As always, remember that the standard error (and confidence interval) indicate uncertainty due to sampling, but say nothing about measurement error, sampling bias, and modeling error, which are often much larger sources of uncertainty.

The approach I presented here is a bit different from what's presented in most introductory stats classes.  If you have taken (or taught) a stats class recently, I would be curious to know what you think of this problem.  After taking the class, would you be able to solve problems like this?  Or if you are teaching, could your students do it?

Your comments are welcome.

Friday, July 18, 2014

More likely to be killed by a terrorist

I am working on the second edition of Think Stats, adding chapters on some topics that didn't make it into the first edition, including survival analysis.  Here is the draft of the new chapter; this post contains some of the highlights.

Survival analysis

Survival analysis is a way to describe how long things last. It is often used to study human lifetimes, but it also applies to “survival” of mechanical and electronic components, or more generally to intervals in time before an event.

If someone you know has been diagnosed with a life-threatening disease, you might have seen a “5-year survival rate,” which is the probability of surviving five years after diagnosis. That estimate and related statistics are the result of survival analysis.

As a more cheerful example, I will use data from the National Survey of Family Growth (NSFG) to quantify how long respondents “survive” until they get married for the first time. The range of respondents’ ages is 14 to 44 years, so the dataset provides a snapshot of women at different stages in their lives, in the same way that a medical cohort might include patients at difference stages of disease.

For women who have been married, the dataset includes the date of their first marriage and their age at the time. For women who have not been married, we know their age when interviewed, but have no way of knowing when or if they will get married.

Since we know the age at first marriage for some women, it might be tempting to exclude the rest and compute the distribution of the known data. That is a bad idea. The result would be doubly misleading: (1) older women would be overrepresented, because they are more likely to be married when interviewed, and (2) married women would be overrepresented! In fact, this analysis would lead to the conclusion that all women get married, which is obviously incorrect.

Kaplan-Meier estimation

In this example it is not only desirable but necessary to include observations of unmarried women, which brings us to one of the central algorithms in survival analysis, Kaplan-Meier estimation.
The general idea is that we can use the data to estimate the hazard function, then convert the hazard function to a survival function. To estimate the hazard function, we consider, for each age, (1) the number of women who got married at that age and (2) the number of women “at risk” of getting married, which includes all women who were not married at an earlier age.

The details of the algorithm are in the book; we'll skip to the results:

The top graph shows the estimated hazard function; it is low in the teens, higher in the 20s, and declining in the 30s. It increases again in the 40s, but that is an artifact of the estimation process; as the number of respondents “at risk” decreases, a small number of women getting married yields a large estimated hazard. The survival function will smooth out this noise.

The bottom graph shows the survival function, which shows for each age the fraction of people who are still unmarried at that age.  The curve is steepest between 25 and 35, when most women get married. Between 35 and 45, the curve is nearly flat, indicating that women who do not marry before age 35 are unlikely to get married before age 45.

A curve like this was the basis of a famous magazine article in 1986; Newsweek reported that a 40-year old unmarried woman was “more likely to be killed by a terrorist” than get married. These statistics were widely reported and became part of popular culture, but they were wrong then (because they were based on faulty analysis) and turned out to be even more wrong (because of cultural changes that were already in progress and continued). In 2006, Newsweek ran an another article admitting that they were wrong.



I encourage you to read more about this article, the statistics it was based on, and the reaction. It should remind you of the ethical obligation to perform statistical analysis with care, interpret the results with appropriate skepticism, and present them to the public accurately and honestly.

Cohort effects

One of the challenges of survival analysis, and one of the reasons Newsweek was wrong, is that different parts of the estimated curve are based on different groups of respondents. The part of the curve at time t is based on respondents whose age was at least t when they were interviewed. So the leftmost part of the curve includes data from all respondents, but the rightmost part includes only the oldest respondents.

If the relevant characteristics of the respondents are not changing over time, that’s fine, but in this case it seems likely that marriage patterns are different for women born in different generations. We can investigate this effect by grouping respondents according to their decade of birth. Groups like this, defined by date of birth or similar events, are called cohorts, and differences between the groups are called cohort effects.

To investigate cohort effects in the NSFG marriage data, I gathered Cycle 5 data from 1995, Cycle 6 data from 2002, the Cycle 7 data from 2006–2010. In total these datasets include 30,769 respondents.
I divided respondents into cohorts by decade of birth and estimated the survival curve for each cohort:



Several patterns are visible:

  • Women born in the 50s married earliest, with successive cohorts marrying later and later, at least until age 30 or so.
  • Women born in the 60s follow a surprising pattern. Prior to age 25, they were marrying at slower rates than their predecessors. After age 25, they were marrying faster. By age 32 they had overtaken the 50s cohort, and at age 44 they are substantially more likely to have married.  Women born in the 60s turned 25 between 1985 and 1995. Remembering that the Newsweek article was published in 1986, it is tempting to imagine that the article triggered a marriage boom. That explanation would be too pat, but it is possible that the article and the reaction to it were indicative of a mood that affected the behavior of this cohort.
  • The pattern of the 70s cohort is similar. They are less likely than their predecessors to be married before age 25, but at age 35 they have caught up with both of the previous cohorts.
  • Women born in the 80s are even less likely to marry before age 25. What happens after that is not clear; for more data, we have to wait for the next cycle of the NSFG, coming in late fall 2014.


Expected remaining lifetime

Given a survival curve, we can compute the expected remaining lifetime as a function of current age. For example, given the survival function of pregnancy length, we can compute the expected time until delivery.

For example, the following figure (left) shows the expecting remaining pregnancy length as a function of the current duration. During Week 0, the expected remaining duration is about 34 weeks. That’s less than full term (39 weeks) because terminations of pregnancy in the first trimester bring the average down.

The curve drops slowly during the first trimester: after 13 weeks, the expected remaining lifetime has dropped by only 9 weeks, to 25. After that the curve drops faster, by about a week per week.



Between Week 37 and 42, the curve levels off between 1 and 2 weeks. At any time during this period, the expected remaining lifetime is the same; with each week that passes, the destination gets no closer. Processes with this property are called “memoryless,” because the past has no effect on the predictions. This behavior is the mathematical basis of the infuriating mantra of obstetrics nurses: “any day now.”

The figure also shows the median remaining time until first marriage, as a function of age. For an 11 year-old girl, the median time until first marriage is about 14 years. The curve decreases until age 22 when the median remaining time is about 7 years. After that it increases again: by age 30 it is back where it started, at 14 years.

For the marriage data I used median rather than mean because the dataset includes women who are unmarried at age 44.  The survival curve is cut off at about 20%, so we can't compute a mean.  But the median is well defined as long as more than 50% of the remaining values are known.

Based on this data, young women have decreasing remaining "lifetimes".  Mechanical components with this property are called NBUE for "new better than used in expectation," meaning that a new part is expected to last longer.

Women older than 22 have increasing remaining time until first marriage.  Components with this property are UBNE for "used better than new in expectation."  That is, the older the part, the longer it is expected to last.  Newborns and cancer patients are also UBNE; their life expectancy increases the longer they live.  Also, people learning to ride a motorcycle.

Details of the calculations in this article are in Think Stats, Chapter 13.  The code is in survival.py.

Thursday, July 10, 2014

Bayesian solution to the Lincoln index problem

Last year my occasional correspondent John D. Cook wrote an excellent blog post about the Lincoln index, which is a way to estimate the number of errors in a document (or program) by comparing results from two independent testers.  Here's his presentation of the problem:
Suppose you have a tester who finds 20 bugs in your program. You want to estimate how many bugs are really in the program. You know there are at least 20 bugs, and if you have supreme confidence in your tester, you may suppose there are around 20 bugs. But maybe your tester isn't very good. Maybe there are hundreds of bugs. How can you have any idea how many bugs there are? There’s no way to know with one tester. But if you have two testers, you can get a good idea, even if you don’t know how skilled the testers are.
Then he presents the Lincoln index, an estimator "described by Frederick Charles Lincoln in 1930," where Wikpedia's use of "described" is a hint that the index is another example of Stigler's law of eponymy.
Suppose two testers independently search for bugs. Let k1 be the number of errors the first tester finds and k2 the number of errors the second tester finds. Let c be the number of errors both testers find. The Lincoln Index estimates the total number of errors as k1 k2 / c [I changed his notation to be consistent with mine].
So if the first tester finds 20 bugs, the second finds 15, and they find 3 in common, we estimate that there are about 100 bugs.

Of course, whenever I see something like this, the idea that pops into my head is that there must be a (better) Bayesian solution!  And there is.  You can read and download my solution here.

I represent the data using the tuple (k1, k2, c), where k1 is the number of bugs found by the first tester, k2 is the number of bugs found by the second, and c is the number they find in common.

The hypotheses are a set of tuples (n, p1, p2), where n is the actual number of errors, p1 is the probability that the first tester finds any given bug, and p2 is the probability for the second tester.

Now all I need is a likelihood function:

class Lincoln(thinkbayes.Suite, thinkbayes.Joint):

    def Likelihood(self, data, hypo):
        """Computes the likelihood of the data under the hypothesis.

        hypo: n, p1, p2
        data: k1, k2, c
        """
        n, p1, p2 = hypo
        k1, k2, c = data

        part1 = choose(n, k1) * binom(k1, n, p1)
        part2 = choose(k1, c) * choose(n-k1, k2-c) * binom(k2, n, p2)
        return part1 * part2

Where choose evaluates the binomial coefficient, \tbinom nk, and binom evaluates the rest of the binomial pmf:

def binom(k, n, p):
    return p**k * (1-p)**(n-k)

And that's pretty much it.  Here's the code that builds and updates the suite of hypotheses:

    data = 20, 15, 3
    probs = numpy.linspace(0, 1, 101)
    hypos = []
    for n in range(32, 350):
        for p1 in probs:
            for p2 in probs:
                hypos.append((n, p1, p2))

    suite = Lincoln(hypos)
    suite.Update(data)

The suite contains the joint posterior distribution for (n, p1, p2), but p1 and p2 are nuisance parameters; we only care about the marginal distribution of n.  Lincoln inherits Marginal from Joint, so we can get the marginal distribution like this:

    n_marginal = suite.Marginal(0)

Where 0 is the index of n in the tuple.  And here's what the distribution looks like:

The lower bound is 32, which is the total number of bugs found by the two testers.  I set the upper bound at 350, which chops off a little bit of the tail.

The maximum likelihood estimate in this distribution is 72; the mean is 106.  So those are consistent with the Lincoln index, which is 100.  But as usual, it is more useful to have the whole posterior distribution, not just a point estimate.  For example, this posterior distribution can be used as part of a risk-benefit analysis to guide decisions about how much effort to allocate to finding additional bugs.

This solution generalizes to more than 2 testers, but figuring out the likelihood function, and evaluating it quickly, becomes increasingly difficult.  Also, as the number of testers increases, so does the number of dimensions in the space of hypotheses.  With two testers there are about 350 * 100 * 100 hypotheses.  On my non-very-fast desktop computer, that takes about a minute.  I could speed it up by evaluating the likelihood function more efficiently, but each new tester would multiply the run time by 100.

The library I used in my solution is thinkbayes.py, which is described in my book, Think Bayes.  Electronic copies are available under a Creative Commons licence from Green Tea Press.  Hard copies are published by O'Reilly Media and available from Amazon.com and other booksellers.

I believe that the approach to Bayesian statistics I present in Think Bayes is a good way to solve problems like this.  I cite as evidence that this example, from the time I read John's article, to the time I pressed "Publish" on this post, took me about 3 hours.



Wednesday, June 4, 2014

Yet another power-law tail, explained

At the next Boston Python user group meeting, participants will present their solutions to a series of puzzles, posted here.  One of the puzzles lends itself to a solution that uses Python iterators, which is something I was planning to get more familiar with it.  So I took on this puzzle, by John Bohannon, (who says he was inspired by this programming problem).  Here's John's version:
Consider these base-10 digits: 123456789. If you insert spaces between them, you get various sequences of numbers:
1 2 3 4 5 6 7 8 9
12 3 4 5 67 8 9
1 2 34 5 6 7 89
12 34 56 78 9
1 23456 78 9
12345 6789
etc.
 
1) Write a program that generates all possible combinations of those digits.
How many are there?
Now let's insert a maximum of 8 addition or subtraction operators between the numbers, like this:
1+2+3+4+5+6+7-8+9
12-3+4+5-67-8+9
1+2+34+5-6-7-89
12-34+56+78+9
1+23456-78-9
12345+6789
etc.
Notice that those arithmetic expressions equate to different values:
1+2+3+4+5+6+7-8+9 = 29
12-3+4+5-67-8+9 = -48
1+2+34+5-6-7-89 = -60
12-34+56+78+9 = 121
1+23456-78-9 = 23370
12345+6789 = 19134
etc.
 
2) Write a program that generates all possible expressions in this way.
How many sum to 100?
 
3) Write a program that finds all such expressions for any sum.
Which sum is the most popular, i.e. has the most expressions?
 
4) Bonus: We can haz pretty data viz?
Like how about a histogram of the number of expressions with sums from -23456788 to 123456789. (A log scale might help. Maybe binning, too.)
I put my solution in an iPython notebook, which you can view here.  Some conclusions:

  1. The distribution of values you can generate by arranging these digits and operators has a power-law tail.
  2. The distribution is a mixture of (approximately) uniform distributions, where each element of the mixture is characterized by the length of the largest term in the expression.  That is, if the biggest term has 3 digits, the values tend to be in the 100s.  If the biggest number has 4 digits, the value is usually in the 1000s, etc.
  3. The reason for the power law is that each element of the mixture is an order of magnitude bigger, but about 0.6 orders of magnitude more rare.  So the parameter of the power tail is about 0.6.
I also ran the analysis with hexadecimal numbers and confirmed that the pattern persists.

For future exploration: what do you think happens if you add multiplication to the operator mix?  Download the notebook and try it!

Thursday, May 8, 2014

Implementing PMFs in Python

Last year I gave a keynote talk at PyCon Taiwan called "Python Epistemology," and I wrote this blog article about it.  The video is here, but unfortunately the sound quality is poor.  In the talk, I demonstrate the use of a Counter, one of the data structures in Python's collections module;  specifically, I use a Counter to implement a probability mass function (PMF) and a suite of Bayesian hypotheses.

This year I was at PyCon 2014 in Montreal and three things happened that lead to this post:
  1. I talked with Fernando Peréz, who gave an excellent keynote talk about doing open science with iPython.  He convinced me to give iPython notebooks another chance,
  2. I talked with Travis Oliphant, co-founder of Continuum Analytics, who convinced me to try Wakari for hosting iPython notebooks, and
  3. I shared a taxi to the airport with Raymond Hettinger, who received an award at Pycon this year for his contributions to core Python modules including collections.
Raymond told me that he heard about my talk from David Beazley, who was in the audience in Taiwan, and asked if I would send him my code to use as an example of what you can do with Counters.  I agreed, of course, but it has taken several weeks to get it done.

Since then, I created this Github repository, which contains the code examples from my talk.  I also put the code into an iPython notebook, which I posted on nbviewer.  I found nbviewer incredibly easy to use; I pasted in the URL of my Github repo, and it generated this static view of the notebook.

Wakari is similar, but it generates a dynamic view of the notebook where anyone can execute and modify the code.  I set up an account, uploaded the notebook, and shared this dynamic notebook, all in less than 10 minutes.  I was very impressed.

If you are interested, please read either the static or dynamic version, then come back here if you have comments.

I want to thank everyone at PyCon for an excellent conference, and especially Fernando Peréz, Travis Oliphant, Raymond Hettinger, and David Beazley for taking the time to talk with me about this example.

Finally, here's an excerpt from the notebook:




Monday, April 28, 2014

Bayes's theorem and logistic regression

This week's post has more math than most, so I wrote in it LaTeX and translated it to HTML using HeVeA. Some of the formulas are not as pretty as they could be. If you prefer, you can read this article in PDF.
Abstract: My two favorite topics in probability and statistics are Bayes’s theorem and logistic regression. Because there are similarities between them, I have always assumed that there is a connection. In this note, I demonstrate the connection mathematically, and (I hope) shed light on the motivation for logistic regression and the interpretation of the results.

1  Bayes’s theorem

I’ll start by reviewing Bayes’s theorem, using an example that came up when I was in grad school. I signed up for a class on Theory of Computation. On the first day of class, I was the first to arrive. A few minutes later, another student arrived. Because I was expecting most students in an advanced computer science class to be male, I was mildly surprised that the other student was female. Another female student arrived a few minutes later, which was sufficiently surprising that I started to think I was in the wrong room. When another female student arrived, I was confident I was in the wrong place (and it turned out I was).
As each student arrived, I used the observed data to update my belief that I was in the right place. We can use Bayes’s theorem to quantify the calculation I was doing intuitively.
I’ll us H to represent the hypothesis that I was in the right room, and F to represent the observation that the first other student was female. Bayes’s theorem provides an algorithm for updating the probability of H:
P(H|F) = P(H
P(F|H)
P(F)
Where
  • P(H) is the prior probability of H before the other student arrived.
  • P(H|F) is the posterior probability of H, updated based on the observation F.
  • P(F|H) is the likelihood of the data, F, assuming that the hypothesis is true.
  • P(F) is the likelihood of the data, independent of H.
Before I saw the other students, I was confident I was in the right room, so I might assign P(H) something like 90%.
When I was in grad school most advanced computer science classes were 90% male, so if I was in the right room, the likelihood of the first female student was only 10%. And the likelihood of three female students was only 0.1%.
If we don’t assume I was in the right room, then the likelihood of the first female student was more like 50%, so the likelihood of all three was 12.5%.
Plugging those numbers into Bayes’s theorem yields P(H|F) = 0.64 after one female student, P(H|FF) = 0.26 after the second, and P(H|FFF) = 0.07 after the third.
[UPDATE: An earlier version of this article had incorrect values in the previous sentence. Thanks to David Burger for catching the error.]

2  Logistic regression

Logistic regression is based on the following functional form:
logit(p) = β0 + β1 x1 + ... + βn xn 
where the dependent variable, p, is a probability, the xs are explanatory variables, and the βs are coefficients we want to estimate. The logit function is the log-odds, or
logit(p) = ln


p
1−p
 


When you present logistic regression like this, it raises three questions:
  • Why is logit(p) the right choice for the dependent variable?
  • Why should we expect the relationship between logit(p) and the explanatory variables to be linear?
  • How should we interpret the estimated parameters?
The answer to all of these questions turns out to be Bayes’s theorem. To demonstrate that, I’ll use a simple example where there is only one explanatory variable. But the derivation generalizes to multiple regression.
On notation: I’ll use P(H) for the probability that some hypothesis, H, is true. O(H) is the odds of the same hypothesis, defined as
O(H) = 
P(H)
1 − P(H)
 
I’ll use LO(H) to represent the log-odds of H:
LO(H) = lnO(H
I’ll also use LR for a likelihood ratio, and OR for an odds ratio. Finally, I’ll use LLR for a log-likelihood ratio, and LOR for a log-odds ratio.

3  Making the connection

To demonstrate the connection between Bayes’s theorem and logistic regression, I’ll start with the odds form of Bayes’s theorem. Continuing the previous example, I could write
   O(H|F) = O(HLR(F|H)     (1)
where
  • O(H) is the prior odds that I was in the right room,
  • O(H|F) is the posterior odds after seeing one female student,
  • LR(F|H) is the likelihood ratio of the data, given the hypothesis.
The likelihood ratio of the data is:
LR(F|H) = 
P(F|H)
P(F|¬ H)
 
where ¬ H means H is false.
Noticing that logistic regression is expressed in terms of log-odds, my next move is to write the log-odds form of Bayes’s theorem by taking the log of Eqn 1:
   LO(H|F) = LO(H) + LLR(F|H)     (2)
If the first student to arrive had been male, we would write
    LO(H|M) = LO(H) + LLR(M|H)     (3)
Or more generally if we use X as a variable to represent the sex of the observed student, we would write
   LO(H|X) = LO(H) + LLR(X|H)     (4)
I’ll assign X=0 if the observed student is female and X=1 if male. Then I can write:
    LLR(X|H) = 

    LLR(F|H)if  X = 0
    LLR(M|H)if  X = 1
    (5)
Or we can collapse these two expressions into one by using X as a multiplier:
   LLR(X|H) = LLR(F|H) + X [LLR(M|H) − LLR(F|H)]     (6)

4  Odds ratios

The next move is to recognize that the part of Eqn 4 in brackets is the log-odds ratio of H. To see that, we need to look more closely at odds ratios.
Odds ratios are often used in medicine to describe the association between a disease and a risk factor. In the example scenario, we can use an odds ratio to express the odds of the hypothesis H if we observe a male student, relative to the odds if we observe a female student:
ORX(H) = 
O(H|M)
O(H|F)
 
I’m using the notation ORX to represent the odds ratio associated with the variable X.
Applying Bayes’s theorem to the top and bottom of the previous expression yields
ORX(H) = 
O(HLR(M|H)
O(HLR(F|H)
 = 
LR(M|H)
LR(F|H)
Taking the log of both sides yields
   LORX(H) = LLR(M|H) − LLR(F|H)     (7)
This result should look familiar, since it appears in Eqn 4.

5  Conclusion

Now we have all the pieces we need; we just have to assemble them. Combining Eqns 4 and 5 yields
   LLR(H|X) = LLR(F) + X LOR(X|H)     (8)
Combining Eqns 3 and 6 yields
   LO(H|X) = LO(H) + LLR(F|H) + X LOR(X|H)     (9)
Finally, combining Eqns 2 and 7 yields
LO(H|X) = LO(H|F) + X LOR(X|H
We can think of this equation as the log-odds form of Bayes’s theorem, with the update term expressed as a log-odds ratio. Let’s compare that to the functional form of logistic regression:
logit(p) = β0 + X β1 
The correspondence between these equations suggests the following interpretation:
  • The predicted value, logit(p), is the posterior log odds of the hypothesis, given the observed data.
  • The intercept, β0, is the log-odds of the hypothesis if X=0.
  • The coefficient of X, β1, is a log-odds ratio that represents odds of H when X=1, relative to when X=0.
This relationship between logistic regression and Bayes’s theorem tells us how to interpret the estimated coefficients. It also answers the question I posed at the beginning of this note: the functional form of logistic regression makes sense because it corresponds to the way Bayes’s theorem uses data to update probabilities.

This document was translated from LATEX by HEVEA.

Monday, April 21, 2014

Inferring participation rates in service projects

About a week ago I taught my tutorial, Bayesian Statistics Made Simple, at PyCon 2014 in Montreal.  My slides, the video, and all the code, are on this site.  The turnout was great.  We had a room full of enthusiastic Pythonistas who are now, if I was successful, enthusiastic Bayesians.

Toward the end, one of the participants asked a great question based on his work (if I understand the background correctly) at Do Something.org.  Here's my paraphrase:
"A group of people sign up to do a community service project.  Some fraction of them actually participate, and then some fraction of the participants report back to confirm.  In other words, some of the people who sign up don't participate, and some of the people that participate don't report back. 
Given the number of people who sign up and the number of people who report back, can we estimate the number of people who actually participated?"
At the tutorial I wasn't able to generate an answer on the fly, except to say

1) It sounds like a two-dimensional problem, where we want to estimate the fraction of people who participate, which I'll call q, and the fraction of participants who report back, r.

2) If we only know the number of people who sign up and the number or participants who report back, we won't be able to estimate q and r separately.  We'll be able to narrow the distribution of q, but not by much.

3) But if we can do additional sampling, we should be able to make a good estimate.  For example, we could choose a random subset of people who sign up and ask them whether they participated (and check whether they reported back).

With these two kinds of data, we can solve the problem using the same tools we used in the tutorial for the Dice and the Euro problems.  I wrote a solution and checked it into the repository I used for the tutorial.  You can read and download it here.

Here's the code that creates the suite of hypotheses:

    probs = numpy.linspace(0, 1, 101)

    hypos = []
    for q in probs:
        for r in probs:
            hypos.append((q, r))

    suite = Volunteer(hypos)

probs is a sequence of 101 values equally-spaced between 0 and 1.  hypos is a list of tuples where each tuple represents a hypothetical pair, (q, r).

Volunteer is a new class that extends Suite and provides a Likelihood function

class Volunteer(thinkbayes.Suite):

    def Likelihood(self, data, hypo):
        """Computes the likelihood of the data.

        hypo: pair of (q, r)
        data: one of two possible formats
        """
        if len(data) == 2:
            return self.Likelihood1(data, hypo)
        elif len(data) == 3:
            return self.Likelihood2(data, hypo)
        else:
            raise ValueError()

For this problem, we do two kinds of update, depending on the data.  The first update takes a pair of values, (signed_up, reported), which are the number of people who signed up and the number that reported back:

    def Likelihood1(self, data, hypo):
        """Computes the likelihood of the data.

        hypo: pair of (q, r)
        data: tuple (signed up, reported)
        """
        q, r = hypo
        p = q * r
        signed_up, reported = data
        yes = reported
        no = signed_up - reported

        like = p**yes * (1-p)**no
        return like

Given the hypothetical values of q and r, we can compute p, which is the probability that someone who signs up will participate and report back.  Then we compute the likelihood using the binomial PMF (well, almost: I didn't bother to computer the binomial coefficient because it drops out, anyway, when we renormalize).

Since I don't have any real data, I'll makes some up for this example.  Suppose 140 people sign up and only 50 report back:

    data = 140, 50
    suite.Update(data)
    PlotMarginals(suite, root='volunteer1')

PlotMarginals displays the marginal distributions of q and r.  We can extract these distributions like this:

def MarginalDistribution(suite, index):
    pmf = thinkbayes.Pmf()
    for t, prob in suite.Items():
        pmf.Incr(t[index], prob)
    return pmf

MarginalDistribution loops through the Suite and makes a new Pmf that represents the posterior distribution of either q (when index=0) or r (when index=1).  Here's what they look like:
The two distributions are identical, which makes sense because this dataset doesn't give us any information about q and r separately, only about their product.

To estimate them separately, we need to sample people who sign up to see if they participated.  Here's what the second Likelihood function looks like:

     def Likelihood2(self, data, hypo):
        """Computes the likelihood of the data.

        hypo: pair of (q, r)
        data: tuple (signed up, participated, reported)
        """
        q, r = hypo

        signed_up, participated, reported = data

        yes = participated
        no = signed_up - participated
        like1 = q**yes * (1-q)**no

        yes = reported
        no = participated - reported
        like2 = r**yes * (1-r)**no

        return like1 * like2

Again, we are given hypothetical pairs of q and r, but now data is a tuple of (signed_up, participated, reported).  We use q to compute the likelihood of signed_up and participated, and r to compute the likelihood of participated and reported.  The product of these factors, like1 * like 2, is the return value.

Again, I don't have real data, but suppose we survey 5 people who signed up, and learn that 3 participated and 1 reported back.  We would do this update:

    data = 5, 3, 1
    suite.Update(data)
    PlotMarginals(suite, root='volunteer2')

And the result would look like this:
Now we can discriminate between q and r.  Based on my imaginary data, we would estimate that more than 60% of the people who signed up participated, but only 50% of them reported back.  Or (more usefully) we could use the posterior distribution of q to form a most likely estimate and credible interval for the number of participants.

This example demonstrates some features we didn't see during the tutorial:

1) The framework I presented extends naturally to handle multi-dimensional estimation.

2) It also handles the case where you want to update the same distribution with different kinds of data.

It also demonstrates two features of Bayesian statistics: the ability to combine information from multiple sources and to extract the maximum amount of information from a small sample.

Many thanks to everyone who participated in the tutorial this year, and especially to the participant who asked about this problem.  Let me know if there are questions and I will update this post.

And if you want to learn more about Bayesian statistics, allow me to recommend Think Bayes.