Monday, April 21, 2014

Inferring participation rates in service projects

About a week ago I taught my tutorial, Bayesian Statistics Made Simple, at PyCon 2014 in Montreal.  My slides, the video, and all the code, are on this site.  The turnout was great.  We had a room full of enthusiastic Pythonistas who are now, if I was successful, enthusiastic Bayesians.

Toward the end, one of the participants asked a great question based on his work (if I understand the background correctly) at Do  Here's my paraphrase:
"A group of people sign up to do a community service project.  Some fraction of them actually participate, and then some fraction of the participants report back to confirm.  In other words, some of the people who sign up don't participate, and some of the people that participate don't report back. 
Given the number of people who sign up and the number of people who report back, can we estimate the number of people who actually participated?"
At the tutorial I wasn't able to generate an answer on the fly, except to say

1) It sounds like a two-dimensional problem, where we want to estimate the fraction of people who participate, which I'll call q, and the fraction of participants who report back, r.

2) If we only know the number of people who sign up and the number or participants who report back, we won't be able to estimate q and r separately.  We'll be able to narrow the distribution of q, but not by much.

3) But if we can do additional sampling, we should be able to make a good estimate.  For example, we could choose a random subset of people who sign up and ask them whether they participated (and check whether they reported back).

With these two kinds of data, we can solve the problem using the same tools we used in the tutorial for the Dice and the Euro problems.  I wrote a solution and checked it into the repository I used for the tutorial.  You can read and download it here.

Here's the code that creates the suite of hypotheses:

    probs = numpy.linspace(0, 1, 101)

    hypos = []
    for q in probs:
        for r in probs:
            hypos.append((q, r))

    suite = Volunteer(hypos)

probs is a sequence of 101 values equally-spaced between 0 and 1.  hypos is a list of tuples where each tuple represents a hypothetical pair, (q, r).

Volunteer is a new class that extends Suite and provides a Likelihood function

class Volunteer(thinkbayes.Suite):

    def Likelihood(self, data, hypo):
        """Computes the likelihood of the data.

        hypo: pair of (q, r)
        data: one of two possible formats
        if len(data) == 2:
            return self.Likelihood1(data, hypo)
        elif len(data) == 3:
            return self.Likelihood2(data, hypo)
            raise ValueError()

For this problem, we do two kinds of update, depending on the data.  The first update takes a pair of values, (signed_up, reported), which are the number of people who signed up and the number that reported back:

    def Likelihood1(self, data, hypo):
        """Computes the likelihood of the data.

        hypo: pair of (q, r)
        data: tuple (signed up, reported)
        q, r = hypo
        p = q * r
        signed_up, reported = data
        yes = reported
        no = signed_up - reported

        like = p**yes * (1-p)**no
        return like

Given the hypothetical values of q and r, we can compute p, which is the probability that someone who signs up will participate and report back.  Then we compute the likelihood using the binomial PMF (well, almost: I didn't bother to computer the binomial coefficient because it drops out, anyway, when we renormalize).

Since I don't have any real data, I'll makes some up for this example.  Suppose 140 people sign up and only 50 report back:

    data = 140, 50
    PlotMarginals(suite, root='volunteer1')

PlotMarginals displays the marginal distributions of q and r.  We can extract these distributions like this:

def MarginalDistribution(suite, index):
    pmf = thinkbayes.Pmf()
    for t, prob in suite.Items():
        pmf.Incr(t[index], prob)
    return pmf

MarginalDistribution loops through the Suite and makes a new Pmf that represents the posterior distribution of either q (when index=0) or r (when index=1).  Here's what they look like:
The two distributions are identical, which makes sense because this dataset doesn't give us any information about q and r separately, only about their product.

To estimate them separately, we need to sample people who sign up to see if they participated.  Here's what the second Likelihood function looks like:

     def Likelihood2(self, data, hypo):
        """Computes the likelihood of the data.

        hypo: pair of (q, r)
        data: tuple (signed up, participated, reported)
        q, r = hypo

        signed_up, participated, reported = data

        yes = participated
        no = signed_up - participated
        like1 = q**yes * (1-q)**no

        yes = reported
        no = participated - reported
        like2 = r**yes * (1-r)**no

        return like1 * like2

Again, we are given hypothetical pairs of q and r, but now data is a tuple of (signed_up, participated, reported).  We use q to compute the likelihood of signed_up and participated, and r to compute the likelihood of participated and reported.  The product of these factors, like1 * like 2, is the return value.

Again, I don't have real data, but suppose we survey 5 people who signed up, and learn that 3 participated and 1 reported back.  We would do this update:

    data = 5, 3, 1
    PlotMarginals(suite, root='volunteer2')

And the result would look like this:
Now we can discriminate between q and r.  Based on my imaginary data, we would estimate that more than 60% of the people who signed up participated, but only 50% of them reported back.  Or (more usefully) we could use the posterior distribution of q to form a most likely estimate and credible interval for the number of participants.

This example demonstrates some features we didn't see during the tutorial:

1) The framework I presented extends naturally to handle multi-dimensional estimation.

2) It also handles the case where you want to update the same distribution with different kinds of data.

It also demonstrates two features of Bayesian statistics: the ability to combine information from multiple sources and to extract the maximum amount of information from a small sample.

Many thanks to everyone who participated in the tutorial this year, and especially to the participant who asked about this problem.  Let me know if there are questions and I will update this post.

And if you want to learn more about Bayesian statistics, allow me to recommend Think Bayes.

No comments:

Post a Comment