Friday, May 6, 2016

Probability is hard, part two

If you read yesterday's post, you know that my colleague Sanjoy Mahajan and I have been working on a series of problems related to conditional probability and Bayesian statistics.  In the previous article, I presented the Red Dice problem, which is relatively simple.  I posted it here because it presents four different versions of the problem, which we'll need for the next step, and it demonstrates the computational tools I use in my solution.

Now I'll present the problem we are really working on: interpreting medical tests.  Suppose we test a patient to see if they have a disease, and the test comes back positive. What is the probability that the patient is actually sick (that is, has the disease)?

To answer this question, we need to know:

• The prevalence of the disease in the population the patient is from. Let's assume the patient is identified as a member of a population where the known prevalence is p.
• The sensitivity of the test, s, which is the probability of a positive test if the patient is sick.
• The false positive rate of the test, t, which is the probability of a positive test if the patient is not sick.

Given these parameters, it is straightforward to compute the probability that the patient is sick, given a positive test.  This problem is a staple in probability classes.

But here's the wrinkle.  What if you are uncertain about one of the parameters: p, s, or t?  For example, suppose you have reason to believe that t is either 0.2 or 0.4 with equal probability.  And suppose p=0.1 and s=0.9.

The questions we would like to answer are:

1. What is the probability that a patient who tests positive is actually sick?
2. Given two patients who have tested positive, what is the probability that both are sick?

As we did with the Red Dice problem, we'll consider four scenarios:
Scenario A: The patients are drawn at random from the relevant population, and the reason we are uncertain about t is that either (1) there are two versions of the test, with different false positive rates, and we don't know which test was used, or (2) there are two groups of people, the false positive rate is different for different groups, and we don't know which group the patient is in.
Scenario B: As in Scenario A, the patients are drawn at random from the relevant population, but the reason we are uncertain about t is that previous studies of the test have been contradictory. That is, there is only one version of the test, and we have reason to believe that t is the same for all groups, but we are not sure what the correct value of t is.
Scenario C: As in Scenario A, there are two versions of the test or two groups of people. But now the patients are being filtered so we only see the patients who tested positive and we don't know how many patients tested negative. For example, suppose you are a specialist and patients are only referred to you after they test positive.
Scenario D: As in Scenario B, we have reason to think that t is the same for all patients, and as in Scenario C, we only see patients who test positive and don't know how many tested negative.

I have posted a Jupyter notebook with solutions for the basic scenario (where we are certain about t) and for Scenario A.  You might want to write your own solution before you read mine.  And then you might want to work on Scenarios B, C, and D.

You can read a static version of the notebook here.

OR

You can run the notebook on Binder.

If you click the Binder link, you should get a home page showing the notebooks and Python modules in the repository for this blog.  Click on test_scenario_a.ipynb to load the notebook for this article.

I'll post my solution for Scenarios B, C, and D next week.

1. It gives me some hope that even you guys are having some difficulty moving the complexity up a level, I can understand the base case fine For Bayes, but keep getting lost as I try to move on.. So, glad to hear it is not just me! I really thought I was not bright enough to "get it"

2. I got the same answers as you for scenario A. The probability that the first patient is sick is

p s / ( p s + (1-p) t1 / 2 + (1-p) t2 / 2).

The three terms in the denominator are the probabilities associated with the three ways of getting a positive test: true positive, false positive with test t1, false positive with test t2.

This works out to 1/4.

In scenario A, the second patient's outcome is independent of the first, so the probability that both are sick is just the square of the above probability.

In scenario B, the answer to question 1 is the same. If I'm not mistaken, the answer to question 2 is

p^2 s^2 / ((p s + (1-p) t1)^2 / 2 + (p s + (1-p) t2)^2)/2)

Sorry that's a bit hard to read. The numerator is the probability of getting two true positive. The denominator is of the form A^2/2 + B^2/2, where A is the probability of getting a single positive result (true or false) under hypothesis t1, and B is the probability under hypothesis t2. Under either hypothesis, the outcomes for patients 1 and 2 are independent, so the probability of getting two false positives under hypothesis t1 is A^2, and similarly for hypothesis t2.

Anyway, the numerical value of the above expression is 1/17, so I claim that that's the answer to scenario B, question 2.

1. So far you are 4 for 4 (two scenarios, two questions each). Want to lock in your answers for C and D before I publish solutions?

2. For any given t, the probability that a person who tests positive is actually sick is

psick(t) = p s / (p s + (1-p) t).

For the given parameters, the two relevant values are

psick(0.2) = 1/3
psick(0.4) = 1/5

In both scenarios C and D, the probability that the first person is sick is simply the average of the two:

P1 = (1/3 + 1/5)/2 = 4/15.

That's because the two possible values of t remain equiprobable throughout these scenarios.

In scenario C, each new positive-testing person is an independent event with this same probability, so the probability that the first two people are both sick is

P2C = P1^2 = 16/225.

In scenario D, the two events are not independent, because they have the same underlying value of t. But for any given t, they would be independent. So we can compute the probability that both people are sick under hypothesis t=0.2, and the probability that both people are sick under hypothesis t=0.4, and average the two:

P2D = (p(0.2)^2 + p(0.4)^2) / 2 = 17/225.

3. Dammit, I got C wrong, didn't I? More of the positive results come from hypothesis t = 0.4 than from t=0.2. I'll revise, but I wanted to get this comment into the record right away.

4. Let me see if I can get it right this time. Feel free to leave my earlier wrong answer up. I deserve to be at least mildly shamed.

Scenario C:

Let t1=0.2 and t2=0.4. Here are the ways to get a positive test result, with their associated probabilities:

Sick: p s
Not sick, t=t1: (1-p) t1 / 2
Not sick, t=t2: (1-p) t2 / 2

The probability that one person is sick, given a positive test result, is

ps / (sum of all three terms above),

which is 1/4.

That's the probability that the first positive-tester is sick, in scenario C (as it was in A and B).

In scenario C, each trial is independent, so the answer to question 2 is the square of the answer to question 1, i.e., 1/16.

Scenario D:

If we hypothesize any given value of t, then we can calculate the probability that any given positive-tester is in fact sick. That turns out to be

psick = 1/3 if t= 0.2
psick = 1/5 if t = 0.4.

In scenario D, we never find out any information that tells us which value of t is correct, so they remain equiprobable. There's a 50% chance that psick=1/3 and a 50% chance that psick = 1/5, so when you meet that first positive-tester, the probability that he's sick is the average of the two.

psick =0.5 (1/3+1/5) = 4/15. [Scenario D, question 1]

Under each hypothesis for t, the probability that any given positive-tester is sick is independent of the others, so the probability that the first two testers are both sick is the square of the probability that one is sick. We still have no information about which value of t is correct, so the probability that both folks are sick is the average of the probabilities for each of the two t's:

0.5 ( 1/9 + 1/25 ) = 17/225 [Scenario D, question 2].

5. Your correction on Scenario C is correct, and your answer on Scenario D was correct all along.

Interestingly, I made the same mistake on Scenario C (but it took me longer to catch it).

Nice job!

3. Hi, I have a doubt about Scenario A question 2. I computed the solution just considering people statistically independent with respect to the test so I get (1/4)^2 = 1/16.
Looking at the solution I saw a lot of computation to get the same answer. Am I missing something or my answer is correct?
By the way, thank you for these series of post Allen, they are very useful.
Giovanni

1. Yes, that's right. I did the computation partly to get the other three probabilities (two well, and one well one sick), and partly to demonstrate the computation I need for some of the other scenarios.