Probably Overthinking It: Probability is hard, part two

Friday, May 6, 2016

Probability is hard, part two

If you read yesterday's post, you know that my colleague Sanjoy Mahajan and I have been working on a series of problems related to conditional probability and Bayesian statistics. In the previous article, I presented the Red Dice problem, which is relatively simple. I posted it here because it presents four different versions of the problem, which we'll need for the next step, and it demonstrates the computational tools I use in my solution.

Now I'll present the problem we are really working on: interpreting medical tests. Suppose we test a patient to see if they have a disease, and the test comes back positive. What is the probability that the patient is actually sick (that is, has the disease)?

To answer this question, we need to know:

The prevalence of the disease in the population the patient is from. Let's assume the patient is identified as a member of a population where the known prevalence is p.
The sensitivity of the test, s, which is the probability of a positive test if the patient is sick.
The false positive rate of the test, t, which is the probability of a positive test if the patient is not sick.

Given these parameters, it is straightforward to compute the probability that the patient is sick, given a positive test. This problem is a staple in probability classes.

But here's the wrinkle. What if you are uncertain about one of the parameters: p, s, or t? For example, suppose you have reason to believe that t is either 0.2 or 0.4 with equal probability. And suppose p=0.1 and s=0.9.

The questions we would like to answer are:

What is the probability that a patient who tests positive is actually sick?
Given two patients who have tested positive, what is the probability that both are sick?

As we did with the Red Dice problem, we'll consider four scenarios:

Scenario A: The patients are drawn at random from the relevant population, and the reason we are uncertain about t is that either (1) there are two versions of the test, with different false positive rates, and we don't know which test was used, or (2) there are two groups of people, the false positive rate is different for different groups, and we don't know which group the patient is in.

Scenario B: As in Scenario A, the patients are drawn at random from the relevant population, but the reason we are uncertain about t is that previous studies of the test have been contradictory. That is, there is only one version of the test, and we have reason to believe that t is the same for all groups, but we are not sure what the correct value of t is.

Scenario C: As in Scenario A, there are two versions of the test or two groups of people. But now the patients are being filtered so we only see the patients who tested positive and we don't know how many patients tested negative. For example, suppose you are a specialist and patients are only referred to you after they test positive.

Scenario D: As in Scenario B, we have reason to think that t is the same for all patients, and as in Scenario C, we only see patients who test positive and don't know how many tested negative.

I have posted a Jupyter notebook with solutions for the basic scenario (where we are certain about t) and for Scenario A. You might want to write your own solution before you read mine. And then you might want to work on Scenarios B, C, and D.

You can read a static version of the notebook here.

OR

You can run the notebook on Binder.

If you click the Binder link, you should get a home page showing the notebooks and Python modules in the repository for this blog. Click on test_scenario_a.ipynb to load the notebook for this article.

I'll post my solution for Scenarios B, C, and D next week.

9 comments:

dartdogMay 6, 2016 at 12:02 PM
It gives me some hope that even you guys are having some difficulty moving the complexity up a level, I can understand the base case fine For Bayes, but keep getting lost as I try to move on.. So, glad to hear it is not just me! I really thought I was not bright enough to "get it"
ReplyDelete
Replies
Ted BunnMay 6, 2016 at 3:16 PM
I got the same answers as you for scenario A. The probability that the first patient is sick is

p s / ( p s + (1-p) t1 / 2 + (1-p) t2 / 2).

The three terms in the denominator are the probabilities associated with the three ways of getting a positive test: true positive, false positive with test t1, false positive with test t2.

This works out to 1/4.

In scenario A, the second patient's outcome is independent of the first, so the probability that both are sick is just the square of the above probability.

In scenario B, the answer to question 1 is the same. If I'm not mistaken, the answer to question 2 is

p^2 s^2 / ((p s + (1-p) t1)^2 / 2 + (p s + (1-p) t2)^2)/2)

Sorry that's a bit hard to read. The numerator is the probability of getting two true positive. The denominator is of the form A^2/2 + B^2/2, where A is the probability of getting a single positive result (true or false) under hypothesis t1, and B is the probability under hypothesis t2. Under either hypothesis, the outcomes for patients 1 and 2 are independent, so the probability of getting two false positives under hypothesis t1 is A^2, and similarly for hypothesis t2.

Anyway, the numerical value of the above expression is 1/17, so I claim that that's the answer to scenario B, question 2.
ReplyDelete
Replies
jimi75June 2, 2016 at 12:50 AM
Hi, I have a doubt about Scenario A question 2. I computed the solution just considering people statistically independent with respect to the test so I get (1/4)^2 = 1/16.
Looking at the solution I saw a lot of computation to get the same answer. Am I missing something or my answer is correct?
By the way, thank you for these series of post Allen, they are very useful.
Giovanni
ReplyDelete
Replies

Add comment