This week's post has more math than most, so I wrote in it LaTeX and translated it to HTML using HeVeA. Some of the formulas are not as pretty as they could be. If you prefer, you can read this
article in PDF.
Abstract:
My two favorite topics in probability and statistics are
Bayes’s theorem and logistic regression. Because there are
similarities between them, I have always assumed that there is
a connection. In this note, I demonstrate the
connection mathematically, and (I hope) shed light on the
motivation for logistic regression and the interpretation of
the results.
1 Bayes’s theorem
I’ll start by reviewing Bayes’s theorem, using an example that came up
when I was in grad school. I signed up for a class on Theory of
Computation. On the first day of class, I was the first to arrive. A
few minutes later, another student arrived. Because I was expecting
most students in an advanced computer science class to be male, I was
mildly surprised that the other student was female. Another female
student arrived a few minutes later, which was sufficiently
surprising that I started to think I was in the wrong room. When
another female student arrived, I was confident I was in the wrong
place (and it turned out I was).
As each student arrived, I used the observed data to update my
belief that I was in the right place. We can use Bayes’s theorem to
quantify the calculation I was doing intuitively.
I’ll us
H to represent the hypothesis that I was in the right
room, and
F to represent the observation that the first other
student was female. Bayes’s theorem provides an algorithm for
updating the probability of
H:
Where
 P(H) is the prior probability of H before the other
student arrived.
 P(HF) is the posterior probability of H, updated based
on the observation F.
 P(FH) is the likelihood of the data, F, assuming that
the hypothesis is true.
 P(F) is the likelihood of the data, independent of H.
Before I saw the other students, I was confident I was in the right
room, so I might assign
P(
H) something like 90%.
When I was in grad school most advanced computer science classes were
90% male, so if I was in the right room, the likelihood of the
first female student was only 10%. And the likelihood of three
female students was only 0.1%.
If we don’t assume I was in the right room, then the likelihood of
the first female student was more like 50%, so the likelihood
of all three was 12.5%.
Plugging those numbers into Bayes’s theorem yields
P(
H
F) = 0.64
after one female student,
P(
H
FF) = 0.26 after the second,
and
P(
H
FFF) = 0.07 after the third.
[UPDATE: An earlier version of this article had incorrect values in the previous sentence. Thanks to David Burger for catching the error.]
2 Logistic regression
Logistic regression is based on the following functional form:
logit(p) = β_{0} + β_{1} x_{1} + ... + β_{n} x_{n} 
where the dependent variable,
p, is a probability,
the
xs are explanatory variables, and the βs are
coefficients we want to estimate. The
logit function is the
logodds, or
logit(p) = ln  ⎛
⎜
⎜
⎝ 
  ⎞
⎟
⎟
⎠ 
When you present logistic regression like this, it raises
three questions:
 Why is logit(p) the right choice for the dependent
variable?
 Why should we expect the relationship between logit(p)
and the explanatory variables to be linear?
 How should we interpret the estimated parameters?
The answer to all of these questions turns out to be Bayes’s
theorem. To demonstrate that, I’ll use a simple example where
there is only one explanatory variable. But the derivation
generalizes to multiple regression.
On notation: I’ll use
P(
H) for the probability
that some hypothesis,
H, is true.
O(
H) is the odds of the same
hypothesis, defined as
I’ll use
LO(
H) to represent the logodds of
H:
I’ll also use
LR for a likelihood ratio, and
OR for an odds
ratio. Finally, I’ll use
LLR for a loglikelihood ratio, and
LOR for a logodds ratio.
3 Making the connection
To demonstrate the connection between Bayes’s theorem and
logistic regression, I’ll start with the odds form
of Bayes’s theorem. Continuing the previous example,
I could write
O(HF) = O(H) LR(FH)
(1) 
where
 O(H) is the prior odds that I was in the right room,
 O(HF) is the posterior odds after seeing one female student,
 LR(FH) is the likelihood ratio of the data, given
the hypothesis.
The likelihood ratio of the data is:
where ¬
H means
H is false.
Noticing that logistic regression is expressed in terms of
logodds, my next move is to write the logodds form of
Bayes’s theorem by taking the log of Eqn
1:
LO(HF) = LO(H) + LLR(FH)
(2) 
If the first student to arrive had been male, we would write
LO(HM) = LO(H) + LLR(MH)
(3) 
Or more generally if we use
X as a variable to represent
the sex of the observed student, we would write
LO(HX) = LO(H) + LLR(XH)
(4) 
I’ll assign
X=0 if the observed student is female and
X=1 if male. Then I can write:
LLR(XH) =  ⎧
⎨
⎩ 
LLR(FH)  if X = 0 
LLR(MH)  if X = 1 


(5) 
Or we can collapse these two expressions into one by using
X as a multiplier:
LLR(XH) = LLR(FH) + X [LLR(MH) − LLR(FH)]
(6) 
4 Odds ratios
The next move is to recognize that
the part of Eqn
4 in brackets is the logodds ratio
of
H. To see that, we need to look more closely at odds ratios.
Odds ratios are often used in medicine to describe the association
between a disease and a risk factor. In the example scenario, we
can use an odds ratio to express the odds of the hypothesis
H if we observe a male student, relative to the odds if we
observe a female student:
I’m using the notation
OR_{X} to represent the odds ratio
associated with the variable
X.
Applying Bayes’s theorem to
the top and bottom of the previous expression yields
OR_{X}(H) = 
O(H) LR(MH) 

O(H) LR(FH) 
 =


Taking the log of both sides yields
LOR_{X}(H) = LLR(MH) − LLR(FH)
(7) 
This result should look familiar, since it appears in
Eqn
4.
5 Conclusion
Now we have all the pieces we need; we just have to assemble them.
Combining Eqns
4 and
5 yields
LLR(HX) = LLR(F) + X LOR(XH)
(8) 
Combining Eqns
3 and
6 yields
LO(HX) = LO(H) + LLR(FH) + X LOR(XH)
(9) 
Finally, combining Eqns
2 and
7 yields
LO(HX) = LO(HF) + X LOR(XH) 
We can think of this equation as the logodds form of Bayes’s theorem,
with the update term expressed as a logodds ratio. Let’s compare
that to the functional form of logistic regression:
logit(p) = β_{0} + X β_{1} 
The correspondence between these equations suggests the following
interpretation:
 The predicted value, logit(p), is the posterior log
odds of the hypothesis, given the observed data.
 The intercept, β_{0}, is the logodds of the
hypothesis if X=0.
 The coefficient of X, β_{1}, is a logodds ratio
that represents odds of H when X=1, relative to
when X=0.
This relationship between logistic regression and Bayes’s theorem
tells us how to interpret the estimated coefficients. It also
answers the question I posed at the beginning of this note:
the functional form of logistic regression makes sense because
it corresponds to the way Bayes’s theorem uses data to update
probabilities.
This document was translated from L^{A}T_{E}X by
H^{E}V^{E}A.
Wow, this was interesting! While I don't have a background in logistic regression (yet), this was a fantastic first exposure to the versatility and usefulness of Bayes's theorem.
ReplyDeleteI loved your example about the college students  we really do use statistics and probability in everyday situations.
14288941
There seems something intuitively wrong with your calculation of P(F). You are saying:
ReplyDeleteP(F) = P(FH) P(H) + P(FH) P(H)
= 0.1 * 0.9 + 0.5 * 0.1
= 0.14
and using P(H) = 0.9, P(FH) = 0.1 to obtain
P(HF) = P(H) * P(FH) / P(F)
= 0.9 * 0.1 / 0.14 = 0.64
But here's the problem: the university is large, and your class is relatively small. So, given that P(F) is the likelhood of the data INDEPENDENT of H, you would expect that the overall university ratio of females would swamp out any skewing that your classes might introduce. In other words, I would expect P(F) = 0.5 (approx), not 0.14.
The problem appears to be that you are assuming that the students in your class constitute 90% of the population of students (i.e. P(H)) of the university, whereas in fact they are likely to constitute only a miniscule proportion. That's why there's a skewing.
Comments?
I'm not positive I understand where you see a problem, but I think I agree with you. P(F) is the probability of a female student regardless of H, so it should be the overall fraction of female students at the university, probably close to 0.5. That's what I used in my calculations.
DeleteBut if you take P(H) = 0.9, P(FH) = 0.1, P(F) = 0.5 and plug it into the formula, you get
DeleteP(HF) = P(H) * P(FH) / P(F)
= 0.9 * 0.1 / 0.5 = 0.18
which is not the answer of 0.64 that you gave in your post.
Ah, now I see the problem! My previous reply was wrong, but the numbers in the article are correct (but explained badly).
DeleteAs you said, the denominator P(F) should be P(FH) P(H) + P(FH) P(H), which is 0.14, not 0.5, and that yields P(HF) = 0.64.
In your first message, you objected to this denominator because you said it assumes that my class makes up 90% of the population of students. I think that's not right  rather it takes into account that I am initially 90% sure that I am in the right class. But the term P(FH) = 0.5 assumes (as you suggest) that my class is an insignificant part of the student population.
Sorry for my confusion, and thanks for pointing this out. When I have a chance, I will edit the article to clarify.
Many thanks for your replies, professor. I hope I'm not overposting.
DeleteWould it be fair to say that my misconception of the problem is that I'm taking Olin University as the population; whereas I should be considering the population not as the Olin University, but a "Downeyian University".
The Downeyian University is a special university ... one in which you have a 90% chance of turning up to the right class ... and not the Olin University, in which you would have only a very small chance of turning up at the right class if you just chose one at random.
And this Downeyian University is a very strange University indeed ... because although you can specify what classes you are likely to turn up correctly for (it's the ones that you teach), you don't know what classes constitute the incorrect choices. They will be some proper subset of the entire Olin University, but we don't know what. The only thing we can say about it is that there is the same proportion of males as females. That would presumably be an assumption.
But there's more! Although you're assuming that proportion of incorrectly chosen, you might be wrong. In fact, it's even plausible. How? Well, suppose you mostly gives lectures in the science faculty. Suppose that the science students are 90% male  not 50% male  and exactly the same proportion as your own class. What happens then, of course, is that the presence of females would actually give you no information.
And maybe the situation is even worse than that! Maybe the actual "Downeyian" population contains more than 90% males, but that the males have a disproportionately larger distaste for mathematics and programming. Maybe they prefer engineering, or something. In that case, your intuition would have to be entirely flipped around ... the presence of females would be a positive indication that you're actually in the right class.
Or perhaps I've got the wrong end of the stick. But I think that what I'm saying makes sense.
Who would have thought statistics could be so much fun? ;)
Don't worry about overposting, but as some point I might have to stop overreplying :)
DeleteReading between the lines, I think you are coming face to face with one of the central issues of Bayesian inference, which is how to interpret probabilities, and especially the prior probability.
In this case, P(H) is the prior probability that I am in the right class. If I chose the classroom at random, P(H) would be low. But I am basing my solution on the assumption that I did not choose the classroom at random, but rather tried to go to the right place. And based on my prior experience with navigating unfamiliar campuses, I estimate that my chance of being in the right place is about 90%.
In frequentist terms, you could say that the relevant sample space is "all the times I've tried to find the right room", rather than "all the classrooms on campus."
In (subjective) Bayesian terms, you would say that 90% is my subjective degree of belief that I am in the right place, based on relevant background information.
But I would not say (as I think you did) that I am making a claim about the university, or that my Downeyian university is very different from a real university. My analysis is based on a model and the simplifications that come with it, but I don't think the model is as weird as you suggest.
Thanks for this line of questions; I think it is productive.