"The canonical example is to imagine that a precocious newborn observes his first sunset, and wonders whether the sun will rise again or not. He assigns equal prior probabilities to both possible outcomes, and represents this by placing one white and one black marble into a bag. The following day, when the sun rises, the child places another white marble in the bag. The probability that a marble plucked randomly from the bag will be white (ie, the child’s degree of belief in future sunrises) has thus gone from a half to two-thirds. After sunrise the next day, the child adds another white marble, and the probability (and thus the degree of belief) goes from two-thirds to three-quarters. And so on. Gradually, the initial belief that the sun is just as likely as not to rise each morning is modified to become a near-certainty that the sun will always rise."This example made me wonder about two things:
1) Although they call it a "canonical example", I had not heard it before, so I wondered where it came from, and
2) Although the example demonstrates the general idea of updating beliefs in light of new evidence, it is not obvious that the hypothetical newborn is actually doing a Bayesian update.
At the risk of spoiling the fun, here is what I found:
1) The example is from Richard Price's commentary on the original essay, discovered after Bayes's death, that presented what came to be called Bayes's Theorem. So I am embarrassed that I didn't know it.
2) The analysis is correct for a particular prior distribution and for a particular interpretation of the posterior, but presented in a way that obscures these details.
A false start
In fact, it is not easy to formulate this example in a Bayesian framework. One option (which fails) is to compute posterior probabilities for two hypotheses: either A (the sun will rise tomorrow) or B (the sun will not rise tomorrow).
We are given the priors: P(A) = P(B) = 1/2.
But when the sun rises in the morning, how do we compute the posteriors? We need the likelihoods; that is, the probability of the evidence (sunrise) under the two hypotheses. It is hard to make sense of these likelihoods, so we conclude that this formulation of the problem isn't working.
The beta distribution
A more fruitful (and more complicated) alternative is to imagine that the sun will rise with some probability, p, and try to estimate p.
If we know nothing about p, we could choose a prior distribution that is uniform between 0 and 1. This is a reasonable choice, but it is not the only choice. For example, we could instead make the odds (rather than the probabilities) uniform over a range of values, which has the intuitive appeal of giving more prior weight to values near 0 and 1.
But starting with uniform probabilities is equivalent to starting with a Beta distribution with parameters α = β = 1, which has a mean value of α / (α + β), or 1/2. And the Beta distribution has the nice property of being a conjugate prior, which means that after an update the posterior is also a Beta distribution.
We can show (but I won't) that observing a success has the effect of increasing α by 1, and observing a failure increases β by 1. And that's where the white and black marbles come in. The author of the example is using the marbles as a sneaky way to talk about a Beta distribution.
After one sunrise, the newborn's posterior belief about p is a Beta distribution with α=2, β=1, which looks like this:
This represents the newborn's posterior belief about p, the probability of sunrise. At the far left, the hypothesis that p=0 has been refuted. At the far right, the most likely value (after one success) is p=1. A 90% credible interval is [0.225, 0.975], which indicates that the newborn is still very unsure about p. But if we insist on a single-value estimate, he or she might reasonably compute the mean of the posterior, which is 2/3.
Similarly, after two successful sunrises, the posterior looks like this:
And the mean value of p is 3/4. So the marble method is correct after all! Well, sort of. It is correct if we assume a uniform prior for p, and if we use the mean of the posterior to generate a single-point estimate. For both decisions there are reasonable alternatives; for example, after any number of successes (without a failure) the maximum likelihood estimate of p is 1.
In summary, this example demonstrates the general idea of a Bayesian update, but think the way the calculation is presented is misleading.
Price's version
As I said, the example is from Price's comments on Bayes's article. Here is Price's version, from the page numbered 409 [in the original typeface, the letter "s" looks like "f", but I will fpare you]:
"One example here it will not be amiss to give.
"Let us image to ourselves the case of a person just brought forth into this world and left to collect from his observation of the order and course of events what powers and causes take place in it. The Sun would, probably, be the first object that would engage his attention; but after losing it the first night, he would be entirely ignorant whether he should ever see it again. He would therefore be in the condition of a person making a first experiment about an event entirely unknown to him. But let him see a second appearance or one return of the Sun, and an expectation would be raised in him of a second return, and he might know that there was an odds of 3 to 1 for some probability of this. This odds would increase, as before represented, with the number of returns to which he was witness. But no finite number of returns would be sufficient to produce absolute or physical certainty. For let it be supposed that he has seen it return at regular and stated intervals a million of times. The conclusions this would warrant would be such as follow --- There would be the odds of the millioneth power of 2, to one, that it was likely that it would return again at the end of the usual interval."To interpret "an odds of 3 to 1 for some probability of this" we have to look back to page 405, which computes the odds "for somewhat more than an even chance that it would happen on a second trial;" that is, the odds that p > 0.5. In the example, the odds are 3:1, which corresponds to a probability of 0.75.
If we use the Beta distribution to compute the same probability, the result is also 0.75. So Price's calculations are consistent with mine; he just uses a different way of summarizing them.
This is just Laplace's Rule of Succession http://en.wikipedia.org/wiki/Rule_of_succession
ReplyDeleteThanks -- I knew that Laplace had already discovered Bayes's Theorem (and probably should get credit for it) but I didn't know he had posed and solved the Sunrise problem (http://en.wikipedia.org/wiki/Sunrise_problem).
DeleteIt seems to me that the hypotheses presented in the Economist article: either the sun will rise or it will not, make no connection between events on successive days, and so it is impossible for the updated probabilities to differ from the originals.
ReplyDeleteYour analysis is evidently a huge improvement, but I wonder if your terminology is a bit confusing when you talk about calculating the probability for a probability. (I noticed a similar terminology recently in your excellent Blinky Monty Hall problem.) I think such phrasing might be hard for some to stomach, is avoidable, and furthermore comes dangerously close to supposing that a probability is a physical property of the system under study, which gets one into all sorts of difficulties.
Might it be conceptually easier to suppose that the sun rises with some fixed relative frequency, and that the purpose of the analysis is to ascertain a probability distribution over the possible frequencies?
PS Bayes came first, but Laplace was the first to write down the theorem in its general form.
PPS - Your discussion of the form of the prior distribution has jogged my memory:
ReplyDeleteIn fact the prior that best represents 'complete ignorance' does indeed place more weight near 0 and 1. Jaynes, in 'Prior Probabilities' (http://bayes.wustl.edu/etj/articles/prior.pdf) showed using transformation groups that the correct prior for P(f) in the case of 'complete ignorance' is proportional to 1/(f(1-f)), which converges on the law of succession for large numbers of samples.
Oddly, the uniform prior (and resulting law of succession) seems to amount to an additional piece of information to the effect that the frequency is neither 0 nor 1.
I see also that in the same article Jaynes discusses the conceptual difficulty of a 'probability of a probability', with an interesting general solution in terms of a population of people, each performing their own probability assignments. I think this can still be confusing, though, and can (probably) be avoided here (the problem, though, is what exactly do I mean by 'a fixed frequency'?).
Probably overthinking?
Thanks for a great post!
ReplyDeleteFollowing up (I think) somewhat on Tom's first comment, Boole argued that the flat prior on frequency of success (or probability p of success) could not be rationally preferred to assigning equal probability to all possible states or constitutions of the universe. Assigning equal probabilities to the frequencies is like assigning equal probability to values of a variable that counts the number of successes in a given sequence.
To take a simple example, say I've observed three risings out of three trials. Laplace assigns equal probability to {000}, {001, 010, 100}, {011, 101, 110}, and {111}. So, P(111) = 1/4 and P(011) = 1/12 before any evidence is collected. Boole's alternative makes all of the possible sequences equally likely. So, P(111) = P(011) = 1/8 before any evidence is collected.
The problem with that assignment (as Boole noticed) is that it makes learning from experience impossible. If one assigns equal probability to the constitutions of the universe, then the probability that the m+1st observation will be a success given that the last m have been successes is 1/2, regardless of the size of m.
I think one can get the same effect by letting alpha and beta go to infinity in the beta distribution. But I'm not very confident about this.
I don't know if you have time to overthink this any further, but I would really appreciate seeing some simple examples and how some alternatives to the flat prior over p work out. For example, I would like to see how a flat prior over the odds compares.
I will run the numbers with a flat prior over the odds. I am also curious to see what kind of difference it makes.
DeleteJonathan, thanks, you raise a historically important issue.
ReplyDeleteBoole was one of many who argued against Laplace's reasoning, without understanding it. Boole's distribution is quite appropriate for the case where we have no reason to suppose a constant mechanism governing the outcomes of our experiments - and surprise, surprise, induction is impossible under these cirumstances.
But the law of succession (along with Jaynes' pre-prior that I mentioned above) was derived for cases where we decide that it is reasonable to assume such a constant mechanism, specifically, sequences of Bernoulli trials.
Of course, we don't have to assume a constant frequency, in order to perform inductive reasoning. It is enough to recognize that our data result from repetitions of similar experiments - the frequencies we fit can vary according to any model that is indicated by the observations.
Sadly, the objections of Boole, Venn, and several others prevailed for a while, and frequentist statistics took hold.
"Boole was one of many who argued against Laplace's reasoning, without understanding it. Boole's distribution is quite appropriate for the case where we have no reason to suppose a constant mechanism governing the outcomes of our experiments."
DeleteThat seems unfair to Boole. The case imagined -- where a precocious newborn observes his first sunrise (or in Hume's language, we imagine Adam with fully formed rational capacity but no previous experience) -- is exactly a case where we have no reason to suppose a constant mechanism governs the outcomes of our experiments.
"Sadly, the objections of Boole, Venn, and several others prevailed for a while, and frequentist statistics took hold."
That is certainly the way Fisher tells the history. But I think there are good reasons to be skeptical of that history. For more detail on this, I highly recommend this paper by Sandy Zabell. Or read some of Boole's later work, in which he explains that he is not opposed to Laplace's principle but to its arbitrary application.
Anyway, I agree with Zabell that it wasn't until the 1930s that Bayesian-Laplacean methods fell out of favor.
You are right about cases of truly complete ignorance - application of the law of succession in such cases would need to be done with great care (we are still free to examine the consequences of such a model, but we need to consider other models also), but the rule was not derived for such cases.
DeleteYou might be right that I misrepresent Boole, I can't claim to have made a detailed historical study, but from my limited reading, it seems as though he had strong objections to the principle under any application (certainly many others did). Thanks for the additional sources.
(Of course, Laplace was also opposed to arbitrary application of his principles, and made this very clear in his writing.)
Its not that I particular wanted to blame Boole either - others attacked the law of succession on really illogical grounds, claiming instances where it demonstrably provided absurd results. They never stopped to wonder how they knew the results were absurd - if it is clear that the outcome is ridiculous, then it is also clear that your reasoning process is making use or prior information much stronger than the uniform prior for which the method was designed.
Thanks again for the sources and the different point of view.
Thank you both for the great comments!
Delete