Probably Overthinking It: Learning to Love Bayesian Statistics

I did a webcast earlier today about Bayesian statistics. Some time in the next week, the video should be available from O'Reilly. In the meantime, you can see my slides here:

And here's a transcript of what I said:

Thanks everyone for joining me for this webcast. At the bottom of this slide you can see the URL for my slides, so you can follow along at home.

I’m Allen Downey and I’m a professor at Olin College, which is a new engineering college right outside Boston. Our mission is to fix engineering education, and one of the ways I’m working on that is by teaching Bayesian statistics.

Bayesian methods have been the victim of a 200 year smear campaign. If you are interested in the history and the people involved, I recommend this book, The Theory That Would Not Die. But the result of this campaign is that many of the things you have heard about Bayesian statistics are wrong, or at least misleading.

In particular, you might have heard that Bayesian stats are difficult, computationally slow, the results are subjective; and we don’t need Bayesian stats because they are redundant with more mainstream methods. I’ll explain what each of these claims is about, and then I’ll explain why I disagree.

At most universities, Bayesian stats is a graduate level topic that you can only study after you’ve learned a lot of math and a lot of classical statistics.

And if you pick up most books, you can see why people think it’s hard. This is from one of the most popular books about Bayesian statistics, and this is just page 7. Only 563 pages to go!

Also, you might have heard that Bayesian methods are computationally intensive, so they don’t scale up to real-world problems or big data.

Furthermore, the results you get depend on prior beliefs, so they are subjective. That might be ok for some applications, but not for science, where the results are supposed to be objective.

There are some cases where the prior is objective, the results you get from Bayesian methods are the same as the results from classical methods, so what’s the point?

Finally, and this is something I hear from statisticians, we’ve made 100 years of scientific and engineering progress using classical statistics. We’re doing just fine without Bayes.

Ronald Fisher, one of the most prominent statisticians of the 20th century, summed it up like this: “The theory of inverse probability [which he called it because Bayes is the theory that dare not speak its name] is founded upon an error, and must be wholly rejected.” Well, that’s the end of the webcast. Thanks for joining me.

Wait, not so fast. None of what I just said is right; they are all myths. Let me explain them one at a time (although not in the same order). I’ll start with #3.

The results from Bayesian methods are subjective. Actually, this one is true. But that’s a good thing.

In fact, it’s an I.J. Good thing. Another prominent statistician, and a cryptologist, summed it up like this, “The Bayesian states his judgements, whereas the objectivist sweeps them under the carpet by calling assumptions knowledge, and he basks in the glorious objectivity of science.”

His point is that we like to pretend that science is strictly objective, but we’re kidding ourselves. Science is always based on modeling decisions, and modeling decisions are always subjective. Bayesian methods make these decisions explicit, and that’s a feature, not a bug.

But all hope for objectivity is not lost. Even if you and I start with different priors, if we see enough data (and agree on how to interpret it) our beliefs will converge. And if we don’t have enough data to converge, we’ll be able to quantify how much uncertainty remains.

On to the next myth, that Bayesian methods are unnecessary because science and engineering are getting along fine without them.

Well, the cracks in that wall are starting to show. In 2005, this paper explained why many published findings are probably false because of unintended flexibility in experimental design, definitions, outcomes and analysis.

This 2011 paper introduced the term “researcher degree of freedom” to explain why false positive rates might be much higher than 5%

And in 2015 a psychology journal shook the foundations of classical statistics when it banned p-values and confidence intervals in favor of estimated effect sizes.

Most recently, there has been a large-scale effort to replicate effects reported in previous papers. The initial results are discouraging, with many failures to replicate, including some well-known and generally accepted effects.

So everything is not great.

Of course, I don’t want to dismiss every statistical method invented in the last 100 years. I am a big fan of regression, for example. But these particular tools: confidence intervals and hypothesis testing with p-values. We would be better off without.

In fact, here’s what the world would look like today if p-values had never been invented.

The next myth I’ll address is that Bayesian methods are redundant because they produce the same results as classical methods, for particular choices of prior beliefs. To be honest, I have always found this claim baffling.

Classical methods produce results in the form of point estimates and confidence intervals.

Bayesian methods produce a posterior distribution, which contains every possible outcome and it’s corresponding probability. That’s a different kind of thing from a point estimate or an interval, and it contains a lot more information. So it’s just not the same.

If you want to compare results from Bayesian methods and classical methods, you can use the posterior distribution to generate point estimates and intervals. But that’s not a fair comparison. It’s like running a race between a car and an airplane, but to make them comparable, you keep the plane on the ground. You’re missing the whole point!

Bayesian methods don’t do the same things better. They do different things, and those things are better. Let me give you some examples.

Suppose you run a poll of likely voters and you find that 52% of your sample intends to vote for one of the candidates, Alice. You compute a confidence interval like 42% to 62%, and you’ve got a p-value, 0.34. Now here’s what you really want to know: what is the probability that Alice will win?

Based on these statistics, I have no idea. This is a diagram from a friend of mine, Ted Bunn, explaining that p-values can answer lots of questions, just not the questions we care about.

In contrast, what you get from Bayesian statistics is a posterior distribution, like this, that shows the possible outcomes of the election and the probability of each outcome. With a distribution like this, you can answer the questions you care about, like the probability of victory, or the probability of winning by more than 5 points.

And when you get new data, Bayes’s theorem tell you how to update the previous distribution to get a new distribution that incorporates the new data. By the way, these figures are from Nate Silver’s blog, they guy who correctly predicted the outcome of the 2008 presidential election in 49 out of 50 states, and in 2012 he got 50 out of 50, using Bayesian statistics.

As this example shows, Bayesian methods answer the questions we actually care about, unlike classical methods, and produce results in a form that makes information actionable.

Let me give you another example. Suppose you are testing a new drug, A, compared to an existing drug B, and it looks like A is better, with a nice small p-value. But A is more expensive. If you’re a doctor, which drug should you prescribe?

Again, classical statistics provides almost no help with this kind of decision making. But suppose I gave you results in the form of probabilities, like the probability of survival, or positive outcomes, or side effects. Now if you have prices, you can do cost-effectiveness analysis, like dollars per life saved, or per quality adjusted life-year, and so on. That kind of information is actually useful.

I’ll do one more example that’s a little more fun. Suppose you are on The Price Is Right, and you are a contestant in the Showdown at the end of the show. You get to see a prize and then you have to guess the price. Your opponent sees a different prize and then they have to guess the price. Whoever is closer, without going over, wins the prize.

Well, here’s what you could do. If you know the distribution of prices for prizes in the past, you could use that as your prior. And conveniently, there’s a super-fan on the Internet who has done that for you, so you can download the results.

Then you can use your price estimate to do a Bayesian update. Here’s what it looks like: the darker line is the prior; the lighter line is the posterior, which is what you believe about the price after you see the prize.

Here’s what it looks like for the second contestant. Again, the dark line is the prior, the light line is what your opponent believes after they see the second prize.

Now you can each perform an opimization that computes your expected gain depending on what you bid. In this example, the best bid for you is about $21,000, and the best bid for your opponent is about $32,000.

This example is mostly for fun, but it makes a point: Bayesian methods suppose complex decision making under uncertainty. The Price Is Right Problem is not easy, it involves some discontinuties that make it hard to do with continuous mathematics. But using Bayesian statistics and some simple computational tools, it’s not that hard.

Ok, one more thing you hear about Bayesian methods is that they are too slow. Well, compared to what? If you want the wrong answer, you can have it very fast. If you want the right answer, there are a few ways to get there.

For a lot of problems, you can get away with brute force. It turns out that computers are fast, and computation is cheap. For many real world problems, a simple implementation of Bayesian statistics is fast enough. Maybe it takes a 10th of a second instead of a millionth of a second, but most of the time we don’t care.

If you do care, there are alternatives. MCMC stands for Monte Carlo Markov Chain, which is a killer computational technique for speeding up Bayesian methods, especially when the number of parameters you are trying to estimate gets big.

And if that’s not fast enough, sometimes there are analytic methods you can use. Here’s an example I worked on that uses a Dirichlet distribution, one of the cases where you can perform a Bayesian update just by doing a few additions.

Ok, last myth I’ll talk about is the idea that Bayesian methods are hard. Again, if you look at things like the Wikipedia page, it can be a little intimidating. But that’s a problem with the math, not the methods.

The fundamental ideas are very simple, and if you take a computational approach, they are really not hard.

I’ll show you an example, starting with a toy problem and then using it to solve a real problem. Suppose I have a box of dice where one is 4-sided, one is 6-sided, one 8-sided and one 12-sided.

I pick a die, but I don’t let you see it. I roll the die and report that I got a six. Then I ask, what is the probability that I rolled each die? Immediately you can figure that I didn’t roll the 4-sided die. And you might have intuition that the six-sided die is the most likely candidate. But how do we quantify that intuition?

I’ll show you two ways to solve this problem, first on paper and the computationally.

Here’s a table you can use to solve problems like this when you have a small number of possible hypotheses.

I’ll fill in the first column, which is the list of possible dice.

Now, let’s assume that I was equally likely to choose any of the dice. So the prior probability is the same for all of them. I’ll set them to 1. I could divide through and set them all to ¼ , but it turns out it doesn’t matter.

So that’s what you should believe before I roll the die and tell you the outcome. Then you get some data: I tell you I rolled a 6. What you have to do is compute the likelihood of the data under each hypothesis. That is, for example, what’s the probability of rolling a 6 on a 4 sided die? Zero. What’s the probability of getting a 6 on a six-sided die? One in six. So I’ll fill those in.

The chance of rolling a 6 on an 8-sided die is one in 8. On a 12-sided die it’s one in 12. And that’s it. Everything from here on is just artithmetic. First we multiply the prior probabilities by the likelihoods.

The result is the posterior probabilities, but they are not normalized; that is, they don’t add up to 1. We can normalize them by adding them up and dividing through.

But first, just to make the arithmetic easy, I’m going to multiply through by 24.

Now I’ll add them up… and divide through by 9. So the posterior probabilities are 0 (the 4-sided die has been eliminated from the running), 4 ninths, 3 ninths, and 2 ninths. As expected, the 6-sided die is the most likely.

Now here’s what that looks like computationally. I’ve got a function, likelihood, that computes the likelihood of the data under each hypothesis. The hypothesis is the number of sides on the die; the data is the outcome that I rolled, the 6.

If the outcome exceeds the number of sides, that’s impossible, so the likelihood is zero. Otherwise, the likelihood is one out of the number of sides. One you provide a likelihood function, you’re done. The Suite class knows how to do a Bayesian update; it does the same thing we just did with the table.

So, we solved the dice problem, which might not seem very interesting. But we also solved the German tank problem, which was very interesting during World War II.

When the Germans were making tanks, they allocated serial numbers to different factories, in different months, in blocks of 100. But not all of the serial numbers got used. So if you captured a tank and looked at the serial number, you could estimate how many tanks were made at a particular factory in a particular month.

To see how that works, let’s look at the likelihood function. The hypothesis now is the number of tanks that were made, out of a possible 100. The data is the serial number of a tank that was captured.

You might recognize this likelihood function. It’s the same as the dice problem, except instead of 4 dice, we have 100 dice.

Here’s what the update looks like. We create an object called Tank that represents the prior distribution, then update it with the data, he serial number 37. And here’s what the posterior distribution looks like.

Everything below 37 has been eliminated, because that’s not possible. The most likely estimate is 37. But the other values up to 100 are also possible. If you see more data, you can do another update and get a new posterior.

It turns out that this works. During WWII there were statisticians producing estimates like this (although not using exactly this method), and their estimates were consistently much lower than what was coming from conventional intelligence. After the war, the production records were captured, and it turned out that the statistical estimates were much better.

So Bayesian methods are not as hard as people think. In particular, if you use computational methods, you can get started very quickly.

I teach a class at Olin College for people who have almost prior statistical knowledge, but they know how to program in Python. In 7 weeks, they work on projects where they apply Bayesian methods to problems they choose and formulate, and I publish the good ones.

Here’s one where the students predicted the outcome of the 2015 Super Bowl, which the New England Patriots won.

Here’s one where they analyzed responses on Tinder, computing that probability that someone would respond. It sounds a little bit sad, but it’s not really like that. They were having some fun with it.

And here’s one that actually got a lot of attention: two students who used data from Game of Thrones to predict the probability that various characters would survive for another book or another chapter.

So people can get started with this very quickly and work on real world problems, although some of the problems are more serious than others.

In summary, most of what you’ve heard about Bayesian methods is wrong, or at least misleading. The results are subjective, but so is everything we believe about the world. Get over it.

Bayesian methods are not redundant; they are different, and better. And we need them more than ever.

Bayesian methods can be computationally intensive, but there are lots of ways to deal with that. And for most applications, they are fast enough, which is all that matters.

Finally, they are not that hard, especially if you take a computational approach.

Of course, I am not impartial. I teach workshops where I introduce people to Bayesian statistics using Python. I’ve got one coming up this weekend in Boston and another at PyCon in Portland Oregon.

And I’m also trying to help teachers get this stuff into the undergraduate curriculum. It’s not just for graduate students! In June I’m doing a one-day workshop for college instructors, along with my colleague, Sanjoy Mahajan.

And I’ve got a book on the topic, called Think Bayes, which you can read at thinkbayes.com

I’ve got a few more books that I recommend, but I won’t read them to you. Let me get to here, where you can go to this URL to get my slides. And here are a few ways you can get in touch with me if you want to follow up.

Probably Overthinking It

Wednesday, May 18, 2016

Learning to Love Bayesian Statistics

1 comment: