Probably Overthinking It: Why are we so surprised?

Abstract In theory, we should not be surprised by the outcome of the 2016 presidential election, but in practice we are. I think there are two reasons: in this article, I explain problems in the way we think about probability; in the next article, I present problems in the way the predictions were reported. In the end, I am sympathetic to people who were surprised, but I'll suggest some things we could do better next time.

Less surprising than flipping two heads

First, let me explain why I think we should not have been surprised. One major forecaster, FiveThirtyEight estimated the probability of a Clinton win at 71%. To the degree that we have confidence in their methodology, we should be less surprised by the outcome than we would be if we tossed a coin twice and got heads both times; that is, not very surprised at all.

Even if you followed the The New York Times Upshot and believed their final prediction, that Clinton had a 91% chance, the result is slightly more surprising than tossing three coins and getting three heads; still, not very.

Judging by the reaction, a lot of people were more surprised than that. And I have to admit that even though I thought the FiveThirtyEight prediction was the most credible, my own reaction suggests that at some level, I thought the probability of a Trump win was closer to 10% than 30%.

Probabilistic predictions

So why are we so surprised? I think a major factor is that few people have experience interpreting probabilistic predictions. As Nate Silver explained (before the fact):

The goal of a probabilistic model is not to provide deterministic predictions (“Clinton will win Wisconsin”) but instead to provide an assessment of probabilities and risks.

I conjecture that many people interpreted the results from the FiveThirtyEight models as the deterministic prediction that Clinton would win, with various degrees of confidence. If you think the outcome means that the prediction was wrong, that suggests you are treating the prediction as deterministic.

A related reason people might be surprised is that they did not distinguish between qualitative and quantitative probabilities. People use qualitative probabilities in conversation all the time; for example, someone might say, "I am 99% sure that it will rain tomorrow". But if you offer to bet on it at odds of 99 to 1, they might say, "I didn't mean it literally; I just mean it will probably rain".

As an example, in their post mortem article, "How Data Failed Us in Calling an Election", Steve Lohr and Natasha Singer wrote:

Virtually all the major vote forecasters, including Nate Silver’s FiveThirtyEight site, The New York Times Upshot and the Princeton Election Consortium, put Mrs. Clinton’s chances of winning in the 70 to 99 percent range.

Lohr and Singer imply that there was a consensus, that everyone said Clinton would win, and they were all wrong. If this narrative sounds right to you, that suggests that you are interpreting the predictions as deterministic and qualitative.

In contrast, if you interpret the predictions as probabilistic and quantitative, the narrative goes like this: there was no consensus; different models produced very different predictions (which should have been a warning). But the most credible sources all indicated that Trump had a substantial chance to win, and they were right.

Distances between probabilities

Qualitatively, there is not much difference between 70% and 99%. Maybe it's the difference between "likely" and "very likely".

But for probabilities, the difference between 70% and 99% is huge. The problem is that we are not good at comparing probabilities, because they don't behave like other things we measure.

For example, if you are trying to estimate the size of an object, and the range of measurements is from 70 inches to 99 inches, you might think:

1) The most likely value is the midpoint, near 85 inches,
2) That estimate might be off by 15/85, or 18%, and
3) There is a some chance that the true value is more than 100 inches.

For probabilities, our intuitions about measurements are all wrong. Given the range between 70% and 99% probability, the most meaningful midpoint is 94%, not 85% (explained below), and there is no chance that the true value is more than 100%.

Also, it would be very misleading to say that the estimate might be off by 15/85 percentage points, or 18%. To see why, see what happens if we flip it around: if Clinton's chance to win is 70% to 99%, that means Trump's chance is 1% to 30%. Immediately, it is clearer that this is a very big range. If we think the most likely value is 15%, and it might be off by 15%, the relative error is 100%, not 18%.

To get a better understanding of the distances between probabilities, the best option is to express them in terms of log odds. For a probability, p, the corresponding odds are

o = p / (1-p)

and the log odds are log10(o).

The following table shows selected probabilities and their corresponding odds and log odds.

Prob Odds Log odds
5% 1:19 -1.26
50% 1 0
70% 7:3 0.37
94% 94:6 1.19
99% 99:1 2.00

This table shows why, earlier, I said that the midpoint between 70% and 99% is 94%, because in terms of log odds, the distance is the same from 0.37 to 1.19 as from 1.19 to 2.0 (for now, I'll ask you to take my word that this is the most meaningful way to measure distance).

And now, to see why I said that the difference between 70% and 99% is huge, consider this: the distance from 70% to 99% is the same as the distance from 70% to 5%.

If credible election predictions had spanned the range from 5% to 70%, you would have concluded that there was no consensus, and you might have dismissed them all. But the range from 70% to 99% is just as big.

In an ideal world, maybe we would use log odds, rather than probability, to talk about uncertainty; in some areas of science and engineering, people do. But that won't happen soon; in the meantime, we have to remember that our intuition for spatial measurements does not apply to probability.

Single case probabilities

So how should we interpret a probabilistic prediction like "Clinton has a 70% chance of winning"?

If I say that a coin has a 50% chance of landing heads, there is a natural interpretation of that claim in terms of long-run frequencies. If I toss the coin 100 times, I expect about 50 heads.

But a prediction like "Clinton has a 70% chance of winning" refers to a single case, which is notoriously hard to interpret. It is tempting to say that if we ran the election 100 times, we would expect Clinton to win 70 times, but that approach raises more problems than it solves.

Another option is to evaluate the long-run performance of the predictor rather than the predictee. For example, if we use the same methodology to predict the outcome of many elections, we could group the predictions by probability, considering the predictions near 10%, the predictions near 20%, and so on. In the long run, about 10% of the 10% predictions should come true, about 20% of the 20% predictions should come true, and so on. This approach is called calibrated probability assessment.

For daily predictions, like weather forecasting, this kind of calibration is possible. In fact, it is one of the examples in Nate Silver's book, The Signal and the Noise. But for rare events like presidential elections, it is not practical.

At this point it might seem like probabilistic predictions are immune to contradiction, but that is not entirely true. Although Trump's victory does not prove that FiveThirtyEight and the other forecasters were wrong, it provides evidence about their trustworthiness.

Consider two hypotheses:

A: The methodology is sound and Clinton had a 70% chance.
B: The methodology is bogus and the prediction is just a random number from 0 to 100.

Under A, the probability of the outcome is 30%; under B it's 50%. So the outcome is evidence against A with a likelihood ratio of 3/5. If you were initially inclined to believe A with a confidence of 90%, this evidence should lower your confidence to 84% (applying Bayes rule). If your initial inclination was 50%, you should downgrade it to 37%. In other words, the outcome provides only weak evidence that the prediction was wrong.

Forecasters who went farther out on a limb took more damage. The New York Times Upshot predicted that Clinton had a 91% chance. For them, the outcome provides evidence against A with a Bayes factor of 5. So if your confidence in A was 90% before the election, it should be 64% now.

And any forecaster who said Clinton had a 99% chance has been strongly contradicted, with Bayes factor 50. A starting confidence of 90% should be reduced to 15%. As Nate Silver tweeted on election night:

In summary, the result of the election does not mean the forecasters were wrong. It provides evidence that anyone who said Clinton had a 99% chance is a charlatan. But it is still reasonable to believe that FiveThirtyEight, Upshot, and other forecasters using similar methodology were basically right.

So why are we surprised?

Many of us are more surprised than we should be because we misinterpreted the predictions before the election, and we are misinterpreting the results now. But it is not entirely our fault. I also think the predictions were presented in ways that made the problems worse. Next time I will review some of the best and worst, and suggest ways we can do better next time.

If you are interested in the problem of single case probabilities, I wrote about it in this article about criminal recidivism.

Probably Overthinking It

Monday, November 14, 2016

Why are we so surprised?

Less surprising than flipping two heads

Probabilistic predictions

Distances between probabilities

Single case probabilities

So why are we surprised?

No comments:

Post a Comment