Tuesday, November 22, 2016

Problematic presentation of probabilistic predictions

Theoretically, people should not be surprised by the results of the 2016 election; several credible forecasters predicted that it would be a close race. But a lot of people were surprised anyway. In my previous article I explained one reason: many people misinterpret probabilistic predictions.

But it is not entirely our fault. In many cases, the predictions were presented in forms that contributed to the confusion. In this article, I review the forecasts and suggest ways we can present them better.

The two forecasters I followed in the weeks prior to the election were FiveThirtyEight and the New York Times's Upshot. If you visited the FiveThirtyEight forecast page, you saw something like this:

And if you visited the Upshot page, you saw

FiveThirtyEight and Upshot use similar methods, so they generate the same kind of results: probabilistic predictions expressed in terms of probabilities.

The problem with probability

The problem with predictions in this form is that people do not have a good sense for what probabilities mean. In my previous article I explained part of the issue -- probabilities do not behave like other kinds of measurements -- but there might be a more basic problem: our brains naturally represent uncertainty in the form of frequencies, not probabilities.

If this frequency format hypothesis is true, it suggests that people understand predictions like "Trump's chances are 1 in 4" better than "Trump's chances are 25%".

As an example, suppose one forecaster predicts that Trump has a 14% chance and another says he has a 25% chance. At first glance, those predictions seems consistent with each other: they both think Clinton is likely to win. But in terms of frequencies, one of those forecasts means 1 chance in 7; the other means 1 chance in 4. When we put it that way, it is clearer that there is a substantial difference.

The Upshot tried to help people interpret probabilities by expressing predictions in the form of American football field goal attempts:

Obviously this analogy doesn't help people who don't watch football, but even for die-hard fans, it doesn't help much. In my viewing experience, people are just as surprised when a kicker misses -- even though they shouldn't be -- as they were by the election results.

One of the best predictions I saw came in the form of frequencies, sort of. On November 8, Nate Silver tweeted this:

"Basically", he wrote, "these 3 cases are equally likely". The implication is that Trump had about 1 chance in 3. That's a little higher than the actual forecast, but I think it communicates the meaning of the prediction more effectively than a probability like 28.6%.

The problem with pink

I think Silver's tweet would be even better if it dropped the convention of showing swing states in pink (for "leaning Republican") and light blue (for "leaning Democratic"). One of the reasons people are unhappy with the predictions is that the predictive maps look like this one from FiveThirtyEight:

And the results look like this (from RealClear Politics):

The results don't look like the predictions because in the results all the states are either dark red or dark blue. There is no pink in the electoral college.

Now suppose you saw a prediction like this (which I generated here):

These three outcomes are equally likely;

Trump wins one of them, Clinton wins two.

With a presentation like this, I think:

1) Before the election, people would understand the predictions better, because they are presented in terms of frequencies like "1 in 3".

2) After the election, people would evaluate the accuracy of the forecasts more fairly, because the results would look like one of the predicted possibilities.

The problem with asymmetry

Consider these two summaries of the predictions:

1) "FiveThirtyEight gives Clinton a 71% chance and The Upshot gives her an 85% chance."

2) "The Upshot gives Trump a 15% chance and FiveThirtyEight gives him a 29% chance."

These two forms are mathematically equivalent, but they are not interpreted the same way. The first form makes it sound like everyone agrees that Clinton is likely to win. The second form makes it clearer that Trump has a substantial chance of winning, and makes it more apparent that the two predictions are substantially different.

If the way we interpret probabilities is asymmetric, as it seems to be, we should be careful to present them both ways. But most forecasters and media reported the first form, in terms of Clinton's probabilities, more prominently, and sometimes exclusively.

The problem with histograms

One of the best ways to explain probabilistic predictions is to run simulations and report the results. The Upshot presented simulation results using a histogram, where the x-axis is the number of electoral college votes for Clinton (notice the asymmetry) and the y-axis is the fraction of simulations where Clinton gets that number:

This visualization has one nice property: the blue area is proportional to Clinton's chances and the red area is proportional to Trump's. To the degree that we assess those areas visually, this is probably more comprehensible than numerical probabilities.

But there are a few problems:

1) I am not sure how many people understand this figure in the first place.

2) Even for people who understand it, the jagginess of the results makes it hard to assess the areas.

3) This visualization tries to do too many things. Is it meant to show that some outcomes are more likely than others, as the title suggests, or is it meant to represent the probabilities of victory visually?

4) Finally, this representation fails to make the possible outcomes concrete. Even if I estimate the areas accurately, I still walk away expecting Clinton to win, most likely in a landslide.

The problem with proportions

One of the reasons people have a hard time with probabilistic predictions is that they are relatively new. The 2004 election was the first where we saw predictions in the form of probabilities, at least in the mass media. Prior to that, polling results were reported in terms of proportions, like "54% of likely voters say they will vote for Alice, 40% for Bob, with 6% undecided, and a margin of error of 5%".

Reports like this didn't provide probabilities explicitly, but over time, people developed a qualitative sense of likelihood. If a candidate was ahead by 10 points in the poll, they were very likely to win; if they were only ahead by 2, it could go either way.

The problem is that proportions and probabilities are reported in the same units: percentages. So even if we know that probabilities and proportions are not the same thing, our intuition can mislead us.

For example, if a candidate is ahead in the polls by 70% to 30%, there is almost no chance they will lose. But if the probabilities are 70% and 30%, that's a very close race. I suspect that when people saw that Clinton had a 70% probability, some of their intuition for proportions leaked in.

Solution: publish the simulations

I propose a simple solution that addresses all of these problems: forecasters should publish one simulated election each day, with state-by-state results.

As an example, here is a fake FiveThirtyEight page I mocked up:

If people saw predictions like this every day, they would experience the range of possible outcomes, including close finishes and landslides. By election day, nothing would surprise them.

Publishing simulations in this form solves the problems I identified:

The problem with probabilities: If you checked this page daily for a week, you would see Clinton win 5 times and Trump win twice. If the frequency format hypothesis is true, this would get the prediction into your head in a way you understand naturally.

The problem with pink: The predictions would look like the results, because in each simulation the states are red or blue; there would be no pink or light blue.

The problem with asymmetry: This way of presenting results doesn't break symmetry, especially if the winner of each simulation is shown on the left.

The problem with histograms: By showing only one simulation per day, we avoid the difficulty of summarizing large numbers of simulations.

The problem with proportions: If we avoid reporting probabilities of victory, we avoid confusing them with proportions of the vote.

Publishing simulations also addresses two problems I haven't discussed:

The problem with precision: Some forecasters presented predictions with three significant digits, which suggests an unrealistic level of precision. The intuitive sense of likelihood you get from watching simulations is not precise, but the imprecision in your head is an honest reflection of the imprecision in the models.

The problem with the popular vote: Most forecasters continue to predict the popular vote and some readers still follow it, despite the fact (underscored by this election) that the popular vote is irrelevant. If it is consistent with the electoral college, it's redundant; otherwise it's just a distraction.

In summary, election forecasters used a variety of visualizations to report their predictions, but they were prone to misinterpretation. Next time around, we can avoid many of the problems by publishing the results of one simulated election each day.

Monday, November 14, 2016

Why are we so surprised?

Abstract In theory, we should not be surprised by the outcome of the 2016 presidential election, but in practice we are. I think there are two reasons: in this article, I explain problems in the way we think about probability; in the next article, I present problems in the way the predictions were reported. In the end, I am sympathetic to people who were surprised, but I'll suggest some things we could do better next time.

Less surprising than flipping two heads

First, let me explain why I think we should not have been surprised. One major forecaster, FiveThirtyEight estimated the probability of a Clinton win at 71%. To the degree that we have confidence in their methodology, we should be less surprised by the outcome than we would be if we tossed a coin twice and got heads both times; that is, not very surprised at all.

Even if you followed the The New York Times Upshot and believed their final prediction, that Clinton had a 91% chance, the result is slightly more surprising than tossing three coins and getting three heads; still, not very.

Judging by the reaction, a lot of people were more surprised than that. And I have to admit that even though I thought the FiveThirtyEight prediction was the most credible, my own reaction suggests that at some level, I thought the probability of a Trump win was closer to 10% than 30%.

Probabilistic predictions

So why are we so surprised? I think a major factor is that few people have experience interpreting probabilistic predictions. As Nate Silver explained (before the fact):

The goal of a probabilistic model is not to provide deterministic predictions (“Clinton will win Wisconsin”) but instead to provide an assessment of probabilities and risks.

I conjecture that many people interpreted the results from the FiveThirtyEight models as the deterministic prediction that Clinton would win, with various degrees of confidence. If you think the outcome means that the prediction was wrong, that suggests you are treating the prediction as deterministic.

A related reason people might be surprised is that they did not distinguish between qualitative and quantitative probabilities. People use qualitative probabilities in conversation all the time; for example, someone might say, "I am 99% sure that it will rain tomorrow". But if you offer to bet on it at odds of 99 to 1, they might say, "I didn't mean it literally; I just mean it will probably rain".

As an example, in their post mortem article, "How Data Failed Us in Calling an Election", Steve Lohr and Natasha Singer wrote:

Virtually all the major vote forecasters, including Nate Silver’s FiveThirtyEight site, The New York Times Upshot and the Princeton Election Consortium, put Mrs. Clinton’s chances of winning in the 70 to 99 percent range.

Lohr and Singer imply that there was a consensus, that everyone said Clinton would win, and they were all wrong. If this narrative sounds right to you, that suggests that you are interpreting the predictions as deterministic and qualitative.

In contrast, if you interpret the predictions as probabilistic and quantitative, the narrative goes like this: there was no consensus; different models produced very different predictions (which should have been a warning). But the most credible sources all indicated that Trump had a substantial chance to win, and they were right.

Distances between probabilities

Qualitatively, there is not much difference between 70% and 99%. Maybe it's the difference between "likely" and "very likely".

But for probabilities, the difference between 70% and 99% is huge. The problem is that we are not good at comparing probabilities, because they don't behave like other things we measure.

For example, if you are trying to estimate the size of an object, and the range of measurements is from 70 inches to 99 inches, you might think:

1) The most likely value is the midpoint, near 85 inches,
2) That estimate might be off by 15/85, or 18%, and
3) There is a some chance that the true value is more than 100 inches.

For probabilities, our intuitions about measurements are all wrong. Given the range between 70% and 99% probability, the most meaningful midpoint is 94%, not 85% (explained below), and there is no chance that the true value is more than 100%.

Also, it would be very misleading to say that the estimate might be off by 15/85 percentage points, or 18%. To see why, see what happens if we flip it around: if Clinton's chance to win is 70% to 99%, that means Trump's chance is 1% to 30%. Immediately, it is clearer that this is a very big range. If we think the most likely value is 15%, and it might be off by 15%, the relative error is 100%, not 18%.

To get a better understanding of the distances between probabilities, the best option is to express them in terms of log odds. For a probability, p, the corresponding odds are

o = p / (1-p)

and the log odds are log10(o).

The following table shows selected probabilities and their corresponding odds and log odds.

Prob Odds Log odds
5% 1:19 -1.26
50% 1 0
70% 7:3 0.37
94% 94:6 1.19
99% 99:1 2.00

This table shows why, earlier, I said that the midpoint between 70% and 99% is 94%, because in terms of log odds, the distance is the same from 0.37 to 1.19 as from 1.19 to 2.0 (for now, I'll ask you to take my word that this is the most meaningful way to measure distance).

And now, to see why I said that the difference between 70% and 99% is huge, consider this: the distance from 70% to 99% is the same as the distance from 70% to 5%.

If credible election predictions had spanned the range from 5% to 70%, you would have concluded that there was no consensus, and you might have dismissed them all. But the range from 70% to 99% is just as big.

In an ideal world, maybe we would use log odds, rather than probability, to talk about uncertainty; in some areas of science and engineering, people do. But that won't happen soon; in the meantime, we have to remember that our intuition for spatial measurements does not apply to probability.

Single case probabilities

So how should we interpret a probabilistic prediction like "Clinton has a 70% chance of winning"?

If I say that a coin has a 50% chance of landing heads, there is a natural interpretation of that claim in terms of long-run frequencies. If I toss the coin 100 times, I expect about 50 heads.

But a prediction like "Clinton has a 70% chance of winning" refers to a single case, which is notoriously hard to interpret. It is tempting to say that if we ran the election 100 times, we would expect Clinton to win 70 times, but that approach raises more problems than it solves.

Another option is to evaluate the long-run performance of the predictor rather than the predictee. For example, if we use the same methodology to predict the outcome of many elections, we could group the predictions by probability, considering the predictions near 10%, the predictions near 20%, and so on. In the long run, about 10% of the 10% predictions should come true, about 20% of the 20% predictions should come true, and so on. This approach is called calibrated probability assessment.

For daily predictions, like weather forecasting, this kind of calibration is possible. In fact, it is one of the examples in Nate Silver's book, The Signal and the Noise. But for rare events like presidential elections, it is not practical.

At this point it might seem like probabilistic predictions are immune to contradiction, but that is not entirely true. Although Trump's victory does not prove that FiveThirtyEight and the other forecasters were wrong, it provides evidence about their trustworthiness.

Consider two hypotheses:

A: The methodology is sound and Clinton had a 70% chance.
B: The methodology is bogus and the prediction is just a random number from 0 to 100.

Under A, the probability of the outcome is 30%; under B it's 50%. So the outcome is evidence against A with a likelihood ratio of 3/5. If you were initially inclined to believe A with a confidence of 90%, this evidence should lower your confidence to 84% (applying Bayes rule). If your initial inclination was 50%, you should downgrade it to 37%. In other words, the outcome provides only weak evidence that the prediction was wrong.

Forecasters who went farther out on a limb took more damage. The New York Times Upshot predicted that Clinton had a 91% chance. For them, the outcome provides evidence against A with a Bayes factor of 5. So if your confidence in A was 90% before the election, it should be 64% now.

And any forecaster who said Clinton had a 99% chance has been strongly contradicted, with Bayes factor 50. A starting confidence of 90% should be reduced to 15%. As Nate Silver tweeted on election night:

In summary, the result of the election does not mean the forecasters were wrong. It provides evidence that anyone who said Clinton had a 99% chance is a charlatan. But it is still reasonable to believe that FiveThirtyEight, Upshot, and other forecasters using similar methodology were basically right.

So why are we surprised?

Many of us are more surprised than we should be because we misinterpreted the predictions before the election, and we are misinterpreting the results now. But it is not entirely our fault. I also think the predictions were presented in ways that made the problems worse. Next time I will review some of the best and worst, and suggest ways we can do better next time.

If you are interested in the problem of single case probabilities, I wrote about it in this article about criminal recidivism.

Tuesday, November 8, 2016

Election day, finally!

Abstract: I defend my claim that voter suppression is 1000 times worse than voter fraud, and furthermore, that voter suppression introduces systematic bias, whereas fraud introduces noise. And from a statistical point of view, bias is much worse than noise.

My run to the polls

Today I left work around 11am, ran 2.5 miles from Olin College to Needham High School, voted, and ran back, all in less than an hour. It was all very pleasant, but I am acutely aware of how lucky I am:

I live in a country where I can vote (even if some of the choices stink),
I am not one of the hundreds of thousands of people who are permanently disenfranchised because of non-violent drug offenses,
I have a job that allows the flexibility to take an hour off to vote,
I have a job,
I have the good health to run 5 miles round trip,
I live in an affluent suburb where I don't have to wait in line to vote,
I have a government-issued photo ID, and
I don't need one to vote in Massachusetts.

And I could go on. When I got back, I posted this tweet, which prompted the replies below:

Data, please.

As I said, my primary point is that voting should be equally easy for everyone; I didn't really intend to get into the voter ID debate. But the challenge my correspondent raised is legitimate. I posited a number, and I should be able to back it up with data.

So let me see if I can defend my claim that voter suppression is a 1000x bigger problem than voter fraud.

According to the ACLU, there are 21 million US citizens who don't have government-issued photo ID. They reference "research" but don't actually provide a citation. So that's not ideal, but even if they are off by a factor of 2, we are talking about 10 million people.

Presumably some of them are not eligible voters. And some of them, who live in states that don't require voter ID, would get one if they needed it to vote. But even if we assume that 90% of them live in states that don't require ID, and all of them would get ID if required, we are still talking about 1 million people who would be prevented from voting because they don't have ID.

So I am pretty comfortable with the conclusion that the number of people whose vote would be suppressed by voter ID laws is in the millions.

Now, how many people commit election fraud that would be prevented by voter ID laws? This Wall Street Journal article cites research that counts people convicted of voter fraud, which seems to be on the order of hundreds per election at most. Even if we assume that only 10% of perpetrators are caught and convicted, the total number of cases is only in the thousands.

Is that reasonable? Should we expect there to be more? I don't think so. Impersonating a voter in order to cast an illegal ballot is probably very rare, for a good reason: it is a risky strategy that is vanishingly unlikely to be effective. And in a presidential election, with a winner-takes-all electoral college, the marginal benefit of a few extra votes is remarkably close to nil.

So I would not expect many individuals to attempt voter fraud. Also, organizing large scale voter impersonation would be difficult or impossible, and as far as I know, no one has tried in a recent presidential election.

In summary, it is likely that the suppressive effect of voter ID laws is on the order of millions (or more), and the voter fraud it might prevent is on the order of thousands (or less). So that's where I get my factor of 1000.

But what about bias?

Let me add one more point: even if the ratio were smaller, I would still be more concerned about voter suppression, because:

Voter suppression introduces systematic bias: It is more likely to affect "low-income individuals, racial and ethnic minority voters, students, senior citizens, [and] voters with disabilities", according to the ACLU.
Unless the tendency to commit fraud is strongly correlated with political affiliation or other socioeconomic factors, fraud introduces random noise.

And as any statistician will tell you, I'll take noise over bias any day, and especially on election day.

UPDATE: The New York Times Upshot provides evidence and reasons to think that claims of large scale election fraud are false.

Tuesday, November 1, 2016

The Skeet Shooting problem

Last week I posted the Skeet Shooting Problem:

At the 2016 Summer Olympics in the Women's Skeet event, Kim Rhode faced Wei Meng in the bronze medal match. After 25 shots, they were tied, sending the match into sudden death. In each round of sudden death, each competitor shoots at two targets. In the first three rounds, Rhode and Wei hit the same number of targets. Finally in the fourth round, Rhode hit more targets, so she won the bronze medal, making her the first Summer Olympian to win an individual medal at six consecutive summer games. Based on this information, should we infer that Rhode and Wei had an unusually good or bad day?

As background information, you can assume that anyone in the Olympic final has about the same probability of hitting 13, 14, 15, or 16 out of 25 targets.

Now here's a solution:

As part of the likelihood function, I'll use binom.pmf, which computes the Binomial PMF.
In the following example, the probability of hitting k=10 targets in n=25 attempts, with probability p=13/15 of hitting each target, is about 8%.

In [16]:

from scipy.stats import binom

k = 10
n = 25
p = 13/25
binom.pmf(k, n, p)

Out[16]:

0.078169085615240511

The following function computes the likelihood of a tie (or no tie) after a given number of shots, n, given the hypothetical value of p.
It loops through the possible number of hits, k, from 0 to n and uses binom.pmf to compute the probability that each shooter hits k targets.
To get the probability that both shooters hit k targets, we square the result.
To get the total likelihood of the outcome, we add up the probability for each value of k.

In [17]:

def likelihood(data, hypo):
    """Likelihood of data under hypo.
        
    data: tuple of (number of shots, 'tie' or 'no tie')
    hypo: hypothetical number of hits out of 25
    """
    p = hypo / 25
    n, outcome = data
    like = sum([binom.pmf(k, n, p)**2 for k in range(n+1)])
    return like if outcome=='tie' else 1-like

Let's see what that looks like for n=2

In [18]:

data = 2, 'tie'

hypos = range(0, 26)
likes = [likelihood(data, hypo) for hypo in hypos]
thinkplot.Plot(hypos, likes)
thinkplot.Config(xlabel='Probability of a hit (out of 25)',
                 ylabel='Likelihood of a tie',
                 ylim=[0, 1])

As we saw in the Sock Drawer problem and the Alien Blaster problem, the probability of a tie is highest for extreme values of p, and minimized when p=0.5.
The result is similar, but more extreme, when n=25:

In [19]:

data = 25, 'tie'

hypos = range(0, 26)
likes = [likelihood(data, hypo) for hypo in hypos]
thinkplot.Plot(hypos, likes)
thinkplot.Config(xlabel='Probability of a hit (out of 25)',
                 ylabel='Likelihood of a tie',
                 ylim=[0, 1])

In the range we care about (13 through 16 hits out of 25) this curve is pretty flat, which means that a tie after the round of 25 doesn't discriminate strongly among the hypotheses.
We could use this likelihood function to run the update, but just for purposes of demonstration, I'll use the Suite class from thinkbayes2:

In [20]:

from thinkbayes2 import Suite

class Skeet(Suite):
    
    def Likelihood(self, data, hypo):
        """Likelihood of data under hypo.
        
        data: tuple of (number of shots, 'tie' or 'no tie')
        hypo: hypothetical number of hits out of 25
        """
        p = hypo / 25
        n, outcome = data
        like = sum([binom.pmf(k, n, p)**2 for k in range(n+1)])
        return like if outcome=='tie' else 1-like

Now I'll create the prior.

In [21]:

suite = Skeet([13, 14, 15, 16])
suite.Print()

The prior mean is 14.5.

In [22]:

suite.Mean()

Out[22]:

14.5

Here's the update after the round of 25.

In [23]:

suite.Update((25, 'tie'))
suite.Print()

13 0.245787744767
14 0.247411480833
15 0.250757985003
16 0.256042789397

The higher values are a little more likely, but the effect is pretty small.
Interestingly, the rounds of n=2 provide more evidence in favor of the higher values of p.

In [24]:

suite.Update((2, 'tie'))
suite.Print()

13 0.240111712333
14 0.243807684231
15 0.251622537592
16 0.264458065844

In [25]:

suite.Update((2, 'tie'))
suite.Print()

13 0.234458172307
14 0.240145161205
15 0.252373188397
16 0.273023478091

In [26]:

suite.Update((2, 'tie'))
suite.Print()

13 0.228830701632
14 0.236427057892
15 0.253007722855
16 0.28173451762

After three rounds of sudden death, we are more inclined to think that the shooters are having a good day.
The fourth round, which ends with no tie, provides a small amount of evidence in the other direction.

In [27]:

suite.Update((2, 'no tie'))
suite.Print()

13 0.2323322732
14 0.23878553469
15 0.252684685857
16 0.276197506253

And the posterior mean, after all updates, is a little higher than 14.5, where we started.

In [28]:

suite.Mean()

Out[28]:

14.572747425162735

In summary, the outcome of this match, with four rounds of sudden death, provides weak evidence that the shooters were having a good day.
In general for this kind of contest, a tie is more likely if the probability of success is very high or low:

In the Alien Blaster problem, the hypothetical value of p are all less than 50%, so a tie causes us to revise beliefs about p downward.
In the Skeet Shooter problem, the hypothetical values are greater than 50%, so ties make us revise our estimates upward.

Using a continuous prior¶

For simplicity, I started with a discrete prior with only 4 values. But the analysis we just did generalizes to priors with an arbitrary number of values. As an example I'll use a discrete approximation to a Beta distribution.
Suppose that among the population of Olympic skeet shooters, the distribution of p (the probability of hitting a target) is well-modeled by a Beta distribution with parameters (15, 10). Here's what that distribution looks like:

In [29]:

from thinkbayes2 import Beta

beta = Beta(15, 10).MakePmf()
thinkplot.Pdf(beta)
thinkplot.Config(xlabel='Probability of a hit',
                 ylabel='PMF')

We can use that distribution to intialize the prior:

In [30]:

prior = Skeet(beta)
prior.Mean() * 25

Out[30]:

14.999999999999991

And then perform the same updates:

In [31]:

posterior = prior.Copy()
posterior.Update((25, 'tie'))
posterior.Update((2, 'tie'))
posterior.Update((2, 'tie'))
posterior.Update((2, 'tie'))
posterior.Update((2, 'no tie'))

Out[31]:

0.088322129669977739

Here are the prior and posterior distributions. You can barely see the difference:

In [32]:

thinkplot.Pdf(prior, color='red', alpha=0.5)
thinkplot.Pdf(posterior, color='blue', alpha=0.5)
thinkplot.Config(xlabel='Probability of a hit',
                 ylabel='PMF')

And the posterior mean is only slightly higher.

In [33]:

posterior.Mean() * 25

Out[33]:

15.041026161545973