Probably Overthinking It

Monday, November 16, 2015

Internet use and religious affiliation in Europe

A few years ago I wrote a paper about Internet use and religious affiliation using data from the General Social Survey (GSS). After controlling for things like education, income, and religious upbringing, I found that people who use the Internet more are less likely to have a religious affiliation; that is, they are more likely to be "Nones". I made an argument for why I think this effect is plausibly causative and, if so, I estimated that it accounts for about 20% of the decrease in religious affiliation in the U.S. between 1990 and 2010. Last April it got some media attention (including this brief discussion on Fox News); here is the blog post I wrote about it at the time.

Since then, I have been planning to replicate the study using data from the European Social Survey, but I didn't get back to it until a few weeks ago. I was reminded about it because they recently released data from Round 7, conducted in 2014. I am always excited about new data, but in this case it doesn't help me: in Rounds 6 and 7 they did not ask about Internet use.

But they did ask in Rounds 1 through 5, collected between 2002 and 2010. So that's something. I downloaded the data, loaded it into Pandas dataframes, and started cleaning, validating, and doing some preliminary exploration. This IPython notebook shows what the first pass looks like.

In the interest of open, replicable science, I am posting preliminary work here, but at this point we should not take the results too seriously.

Data inventory

The dependent variables I plan to study are

rlgblg: Do you consider yourself as belonging to any particular religion or denomination?

rlgdgr: Regardless of whether you belong to a particular religion, how religious would you say you are?

The explanatory variables are

yrbrn: And in what year were you born?

hinctnta: Using this card, please tell me which letter describes your household's total income, after tax and compulsory deductions, from all sources? If you don't know the exact figure, please give an estimate. Use the part of the card that you know best: weekly, monthly or annual income.

eduyrs: About how many years of education have you completed, whether full-time or part-time? Please report these in full-time equivalents and include compulsory years of schooling.

tvtot: On an average weekday, how much time, in total, do you spend watching television?

rdtot: On an average weekday, how much time, in total, do you spend listening to the radio?

nwsptot: On an average weekday, how much time, in total, do you spend reading the newspapers?

netuse: Now, using this card, how often do you use the internet, the World Wide Web or e-mail - whether at home or at work - for your personal use?

Recodes

Income: I created a variable, hinctnta5, which subtracts 5 from hinctnta, so the mean is near 0. This shift makes the parameters of the model easier to interpret.

Year born: Similarly, I created yrbrn60, which subtracts 1960 from yrbrn.

Years of education: The distribution of eduyrs includes some large values that might be errors, and the question was posed differently in the first few rounds. I will investigate more carefully later, but for now I am replacing values greater than 25 years with 25, and subtracting off the mean, 12, to create eduyrs12.

Results

Just to get a quick look at things, I ran a logistic regression with rlgblg as the dependent variable, using data Rounds 1-5 and including all countries. The sample size is 229,307. Here are the estimated parameters (computed by StatsModels):

	coef	std err	z	P>\|z\|	[95.0% Conf. Int.]
Intercept	0.9811	0.014	70.014	0.000	0.954 1.009
yrbrn60	-0.0078	0.000	-27.826	0.000	-0.008 -0.007
eduyrs12	-0.0376	0.001	-29.619	0.000	-0.040 -0.035
hinctnta5	-0.0220	0.002	-12.934	0.000	-0.025 -0.019
tvtot	-0.0161	0.002	-7.205	0.000	-0.021 -0.012
rdtot	-0.0149	0.002	-8.826	0.000	-0.018 -0.012
nwsptot	-0.0320	0.004	-8.924	0.000	-0.039 -0.025
netuse	-0.0758	0.002	-42.062	0.000	-0.079 -0.072

The parameters are all statistically significant with very small p-values. And they are all negative, which indicates:

Younger people are less likely to report a religious affiliation.
More educated people are less likely...
People with higher income are less likely...
People who consume more media (television, radio, newspaper) are less likely...
People who use the Internet more are less likely...

The effect of Internet use (per hour per week) appears to be about twice the effect of reading the newspaper, which is about twice the effect of television or radio.

The effect of the Internet is comparable to about a decade of age, two years of education, or 3 deciles of income.

Most of these results are consistent with what I saw in my previous study and what other studies have shown. One exception is income: in other studies, the usual pattern is that people in the lowest income groups are less likely to be affiliated, and after that, income has no effect. We'll see if this preliminary result holds up.

I ran a similar model using rlgdgr (degree of religiosity) as the dependent variable:

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	5.6668	0.019	300.071	0.000	5.630 5.704
yrbrn60	-0.0180	0.000	-47.352	0.000	-0.019 -0.017
eduyrs12	-0.0688	0.002	-40.159	0.000	-0.072 -0.065
hinctnta5	-0.0266	0.002	-11.397	0.000	-0.031 -0.022
tvtot	-0.0801	0.003	-26.334	0.000	-0.086 -0.074
rdtot	-0.0179	0.002	-7.791	0.000	-0.022 -0.013
nwsptot	-0.0531	0.005	-10.873	0.000	-0.063 -0.044
netuse	-0.1020	0.002	-40.942	0.000	-0.107 -0.097

The results are similar. Again, this IPython notebook has the details.

Limitations

Again, we should not take these results too seriously yet:

So far I am not taking into account the weights associated with respondents, either within or across countries. So for now I am oversampling people in small countries, as well as some groups within countries.
At this point I haven't done anything careful to fill missing values, so the results will change when I get that done.
And I think it will be more meaningful to break the results down by country.

Stay tuned. More coming soon!

Friday, November 13, 2015

Recidivism and logistic regression

In my previous article, I presented the problem of estimating a criminal's risk of recidivism, focusing on the philosophical problems of attributing a probability to an individual. In this article I turn to the more practical problem of doing the estimation.

My collaborator asked me to compare two methods and explain their assumptions, properties, pros and cons:

1) Logistic regression: This is the standard method in the field. Researchers have designed a survey instrument that assigns each offender a score from -3 to 12. Using data from previous offenders, they estimate the parameters of a logistic regression model and generate a predicted probability of recidivism for each score.

2) Post test probabilities: An alternative of interest to my collaborator is the idea of treating survey scores the way some tests are used in medicine, with a threshold that distinguishes positive and negative results. Then for people who "test positive" we can compute the post-test probability of recidivism.

In addition to these two approaches, I considered three other models:

3) Independent estimates for each group: In the previous article, I cited the notion that "the probability of recidivism for an individual offender will be the same as the observed recidivism rate for the group to which he most closely belongs." (Harris and Hanson 2004). If we take this idea to an extreme, the estimate for each individual should be based only on observations of the same risk group. The logistic regression model is holistic in the sense that observations from each group influence the estimates for all groups. An individual who scores a 6, for example, might reasonably object if the inclusion of offenders from other risk classes has the effect of increasing the estimated scores for his own class. To evaluate this concern, I also considered a simple model where the risk in each group is estimated independently.

4) Logistic regression with a quadratic term: Simple logistic regression is based on the assumption that the scores are linear in the sense that each increment corresponds to the same increase in risk; for example, it assumes that the odds ratio between groups 1 and 2 is the same as the odds ratio between groups 9 and 10. To test that assumption, I ran a model that includes the square of the scores as a predictive variable. If risks are actually non-linear, this quadratic term might provide a better fit to the data. But it doesn't, so the linear assumption holds up, at least to this simple test.

5) Bayesian logistic regression: This is a natural extension of logistic regression that takes as inputs prior distributions for the parameters, and generates posterior distributions for the parameters and predictive distributions for the risks in each group. For this application, I didn't expect this approach to provide any great advantage over conventional logistic regression, but I think the Bayesian version is a powerful and surprisingly simple idea, so I used this project as an excuse to explore it.

Details of these methods are in this IPython notebook; I summarize the results below.

Logistic regression

My collaborator provided a dataset with 10-year recidivism outcomes for 703 offenders. The definition of recidivism in this case is that the individual was charged with another offense within 10 years of release (but not necessarily convicted). The range of scores for offenders in the dataset is from -2 to 11.

I estimated parameters using the StatsModels module. The following figure shows the estimated probability of recidivism for each score and the predictive 95% confidence interval.

A striking feature of this graph is the range of risks, from less than 10% to more than 60%. This range suggests that it is important to get this right, both for the individuals involved and for society.

[One note: the logistic regression model is linear when expressed in log-odds, so it is not a straight line when expressed in probabilities.]

Post test probabilities

In the context of medical testing, a "post-test probability/positive" (PTPP) is the answer to the question, "Given that a patient has tested positive for a condition of interest (COI), what is the probability that the patient actually has the condition, as opposed to a false positive test". The advantage of PTPP in medicine is that it addresses the question that is arguably most relevant to the patient.

In the context of recidivism, if we want to treat risk scores as a binary test, we have to choose a threshold that splits the range into low and high risk groups. As an example, if I choose "6 or more" as the threshold, I can compute the test's sensitivity, specificity, and PTPP:

With threshold 6, the sensitivity of the test is 42%, meaning that the test successfully identifies (or predicts) 42% of recidivists.
The specificity is 76%, which is the fraction of non-recidivists who correctly test negative.
The PTPP is 42%, meaning that 42% of the people who test positive will reoffend. [Note: it is only a coincidence in this case that sensitivity and PTPP are approximately equal.]

One of the questions my collaborator asked is whether it's possible to extend this analysis to generate a posterior distribution for PTPP, rather than a point estimate. In the notebook I show how to do that using beta distributions for the priors and posteriors. For this example, the 95% posterior credible interval is 35% to 49%.

Now suppose we customize the test for each individual. So if someone scores 3, we would set the threshold to 3 and compute a PTPP and a credible interval. We could treat the result as the probability of recidivism for people who test 3 or higher. The following figure shows the results (in blue) superimposed on the results from logistic regression (in grey).

A few features are immediately apparent:

At the low end of the range, PTPP is substantially higher than the risk estimated by logistic regression.
At the high end, PTPP is lower.
With this dataset, it's not possible to compute PTPP for the lowest and highest scores (as contrasted with logistic regression, which can extrapolate).
For the lowest risk scores, the credible intervals on PTPP are a little smaller than the confidence intervals of logistic regression.
For the high risk scores, the credible intervals are very large, due to the small amount of data.

So there are practical challenges in estimating PTPPs with a small dataset. There are also philosophical challenges. If the goal is to estimate the risk of recidivism for an individual, it's not clear that PTPP is the answer to the right question.

This way of using PTPP has the effect of putting each individual in a reference class with everyone who scored the same or higher. Someone who is subject to this test could reasonably object to being lumped with higher-risk offenders. And they might reasonably ask why it's not the other way around: why is the reference class "my risk or higher" and not "my risk or lower"?

I'm not sure there is a principled reason to that question. In general, we should prefer a smaller reference class, and one that does not systematically bias the estimates. Using PTPP (as proposed) doesn't do well by these criteria.

But logistic regression is vulnerable to the same objection: outcomes from high-risk offenders have some effect on the estimates for lower-risk groups. Is that fair?

Independent estimates

To answer that question, I also ran a simple model that estimates the risk for each score independently. I used a uniform distribution as a prior for each group, computed a Bayesian posterior distribution, and plotted the estimated risks and credible intervals. Here are the results, again superimposed on the logistic regression model.

People who score 0, 1, or 8 or more would like this model. People who score -2, 2, or 7 would hate it. But given the size of the credible intervals, it's clear that we shouldn't take these variations too seriously.

With a bigger dataset, the credible intervals would be smaller, and we expect the results to be ordered so that groups with higher scores have higher rates of recidivism. But with this small dataset, we have some practical problems.

We could smooth the data by combining adjacent scores into risk ranges. But then we're faced again with the objection that this kind of lumping is unfair to people at the low end of each range, and dangerously generous to people at the high end.

I think logistic regression balances these conflicting criteria reasonably: by taking advantage of background information -- the assumption that higher scores correspond to higher risks -- it uses data efficiently, generating sensible estimates for all scores and confidence intervals that represent the precision of the estimates.

However, it is based on an assumption that might be too strong, that the relationship between log-odds and risk score is linear. If that's not true, the model might be unfair to some groups and too generous to others.

Linear regression with a quadratic term

To see whether the linear assumption is valid, we can run logistic regression again with two explanatory variables: score and the square of score. If the relationship is actually non-linear, this quadratic model might fit the data better.

The estimated coefficient for the quadratic term is negative, so the line of best fit is a parabola with downward curvature. That's consistent with the shape of the independent estimates in the previous section. But the estimate for the quadratic parameter is not statistically significant (p=0.11), which suggests that it might be due to randomness.

Here is what the results look like graphically, again compared to the logistic regression model:

The estimates are lower for the lowest score groups, about the same in the middle, and lower again for the highest scores. At the high end, it seems unlikely that the risk of recidivism actually declines for the highest scores, and more likely that we are overfitting the data.

This model suggests that the linear logistic regression might overestimate risk for the lowest-scoring groups. It might be worth exploring this possibility with more data.

Bayesian logistic regression

One of the best features of logistic regression is that the results are in the form of a probability, which is useful for all kinds of analysis and especially for the kind of risk-benefit analysis that underlies parole decisions.

But conventional logistic regression does not admit prior information about the parameters, and the result is not a posterior distribution, but just a point estimate and confidence interval. In general, confidence intervals are less useful as part of a decision making process. And if you have ever tried to explain a confidence interval to a general audience, you know they can be problematic.

So, mostly for my own interest, I applied Bayesian logistic regression to the same data. The Bayesian version of logistic regression is conceptually simple and easy to implement, in part because at the core of logistic regression there is a tight connection to Bayes's theorem. I wrote about this connection in a previous article.

My implementation, with explanatory comments, is in the IPython notebook; here are the results:

With a uniform prior for the parameters, the results are almost identical to the results from (conventional) logistic regression. At first glance it seems like the Bayesian version has no advantage, but there are a few reasons it might:

For each score, we can generate a posterior distribution that represents not just a point estimate, but all possible estimates and their probabilities. If the posterior mean is 55%, for example, we might want to know the probability that the correct value is more than 65%, or less than 45%. The posterior distribution can answer this question; conventional logistic regression cannot.
If we use estimates from the model as part of another analysis, we could carry the posterior distribution through the subsequent computation, producing a posterior predictive distribution for whatever results the analysis generates.
If we have background information that guides the choice of the prior distributions, we could use that information to produce better estimates.

But in this example, I don't have a compelling reason to choose a different prior, and there is no obvious need for a posterior distribution: a point estimate is sufficient. So (although it hurts me to admit it) the Bayesian version of logistic regression has no practical advantage for this application.

Summary

Using logistic regression to estimate the risk of recidivism for each score group makes good use of the data and background information. Although the estimate for each group is influenced by outcomes in other groups, this influence is not unfair in any obvious way.

The alternatives I considered are problematic: PTPP is vulnerable to both practical and philosophical objections. Estimating risks for each group independently is impractical with a small dataset, but might work well with more data.

The quadratic logistic model provides no apparent benefit, and probably overfits the data. The Bayesian version of logistic regression is consistent with the conventional version, and has no obvious advantages for this application.

[UPDATE 17 November 2015] Sam Mason wrote to ask if I had considered using Gaussian process regression to model the data. I had not, but it strikes me as a reasonable choice if we think that the risk of recidivism varies smoothly with risk score, but not necessarily with the same risk ratio between successive scores (as assumed by logistic regression).

And then he kindly sent me this IPython notebook, where he demonstrates the technique. Here are two models he fit, using an exponential kernel (left) and a Matern kernel (right).

Here's what Sam has to say about the methods:

"I've done a very naive thing in a Gaussian process regression. I use the GPy toolbox (http://sheffieldml.github.io/GPy), which is a reasonably nice library for doing this sort of thing. GPs have n^2 complexity in the number of data points, so scaling beyond a few hundred points can be awkward but there are various sparse approximations (some included in GPy) should you want to play with much more data. The main thing to choose is the covariance function—i.e. how related are points should points be considered to be at a given distance. The GPML book by Rasmussen and Williams is a good reference if you want to know more.

"The generally recommended kernel if you don't know much about the data is a Matern kernel and this ends up looking a bit like your quadratic fit (a squared exponential kernel is also commonly used, but tends to be too strong). I'd agree with the GP model that the logistic regression is probably saying that those with high scores are too likely to recidivate. Another kernel that looked like a good fit is the Exponential kernel. This smooths the independent estimates a bit and says that the extreme rates (i.e. <=0 or >=7) are basically equal—which seems reasonable to me."

The shape of these fitted models is plausible to me, and consistent with the data.

At this point, we have probably squeezed out all the information we're going to get from 702 data points, but I would be interested to see if this shape appears in models of other datasets. If so, it might have practical consequences. For people in the highest risk categories, the predictions from the logistic regression model are substantially higher than the predictions from these GP models.

I'll see if I can get my hands on more data.

Monday, November 9, 2015

Recidivism and single-case probabilities

I am collaborating with a criminologist who studies recidivism. In the context of crime statistics, a recidivist is a convicted criminal who commits a new crime post conviction. Statistical studies of recidivism are used in parole hearings to assess the risks of releasing a prisoner. This application of statistics raises questions that go to the foundations of probability theory.

Specifically, assessing risk for an individual is an example of a "single-case probability", a well-known problem in the philosophy of probability. For example, we would like to be able to make a statement like, "If released, there is a 55% probability that Mr. Smith will be charged with another crime within 10 years." But how could we estimate a probability like that, and what would it mean?

I suggest we answer these questions in three steps. The first is to choose a "reference class"; the second is to estimate the probability of recidivism in the reference class; the third is to interpret the result as it applies to Mr. Smith.

For example, if the reference class includes all people convicted of the same crime as Mr. Smith, we could find a sample of that population and compute the rate of recidivism in the sample. If 55% of the sample were recidivists, we might claim that Mr. Smith has a 55% probability of reoffending.

Let's look at those steps in more detail:

1) The reference class problem For any individual offender, there are an unbounded number of reference classes we might choose. For example, if we consider characteristics like age, marital status, and severity of crime, we might form a reference class using any one of those characteristics, any two, or all three. As the number of characteristics increases, the number of possible classes grows exponentially (and I mean that literally, not as a sloppy metaphor for "very fast").

Some reference classes are preferable to others; in general, we would like the people in each class to be as similar as possible to each other, and the classes to be as different as possible from each other. But there is no objective procedure for choosing the "right" reference class.

2) Sampling and estimation Assuming we have chosen a reference class, we would like to estimate the proportion of recidivists in the class. First, we have to define recidivism in a way that's measurable. Ideally we would like to know whether each member of the class commits another crime, but that's not possible. Instead, recidivism is usually defined to mean that the person is either charged or convicted of another crime.

If we could enumerate the members of class and count recidivists and non-, we would know the true proportion, but normally we can't do that. Instead we choose a sampling process intended to select a representative sample, meaning that every member of the class has the same probability of appearing in the sample, and then use the observed proportion in the sample to estimate the true proportion in the class.

3) Interpretation Suppose we agree on a reference class, C, a sampling process, and an estimation process, and estimate that the true proportion of recidivists in C is 55%. What can we say about Mr. Smith?

As my collaborator has demonstrated, this question is a topic of ongoing debate. Among practitioners in the field, some take the position that "the probability of recidivism for an individual offender will be the same as the observed recidivism rate for the group to which he most closely belongs." (Harris and Hanson 2004). On this view, we would conclude that Mr. Smith has a 55% chance of reoffending.

Others take the position that this claim is nonsense because it could never be confirmed or denied; whether Mr. Smith goes on to commit another crime or not, neither outcome supports or refutes the claim that the probability was 55%. On this view, probability cannot be meaningfully applied to a single event.

To understand this view, consider an analogy suggested by my colleague Rehana Patel: Suppose you estimate that the median height of people in class C is six feet. You could not meaningfully say that the median height of Mr. Smith is six feet. Only the class has a median height, individuals do not. Similarly, only the class has a proportion of recidivists; individuals do not.

And the answer is...

At this point I have framed the problem and tried to state the opposing views clearly. Now I will explain my own view and try to justify it.

I think it is both meaningful and useful to say, in the example, that Mr. Smith has a 55% chance of offending again. Contrary to the view that no possible outcome supports or refutes this claim, Bayes's theorem tells us otherwise. Suppose we consider two hypotheses:

H55: Mr. Smith has a 55% chance of reoffending.
H45: Mr. Smith has a 45% chance of reoffending.

If Mr. Smith does in fact commit another crime, this datum supports H55 with a Bayes factor of (55)/(45) = 1.22. And if he does not, that datum supports H45 by the same factor. In either case it is weak evidence, but nevertheless it is evidence, which means that H55 is a meaningful claim that can be supported or refuted by data. The same argument applies if there are more than two discrete hypotheses or a continuous set of hypotheses.

Furthermore, there is a natural operational interpretation of the claim. If we consider some number, n, of individuals from class C, and estimate that each of them has probability, p, of reoffending, we can compute a predictive distribution for k, the number who reoffend. It is just the binomial distribution of k with parameters p and n:

$f(k;n,p) = \Pr(X = k) = \binom n k p^k(1-p)^{n-k}$

For example, if n=100 and p=55, the most likely value of k is 55, and the probability that k=55 is about 8%. As far as I know, no one has a problem with that conclusion.

But if there is no philosophical problem with the claim, "Of these 100 members of class C, the probability that 55 of them will reoffend is 8%", there should be no special problem when n=1. In that case we would say, "Of these 1 members of class C, the probability that 1 will reoffend is 55%." Granted, that's a funny way to say it, but that's a grammatical problem, not a philosophical one.

Now let me address the challenge of the height analogy. I agree that Mr. Smith does not have a median height; only groups have medians. However, if the median height in C is six feet, I think it is correct to say that Mr. Smith has a 50% chance of being taller than six feet.

That might sound strange; you might reply, "Mr. Smith's height is a deterministic physical quantity. If we measure it repeatedly, the result is either taller than six feet every time, or shorter every time. It is not a random quantity, so you can't talk about its probability."

I think that reply is mistaken, because it leaves out a crucial step:

1) If we choose a random member of C and measure his height, the result is a random quantity, and the probability that it exceeds six feet is 50%.

2) We can consider Mr. Smith to be a randomly-chosen member of C, because part of the notion of a reference class is that we consider the members to be interchangeable.

3) Therefore Mr. Smith's height is a random quantity and we can make probabilistic claims about it.

My use of the word "consider" in the second step is meant to highlight that this step is a modeling decision: if we choose to regard Mr. Smith as a random selection from C, we can treat his characteristics as random variables. The decision not to distinguish among the members of the class is part of what it means to choose a reference class.

Finally, if the proportion of recidivists in C is 55%, the probability that a random member of C will reoffend is 55%. If we consider Mr. Smith to be a random member of C, his probability of reoffending is 55%.

Is this useful?

I have argued that it is meaningful to claim that Mr. Smith has a 55% probability of recidivism, and addressed some of the challenges to that position.

I also think that claims like this are useful because they guide decision making under uncertainty. For example, if Mr. Smith is being considered for parole, we have to balance the costs and harms of keeping him in prison with the possible costs and harms of releasing him. It is probably not possible to quantify all of the factors that should be taken into consideration in this decision, but it seems clear that the probability of recidivism is an important factor.

Furthermore, this probability is most useful if expressed in absolute, as opposed to relative, terms. If we know that one prisoner has a higher risk than another, that provides almost no guidance about whether either should be released. But if one has a probability of 10% and another 55% (and those are realistic numbers for some crimes) those values could be used as part of a risk-benefit analysis, which would usefully inform individual decisions, and systematically yield better outcomes.

What about the Bayesians and the Frequentists?

When I started this article, I thought the central issue was going to be the difference between the Frequentist and Bayesian interpretations of probability. But I came to the conclusion that this distinction is mostly irrelevant.

Considering the three steps of the process again:

1) Reference class: Choosing a reference class is equally problematic under either interpretation of probability; there is no objective process that identifies the right, or best, reference class. The choice is subjective, but that is not to say arbitrary. There are reasons to prefer one model over another, but because there are multiple relevant criteria, there is no unique best choice.

2) Estimation: The estimation step can be done using frequentist or Bayesian methods, and there are reasons to prefer one or the other. Some people argue that Bayesian methods are more subjective because they depend on a prior distribution, but I think both approaches are equally subjective; the only difference is whether the priors are explicit. Regardless, the method you use to estimate the proportion of recidivists in the reference class has no bearing on the third step.

3) Interpretation: In my defense of the claim that the proportion of recidivists in the reference class is the probability of recidivism for the individual, I used Bayes's theorem, which is a universally accepted law of probability, but I did not invoke any particular interpretation of probability.

I argued that we could treat an unknown quantity as a random variable. That idea is entirely unproblematic under Bayesianism, but less obviously compatible with frequentism. Some sources claim that frequentism is specifically incompatible with single-case probabilities.

For example the Wikipedia article on probability interpretations claims that "Frequentists are unable to take this approach [a propensity interpretation], since relative frequencies do not exist for single tosses of a coin, but only for large ensembles or collectives."

I don't agree that assigning a probability to a single case is a special problem for frequentism. Single case probabilities seem hard because they make the choice of the reference class more difficult. But choosing a reference class is hard under any interpretation of probability; it is not a special problem for frequentism.

And once you have chosen a reference class, you can estimate parameters of the class and generate predictions for individuals, or groups, without commitment to a particular interpretation of probability.

[For more about alternative interpretations of probability, and how they handle single-case probabilities, see this article in the Stanford Encyclopedia of Philosophy, especially the section on frequency interpretations. As I read it, the author agrees with me that (1) the problem of the single case relates to choosing a reference class, not attributing probabilities to individuals, and (2) in choosing a reference class, there is no special problem with the single case that is not also a problem in other cases. If there is any difference, it is one of degree, not kind.]

Summary

In summary, the assignment of a probability to an individual depends on three subjective choices:

1) The reference class,
2) The prior used for estimation,
3) The modeling decision to regard an individual as a random sample with n=1.

You can think of (3) as an additional choice or as part of the definition of the reference class.

These choices are subjective but not arbitrary; that is, there are justified reasons to prefer one over others, and to accept some as good enough for practical purposes.

Finally, subject to those choices, it is meaningful and useful to make claims like "Mr. Smith has a 55% probability of recidivism".

Q&A

1) Isn't it dehumanizing to treat an individual as if he were an interchangeable, identical member of a reference class? Every human is unique; you can't treat a person like a number!

I conjecture that everyone who applies quantitative methods to human endeavors has heard a complaint like this. If the intended point is, "You should never apply quantitative models to humans," I disagree. Everything we know about the world, including the people in it, is mediated by models, and all models are based on simplifications. You have to decide what to include and what to leave out. If your model includes the factors that matter and omits the ones that don't, it will be useful for practical purposes. If you make bad modeling decisions, it won't.

But if the intent of this question is to say, "Think carefully about your modeling decisions, validate your models as much as practically possible, and consider the consequences of getting it wrong," then I completely agree.

Model selection has consequences. If we fail to include a relevant factor (that is, one that has predictive power), we will treat some individuals unfairly and systematically make decisions that are less good for society. If we include factors that are not predictive, our predictions will be more random and possibly less fair.

And those are not the only criteria. For example, if it turns out that a factor, like race, has predictive power, we might choose to exclude it from the model anyway, possibly decreasing accuracy in order to avoid a particularly unacceptable kind of injustice.

So yes, we should think carefully about model selection, but no, we should not exclude humans from the domain of statistical modeling.

2) Are you saying that everyone in a reference class has the same probability of recidivism? That can't be true; there must be variation within the group.

I'm saying that an individual only has a probability AS a member of a reference class (or, for the philosophers, qua a member of a reference class). You can't choose a reference class, estimate the proportion for the class, and then speculate about different probabilities for different members of the class. If you do, you are just re-opening the reference class question.

To make that concrete, suppose there is a standard model of recidivism that includes factors A, B, and C, and excludes factors D, E, and F. And suppose that the model estimates that Mr. Smith's probability of recidivism is 55%.

You might be tempted to think that 55% is the average probability in the group, and the probability for Mr. Smith might be higher or lower. And you might be tempted to adjust the estimate for Mr. Smith based on factor D, E, or F.

But if you do that, you are effectively replacing the standard model with a new model that includes additional factors. In effect, you are saying that you think the standard model leaves out an important factor, and would be better if it included more factors.

That might be true, but it is a question of model selection, and should be resolved by considering model selection criteria.

It is not meaningful or useful to talk about differences in probability among members of a reference class. Part of the definition of reference class is the commitment to treat the members as equivalent.

That commitment is a modeling decision, not a claim about the world. In other words, when we choose a model, we are not saying that we think the model captures everything about the world. On the contrary, we are explicitly acknowledging that it does not. But the claim (or sometimes the hope) is that it captures enough about the world to be useful.

3) What happened to the problem of the single case? Is it a special problem for frequentism? Is it a special problem at all?

There seems to be some confusion about whether the problem of the single case relates to choosing a reference class (my step 1) or attributing a probability to an individual (step 3).

I have argued that it does not relate to step 3. Once you have chosen a reference class and estimated its parameters, there is no special problem in applying the result to the case of n=1, not under frequentism or any other interpretation I am aware of.

During step 1, single case predictions might be more difficult, because the choice of reference class is less obvious or people might be less likely to agree. But there are exceptions of both kinds: for some single cases, there might be an easy consensus on an obvious good reference class; for some non-single cases, there are many choices and no consensus. In all cases, the choice of reference class is subjective, but guided by the criteria of model choice.

So I think the single case problem is just an instance of the reference class problem, and not a special problem at all.

Tuesday, November 3, 2015

Learning to Love Bayesian Statistics

At Strata NYC 2015, O'Reilly Media's data science conference, I gave a talk called "Learning to Love Bayesian Statistics". The video is available now:

The sound quality is not great, and the clicker to advance the slides was a little wonky, but other than that, it went pretty well.

Here are the slides, if you want to follow along at home.

I borrowed illustrations from The Phantom Tollbooth, in a way I think it consistent with fair use. I think they work remarkably well.

Thanks to the folks at Strata for inviting me to present, and for allowing me to make the video freely available. It's actually the first video I have posted on YouTube. I'm a late adopter.

Monday, November 2, 2015

One million is a lot

When I was in third grade, the principal of my elementary school announced a bottle cap drive with the goal of collecting one million bottle caps. The point, as I recall, was to demonstrate that one million is a very large number. After a few months, we ran out of storage space, the drive was cancelled, and we had to settle for the lesson that 100,000 is a lot of bottle caps.

So it is a special pleasure for me to announce that, early Sunday morning (November 1, 2015), this blog reached one million page views. I am celebrating the occasion with a review of some of my favorite articles and, of course, some analysis of the page view statistics.

Here's a screenshot of my Blogger stats page to make it official:

And here are links to the 10 most read articles:

Posts

Entry

Pageviews

Are first babies more likely to be late?

Feb 7, 2011, 9 comments

130446

All your Bayes are belong to us!

Oct 27, 2011, 56 comments

47773

My favorite Bayes's Theorem problems

Oct 20, 2011, 13 comments

33020

Bayesian survival analysis for "Game of Thrones"

Mar 25, 2015, 5 comments

32210

The Inspection Paradox is Everywhere

Aug 18, 2015, 23 comments

30330

Bayesian statistics made simple

Mar 14, 2012

21718

Are your data normal? Hint: no.

Aug 7, 2013, 13 comments

15035

Yet another reason SAT scores are non-predictive

Feb 2, 2011, 3 comments

13806

Freshman hordes even more godless!

Jan 29, 2012, 6 comments

7904

Bayesian analysis of match rates on Tinder

Feb 10, 2015

7096

By far the most popular is my article about whether first babies are more likely to be late. It turns out they are, but only by a couple of days.

Two of the top 10 are articles written by students in my Bayesian statistics class: "Bayesian survival analysis for Game of Thrones" by Erin Pierce and Ben Kahle, and "Bayesian analysis of match rates on Tinder", by Ankur Das and Mason del Rosario. So congratulations, and thanks, to them!

Five of the top 10 are explicitly Bayesian, which is clearly the intersection of my interests and popular curiosity. But the other common theme is the application of statistical methods (of any kind) to questions people are interested in.

According to Blogger stats, my readers are mostly in the U.S., with most of the rest in Europe. No surprises there, with the exception of Ukraine, which is higher in the rankings than expected. Some of those views are probably bogus; anecdotally, Blogger does not do a great job of filtering robots and fake clicks (I don't have ads on my blog, so I am not sure how anyone benefits from fake clicks, but I have to conclude that some of my million are not genuine readers).

Most of my traffic comes from Google, Reddit, Twitter, and Green Tea Press, which is the home of my free books. It looks like a lot of people find me through "organic" search, as opposed to my attempts at publicity. And what are those people looking for?

People who find my blog are looking for Bayesian statistics, apparently, and the answer to the eternal question, "Are first babies more likely to be late?"

Those are all the reports you can get from Blogger (unless you are interested in which browsers my readers use). But if I let it go at that, this blog wouldn't be called "Probably Overthinking It".

I used the Chrome extension SingleFile to grab the stats for each article in a form I can process, then used the Pandas read_html function to get it all into a table. The results, and my analysis, are in this IPython notebook.

My first post, "Proofiness and Elections", was on January 4, 2011. I've published 115 posts since then; the average time between posts is 15 days, but that includes a 180 day hiatus after "Secularization in America: Part Seven" in July 2012. I spent the fall working on Think Bayes, and got back to blogging in January 2013.

Blogger provides stats on the most popular posts; I had to do my own analysis to extract the least popular posts:

Some of these deserve their obscurity, but not all. "Will Millennials Ever Get Married?" is one of my favorite projects, and I think the video from the talk is pretty good. And "When will I win the Great Bear Run?" is one of the best statistical models I've developed, albeit applied to a problem that is mostly silly.

Measures of popularity often follow Zipf's law, and my blog is no different. As I suggest in Chapter 5 of Think Complexity, the most robust way to check for Zipf-like behavior is to plot the complementary CDF of frequency (for example, page views) on a log-log scale:

For articles with more than 1000 page views, the CCDF is approximately straight, in compliance with Zipf's law.

The posts that elicited the most comments are:

Apparently, people like their veridical paradoxes! The Girl Named Florida problem attracted the attention and wrath of JeffJo, the reader who has contributed by far the most comments. He also accounts for many of the comments on The Sleeping Beauty Problem, along with Brian Mays. Between them, they might have posted more words on my blog than I have.

A few of my posts have attracted attention on the social network of Google employees, Google+:

I'm glad someone appreciates The Inspection Paradox. I submitted it for publication in CHANCE magazine, but they didn't want it. Thirty thousand readers, 909 Google employees, and I think they blew it.

One thing I have learned from this blog is that I can never predict whether an article will be popular. One of the most technically challenging articles, "Bayes meets Fourier", apparently found an audience of people interested in Bayesian statistics and signal processing. At the same time, some of my favorites, like The Rock Hyrax Problem and Belly Button Biodiversity have landed flat. I've given up trying to predict what will hit.

I have posts coming up in the next few weeks that I am excited about, including an analysis of Internet use and religion using data from the European Social Survey. Watch this space.

Thanks to everyone who contributed to the first million page views. I hope you found it interesting and learned something, and I hope you'll be back for the next million!