Monday, April 28, 2014

Bayes's theorem and logistic regression

This week's post has more math than most, so I wrote in it LaTeX and translated it to HTML using HeVeA. Some of the formulas are not as pretty as they could be. If you prefer, you can read this article in PDF.
Abstract: My two favorite topics in probability and statistics are Bayes’s theorem and logistic regression. Because there are similarities between them, I have always assumed that there is a connection. In this note, I demonstrate the connection mathematically, and (I hope) shed light on the motivation for logistic regression and the interpretation of the results.

1  Bayes’s theorem

I’ll start by reviewing Bayes’s theorem, using an example that came up when I was in grad school. I signed up for a class on Theory of Computation. On the first day of class, I was the first to arrive. A few minutes later, another student arrived. Because I was expecting most students in an advanced computer science class to be male, I was mildly surprised that the other student was female. Another female student arrived a few minutes later, which was sufficiently surprising that I started to think I was in the wrong room. When another female student arrived, I was confident I was in the wrong place (and it turned out I was).
As each student arrived, I used the observed data to update my belief that I was in the right place. We can use Bayes’s theorem to quantify the calculation I was doing intuitively.
I’ll us H to represent the hypothesis that I was in the right room, and F to represent the observation that the first other student was female. Bayes’s theorem provides an algorithm for updating the probability of H:
P(H|F) = P(H
  • P(H) is the prior probability of H before the other student arrived.
  • P(H|F) is the posterior probability of H, updated based on the observation F.
  • P(F|H) is the likelihood of the data, F, assuming that the hypothesis is true.
  • P(F) is the likelihood of the data, independent of H.
Before I saw the other students, I was confident I was in the right room, so I might assign P(H) something like 90%.
When I was in grad school most advanced computer science classes were 90% male, so if I was in the right room, the likelihood of the first female student was only 10%. And the likelihood of three female students was only 0.1%.
If we don’t assume I was in the right room, then the likelihood of the first female student was more like 50%, so the likelihood of all three was 12.5%.
Plugging those numbers into Bayes’s theorem yields P(H|F) = 0.64 after one female student, P(H|FF) = 0.26 after the second, and P(H|FFF) = 0.07 after the third.
[UPDATE: An earlier version of this article had incorrect values in the previous sentence. Thanks to David Burger for catching the error.]

2  Logistic regression

Logistic regression is based on the following functional form:
logit(p) = β0 + β1 x1 + ... + βn xn 
where the dependent variable, p, is a probability, the xs are explanatory variables, and the βs are coefficients we want to estimate. The logit function is the log-odds, or
logit(p) = ln


When you present logistic regression like this, it raises three questions:
  • Why is logit(p) the right choice for the dependent variable?
  • Why should we expect the relationship between logit(p) and the explanatory variables to be linear?
  • How should we interpret the estimated parameters?
The answer to all of these questions turns out to be Bayes’s theorem. To demonstrate that, I’ll use a simple example where there is only one explanatory variable. But the derivation generalizes to multiple regression.
On notation: I’ll use P(H) for the probability that some hypothesis, H, is true. O(H) is the odds of the same hypothesis, defined as
O(H) = 
1 − P(H)
I’ll use LO(H) to represent the log-odds of H:
LO(H) = lnO(H
I’ll also use LR for a likelihood ratio, and OR for an odds ratio. Finally, I’ll use LLR for a log-likelihood ratio, and LOR for a log-odds ratio.

3  Making the connection

To demonstrate the connection between Bayes’s theorem and logistic regression, I’ll start with the odds form of Bayes’s theorem. Continuing the previous example, I could write
   O(H|F) = O(HLR(F|H)     (1)
  • O(H) is the prior odds that I was in the right room,
  • O(H|F) is the posterior odds after seeing one female student,
  • LR(F|H) is the likelihood ratio of the data, given the hypothesis.
The likelihood ratio of the data is:
LR(F|H) = 
P(F|¬ H)
where ¬ H means H is false.
Noticing that logistic regression is expressed in terms of log-odds, my next move is to write the log-odds form of Bayes’s theorem by taking the log of Eqn 1:
   LO(H|F) = LO(H) + LLR(F|H)     (2)
If the first student to arrive had been male, we would write
    LO(H|M) = LO(H) + LLR(M|H)     (3)
Or more generally if we use X as a variable to represent the sex of the observed student, we would write
   LO(H|X) = LO(H) + LLR(X|H)     (4)
I’ll assign X=0 if the observed student is female and X=1 if male. Then I can write:
    LLR(X|H) = 

    LLR(F|H)if  X = 0
    LLR(M|H)if  X = 1
Or we can collapse these two expressions into one by using X as a multiplier:
   LLR(X|H) = LLR(F|H) + X [LLR(M|H) − LLR(F|H)]     (6)

4  Odds ratios

The next move is to recognize that the part of Eqn 4 in brackets is the log-odds ratio of H. To see that, we need to look more closely at odds ratios.
Odds ratios are often used in medicine to describe the association between a disease and a risk factor. In the example scenario, we can use an odds ratio to express the odds of the hypothesis H if we observe a male student, relative to the odds if we observe a female student:
ORX(H) = 
I’m using the notation ORX to represent the odds ratio associated with the variable X.
Applying Bayes’s theorem to the top and bottom of the previous expression yields
ORX(H) = 
Taking the log of both sides yields
   LORX(H) = LLR(M|H) − LLR(F|H)     (7)
This result should look familiar, since it appears in Eqn 4.

5  Conclusion

Now we have all the pieces we need; we just have to assemble them. Combining Eqns 4 and 5 yields
   LLR(H|X) = LLR(F) + X LOR(X|H)     (8)
Combining Eqns 3 and 6 yields
   LO(H|X) = LO(H) + LLR(F|H) + X LOR(X|H)     (9)
Finally, combining Eqns 2 and 7 yields
LO(H|X) = LO(H|F) + X LOR(X|H
We can think of this equation as the log-odds form of Bayes’s theorem, with the update term expressed as a log-odds ratio. Let’s compare that to the functional form of logistic regression:
logit(p) = β0 + X β1 
The correspondence between these equations suggests the following interpretation:
  • The predicted value, logit(p), is the posterior log odds of the hypothesis, given the observed data.
  • The intercept, β0, is the log-odds of the hypothesis if X=0.
  • The coefficient of X, β1, is a log-odds ratio that represents odds of H when X=1, relative to when X=0.
This relationship between logistic regression and Bayes’s theorem tells us how to interpret the estimated coefficients. It also answers the question I posed at the beginning of this note: the functional form of logistic regression makes sense because it corresponds to the way Bayes’s theorem uses data to update probabilities.

This document was translated from LATEX by HEVEA.

Monday, April 21, 2014

Inferring participation rates in service projects

About a week ago I taught my tutorial, Bayesian Statistics Made Simple, at PyCon 2014 in Montreal.  My slides, the video, and all the code, are on this site.  The turnout was great.  We had a room full of enthusiastic Pythonistas who are now, if I was successful, enthusiastic Bayesians.

Toward the end, one of the participants asked a great question based on his work (if I understand the background correctly) at Do  Here's my paraphrase:
"A group of people sign up to do a community service project.  Some fraction of them actually participate, and then some fraction of the participants report back to confirm.  In other words, some of the people who sign up don't participate, and some of the people that participate don't report back. 
Given the number of people who sign up and the number of people who report back, can we estimate the number of people who actually participated?"
At the tutorial I wasn't able to generate an answer on the fly, except to say

1) It sounds like a two-dimensional problem, where we want to estimate the fraction of people who participate, which I'll call q, and the fraction of participants who report back, r.

2) If we only know the number of people who sign up and the number or participants who report back, we won't be able to estimate q and r separately.  We'll be able to narrow the distribution of q, but not by much.

3) But if we can do additional sampling, we should be able to make a good estimate.  For example, we could choose a random subset of people who sign up and ask them whether they participated (and check whether they reported back).

With these two kinds of data, we can solve the problem using the same tools we used in the tutorial for the Dice and the Euro problems.  I wrote a solution and checked it into the repository I used for the tutorial.  You can read and download it here.

Here's the code that creates the suite of hypotheses:

    probs = numpy.linspace(0, 1, 101)

    hypos = []
    for q in probs:
        for r in probs:
            hypos.append((q, r))

    suite = Volunteer(hypos)

probs is a sequence of 101 values equally-spaced between 0 and 1.  hypos is a list of tuples where each tuple represents a hypothetical pair, (q, r).

Volunteer is a new class that extends Suite and provides a Likelihood function

class Volunteer(thinkbayes.Suite):

    def Likelihood(self, data, hypo):
        """Computes the likelihood of the data.

        hypo: pair of (q, r)
        data: one of two possible formats
        if len(data) == 2:
            return self.Likelihood1(data, hypo)
        elif len(data) == 3:
            return self.Likelihood2(data, hypo)
            raise ValueError()

For this problem, we do two kinds of update, depending on the data.  The first update takes a pair of values, (signed_up, reported), which are the number of people who signed up and the number that reported back:

    def Likelihood1(self, data, hypo):
        """Computes the likelihood of the data.

        hypo: pair of (q, r)
        data: tuple (signed up, reported)
        q, r = hypo
        p = q * r
        signed_up, reported = data
        yes = reported
        no = signed_up - reported

        like = p**yes * (1-p)**no
        return like

Given the hypothetical values of q and r, we can compute p, which is the probability that someone who signs up will participate and report back.  Then we compute the likelihood using the binomial PMF (well, almost: I didn't bother to computer the binomial coefficient because it drops out, anyway, when we renormalize).

Since I don't have any real data, I'll makes some up for this example.  Suppose 140 people sign up and only 50 report back:

    data = 140, 50
    PlotMarginals(suite, root='volunteer1')

PlotMarginals displays the marginal distributions of q and r.  We can extract these distributions like this:

def MarginalDistribution(suite, index):
    pmf = thinkbayes.Pmf()
    for t, prob in suite.Items():
        pmf.Incr(t[index], prob)
    return pmf

MarginalDistribution loops through the Suite and makes a new Pmf that represents the posterior distribution of either q (when index=0) or r (when index=1).  Here's what they look like:
The two distributions are identical, which makes sense because this dataset doesn't give us any information about q and r separately, only about their product.

To estimate them separately, we need to sample people who sign up to see if they participated.  Here's what the second Likelihood function looks like:

     def Likelihood2(self, data, hypo):
        """Computes the likelihood of the data.

        hypo: pair of (q, r)
        data: tuple (signed up, participated, reported)
        q, r = hypo

        signed_up, participated, reported = data

        yes = participated
        no = signed_up - participated
        like1 = q**yes * (1-q)**no

        yes = reported
        no = participated - reported
        like2 = r**yes * (1-r)**no

        return like1 * like2

Again, we are given hypothetical pairs of q and r, but now data is a tuple of (signed_up, participated, reported).  We use q to compute the likelihood of signed_up and participated, and r to compute the likelihood of participated and reported.  The product of these factors, like1 * like 2, is the return value.

Again, I don't have real data, but suppose we survey 5 people who signed up, and learn that 3 participated and 1 reported back.  We would do this update:

    data = 5, 3, 1
    PlotMarginals(suite, root='volunteer2')

And the result would look like this:
Now we can discriminate between q and r.  Based on my imaginary data, we would estimate that more than 60% of the people who signed up participated, but only 50% of them reported back.  Or (more usefully) we could use the posterior distribution of q to form a most likely estimate and credible interval for the number of participants.

This example demonstrates some features we didn't see during the tutorial:

1) The framework I presented extends naturally to handle multi-dimensional estimation.

2) It also handles the case where you want to update the same distribution with different kinds of data.

It also demonstrates two features of Bayesian statistics: the ability to combine information from multiple sources and to extract the maximum amount of information from a small sample.

Many thanks to everyone who participated in the tutorial this year, and especially to the participant who asked about this problem.  Let me know if there are questions and I will update this post.

And if you want to learn more about Bayesian statistics, allow me to recommend Think Bayes.

Saturday, April 12, 2014

Think X, Y and Z: What's in the pipeline?

Greetings from PyCon 2014 in Montreal!  I did a book signing yesterday at the O'Reilly Media booth.  I had the pleasure of working side by side with David Beazley, who was signing copies of The Python Cookbook, now updated for Python 3 and, I assume, including all of the perverse things Prof. Beazley does with Python generators.

Normally O'Reilly provides 25 copies of the book for signing, but due to a mix up, they brought extra copies of all of my books, so I ended up signing about 50 books.  That was fun, but tiring.

Several people asked me what I am working on now, so I thought I would answer here:

1) This semester I am revising Think Stats and using it to teach Data Science at Olin College.  Think Stats was the first book in the Think X series to get published.  There are still many parts I am proud of, but a few parts that make me cringe.  So I am fixing some errors and adding new sections on multiple linear regression, logistic regression, and survival analysis.  The revised edition should be done in June 2014.  I will post my draft, real soon, at

2) I am also teaching Software Systems this semester, a class that covers system-level programming in C along with topics in operating systems, networks, databases, and embedded systems.  We are using Head First C to learn the language, and I am writing Think OS to present the OS topics.  The current (very rough) draft is at  I am planning to incorporate some material from The Little Book of Semaphores.  Then I will finish it up this summer.

3) Next semester I am teaching two half-semester classes: Bayesian statistics during the first half, based on Think Bayes; and digital signal processing during the second half.  I am taking a computational approach to DSP (surprise!), with emphasis on applications like sound and image processing.  The goal is for students to understand spectral analysis using FFT and to wrap their heads around time-domain and frequency-domain representations of time series data.  Working with sound takes advantage of human hearing, which is a pretty good spectral analysis system.  Applications will include pitch tracking as in Rock Band, and maybe we will reimplement Auto-Tune.

I have drafted a few chapters and posted them at I'll be working on it over the summer, then the students will work with it in the fall. If it all goes well, I'll revise it in January 2015, or maybe that summer.

One open question is what environment I will develop the book in. My usual work environment is LaTeX for the book, with Makefiles that generate PDF, HTML, and DocBook. I write code in emacs and run it on the Linux command line. So, that's a pretty old school toolkit.

For a long time I've been thinking about switching to IPython, which would allow me to combine text and code examples, and give users the ability to run the code while reading.

But two things have kept me from making the leap:

a) I like editing scripts, and I hate notebooks.  I find the interface awkward, and I don't want to keep track of which cells have to execute when I make a change.

b) I have been paralyzed by the number of options for distributing the results. For each of my books I have a Python module that contains my library code, so I need to distribute the module along with the notebook. Notebook viewers like nbviewer would not work for me (I think). At the other extreme, I could package my development in a Virtual Machine and let readers run my VM. But that seems like overkill, and many of my readers and not comfortable with virtual machines.

But I just listened to Fernando Perez's keynote at PyCon, where he talked about his work on IPython. I grabbed him after the talk and he was very helpful; in particular, he suggested the option of using Wakari Cloud. Continuum Analytics, which runs Wakari, is a PyCon sponsor, so I will try to talk to them later today (and find out whether I am misunderstanding what they do).

4) The other class I am teaching in the fall is Modeling and Simulation, which I helped develop, along with my colleagues John Geddes and Mark Somerville, about six years ago. For the last few years we've been using my book, Physical Modeling in MATLAB.

The book presents the computational part of the class, but doesn't include all of the other material, especially the modeling part of Modeling and Simulation! Prof. Somerville has drafted additional chapters to present that material, and Prof. Geddes and I have talked about expanding some of the math topics.

This summer the three of us are hoping to draft a major revision of the book. We'll try it out in the fall, and then finish it off in January.

The current version of the book uses MATLAB, but the features we use are also available in Python with SciPy. So at some point we might switch the class and the book over to Python.


That's what's in the pipeline! I'm not sure I will actually be able to finish four books in the next year, so this schedule might be "aspirational" (that's Olin-speak for things that might not be completely realistic).

Let me know what you think. Which ones should I work on first?

Monday, April 7, 2014

The Internet and religious affiliation

A few weeks ago I published this paper on arXiv: "Religious affiliation, education and Internet use".  Regular readers of this blog will recognize this as the article I was writing about in July 2012, including this article.

A few days ago, MIT Technology Review wrote about my paper: How the Internet Is Taking Away America’s Religion

[UPDATE April 16, 2014: Removed some broken links here.]

The story has attracted some media attention, so I am using this blog article to correct a few errors and provide more information.


1) Some stories are reporting that my paper was published by MIT Technology Review.  That is not correct.  It was published in arXiv, a "a repository of electronic preprints of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance, which can be accessed online."  It is not a peer-reviewed venue.  My paper has not been peer reviewed.

2) I am a Professor of Computer Science at Olin College of Engineering in Needham, Massachusetts.

3) My results suggest that Internet use might account for about 20% of the decrease in religious affiliation between 1990 and 2010, or about 5 million out of 25 million people.

Note that this result doesn't mean that 5 million people who used to be affiliated are now disaffiliated because of the Internet.  Rather, my study estimates that if the Internet had no effect on affiliation, there would be an additional 5 million affiliated people.

Some additional questions and answers follow.


How do you know the Internet causes disaffiliation?  Why not global warming?  Or pirates?

The statistical analysis in this paper is not just a correlation between two time series.  That would tell us very little about causation.

The analysis I report is based on more than 9000 individual respondents to the General Social Survey (GSS).  Each respondent answered questions about their Internet use and religious affiliation, and provided demographic information like religious upbringing, region, income, socioeconomic status, etc.

These variables allowed me to control for most of the usual confounding variables and isolate the association between Internet use and religious affiliation.

Now, there are still two alternative explanations.  Religious disaffiliation could cause Internet use.  That seems less plausible to me, but it is still possible.  The other possibility is that a third factor could cause both Internet use and disaffiliation.  But that factor would have to be new or increasing since 1990, and it would have to be uncorrelated or only weakly correlated with the control variables I included.

I can't think of any good candidates for a third factor like that, and I have not heard any candidates that meet the criteria.  So I think it is reasonable to conclude that Internet use causes disaffiliation.

Was this research published in MIT Technology Review?

No. I published the paper on arXiv: Religious affiliation, education and Internet use

Tech Review was one of the first places to pick up the story.  Some sites are reporting, incorrectly, that my research was published in MIT Technology Review.

Correlation and causation

By far the most common response to my paper is something like what Matt McFarland wrote:
This is an interesting read, but the problem with the theory is that correlation isn’t causation. With so much changing since 1990, it’s difficult to conclude what variables were factors, and to what degree.
Let me address that in two parts.

1) My study is not based on simple correlations.  I used statistical methods (specifically logistic regression) that are designed to answer the question Matt asks: which variables were factors and to what degree?  I controlled for all the usual confounding factors, including income, education, religious upbringing, and a few others.

My results show that there is an association between Internet use and disaffiliation, after controlling for these other factors.

2) The question that remains is which direction the causation goes.  There are three possibilities:

a) Internet use causes disaffiliation
b) disaffiliation causes Internet use
c) some other third factor causes both

In the paper I argue that (a) is more likely than (b) because it is easy to imagine several ways Internet use might cause disaffiliation, and harder to imagine how disaffiliation causes Internet use.

Similarly, it is hard to think of a third factor that causes both Internet use and disaffiliation.  In order to explain the data, this new factor would have to be new, or changing substantially between 1990 and 2010, and it would have to have a strong effect on both Internet use and disaffiliation.

There are certainly many factors that contribute to disaffiliation.  According to my analysis, Internet use might account for 20% of the observed change.  Another 25% is due to changes in religious upbringing, and 5% is due to increases in college education.  That leaves 50% of the observed changes unexplained by the factors I was able to include in the study.

If you think that Internet use does not cause disaffiliation, it is not enough to list other causes of disaffiliation, or other things that have changed since 1990.  To make your argument, you have to find a third factor that

a) Is not already controlled for in my analysis,
b) Is new in 1990, or began to change substantially around that time, and
c) Actually causes (not just associated with) both Internet use and disaffiliation.

So far I have not heard a candidate that meets these criteria.  For example, some people have suggested  personality traits that might cause both Internet use and disaffiliation.  That's certainly possible.  But unless those traits are new, or started becoming more prevalent, in 1990, they don't explain the recent changes.

Questions from the media

1. Why did you initiate this study?

As a college teacher, I have been following the CIRP Freshman Survey for several years.  It is a survey of incoming college students that asks, among other things, about their religious preference.  Since 1985 the fraction of students reporting no religious preference has more than tripled, from 8% to 25%.  I think this is an underreported story.

My latest blog post about the Freshman Survey is here:

About two years ago I started working with data from the General Social Survey (GSS), and realized there was an opportunity to investigate factors associated with disaffiliation, and to predict future trends.

2. How did you gather your data?

I am not involved in running either the Freshman Survey or the GSS.  I use data made available by the Higher Education Research Institute (HERI) and National Opinion Research Center (NORC).  Obviously, their work is a great benefit to researchers like me.  On the other hand, there are always challenges working with “found data.”  Even with surveys like these that are well designed and executed, you never have exactly the data you want; it takes some creativity to find the data that answer your questions and the questions your data can answer.

3. Could you elaborate on your overall findings?

I think there are two important results from this study.  One is that I identified several factors associated with religious disaffiliation and measured the strength of each association.  By controlling for things like income, education, and religious upbringing, I was able to isolate the effect of Internet use.  I found something like a dose-response curve.  People who reported some Internet use (a few hours a week) were less likely to report a religious preference, by about 2 percentage points.  People who use the Internet more than 7 hours per week were even less likely to be religious, by an additional 3 percentage points.  That effect turns out to be stronger than a four-year college education, which reduces religious affiliation by about 2 percentage points.

With this kind of data, we can’t know for sure that Internet use causes religious disaffiliation.  It is always possible that disaffiliation causes Internet use, or that a third factor causes both.  In the paper I explain why I think these alternatives are less plausible, and provide some additional analysis.  Based on these results, I conclude, tentatively, that Internet use causes religious disaffiliation, but a reasonable person could disagree.

In the second part of the paper I use parameters from the regression models to run simulations of counterfactual worlds, which allows me to estimate the number of people in the U.S. whose disaffiliation is due to education, Internet use, and other factors.  Between 1980 and 2010, the total decrease in religious affiliation is about 25 million people.  About 25% of that decrease is because fewer people are being raised with a religious affiliation.  Another 20% might be due to increases in Internet use.  And another 5% might be due to increases in college education.

That leaves 50% of the decrease unexplained by the factors I was able to include in the study, which raises interesting questions for future research.

4. Are you using this research for anything specific?

I do work in this area because a lot of people find it interesting; I also use it as an example in my classes.  I teach Data Science at Olin College of Engineering.  I want my students to be prepared to work with real data and use it to answer real questions.  I use examples like this to demonstrate the tools and to show what’s possible as data like this becomes more readily available.

5. Based on your research, are you able to hypothesize what might happen to religion in America in 50, 100, or more years?

I made some predictions in this blog post:

The most likely changes between now and 2040 are: the fraction of people with no religious preference will increase to about 25%, overtaking the fraction of Catholics, which will decline slowly.  The fraction of Protestants will drop more quickly, to about 45%.

These predictions are based on generational replacement: the people in the surveyed population will get older; some will die and be replaced by the next generation.  Most adults don’t change religious affiliation, so these predictions should be reliable.

But they are based on generational replacement only, not on any other factors that might speed up or slow down the trends.  Going farther into the future, those other factors become more important.

6. Is there anything in particular you think people should know/understand about your research and findings?

Again, it’s important to remember that my results are based on observational studies.  With that kind of data, we don’t know for sure whether the statistical relationships we see are due to causation.  In this case I think we can make a strong argument that Internet use causes religious disaffiliation, but a reasonable person could disagree.

My paper includes some analysis that is pretty standard stuff, like logistic regression.  But I also used methods that are less common; for example, using parameters from the regression models, I ran simulations of counterfactual worlds, which allowed me to estimate the number of people in the U.S. whose disaffiliation might be due to education, Internet use, and other factors.

7. Although your research cannot determine for sure that the Internet causes less religious affiliation, what about it might you speculate could be decreasing religion?

In the paper I wrote “it is easy to imagine at least two ways Internet use could contribute to disaffiliation.  For people living in homogeneous communities, the Internet provides opportunities to find information about people of other religions (and none), and to interact with them personally.  Also, for people with religious doubt, the Internet provides access to people in similar circumstances all over the world.”

These are speculations based on anecdotal evidence, not the kind of data I used in the statistical analysis.  One place I see people from different religious background engaging on the Internet is in online forums like Reddit.  Here's an example from just a few hours ago.

This NYT article provides similar anecdotal evidence:

8. Do you think that Internet will advance in this secularization process?

There are two parts of secularization, disaffiliation from organized religion and decrease in religious faith.  The data I reported in my paper provide some evidence that the Internet is contributing to disaffiliation in the U.S.  I haven’t had a chance look into the related issue of religious faith, but I am interested in that question, too.

9. The access to a wider range of information would be one of the explanations for the decrease in religious affiliation?

Again, I don’t have data to support that, but it seems likely to be at least part of the explanation.

More questions

There is a difference between those who are religiously affiliated (belong to or active with a church, for example) and those who consider themselves spiritual or religious. Can you clarify what you’re talking about?

Yes, good point!  My paper is only about religious affiliation, or religious preference.  The GSS also asks about religious faith and spirituality, but I have not had a chance to do the same analysis with those variables.

I have seen other studies that suggest that belief in God, other forms of religious faith, and spirituality are not changing as quickly as religious affiliation.  But I don't have a good reference handy.