Friday, June 22, 2018

Inference in three hours

I am preparing a talk for the Joint Statistical Meetings (JSM 2018) in August.  It's part of a session called "Bringing Intro Stats into a Multivariate and Data-Rich World"; my talk is called "Inference in Three Hours, and More Time for the Good Stuff".

Here's what I said I would talk about:
Teaching statistical inference using mathematical methods takes too much time, emphasizes the least important material, and leaves many students unprepared to apply statistics in the real world. Simple computer simulations can demonstrate the fundamental ideas of statistical inference quickly, clearly, and memorably. Computational methods are also robust and flexible, making it possible to work with a wider range of data and experiments. And by teaching statistical inference better and faster, we leave time for the most important goals of statistics education: preparing students to use data to answer questions and guide decision making under uncertainty. In this talk, I discuss problems with current approaches and present educational material I have developed based on computer simulations in Python.
I have slides for the talk now:

And here's the Jupyter notebook they are based on.

I have a few weeks until the conference, so comments and suggestions are welcome.


Coincidentally, I got question on Twitter today that's related to my talk:
Very late to this post by @AllenDowney, but quite informative:  …
Have one question though: seems a lot of the single case reasoning here is similar to what I was taught was a mistaken conclusion: “that there is a 95% prob that a parameter lies within the given 95% CI.” What is the difference? Seems I am missing some nuance?
The post @cutearguments asks about is "Recidivism and single-case probabilities", where I make an argument that single-case probabilities are not a special problem, even under the frequentist interpretation of probability; they only seem like a special problem because they make the reference class problem particularly salient.

So what does that have to do with confidence intervals?  Let me start with the example in my talk: suppose you are trying to estimate the average height of men in the U.S.  You collect a sample and generate an estimate, like 178 cm, and a 95% confidence interval, like (177, 179) cm.

Naively, it is tempting to say that there is a 95% chance that the true value (the actual average height of every male resident in the population) falls in the 95% confidence interval.  But that's not true.

There are two reasons you might hear for why it's not true:

1) The true value is unknown, but it is not a random quantity, so it is either in the interval or it's not.  You can't assign a probability to it.

2) The 95% confidence interval does not have a 95% chance of containing the true value because that's just not what it means.  A confidence interval quantifies variability due to random sampling; that's all.

The first argument is bogus; the second is valid.

If you are a Bayesian, the first argument is bogus because it is entirely unproblematic to make probability statements about unknown quantities, whether they are considered random or not.

If you are a frequentist, the first argument is still bogus because even if the true value is not a random quantity, the confidence interval is.  And furthermore, it belongs to a natural reference class, the set of confidence intervals we would get by running the experiment many times.  If we agree to treat it as a member of that reference class, we should have no problem giving it a probability of containing the true value.

But that probability is not 95%.   If you want an interval with a 95% chance of containing the true value, you need a Bayesian credible interval.

Friday, June 1, 2018

Bayesian Zig Zag

Almost two years ago, I had the pleasure of speaking at the inaugural meeting of the Boston Bayesians, where I presented "Bayesian Bandits from Scratch" (the notebook for that talk is here).  Since then, the group has flourished, thanks to the organizers, Jordi Diaz and Colin Carroll.

Last night I made my triumphant return for the 21st meeting, where I presented a talk I called "Bayesian Zig Zag".  Here's the abstract:
Tools like PyMC and Stan make it easy to implement probabilistic models, but getting started can be challenging.  In this talk, I present a strategy for simultaneously developing and implementing probabilistic models by alternating between forward and inverse probabilities and between grid algorithms and MCMC.  This process helps developers validate modeling decisions and verify their implementation. 
As an example, I will use a version of the "Boston Bruins problem", which I presented in Think Bayes, updated for the 2017-18 season. I will also present and request comments on my plans for the second edition of Think Bayes.
When I wrote the abstract, I was confident that the Bruins would be in the Stanley Cup final, but that is not how it worked out.  I adapted, using results from the first two games of the NHL final series to generate predictions for the next game.

Here are the slides from the talk:

And here is the Jupyter notebook I presented.  If you want to follow along, you'll see that there is a slide that introduces each section of the notebook, and then you can read the details.  If you have a Python installation with PyMC, you can download the notebook from the repository and try it out.

The talk starts with basic material that should be accessible for beginners, and ends with a hierarchical Bayesian model of a Poisson process, so it covers a lot of ground!  I hope you find it useful.

For people who were there, thank you for coming (all the way from Australia!), and thanks for the questions, comments, and conversation.  Thanks again to Jordi and Colin for organizing, to WeWork for hosting, and to QuantumBlack for sponsoring.

Thursday, May 3, 2018

Some people hate custom libraries

For most of my books, I provide a Python module that defines the functions and objects I use in the book.  That makes some people angry.

The following Amazon review does a nice job of summarizing the objections, and it demonstrates the surprising passion this issue evokes:

March 29, 2018
Format: Paperback
Echoing another reviewer, the custom code requirement means you learn their custom code rather than, you know, the standard modules numpy and scipy. For example, at least four separate classes are required, representing hundreds of lines of code, are required just to execute the first six lines of code in the book. All those lines do is define two signals, a cosine and a sine, sums them, then plots them. This, infuriatingly, hides some basic steps. Here's how you can create a cosine wave with frequency 440Hz:

duration = 0.5
framerate = 11025
n = round(duration*framerate)
ts = np.arange(n)/framerate
amp = 1.0
freq = 440
offset = 0.0
cos_sig = amp * numpy.cos( 2*numpy.pi*ts*freq + offset)
freq = 880
sin_sig = amp * numpy.sin( 2*numpy.pi*ts*freq + offset)

Instead, these clowns have

cos_sig = thinkdsp.CosSignal(freq=440,amp=1.0,offset=0)
sin_sig = thinkdsp.SinSignal(freq=440,amp=1.0,offset=0)
mix = cos_sig + sin_sig

where CosSignal and SinSignal are custom classes, not functions, which inherits four separate classes, NONE of which are necessary, and all of which serve to make things more complex than necessary, on the pretense this makes things easier. The classes these class inherit are a generic Sinusoid and SumSignal classes, which inherits a Signal class, which depends on a Wave class, which performs plotting using pyplot in matplotlib. None of which make anything really any easier, but does serve to hide a lot of basic functionality, like hiding how to use numpy, matplotlib, and pyplot.

In short, just to get through the first two pages, you have to have access to github to import their ridiculous thinkdsp, thinkplot, and thinkstats, totalling around 5500 lines of code, or you are just screwed and can't use this book. All decent teaching books develops code you need as necessary and do NOT require half a dozen files with thousands of lines of custom code just to get to page 2. What kind of clown does this when trying to write a book to show how to do basic signal processing? Someone not interested in teaching you DSP, but trying to show off their subpar programming skills by adding unnecessary complexity (a sure sign of a basic programmer, not a good).

The authors openly admit their custom code is nothing more than wrappers in numpy and scipy, so the authors KNEW they were writing a crappy book and filling it with a LOT of unnecessary complexity. Bad code is bad code. Using bad code to teach makes bad teaching. It's obvious Allen B. Downey has spent his career in academia, where writing quality code doesn't matter.

Well, at least he spelled my name right.

Maybe I should explain why I think it's a good idea to provide a custom library along with a book like Think DSP.  Importantly, the goal of the book is to help people learn the core ideas of signal processing; the software is a means to this end.

Here's what I said in the preface:
The premise of this book is that if you know how to program, you can use that skill to learn other things, and have fun doing it. 
With a programming-based approach, I can present the most important ideas right away. By the end of the first chapter, you can analyze sound recordings and other signals, and generate new sounds. Each chapter introduces a new technique and an application you can apply to real signals. At each step you learn how to use a technique first, and then how it works.
For example, in the first chapter, I introduce two objects defined in Wave and Spectrum.  Wave provides a method called make_spectrum that creates a Spectrum object, and Spectrum provides make_wave, which creates a Wave.

When readers use these objects and methods, they are implicitly learning one of the fundamental ideas of signal processing: that a Wave and its Spectrum are equivalent representations of the same information -- given one, you can always compute the other.

This example demonstrates one reason I use custom libraries in my books: The API is the lesson.  As you learn about these objects and how they interact, you are also learning the core ideas of the topic.

Another reason I think these libraries are a good idea is that they let me introduce ideas top-down: that is, I can show what a method does -- and why it is useful -- first; then I can present details when they necessary or most useful.

For example, I introduce the Spectrum object in Chapter 1.  I use it to apply a low pass filter, and the reader can hear what that sounds like.  You can too, by running the Chapter 1 notebook on Binder.

In Chapter 2, I reveal that my make_spectrum function is a thin wrapper on two NumPy functions, and present the source code:

from np.fft import rfft, rfftfreq

# class Wave:
    def make_spectrum(self):
        n = len(self.ys)
        d = 1 / self.framerate

        hs = rfft(self.ys)
        fs = rfftfreq(n, d)

        return Spectrum(hs, fs, self.framerate)

At this point, anyone who prefers to use NumPy directly, rather than my wrappers, knows how.

In Chapter 7, I unwrap one more layer and show how the FFT algorithm works.  Why Chapter 7?  Because I introduce correlation in Chapter 5, which helps me explain the Discrete Cosine Transform in Chapter 6, which helps me explain the Discrete Fourier Transform.

Using custom libraries lets me organize the material in the way I think works best, based on my experience working with students and seeing how they learn.

This example demonstrates another benefit of defining my own objects: data encapsulation.  When you use NumPy's rfft to compute a spectrum, you get an array of amplitudes, but not the frequencies they correspond to.  You can call rfftfreq to get the frequencies, and that's fine, but now you have two arrays that represent one spectrum.  Wouldn't it be nice to wrap them up in an object?  That's what a Spectrum object is.

Finally, I think these examples demonstrate good software engineering practice, particularly bottom-up design.  When you work with libraries like NumPy, it is common and generally considered a good idea to define functions and objects that encapsulate data, hide details, eliminate repeated code, and create new abstractions.  Paul Graham wrote about this idea in one of his essays on software:
[...] you don't just write your program down toward the language, you also build the language up toward your program. [...] the boundary between language and program is drawn and redrawn, until eventually it comes to rest along [...] the natural frontiers of your problem. In the end your program will look as if the language had been designed for it.
That's why, in the example that makes my correspondent so angry, it takes just three lines to create and add the signals; and more importantly, those lines contain exactly the information relevant to the operations and no more.  I think that's good quality code.

In summary, I provide custom libraries for my books because:

1) They demonstrate good software engineering practice, including bottom-up design and data encapsulation.

2) They let me present ideas top-down, showing how they are used before how they are implemented.

3) And as readers learn the APIs I defined, they are implicitly learning the key ideas.

I understand that not everyone agrees with this design decision, and maybe it doesn't work for everyone.  But I am still surprised that it makes people so angry.

Wednesday, April 18, 2018

Computing at Olin Q&A

I was recently interviewed by Sally Phelps, the Director of Postgraduate Planning at Olin.  We talked about computer science at Olin, which is something we are often asked to explain to prospective students and their parents, employers, and other external audiences.

Afterward, I wrote the following approximation of our conversation, which I have edited to be much more coherent than what I actually said.

I should note: My answers to the following questions are my opinions.  I believe that other Olin professors who teach software classes would say similar things, but I am sure we would not all say the same things.

Photo Credit: Sarah Deng

Q: What is the philosophy of Olin when it comes to training software engineers of the future?

To understand computer science at Olin, you have to understand that Olin really has one curriculum, and it's engineering.

We have degrees in Engineering, Mechanical Engineering, and Electrical and Computer Engineering.  But everyone sees the same approach to engineering: it starts with people and it ends with people.  That means you can't wait for someone to hand you a well-formulated problem from a textbook; you have to understand the people you are designing for, and the context of the problem.  You have to know when an engineering solution can help and when it might not.  And then when you have a solution, you have to be able to get it out of the lab and into the world.

Q: That sounds very different from a traditional computer science degree.  

It is.  Because we already have a lot of computer scientists who know how data structures work; we don't have as many who can identify opportunities, work on open-ended problems, work on teams with people from other disciplines, work on solutions that might involve electrical and mechanical systems as well as software.

And we don't have a lot of computer scientists who can communicate clearly about their work; to have impact, they have to be able to explain the value of what they are doing.  Most computer science programs don't teach those things very well.

Also most CS programs don't do a great job of preparing students to work as software engineers.  A lot of classes are too theoretical, too mathematical, and too focused on the computer itself, not the things you want to do with it, the applications.

At Olin, we've got some theory, some mathematical foundations, some focus on the design of software systems.  But we've turned that dial down because the truth is that a lot of that material is not relevant to practice.  I always get a fight when I say that, because you can never take anything out of the curriculum.  There's always someone who says you have to know how to balance a red-black tree or you can't be a computer scientist; or you have to know about Belady's anomaly, or you have to know X, Y, and Z.

Well, you don't.  For the vast majority of our students, for all the things they are going to do, a big chunk of the traditional curriculum is irrelevant.  So we look at the traditional curriculum with some skepticism, and we make cuts.

We have to; there's only so much time.  In four years, students take about 32 classes.  We have to spend them wisely.  We have to think about where they are going after graduation.  Some will go to grad school, some will start companies, some will work in industry,  Some of them will be software engineers, some will be product managers, some will work in other fields; they might develop software, or work with software developers. 

Q: So how do you prepare people for all of that?

It depends what "prepare" means.  If it means teach them everything they need to know, it's impossible.  But you can identify the knowledge, skills and attitudes they are most likely to need.

It helps if you have faculty with industry experience.  A lot of professors go straight to grad school and straight into academics, and then they have long arguments about what software engineers need to know.  Sometimes they don't know what they are talking about.

If you're designing a curriculum, just like a good engineer, you have to understand the people you're designing for and the context of the solution.  Who are your students, where are they going, and what are they going to need?  Then you can decide what to teach.

Q: So if a student is interested in CS and they're deciding between Olin and another school, what do you tell them?

I usually tell them about the Olin curriculum, what I just explained.  And I suggest they look at our graduation requirements.  Students at Olin who do the Engineering major with a concentration in computing, they take a relatively small number of computer science classes, usually around seven.  And they take a lot of other engineering classes.

In the first semester, everyone takes the same three engineering classes, so everyone does some mechanical design, some circuits and measurement, and some computational modeling.

Everyone takes a foundation class in humanities, and another in entrepreneurship.  Everyone takes Principles of Engineering, where they design and build a mechatronic system.

In the fourth semester everyone takes user-centric design, and finally, in the senior year, everyone does a two-semester engineering capstone, which is usually interdisciplinary.

If a prospective student looks at those classes and they're excited about doing design and engineering -- and several kinds of engineering -- along with computer science, then Olin is probably a good choice for them.

If they look at the requirements and they dread them -- if the requirements are preventing them from doing what they really want -- then maybe Olin's not the right place.

Q: I understand there are student-taught software classes – can you tell us more about that?

We do, and a lot of them have been related to software, because that's an area where we have students doing internships, and sometimes starting companies, and they get a lot of industry experience.

And they come back with skills and knowledge they can share with their peers.  Sometimes that happens in classes, especially on projects.  But it can also be a student-led class where student instructors propose a class, and they they work with faculty advisors to develop and present the material.  As an advisor, I can help with curriculum design and the pedagogy, and sometimes I have a good view of the context or the big picture.  And then a lot of times the students have a better view of the details.  They've spent the summer working in a particular domain, or with a particular technology, and they can help their peers get a jump start.

They also bring some of the skills and attitudes of software engineering.  For example, we teach students about testing, and version control, and code quality.  But in a class it can be artificial; a lot of times students want to get code working and they have to move on to the next thing.  They don't want to hear from me about coding "style".

It can be more effective when it's coming from peers, and when it's based on industry experience.  The student instructors might say they worked at Pivotal, and they had to do pair programming, or they worked at Google, and all of their code was reviewed for readability before they could check it in.  Sometimes that's got more credibility than I do.

Q: What does the future look like for computing at Olin?

A big part of it is programming in context.  For example, the first software class is Modeling and Simulation, which is about computational models in science, including physics, chemistry, medicine, ecology…  So right from the beginning, we're not just learning to program, we're applying it to real world problems.

Programming isn't just a way of translation well understood solutions into code.  It's a way of communicating, teaching, learning, and thinking.  Students with basic programming skills can use coding as a "pedagogic lever" to learn other topics in engineering, math, natural and social science, arts and humanities.

I think we are only starting to figure out what that looks like.  We have some classes that use computation in these ways, but I think there are a lot more opportunities.  A lot of ideas that we teach mathematically, we could be doing computationally, maybe in addition to, or maybe instead of the math.

One of my examples is signal processing, where probably the most important idea is the Fourier transform.  If you do that mathematically, you have to start with complex numbers and work your way up.  It takes a long time before you get to anything interesting.

With a computational approach, I can give you a program on the first day to compute the Fourier transform, and you can use it, and apply it to real problems, and see what it does, and run experiments and listen to the results, all on day one.  And now that you know what it's good for, maybe you'll want to know how it works.  So we can go top-down, start with applications, and then open the hood and look at the engine.

I'd like to see us apply this approach throughout the curriculum, especially engineering, math, and science, but also arts, humanities and social science.

Thursday, March 22, 2018

Generational changes in support for gun laws

This is the fourth article in a series about changes in support for gun control laws over the last 50 years.

In the first article I looked at data from the General Social Survey and found that young adults are less likely than previous generations to support gun control.

In the second article I looked at data from the CIRP Freshman Survey and found that even the youngest adults, who grew up with lockdown drills and graphic news coverage of school shootings, are LESS likely to support strict gun control laws.

In the third article, I ran graphical tests to distinguish age, period, and cohort effects.  I found strong evidence for a period effect: support for gun control among all groups increased during the 1980s and 90s, and has been falling in all groups since 2000.  I also saw some evidence of a cohort effect: people born in the 1980s and 90s are less likely to support strict gun control laws.

In this article, I dive deeper, using logistic regression to estimate the sizes of these effects separately, while controlling for demographic factors like sex, race, urban or rural residence, etc.


As in the previous articles, I am using data from the General Social Survey (GSS), and in particular the variable 'gunlaw', which records responses to the question:
Would you favor or oppose a law which would require a person to obtain a police permit before he or she could buy a gun?
I characterize respondents who answer "favor" to be more likely to support strict gun control laws.

The explanatory factors I consider are:

'nineties', 'eighties', 'seventies', 'fifties', 'forties', 'thirties', 'twenties':  These variables encode the respondents decade of birth.

'female': indicates that the respondent is female.

'black': indicates that the respondent is black.

'otherrace': indicates that the respondent is neither white nor black (most people in this category are mixed race).

'hispanic': indicates that the respondent is Hispanic.

'conservative', 'liberal': indicates that the respondent reports being conservative or liberal (not moderate).

'lowrealinc', 'highrealinc': indicates that the respondent's household income is in the bottom or top 25%, based on self-report, converted to constant dollars.

'college': indicates whether the respondent attended any college.

'urban', 'rural': indicates whether the respondent lives in an urban or rural area (not suburban).

'gunhome': indicates whether the respondent reports that they "
have in [their] home or garage any guns or revolvers".

'threatened': indicates whether the respondent reports that they have "ever been threatened with a gun, or shot at". 

These factors are all binary.  In addition, I also estimate the period effect by including the following  variables: 'yminus10', 'yminus20', 'yminus30', and 'yminus40', to indicate respondents surveyed 10, 20, 30, and 40 years prior to 2016. 


I used logistic regression to estimate the effect of each of these variables.  The regression also includes a cubic model of time, intended to capture the period effect.  You can see the period effect in the following figure, which shows actual changes in support for a gun permit law over the history of the GSS (in gray) and the retroactive predictions of the model (in red).

To present the results in an interpretable form, I define a collection of hypothetical people with different attributes and estimate the probability that each one would favor a gun permit law.

As a baseline, I start with a white, non-Hispanic male born in the 1960s who is politically moderate, in the middle 50% of the income range, who attended college, lives in a suburb, has never been threatened with a gun or shot at, and does not have a gun at home.  People like that interviewed in 2016 have a 74% chance of favoring "a law which would require a person to obtain a police permit before he or she could buy a gun".

The following table shows results for people with different attributes: the first row, which is labeled 'baseline' is the baseline person from the previous paragraph; the second row, labeled 'nineties', is identical to the baseline in every way, except born in the 1990s rather than the 1960s.  The line labeled 'female' is identical to the baseline, but female.

These results are generated by running 201 random samples from the GSS data and computing the median, 2.5th and 97.5th percentiles.  The range from 'low2.5' to 'high97.5' forms a 95% confidence interval.

low2.5 median high97.5
baseline 71.5 73.6 75.1
nineties 60.1 63.8 68.5
eighties 64.9 67.7 69.8
seventies 67.8 70.1 72.0
fifties 70.7 72.8 74.8
forties 71.4 73.7 75.4
thirties 69.9 72.0 73.8
twenties 70.7 72.9 74.8
female 83.2 84.7 85.6
black 75.9 78.0 79.5
otherrace 78.2 80.4 82.9
hispanic 70.7 73.5 76.0
conservative 64.1 65.8 67.8
liberal 75.9 77.7 79.3
lowrealinc 69.1 71.1 72.9
highrealinc 73.2 75.2 76.7
college 73.2 75.1 76.3
urban 66.0 68.3 69.9
rural 59.5 62.0 64.0
threatened 68.8 71.1 73.1
gunhome 51.1 53.6 55.8
yminus10 84.1 85.3 86.2
yminus20 84.3 85.3 86.4
yminus30 80.7 81.5 82.8
yminus40 78.8 79.9 81.4
lowest combo 16.5 19.2 22.1
highest combo 91.3 92.3 93.4

Again, the hypothetical baseline person has a 74% chance of favoring a gun permit law.  A nearly identical person born in the 1990s has only a 64% chance.

To see this and the other effects more clearly, I computed the difference between each hypothetical person and the baseline, then sorted by the magnitude of the apparent effect. 

low2.5 median high97.5
lowest combo -57.7 -54.5 -49.5
gunhome -21.1 -20.0 -18.6
rural -13.4 -11.5 -9.9
nineties -13.5 -9.8 -4.5
conservative -8.9 -7.6 -6.5
eighties -7.9 -5.9 -2.7
urban -6.6 -5.4 -4.1
seventies -5.6 -3.2 -1.4
lowrealinc -3.7 -2.5 -1.2
threatened -3.6 -2.4 -1.1
thirties -3.2 -1.5 -0.1
fifties -2.2 -0.8 0.8
twenties -2.8 -0.6 1.0
forties -1.7 -0.0 1.5
baseline 0.0 0.0 0.0
hispanic -1.5 0.2 2.0
college 0.6 1.6 2.6
highrealinc 0.7 1.7 2.6
liberal 2.9 4.2 5.2
black 3.1 4.5 5.8
yminus40 4.9 6.6 8.0
otherrace 4.8 6.9 9.6
yminus30 6.6 8.1 9.8
female 10.1 11.1 12.4
yminus20 10.2 11.7 14.0
yminus10 10.1 11.7 13.7
highest combo 16.9 18.8 21.1

All else being equal, someone who owns a gun is about 20 percentage points less likely to favor gun permit laws.

Compared to people born in the 1960s, people born in the 1990s are 10 points less likely.  People born in the 1980s and 1970s are also less likely, by 6 and 3 points.  People born in previous generations are not substantially different from people born in the 1960s (and the effect is not statistically significant).

Compared to suburbanites, people in rural and urban communities are less likely, by 12 and 5 points.

People in the lowest 25% of household income are less likely by 2.5 points; people in the highest 25% are more likely by 2 percentage point.

Blacks and other non-whites are more likely to favor gun permit laws, by 4.5 and 7 percentage points.

Compared to political moderates, conservatives are 8 points less likely and liberals are 4 points more likely to favor gun permit laws.

Compared to men, women are 11 points more likely to favor gun permit laws.

Controlling for all of these factors, the period effect persists: people with the same attributes surveyed 10, 20, 30, and 40 years ago would have been 10, 10, 7, and 5 points more likely to favor gun permit laws.

In these results, Hispanics are not significantly different from non-Hispanic whites.  But because of the way the GSS asked about Hispanic background, this variable is missing a lot of data; these results might not mean much.

Surprisingly, people who report that they have been "threatened with a gun or shot at" are 2 percentage points LESS likely to favor gun permit laws.  This effect is small but statistically significant, and it is consistent in many versions of the model.  A possible explanation is that this variable captures information about the respondent's relationship with guns that is not captured by other variables.  For example, if a respondent does not have a gun at home, but spends time around people who do, they might be more likely to have been threatened and also more likely to share cultural values with gun owners.  Alternatively, since this question was only asked until 1996, it's possible that it is capturing a period effect, at least in part.

"Lowest combo" represents a hypothetical person with all attributes associated with lower support for gun laws: a white conservative male born in the 1990s, living in a rural area, with household income in the lowest 25%, who has not attended college, who owns a gun, and has been threatened with a gun or shot at.  Such a person has a 19% change of favoring a gun permit law, 54 points lower than the baseline.

"Highest combo" represents a hypothetical person with all attributes associated with higher support for gun laws: a mixed race liberal woman born in the 1960s or before, living in a suburb, with household income in the highest 25%, who has attended college, does not own a gun, and has not been threatened with one.  Such a person has a 92% chance of favoring a gun permit law, 19 points higher than the baseline.

[You might be surprised that these results as asymmetric: that is, that the lowest combo is farther from the baseline than the highest combo.  The reason is that the "distance" between probabilities is not linear.  For more about that, see my previous article on the challenges of interpreting probablistic predictions].


The entire analysis in this Jupyter notebook.  The steps are:

1) Load the subset of GSS data I selected, which you can download here.

2) For each year of the survey, use weighted bootstrap to select a random sample that accounts for the stratified sampling in the GSS design.

3) Fill missing values in each column by drawing random samples from the valid responses.

4) Convert some numerical and categorical variables to boolean; for example 'conservative' and 'liberal' are based on the categorical variable 'polviews'; and 'lowrealinc' and 'highrealinc' are based on the numerical variable 'realinc'.

5) Use logistic regression to estimate model parameters, which are in terms of log odds.

6) Use the model to make predictions for each of the hypothetical people in the tables, in terms of probabilities.

7) Compute the predicted difference between each hypothetical person and the baseline.

By repeating steps (2) through (7) about 200 times, we get a distribution of estimates that accounts for uncertainty due to random sampling and missing values.  From these distributions, we can select the median and a 95% confidence interval, as reported in the tables above.   

Thursday, March 1, 2018

Support for gun control is decreasing in all age groups

This is the third article in a series about changes in support for gun control over the last 50 years.

In the first article I looked at data from the General Social Survey and found that young adults are less likely than previous generations to support gun control.

In the second article I looked at data from the CIRP Freshman Survey and found that even the youngest adults, who grew up with lockdown drills and graphic news coverage of school shootings, are still LESS likely to support gun control.

Untangling age, period, and cohort effects

In this article, I do some age-period-cohort analysis to see if the changes over the last 50 years are due to age, period, or cohort effects:

Age effect: People's views might change over the course of their lives.  For example, they might be more likely to support gun rights when they are young, and more likely to support gun control when they have children. (This turns out not to be true.)

Period effect: People's views might change due to an external factor that affects all age groups and cohorts over the same time period.  For example, if gun crime rates increase, we might expect support for gun control to increase.  (There is some evidence for this.)

Cohort effect: People's views might be different from one generation to the next, due to differences in the environment.  For example, current teenagers might support gun control because of their experiences with school shootings. (This turns out not to be true.)

As the composition of the population changes over time, it can be hard to untangle these effects, but the design of the Generation Social Survey (GSS) makes it possible.  From 1972 to 2016, the GSS asked respondents
Would you favor or oppose a law which would require a person to obtain a police permit before he or she could buy a gun?
The following figure shows the fraction of respondents who would favor this law:

In the 1970s and 80s, support for this policy was near 75%.  It increased during the 1990s, peaking near 85% around 2000, and has been declining ever since.  In the most recent survey year, it is at 71%.

Testing for age effects

To test for age effects, we can group respondents into cohorts by decade of birth and plot support for gun control as a function of age.

If there is an age effect we would expect all cohorts to follow a similar trajectory as they age.  For example, if people are more likely to support gun control during their child-bearing years, we would expect these line to generally increase from left to right.

Here are the results:

There are no obvious patterns here, which suggest that there is no age effect.

Testing for period effects

To test for period effects, we group by decade of birth again, and plot the results over time.  If there is a period effect, we expect all cohorts to follow a similar trajectory.

Here are the results:

This figure shows clear evidence for a period effect: all cohorts follow a similar trajectory over the same period.  (Don't be distracted by the extreme first points in the green and purple curves; they are based on a small number of respondents.)

Looking at the last few points in each cohort, it looks like people born in the 1980s and 90s are less likely to support gun control than previous generations, but this figure does not show strong evidence for a cohort effect.

In summary, there is strong evidence for a period effect: support for gun control increased among all groups increased during the 1980s and 90s, and has been falling in all groups since 2000.

Violent crime rates

A possible explanation is that these trends are driven by changes in violent crime, especially gun violence, which increased during the 1980s, peaked in 1993, and has been falling ever since, according to this study from the Pew Research Center.

To investigate this more carefully, I would like to see a graph of people's perception of violent crime rates, which does not always track reality.

Breakdown by political views

In general, liberals are more likely to support gun control than conservatives; we might expect a period effect to have different impact on different groups.  The following figure shows support for gun control over time, grouped by self-reported political identity:

Whatever external forces caused the increase and subsequent decrease in support for gun control, it affected all groups over the same period.  The most recent decreases seems to be bigger among conservatives, so the gap may be growing.

Breakdown by race

Nonwhites are more likely to support gun control than whites by about 8 percentage points.  The following figure shows how this difference has changed over time:

Both groups were affected similarly over the same period.  Among nonwhites, support for gun control might have increased sooner, in the 1980s, and might be falling more slowly now.

Wednesday, February 28, 2018

Post-Columbine students do not support gun control

In their coverage of the Parkland school shooting, The Economist writes:
Though polling suggests that young people are only slightly more in favour of gun-control measures than their elders, those surveys focus on those aged 18 and above. There may be a pre- and post-Columbine divide within that group.
Based on my analysis of data from the General Social Survey (GSS) and the CIRP Freshman Survey, I think the first sentence is false and the second is unlikely: young people are substantially less in favor of gun-control measures than their elders.

Here's the figure, from my previous article, showing these trends:

The blue line shows the fraction of respondents in the GSS who would favor "a law which would require a person to obtain a police permit before he or she could buy a gun?"

Among people born before 1980, support for this form of gun control is strong: around 75% for people born between 1910 and 1940, and approaching 80% for people born between 1950 and 1980.
But among people born in the 1980s and 90s, support for gun control is below 70%.

The orange line shows the fraction of respondents to the CIRP Freshman Survey who "Agree" or "Strongly agree" that
The federal government should do more to control the sale of handguns.
This dataset does not go back as far, but shows the same pattern: a large majority of people born before 1980 supported gun control (when they were surveyed as college freshmen); among people born after 1980, far fewer support gun control.

However, these results are based on people people who are currently young adults.  Maybe, as the Economist speculates:
The pupils, in their late teens, started their education after a massacre at Columbine High School in Colorado in 1999, in which 13 were killed. That means they have been practising active-shooter drills in the classroom since kindergarten. Seeing a school shooting as an event to prepare for, rather than an awful aberration, seems to have fuelled the students’ anger. 
They may be angry, but at least so far, their anger has not led them to support gun control.  Data from the Freshman Survey makes this clear.  The following figure shows, for survey respondents from 1989 to 2013, the fraction that agree or strongly agree that:
The federal government should do more to control the sale of handguns.
And for respondents in 2016, the fraction that agree or strongly agree that:
The federal government should have stricter gun control laws.

The change in wording makes it hard to compare the last data point with the previous trend, but it is clear at least that college freshmen in 2013 were substantially less likely than previous generations to support gun control: at 64%, they were 20 percentage points down from the peak, at 84%.

A large majority of the 2013 respondents were born in 1995.  They were 3 when Columbine was in the news, 10 during the Red Lake shootings, 11 during the West Nickel Mines school shooting, 12 during the Virginia Tech shooting, and 13 during the Northern Illinois University shooting.

They were 17 during the Chardon High School shooting, the Oikos University shooting, and the Sandy Hook Elementary School shooting.

And when they were surveyed in 2013, less than a year after Sandy Hook, more than 33% of them did not agree that the federal government should do more to control the sale of handguns, more than in any previous year of the survey.

Seeing these horrific events in the news, during their entire conscious lives, with increasingly dramatic and graphic coverage, might have made these students angry, but it did not make more of them support gun control.

Practicing active-shooter drills since kindergarten might have made these students angry, but it did not make more of them support gun control.

Maybe, as The Economist suggests, these students see a school shooting as "an event to prepare for, rather than an awful aberration".   But that does not make them more likely to support gun control.