Wednesday, September 26, 2018

This blog has moved

As of September 2018, I am moving Probably Overthinking It to a new location.

Blogger has been an excellent host; so far, this blog has received more than two million page views.

But earlier this month, when I published a new article, Blogger prompted me to post it on Google+, and I did.  A few hours later I discovered that my Google+ account had been suspended for violating terms of service, but I got no information about what terms I had violated.

While my Google+ account was suspended, I was unable to access Blogger and some other Google services.  And since Probably Overthinking It is a substantial part of my professional web presence, that was unacceptable.

I appealed the suspension by pressing a button, with no opportunity to ask a question.  Within 24 hours, my account was restored, but with no communication and still no information.

So for me, using Google+ has become a game of Russian Roulette.  Every time I post something, there seems to be a random chance that I will lose control of my web presence.  And maybe next time it will be permanent.

It is nice that using Blogger is free, but this episode has been a valuable reminder that "If you are not paying for it, you are not the customer".  (Who said that?)

I have moved Probably Overthinking It to a site I control, hosted by a company I pay, a company that has provided consistently excellent customer service.

Lesson learned.

[When I published this article, Blogger asked if I wanted to post it on Google+.  I did not.]

UPDATE: See the discussion of this post on Hacker News, with lots of good advice for migrating to services you have more control over.

Wednesday, September 19, 2018

Two hour marathon in 2031, maybe

On Sunday (September 16, 2018) Eliud Kipchoge ran the Berlin Marathon in 2:01:39, smashing the previous world record by more than a minute and taking a substantial step in the progression toward a two hour marathon.

In a previous article, I noted that the marathon record pace since 1970 has been progressing linearly over time, and I proposed a model that explains why we might expect it to continue.  Based on a linear extrapolation of the data so far, I predicted that someone would break the two hour barrier in 2041, plus or minus a few years.

Now it is time to update my predictions in light of the new record.  The following figure shows the progression of world record pace since 1970 (orange line), a linear fit to the data (blue line) and a 90% predictive confidence interval (shaded area).  The dashed lines show the two hour marathon pace (13.1 mph) and lower and upper bounds for the year we will reach it.

Since the previous record was broken in 2014, we have been slightly behind the long-term trend.  But the new record more than makes up for it, putting us at the upper edge of the predictive interval.

This model predicts that we might see a two hour marathon as early as 2031, and probably will before 2041.

Note that this model is based on data from races.  It is possible that we will see a two hour marathon sooner in under time trial conditions, as in the Nike Breaking2 project.

Thursday, September 13, 2018

Tom Bayes and the case of the double dice

The double dice problem

Suppose I have a box that contains one each of 4-sided, 6-sided, 8-sided, and 12-sided dice. I choose a die at random, and roll it twice without letting you see the die or the outcome. I report that I got the same outcome on both rolls.

1) What is the posterior probability that I rolled each of the dice?
2) If I roll the same die again, what is the probability that I get the same outcome a third time?

You can see the complete solution in this Jupyter notebook, or read the HTML version here.


Here's a BayesTable that represents the four hypothetical dice.
In [3]:
hypo = [Fraction(sides) for sides in [4, 6, 8, 12]]
table = BayesTable(hypo)
hypo prior likelihood unnorm posterior
0 4 1 NaN NaN NaN
1 6 1 NaN NaN NaN
2 8 1 NaN NaN NaN
3 12 1 NaN NaN NaN

Since we didn't specify prior probabilities, the default value is equal priors for all hypotheses. They don't have to be normalized, because we have to normalize the posteriors anyway.
Now we can specify the likelihoods: if a die has n sides, the chance of getting the same outcome twice is 1/n.
So the likelihoods are:
In [4]:
table.likelihood = 1/table.hypo
hypo prior likelihood unnorm posterior
0 4 1 1/4 NaN NaN
1 6 1 1/6 NaN NaN
2 8 1 1/8 NaN NaN
3 12 1 1/12 NaN NaN
Now we can use update to compute the posterior probabilities:
In [5]:
hypo prior likelihood unnorm posterior
0 4 1 1/4 1/4 2/5
1 6 1 1/6 1/6 4/15
2 8 1 1/8 1/8 1/5
3 12 1 1/12 1/12 2/15
In [6]:
0    0.400000
1    0.266667
2    0.200000
3    0.133333
Name: posterior, dtype: float64
The 4-sided die is most likely because you are more likely to get doubles on a 4-sided die than on a 6-, 8-, or 12- sided die.

Part two

The second part of the problem asks for the (posterior predictive) probability of getting the same outcome a third time, if we roll the same die again.
If the die has n sides, the probability of getting the same value again is 1/n, which should look familiar.
To get the total probability of getting the same outcome, we have to add up the conditional probabilities:
P(n | data) * P(same outcome | n)
The first term is the posterior probability; the second term is 1/n.
In [7]:
total = 0
for _, row in table.iterrows():
    total += row.posterior / row.hypo
Fraction(13, 72)
This calculation is similar to the first step of the update, so we can also compute it by
1) Creating a new table with the posteriors from table.
2) Adding the likelihood of getting the same outcome a third time.
3) Computing the normalizing constant.
In [8]:
table2 = table.reset()
table2.likelihood = 1/table.hypo
hypo prior likelihood unnorm posterior
0 4 2/5 1/4 NaN NaN
1 6 4/15 1/6 NaN NaN
2 8 1/5 1/8 NaN NaN
3 12 2/15 1/12 NaN NaN
In [9]:
Fraction(13, 72)
In [10]:
hypo prior likelihood unnorm posterior
0 4 2/5 1/4 1/10 36/65
1 6 4/15 1/6 2/45 16/65
2 8 1/5 1/8 1/40 9/65
3 12 2/15 1/12 1/90 4/65
This result is the same as the posterior after seeing the same outcome three times.
This example demonstrates a general truth: to compute the predictive probability of an event, you can pretend you saw the event, do a Bayesian update, and record the normalizing constant.
(With one caveat: this only works if your priors are normalized.)

Tuesday, July 10, 2018

The Physics of Bungee Jumping

Bungee jumping turns out to be more complicated than I realized.  I use bungee jumping as an example in Modeling and Simulation in Python, which I am revising this summer.  The book is not done, but you can see the current draft here.

During the first phase of the jump, before the cord is fully extended, I treat the jumper as if they are in free fall, including the effect of gravity and air resistance, but ignoring the interaction between the jumper and the cord.

It turns out that this interaction is non-negligible.  As the cord drops from its folded initial condition to its extended final condition, it loses potential energy.  Where does that energy go?  It is transferred to the jumper!

The following diagram shows the scenario, courtesy of this web page on the topic:

The acceleration of the jumper turns out to be

where a is the net acceleration of the jumper, g is acceleration due to gravity, v is the velocity of the jumper, y is the position of the jumper relative to the starting point, L is the length of the cord, and μ is the mass ratio of the cord and jumper.

For a bungee jumper with mass 75 kg, I've computed the trajectory of a jumper with and without the effect of the cord.  The difference is more than two meters, which could be the difference between a successful jump and a bad day.

The details are in this Jupyter notebook.

Friday, June 22, 2018

Inference in three hours

I am preparing a talk for the Joint Statistical Meetings (JSM 2018) in August.  It's part of a session called "Bringing Intro Stats into a Multivariate and Data-Rich World"; my talk is called "Inference in Three Hours, and More Time for the Good Stuff".

Here's what I said I would talk about:
Teaching statistical inference using mathematical methods takes too much time, emphasizes the least important material, and leaves many students unprepared to apply statistics in the real world. Simple computer simulations can demonstrate the fundamental ideas of statistical inference quickly, clearly, and memorably. Computational methods are also robust and flexible, making it possible to work with a wider range of data and experiments. And by teaching statistical inference better and faster, we leave time for the most important goals of statistics education: preparing students to use data to answer questions and guide decision making under uncertainty. In this talk, I discuss problems with current approaches and present educational material I have developed based on computer simulations in Python.
I have slides for the talk now:

And here's the Jupyter notebook they are based on.

I have a few weeks until the conference, so comments and suggestions are welcome.


Coincidentally, I got question on Twitter today that's related to my talk:
Very late to this post by @AllenDowney, but quite informative:  …
Have one question though: seems a lot of the single case reasoning here is similar to what I was taught was a mistaken conclusion: “that there is a 95% prob that a parameter lies within the given 95% CI.” What is the difference? Seems I am missing some nuance?
The post @cutearguments asks about is "Recidivism and single-case probabilities", where I make an argument that single-case probabilities are not a special problem, even under the frequentist interpretation of probability; they only seem like a special problem because they make the reference class problem particularly salient.

So what does that have to do with confidence intervals?  Let me start with the example in my talk: suppose you are trying to estimate the average height of men in the U.S.  You collect a sample and generate an estimate, like 178 cm, and a 95% confidence interval, like (177, 179) cm.

Naively, it is tempting to say that there is a 95% chance that the true value (the actual average height of every male resident in the population) falls in the 95% confidence interval.  But that's not true.

There are two reasons you might hear for why it's not true:

1) The true value is unknown, but it is not a random quantity, so it is either in the interval or it's not.  You can't assign a probability to it.

2) The 95% confidence interval does not have a 95% chance of containing the true value because that's just not what it means.  A confidence interval quantifies variability due to random sampling; that's all.

The first argument is bogus; the second is valid.

If you are a Bayesian, the first argument is bogus because it is entirely unproblematic to make probability statements about unknown quantities, whether they are considered random or not.

If you are a frequentist, the first argument is still bogus because even if the true value is not a random quantity, the confidence interval is.  And furthermore, it belongs to a natural reference class, the set of confidence intervals we would get by running the experiment many times.  If we agree to treat it as a member of that reference class, we should have no problem giving it a probability of containing the true value.

But that probability is not 95%.   If you want an interval with a 95% chance of containing the true value, you need a Bayesian credible interval.

Friday, June 1, 2018

Bayesian Zig Zag

Almost two years ago, I had the pleasure of speaking at the inaugural meeting of the Boston Bayesians, where I presented "Bayesian Bandits from Scratch" (the notebook for that talk is here).  Since then, the group has flourished, thanks to the organizers, Jordi Diaz and Colin Carroll.

Last night I made my triumphant return for the 21st meeting, where I presented a talk I called "Bayesian Zig Zag".  Here's the abstract:
Tools like PyMC and Stan make it easy to implement probabilistic models, but getting started can be challenging.  In this talk, I present a strategy for simultaneously developing and implementing probabilistic models by alternating between forward and inverse probabilities and between grid algorithms and MCMC.  This process helps developers validate modeling decisions and verify their implementation. 
As an example, I will use a version of the "Boston Bruins problem", which I presented in Think Bayes, updated for the 2017-18 season. I will also present and request comments on my plans for the second edition of Think Bayes.
When I wrote the abstract, I was confident that the Bruins would be in the Stanley Cup final, but that is not how it worked out.  I adapted, using results from the first two games of the NHL final series to generate predictions for the next game.

Here are the slides from the talk:

And here is the Jupyter notebook I presented.  If you want to follow along, you'll see that there is a slide that introduces each section of the notebook, and then you can read the details.  If you have a Python installation with PyMC, you can download the notebook from the repository and try it out.

The talk starts with basic material that should be accessible for beginners, and ends with a hierarchical Bayesian model of a Poisson process, so it covers a lot of ground!  I hope you find it useful.

For people who were there, thank you for coming (all the way from Australia!), and thanks for the questions, comments, and conversation.  Thanks again to Jordi and Colin for organizing, to WeWork for hosting, and to QuantumBlack for sponsoring.

Thursday, May 3, 2018

Some people hate custom libraries

For most of my books, I provide a Python module that defines the functions and objects I use in the book.  That makes some people angry.

The following Amazon review does a nice job of summarizing the objections, and it demonstrates the surprising passion this issue evokes:

March 29, 2018
Format: Paperback
Echoing another reviewer, the custom code requirement means you learn their custom code rather than, you know, the standard modules numpy and scipy. For example, at least four separate classes are required, representing hundreds of lines of code, are required just to execute the first six lines of code in the book. All those lines do is define two signals, a cosine and a sine, sums them, then plots them. This, infuriatingly, hides some basic steps. Here's how you can create a cosine wave with frequency 440Hz:

duration = 0.5
framerate = 11025
n = round(duration*framerate)
ts = np.arange(n)/framerate
amp = 1.0
freq = 440
offset = 0.0
cos_sig = amp * numpy.cos( 2*numpy.pi*ts*freq + offset)
freq = 880
sin_sig = amp * numpy.sin( 2*numpy.pi*ts*freq + offset)

Instead, these clowns have

cos_sig = thinkdsp.CosSignal(freq=440,amp=1.0,offset=0)
sin_sig = thinkdsp.SinSignal(freq=440,amp=1.0,offset=0)
mix = cos_sig + sin_sig

where CosSignal and SinSignal are custom classes, not functions, which inherits four separate classes, NONE of which are necessary, and all of which serve to make things more complex than necessary, on the pretense this makes things easier. The classes these class inherit are a generic Sinusoid and SumSignal classes, which inherits a Signal class, which depends on a Wave class, which performs plotting using pyplot in matplotlib. None of which make anything really any easier, but does serve to hide a lot of basic functionality, like hiding how to use numpy, matplotlib, and pyplot.

In short, just to get through the first two pages, you have to have access to github to import their ridiculous thinkdsp, thinkplot, and thinkstats, totalling around 5500 lines of code, or you are just screwed and can't use this book. All decent teaching books develops code you need as necessary and do NOT require half a dozen files with thousands of lines of custom code just to get to page 2. What kind of clown does this when trying to write a book to show how to do basic signal processing? Someone not interested in teaching you DSP, but trying to show off their subpar programming skills by adding unnecessary complexity (a sure sign of a basic programmer, not a good).

The authors openly admit their custom code is nothing more than wrappers in numpy and scipy, so the authors KNEW they were writing a crappy book and filling it with a LOT of unnecessary complexity. Bad code is bad code. Using bad code to teach makes bad teaching. It's obvious Allen B. Downey has spent his career in academia, where writing quality code doesn't matter.

Well, at least he spelled my name right.

Maybe I should explain why I think it's a good idea to provide a custom library along with a book like Think DSP.  Importantly, the goal of the book is to help people learn the core ideas of signal processing; the software is a means to this end.

Here's what I said in the preface:
The premise of this book is that if you know how to program, you can use that skill to learn other things, and have fun doing it. 
With a programming-based approach, I can present the most important ideas right away. By the end of the first chapter, you can analyze sound recordings and other signals, and generate new sounds. Each chapter introduces a new technique and an application you can apply to real signals. At each step you learn how to use a technique first, and then how it works.
For example, in the first chapter, I introduce two objects defined in Wave and Spectrum.  Wave provides a method called make_spectrum that creates a Spectrum object, and Spectrum provides make_wave, which creates a Wave.

When readers use these objects and methods, they are implicitly learning one of the fundamental ideas of signal processing: that a Wave and its Spectrum are equivalent representations of the same information -- given one, you can always compute the other.

This example demonstrates one reason I use custom libraries in my books: The API is the lesson.  As you learn about these objects and how they interact, you are also learning the core ideas of the topic.

Another reason I think these libraries are a good idea is that they let me introduce ideas top-down: that is, I can show what a method does -- and why it is useful -- first; then I can present details when they necessary or most useful.

For example, I introduce the Spectrum object in Chapter 1.  I use it to apply a low pass filter, and the reader can hear what that sounds like.  You can too, by running the Chapter 1 notebook on Binder.

In Chapter 2, I reveal that my make_spectrum function is a thin wrapper on two NumPy functions, and present the source code:

from np.fft import rfft, rfftfreq

# class Wave:
    def make_spectrum(self):
        n = len(self.ys)
        d = 1 / self.framerate

        hs = rfft(self.ys)
        fs = rfftfreq(n, d)

        return Spectrum(hs, fs, self.framerate)

At this point, anyone who prefers to use NumPy directly, rather than my wrappers, knows how.

In Chapter 7, I unwrap one more layer and show how the FFT algorithm works.  Why Chapter 7?  Because I introduce correlation in Chapter 5, which helps me explain the Discrete Cosine Transform in Chapter 6, which helps me explain the Discrete Fourier Transform.

Using custom libraries lets me organize the material in the way I think works best, based on my experience working with students and seeing how they learn.

This example demonstrates another benefit of defining my own objects: data encapsulation.  When you use NumPy's rfft to compute a spectrum, you get an array of amplitudes, but not the frequencies they correspond to.  You can call rfftfreq to get the frequencies, and that's fine, but now you have two arrays that represent one spectrum.  Wouldn't it be nice to wrap them up in an object?  That's what a Spectrum object is.

Finally, I think these examples demonstrate good software engineering practice, particularly bottom-up design.  When you work with libraries like NumPy, it is common and generally considered a good idea to define functions and objects that encapsulate data, hide details, eliminate repeated code, and create new abstractions.  Paul Graham wrote about this idea in one of his essays on software:
[...] you don't just write your program down toward the language, you also build the language up toward your program. [...] the boundary between language and program is drawn and redrawn, until eventually it comes to rest along [...] the natural frontiers of your problem. In the end your program will look as if the language had been designed for it.
That's why, in the example that makes my correspondent so angry, it takes just three lines to create and add the signals; and more importantly, those lines contain exactly the information relevant to the operations and no more.  I think that's good quality code.

In summary, I provide custom libraries for my books because:

1) They demonstrate good software engineering practice, including bottom-up design and data encapsulation.

2) They let me present ideas top-down, showing how they are used before how they are implemented.

3) And as readers learn the APIs I defined, they are implicitly learning the key ideas.

I understand that not everyone agrees with this design decision, and maybe it doesn't work for everyone.  But I am still surprised that it makes people so angry.

Wednesday, April 18, 2018

Computing at Olin Q&A

I was recently interviewed by Sally Phelps, the Director of Postgraduate Planning at Olin.  We talked about computer science at Olin, which is something we are often asked to explain to prospective students and their parents, employers, and other external audiences.

Afterward, I wrote the following approximation of our conversation, which I have edited to be much more coherent than what I actually said.

I should note: My answers to the following questions are my opinions.  I believe that other Olin professors who teach software classes would say similar things, but I am sure we would not all say the same things.

Photo Credit: Sarah Deng

Q: What is the philosophy of Olin when it comes to training software engineers of the future?

To understand computer science at Olin, you have to understand that Olin really has one curriculum, and it's engineering.

We have degrees in Engineering, Mechanical Engineering, and Electrical and Computer Engineering.  But everyone sees the same approach to engineering: it starts with people and it ends with people.  That means you can't wait for someone to hand you a well-formulated problem from a textbook; you have to understand the people you are designing for, and the context of the problem.  You have to know when an engineering solution can help and when it might not.  And then when you have a solution, you have to be able to get it out of the lab and into the world.

Q: That sounds very different from a traditional computer science degree.  

It is.  Because we already have a lot of computer scientists who know how data structures work; we don't have as many who can identify opportunities, work on open-ended problems, work on teams with people from other disciplines, work on solutions that might involve electrical and mechanical systems as well as software.

And we don't have a lot of computer scientists who can communicate clearly about their work; to have impact, they have to be able to explain the value of what they are doing.  Most computer science programs don't teach those things very well.

Also most CS programs don't do a great job of preparing students to work as software engineers.  A lot of classes are too theoretical, too mathematical, and too focused on the computer itself, not the things you want to do with it, the applications.

At Olin, we've got some theory, some mathematical foundations, some focus on the design of software systems.  But we've turned that dial down because the truth is that a lot of that material is not relevant to practice.  I always get a fight when I say that, because you can never take anything out of the curriculum.  There's always someone who says you have to know how to balance a red-black tree or you can't be a computer scientist; or you have to know about Belady's anomaly, or you have to know X, Y, and Z.

Well, you don't.  For the vast majority of our students, for all the things they are going to do, a big chunk of the traditional curriculum is irrelevant.  So we look at the traditional curriculum with some skepticism, and we make cuts.

We have to; there's only so much time.  In four years, students take about 32 classes.  We have to spend them wisely.  We have to think about where they are going after graduation.  Some will go to grad school, some will start companies, some will work in industry,  Some of them will be software engineers, some will be product managers, some will work in other fields; they might develop software, or work with software developers. 

Q: So how do you prepare people for all of that?

It depends what "prepare" means.  If it means teach them everything they need to know, it's impossible.  But you can identify the knowledge, skills and attitudes they are most likely to need.

It helps if you have faculty with industry experience.  A lot of professors go straight to grad school and straight into academics, and then they have long arguments about what software engineers need to know.  Sometimes they don't know what they are talking about.

If you're designing a curriculum, just like a good engineer, you have to understand the people you're designing for and the context of the solution.  Who are your students, where are they going, and what are they going to need?  Then you can decide what to teach.

Q: So if a student is interested in CS and they're deciding between Olin and another school, what do you tell them?

I usually tell them about the Olin curriculum, what I just explained.  And I suggest they look at our graduation requirements.  Students at Olin who do the Engineering major with a concentration in computing, they take a relatively small number of computer science classes, usually around seven.  And they take a lot of other engineering classes.

In the first semester, everyone takes the same three engineering classes, so everyone does some mechanical design, some circuits and measurement, and some computational modeling.

Everyone takes a foundation class in humanities, and another in entrepreneurship.  Everyone takes Principles of Engineering, where they design and build a mechatronic system.

In the fourth semester everyone takes user-centric design, and finally, in the senior year, everyone does a two-semester engineering capstone, which is usually interdisciplinary.

If a prospective student looks at those classes and they're excited about doing design and engineering -- and several kinds of engineering -- along with computer science, then Olin is probably a good choice for them.

If they look at the requirements and they dread them -- if the requirements are preventing them from doing what they really want -- then maybe Olin's not the right place.

Q: I understand there are student-taught software classes – can you tell us more about that?

We do, and a lot of them have been related to software, because that's an area where we have students doing internships, and sometimes starting companies, and they get a lot of industry experience.

And they come back with skills and knowledge they can share with their peers.  Sometimes that happens in classes, especially on projects.  But it can also be a student-led class where student instructors propose a class, and they they work with faculty advisors to develop and present the material.  As an advisor, I can help with curriculum design and the pedagogy, and sometimes I have a good view of the context or the big picture.  And then a lot of times the students have a better view of the details.  They've spent the summer working in a particular domain, or with a particular technology, and they can help their peers get a jump start.

They also bring some of the skills and attitudes of software engineering.  For example, we teach students about testing, and version control, and code quality.  But in a class it can be artificial; a lot of times students want to get code working and they have to move on to the next thing.  They don't want to hear from me about coding "style".

It can be more effective when it's coming from peers, and when it's based on industry experience.  The student instructors might say they worked at Pivotal, and they had to do pair programming, or they worked at Google, and all of their code was reviewed for readability before they could check it in.  Sometimes that's got more credibility than I do.

Q: What does the future look like for computing at Olin?

A big part of it is programming in context.  For example, the first software class is Modeling and Simulation, which is about computational models in science, including physics, chemistry, medicine, ecology…  So right from the beginning, we're not just learning to program, we're applying it to real world problems.

Programming isn't just a way of translation well understood solutions into code.  It's a way of communicating, teaching, learning, and thinking.  Students with basic programming skills can use coding as a "pedagogic lever" to learn other topics in engineering, math, natural and social science, arts and humanities.

I think we are only starting to figure out what that looks like.  We have some classes that use computation in these ways, but I think there are a lot more opportunities.  A lot of ideas that we teach mathematically, we could be doing computationally, maybe in addition to, or maybe instead of the math.

One of my examples is signal processing, where probably the most important idea is the Fourier transform.  If you do that mathematically, you have to start with complex numbers and work your way up.  It takes a long time before you get to anything interesting.

With a computational approach, I can give you a program on the first day to compute the Fourier transform, and you can use it, and apply it to real problems, and see what it does, and run experiments and listen to the results, all on day one.  And now that you know what it's good for, maybe you'll want to know how it works.  So we can go top-down, start with applications, and then open the hood and look at the engine.

I'd like to see us apply this approach throughout the curriculum, especially engineering, math, and science, but also arts, humanities and social science.

Thursday, March 22, 2018

Generational changes in support for gun laws

This is the fourth article in a series about changes in support for gun control laws over the last 50 years.

In the first article I looked at data from the General Social Survey and found that young adults are less likely than previous generations to support gun control.

In the second article I looked at data from the CIRP Freshman Survey and found that even the youngest adults, who grew up with lockdown drills and graphic news coverage of school shootings, are LESS likely to support strict gun control laws.

In the third article, I ran graphical tests to distinguish age, period, and cohort effects.  I found strong evidence for a period effect: support for gun control among all groups increased during the 1980s and 90s, and has been falling in all groups since 2000.  I also saw some evidence of a cohort effect: people born in the 1980s and 90s are less likely to support strict gun control laws.

In this article, I dive deeper, using logistic regression to estimate the sizes of these effects separately, while controlling for demographic factors like sex, race, urban or rural residence, etc.


As in the previous articles, I am using data from the General Social Survey (GSS), and in particular the variable 'gunlaw', which records responses to the question:
Would you favor or oppose a law which would require a person to obtain a police permit before he or she could buy a gun?
I characterize respondents who answer "favor" to be more likely to support strict gun control laws.

The explanatory factors I consider are:

'nineties', 'eighties', 'seventies', 'fifties', 'forties', 'thirties', 'twenties':  These variables encode the respondents decade of birth.

'female': indicates that the respondent is female.

'black': indicates that the respondent is black.

'otherrace': indicates that the respondent is neither white nor black (most people in this category are mixed race).

'hispanic': indicates that the respondent is Hispanic.

'conservative', 'liberal': indicates that the respondent reports being conservative or liberal (not moderate).

'lowrealinc', 'highrealinc': indicates that the respondent's household income is in the bottom or top 25%, based on self-report, converted to constant dollars.

'college': indicates whether the respondent attended any college.

'urban', 'rural': indicates whether the respondent lives in an urban or rural area (not suburban).

'gunhome': indicates whether the respondent reports that they "
have in [their] home or garage any guns or revolvers".

'threatened': indicates whether the respondent reports that they have "ever been threatened with a gun, or shot at". 

These factors are all binary.  In addition, I also estimate the period effect by including the following  variables: 'yminus10', 'yminus20', 'yminus30', and 'yminus40', to indicate respondents surveyed 10, 20, 30, and 40 years prior to 2016. 


I used logistic regression to estimate the effect of each of these variables.  The regression also includes a cubic model of time, intended to capture the period effect.  You can see the period effect in the following figure, which shows actual changes in support for a gun permit law over the history of the GSS (in gray) and the retroactive predictions of the model (in red).

To present the results in an interpretable form, I define a collection of hypothetical people with different attributes and estimate the probability that each one would favor a gun permit law.

As a baseline, I start with a white, non-Hispanic male born in the 1960s who is politically moderate, in the middle 50% of the income range, who attended college, lives in a suburb, has never been threatened with a gun or shot at, and does not have a gun at home.  People like that interviewed in 2016 have a 74% chance of favoring "a law which would require a person to obtain a police permit before he or she could buy a gun".

The following table shows results for people with different attributes: the first row, which is labeled 'baseline' is the baseline person from the previous paragraph; the second row, labeled 'nineties', is identical to the baseline in every way, except born in the 1990s rather than the 1960s.  The line labeled 'female' is identical to the baseline, but female.

These results are generated by running 201 random samples from the GSS data and computing the median, 2.5th and 97.5th percentiles.  The range from 'low2.5' to 'high97.5' forms a 95% confidence interval.

low2.5 median high97.5
baseline 71.5 73.6 75.1
nineties 60.1 63.8 68.5
eighties 64.9 67.7 69.8
seventies 67.8 70.1 72.0
fifties 70.7 72.8 74.8
forties 71.4 73.7 75.4
thirties 69.9 72.0 73.8
twenties 70.7 72.9 74.8
female 83.2 84.7 85.6
black 75.9 78.0 79.5
otherrace 78.2 80.4 82.9
hispanic 70.7 73.5 76.0
conservative 64.1 65.8 67.8
liberal 75.9 77.7 79.3
lowrealinc 69.1 71.1 72.9
highrealinc 73.2 75.2 76.7
college 73.2 75.1 76.3
urban 66.0 68.3 69.9
rural 59.5 62.0 64.0
threatened 68.8 71.1 73.1
gunhome 51.1 53.6 55.8
yminus10 84.1 85.3 86.2
yminus20 84.3 85.3 86.4
yminus30 80.7 81.5 82.8
yminus40 78.8 79.9 81.4
lowest combo 16.5 19.2 22.1
highest combo 91.3 92.3 93.4

Again, the hypothetical baseline person has a 74% chance of favoring a gun permit law.  A nearly identical person born in the 1990s has only a 64% chance.

To see this and the other effects more clearly, I computed the difference between each hypothetical person and the baseline, then sorted by the magnitude of the apparent effect. 

low2.5 median high97.5
lowest combo -57.7 -54.5 -49.5
gunhome -21.1 -20.0 -18.6
rural -13.4 -11.5 -9.9
nineties -13.5 -9.8 -4.5
conservative -8.9 -7.6 -6.5
eighties -7.9 -5.9 -2.7
urban -6.6 -5.4 -4.1
seventies -5.6 -3.2 -1.4
lowrealinc -3.7 -2.5 -1.2
threatened -3.6 -2.4 -1.1
thirties -3.2 -1.5 -0.1
fifties -2.2 -0.8 0.8
twenties -2.8 -0.6 1.0
forties -1.7 -0.0 1.5
baseline 0.0 0.0 0.0
hispanic -1.5 0.2 2.0
college 0.6 1.6 2.6
highrealinc 0.7 1.7 2.6
liberal 2.9 4.2 5.2
black 3.1 4.5 5.8
yminus40 4.9 6.6 8.0
otherrace 4.8 6.9 9.6
yminus30 6.6 8.1 9.8
female 10.1 11.1 12.4
yminus20 10.2 11.7 14.0
yminus10 10.1 11.7 13.7
highest combo 16.9 18.8 21.1

All else being equal, someone who owns a gun is about 20 percentage points less likely to favor gun permit laws.

Compared to people born in the 1960s, people born in the 1990s are 10 points less likely.  People born in the 1980s and 1970s are also less likely, by 6 and 3 points.  People born in previous generations are not substantially different from people born in the 1960s (and the effect is not statistically significant).

Compared to suburbanites, people in rural and urban communities are less likely, by 12 and 5 points.

People in the lowest 25% of household income are less likely by 2.5 points; people in the highest 25% are more likely by 2 percentage point.

Blacks and other non-whites are more likely to favor gun permit laws, by 4.5 and 7 percentage points.

Compared to political moderates, conservatives are 8 points less likely and liberals are 4 points more likely to favor gun permit laws.

Compared to men, women are 11 points more likely to favor gun permit laws.

Controlling for all of these factors, the period effect persists: people with the same attributes surveyed 10, 20, 30, and 40 years ago would have been 10, 10, 7, and 5 points more likely to favor gun permit laws.

In these results, Hispanics are not significantly different from non-Hispanic whites.  But because of the way the GSS asked about Hispanic background, this variable is missing a lot of data; these results might not mean much.

Surprisingly, people who report that they have been "threatened with a gun or shot at" are 2 percentage points LESS likely to favor gun permit laws.  This effect is small but statistically significant, and it is consistent in many versions of the model.  A possible explanation is that this variable captures information about the respondent's relationship with guns that is not captured by other variables.  For example, if a respondent does not have a gun at home, but spends time around people who do, they might be more likely to have been threatened and also more likely to share cultural values with gun owners.  Alternatively, since this question was only asked until 1996, it's possible that it is capturing a period effect, at least in part.

"Lowest combo" represents a hypothetical person with all attributes associated with lower support for gun laws: a white conservative male born in the 1990s, living in a rural area, with household income in the lowest 25%, who has not attended college, who owns a gun, and has been threatened with a gun or shot at.  Such a person has a 19% change of favoring a gun permit law, 54 points lower than the baseline.

"Highest combo" represents a hypothetical person with all attributes associated with higher support for gun laws: a mixed race liberal woman born in the 1960s or before, living in a suburb, with household income in the highest 25%, who has attended college, does not own a gun, and has not been threatened with one.  Such a person has a 92% chance of favoring a gun permit law, 19 points higher than the baseline.

[You might be surprised that these results as asymmetric: that is, that the lowest combo is farther from the baseline than the highest combo.  The reason is that the "distance" between probabilities is not linear.  For more about that, see my previous article on the challenges of interpreting probablistic predictions].


The entire analysis in this Jupyter notebook.  The steps are:

1) Load the subset of GSS data I selected, which you can download here.

2) For each year of the survey, use weighted bootstrap to select a random sample that accounts for the stratified sampling in the GSS design.

3) Fill missing values in each column by drawing random samples from the valid responses.

4) Convert some numerical and categorical variables to boolean; for example 'conservative' and 'liberal' are based on the categorical variable 'polviews'; and 'lowrealinc' and 'highrealinc' are based on the numerical variable 'realinc'.

5) Use logistic regression to estimate model parameters, which are in terms of log odds.

6) Use the model to make predictions for each of the hypothetical people in the tables, in terms of probabilities.

7) Compute the predicted difference between each hypothetical person and the baseline.

By repeating steps (2) through (7) about 200 times, we get a distribution of estimates that accounts for uncertainty due to random sampling and missing values.  From these distributions, we can select the median and a 95% confidence interval, as reported in the tables above.