Olin College is Hiring

Olin College is Hiring. I teach at Olin College, a new undergraduate engineering college with the mission to fix engineering education. If you're interested in joining our team, here is information about the Faculty Search at Olin College.

Saturday, April 12, 2014

Think X, Y and Z: What's in the pipeline?

Greetings from PyCon 2014 in Montreal!  I did a book signing yesterday at the O'Reilly Media booth.  I had the pleasure of working side by side with David Beazley, who was signing copies of The Python Cookbook, now updated for Python 3 and, I assume, including all of the perverse things Prof. Beazley does with Python generators.

Normally O'Reilly provides 25 copies of the book for signing, but due to a mix up, they brought extra copies of all of my books, so I ended up signing about 50 books.  That was fun, but tiring.

Several people asked me what I am working on now, so I thought I would answer here:

1) This semester I am revising Think Stats and using it to teach Data Science at Olin College.  Think Stats was the first book in the Think X series to get published.  There are still many parts I am proud of, but a few parts that make me cringe.  So I am fixing some errors and adding new sections on multiple linear regression, logistic regression, and survival analysis.  The revised edition should be done in June 2014.  I will post my draft, real soon, at thinkstats2.com.

2) I am also teaching Software Systems this semester, a class that covers system-level programming in C along with topics in operating systems, networks, databases, and embedded systems.  We are using Head First C to learn the language, and I am writing Think OS to present the OS topics.  The current (very rough) draft is at think-os.com.  I am planning to incorporate some material from The Little Book of Semaphores.  Then I will finish it up this summer.

3) Next semester I am teaching two half-semester classes: Bayesian statistics during the first half, based on Think Bayes; and digital signal processing during the second half.  I am taking a computational approach to DSP (surprise!), with emphasis on applications like sound and image processing.  The goal is for students to understand spectral analysis using FFT and to wrap their heads around time-domain and frequency-domain representations of time series data.  Working with sound takes advantage of human hearing, which is a pretty good spectral analysis system.  Applications will include pitch tracking as in Rock Band, and maybe we will reimplement Auto-Tune.

I have drafted a few chapters and posted them at think-dsp.com. I'll be working on it over the summer, then the students will work with it in the fall. If it all goes well, I'll revise it in January 2015, or maybe that summer.

One open question is what environment I will develop the book in. My usual work environment is LaTeX for the book, with Makefiles that generate PDF, HTML, and DocBook. I write code in emacs and run it on the Linux command line. So, that's a pretty old school toolkit.

For a long time I've been thinking about switching to IPython, which would allow me to combine text and code examples, and give users the ability to run the code while reading.

But two things have kept me from making the leap:

a) I like editing scripts, and I hate notebooks.  I find the interface awkward, and I don't want to keep track of which cells have to execute when I make a change.

b) I have been paralyzed by the number of options for distributing the results. For each of my books I have a Python module that contains my library code, so I need to distribute the module along with the notebook. Notebook viewers like nbviewer would not work for me (I think). At the other extreme, I could package my development in a Virtual Machine and let readers run my VM. But that seems like overkill, and many of my readers and not comfortable with virtual machines.

But I just listened to Fernando Perez's keynote at PyCon, where he talked about his work on IPython. I grabbed him after the talk and he was very helpful; in particular, he suggested the option of using Wakari Cloud. Continuum Analytics, which runs Wakari, is a PyCon sponsor, so I will try to talk to them later today (and find out whether I am misunderstanding what they do).

4) The other class I am teaching in the fall is Modeling and Simulation, which I helped develop, along with my colleagues John Geddes and Mark Somerville, about six years ago. For the last few years we've been using my book, Physical Modeling in MATLAB.

The book presents the computational part of the class, but doesn't include all of the other material, especially the modeling part of Modeling and Simulation! Prof. Somerville has drafted additional chapters to present that material, and Prof. Geddes and I have talked about expanding some of the math topics.

This summer the three of us are hoping to draft a major revision of the book. We'll try it out in the fall, and then finish it off in January.

The current version of the book uses MATLAB, but the features we use are also available in Python with SciPy. So at some point we might switch the class and the book over to Python.


That's what's in the pipeline! I'm not sure I will actually be able to finish four books in the next year, so this schedule might be "aspirational" (that's Olin-speak for things that might not be completely realistic).

Let me know what you think. Which ones should I work on first?

Monday, April 7, 2014

The Internet and religious affiliation

A few weeks ago I published this paper on arXiv: "Religious affiliation, education and Internet use".  Regular readers of this blog will recognize this as the article I was writing about in July 2012, including this article.

A few days ago, MIT Technology Review wrote about my paper: How the Internet Is Taking Away America’s Religion

[UPDATE April 16, 2014: Removed some broken links here.]

The story has attracted some media attention, so I am using this blog article to correct a few errors and provide more information.


1) Some stories are reporting that my paper was published by MIT Technology Review.  That is not correct.  It was published in arXiv, a "a repository of electronic preprints of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance, which can be accessed online."  It is not a peer-reviewed venue.  My paper has not been peer reviewed.

2) I am a Professor of Computer Science at Olin College of Engineering in Needham, Massachusetts.

3) My results suggest that Internet use might account for about 20% of the decrease in religious affiliation between 1990 and 2010, or about 5 million out of 25 million people.

Note that this result doesn't mean that 5 million people who used to be affiliated are now disaffiliated because of the Internet.  Rather, my study estimates that if the Internet had no effect on affiliation, there would be an additional 5 million affiliated people.

Some additional questions and answers follow.


How do you know the Internet causes disaffiliation?  Why not global warming?  Or pirates?

The statistical analysis in this paper is not just a correlation between two time series.  That would tell us very little about causation.

The analysis I report is based on more than 9000 individual respondents to the General Social Survey (GSS).  Each respondent answered questions about their Internet use and religious affiliation, and provided demographic information like religious upbringing, region, income, socioeconomic status, etc.

These variables allowed me to control for most of the usual confounding variables and isolate the association between Internet use and religious affiliation.

Now, there are still two alternative explanations.  Religious disaffiliation could cause Internet use.  That seems less plausible to me, but it is still possible.  The other possibility is that a third factor could cause both Internet use and disaffiliation.  But that factor would have to be new or increasing since 1990, and it would have to be uncorrelated or only weakly correlated with the control variables I included.

I can't think of any good candidates for a third factor like that, and I have not heard any candidates that meet the criteria.  So I think it is reasonable to conclude that Internet use causes disaffiliation.

Was this research published in MIT Technology Review?

No. I published the paper on arXiv: Religious affiliation, education and Internet use

Tech Review was one of the first places to pick up the story.  Some sites are reporting, incorrectly, that my research was published in MIT Technology Review.

Correlation and causation

By far the most common response to my paper is something like what Matt McFarland wrote:
This is an interesting read, but the problem with the theory is that correlation isn’t causation. With so much changing since 1990, it’s difficult to conclude what variables were factors, and to what degree.
Let me address that in two parts.

1) My study is not based on simple correlations.  I used statistical methods (specifically logistic regression) that are designed to answer the question Matt asks: which variables were factors and to what degree?  I controlled for all the usual confounding factors, including income, education, religious upbringing, and a few others.

My results show that there is an association between Internet use and disaffiliation, after controlling for these other factors.

2) The question that remains is which direction the causation goes.  There are three possibilities:

a) Internet use causes disaffiliation
b) disaffiliation causes Internet use
c) some other third factor causes both

In the paper I argue that (a) is more likely than (b) because it is easy to imagine several ways Internet use might cause disaffiliation, and harder to imagine how disaffiliation causes Internet use.

Similarly, it is hard to think of a third factor that causes both Internet use and disaffiliation.  In order to explain the data, this new factor would have to be new, or changing substantially between 1990 and 2010, and it would have to have a strong effect on both Internet use and disaffiliation.

There are certainly many factors that contribute to disaffiliation.  According to my analysis, Internet use might account for 20% of the observed change.  Another 25% is due to changes in religious upbringing, and 5% is due to increases in college education.  That leaves 50% of the observed changes unexplained by the factors I was able to include in the study.

If you think that Internet use does not cause disaffiliation, it is not enough to list other causes of disaffiliation, or other things that have changed since 1990.  To make your argument, you have to find a third factor that

a) Is not already controlled for in my analysis,
b) Is new in 1990, or began to change substantially around that time, and
c) Actually causes (not just associated with) both Internet use and disaffiliation.

So far I have not heard a candidate that meets these criteria.  For example, some people have suggested  personality traits that might cause both Internet use and disaffiliation.  That's certainly possible.  But unless those traits are new, or started becoming more prevalent, in 1990, they don't explain the recent changes.

Questions from the media

1. Why did you initiate this study?

As a college teacher, I have been following the CIRP Freshman Survey for several years.  It is a survey of incoming college students that asks, among other things, about their religious preference.  Since 1985 the fraction of students reporting no religious preference has more than tripled, from 8% to 25%.  I think this is an underreported story.

My latest blog post about the Freshman Survey is here: http://allendowney.blogspot.com/2014/03/freshman-hordes-slightly-more-godless.html

About two years ago I started working with data from the General Social Survey (GSS), and realized there was an opportunity to investigate factors associated with disaffiliation, and to predict future trends.

2. How did you gather your data?

I am not involved in running either the Freshman Survey or the GSS.  I use data made available by the Higher Education Research Institute (HERI) and National Opinion Research Center (NORC).  Obviously, their work is a great benefit to researchers like me.  On the other hand, there are always challenges working with “found data.”  Even with surveys like these that are well designed and executed, you never have exactly the data you want; it takes some creativity to find the data that answer your questions and the questions your data can answer.

3. Could you elaborate on your overall findings?

I think there are two important results from this study.  One is that I identified several factors associated with religious disaffiliation and measured the strength of each association.  By controlling for things like income, education, and religious upbringing, I was able to isolate the effect of Internet use.  I found something like a dose-response curve.  People who reported some Internet use (a few hours a week) were less likely to report a religious preference, by about 2 percentage points.  People who use the Internet more than 7 hours per week were even less likely to be religious, by an additional 3 percentage points.  That effect turns out to be stronger than a four-year college education, which reduces religious affiliation by about 2 percentage points.

With this kind of data, we can’t know for sure that Internet use causes religious disaffiliation.  It is always possible that disaffiliation causes Internet use, or that a third factor causes both.  In the paper I explain why I think these alternatives are less plausible, and provide some additional analysis.  Based on these results, I conclude, tentatively, that Internet use causes religious disaffiliation, but a reasonable person could disagree.

In the second part of the paper I use parameters from the regression models to run simulations of counterfactual worlds, which allows me to estimate the number of people in the U.S. whose disaffiliation is due to education, Internet use, and other factors.  Between 1980 and 2010, the total decrease in religious affiliation is about 25 million people.  About 25% of that decrease is because fewer people are being raised with a religious affiliation.  Another 20% might be due to increases in Internet use.  And another 5% might be due to increases in college education.

That leaves 50% of the decrease unexplained by the factors I was able to include in the study, which raises interesting questions for future research.

4. Are you using this research for anything specific?

I do work in this area because a lot of people find it interesting; I also use it as an example in my classes.  I teach Data Science at Olin College of Engineering.  I want my students to be prepared to work with real data and use it to answer real questions.  I use examples like this to demonstrate the tools and to show what’s possible as data like this becomes more readily available.

5. Based on your research, are you able to hypothesize what might happen to religion in America in 50, 100, or more years?

I made some predictions in this blog post:


The most likely changes between now and 2040 are: the fraction of people with no religious preference will increase to about 25%, overtaking the fraction of Catholics, which will decline slowly.  The fraction of Protestants will drop more quickly, to about 45%.

These predictions are based on generational replacement: the people in the surveyed population will get older; some will die and be replaced by the next generation.  Most adults don’t change religious affiliation, so these predictions should be reliable.

But they are based on generational replacement only, not on any other factors that might speed up or slow down the trends.  Going farther into the future, those other factors become more important.

6. Is there anything in particular you think people should know/understand about your research and findings?

Again, it’s important to remember that my results are based on observational studies.  With that kind of data, we don’t know for sure whether the statistical relationships we see are due to causation.  In this case I think we can make a strong argument that Internet use causes religious disaffiliation, but a reasonable person could disagree.

My paper includes some analysis that is pretty standard stuff, like logistic regression.  But I also used methods that are less common; for example, using parameters from the regression models, I ran simulations of counterfactual worlds, which allowed me to estimate the number of people in the U.S. whose disaffiliation might be due to education, Internet use, and other factors.

7. Although your research cannot determine for sure that the Internet causes less religious affiliation, what about it might you speculate could be decreasing religion?

In the paper I wrote “it is easy to imagine at least two ways Internet use could contribute to disaffiliation.  For people living in homogeneous communities, the Internet provides opportunities to find information about people of other religions (and none), and to interact with them personally.  Also, for people with religious doubt, the Internet provides access to people in similar circumstances all over the world.”

These are speculations based on anecdotal evidence, not the kind of data I used in the statistical analysis.  One place I see people from different religious background engaging on the Internet is in online forums like Reddit.  Here's an example from just a few hours ago.

This NYT article provides similar anecdotal evidence: http://www.nytimes.com/2013/07/21/us/some-mormons-search-the-web-and-find-doubt.html?pagewanted=all&_r=1&

8. Do you think that Internet will advance in this secularization process?

There are two parts of secularization, disaffiliation from organized religion and decrease in religious faith.  The data I reported in my paper provide some evidence that the Internet is contributing to disaffiliation in the U.S.  I haven’t had a chance look into the related issue of religious faith, but I am interested in that question, too.

9. The access to a wider range of information would be one of the explanations for the decrease in religious affiliation?

Again, I don’t have data to support that, but it seems likely to be at least part of the explanation.

More questions

There is a difference between those who are religiously affiliated (belong to or active with a church, for example) and those who consider themselves spiritual or religious. Can you clarify what you’re talking about?

Yes, good point!  My paper is only about religious affiliation, or religious preference.  The GSS also asks about religious faith and spirituality, but I have not had a chance to do the same analysis with those variables.

I have seen other studies that suggest that belief in God, other forms of religious faith, and spirituality are not changing as quickly as religious affiliation.  But I don't have a good reference handy.

Thursday, March 6, 2014

Freshman hordes slightly more godless than ever

This article is an update to my annual series on one of the most under-reported stories of the decade: the fraction of college freshmen who report no religious preference has tripled since 1985, from 8% to 24%, and the trend is accelerating.

In last year's installment, I made the bold prediction that the trend would continue, and that the students starting college in 2013 would again, be the most godless ever.  It turns out I was right -- just barely.  The number of students reporting no religious preference increased to 24.6%, slightly higher than the previous record, 24.5% in 2011.  Of course, that "difference" is not statistically meaningful.  More valid conclusions are

1) This year's data point is consistent with previous predictions, and

2) Data since 1990 support the conclusion that the number of incoming college students with no religious preference is increasing and probably accelerating.

This analysis is based on survey results from the Cooperative Institutional Research Program (CIRP) of the Higher Education Research Insitute (HERI).  In 2013, more than 165,000 students at 234 colleges and universities completed the CIRP Freshman Survey, which includes questions about students’ backgrounds, activities, and attitudes.

In one question, students select their “current religious preference,” from a choice of seventeen common religions, “Other religion,” or “None.”

Another question asks students how often they “attended a religious service” in the last year. The choices are “Frequently,” “Occasionally,” and “Not at all.” Students are instructed to select “Occasionally” if they attended one or more times.

The following figure shows the fraction of Nones over more than 40 years of the survey

The blue line shows actual data through 2012; the red line shows a quadratic fit to the data.  The dark gray region shows a 90% confidence interval, which quantifies sampling error, so it reflects uncertainty about the parameters of the fit.

The light gray region shows a 90% confidence interval taking into account both sampling error and residual error.  So it reflects total uncertainty about the predicted value, including uncertainty due to random variation from year to year.

We expect the new data point from 2013, shown as a blue square, to fall within the light gray interval, and it does.  In fact, at 24.6% it falls only slightly below the fitted curve.

Here is the corresponding plot for attendance at religious services:

Again, the new data point for 2013, 27.3%,  falls comfortably in the predicted range and slightly ahead of the long term trend.

Predictions for 2014

Using the new 2013 data, we can generate predictions for 2014.  Here is the revised plot for "Nones":
The prediction for next year is that the fraction of Nones will hit a new all-time high at 25.8% (from 24.6%).  If so, it is likely to match or exceed the fraction of students whose preference is Roman Catholic.

And here is the prediction for "No attendance":

The prediction for 2014 is a small increase to 27.5% (from 27.3%).  I'll be back next year to check on these predictions.


1) As always, more males than females report no religious preference.  The gender gap decreased this year, but still falls in the predicted range, as shown in the following plot:
Evidence that the gender gap is increasing is strong.  The p-value of the slope of the fitted curve is less than 1e-6.

2) I notice that the number of schools and the number of students participating in the Freshman Survey has been falling for several years.  I wonder if the mix of schools represented in the survey is changing over time, and what effect this might have on the trends I am watching.  The percentage of "Nones" is different across different kinds of institutions (colleges, universities, public, private, etc.)  If participation rates are changing among these groups, that would affect the results.

3) Obviously college students are not representative of the general population.  Data from other sources indicate that the same trends are happening in the general population, but I haven't been able to make a quantitative comparison between college students and others.  Data from other sources also indicate that college graduates are slightly more likely to attend religious services, and to report a religious preference, than the general population.

Data Source

The American Freshman: National Norms Fall 2013
Eagan, K., Lozano, J.B., Hurtado, S., & Case, M.H.
ISBN: 978-1-878477-26-2     187 pages.
Mar 2014

This and all previous reports are available from the HERI publications page.

Thursday, February 20, 2014

Correlation is evidence of causation

In class last week, I was talking about correlation and linear regression, and I made the outrageous claim that correlation is evidence of causation.  One of my esteemed colleagues, who is helping out with the class, was sitting in the back of the room, and immediately challenged my claim.  It wasn't a good time for a long discussion of the question, so I didn't elaborate.

The point I was trying to make (and will elaborate here) is that the usual mantra, "Correlation does not imply causation," is true only in a trivial sense, so we need to think about it more carefully.  And as regular readers might expect, I'll take a Bayesian approach.

It is true that correlation doesn't imply causation in the mathematical sense of "imply;" that is, finding a correlation between A and B does not prove that A causes B.  However, it does provide evidence that A causes B.  It also provides evidence that B causes A, and if there is a hypothetical C that might cause A and B, the correlation is evidence for that hypothesis, too.

In Bayesian terms, a dataset, D, is evidence for a hypothesis, H, if the probability of H is higher after seeing D.  That is, if P(H|D) > P(H).

For any two variables, A and B, we should consider 4 hypotheses:

A: A causes B
B: B causes A
C: C causes A and B
N: there are no causal relationships among A, B, and C

And there might be multiple versions of C, for different hypothetical factors.  If I have no prior evidence of any causal relationships among these variables, I would assign a high probability (in the sense of a subjective degree of belief) to the null hypothesis, N, and low probabilities to the others.  If I have background information that makes A, B, or C more plausible, I might assign prior probabilities accordingly.  Otherwise I would assign them equal priors.

Now suppose I find a correlation between A and B, with p-value=0.01.  I would compute the likelihood of this result under each hypothesis:

L(D|A)  1: If A causes B, the chance of finding a correlation is probably high, depending on the noisiness of the relationship and the size of the dataset.

L(D|B)  1, for the same reason.

L(D|C)  1, or possibly a bit lower than the previous likelihoods, because any noise in the two causal relationships would be additive.

L(D|N) = 0.01.  The probability of seeing a correlation with the observed strength, or more, under the null hypothesis, is the computed p-value, 0.01.

When we multiply the prior probabilities by the likelihoods, the probability assigned to N drops by a factor of 100; the other probabilities are almost unchanged.  When we renormalize, the other probabilities go up.

In other words, the update takes most of the probability mass away from N and redistributes it to the other hypotheses.  The result of the redistribution depends on the priors, but for all of the alternative hypotheses, the posterior is greater than the prior.  That is

P(A|D) > P(A)
P(B|D) > P(B)
P(C|D) > P(C)

Thus, the correlation is evidence in favor of A, B and C.  In this example, the Bayes factor for all three is about 100:1, maybe a bit lower for C.  So the correlation alone does not discriminate much, if at all, between the alternative hypotheses.

If there is a good reason to think that A is more plausible than B and C, that would be reflected in the priors.  In that case the posterior probability might be substantially higher for A than for B and C.

And if the resulting posterior, P(A|D), were sufficiently high, I would be willing to say that the observed correlation implies causation, with the qualification that I am using "imply" in the sense of strong empirical evidence, not a mathematical proof.

People who have internalized the mantra that correlation does not imply causation might be surprised by my casual (not causal) blasphemy.  But I am not alone.  This article from Slate makes a similar point, but without the Bayesian mumbo-jumbo.

And the Wikipedia page on "Correlation does not imply causation" includes this discussion of correlation as scientific evidence:
Much of scientific evidence is based upon a correlation of variables – they are observed to occur together. Scientists are careful to point out that correlation does not necessarily mean causation. The assumption that A causes B simply because A correlates with B is often not accepted as a legitimate form of argument. However, sometimes people commit the opposite fallacy – dismissing correlation entirely, as if it does not suggest causation. This would dismiss a large swath of important scientific evidence. 
In conclusion, correlation is a valuable type of scientific evidence in fields such as medicine, psychology, and sociology. But first correlations must be confirmed as real, and then every possible causative relationship must be systematically explored. In the end correlation can be used as powerful evidence for a cause-and-effect relationship between a treatment and benefit, a risk factor and a disease, or a social or economic factor and various outcomes. But it is also one of the most abused types of evidence, because it is easy and even tempting to come to premature conclusions based upon the preliminary appearance of a correlation.
I think this is a reasonable conclusion, and hopefully not too shocking to my colleagues in the back of the room.

UPDATE February 21, 2014:  There is a varied and lively discussion of this article on reddit/r/statistics.

One of the objections raised there is that I treat the hypotheses A, B, C, and N as mutually exclusive, when in fact they are not.  For example, it's possible that A causes B and B causes A.  This is a valid objection, but we can address it by adding additional hypotheses for A&B, B&C, A&C, etc.  The rest of my argument still holds.  Finding a correlation between A and B is evidence for all of these hypotheses, and evidence against N.

One of my anonymous correspondents on reddit added this comment, which gives examples where correlation alone might be used, in the absence of better evidence, to guide practical decisions:
This [meaning my article] is not too different from the standard view in medicine, though usually phrased in more of a discrete "levels of evidence" sense than a Bayesian sense. While direct causal evidence is the gold standard in medicine, correlational studies are still taken as providing some evidence that is sometimes worth acting on, in the absence of better evidence. For example, correlations with negative health outcomes are sometimes taken as reasons to issue recommendation to avoid certain behaviors/drugs/foods (pending further data), and unexpected correlations are often taken as good justification for funding further studies into a relationship.
 In general, one of the nice things about Bayesian analysis is that it provides useful inputs for decision analysis, especially when we have to make decisions in the absence of conclusive evidence.

Thursday, January 23, 2014

Bayesian statistics in Montreal

I am happy to announce that I will be offering my tutorial, "Bayesian Statistics Made Simple" at PyCon 2014 in Montreal.  The tutorial is based on material from Think Bayes.  It includes some of the examples and exercises in the book.  Participants will work on examples during the workshop, so it should be engaging and reasonably fun.  This is the third time I have done the tutorial at PyCon.

You can get more information about my tutorial (and the other tutorials at PyCon).

And here is the description:

Thursday 10 April 1:20 p.m.–4:40 p.m.

Bayesian statistics made simple

Allen Downey

Audience level:


An introduction to Bayesian statistics using Python.  Bayesian statistics are usually presented mathematically, but many of the ideas are easier to understand computationally.  People who know Python can get started quickly and use Bayesian analysis to solve real problems.  This tutorial is based on material and case studies from Think Bayes (O’Reilly Media).


Bayesian statistical methods are becoming more common and more important, but there are not many resources to help beginners get started.  People who know Python can use their programming skills to get a head start.
I will present simple programs that demonstrate the concepts of Bayesian statistics, and apply them to a range of example problems.  Participants will work hands-on with example code and practice on example problems.
Students should have at least basic level Python and basic statistics.  If you learned about Bayes’s theorem and probability distributions at some time, that’s enough, even if you don’t remember it!
Students should bring a laptop with Python 2.x and matplotlib.  You can work in any environment; you just need to be able to download a Python program and run it.

Friday, December 27, 2013

Leslie Valiant is probably British. Or old.

I got Leslie Valiant's new book, Probably Approximately Correct, for Christmas.  I'm embarrassed to admit that I was not familiar with the author, especially since he won the Turing Award in 2010.  But I wasn't, and that led to a funny sequence of thoughts, which leads to an interesting problem in Bayesian inference.

When I saw the first name "Leslie," I thought that the author was probably female, since Leslie is a primarily female name, at least for young people in the US.  But the dust jacket identifies the author as a computer scientist, and when I read that I saw blue and smelled cheese, which is the synesthetic sensation I get when I encounter high-surprisal information that causes large updates in my subjective probabilistic beliefs (or maybe it's just the title of a TV show).

Specifically, the information that the author is a computer scientist caused two immediate updates: I concluded that the author is more likely to be male and, if male, more likely to be British, or old, or both.

A quick flip to the back cover revealed that both of those conclusions were true, but it made me wonder if they were justified.  That is, was my internal Bayesian update system (IBUS) working correctly, or leaping to conclusions?

Part One: Is the author male?

To check, I will try to quantify the analysis my IBUS performed.  First let's think about the odds that the author is male.  Starting with the name "Leslie" I would have guessed that about 1/3 of Leslies are male.  So my prior odds were 1:2 against.

Now let's update with the information that Leslie is a computer scientist who writes popular non-fiction.  I have read lots of popular computer science books, and of them about 1 in 20 were written by women.  I have no idea what fraction of computer science books are actually written by women.  My estimate might be wrong because my reading habits are biased, or because my recollection is not accurate.  But remember that we are talking about my subjective probabilistic beliefs.   Feel free to plug in your own numbers.

Writing this formally, I'll define

M: the author is male
F: the author is female
B: the author is a computer scientist
L: the author's name is Leslie


odds(M | L, B) = odds(M | L) like(B | M) / like(B | F)

If the prior odds are 1:2 and the likelihood ratio is 20, the posterior odds are 10:1 in favor of male.  Intuitively, "Leslie" is weak evidence that the author is female, but "computer scientist" is stronger evidence that the author is male.

Part Two: Is the author British?

So what led me to think that the author is British?  Well, I know that "Leslie" is primarily female in the US, but closer to gender-neutral in the UK.  If someone named Leslie is more likely to be male in the UK (compared to the US), then maybe men named Leslie are more likely to be from the UK.  But not necessarily.  We need to be careful.

If the name Leslie is much more common in the US than in the UK, then the absolute number of men named Leslie might be greater in the US.  In that case, "Leslie" would be evidence in favor of the hypothesis that the author is American.

I don't know whether "Leslie" is more popular in the US.  I could do some research, but for now I will stick with my subjective update process, and assume that the number of people named Leslie is about the same in the US and the UK.

So let's see what the update looks like.  I'll define

US: the author is from the US
UK: the author is from the UK


odds(UK | L, B) = odds(UK | B) like(L | UK) / like(L | US)

Again thinking about my collection of popular computer science books, I guess that one author in 10 is from the UK, so my prior odds are about 10:1.

To compute the likelihoods, I use the law of total probability conditioned on the probability that the author is male (which I just computed).  So:

like(L | UK) = prob(M) like(L | UK, M) + prob(F) like(L | UK, F)


like(L | US) = prob(M) like(L | US, M) + prob(F) like(L | US, F)

Based on my posterior odds from Part One:

prob(M) = 90%
prob(F) = 10%

Assuming that the number of people named Leslie is about the same in the US and the UK, and guessing that "Leslie" is gender neutral in the UK:

like(L | UK, M) = 50%
like(L | UK, F) = 50%

And guessing that "Leslie" is primarily female in the US:

like(L | US, M) = 10%
like(L | US, F) = 90%

Taken together, the likelihood ratio is about 3:1, which means that knowing L and suspecting M is evidence in favor of UK.  But not very strong evidence.


It looks like my IBUS is functioning correctly or, at least, my analysis can be justified provided that you accept the assumptions and guesses that went into it.  Since any of those numbers could easily be off by a factor of two, or more, don't take the results too seriously.

Monday, November 25, 2013

Six Ways Coding Teaches Math

Last week I attended the Computer-Based Mathematics Education Summit in New York City, run by Wolfram Research and hosted at UNICEF headquarters.

The motivation behind the summit is explained by Conrad Wolfram in this TED talk.  His idea is that mathematical modeling almost always involves these steps:
  1. Posing the right question.
  2. Taking a real world problem and expressing it in a mathematical formulation.
  3. Computation.
  4. Mapping the math formulation back to the real world.
Wolfram points out that most math classes spend all of their time on step 3, and no time on steps 1, 2, and 4.  But step 3 is exactly what computers are good at, and what humans are bad at.  And furthermore, steps 1, 2, and 4 are important, and hard, and can only be done by humans (at least for now).

So he claims, and I agree, that we should be spending 80% of the time in math classes on steps 1, 2, and 4, and only 20% on computation, which should be done primarily using computers.

When I saw Wolfram's TED talk, I was struck by the similarity of his 4 steps to the framework we teach in Modeling and Simulation, a class developed at Olin College by John Geddes, Mark Somerville, and me.  We use this diagram to explain what we mean by modeling:

Our four steps are almost the same, but we use some different language: (1) The globe in the lower left is a physical system you are interested in.  You have to make modeling decisions to decide what aspects of the real world can be ignored and which are important to include in the model.  (2) The result, in the upper left, is a model, or several models, which you can analyze mathematically or simulate, which gets you to (3) the upper-right corner, a set of results, and finally (4) you have to compare your results with the real world.

The exclamation points represent the work the model does, which might be
  • Prediction: What will this system do in the future?
  • Explanation: Why does the system behave as it does (and in what regime might it behave differently)?
  • Optimization: How can we design this system to behave better (for some definition of "better")?
In Modeling and Simulation, students use simulations more than mathematical analysis, so they can choose to study systems more interesting than what you see in freshman physics.  And we don't do the modeling for them.  They have to make, and justify, decisions about what should be in the model depending on what kind of work it is intended to do.

Leaving aside whether we should call this activity math, or modeling, or something else, it's clear that Wolfram's framework and ours are getting at the same set of ideas.  So I was looking forward to the summit.

I proposed to present a talk called "Six Ways Coding Teaches Math," based on Modeling and Simulation, and also on classes I have developed for Data Science and Bayesian Statistics.  For reasons I'm not sure I understand, my proposal was not accepted initially, but on the second day of the conference, I got an email from one of the organizers asking if I could present later that day.

I put some slides together in about 30 minutes and did the presentation two hours later!  Here are the slides:

Special thanks to John Geddes, who also attended the CBM summit, and who helped me prepare the presentation and facilitate discussions.  And thanks to Mark Somerville, who answered a last minute email and sent the figure above, which is much prettier than my old version.

Here's an outline of what I talked about.

Six Ways Coding Teaches Math

My premise is that programming is a pedagogic wedge.  If students can write basic programs in any language, they can use that skill to learn almost anything, especially math.

This is also the premise of my book series, which uses Python code to explain statistics, complexity science, and (my current project) digital signal processing.

I presented six ways you can use coding to learn math topics:

1) Translation.

This is probably the most obvious of the six, but students can learn computational mechanisms and algorithms by translating them into code from English, or from math notation, or even from another programming language.  Any misunderstandings will be reflected in their code, so when they debug programs, they are also debugging their brains.

2) "Proof by example".

If you prove a result mathematically, you can check whether it is true by trying out some examples.  For example, in my statistics class, we test the Central Limit Theorem by adding up variates from different distributions.  Students get insight into why the CLT works, when it does.  And we also try examples where the CLT doesn't apply, like adding up Pareto variates.  I hope this exercise helps students remember not just the rule but also the exceptions.

3) Understanding math entities by API design.

Many mathematical entities are hard to explain because they are so abstract.  When you represent these entities as software objects, you define an application program interface (API) that specifies the operations the entities support, and how they interact with other entities.  Students can understand what these entities ARE by understanding what they DO.

An example from my statistics class is a library I provide that defines object to represent PMFs, CDFs, PDFs, etc.  The methods these object provide define, in some sense, what they are.

This pedagogic approach needed more explaining than the others, and one participant pointed out that it might require more than just basic programming skills.  I agreed, but I would also suggest that students benefit from using these APIs, even if they don't design them.

4) Option to go top down.

When students have programming skills, you don't have to present every topic bottom-up.  You can go top-down; that is, students can start using new tools before they understand how they work.

An example came up when I started working on a new textbook for Digital Signal Processing (DSP).  In DSP books, Chapter 1 is usually about complex arithmetic.  If you approach the topic mathematically, that's where you have to start.  Then it takes 9 chapters and 300 pages to get to the Fast Fourier Transform, which is the most important algorithm in DSP.

Approaching the topic computationally, we can use an implementation of FFT (readily available in any language) to start doing spectral analysis on the first day.  Students can download sound files, or record their own, and start looking at spectra and spectrograms right away.  Once they understand what spectral analysis is, they are motivated and better prepared to understand how it works.  And the exercises are infinitely more interesting.

5) Two bites at each apple.

Some people like math notation and appreciate how it expresses complex ideas concisely.  Other people find that the same ideas expressed in code are easier to read.  If you present ideas both ways, everyone gets two chances to understand.

Sometimes math notation and code look similar, but often they are very different.  An example that comes up in Think Bayes is a Bayesian update.  Here it is in math notation (from Wikipedia):

And here is the code:

class Suite(Pmf):
    """Map from hypothesis to probability."""

    def update(self, data):

        for hypo in self:
            like = self.likelihood(data, hypo)
            self[hypo] *= like


If you are a mathematician who doesn't program, you might prefer the first.  If you know Python, you probably find the second easier to read.  And if you are comfortable with both, you might find it enlightening to see the idea expressed in different ways.

6) Connect to the real world.

Finally, with computational tools, students can work on real world problems.  In my Data Science class, students aren't limited to data that comes in the right format, or toy examples from a book.  They can work with datasets from almost any source.

And according to Big Data Borat:
In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.
So students work with real data and interact with real clients.  Which reminds me: I am looking for external collaborators to work with students in my Data Science class, starting January 2014.

UPDATE: Here's an article by Michael Ernst about his class, which combines introductory Python programming and data processing: Teaching intro CS and programming by way of scientific data analysis

Audience participation

So that was my presentation.  Then I had a little fun.  I asked the participants to assemble into small groups, introduce themselves to their neighbors, and discuss these prompts:  Are there other categories in addition to the six I described?  Are the people in the audience doing similar things?  How do these ideas apply in secondary education, and primary?

After a day and a half of sitting in presentations with limited interaction, many of the participants seemed happy to talk and hear from other participants.  Although when you break our active learning methods on a naive audience, not everyone appreciates it!

Anyway, I sat in on a some excellent conversations, and then asked the groups to report out.  I wish I could summarize their comments, but I have to concentrate to understand and facilitate group discussion, and I usually don't remember it well afterward.

One concern that came up more than once is the challenge of building programming skills (which my premise takes for granted) in the first place.  There are, of course, two options.  You can require programming as a prerequisite or teach it on demand.  In the Olin curriculum, there are examples of both.

Modeling and Simulation does not require any programming background, and each year about half of the students have no programming experience at all.  While they are working on projects, they work on a series of exercises to develop the programming skills they need.  And they read "The Cat Book," also known as Physical Modeling in MATLAB.

For most of my other classes, Python programming is a prerequisite.  Most students meet the requirement by taking our introductory programming class, which uses my book, Think Python.  But some of them just read the book.

That's all for now from the CBM summit.  If you read this far, let me ask you the same questions I asked the summit participants:

  1. Are there other categories in addition to the six I described?
  2. Are you doing similar things?
  3. How do these ideas apply in secondary education, and primary?
Please comment below or send me email.