Probably Overthinking It: January 2013

Wednesday, January 23, 2013

Bayesian Statistics Made Simple

I am happy to announce that I will offer an updated and revised version of my tutorial, Bayesian Statistics Made Simple, at PyCon 2013.

Registration is open now. Here are the details:

PyCon 2013
Santa Clara, CA

Wednesday 13 March, 1:20 p.m.–4:40 p.m.

Bayesian statistics made simple
Allen Downey

Audience level: Intermediate

DESCRIPTION

An introduction to Bayesian statistics using Python. Bayesian statistics are usually presented mathematically, but many of the ideas are easier to understand computationally. People who know some Python have a head start.

We will use material from Think Stats: Probability and Statistics for Programmers (O’Reilly Media), and Think Bayes, a work in progress at http://thinkbayes.com.

ABSTRACT

Bayesian statistical methods are becoming more common and more important, but there are not many resources to help beginners get started. People who know Python can use their programming skills to get a head start.

I will present simple programs that demonstrate the concepts of Bayesian statistics, and apply them to a range of example problems. Participants will work hands-on with example code and practice on example problems.

Students should have at least basic Python and basic statistics. If you learned about Bayes’s Theorem and probability distributions at some time, that’s enough, even if you don’t remember it!

Students should bring a laptop with Python 2.x and matplotlib. You can work in any environment; you just need to be able to download a Python program and run it.

Tuesday, January 8, 2013

Are first babies more likely to be late, revisited.

UPDATE: The version of this article with the most recent data is here.

Two years ago I wrote an article called Are first babies more likely to be late?, based on a question that came up when my wife and I were expecting our first child. I compared the pool of first babies to the pool of all other babies, and found:

There is a small difference in the mean pregnancy length for the two groups, about 13 hours, but it is not practically or statistically significant.
If we group babies into Early, On Time, or Late (where On Time is 38, 39 or 40 weeks), first babies are a little more likely to be Early or Late, and less likely to be On Time.

Then yesterday I got the following question from an Unknown correspondent:

While interesting, I can't help but think you need to compare the first and others for the same woman. While may be unlikely it could still be that a tendency exists for a woman's second, third, etc, child comes earlier.

This is an excellent suggestion. It is possible that the variability between people is masking some of the variability between first and later babies. By pairing first and second babies with the same mother, we can control for variation between mothers.

So I ran that experiment, selecting all mothers with at least two children and computing the difference in pregnancy length between the second and first child (so a positive value means the second child was later). Here is the distribution of these value for 4387 women in the NSFG (National Survey of Family Growth):

Visually the distribution looks symmetric, and the summary statistics support that conclusion. The mean is -0.034, which means that (if anything) the second baby is born about 6 hours earlier, but this difference is not statistically significant.

Conclusion: good question, definitely worth running the experiment, but the primary result is the same as what we saw before: no significant difference in the means.

Monday, January 7, 2013

Call for Bayesian case studies

It's been a while since the last post because I have been hard at work on Think Bayes. As always, I have been posting drafts as I go along, so you can read the current version at thinkbayes.com.

I am teaching Computational Bayesian Statistics in the spring, using the draft edition of the book. The students will work on case studies, some of which will be included in the book. And then I hope the book will be published as part of the Think X series (for all X). At least, that's the plan.

In the next couple of weeks, students will be looking for ideas for case studies. An ideal project has at least some of these characteristics:

An interesting real-world application (preferably not a toy problem).
Data that is either public or can be made available for use in the case study.
Permission to publish the case study!
A problem that lends itself to Bayesian analysis, in particular if there is a practical advantage to generating a posterior distribution rather than a point or interval estimate.

Examples in the book include:

The hockey problem: estimating the rate of goals scored by two hockey teams in order to predict the outcome of a seven-game series.
The paintball problem, a version of the lighthouse problem. This one verges on being a toy problem, but recasting it in the context of paintball got it over the bar for me.
The kidney problem. This one is as real as it gets -- it was prompted by a question posted by a cancer patient who needed a statistical estimate of when a tumor formed.
The unseen species problem: a nice Bayesian solution to a standard problem in ecology.

So far I have a couple of ideas prompted by questions on Reddit:

But I would love to get more ideas. If you have a problem you would like to contribute, let me know!