Wednesday, September 23, 2015

First babies are more likely to be late

UPDATE: The version of this article with the most recent data is here.

If you are pregnant with your first child, you might have heard that first babies are more likely to be late.  This turns out to be true, but the difference is small, about 16 hours.

Averaged across all live births, the mean duration of pregnancy for first babies is 38.6 weeks, compared to 38.5 weeks for other babies.

Those means include pre-term babies, which affect the averages in a way that understates the difference.  For full-term babies, the differences are a little bigger.

For example, if you are at the beginning of Week 36, the average time until delivery is 3.4 weeks for first babies and 3.1 weeks for others, a difference of 1.8 days.  The gap is about the same for weeks 37 through 40.  After that, there is no consistent difference between first babies and others.

The following figure shows average remaining duration in weeks, for first babies and others, computed for weeks 36 through 43.

The gap between first babies and others is consistent until Week 41.  As an aside, this figure also shows a surprising pattern: after Week 38, the expected remaining duration levels off at about one week.  For more than a month, the finish line is always a week away!

Looking at the probability of delivering in the next week, we see a similar pattern: from Week 38 on, the probability is almost the same, with some increase after Week 41.

The difference between first babies and others is highest in Weeks 39 and 40; for example, in Week 39, the chance of delivering in the next week is 52% for first babies, compared to 64% for others.  By Week 41, this gap has closed.

In summary, among full-term pregnancies, first babies arrive a little later than others, by about two days.  After Week 38, the expected remaining duration is about one week.


The code I used to generate these results is in this IPython Notebook.  I used data from the National Survey of Family Growth (NSFG).  During the last three survey cycles, they interviewed more than 25,000 women and collected data about more than 48,000 pregnancies.  Of those, I selected the 30,110 pregnancies whose outcome was a live birth.

Of those, there were 13,864 first babies and 16,246 others.  The mean duration of pregnancy for first babies is 38.61, with SE 0.024; for others it is 38.52 with SE 0.019.  The difference is statistically significant with p < 0.001.

However, those means could be misleading for two reasons: they include pre-term babies, which bring down the averages for both groups.  Also, they do not take into account the stratified survey design.

To address the second point, I use weighted resampling, running each analysis 101 times and selecting the 10th, 50th, and 90th percentile of the results.  The lines in the figure above show median values (50th percentile).  The gray areas show an 80% confidence interval (between the 10th and 90th percentiles).


This analysis is based on data reported by respondents, so it includes errors due to inaccurate memory and reporting.  In most cases respondents are reporting estimates made by doctors, but some might be reporting their own estimates.

The observed differences between first babies and others might be caused by differences in measurement error.  For example, estimates for first time mothers might be less accurate.  Based on this data, we can't tell whether the observed differences are due to biological factors or procedural factors.

But for purposes of prediction, it doesn't matter.  If you are a first time mother and your doctor estimates that you are at Week 36, your chance of delivering in the next week is lower, relative to other mothers, and your expected time until delivery is longer, regardless of what causes the difference.


I use this question—whether first babies are more likely to be late—as a case study in my book, Think Stats.  There, I used data from only one cycle of the NSFG.  I report a small difference between first babies and others, but it is not statistically significant.

I also wrote about this question in a previous blog article, "Are first babies more likely to be late?", which has been viewed more than 100,000 times, more than any other article on this blog.

I am reviewing the question now for two reasons:

1) I worked on another project that required me to load data from other cycles of the NSFG.  Having done that work, I saw an opportunity to run my analysis again with more data.

2) Since my previous articles were intended partly for statistics education, I kept the analysis simple.  In particular, I ignored the stratified design of the survey, which made the results suspect.  Fortunately, it turns out that the effect is small; the new results are consistent with what I saw before.

Since I've been writing about this topic and using it as a teaching example for more than 5 years, I hope the question is settled now.

Tuesday, September 1, 2015

Bayesian analysis of gluten sensitivity

Last week a new study showed that many subjects diagnosed with non-celiac gluten sensitivity (NCGS) were not able to distinguish gluten flour from non-gluten flour in a blind challenge.

In this article, I review the the study and use a simple Bayesian model to show that the results support the hypothesis that none of the subjects are sensitive to gluten.  But there are complications in the design of the study that might invalidate the model.

Here is a description of the study:
"We studied 35 non-CD subjects (31 females) that were on a gluten-free diet (GFD), in a double-blind challenge study. Participants were randomised to receive either gluten-containing flour or gluten-free flour for 10 days, followed by a 2-week washout period and were then crossed over. The main outcome measure was their ability to identify which flour contained gluten.
"The gluten-containing flour was correctly identified by 12 participants (34%)..."
Since 12 out of 35 participants were able to identify the gluten flour, the authors conclude "Double-blind gluten challenge induces symptom recurrence in just one-third of patients fulfilling the clinical diagnostic criteria for non-coeliac gluten sensitivity."

This conclusion seems odd to me, because if none of the patients were sensitive to gluten, we would expect some of them to identify the gluten flour by chance.  So the results are consistent with the hypothesis that none of the subjects are actually gluten sensitive.

We can use a Bayesian approach to interpret the results more precisely.  But first, as always, we have to make some modeling decisions.

First, of the 35 subjects, 12 identified the gluten flour based on resumption of symptoms while they were eating it.  Another 17 subjects wrongly identified the gluten-free flour based on their symptoms, and 6 subjects were unable to distinguish.  So each subject gave one of three responses.  To keep things simple I follow the authors of the study and lump together the second two groups; that is, I consider two groups: those who identified the gluten flour and those who did not.

Second, I assume (1) people who are actually gluten sensitive have a 95% chance of correctly identifying gluten flour under the challenge conditions, and (2) subjects who are not gluten sensitive have only a 40% chance of identifying the gluten flour by chance (and a 60% chance of either choosing the other flour or failing to distinguish).

Under this model, we can estimate the actual number of subjects who are gluten sensitive, gs.  I chose a uniform prior for gs, from 0 to 35.  To perform the Bayesian analysis, we have to compute the likelihood of the data under each hypothetical value of gs.  Here is the likelihood function in Python

    def Likelihood(self, data, hypo):
        gs = hypo
        yes, no = data
        n = yes + no
        ngs = n - gs
        pmf1 = thinkbayes2.MakeBinomialPmf(gs, 0.95)
        pmf2 = thinkbayes2.MakeBinomialPmf(ngs, 0.4)
        pmf = pmf1 + pmf2
        return pmf[yes]

The function works by computing the PMF of the number of gluten identifications conditioned on gs, and then selecting the actual number of identifications, yes, from the PMF.  The details of the computation are in this IPython notebook.

And here is the posterior distribution:

The most likely value of gs is 0, so it is quite possible that none of the respondents are gluten sensitive.  The 95% credible interval for gs is (0, 8), so a reasonable upper bound on the number of gluten-sensitive respondents is 8 out of 35, or 23%.

We can also use this analysis to compare two hypotheses:

A) Some of the respondents are gluten sensitive (equally likely from 0 to 35).
B) None of the respondents are gluten sensitive.

The Bayes factor in support of B turns out to be about 8.4, which is moderately strong.  If you believed, before reading this study, that the probability of B was 50%, you should now believe it is about 90%.

However, there are complications in the design of the study that might invalidate this simple model.  In particular, the gluten free flour in the study contained corn starch, which some people may be sensitive to.  And several subjects reported symptoms under both challenge conditions; that is, when eating both gluten flour and gluten-free flour.  So it is possible that misidentification of the gluten flour, as well as failure to distinguish, indicates sensitivity to both gluten and corn starch.

But if we limit ourselves to the question of whether people diagnosed with non-celiac gluten sensitivity are specifically sensitive to gluten, this study suggests that they are not.

Thank yous: I heard about this study in this blog post.  And I want to thank the authors of the study and their publisher for making the entire paper available on the web, which made my analysis possible.

Update 2 Sep 2015:  There is some additional discussion of this analysis on Reddit, including a very nice generalization from PhaethonPrime.