Averaged across all live births, the mean duration of pregnancy for first babies is 38.6 weeks, compared to 38.5 weeks for other babies.
Those means include pre-term babies, which affect the averages in a way that understates the difference. For full-term babies, the differences are a little bigger.
For example, if you are at the beginning of Week 36, the average time until delivery is 3.4 weeks for first babies and 3.1 weeks for others, a difference of 1.8 days. The gap is about the same for weeks 37 through 40. After that, there is no consistent difference between first babies and others.
The following figure shows average remaining duration in weeks, for first babies and others, computed for weeks 36 through 43.
The gap between first babies and others is consistent until Week 41. As an aside, this figure also shows a surprising pattern: after Week 38, the expected remaining duration levels off at about one week. For more than a month, the finish line is always a week away!
Looking at the probability of delivering in the next week, we see a similar pattern: from Week 38 on, the probability is almost the same, with some increase after Week 41.
In summary, among full-term pregnancies, first babies arrive a little later than others, by about two days. After Week 38, the expected remaining duration is about one week.
The code I used to generate these results is in this IPython Notebook. I used data from the National Survey of Family Growth (NSFG). During the last three survey cycles, they interviewed more than 25,000 women and collected data about more than 48,000 pregnancies. Of those, I selected the 30,110 pregnancies whose outcome was a live birth.
Of those, there were 13,864 first babies and 16,246 others. The mean duration of pregnancy for first babies is 38.61, with SE 0.024; for others it is 38.52 with SE 0.019. The difference is statistically significant with p < 0.001.
However, those means could be misleading for two reasons: they include pre-term babies, which bring down the averages for both groups. Also, they do not take into account the stratified survey design.
To address the second point, I use weighted resampling, running each analysis 101 times and selecting the 10th, 50th, and 90th percentile of the results. The lines in the figure above show median values (50th percentile). The gray areas show an 80% confidence interval (between the 10th and 90th percentiles).
This analysis is based on data reported by respondents, so it includes errors due to inaccurate memory and reporting. In most cases respondents are reporting estimates made by doctors, but some might be reporting their own estimates.
The observed differences between first babies and others might be caused by differences in measurement error. For example, estimates for first time mothers might be less accurate. Based on this data, we can't tell whether the observed differences are due to biological factors or procedural factors.
But for purposes of prediction, it doesn't matter. If you are a first time mother and your doctor estimates that you are at Week 36, your chance of delivering in the next week is lower, relative to other mothers, and your expected time until delivery is longer, regardless of what causes the difference.
I use this question—whether first babies are more likely to be late—as a case study in my book, Think Stats. There, I used data from only one cycle of the NSFG. I report a small difference between first babies and others, but it is not statistically significant.
I also wrote about this question in a previous blog article, "Are first babies more likely to be late?", which has been viewed more than 100,000 times, more than any other article on this blog.
I am reviewing the question now for two reasons:
1) I worked on another project that required me to load data from other cycles of the NSFG. Having done that work, I saw an opportunity to run my analysis again with more data.
2) Since my previous articles were intended partly for statistics education, I kept the analysis simple. In particular, I ignored the stratified design of the survey, which made the results suspect. Fortunately, it turns out that the effect is small; the new results are consistent with what I saw before.
Since I've been writing about this topic and using it as a teaching example for more than 5 years, I hope the question is settled now.