Sunday, January 29, 2012

Freshman hordes even more godless!

[This is an update of an article I wrote last year, "Freshman hordes more godless than ever."  There is a followup to this article here.]

For several years I have been following one of the most under-reported stories of the decade: the fraction of college freshmen who report no religious preference has more than tripled since 1985, from 8% to 25%, and the trend is accelerating.

Similarly, students reporting that in the last year they have never attended a religious service has grown from 8% to more than 27%.

My analysis is based on survey results from the Cooperative Institutional Research Program (CIRP) of the Higher Education Research Insitute (HERI).  In 2011, more than 200,000 students at 270 colleges and universities completed the CIRP Freshman Survey, which includes questions about students’ backgrounds, activities, and attitudes.


In one question, students select their “current religious preference,” from a choice of seventeen common religions, “Other religion,” or “None.”


Another question asks students how often they “attended a religious service” in the last year. The choices are “Frequently,” “Occasionally,” and “Not at all.” Students are instructed to select “Occasionally” if they attended one or more times.
This figure shows students' responses over the history of the survey:
It's clear that both measurements are increasing, and it looks like they might be accelerating.  To make that more precise, I fit a parabola to each curve.  This figure shows the data for "No religion" and a least squares fit:
The red line shows the fitted model; the dark gray area shows the sample error of the model.  The lighter gray area shows the sum of the sampling error and the residual error.  R² for this model is 0.95; the p-values for the model and the parameters are < 0.001. 


Similarly, here is the data for "No attendance" and a least squares fit:
R² for this model is 0.89, and again the p-values are effectively 0. [Note for stats geeks: as it happens, the coefficient of the linear term is close to zero, so it is not statistically significant.  My first thought was to remove it from the model, but if I did that, I would understate the sampling error of the model.  In this case, it is correct to keep an "insignificant" variable in the model; the fact that it is near zero doesn't mean we can ignore it, or the error associated with it.]


To test more explicitly whether the growth is accelerating, I computed the change from one year to the next in percentage points.  The following figure shows these changes and a least squares fit:
Subtraction amplifies noise, so for this model R² is only 0.24, but the p-value of the slope is 0.002, so this data provides strong evidence of acceleration.  The slope is 0.035 percentage points per year.


Based on this model, the predicted change for next year is 1 percentage point, so we expect the fraction of freshmen reporting no religious preference to be 26%.


The gender gap


Since the beginning of the CIRP survey, more men than women have reported no religious preference:
And the gender gap seems to be growing.  Here is the difference between men and women, in percentage points, and a least squares fit:
R² for this model is 0.44; the slope is 0.035 percentage points per year, with p-value < 0.0001.

Discussion



I first wrote about this in 2007, in this article for Free Inquiry magazine.  There I wrote:

College students are hardly a random sample of the population. People with more education are less likely to believe in heaven, the devil, miracles, and the literal truth of the Christian Bible. However, contrary to many people’s expectations, educated people are more likely to attend services. So, we expect the students in this sample to be less believing than the general population but also more observant. 
There is reason to think that the rate of secularization in the general population is faster than what we see in this sample. Over the lifetime of the CIRP survey, college education has democratized; the percentage of high-school graduates who enter college immediately after graduation has increased from roughly 50 percent in 1970 to 65 percent in 2003. Over this time, CIRP has included more poor students, more racial minorities, and more students from families with less education. These groups tend to be more religious than the general population, so we expect their participation to increase the religiosity in the sample. Thus, the observed decrease probably underestimates the trend in the general population. 
The theory of secularization—that there is a global, long-term trend away from religion—is controversial. Early sociologists, notably Max Weber, hypothesized that secularization is a predictable effect of rationalization—the increasing tendency for social actions to be based on reason rather than emotion or tradition.

In the 1950s and 1960s, many sociologists of religion defended strong theories of secularization, but, since then, several of them—including Peter Berger and Harvey Cox—have reversed their positions, arguing that religion is resurging in some areas, including the United States. 
The data presented here speak directly to this debate. The CIRP survey has posed almost the same questions to a large sample of a consistently defined group for almost forty years, and the results show a clear and consistent trend away from both identification with religious sects and participation in religious services. These data make a strong case for secularization in the United States that has, if anything, accelerated in the last decade.



Data Source

Data from the 2011 CIRP Survey are reported iThe American Freshman: National Norms for Fall 2011Pryor, J. H., DeAngelo, L., Palucki Blake, L., Hurtado, S., & Tran, S., Jan 2012.


This and all previous reports are available from the HERI publications page.

Friday, January 27, 2012

Think Complexity, Part Two

My new book, Think Complexity, will be published by O'Reilly Media in March. For people who can't stand to wait that long, I am publishing excerpts here.  If you really can't wait, you can read the free version at thinkcomplex.com.

In the previous installment I outlined the topics in Think Complexity and contrasted a classical physical model of planetary orbits with an example from complexity science: Schelling's model of racial segregation.  In this installment I outline some of the ways complexity differs from classical science.

Paradigm shift?  

When I describe this book to people, I am often asked if this new kind of science is a paradigm shift.  I don't think so, and here's why.   Thomas Kuhn introduced the term ``paradigm shift'' in The Structure of Scientific Revolutions in 1962.  It refers to a process in the history of science where the basic assumptions of a field change, or where one theory is replaced by another. He presents as examples the Copernican revolution, the displacement of phlogiston by the oxygen model of combustion, and the emergence of relativity.

The development of complexity science is not the replacement of an older model, but (in my opinion) a gradual shift in the criteria models are judged by, and in the kinds of models that are considered acceptable.  For example, classical models tend to be law-based, expressed in the form of equations, and solved by mathematical derivation.  Models that fall under the umbrella of complexity are often rule-based, expressed as computations, and simulated rather than analyzed.  Not everyone finds these models satisfactory.

For example, in Sync, Steven Strogatz writes about his model of spontaneous synchronization in some species of fireflies.  He presents a simulation that demonstrates the phenomenon, but then writes:
I repeated the simulation dozens of times, for other random initial conditions and for other numbers of oscillators.  Sync every time. [...] The challenge now was to prove it.  Only an ironclad proof would demonstrate, in a way that no computer ever could, that sync was inevitable; and the best kind of proof would clarify why it was inevitable.  
Strogatz is a mathematician, so his enthusiasm for proofs is understandable, but his proof doesn't address what is, to me, the most interesting part the phenomenon.  In order to prove that ``sync was inevitable,'' Strogatz makes several simplifying assumptions, in particular that each firefly can see all the others.

In my opinion, it is more interesting to explain how an entire valley of fireflies can synchronize despite the fact that they cannot all see each other.  Think Complexity discusses how this kind of global behavior emerges from local interactions. Explanations of these phenomena often use agent-based models, which explore (in ways that would be difficult or impossible with mathematical analysis) the conditions that allow or prevent synchronization.

I am a computer scientist, so my enthusiasm for computational models is probably no surprise.  I don't mean to say that Strogatz is wrong, but rather that people disagree about what questions to ask and what tools to use to answer them.  These decisions are based on value judgments, so there is no reason to expect agreement.  Nevertheless, there is rough consensus among scientists about which models are considered good science, and which others are fringe science, pseudoscience, or not science at all.

I claim, and this is a central thesis of the book, that the criteria this consensus is based on change over time, and that the emergence of complexity science reflects a gradual shift in these criteria.

The axes of scientific models

I have described classical models as based on physical laws, expressed in the form of equations, and solved by mathematical analysis; conversely, models of complexity systems are often based on simple rules and implemented as computations.  We can think of this trend as a shift over time along two axes:

Equation-based simulation-based
Analysiscomputation

The new kind of science is different in several other ways:

Continuousdiscrete: Classical models tend to be based on continuous mathematics, like calculus; models of complex systems are often based on discrete mathematics, including graphs and cellular automata.

Linearnon-linear: Classical models are often linear, or use linear approximations to non-linear systems; complexity science is more friendly to non-linear models.  One example is chaos theory, which explores the dynamics of systems of non-linear equations.

Deterministicstochastic: Classical models are usually deterministic, which may reflect underlying philosophical determinism; complex models often feature randomness.

Abstractdetailed: In classical models, planets are point masses, planes are frictionless, and cows are spherical (see http://en.wikipedia.org/wiki/Spherical_cow).  Simplifications like these are often necessary for analysis, but computational models can be more realistic.

One, twomany: In celestial mechanics, the two-body problem can be solved analytically; the three-body problem cannot.  Where classical models are often limited to small numbers of interacting elements, complexity science works with larger complexes (which is where the name comes from).

Homogeneouscomposite: In classical models, the elements tend to be interchangeable; complex models more often include heterogeneity.

These are generalizations, so we should not take them too seriously. And I don't mean to deprecate classical science.  A more complicated model is not necessarily better; in fact, it is usually worse.

Also, I don't mean to say that these changes are abrupt or complete. Rather, there is a gradual migration in the frontier of what is considered acceptable, respectable work.  Some tools that used to be regarded with suspicion are now common, and some  models that were widely accepted are now regarded with scrutiny.  For example, when Appel and Haken proved the four-color theorem in 1976, they used a computer to enumerate 1,936 special cases that were, in some sense, lemmas of their proof.  At the time, many mathematicians did not consider the theorem truly proved.  Now computer-assisted proofs are common and generally (but not universally) accepted.

Conversely, a substantial body of economic analysis is based on a model of human behavior called ``Economic man,'' or, with tongue in cheek, Homo economicus.  Research based on this model was highly-regarded for several decades, especially if it involved mathematical virtuosity.  More recently, this model is treated with more skepticism, and models that include imperfect information and bounded rationality are hot topics.

At this point I have laid out a lot of ideas for one article, and explained them very briefly.  Think Complexity gets into these topics in more detail, but I will stop here for now.  Next time I will talk about related shifts in engineering and (a little farther afield) in ways of thinking.

Monday, January 23, 2012

Think Complexity

Think Complexity

On Friday I turned in the manuscript for Think Complexity, which will be published by O'Reilly Media in March (or, at least, that's the schedule). For people who can't stand to wait that long, I am going to publish excerpts here.  And if you really can't wait, you can read the free version at thinkcomplex.com.

The book is about data structures and algorithms, intermediate programming in Python, computational modeling and the philosophy of science:

  • Data structures and algorithms: A data structure is a collection of data elements organized in a way that supports particular operations.  For example, a Python dictionary organizes key-value pairs in a way that provides fast mapping from keys to values, but mapping from values to keys is slower. An algorithm is a mechanical process for performing a computation. Designing efficient programs often involves the co-evolution of data structures and the algorithms that use them.  For example, in the first few chapters I present graphs, data structures that implement graphs, and graph algorithms based on those data structures.
  • Python programming: Think Complexity picks up where Think Python leaves off.  I assume that you have read that book or have equivalent knowledge of Python.  I try to emphasize fundamental ideas that apply to programming in many languages, but along the way I present some useful features that are specific to Python.
  • Computational modeling: A model is a simplified description of a system used for simulation or analysis. Computational models are designed to take advantage of cheap, fast computation.
  • Philosophy of science: The experiments and results in this book raise questions relevant to the philosophy of science, including the nature of scientific laws, theory choice, realism and instrumentalism, holism and reductionism, and epistemology.

The book is also about complexity science, which is an interdisciplinary field---at the intersection of mathematics, computer science and natural science---that focuses on discrete models of physical systems.  In particular, it focuses on complex systems, which are systems with many interacting components.

Complex systems include networks and graphs, cellular automata, agent-based models and swarms, fractals and self-organizing systems, chaotic systems and cybernetic systems.  Here is a framework for these topics (from http://en.wikipedia.org/wiki/Complex_systems).



A new kind of science?

In 2002 Stephen Wolfram published A New Kind of Science where he presents his and others' work on cellular automata and describes a scientific approach to the study of computational systems.  There's a chapter about cellular automata in the book, but for now I'm going to borrow his title for something a little broader.

I think complexity is a ``new kind of science'' not because it applies the tools of science to a new subject, but because it uses different tools, allows different kinds of work, and ultimately changes what we mean by ``science.''

To demonstrate the difference, I'll start with an example of classical science: suppose someone asked you why planetary orbits are elliptical.  You might invoke Newton's law of universal gravitation and use it to write a differential equation that describes planetary motion.  Then you could solve the differential equation and show that the solution is an ellipse.  Voila!

Most people find this kind of explanation satisfying.  It includes a mathematical derivation---so it has some of the rigor of a proof---and it explains a specific observation, elliptical orbits, by appealing to a general principle, gravitation.

Let me contrast that with a different kind of explanation.  Suppose you move to a city like Detroit that is racially segregated, and you want to know why it's like that.  If you do some research, you might find a paper by Thomas Schelling called ``Dynamic Models of Segregation,'' which proposes a simple model of racial segregation (you can download the paper here).

Here is an outline of the paper's primary experiment:

  • The Schelling model of the city is an array of cells where each cell represents a house.  The houses are occupied by two kinds of agents, labeled red and blue, in roughly equal numbers.  About 10% of the houses are empty.   
  • At any point in time, an agent might be happy or unhappy, depending on the other agents in the neighborhood.  In one version of the model, agents are happy if they have at least two neighbors like themselves, and unhappy if they have one or zero.
  • The simulation proceeds by choosing an agent at random and checking to see whether it is happy.  If so, nothing happens; if not, the agent chooses one of the unoccupied cells at random and moves.

If you start with a simulated city that is entirely unsegregated and run the model for a short time, clusters of similar agents appear.  As time passes, the clusters grow and coalesce until there are a small number of large clusters and most agents live in homogeneous neighborhoods.

The degree of segregation in the model is surprising, and it suggests an explanation of segregation in real cities.  Maybe Detroit is segregated because people prefer not to be greatly outnumbered and will move if the composition of their neighborhoods makes them unhappy.    

Is this explanation satisfying in the same way as the explanation of planetary motion?  Most people would say not, but why?

Most obviously, the Schelling model is highly abstract, which is to say not realistic.  It is tempting to say that people are more complex than planets, but when you think about it, planets are just as complex as people (especially the ones that have people).

Both systems are complex, and both models are based on simplifications; for example, in the model of planetary motion we include forces between the planet and its sun, and ignore interactions between planets.

The important difference is that, for planetary motion, we can defend the model by showing that the forces we ignore are smaller than the ones we include.  And we can extend the model to include other interactions and show that the effect is small.  For Schelling's model it is harder to justify the simplifications.

To make matters worse, Schelling's model doesn't appeal to any physical laws, and it uses only simple computation, not mathematical derivation.  Models like Schelling's don't look like classical science, and many people find them less compelling, at least at first. But as I will try to demonstrate, these models do useful work, including prediction, explanation, and design.  One of the goals of Think Complexity is to explain how.

In the next excerpt, I will present some of the ways I think complexity science differs from classical science.  But in the meantime I welcome comments on Part One.

Friday, January 20, 2012

Some of my best friends are crackpots

I have a soft spot for crank science.  Recently I visited Norumbega Tower, which is an enduring monument to the crackpot theories of Eben Norton Horsford, inventor of double-acting baking powder and faux history.  But that's not what this article is about.

This article is about the Variability Hypothesis, which (quoth Wikipedia):
"originated in the early nineteenth century with Johann Meckel, who argued that males have a greater range of ability than females, especially in intelligence. In other words, he believed that most geniuses and most mentally retarded people are men. Because he considered males to be the 'superior animal,' Meckel concluded that females' lack of variation was a sign of inferiority."
I particularly like the last part, because I suspect that if it turns out that women are actually more variable, Meckel would take that as a sign of inferiority, too.  Anyway, you will not be surprised to hear that evidence for the Variability Hypothesis is mixed at best.

Nevertheless, it came up in my class recently when we looked at data from the CDC's Behavioral Risk Factor Surveillance System (BRFSS), specifically the self-reported height of adult American men and women.  The dataset includes responses from 154407 men and 254722 women.  Here's what we found:

  • The average height for men is 178 cm; the average height for women is 163 cm.  So men are taller, on average.  No surprises so far.
  • For men the standard deviation is 7.7 cm; for women is it 7.3 cm.  So in absolute terms, men's heights are more variable.
  • But to compare variability between groups, it is more meaningful to use the coefficient of variation (CV), which is the standard deviation divided by the mean.  It is a dimensionless measure of variability relative to scale.  For men CV is 0.0433; for women it is 0.0444.

That's very close, so we could conclude that this dataset provides no support for the Variability Hypothesis.  But in keeping with the theme of this blog, I won't be happy until I have overthought the question.  And along the way I want to use this problem to demonstrate Bayesian estimation in 2 dimensions.

 Bayesian estimation in 2-D

In previous articles I have presented Bayesian estimation for the number of species in a biological sample and the age of a renal tumor.  But those are both 1-D problems.  Estimating the mean and variance of a sample is a 2-D problem.  There are analytic solutions for several forms of the prior distribution, but as usual I will demonstrate a computational approach.

I'll present the code for the update first, and get back to the prior.  Here's what the update looks like:


def LogUpdate(suite, evidence):
    for hypo in suite.Values():
        likelihood = LogLikelihood(evidence, hypo)
        suite.Incr(hypo, likelihood)

suite is a Pmf object that maps from each hypothesis to its probability.  While we're performing the update, the probabilities are unnormalized, which is why they are called likelihoods.  And the whole thing happens under a log transform, because otherwise the likelihoods get so small they get rounded off to zero (see underflow).


Here's the function that computes likelihoods:


def LogLikelihood(evidence, hypo):
    t = evidence
    mu, sigma = hypo


    total = Summation(t, mu)
    return -len(t) * math.log(sigma) - total / 2 / sigma**2

evidence is the tuple of heights in centimeters; hypo is a tuple of hypothetical values for mu and sigma.  Summation computes the sum of the squared deviations from mu; the return value is the log likelihood of getting this sample from a Gaussian distribution with the hypothetical parameters.

I pulled Summation out into a separate function because it gets called repeatedly with the same parameters, so I can improve the run time (a lot) by caching previous results:


def Summation(t, mu, cache={}):
    try:
        return cache[t, mu]
    except KeyError:
        total = sum((x-mu)**2 for x in t)
        cache[t, mu] = total
        return total

That's it for the update.  Now we need a prior.  One option is a uniform prior where mu can be any value and sigma can be any positive value.  In practice, there is no point in evaluating values far from the true parameters, because their likelihoods will be vanishingly small.  So we'll use the data to choose the range for the prior, then use the data to update the prior.

That might sound completely bogus, because we are using the same data twice.  But choosing the range for the prior is just a computational convenience; it has no effect on the result, as long as the range is wide enough that the likelihood for values outside the range is effectively zero.

To choose the prior range for mu and sigma, I use the conventional estimators and their standard errors.  Here's the code:


def MakeUniformPrior(t, num_points, label, spread=3.0):
    
    # estimate mean and stddev of t
    n = len(t)
    xbar, S2 = thinkstats.MeanVar(t)
    sighat = math.sqrt(S2)


    # compute standard error for mu and the range of ms
    stderr_xbar = sighat / math.sqrt(n)
    mspread = spread * stderr_xbar
    ms = numpy.linspace(xbar-mspread, xbar+mspread, num_points)


    # compute standard error for sigma and the range of ss
    stderr_sighat = sighat / math.sqrt(2 * (n-1))
    sspread = spread * stderr_sighat
    ss = numpy.linspace(sighat-sspread, sighat+sspread, num_points)


    # populate the PMF
    pmf = Pmf.Pmf(name=label)
    for m in ms:
        for s in ss:
            pmf.Set((m, s), 1)
    return ms, ss, pmf

Finally, here's the code that makes the prior, updates it, and normalizes the posterior:

def EstimateParameters(t, label, num_points=31):
    xs, ys, suite = MakeUniformPrior(t, num_points, label)

    suite.Log()
    LogUpdate(suite, tuple(t))
    suite.Exp()
    suite.Normalize()

    return xs, ys, suite

Ok, that was more code than I usually put in these articles.  You can download it all from thinkcomplex.com/bayes_height.py.

Results

The result is a map from pairs of mu and sigma to probabilities; one simple way to display the map is a contour plot.  Here is the result for men:

And for women:

As expected, the central location is near the estimated parameters.  But with this result in numerical form is that it is easy to compute the marginal distributions for mu and sigma, or their confidence intervals.  We can also compute posterior distributions for CV, and compare the distributions CV for men and women:

Since there is no overlap between the distributions, we can conclude that the small observed difference is statistically significant; or, if we want to be more Bayesian about it, we can compute the probability that the CV for women is higher, which is pretty close to 1.

Finally we conclude:
  • The standard deviation of height for men is higher.
  • But the coefficient of variation is slightly lower.
  • Even though the difference is tiny, it is very unlikely to be due to chance.
  • And therefore women are inferior.
Hey, I told you I have a soft spot for crank science.

Thursday, January 5, 2012

Frank is a scoundrel, probably

My friend Ted Bunn wrote an article, "Who knows what evil lurks in the hearts of men? The Bayesian doesn’t care," inspired in part by one of my posts, "Repeated tests: how bad can it be?"

He presents this scenario:

Frank and Betsy are wondering whether a particular coin is a fair coin (i.e., comes up heads and tails equally often when flipped).  Frank, being a go-getter type, offers to do some tests to find out. He takes the coin away, flips it a bunch of times, and eventually comes back to Betsy to report his results. 
“I flipped the coin 3022 times,” he says, “and it came up heads 1583 times. That’s 72 more heads than you’d expect with a fair coin. I worked out the p-value — that is, the probability of this large an excess occurring if the coin is fair — and it’s under 1%. So we can conclude that the coin is unfair at a significance level of  1% (or ’99% confidence’ as physicists often say).”
And suggests that there are two ways Betsy can interpret this report (again, quoting Ted):


  1. Frank is an honest man, who has followed completely orthodox (frequentist) statistical procedure. To be specific, he decided on the exact protocol for his test (including, for some reason, the decision to do 3022 trials) in advance.
  2. Frank is a scoundrel who, for some reason, wants to reach the conclusion that the coin is unfair. He comes up with a nefarious plan: he keeps flipping the coin for as long as it takes to reach that 1% significance threshold, and then he stops and reports his results.

Finally, Ted asks, "What should Betsy conclude on the basis of the information Frank has given her?"  If you want to know Ted's answer, you have to read his article (and you should -- it is very interesting).

But I'm going to give you my answer: Frank is a scoundrel.  Well, probably.

I'll follow Ted by making one more assumption: "Suppose that Betsy’s initial belief is that 95% of coins are fair — that is, the probability P that they come up heads is exactly 0.5."  Now we can evaluate the evidence that Frank is a scoundrel.

If Frank is a scoundrel, the probability that he reports a positive result is 1, provided that he is willing to keep flipping long enough, or 1-x, for small x, if we put a bound on the number of flips he is willing to do.  So

P(positive test | Frank is a scoundrel) = 1-x

If Frank is honest, then the probability of a positive result is

P(fair coin) P(false positive | fair coin) + P(biased coin) * P(true positive | biased coin)

Betsy believes that P(fair coin) is 95% and P(biased coin) is 5%.  Since Frank's significance level is 1%, P(false positive | fair coin) is 1%.

The probability of a true positive is the power of the test, which depends on how biased the coin actually is.  But I will make the approximation that 3022 flips is enough to detect any substantial bias, so I'll take P(true positive | biased coin) = 1-y, for small y.  So,

P(positive test | Frank is honest) = (0.95)(0.01) + (0.05)(1-y)

As x and y get small, the likelihood ratio is (1 / 0.0595), which is about 17.  So that is fairly strong evidence that Frank is a scoundrel.

I don't know about Betsy's prior beliefs about Frank, so you will have to fill in your own punchline about her posterior.

Wednesday, November 30, 2011

Are first babies more likely to be light?

Way back in February I analyzed data from the National Survey of Family Growth (NSFG) and answered the question "Are first babies more likely to be late?"  Now I am getting back to that data to look at the related question, "Are first babies more likely to be light?"

In Think Stats (Chapter 7), I showed that the mean birth weight for first babies is lower than the mean for others by about 2 ounces, and that difference is statistically significant.  But there is also a relationship between birth weight and mother's age, and the mothers of first babies tend to be younger.  So the question is: if we control for the age of the mother, are first babies still lighter?

Several students in my class are working on projects involving multiple regression, so I want to use this question as an example.  I will follow the steps I recommend to my students:

1) Before looking at relationships between variables, look at each variable in isolation.  In particular, characterize the distributions and identify issues like outliers or long tails.

2) Look at the variables pairwise.  For each pair, look at CDFs and scatterplots, and compute correlations and/or least squares fits.

3) If there seem to be relationships among the variables, look for ways to separate the effects, either by breaking the data into subsets or doing multiple regression.

So let's proceed.

One variable at a time

The variables I'll use are
  1. Mother's age in years, 
  2. Birth weight in ounces, and 
  3. first, which is a dummy variable, 1 for first babies and 0 for others.
For live births we have 9148 records with valid ages; here is the distribution:
It looks like there is a little skew to the right, but other than that, nothing to worry about.

We have 9038 records with valid weights, with this distribution:
The middle of the distribution is approximately normal, with some jaggies due to round-off.  In the tails there are some values that are are certainly errors, but it is hard to draw a clear line between exceptional cases and bad data.  It might be a good idea to exclude some extreme values, but for the analysis below I did not.

There are only two values for first, so it's not much of a distribution, but with categorical data, we should check that we have enough values in each bin.  As it turns out, there are 4413 first babies and 4735 others, so that's just fine.

One pair at a time

To characterize the relationship between weight and first, we compare the CDF of weights for first babies and others:
There is some space between the distributions, and the difference in means is 2 ounces.  It's not obvious whether that difference is statistically significant, but it is, with p < 0.001.

Similarly we can compare the CDF of mother's age for first babies and others:
Here there is clearly space between the distributions.  The difference in means is 3.6 years, which is significant (no surprise this time).

It is not as easy to see the relationship between age and birthweight.  I often tell my students to start with a scatterplot:
But in this case it's not much help.  If there is a relationship there, it is hard to see.  An alternative is to break the age range into bins and compute the mean in each bin.  Here's what that looks like with 2-year bins:
We can't take the first and last points too seriously; there are not many cases in those bins.  And ideally I should represent the variability in each bin so we have a sense of whether the apparent differences are real.  Based on this figure, it looks like there is a relationship, but it might be nonlinear.

Pearson's coefficient of correlation between age and birthweight is 0.07, which is small but statistically significant.  Spearman's coefficient of correlation is 0.10; the difference between the two coefficients is another warning that the relationship is nonlinear.

If we ignore the non-linearity for now, we can compute a least squares fit for birthweight as a function of age.  The slope is 0.28, which means that we expect an additional 0.28 ounces per year of age.  Since first mothers are 3.6 years younger than others, we expect their babies to be 1.0 ounces lighter.  In fact, they are 2.0 ounces lighter, so the linear model of weight vs age accounts for 50% of the observed difference.

Multiple regression

So far I have been using the thinkstats libraries to compute correlation coefficients and least squares fits.  But for multiple regression I am going to break out rpy, which is the Python interface to R a "language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories."  To see how it works, you can download the code I used in this section: age_lm.py


The first model I ran is the same linear model we were just looking at:


weight = intercept + slope * age


In R's shorthand, that's:



weights ~ ages


Here are the results:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 109.28635    1.08775 100.470  < 2e-16 ***
ages          0.27926    0.04258   6.559 5.72e-11 ***


Multiple R-squared: 0.004738, Adjusted R-squared: 0.004628 


The intercept is 109 ounces; the slope is 0.28 ounces per year, which what we saw before---comparing my implementation to R makes me more confident that R is correct :)

The standard error provides a confidence interval for the estimates; plus or minus 2 standard errors is (roughly) a 95% confidence interval.

The last column is the p-value, which is very small, indicating that the slope and intercept are significantly different from 0.  The t-value is the test statistic used to compute the p-value, but I don't know why it gets reported; it doesn't mean much.

Since the p-values are so small, it might be surprising that R2 is so low, only 0.0047.  But there is no contradiction.  We can say with high confidence that there is a relationship between these variables; nevertheless, birth weight is highly variable, and even if you know the age of the mother, that does not reduce the variability by much.

To understand adjusted R2, consider this: if you add more explanatory variables to a model, R2 usually goes up even if there is no real relationship.  The adjusted R2 takes this into account, which makes it more meaningful to compare models with a different number of variables.

Now let's add first as an explanatory variable, so the model looks like this:

weights ~ first + ages

Here are the results:

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 110.62791    1.24198  89.073  < 2e-16 ***
first        -1.11688    0.49940  -2.236   0.0253 *  
ages          0.24708    0.04493   5.499 3.93e-08 ***


Multiple R-squared: 0.005289, Adjusted R-squared: 0.005069 

The coefficient for first is -1.1, which means that we expect first babies to be 1.1 ounces lighter, controlling for age.  The p-value for this estimate is 2.5%, which I consider borderline significant.

The coefficient for ages is about the same as before, and significant.  And R2 is a little higher, but still small.

But remember that the relationship between weight and age is non-linear.  We can explore that by introducing a new variable:

ages2 = ages^2

Now if we run this model:

weights ~ ages + ages2

We are effectively fitting a parabola to the weight vs age curve.

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 89.151237   4.407677  20.226  < 2e-16 ***
ages         1.898151   0.346071   5.485 4.25e-08 ***
ages2       -0.031002   0.006577  -4.714 2.47e-06 ***


Multiple R-squared: 0.00718, Adjusted R-squared: 0.00696 

Estimates for both variables are significant.  The coefficient for ages2 is negative, which means that the parabola has downward curvature, as expected.  And R2 is a little bigger.

Finally, we can bring it all together:

weights ~ first + ages + ages2

With this model we get:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 91.07627    4.56813  19.937  < 2e-16 ***
first       -0.80708    0.50373  -1.602    0.109    
ages         1.79807    0.35163   5.113 3.23e-07 ***
ages2       -0.02953    0.00664  -4.447 8.80e-06 ***

Multiple R-squared: 0.007462, Adjusted R-squared: 0.007132 

When we include the parabolic model of weight and age, the coefficient for first gets smaller, and the p-value is 11%.

I conclude that the difference in weight for first babies is explained by the difference in mothers' ages.  When we control for age, the difference between first babies and others is no longer statistically significant.  It is still possible that there is a small difference in weight for first babies, but this dataset provides little evidence for it.

Monday, November 28, 2011

Estimating the age of renal tumors

UPDATE April 2, 2012: I wrote a paper describing this work and submitted it to arXiv. You can download it here.

Abstract: We present a Bayesian method for estimating the age of a renal tumor given its size. We use a model of tumor growth based on published data from observations of untreated tumors. We find, for example, that the median age of a 5 cm tumor is 20 years, with interquartile range 16-23 and 90% confidence interval 11-30 years.

-----

A few weeks ago I read this post on reddit.com/r/statistics:
"I have Stage IV Kidney Cancer and am trying to determine if the cancer formed before I retired from the military. ... Given the dates of retirement and detection is it possible to determine when there was a 50/50 chance that I developed the disease? Is it possible to determine the probability on the retirement date?  My tumor was 15.5 cm x 15 cm at detection. Grade II."
I contacted the original poster and got more information; I learned that veterans get different benefits if it is "more likely than not" that a tumor formed while they were in military service (among other considerations).

Because renal tumors grow slowly, and often do not cause symptoms, they are often left untreated.  As a result, we can observe the rate of growth for untreated tumors by comparing scans from the same patient at different times.  Several papers have reported these growth rates.

I collected data from a paper by Zhang et al.  I contacted the authors to see if I could get raw data, but they refused on grounds of medical privacy.  Nevertheless, I was able to extract the data I needed by printing one of their graphs and measuring it with a ruler.  It's silly, but it works.

They report growth rates in reciprocal doubling time (RDT), which is in units of doublings per year.  So a tumor with RDT=1 doubles in volume each year; with RDT=2 it quadruples in the same time, and with RDT=-1, it halves.  The following figure shows the distribution of RDT for 53 patients:

The squares are the data points from the paper; the line is a model I fit to the data.  The positive tail fits an exponential distribution well, so I used a mixture of two exponentials.

As a simple model of tumor growth, I chose the median value of RDT, which is 0.45, and used that to estimate the age of a tumor with maximum dimension 15.5 cm.  Here's what I wrote in my letter to the Veterans Benefits Administration:
  1. In the largest study I reviewed (53 patients) the median volume doubling time is 811 days. By definition of median, 50% of observed tumors grew faster and 50% slower.  By geometry, the doubling time for the maximum linear dimension is approximately (811)(3) = 2433 days or 6.7 years.  Therefore, for a tumor with maximum linear dimension 15.5 cm on [diagnosis date], it is as likely as not that the size on [discharge date] was 6 cm.
  2. If the diameter of the tumor on [discharge date] were 1 mm and it grew to 15.5 cm by [diagnosis date], the effective volume doubling time would be 150 days.  Fewer than half of the tumors in the studies I reviewed grew at this rate or faster, so it is more likely than not that the tumor grew more slowly.
Based on this analysis, I conclude that it is more likely than not that this tumor formed prior to [discharge date].
I think this model is sufficient to answer the question as posed, but it occurred to me later (in the shower, where all good ideas come from) that we can do better.  By sampling from the distribution of growth rates and generating simulated tumor histories, we can estimate the distribution of size as a function of time and then, using Bayes's Theorem, get the distribution of age as a function of size.

Here's how.  The simulation starts with a small tumor (0.3 cm) and runs these steps:
  1. Choose a growth rate from the distribution of RDT.
  2. Compute the size of the tumor at the end of an 8 month interval (that's the median interval between  measurements in the data source).
  3. Repeat until the tumor is 20 cm in diameter.
This figure shows 100 simulated growth trajectories:
The line at 10 cm shows the range of ages for tumors at that size: the fastest-growing tumor gets there in 8 years; the slowest takes more than 35.

By drawing the line at different sizes, we can estimate the distribution of age as a function of size.  There's an implicit use of Bayes's Theorem in there, but because I did everything discretely, I didn't have to think too hard.  This figure shows the distribution of age for a few different sizes:
Not surprisingly, bigger tumors are likely to be older.  For any size, we can generate the CDF and compute the median, interquartile range, and 90% confidence interval.  Here's what that looks like (with size on a log scale):
The points are data from simulation, which produces some variability due to discrete approximation.  The lines are fitted to the data.

With these results, doctors can look up the size of a tumor and get the distribution of ages; for example, the median age of a 15 cm tumor is 27 years, with interquartile range 22-31 and 90% confidence interval 16-39 years.

This model yields more detail than the simple model I started with, but the results are qualitatively similar; a tumor this size is more likely than not to have formed prior to the original poster's date of discharge.  It looks like there is also a good chance that it formed prior to enlistment, but I don't know what the VBA makes of that.

-----

I think this model makes the best use of the available data, but there are several limitations:

1) The factors that limit tumor growth are different for very small tumors, so the observed data doesn't apply.  We can extrapolate back to when the tumor was small (I chose 0.3 cm, a bit smaller than the smallest tumor in the study).  That gives us a lower bound on the age of the tumor, but we can't say much about when the first cancer cell appeared.

2) The distribution of growth rates is based on a sample of 53 patients.  A different sample would yield a different distribution.  I could use resampling to characterize this source of error, but haven't.

3) The growth model does not take into account tumor subtype or grade, which is consistent with the conclusion of Zhang et al: “Growth rates in renal tumors of different sizes, subtypes and grades represent a wide range and overlap substantially.”  

4) In our model of tumor growth, the growth rate during each interval is independent of previous growth rates.  It is plausible that, in reality, tumors that have grown quickly in the past are more likely to grow quickly.

If this correlation exists, it affects the location and spread of the results.  For example, running simulations with ρ = 0.4 increases the estimated median age by about a year, and the interquartile range
by about 3 years.  However, if there were a strong serial correlation in growth rate, there would be also be a correlation between tumor volume and growth rate, and prior work has shown no such relationship.

There could still be a weak serial correlation, but since there is currently no evidence for it, I ran these simulations with ρ  = 0.