Friday, January 20, 2012

Some of my best friends are crackpots

I have a soft spot for crank science.  Recently I visited Norumbega Tower, which is an enduring monument to the crackpot theories of Eben Norton Horsford, inventor of double-acting baking powder and faux history.  But that's not what this article is about.

This article is about the Variability Hypothesis, which (quoth Wikipedia):
"originated in the early nineteenth century with Johann Meckel, who argued that males have a greater range of ability than females, especially in intelligence. In other words, he believed that most geniuses and most mentally retarded people are men. Because he considered males to be the 'superior animal,' Meckel concluded that females' lack of variation was a sign of inferiority."
I particularly like the last part, because I suspect that if it turns out that women are actually more variable, Meckel would take that as a sign of inferiority, too.  Anyway, you will not be surprised to hear that evidence for the Variability Hypothesis is mixed at best.

Nevertheless, it came up in my class recently when we looked at data from the CDC's Behavioral Risk Factor Surveillance System (BRFSS), specifically the self-reported height of adult American men and women.  The dataset includes responses from 154407 men and 254722 women.  Here's what we found:

  • The average height for men is 178 cm; the average height for women is 163 cm.  So men are taller, on average.  No surprises so far.
  • For men the standard deviation is 7.7 cm; for women is it 7.3 cm.  So in absolute terms, men's heights are more variable.
  • But to compare variability between groups, it is more meaningful to use the coefficient of variation (CV), which is the standard deviation divided by the mean.  It is a dimensionless measure of variability relative to scale.  For men CV is 0.0433; for women it is 0.0444.

That's very close, so we could conclude that this dataset provides no support for the Variability Hypothesis.  But in keeping with the theme of this blog, I won't be happy until I have overthought the question.  And along the way I want to use this problem to demonstrate Bayesian estimation in 2 dimensions.

 Bayesian estimation in 2-D

In previous articles I have presented Bayesian estimation for the number of species in a biological sample and the age of a renal tumor.  But those are both 1-D problems.  Estimating the mean and variance of a sample is a 2-D problem.  There are analytic solutions for several forms of the prior distribution, but as usual I will demonstrate a computational approach.

I'll present the code for the update first, and get back to the prior.  Here's what the update looks like:


def LogUpdate(suite, evidence):
    for hypo in suite.Values():
        likelihood = LogLikelihood(evidence, hypo)
        suite.Incr(hypo, likelihood)

suite is a Pmf object that maps from each hypothesis to its probability.  While we're performing the update, the probabilities are unnormalized, which is why they are called likelihoods.  And the whole thing happens under a log transform, because otherwise the likelihoods get so small they get rounded off to zero (see underflow).


Here's the function that computes likelihoods:


def LogLikelihood(evidence, hypo):
    t = evidence
    mu, sigma = hypo


    total = Summation(t, mu)
    return -len(t) * math.log(sigma) - total / 2 / sigma**2

evidence is the tuple of heights in centimeters; hypo is a tuple of hypothetical values for mu and sigma.  Summation computes the sum of the squared deviations from mu; the return value is the log likelihood of getting this sample from a Gaussian distribution with the hypothetical parameters.

I pulled Summation out into a separate function because it gets called repeatedly with the same parameters, so I can improve the run time (a lot) by caching previous results:


def Summation(t, mu, cache={}):
    try:
        return cache[t, mu]
    except KeyError:
        total = sum((x-mu)**2 for x in t)
        cache[t, mu] = total
        return total

That's it for the update.  Now we need a prior.  One option is a uniform prior where mu can be any value and sigma can be any positive value.  In practice, there is no point in evaluating values far from the true parameters, because their likelihoods will be vanishingly small.  So we'll use the data to choose the range for the prior, then use the data to update the prior.

That might sound completely bogus, because we are using the same data twice.  But choosing the range for the prior is just a computational convenience; it has no effect on the result, as long as the range is wide enough that the likelihood for values outside the range is effectively zero.

To choose the prior range for mu and sigma, I use the conventional estimators and their standard errors.  Here's the code:


def MakeUniformPrior(t, num_points, label, spread=3.0):
    
    # estimate mean and stddev of t
    n = len(t)
    xbar, S2 = thinkstats.MeanVar(t)
    sighat = math.sqrt(S2)


    # compute standard error for mu and the range of ms
    stderr_xbar = sighat / math.sqrt(n)
    mspread = spread * stderr_xbar
    ms = numpy.linspace(xbar-mspread, xbar+mspread, num_points)


    # compute standard error for sigma and the range of ss
    stderr_sighat = sighat / math.sqrt(2 * (n-1))
    sspread = spread * stderr_sighat
    ss = numpy.linspace(sighat-sspread, sighat+sspread, num_points)


    # populate the PMF
    pmf = Pmf.Pmf(name=label)
    for m in ms:
        for s in ss:
            pmf.Set((m, s), 1)
    return ms, ss, pmf

Finally, here's the code that makes the prior, updates it, and normalizes the posterior:

def EstimateParameters(t, label, num_points=31):
    xs, ys, suite = MakeUniformPrior(t, num_points, label)

    suite.Log()
    LogUpdate(suite, tuple(t))
    suite.Exp()
    suite.Normalize()

    return xs, ys, suite

Ok, that was more code than I usually put in these articles.  You can download it all from thinkcomplex.com/bayes_height.py.

Results

The result is a map from pairs of mu and sigma to probabilities; one simple way to display the map is a contour plot.  Here is the result for men:

And for women:

As expected, the central location is near the estimated parameters.  But with this result in numerical form is that it is easy to compute the marginal distributions for mu and sigma, or their confidence intervals.  We can also compute posterior distributions for CV, and compare the distributions CV for men and women:

Since there is no overlap between the distributions, we can conclude that the small observed difference is statistically significant; or, if we want to be more Bayesian about it, we can compute the probability that the CV for women is higher, which is pretty close to 1.

Finally we conclude:
  • The standard deviation of height for men is higher.
  • But the coefficient of variation is slightly lower.
  • Even though the difference is tiny, it is very unlikely to be due to chance.
  • And therefore women are inferior.
Hey, I told you I have a soft spot for crank science.

2 comments:

  1. In your book Think Bayes, you use the same example to illustrate Approximate Bayesian Computation. And you use `scipy.stats.norm.logpdf(s, sigma, sigma/math.sqrt(2*(n-1)))` for the likelihood of the sample standard deviation under the (mu, sigma) hypothesis. And I wonder, won't me more appropiate to use the sampling distribution of the sample variance for that likelihood instead? Something like `loglike += scipy.stats.chi2.logpdf((s**2*(n-1))/(sigma**2),df=(n-1))`.

    I posted this as question in stackoverflow http://stats.stackexchange.com/questions/238046/what-is-the-likelihood-of-drawing-a-sample-with-standard-deviation-s-from-a-no

    ReplyDelete
    Replies
    1. Great question, thanks! Yes, the chi2 would be correct (and just as easy to implement). I think my normal approximation is pretty close, but when I get a chance, I will run both and see how close. If I did this again, I would use the chi2 distribution, since the normal approximation doesn't provide any advantage for this problem.

      Delete