Wednesday, April 27, 2016

Bayes on Jupyter

I am working on an updated version of my workshop, Bayesian Statistics Made Simple, now using Jupyter notebooks (formerly known as IPython). It's still a work in progress, but you can see a draft of my slides here:

 

If you want to run the code, you can run the notebook in your browser by hitting this button



You should see a home page like this:


If you want to try the exercises, open workshop01.ipynb.  If you just want to see the answers, open workshop01_soln.ipynb.

Either way, you should be able to run the notebooks in your browser and try out the examples.

If you run into any problems, let me know.  Comments and suggestions are welcome.

Special thanks to the generous people who run Binder, which makes it easy to share and reproduce computation.  You can watch their video here:




Wednesday, April 6, 2016

Bayesian update with a multivariate normal distribution

This notebook contains a solution to a problem posted on Reddit; here's the original statement of the problem:

So, I have two sets of data where the elements correspond to each other:

A = {122.8, 115.5, 102.5, 84.7, 154.2, 83.7, 122.1, 117.6, 98.1, 
     111.2, 80.3, 110.0, 117.6, 100.3, 107.8, 60.2}
B = {82.6, 99.1, 74.6, 51.9, 62.3, 67.2, 82.4, 97.2, 68.9, 77.9,
     81.5, 87.4, 92.4, 80.8, 74.7, 42.1}

I'm trying to find out the probability that (91.9 <= A <= 158.3) and (56.4 <= B <= 100). I know that P(91.9 <= A <= 158.3) = 0.727098 and that P(56.4 <= B <= 100) = 0.840273, given that A is a normal distribution with mean 105.5 and standard deviation 21.7 and that B is a normal distribution with mean 76.4 and standard deviation 15.4. However, since they are dependent events, P(BA)=P(A)P(B|A)=P(B)P(A|B). Is there any way that I can find out P(A|B) and P(B|A) given the data that I have?

The original poster added this clarification:

I'm going to give you some background on what I'm trying to do here first. I'm doing sports analysis trying to find the best quarterback of the 2015 NFL season using passer rating and quarterback rating, two different measures of how the quarterback performs during a game. The numbers in the sets above are the different ratings for each of the 16 games of the season (A being passer rating, B being quarterback rating, the first element being the first game, the second element being the second, etc.) The better game the quarterback has, the higher each of the two measures will be; I'm expecting that they're correlated and dependent on each other to some degree. I'm assuming that they're normally distributed because most things done by humans tend to be normally distributed.

As a first step, let's look at the data. I'll put the two datasets into NumPy arrays.

In [2]:
a = np.array([122.8, 115.5, 102.5, 84.7, 154.2, 83.7,
              122.1, 117.6, 98.1, 111.2, 80.3, 110.0,
              117.6, 100.3, 107.8, 60.2])
b = np.array([82.6, 99.1, 74.6, 51.9, 62.3, 67.2,
              82.4, 97.2, 68.9, 77.9, 81.5, 87.4,
              92.4, 80.8, 74.7, 42.1])
n = len(a)
n
Out[2]:
16

And make a scatter plot:

In [3]:
thinkplot.Scatter(a, b, alpha=0.7)

It looks like modeling this data with a bi-variate normal distribution is a reasonable choice.

Let's make an single array out of it:

In [4]:
X = np.array([a, b])

And compute the sample mean

In [29]:
 = X.mean(axis=1)
print()
[ 105.5375   76.4375]

Sample standard deviation

In [30]:
std = X.std(axis=1)
print(std)
[ 21.04040384  14.93640163]

Covariance matrix

In [32]:
S = np.cov(X)
print(S)
[[ 472.21183333  161.33583333]
 [ 161.33583333  237.96916667]]

And correlation coefficient

In [33]:
corrcoef = np.corrcoef(a, b)
print(corrcoef)
[[ 1.         0.4812847]
 [ 0.4812847  1.       ]]

Now, let's start thinking about this as a Bayesian estimation problem.

There are 5 parameters we would like to estimate:

  • The means of the two variables, μ_a, μ_b

  • The standard deviations, σ_a, σ_b

  • The coefficient of correlation, ρ.

As a simple starting place, I'll assume that the prior distributions for these variables are uniform over all possible values.

I'm going to use a mesh algorithm to compute the joint posterior distribution, so I'll "cheat" and construct the mesh using conventional estimates for the parameters.

For each parameter, I'll compute a range of possible values where

  • The center of the range is the value estimated from the data.

  • The width of the range is 6 standard errors of the estimate.

The likelihood of any point outside this mesh is so low, it's safe to ignore it.

Here's how I construct the ranges:

In [9]:
def make_array(center, stderr, m=11, factor=3):
    return np.linspace(center-factor*stderr, 
                       center+factor*stderr, m)

μ_a = [0]
μ_b = [1]
σ_a = std[0]
σ_b = std[1]
ρ = corrcoef[0][1]

μ_a_array = make_array(μ_a, σ_a / np.sqrt(n))
μ_b_array = make_array(μ_b, σ_b / np.sqrt(n))
σ_a_array = make_array(σ_a, σ_a / np.sqrt(2 * (n-1)))
σ_b_array = make_array(σ_b, σ_b / np.sqrt(2 * (n-1)))
#ρ_array = make_array(ρ, np.sqrt((1 - ρ**2) / (n-2)))
ρ_array = make_array(ρ, 0.15)

def min_max(array):
    return min(array), max(array)

print(min_max(μ_a_array))
print(min_max(μ_b_array))
print(min_max(σ_a_array))
print(min_max(σ_b_array))
print(min_max(ρ_array))
(89.757197120005102, 121.31780287999489)
(65.235198775056304, 87.639801224943696)
(9.5161000378105989, 32.564707642175762)
(6.7553975307657055, 23.117405735750815)
(0.031284703359568844, 0.93128470335956881)

Although the mesh is constructed in 5 dimensions, for doing the Bayesian update, I want to express the parameters in terms of a vector of means, μ, and a covariance matrix, Σ.

Params is an object that encapsulates these values. pack is a function that takes 5 parameters and returns a Param object.

In [10]:
class Params:
    def __init__(self, μ, Σ):
        self.μ = μ
        self.Σ = Σ
        
    def __lt__(self, other):
        return (self.μ, self.Σ) < (self.μ, self.Σ)
In [11]:
def pack(μ_a, μ_b, σ_a, σ_b, ρ):
    μ = np.array([μ_a, μ_b])
    cross = ρ * σ_a * σ_b
    Σ = np.array([[σ_a**2, cross], [cross, σ_b**2]])
    return Params(μ, Σ)

Now we can make a prior distribution. First, mesh is the Cartesian product of the parameter arrays. Since there are 5 dimensions with 11 points each, the total number of points is 11**5 = 161,051.

In [12]:
mesh = product(μ_a_array, μ_b_array, 
               σ_a_array, σ_b_array, ρ_array)

The result is an iterator. We can use itertools.starmap to apply pack to each of the points in the mesh:

In [13]:
mesh = starmap(pack, mesh)

Now we need an object to encapsulate the mesh and perform the Bayesian update. MultiNorm represents a map from each Param object to its probability.

It inherits Update from thinkbayes2.Suite and provides Likelihood, which computes the probability of the data given a hypothetical set of parameters.

If we know the mean is μ and the covariance matrix is Σ:

  • The sampling distribution of the mean, , is multivariable normal with parameters μ and Σ/n.

  • The sampling distribution of (n-1) S is Wishart with parameters n-1 and Σ.

So the likelihood of the observed summary statistics, and S, is the product of two probability densities:

  • The pdf of the multivariate normal distrbution evaluated at .

  • The pdf of the Wishart distribution evaluated at (n-1) S.

In [14]:
class MultiNorm(thinkbayes2.Suite):
    
    def Likelihood(self, data, hypo):
        , S, n = data

        pdf_x̄ = multivariate_normal(hypo.μ, hypo.Σ/n)
        pdf_S = wishart(df=n-1, scale=hypo.Σ)
        
        like = pdf_x̄.pdf() * pdf_S.pdf((n-1) * S)
        return like

Now we can initialize the suite with the mesh.

In [15]:
suite = MultiNorm(mesh)

And update it using the data (the return value is the total probability of the data, aka the normalizing constant). This takes about 30 seconds on my machine.

In [16]:
suite.Update((, S, n))
Out[16]:
1.6385250666091713e-15

Now to answer the original question, about the conditional probabilities of A and B, we can either enumerate the parameters in the posterior or draw a sample from the posterior.

Since we don't need a lot of precision, I'll draw a sample.

In [17]:
sample = suite.MakeCdf().Sample(300)

For a given pair of values, μ and Σ, in the sample, we can generate a simulated dataset.

The size of the simulated dataset is arbitrary, but should be large enough to generate a smooth distribution of P(A|B) and P(B|A).

In [18]:
def generate(μ, Σ, sample_size):
    return np.random.multivariate_normal(μ, Σ, sample_size)

# run an example using sample stats
fake_X = generate(, S, 300)

The following function takes a sample of $a$ and $b$ and computes the conditional probabilites P(A|B) and P(B|A)

In [19]:
def conditional_probs(sample):
    df = pd.DataFrame(sample, columns=['a', 'b'])
    pA = df[(91.9 <= df.a) & (df.a <= 158.3)]
    pB = df[(56.4 <= df.b) & (df.b <= 100)]
    pBoth = pA.index.intersection(pB.index)
    pAgivenB = len(pBoth) / len(pB)
    pBgivenA = len(pBoth) / len(pA)
    return pAgivenB, pBgivenA

conditional_probs(fake_X)
Out[19]:
(0.8174603174603174, 0.865546218487395)

Now we can loop through the sample of parameters, generate simulated data for each, and compute the conditional probabilities:

In [20]:
def make_predictive_distributions(sample):
    pmf = thinkbayes2.Joint()

    for params in sample:
        fake_X = generate(params.μ, params.Σ, 300)
        probs = conditional_probs(fake_X)
        pmf[probs] += 1

    pmf.Normalize()
    return pmf

predictive = make_predictive_distributions(sample)

Then pull out the posterior predictive marginal distribution of P(A|B), and print the posterior predictive mean:

In [21]:
thinkplot.Cdf(predictive.Marginal(0).MakeCdf())
predictive.Marginal(0).Mean()
Out[21]:
0.7246540147158328

And then pull out the posterior predictive marginal distribution of P(B|A), with the posterior predictive mean

In [22]:
thinkplot.Cdf(predictive.Marginal(1).MakeCdf())
predictive.Marginal(1).Mean()
Out[22]:
0.8221366197059744

We don't really care about the posterior distributions of the parameters, but it's good to take a look and make sure they are not crazy.

The following function takes μ and Σ and unpacks them into a tuple of 5 parameters:

In [23]:
def unpack(μ, Σ):
    μ_a = μ[0]
    μ_b = μ[1]
    σ_a = np.sqrt(Σ[0][0])
    σ_b = np.sqrt(Σ[1][1])
    ρ = Σ[0][1] / σ_a / σ_b
    return μ_a, μ_b, σ_a, σ_b, ρ

So we can iterate through the posterior distribution and make a joint posterior distribution of the parameters:

In [24]:
def make_marginals(suite):
    joint = thinkbayes2.Joint()
    for params, prob in suite.Items():
        t = unpack(params.μ, params.Σ)
        joint[t] = prob
    return joint

marginals = make_marginals(suite)

And here are the posterior marginal distributions for μ_a and μ_b

In [25]:
thinkplot.Cdf(marginals.Marginal(0).MakeCdf())
thinkplot.Cdf(marginals.Marginal(1).MakeCdf());

And here are the posterior marginal distributions for σ_a and σ_b

In [26]:
thinkplot.Cdf(marginals.Marginal(2).MakeCdf())
thinkplot.Cdf(marginals.Marginal(3).MakeCdf());

Finally, the posterior marginal distribution for the correlation coefficient, ρ

In [27]:
thinkplot.Cdf(marginals.Marginal(4).MakeCdf());

Monday, March 28, 2016

Engineering education is good preparation for a lot of things

I am a big fan of engineering education.  I think studying engineering for four years is intellectually challenging, exciting, and (when it is done right) fun.  An engineering education is good preparation for careers in many fields and (again, when it is done right) good preparation for citizenship.

So you can imagine my reaction to a click-bait headline like, "Does Engineering Education Breed Terrorists?", which appeared recently in the Chronicle of Higher Education.

The article presents results from a new book, Engineers of Jihad: The Curious Connection Between Violent Extremism and Education, by Diego Gambetta and Steffen Hertog.  If you don't want to read the book, it looks like you can get most of the results from this working paper, which is free to download.

I have not read the book, but I will summarize results from the paper.

Engineers are over-represented

Gambetta and Hertog compiled a list of 404 "members of violent Islamist groups".  Of those, they found educational information on 284, and found that 196 had some higher education.  For 178 of those cases, they were able to determine the subject of study, and of those, the most common fields were engineering (78), Islamic studies (34), medicine (14), and economics/business (12).

So the representation of engineers among this group is 78/178, or 44%.  Is that more than we should expect?  To answer that, the ideal reference group would be a random sample of males from the same countries, with the same distribution of ages, selecting people with some higher education.

It is not easy to assemble a reference sample like that, but Gambetta and Hertog approximate it with data from several sources.  Their Table 4 summarizes the results, broken down by country:


In every country, the fraction of engineers in the sample of violent extremists is higher than the fraction in the reference group (with the possible exceptions of Singapore and Saudi Arabia).

As always in studies like this, there are methodological issues that could be improved, but taking the results at face value, it looks like people who studied engineering are over-represented among violent extremists by a substantial factor -- Gambetta and Hertog put it at "two to four times the size we would expect [by random selection]".

Why engineers?

In the second part of the paper, Gambetta and Hertog turn to the question of causation; as they formulate it, "we will try to explain only why engineers became more radicalized than people with other degrees".

They consider four hypotheses:

1) "Random appearance of engineers as first movers, followed by diffusion through their network", which means that if the first members of these groups were engineers by chance, and if they recruited primarily through their social networks, and if engineers are over-represented in those networks, the observed over-representation might be due to historical accident.

2) "Selection of engineers because of their technical skills"; that is, engineers might be over-represented because extremist groups recruit them specifically.

3) "Engineers' peculiar cognitive traits and dispositions"; which includes both selection and self-selection; that is, extremist groups might recruit people with the traits of engineers, and engineers might be more likely to have traits that make them more likely to join extremist groups.  [I repeat "more likely" as a reminder that under this hypothesis, not all engineers have the traits, and not all engineers with the traits join extremist groups.]

4) "The special social difficulties faced by engineers in Islamic countries"; which is the hypothesis that (a) engineering is a prestigious field of study, (b) graduates expect to achieve high status, (c) in many Islamic countries, their expectations are frustrated by lack of economic opportunity, and (d) the resulting "frustrated expectations and relative deprivation" explain why engineering graduates are more likely to become radicalized.

Their conclusion is that 1 and 2 "do not survive close scrutiny" and that the observed over-representation is best explained by the interaction of 3 and 4.

My conclusions

Here is my Bayesian interpretation of these hypotheses and the arguments Gambetta and Hertog present:

1) Initially, I didn't find this hypothesis particularly plausible.  The paper presents weak evidence against it, so I find it less plausible now.

2) I find it very plausible that engineers would be recruited by extremist groups, not only for specific technical skills but also for their general ability to analyze systems, debug problems, design solutions, and work on teams to realize their designs.  And once recruited, I would expect these skills to make them successful in exactly the ways that would cause them to appear in the selected sample of known extremists.  The paper provides only addresses a narrow version of this hypothesis (specific technical skills), and provides only weak evidence against it, so I still find it highly plausible.

3) Initially I found it plausible that there are character traits that make a person more likely to pursue engineering and more likely to join and extremist group.  This hypothesis could explain the observed over-representation even if both of those connections are weak; that is, even if few of the people who have the traits pursue engineering and even fewer join extremist groups.  The paper presents some strong evidence for this hypothesis, so I find it quite plausible.

4) My initial reaction to this hypothesis is that the causal chain sounds fragile.  And even if true, it is so hard to test, it verges on being unfalsifiable.  But Gambetta and Hertog were able to do more work with it than I expected, so I grudgingly upgrade it to "possible".

In summary, my interpretation of the results differs slightly from the authors': I think (2) and (3) are more plausible than (3) and (4).

But overall I think the paper is solid: it starts with an intriguing observation, makes good use of data to refine the question and to evaluate alternative hypotheses, and does a reasonable job of interpreting the results.

So at this point, you might be wondering "What about engineering education?  Does it breed terrorists, like the headline said?"

Good question!  But Gambetta and Hertog have nothing to say about it.  They did not consider this hypothesis, and they presented no evidence in favor or against.  In fact, the phrase "engineering education" does not appear in their 90-page paper except in the title of a paper they cite.

Does bad journalism breed click bait?

So where did that catchy headline come from?  From the same place all click bait comes from: the intersection of dwindling newspaper subscriptions and eroding journalistic integrity.

In this case, the author of the article, Dan Berrett, might not be the guilty party.  He does a good job of summarizing Gambetta and Hertog's research.  He presents summaries of a related paper on the relationship between education and extremism, and a related book.  He points out some limitations of the study design.  So far, so good.

The second half of the article is weaker, where Berrett reports speculation from researchers farther and farther from the topic.

For example, Donna M. Riley at Virginia Tech suggests that traditional [engineering] programs "reinforce positivist patterns of thinking" because "general-education courses [...] are increasingly being narrowed" and "engineers spend almost all their time with the same set of epistemological rules."

Berrett also quotes a 2014 study of American students in four engineering programs (MIT, UMass Amherst, Olin, and Smith), which concludes, "Engineering education, fosters a culture of disengagement that defines public welfare concerns as tangential to what it means to practice engineering."

But even if these claims are true, they are about American engineering programs in 2010s.  They are unlikely to explain the radicalization of the extremists in the study, who were educated in Islamic countries in the "mid 70s to late 90s".

At this point there is no reason to think that engineering education contributed to the radicalization of the extremists in the sample.  Gambetta and Hertog don't consider the hypothesis, and present no evidence that supports it.  To their credit, they "wouldn’t presume to make curricular recommendations to [engineering] educators."

In summary, Gambetta and Hertog provide several credible explanations, supported by evidence, that could explain the observed over-representation.  Speculating on other causes, assuming they are true—and then providing prescriptions for American engineering education—is gratuitous.  And suggesting in a headline that engineering education causes terrorism is hack journalism.

We have plenty of problems

Traditional engineering programs are broken in a lot of ways.  That's why Olin College was founded, and why I wanted to work here.  I am passionate about engineering education and excited about the opportunities to make it better.  Furthermore, some of the places we have the most room for improvement are exactly the ones highlighted in Derrett's article.

Yes, the engineering curriculum should "emphasize human needs, contexts, and interactions", as the
executive director of the American Society for Engineering Education (ASEE) suggests.  As he explains, "We want engineers to understand what motivates humans, because that’s how you satisfy human needs.  You have to understand people as people."

And yes, we want engineers "who recognize needs, design solutions and engage in creative enterprises for the good of the world".  That's why we put it in Olin College's mission statement.

We may not be doing these things very well, yet.  We are doing our best to get better.

In the meantime, it is premature to add "preventing terrorism" to our list of goals.  We don't know the causes of terrorism, we have no reason to think engineering education is one of them, and even if we did, I'm not sure we would know what to do about it.

Debating changes in engineering education is hard enough.  Bringing terrorism into it—without evidence or reason—doesn't help.