Monday, March 2, 2015

Statistical inference is only mostly wrong

p-values banned!

The journal Basic and Applied Social Psychology (BASP) made news recently by "banning" p-values.  Here's a summary of their major points:
  1. "...the null hypothesis significance testing procedure (NHSTP) is invalid...".   "We believe that the p<0.05 bar is too easy to pass and sometimes serves as an excuse for lower quality research."
  2. "Confidence intervals suffer from an inverse problem that is not very different from that suffered by the NHSTP.   ... the problem is that, for example, a 95% confidence interval does not indicate that the parameter of interest has a 95% probability of being within the interval."
  3. "Bayesian procedures are more interesting.  The usual problem ... is that they depend on some sort of Laplacian assumption to generate numbers where none exist."
There are many parts of this new policy, and the rationale for it, that I agree with.  But there are several things I think they got wrong, both in the diagnosis and the prescription.  So I find myself -- to my great surprise -- defending classical statistical inference.  I'll take their points one at a time:


Classical hypothesis testing is problematic, but it is not "invalid", provided that you remember what it is and what it means.  A hypothesis test is a partial answer to the question, "Is it likely that I am getting fooled by randomness?"

Suppose you see an apparent effect, like a difference between two groups.  An example I use in Think Stats is the apparent difference in pregnancy length between first babies and others: in a sample of more than 9000 pregnancies in a national survey, first babies are born about 13 hours earlier, on average, than other babies.

There are several possible explanations for this difference:
  1. The effect might be real; that is, a similar difference would be seen in the general population.
  2. The apparent effect might be due to a biased sampling process, so it would not appear in the general population.
  3. The apparent effect might be due to measurement errors.
  4. The apparent effect might be due to chance; that is, the difference might appear in a random sample, but not in the general population.
Hypothesis testing addresses only the last possibility.  It asks, "If there were actually no difference between first babies and others, what would be the chance of selecting a sample with a difference as big as 13 hours?"  The answer to this question is the p-value.

If the p-value is small, you can conclude that the fourth possibility is unlikely, and infer that the other three possibilities are more likely.  In the actual sample I looked at, the p-value was 0.17, which means that there is a 17% chance of seeing a difference as big as 13 hours, even if there is no actual difference between the groups.  So I concluded that the fourth possibility can't be ruled out.

There is nothing invalid about these conclusions, provided that you remember a few caveats:
  1. Hypothesis testing can help rule out one explanation for the apparent effect, but it doesn't account for others, particularly sampling bias and measurement error.
  2. Hypothesis testing doesn't say anything about how big or important the effect is.
  3. There is nothing magic about the 5% threshold.

Confidence intervals

The story is pretty much the same for confidence intervals; they are problematic for many of the same reasons, but they are not useless.  A confidence interval is a partial answer to the question: "How precise is my estimate of the effect size?"

If you run an experiment and observe an effect, like the 13-hour difference in pregnancy length, you might wonder whether you would see the same thing if you ran the experiment again.  Would it always be 13 hours, or might it range between 12 and 14?  Or maybe between 11 and 15?

You can answer these questions by 
  1. Making a model of the experimental process,
  2. Analyzing or simulating the model, and
  3. Computing the sampling distribution, which quantifies how much the estimate varies due to random sampling. 
Confidence intervals and standard errors are two ways to summarize the sampling distribution and indicate the precision of the estimate.

Again, confidence intervals are useful if you remember what they mean:
  1. The CI quantifies the variability of the estimate due to random sampling, but does not address other sources of error.
  2. You should not make any claim about the probability that the actual value falls in the CI.
As the editors of BASP point out, "a 95% confidence interval does not indicate that the parameter of interest has a 95% probability of being within the interval."  Ironically, the situation is worse when the sample size is large.  In that case, the CI is usually small, other sources of error dominate, and the CI is less likely to contain the actual value.

One other limitation to keep in mind: both p-values and confidence intervals are based on modeling decisions: a p-value is based on a model of the null hypothesis, and a CI is based on a model of the experimental process.  Modeling decisions are subjective; that is, reasonable people could disagree about the best model of a particular experiment.  For any non-trivial experiment, there is no unique, objective p-value or CI.

Bayesian inference

The editors of BASP write that the problem with Bayesian statistics is that "they depend on some sort of Laplacian assumption to generate numbers where none exist."  This is an indirect reference to the fact that Bayesian analysis depends on the choice of a prior distribution, and to Laplace's principle of indifference, which is an approach some people recommend for choosing priors.

The editors' comments evoke a misunderstanding of Bayesian methods.  If you use Bayesian methods as another way to compute p-values and confidence intervals, you are missing the point.  Bayesian methods don't do the same things better; they do different things, which are better.

Specifically, the result of Bayesian analysis is a posterior distribution, which includes all possible estimates and their probabilities.  Using posterior distributions, you can compute answers to questions like:
  1. What is the probability that a given hypothesis is true, compared to a suite of alternative hypotheses?
  2. What is the probability that the true value of a parameter falls in any given interval?
These are questions we care about, and their answers can be used directly to inform decision-making under uncertainty.

But I don't want to get bogged down in a Bayesian-frequentist debate.  Let me get back to the BASP editorial.


Conventional statistical inference, using p-values and confidence intervals, is problematic, and it fosters bad science, as the editors of BASP claim.  These problems have been known for a long time, but previous attempts to instigate change have failed.  The BASP ban (and the reaction it provoked) might be just what we need to get things moving.

But the proposed solution is too blunt; statistical inference is broken but not useless.  Here is the approach I recommend:
  1. The effect size is the most important thing.  Papers that report statistical results should lead with the effect size and explain its practical impact in terms relevant to the context.
  2. The second most important thing -- a distant second -- is a confidence interval or standard error, which quantifies error due to random sampling.  This is a useful additional piece of information, provided that we don't forget about other sources of error.
  3. The third most important thing -- a distant third -- is a p-value, which provides a warning if an apparent effect could be explained by randomness.
  4. We should drop the arbitrary 5% threshold, and forget about the conceit that 5% is the false positive rate.
For me, the timing of the BASP editorial is helpful.  I am preparing a new tutorial on statistical inference, which I will present at PyCon 2015 in April.  The topic just got a little livelier!


"The CI quantifies uncertainty due to random sampling, but it says nothing about sampling bias or measurement error. Because these other sources of error are present in nearly all real studies, published CIs will contain the population value far less often than 90%."

A fellow redditor raised an objection I'll try to paraphrase:

"The 90% CI contains the true parameter 90% of the time because that's the definition of the 90% CI.  If you compute an interval that doesn't contain the true parameter 90% of the time, it's not a 90% CI."

The reason for the disagreement is that my correspondent and I were using different definitions of CI:
  1. I was defining CI by procedure; that is, a 90% CI is what you get if you compute the sampling distribution and take the 5th and 95th percentiles.
  2. My correspondent was defining CI by intent: A CI is an interval that contains the true value 90% of the time.
If you use the first definition, the problem with CIs is that they don't contain the true value 90% of the time.

If you use the second definition, the problem with CIs is that our standard way of computing them doesn't work, at least not in the real world.

Once this point was clarified, my correspondent and I agreed about the conclusion: the fraction of published CIs that contain the true value of the parameter is less than 90%, and probably a lot less.

Tuesday, February 24, 2015

Upcoming talk on survival analysis in Python

On March 2, 2015 I am presenting a short talk for the Python Data Science meetup.  Here is the announcement for the meetup.

And here are my slides:

The code for the talk is in an IPython notebook you can view on nbviewer.  It is still a work in progress!

And here's the punchline graph:

Each line corresponds to a different cohort: 50s means women born during the 1950s, and so on.  The curves show the probability of being unmarried as a function of age; for example, about 50% of women born in the 50s had been married by age 23.  For women born in the 80s, only about 25% were married by age 23.

The gray lines show predictions based on applying patterns from previous cohorts.

A few patterns emerge from this figure:

1) Successive generations of women are getting married later and later.  No surprise there.

2) For women born in the 50s, the curve leveled off; if they didn't marry early, they were unlikely to marry at all.  For later generations, the curve keeps dropping, which indicates some women getting married for the first time at later ages.

3) The predictions suggest that the fraction of women who eventually marry is not changing substantially.  My predictions for Millennials suggest that they will end up marrying, eventually, at rates similar to previous cohorts.

Tuesday, February 10, 2015

Bayesian analysis of match rates on Tinder

Last fall I taught an introduction to Bayesian statistics at Olin College. My students worked on some excellent projects, and I invited them to write up their results as guest articles for this blog. Just in time for Valentine's Day, here's the second in the series:

It’s (Not) a Match!
Ankur Das and Mason del Rosario

All code for this post can be found on Github:

Valentine’s day is fast approaching, and if you are anything like us engineering students, you will most likely be dateless on the 14th of February. Luckily, there are lots of great apps geared towards bringing singles together. One of the most popular is Tinder. For those unfamiliar with the app, Tinder presents the user with other users’ profiles, and the user then either swipes right for yes or left for no. When two people have swiped right on one another, they become a “match” and can chat with each other via text through the app.

Images of our Tinder profiles. Aren’t we a couple of handsome devils?

Some time ago, two of our friends decided to conduct an experiment on Tinder. The friends, one male and one female, each swiped right on the next 100 profiles and recorded how many matches they got. Though our female friend received over 90 matches, the male garnered only one, giving a 90% match rate for the female and an abysmal 1% response rate for the male! The discrepancy we saw is not an isolated incident. This article from reports an experiment where male profiles at  best garnered 7% match rates while a female profile received a 20% match rate.

Why such poor odds for men on Tinder? Women may be more discriminating with their swipes, but another possibility is that women are simply less active on Tinder. Pursuing the less depressing option, we set about trying to answer the following question: how does the activity rate of Tinder’s population affect a given user’s match rate?

p q r

Luckily for all us dateless men and women out there, the number of responses we get is not an accurate measure of your attractiveness on Tinder. Adapting Allen Downey’s work on “The Volunteer Problem,” a user’s total response rate p is the product of r, the true attractiveness rating, and q, the activity rating. For every 100 users we swipe, only some q fraction of them will even see our profiles, let alone swipe back. With that in mind, we have something to attribute our Tinder failures to: the inactivity of users who may not have logged in recently.

Still, it would be nice to know what our actual r response rate is. Finding p is simple --- simply measure the number of return swipes --- but we need to determine q to solve for r. With our set of Bayesian tools, Python scripts, and swiping fingers, we set out to estimate the activity rate, q, of other Tinder users.

How often do we Tinder?

For some of us, Tinder is a way of life. For a reasons we won’t get into here, we check Tinder hourly, keeping up to date with the latest swipes. But many users check the app more infrequently, going days or weeks without looking at new potential matches. Unfortunately Tinder’s best selling points also make life difficult for the amateur Bayesian statistician. The app provides minimal information on profiles and has no public history of past swipes and encounters. When seeing a new profile, the user’s last log-in time provides the only hint of her future activity, leaving us guessing if she chose to ignore our profile or simply never saw it.

Luckily, we can adapt Prof. Downey’s solution to “The Red Line Problem” to turn this sparse data into a predictive model. We start with an assumption that the average time between Tinder activity follows a normal distribution centered at 10 hours. Even then, someone who uses Tinder every 10 hours on average will occasionally check twice an hour or just once a week. We model this variance with an exponential distribution for each average activity period to show different activity rates for the same user.

(Note: all times and arrival rates are given in hours unless otherwise noted)

While some other user may go 10 hours without checking Tinder, our last arrival data only gives us a single point in that interval, skewing our results. First, we are 5 times more likely to observe someone during a 5 hour interval than a 1 hour interval, simply due to its length. We account for this bias towards larger intervals by multiplying each probability by time.
Exponential PDF scaled by time  for for a user with an average of 10 hours between activity

Even with these likelihoods, our data comes in the form of time since last activity. 15 minutes ago could indicate a future log in 10 minutes or 10 days later, so we need to adapt our distribution to this type of input. For every gap between activity, we have an equal chance of making our observation at any point in between. Combining the chances of each activity period happening with the chances of finding a time in that period produces a mixture of various observed last log-in times, specific to each hypothetical user with a different average activity rate.
Exponential CDF for observed last activity time for a user with an average of 10 hours between activity

Now what?

That was a lot of work to get to a distribution within a distribution within a distribution, but now we finally have a chance to use our data! Looking back at our prior distribution of average activity rates, we now have a corresponding distribution of observed past log-in times for each. We sampled 100 users to record their activity times, giving us accurate information for the users relevant to us. Our findings vary between genders, areas, and age ranges, so new time data must be collected for each aspiring Tinder statistician. Updating each of our hypotheses with the chance of observing the times we saw shows the graph below, a steep distribution centered at 10 hours.

Exponential PDF scaled by time  for for a user with an average of 10 hours between activity

The posterior distribution peaks at the same central value as the prior, but with a smaller standard deviation, due the large amount of data. The small standard deviation likely indicates more about Tinder’s sorting algorithm than the true userbase, however. Given the number of people who have likely stopped using the app, we should see a much longer tail for users checking Tinder every 24 hours or longer. This data actually indicates that Tinder only shows us users that tend to be fairly active.  This behavior makes sense: what dating app would match you with inactive users?

Predictive Modeling

After a bit of data collection and analysis, we now know the breakdown of our potential matches. Most of them check Tinder about every 10 hours on average, so they should respond promptly to our swipes. For the sake of modeling ease, let’s say that when someone checks Tinder, they instantly see our profile, regardless of the time they spend active or Tinder’s display algorithm. Not only does this make our analysis much easier, it also gives us a worst case estimate. If our results for Tinder success assume extra people see our profile, our true response rate can only be better.

Using the spread of potential Tinder matches, we can predict how many will use Tinder within some threshold time. If we check again in 5 hours, for instance, we will of course receive fewer views than 10 hours later. To find the distributions of response rates, we looked at each average response rate, finding the chance of activity for that rate within some threshold time. Scaling all these probabilities by the likelihood of that average response rate produces a distribution of q, the portion of users active within some time.
q distributions for active users within 5 and 10 hours

True Response Rate

Now that we are able to generate a distribution for q, we are half way towards generating a distribution for a given user’s true response rate, r. Next, we need to know what the distribution of perceived response rate, p, looks like. Recalling that the product of q and r gives us p, we can reverse engineer our desired r distribution by taking the quotient of p over q!  We’re working on that part of the problem now.


Clearly, we still have some work to do before we can say anything quantitative regarding a user’s true match rate, but we can say some positive things without going much further. One encouraging observation is that a user’s true match rate will always be better than the perceived match rate. While a user may see that only 10 out of every 100 profiles become matches, a percentage of those users have not even been on Tinder for quite some time. The upshot of this: you’re probably more attractive than Tinder makes you think you are!

That’s all for now. Happy swiping!

Thursday, February 5, 2015

Godless freshmen: now more Nones than Catholics

This article is an update to my annual series on one of the most under-reported stories of the decade: the fraction of college freshmen who report no religious preference has more than tripled since 1985, from 8% to 27%, and the trend is accelerating.

In last year's installment, I made the bold prediction that the trend would continue, and that the students starting college in 2014 would again, be the most godless ever.  It turns out I was right for the fifth year in a row.  The number of students reporting no religious preference increased to 27.5%, a substantial increase since last year's record-high 24.6%.  Also, for the first time in the history of the survey, the number of "Nones" exceeds the number of Catholics (25.3%).

The number of people reporting that they never attended a religious service also reached an all-time high at 29.3%, up from 27.3% last year.

Of course, we should not over-interpret a single data point, but generally:

1) This year's data points are consistent with previous predictions, and

2) Data since 1990 support the conclusion that the number of incoming college students with no religious preference is increasing and accelerating.

This analysis is based on survey results from the Cooperative Institutional Research Program (CIRP) of the Higher Education Research Insitute (HERI).  In 2014, more than 153,000 students at 227 colleges and universities completed the CIRP Freshman Survey, which includes questions about students’ backgrounds, activities, and attitudes.

In one question, students select their “current religious preference,” from a choice of seventeen common religions, “Other religion,” or “None.”

Another question asks students how often they “attended a religious service” in the last year. The choices are “Frequently,” “Occasionally,” and “Not at all.” Students are instructed to select “Occasionally” if they attended one or more times.

The following figure shows the fraction of Nones over more than 40 years of the survey
Fraction of college Freshmen with no religious preference.
The blue line shows actual data through 2013; the blue square shows the new data point for 2014.  The gray regions shows the predictions I generated last year based on data through 2013.  The new data point falls at the high end of the predicted interval.

The red line shows a quadratic fit to the data.  The dark gray region shows a 90% confidence interval, which quantifies sampling error, so it reflects uncertainty about the parameters of the fit.  The light gray region shows a 90% confidence interval taking into account both sampling error and residual error.  So it reflects total uncertainty about the predicted value, including uncertainty due to random variation from year to year.

Here is the corresponding plot for attendance at religious services:

Fraction of college Freshmen who report no attendance at religious services.

Again, the new data point for 2014, 29.3%,  falls in the predicted range, although somewhat ahead of the long term trend.

Predictions for 2015

Using the new 2014 data, we can generate predictions for 2015.  Here is the revised plot for "Nones":

Predictive interval for 2015.
This year's measurement is ahead of the long-term trend, so next year's is likely to regress, slightly, to 27.1% (down 0.4%).

And here is the prediction for "No attendance":
Predictive interval for 2015.

Again, because this year's value is ahead of the long term trend, the center of the predictive distribution is slightly lower, at 28.6% (down 0.7%).

I'll be back next year to check on these predictions.


1) As always, more males than females report no religious preference, and the gender gap appears to be growing.
Difference between men and women, fraction reporting no religious preference.
Evidence that the gender gap is increasing is strong.  The p-value of the slope of the fitted curve is less than 1e-6.

2) I notice that the number of schools and the number of students participating in the Freshman Survey has been falling for several years.  I wonder if the mix of schools represented in the survey is changing over time, and what effect this might have on the trends I am watching.  The percentage of "Nones" is different across different kinds of institutions (colleges, universities, public, private, etc.)  If participation rates are changing among these groups, that would affect the results.

3) Obviously college students are not representative of the general population.  Data from other sources indicate that the same trends are happening in the general population, but I haven't been able to make a quantitative comparison between college students and others.  Data from other sources also indicate that college graduates are slightly more likely to attend religious services, and to report a religious preference, than the general population.

Data Source
Eagan, K., Stolzenberg, E. B., Ramirez, J. J., Aragon, M. C., Suchard, M. R., & Hurtado, S. (2014). The American freshman: National norms fall 2014. Los Angeles: Higher Education Research Institute, UCLA.

This and all previous reports are available from the HERI publications page.

Monday, January 26, 2015

Bayesian predictions for Super Bowl XLIX

Last fall I taught an introduction to Bayesian statistics at Olin College. My students worked on some excellent projects, and I invited them to write up their results as guest articles for this blog. Here is the first article of the series:

Predicting Super Bowl XLIX
Alex Crease and Matt Wismer

All code for this post can be found on Github:

With the Super Bowl coming up this Sunday, everyone is probably wondering where to place their bets. Fortunately, we've developed a model that uses Bayesian statistics to predict the results of football matches based upon prior season data of each team. We then used our model to predict a game between the Seahawks and the Patriots.

Our model is an expanded version of Allen Downey's solution to the “World Cup Problem”, where one can predict goals scored in a soccer match based on prior data. While the soccer model uses a Poisson process to predict when the next goal will be scored, our football model is a bit trickier because it not only accounts for defense, but also incorporates the two types of scoring; field goals and 7 point touchdowns, to predict final scores. Safeties and two-point conversions were eliminated to simplify our model. We also didn't include overtime considerations, allowing the two teams to tie at the end of regulation.

This model can be useful in a number of different betting schemes. Here’s what it predicts:

Who’s going to win?

Patriots, easy. Well, we may be biased because we both go to school in New England. Here’s the real story:

We considered scoring to be a single Poisson process, and have a separate distribution that represents our belief about the probability of a team scoring a touchdown or a field goal. We then calculate the probability of scoring a certain number of touchdowns and field goals based on the binomial distribution. The single Poisson process is related to the ability of the offense to get into field goal range, and the percentage of touchdowns is related to the team’s red zone efficiency.

We also keep track of offense and defense separately. To calculate the distribution of points scored, we average the offense’s scoring rate with the rate at which the defense allows scoring, and average the offense’s touchdown percent with the percentage of touchdowns that the defense allows.

The points distributions for each team were generated as described above, so we then calculated the probability that one would be greater than the other, or, in simpler terms, the probability that each team would win:

Patriots Win
Seahawks Win

At this point, the teams seem well matched, with our model slightly favoring the Patriots. Looking more closely at the data, the 90% confidence interval gives the Patriots a scoring range of 6-45, and the Seahawks 6-44, bringing us to the investigation of the point spread.

Betting the Point Spread

This game is likely to be pretty close. We looked at the probability of different spreads by subtracting the Seahawk's point distribution from the Patriot’s point distribution. The cumulative distribution of the point spread is shown below. As you can see, the 50% mark appears right around the Patriots winning by 1. So if you can find a spread significantly distant from that, you can consult this chart to figure out how likely you will be to win the spread. However, this result does not consider overtime a possibility, it depends on the chance that the teams will tie. Thus the point spread may be different if the game goes into overtime.

Bets on the Total

Some people like to bet on the total points scored by both teams, so we wanted to figure that out as well. We added our two points distributions, and got an interesting result. There is almost an identical probability of the total being 37 (3.636%) or 44 (3.634%). You can consult the below graph for other likely total scores to see where to place your bets.


Another method of betting on the total is based on an over/under scale, like the point spread. We've got you covered in that case, too. We expect the total score to be centered around 45 points, as that is the mean of the distribution. According to, the over/under line is currently at 48 or 48.5, and with our results in mind, we advise betting under. However, if the game goes into overtime, all bets are off (pun intended) and the total score could go higher.

Office Pool Grid Betting

Because of the multiple ways of scoring in football, certain scores are more likely to occur than others. As you can see in the histograms below, scoring 17 points in a match is the most likely scenario for both teams because it is a combination of 2 touchdowns and 1 field goal, both quantities being likely for a football match. However, it is impossible to score something like 11 points because no combinations of 7 and 3 add up to that value. From the two graphs, you can see that the Seahawks are actually more likely to score field goals than the Patriots because their probability of scoring 3 points is higher.


To satisfy your betting needs, we've created histograms below displaying the likelihood that teams will score values with specific last digits so you can (hopefully) blow your office mates away with your office pool bets. If you are familiar with the way points add up in football, our data isn't very surprising. Across both teams, the top two digits to place your bets on are 0 and 7, with 4 and 3 coming up behind.



Sadly, our results do not show a clear outcome to place your bets on. However, we hope that we've shed some light on how you can make an educated guess when deciding where to put your money. Please remember that you ultimately decide how to bet, and we are no way liable for any money you may lose (or win) over the Super Bowl weekend.

Wednesday, December 17, 2014

Statistics tutorials at PyCon 2015

I am happy to announce that I will offer two statistics tutorials at PyCon 2015 on April 9 in Montreal.  In the morning session I am teaching Bayesian Statistics Made Simple, which I have taught several times before, including the last three PyCons.  In the afternoon I am offering a new tutorial, Statistical Inference with Computational Methods.

The whole tutorial schedule is here, along with registration information.  And here are some details about the tutorials:

Bayesian statistics made simple

Audience level:


An introduction to Bayesian statistics using Python.  Bayesian statistics are usually presented mathematically, but many of the ideas are easier to understand computationally.  People who know Python can get started quickly and use Bayesian analysis to solve real problems.  This tutorial is based on material and case studies from Think Bayes (O’Reilly Media).


Bayesian statistical methods are becoming more common and more important, but there are not many resources to help beginners get started.  People who know Python can use their programming skills to get a head start.
I will present simple programs that demonstrate the concepts of Bayesian statistics, and apply them to a range of example problems.  Participants will work hands-on with example code and practice on example problems.
Attendees should have at least basic level Python and basic statistics.  If you learned about Bayes’s theorem and probability distributions at some time, that’s enough, even if you don’t remember it!
Attendees should bring a laptop with Python and matplotlib.  You can work in any environment; you just need to be able to download a Python program and run it.  I will provide code to help attendees get set up ahead of time.

Statistical inference with computational methods

Audience level:


Statistical inference is a fundamental tool in science and engineering, but it is often poorly understood.  This tutorial uses computational methods, including Monte Carlo simulation and resampling, to explore estimation, hypothesis testing and statistical modeling.  Attendees will develop understanding of statistical concepts and learn to use real data to answer relevant questions.


Do you know the difference between standard deviation and standard error?  Do you know what statistical test to use for any occasion?  Do you really know what a p-value is?  How about a confidence interval?
Most students don’t really understand these concepts, even after taking several statistics classes.  The problem is that these classes focus on mathematical methods that bury the concepts under a mountain of details.
This tutorial uses Python to implement simple statistical experiments that develop deep understanding.  Attendees will learn about resampling and related tools that use random simulation to perform statistical inference, including estimation and hypothesis testing.  We will use pandas, which provides structures for data analysis, along with NumPy and SciPy.
I will present examples using real-world data to answer relevant questions.  The tutorial material is based on my book, Think Stats, a class I teach at Olin College, and my blog, “Probably Overthinking It.

More information and registration here.

Thursday, December 4, 2014

The Rock Hyrax Problem

This is the third of a series of articles about Bayesian analysis.  The previous article is here.

Earlier this semester I posed this problem to my Bayesian statistics class at Olin College:
Suppose I capture and tag 10 rock hyraxes.  Some time later, I capture another 10 hyraxes and find that two of them are already tagged.  How many hyraxes are there in this environment?
This is an example of a mark and recapture experiment, which you can read about on Wikipedia.  The Wikipedia page also includes the photo of a tagged hyrax shown above.

As always with problems like this, we have to make some modeling assumptions.

1) For simplicity, you can assume that the environment is reasonably isolated, so the number of hyraxes does not change between observations.

2) And you can assume that each hyrax is equally likely to be captured during each phase of the experiment, regardless of whether it has been tagged.  In reality, it is possible that tagged animals would avoid traps in the future, or possible that the same behavior that got them caught the first time makes them more likely to be caught again.  But let's start simple.

My solution to this problem uses the computation framework from my book, Think Bayes.  The framework is described in this notebook.  If you have read Think Bayes or attended one of my workshops, you might want to attempt this problem before you look at my solution.

If you solve this problem analytically, or use MCMC, and you want to share your solution, please let me know and I will post it here.

And when you are ready, you can see my solution in this notebook.

I will post more of the exercises from my class over the next few weeks.

UPDATE December 5, 2014: João Neto posted a solution to this problem in BUGS using a Jeffrey's prior.