Wednesday, September 26, 2018

This blog has moved

As of September 2018, I am moving Probably Overthinking It to a new location.

Blogger has been an excellent host; so far, this blog has received more than two million page views.

But earlier this month, when I published a new article, Blogger prompted me to post it on Google+, and I did.  A few hours later I discovered that my Google+ account had been suspended for violating terms of service, but I got no information about what terms I had violated.

While my Google+ account was suspended, I was unable to access Blogger and some other Google services.  And since Probably Overthinking It is a substantial part of my professional web presence, that was unacceptable.

I appealed the suspension by pressing a button, with no opportunity to ask a question.  Within 24 hours, my account was restored, but with no communication and still no information.

So for me, using Google+ has become a game of Russian Roulette.  Every time I post something, there seems to be a random chance that I will lose control of my web presence.  And maybe next time it will be permanent.

It is nice that using Blogger is free, but this episode has been a valuable reminder that "If you are not paying for it, you are not the customer".  (Who said that?)

I have moved Probably Overthinking It to a site I control, hosted by a company I pay, a company that has provided consistently excellent customer service.

Lesson learned.

[When I published this article, Blogger asked if I wanted to post it on Google+.  I did not.]

UPDATE: See the discussion of this post on Hacker News, with lots of good advice for migrating to services you have more control over.

Wednesday, September 19, 2018

Two hour marathon in 2031, maybe

On Sunday (September 16, 2018) Eliud Kipchoge ran the Berlin Marathon in 2:01:39, smashing the previous world record by more than a minute and taking a substantial step in the progression toward a two hour marathon.

In a previous article, I noted that the marathon record pace since 1970 has been progressing linearly over time, and I proposed a model that explains why we might expect it to continue.  Based on a linear extrapolation of the data so far, I predicted that someone would break the two hour barrier in 2041, plus or minus a few years.

Now it is time to update my predictions in light of the new record.  The following figure shows the progression of world record pace since 1970 (orange line), a linear fit to the data (blue line) and a 90% predictive confidence interval (shaded area).  The dashed lines show the two hour marathon pace (13.1 mph) and lower and upper bounds for the year we will reach it.

Since the previous record was broken in 2014, we have been slightly behind the long-term trend.  But the new record more than makes up for it, putting us at the upper edge of the predictive interval.

This model predicts that we might see a two hour marathon as early as 2031, and probably will before 2041.

Note that this model is based on data from races.  It is possible that we will see a two hour marathon sooner in under time trial conditions, as in the Nike Breaking2 project.

Thursday, September 13, 2018

Tom Bayes and the case of the double dice

The double dice problem

Suppose I have a box that contains one each of 4-sided, 6-sided, 8-sided, and 12-sided dice. I choose a die at random, and roll it twice without letting you see the die or the outcome. I report that I got the same outcome on both rolls.

1) What is the posterior probability that I rolled each of the dice?
2) If I roll the same die again, what is the probability that I get the same outcome a third time?

You can see the complete solution in this Jupyter notebook, or read the HTML version here.


Here's a BayesTable that represents the four hypothetical dice.
In [3]:
hypo = [Fraction(sides) for sides in [4, 6, 8, 12]]
table = BayesTable(hypo)
hypo prior likelihood unnorm posterior
0 4 1 NaN NaN NaN
1 6 1 NaN NaN NaN
2 8 1 NaN NaN NaN
3 12 1 NaN NaN NaN

Since we didn't specify prior probabilities, the default value is equal priors for all hypotheses. They don't have to be normalized, because we have to normalize the posteriors anyway.
Now we can specify the likelihoods: if a die has n sides, the chance of getting the same outcome twice is 1/n.
So the likelihoods are:
In [4]:
table.likelihood = 1/table.hypo
hypo prior likelihood unnorm posterior
0 4 1 1/4 NaN NaN
1 6 1 1/6 NaN NaN
2 8 1 1/8 NaN NaN
3 12 1 1/12 NaN NaN
Now we can use update to compute the posterior probabilities:
In [5]:
hypo prior likelihood unnorm posterior
0 4 1 1/4 1/4 2/5
1 6 1 1/6 1/6 4/15
2 8 1 1/8 1/8 1/5
3 12 1 1/12 1/12 2/15
In [6]:
0    0.400000
1    0.266667
2    0.200000
3    0.133333
Name: posterior, dtype: float64
The 4-sided die is most likely because you are more likely to get doubles on a 4-sided die than on a 6-, 8-, or 12- sided die.

Part two

The second part of the problem asks for the (posterior predictive) probability of getting the same outcome a third time, if we roll the same die again.
If the die has n sides, the probability of getting the same value again is 1/n, which should look familiar.
To get the total probability of getting the same outcome, we have to add up the conditional probabilities:
P(n | data) * P(same outcome | n)
The first term is the posterior probability; the second term is 1/n.
In [7]:
total = 0
for _, row in table.iterrows():
    total += row.posterior / row.hypo
Fraction(13, 72)
This calculation is similar to the first step of the update, so we can also compute it by
1) Creating a new table with the posteriors from table.
2) Adding the likelihood of getting the same outcome a third time.
3) Computing the normalizing constant.
In [8]:
table2 = table.reset()
table2.likelihood = 1/table.hypo
hypo prior likelihood unnorm posterior
0 4 2/5 1/4 NaN NaN
1 6 4/15 1/6 NaN NaN
2 8 1/5 1/8 NaN NaN
3 12 2/15 1/12 NaN NaN
In [9]:
Fraction(13, 72)
In [10]:
hypo prior likelihood unnorm posterior
0 4 2/5 1/4 1/10 36/65
1 6 4/15 1/6 2/45 16/65
2 8 1/5 1/8 1/40 9/65
3 12 2/15 1/12 1/90 4/65
This result is the same as the posterior after seeing the same outcome three times.
This example demonstrates a general truth: to compute the predictive probability of an event, you can pretend you saw the event, do a Bayesian update, and record the normalizing constant.
(With one caveat: this only works if your priors are normalized.)

Tuesday, July 10, 2018

The Physics of Bungee Jumping

Bungee jumping turns out to be more complicated than I realized.  I use bungee jumping as an example in Modeling and Simulation in Python, which I am revising this summer.  The book is not done, but you can see the current draft here.

During the first phase of the jump, before the cord is fully extended, I treat the jumper as if they are in free fall, including the effect of gravity and air resistance, but ignoring the interaction between the jumper and the cord.

It turns out that this interaction is non-negligible.  As the cord drops from its folded initial condition to its extended final condition, it loses potential energy.  Where does that energy go?  It is transferred to the jumper!

The following diagram shows the scenario, courtesy of this web page on the topic:

The acceleration of the jumper turns out to be

where a is the net acceleration of the jumper, g is acceleration due to gravity, v is the velocity of the jumper, y is the position of the jumper relative to the starting point, L is the length of the cord, and μ is the mass ratio of the cord and jumper.

For a bungee jumper with mass 75 kg, I've computed the trajectory of a jumper with and without the effect of the cord.  The difference is more than two meters, which could be the difference between a successful jump and a bad day.

The details are in this Jupyter notebook.

Friday, June 22, 2018

Inference in three hours

I am preparing a talk for the Joint Statistical Meetings (JSM 2018) in August.  It's part of a session called "Bringing Intro Stats into a Multivariate and Data-Rich World"; my talk is called "Inference in Three Hours, and More Time for the Good Stuff".

Here's what I said I would talk about:
Teaching statistical inference using mathematical methods takes too much time, emphasizes the least important material, and leaves many students unprepared to apply statistics in the real world. Simple computer simulations can demonstrate the fundamental ideas of statistical inference quickly, clearly, and memorably. Computational methods are also robust and flexible, making it possible to work with a wider range of data and experiments. And by teaching statistical inference better and faster, we leave time for the most important goals of statistics education: preparing students to use data to answer questions and guide decision making under uncertainty. In this talk, I discuss problems with current approaches and present educational material I have developed based on computer simulations in Python.
I have slides for the talk now:

And here's the Jupyter notebook they are based on.

I have a few weeks until the conference, so comments and suggestions are welcome.


Coincidentally, I got question on Twitter today that's related to my talk:
Very late to this post by @AllenDowney, but quite informative:  …
Have one question though: seems a lot of the single case reasoning here is similar to what I was taught was a mistaken conclusion: “that there is a 95% prob that a parameter lies within the given 95% CI.” What is the difference? Seems I am missing some nuance?
The post @cutearguments asks about is "Recidivism and single-case probabilities", where I make an argument that single-case probabilities are not a special problem, even under the frequentist interpretation of probability; they only seem like a special problem because they make the reference class problem particularly salient.

So what does that have to do with confidence intervals?  Let me start with the example in my talk: suppose you are trying to estimate the average height of men in the U.S.  You collect a sample and generate an estimate, like 178 cm, and a 95% confidence interval, like (177, 179) cm.

Naively, it is tempting to say that there is a 95% chance that the true value (the actual average height of every male resident in the population) falls in the 95% confidence interval.  But that's not true.

There are two reasons you might hear for why it's not true:

1) The true value is unknown, but it is not a random quantity, so it is either in the interval or it's not.  You can't assign a probability to it.

2) The 95% confidence interval does not have a 95% chance of containing the true value because that's just not what it means.  A confidence interval quantifies variability due to random sampling; that's all.

The first argument is bogus; the second is valid.

If you are a Bayesian, the first argument is bogus because it is entirely unproblematic to make probability statements about unknown quantities, whether they are considered random or not.

If you are a frequentist, the first argument is still bogus because even if the true value is not a random quantity, the confidence interval is.  And furthermore, it belongs to a natural reference class, the set of confidence intervals we would get by running the experiment many times.  If we agree to treat it as a member of that reference class, we should have no problem giving it a probability of containing the true value.

But that probability is not 95%.   If you want an interval with a 95% chance of containing the true value, you need a Bayesian credible interval.

Friday, June 1, 2018

Bayesian Zig Zag

Almost two years ago, I had the pleasure of speaking at the inaugural meeting of the Boston Bayesians, where I presented "Bayesian Bandits from Scratch" (the notebook for that talk is here).  Since then, the group has flourished, thanks to the organizers, Jordi Diaz and Colin Carroll.

Last night I made my triumphant return for the 21st meeting, where I presented a talk I called "Bayesian Zig Zag".  Here's the abstract:
Tools like PyMC and Stan make it easy to implement probabilistic models, but getting started can be challenging.  In this talk, I present a strategy for simultaneously developing and implementing probabilistic models by alternating between forward and inverse probabilities and between grid algorithms and MCMC.  This process helps developers validate modeling decisions and verify their implementation. 
As an example, I will use a version of the "Boston Bruins problem", which I presented in Think Bayes, updated for the 2017-18 season. I will also present and request comments on my plans for the second edition of Think Bayes.
When I wrote the abstract, I was confident that the Bruins would be in the Stanley Cup final, but that is not how it worked out.  I adapted, using results from the first two games of the NHL final series to generate predictions for the next game.

Here are the slides from the talk:

And here is the Jupyter notebook I presented.  If you want to follow along, you'll see that there is a slide that introduces each section of the notebook, and then you can read the details.  If you have a Python installation with PyMC, you can download the notebook from the repository and try it out.

The talk starts with basic material that should be accessible for beginners, and ends with a hierarchical Bayesian model of a Poisson process, so it covers a lot of ground!  I hope you find it useful.

For people who were there, thank you for coming (all the way from Australia!), and thanks for the questions, comments, and conversation.  Thanks again to Jordi and Colin for organizing, to WeWork for hosting, and to QuantumBlack for sponsoring.

Thursday, May 3, 2018

Some people hate custom libraries

For most of my books, I provide a Python module that defines the functions and objects I use in the book.  That makes some people angry.

The following Amazon review does a nice job of summarizing the objections, and it demonstrates the surprising passion this issue evokes:

March 29, 2018
Format: Paperback
Echoing another reviewer, the custom code requirement means you learn their custom code rather than, you know, the standard modules numpy and scipy. For example, at least four separate classes are required, representing hundreds of lines of code, are required just to execute the first six lines of code in the book. All those lines do is define two signals, a cosine and a sine, sums them, then plots them. This, infuriatingly, hides some basic steps. Here's how you can create a cosine wave with frequency 440Hz:

duration = 0.5
framerate = 11025
n = round(duration*framerate)
ts = np.arange(n)/framerate
amp = 1.0
freq = 440
offset = 0.0
cos_sig = amp * numpy.cos( 2*numpy.pi*ts*freq + offset)
freq = 880
sin_sig = amp * numpy.sin( 2*numpy.pi*ts*freq + offset)

Instead, these clowns have

cos_sig = thinkdsp.CosSignal(freq=440,amp=1.0,offset=0)
sin_sig = thinkdsp.SinSignal(freq=440,amp=1.0,offset=0)
mix = cos_sig + sin_sig

where CosSignal and SinSignal are custom classes, not functions, which inherits four separate classes, NONE of which are necessary, and all of which serve to make things more complex than necessary, on the pretense this makes things easier. The classes these class inherit are a generic Sinusoid and SumSignal classes, which inherits a Signal class, which depends on a Wave class, which performs plotting using pyplot in matplotlib. None of which make anything really any easier, but does serve to hide a lot of basic functionality, like hiding how to use numpy, matplotlib, and pyplot.

In short, just to get through the first two pages, you have to have access to github to import their ridiculous thinkdsp, thinkplot, and thinkstats, totalling around 5500 lines of code, or you are just screwed and can't use this book. All decent teaching books develops code you need as necessary and do NOT require half a dozen files with thousands of lines of custom code just to get to page 2. What kind of clown does this when trying to write a book to show how to do basic signal processing? Someone not interested in teaching you DSP, but trying to show off their subpar programming skills by adding unnecessary complexity (a sure sign of a basic programmer, not a good).

The authors openly admit their custom code is nothing more than wrappers in numpy and scipy, so the authors KNEW they were writing a crappy book and filling it with a LOT of unnecessary complexity. Bad code is bad code. Using bad code to teach makes bad teaching. It's obvious Allen B. Downey has spent his career in academia, where writing quality code doesn't matter.

Well, at least he spelled my name right.

Maybe I should explain why I think it's a good idea to provide a custom library along with a book like Think DSP.  Importantly, the goal of the book is to help people learn the core ideas of signal processing; the software is a means to this end.

Here's what I said in the preface:
The premise of this book is that if you know how to program, you can use that skill to learn other things, and have fun doing it. 
With a programming-based approach, I can present the most important ideas right away. By the end of the first chapter, you can analyze sound recordings and other signals, and generate new sounds. Each chapter introduces a new technique and an application you can apply to real signals. At each step you learn how to use a technique first, and then how it works.
For example, in the first chapter, I introduce two objects defined in Wave and Spectrum.  Wave provides a method called make_spectrum that creates a Spectrum object, and Spectrum provides make_wave, which creates a Wave.

When readers use these objects and methods, they are implicitly learning one of the fundamental ideas of signal processing: that a Wave and its Spectrum are equivalent representations of the same information -- given one, you can always compute the other.

This example demonstrates one reason I use custom libraries in my books: The API is the lesson.  As you learn about these objects and how they interact, you are also learning the core ideas of the topic.

Another reason I think these libraries are a good idea is that they let me introduce ideas top-down: that is, I can show what a method does -- and why it is useful -- first; then I can present details when they necessary or most useful.

For example, I introduce the Spectrum object in Chapter 1.  I use it to apply a low pass filter, and the reader can hear what that sounds like.  You can too, by running the Chapter 1 notebook on Binder.

In Chapter 2, I reveal that my make_spectrum function is a thin wrapper on two NumPy functions, and present the source code:

from np.fft import rfft, rfftfreq

# class Wave:
    def make_spectrum(self):
        n = len(self.ys)
        d = 1 / self.framerate

        hs = rfft(self.ys)
        fs = rfftfreq(n, d)

        return Spectrum(hs, fs, self.framerate)

At this point, anyone who prefers to use NumPy directly, rather than my wrappers, knows how.

In Chapter 7, I unwrap one more layer and show how the FFT algorithm works.  Why Chapter 7?  Because I introduce correlation in Chapter 5, which helps me explain the Discrete Cosine Transform in Chapter 6, which helps me explain the Discrete Fourier Transform.

Using custom libraries lets me organize the material in the way I think works best, based on my experience working with students and seeing how they learn.

This example demonstrates another benefit of defining my own objects: data encapsulation.  When you use NumPy's rfft to compute a spectrum, you get an array of amplitudes, but not the frequencies they correspond to.  You can call rfftfreq to get the frequencies, and that's fine, but now you have two arrays that represent one spectrum.  Wouldn't it be nice to wrap them up in an object?  That's what a Spectrum object is.

Finally, I think these examples demonstrate good software engineering practice, particularly bottom-up design.  When you work with libraries like NumPy, it is common and generally considered a good idea to define functions and objects that encapsulate data, hide details, eliminate repeated code, and create new abstractions.  Paul Graham wrote about this idea in one of his essays on software:
[...] you don't just write your program down toward the language, you also build the language up toward your program. [...] the boundary between language and program is drawn and redrawn, until eventually it comes to rest along [...] the natural frontiers of your problem. In the end your program will look as if the language had been designed for it.
That's why, in the example that makes my correspondent so angry, it takes just three lines to create and add the signals; and more importantly, those lines contain exactly the information relevant to the operations and no more.  I think that's good quality code.

In summary, I provide custom libraries for my books because:

1) They demonstrate good software engineering practice, including bottom-up design and data encapsulation.

2) They let me present ideas top-down, showing how they are used before how they are implemented.

3) And as readers learn the APIs I defined, they are implicitly learning the key ideas.

I understand that not everyone agrees with this design decision, and maybe it doesn't work for everyone.  But I am still surprised that it makes people so angry.