## Thursday, April 11, 2013

### The price is right

On the reddit statistics forum, I recently posted a link to my tutorial on Bayesian statistics.  One of my fellow redditors drew my attention to this article, which uses pymc to do a Bayesian analysis of The Price is Right.  He or she asked how I would solve this problem using the framework in Think Bayes.  So, here goes.

First, we have to define a Suite of hypotheses.  In this example, each hypothesis represents a belief about the total price of the showcase.  We are told, based on data from previous shows, that a reasonable prior distribution is normal with mean 35000 dollars and standard deviation 7500.

So here's the code that creates the Suite:

class Price(thinkbayes.Suite):

def __init__(self, error_sigma):
"""Constructs the suite.

error_sigma: standard deviation of the distribution of error
"""
thinkbayes.Suite.__init__(self)
pmf = thinkbayes.MakeGaussianPmf(35000, 7500, num_sigmas=4)

# copy items from pmf to self
for val, prob in pmf.Items():
self.Set(val, prob)

# store error_sigma for use in Likelihood
self.error_sigma = error_sigma

MakeGaussianPmf makes a Pmf (probability mass function) that approximates a normal distribution.  It truncates the range of the distribution 4 sigmas in each direction from the mean.

I'll explain error_sigma in a minute.

Now that we have a suite of hypotheses, we think about how to represent the data.  In this case, the "data" is my best guess about the total value of the showcase, which we can think of as a measurement produced by a somewhat unreliable instrument, my brain.

According to the problem statement, my best guess is 3000 for the price of the snowmobile, 12000 for the price of the trip, so 15000 for the total price.

Now we need a Likelihood function that takes a hypothetical price for the showcase, and my best guess, and returns the Likelihood of the data given the hypothesis.  That is, if the actual price of the showcase is X, what is the probability that I would guess 12000?

To answer that question, we need some information about how good I am at guessing prices, and that's where error_sigma comes in.

We are told that my uncertainty about the price of the snowmobile can be captured by a normal distribution with sigma=500.  And my uncertainty about the price of the trip is normal with sigma=3000.  So let's compute the standard deviation of my total error:

error_snowmobile = 500
error_trip = 3000
error_total = math.sqrt(error_snowmobile**2 + error_trip**2)

Now we can create the Suite

suite = Price(error_total)

and update it with my guess

my_guess = 15000
suite.Update(my_guess)

When we invoke Update, it invokes Likelihood once for each hypothetical showcase price.  Here's the Likelihood function:

def Likelihood(self, hypo, data):
"""Computes the likelihood of the data under the hypothesis.

hypo: actual price
data: my guess
"""
actual_price = hypo
my_guess = data

error = my_guess - actual_price
like = thinkbayes.EvalGaussianPdf(
mu=0,
sigma=self.error_sigma,
x=error)

return like

x is the error; that is, how much my guess is off by.  like is the likelihood that I would be off by that much, computing by evaluating the density function of the Gaussian with sigma=error_sigma.

And we're done.  Let's see what the results look like.

The mean of the posterior distribution is 17800, substantially less than the prior mean (35000).  It is also substantially less than the result reported in the original article, near 28000.  So we are left with a suite of three hypotheses:

1) There is a mistake in my implementation.
2) There is a mistake in the other author's implementation.
3) We are actually making subtly different modeling assumptions.

Option #3 is possible because I am not sure I understand the model in the original article.  The author describes my beliefs about the snowmobile and the trip as "priors", which suggests that they are going to get updated.  In contrast, I am treating my guess about the prices as data (that is, a summary of what I learned by seeing the contents of the showcase), but I am also modeling myself as a measurement instrument with a characteristic distribution of errors.

Under my interpretation, the posterior shown above makes sense.  For example, if my guess is 15000, and the standard deviation of my guesses is 3050, then it is very unlikely that I am off by 4 standard deviations, so the upper bound of the posterior should be around 15000 + 4 * 3050 = 27200.  That makes the mean reported in the original article (28000) seem too high.

But maybe I am not interpreting the statement of the problem (or the model) as intended.  I will check in with my correspondent and update this article when we have an explanation!

UPDATE April 18, 2013:  I exchanged a few emails with the author of the original article, Cameron Davidson-Pilon.  He found a bug in his code that explains at least part of the difference between his results and mine.  So I think he is planning to update his article and the book he is working on.

He also sent me some data on the value of recent showcases on The Price is Right and the bids offered by the contestants.  The data were collected by Steve Gee and posted at The Price is Right Stats.

I have written code that uses this data to form the prior distribution of prices, and also to estimate the distribution of errors for the contestants.  And I am writing it up as a chapter in Think Bayes.  I'll post the new chapter here when it is done!

UPDATE April 22, 2013: I have added a chapter to Think Bayes, and I am publishing it as a two-part series, starting here.

1. Hi, Dr. Downey -

I am continuing to watch your presentations online and also digging into ThinkBayes. Thank you for your work in this area, and your commitment to teaching.

I'm new to the whole area of probabilistic programming, and ran into your book and lectures after I found pyMC. Honestly, I had a little difficulty following some of the reasoning in their examples, at least on the first pass through initial chapters, but your book is helping to clarify the concepts.

The question emerging in my mind, now that I'm starting into your book and code is, when is it appropriate to use your approach/code vs. pyMC? Should I understand your book as more conceptual, and then move on to use pyMC in practice? Would pyMC be as performant (or more?) as your code for the problems you cover, or does it involve simulation and therefore require more CPU cycles? Being new, again, and facing these options, I feel like I need a little more in the way of the lay of the land.

Thank you for your time. Regards,

Steve

1. Hi, Steve. This is a great question, and one I am thinking about myself. When I started Think Bayes I imagined that I would start with simple methods and switch to pyMC when needed. But I found that for the case studies that I wanted to do, the simple methods were sufficient, so I never made the switch.

When I have a chance to get back to it, I want to reimplement some of my solutions using pyMC and try to answer your question: when to prefer one or the other, and how to make a smooth transition from one to the other.

One way to think about the difference is that pyMC generates samples from the posterior distribution, so you let it run as long as necessary to generate a posterior with sufficient precision. The method in Think Bayes is basically enumeration of the posterior or a discrete approximation to it. So it takes a fixed amount of time and yields high precision, but the run time grows exponentially with the number of dimensions / number of parameters you are trying to estimate. And that's not practical for more than 3-4 parameters.

2. 