Thursday, June 16, 2016

What is a distribution?

This article uses object-oriented programming to explore of one of the most useful concepts in statistics, distributions.  The code is in a Jupyter notebook.

You can read a static version of the notebook on nbviewer.


You can run the code in a browser by clicking this link and then selecting distribution.ipynb from the list.

The following is a summary of the material in the notebook, which you might want to read before you dive into the code.

Random processes and variables

One of the recurring themes of my books is the use of object-oriented programming to explore mathematical ideas. Many mathematical entities are hard to define because they are so abstract. Representing them in Python puts the focus on what operations each entity supports  that is, what the objects can do  rather than on what they are.

In this article, I explore the idea of a probability distribution, which is one of the most important ideas in statistics, but also one of the hardest to explain. To keep things concrete, I'll start with one of the usual examples: rolling dice.
  • When you roll a standard six-sided die, there are six possible outcomes  numbers 1 through 6  and all outcomes are equally likely.
  • If you roll two dice and add up the total, there are 11 possible outcomes  numbers 2 through 12 — but they are not equally likely.  The least likely outcomes, 2 and 12, only happen once in 36 tries; the most likely outcome happens 1 times in 6.
  • And if you roll three dice and add them up, you get a different set of possible outcomes with a different set of probabilities. 
What I've just described are three random number generators.   The output from each generator is a random variable. And each random variable has probability distribution, which is the set of possible outcomes and the corresponding set of probabilities.

Representing distributions

There are many ways to represent a probability distribution. The most obvious is a probability mass function, or PMF, which is a function that maps from each possible outcome to its probability. And in Python, the most obvious way to represent a PMF is a dictionary that maps from outcomes to probabilities.

So is a Pmf a distribution? No. At least in my framework, a Pmf is one of several representations of a distribution. Other representations include the cumulative distribution function (CDF) and the characteristic function (CF).

These representations are equivalent in the sense that they all contain the same information; if I give you any one of them, you can figure out the others.

So why would we want different representations of the same information? The fundamental reason is that there are many operations we would like to perform with distributions; that is, questions we would like to answer. Some representations are better for some operations, but none of them is the best for all operations.

Here are some of the questions we would like a distribution to answer:

  • What is the probability of a given outcome?
  • What is the mean of the outcomes, taking into account their probabilities?
  • What is the variance of the outcome?  Other moments?
  • What is the probability that the outcome exceeds (or falls below) a threshold?
  • What is the median of the outcomes, that is, the 50th percentile?
  • What are the other percentiles?
  • How can get generate a random sample from this distribution, with the appropriate probabilities?
  • If we run two random processes and choose the maximum of the outcomes (or minimum), what is the distribution of the result?
  • If we run two random processes and add up the results, what is the distribution of the sum?

Each of these questions corresponds to a method we would like a distribution to provide. But there is no one representation that answers all of them easily and efficiently.

As I demonstrate in the notebook, the PMF representation makes it easy to look up an outcome and get its probability, and it can compute mean, variance, and other moments efficiently.

The CDF representation can look up an outcome and find its cumulative probability efficiently.  And it can do a reverse lookup equally efficiently; that is, given a probability, it can find the corresponding value, which is useful for computing medians and other percentiles.

The CDF also provides an easy way to generate random samples, and a remarkably simple way to compute the distribution of the maximum, or minimum, of a sample.

To answer the last question, the distribution of a sum, we can use the PMF representation, which is simple, but not efficient.  An alternative is to use the characteristic function (CF), which is the Fourier transform of the PMF.  That might sound crazy, but using the CF and the Convolution Theorem, we can compute the distribution of a sum in linearithmic time, or O(n log n).

If you are not familiar with the Convolution Theorem, you might want to read Chapter 8 of Think DSP.

So what's a distribution?

The Pmf, Cdf, and CharFunc are different ways to represent the same information. For the questions we want to answer, some representations are better than others. So how should we represent the distribution itself?

In my implementation, each representation is a mixin; that is, a class that provides a set of capabilities. A distribution inherits all of the capabilities from all of the representations. Here's a class definition that shows what I mean:

class Dist(Pmf, Cdf, CharFunc):
    def __init__(self, d):
        """Initializes the Dist.
        Calls all three __init__ methods.
        Pmf.__init__(self, d)
        Cdf.__init__(self, *compute_cumprobs(d))
        CharFunc.__init__(self, compute_fft(d))

When you create a Dist, you provide a dictionary of values and probabilities.
Dist.__init__ calls the other three __init__ methods to create the Pmf, Cdf, and CharFunc representations. The result is an object that has all the attributes and methods of the three representations.

From a software engineering point of view, that might not be the best design, but it is meant to illustrate what it means to be a distribution.

In short, if you give me any representation of a distribution, you have told me everything I need to answer questions about the possible outcomes and their probabilities. Converting from one representation to another is mostly a matter of convenience and computational efficiency.

Conversely, if you are trying to find the distribution of a random variable, you can do it by computing whichever representation is easiest to figure out.

So that's the idea.  If you want more details, take a look at the notebook by following one of the links at the top of the page.


  1. Just an observation: your meaning of "random process" makes sense, but it happens not to be how the phrase is generally used. Generally "random process" is used to mean a collection of random variables indexed by time (or some other similar index set). See for example . Maybe yours could alternatively be called a "random procedure"?

    1. Thanks, James. If each die roll yields a random variable, then the sequence of RVs makes a discrete time random process. But since they are independent and identically distributed, I treat them as a single RV. Admittedly I am being imprecise, but I didn't want this article to turn into "What is a random process?"

  2. Well, I think you need to bring this essay up to date! OOP is so 90's -- to get people's attention I think you need to rework this as an exercise in functional programming. That's what all the cool kids are doing these days. It's a toss up between Haskell (superior FP street cred) and Scala (messy, kitchen-sink approach, but undeniably the next big thing).

    I'm only half joking -- probability distributions are, like most or all mathematical objects, essentially declarative in nature instead of procedural. Ideally one would just declare the properties of a distribution in one representation and then properties of other representations would just fall out of the machinery. Failing that kind of mathematical programming language (to which existing computer algebra systems are an approximation) I think functional programming is a better fit than procedural.

    I don't know how one would best define probability distributions in a FP language. I'll call on those with more expertise to give us some ideas at this point.