Wednesday, April 26, 2017

Python as a way of thinking

This article contains supporting material for this blog post at Scientific American.  The thesis of the post is that modern programming languages (like Python) are qualitatively different from the first generation (like FORTRAN and C), in ways that make them effective tools for teaching, learning, exploring, and thinking.

I presented a longer version of this argument in a talk I presented at Olin College last fall.  The slides are here:

Here are Jupyter notebooks with the code examples I mentioned in the talk:

Here's my presentation at SciPy 2015, where I talked more about Python as a way of teaching and learning DSP:

Finally, here's the notebook "Using Counters", which uses Python's Counter object to implement a PMF (probability mass function) and perform Bayesian updates.

In [13]:
from __future__ import print_function, division

from collections import Counter
import numpy as np
A counter is a map from values to their frequencies. If you initialize a counter with a string, you get a map from each letter to the number of times it appears. If two words are anagrams, they yield equal Counters, so you can use Counters to test anagrams in linear time.
In [3]:
def is_anagram(word1, word2):
    """Checks whether the words are anagrams.

    word1: string
    word2: string

    returns: boolean
    return Counter(word1) == Counter(word2)
In [4]:
is_anagram('tachymetric', 'mccarthyite')
In [5]:
is_anagram('banana', 'peach')
A Counter is a natural representation of a multiset, which is a set where the elements can appear more than once. You can extend Counter with set operations like is_subset:
In [6]:
class Multiset(Counter):
    """A multiset is a set where elements can appear more than once."""

    def is_subset(self, other):
        """Checks whether self is a subset of other.

        other: Multiset

        returns: boolean
        for char, count in self.items():
            if other[char] < count:
                return False
        return True
    # map the <= operator to is_subset
    __le__ = is_subset
You could use is_subset in a game like Scrabble to see if a given set of tiles can be used to spell a given word.
In [7]:
def can_spell(word, tiles):
    """Checks whether a set of tiles can spell a word.

    word: string
    tiles: string

    returns: boolean
    return Multiset(word) <= Multiset(tiles)
In [8]:
can_spell('SYZYGY', 'AGSYYYZ')

Probability Mass Functions

You can also extend Counter to represent a probability mass function (PMF).
normalize computes the total of the frequencies and divides through, yielding probabilities that add to 1.
__add__ enumerates all pairs of value and returns a new Pmf that represents the distribution of the sum.
__hash__ and __id__ make Pmfs hashable; this is not the best way to do it, because they are mutable. So this implementation comes with a warning that if you use a Pmf as a key, you should not modify it. A better alternative would be to define a frozen Pmf.
render returns the values and probabilities in a form ready for plotting
In [9]:
class Pmf(Counter):
    """A Counter with probabilities."""

    def normalize(self):
        """Normalizes the PMF so the probabilities add to 1."""
        total = float(sum(self.values()))
        for key in self:
            self[key] /= total

    def __add__(self, other):
        """Adds two distributions.

        The result is the distribution of sums of values from the
        two distributions.

        other: Pmf

        returns: new Pmf
        pmf = Pmf()
        for key1, prob1 in self.items():
            for key2, prob2 in other.items():
                pmf[key1 + key2] += prob1 * prob2
        return pmf

    def __hash__(self):
        """Returns an integer hash value."""
        return id(self)
    def __eq__(self, other):
        return self is other

    def render(self):
        """Returns values and their probabilities, suitable for plotting."""
        return zip(*sorted(self.items()))
As an example, we can make a Pmf object that represents a 6-sided die.
In [10]:
d6 = Pmf([1,2,3,4,5,6])
d6.normalize() = 'one die'
Pmf({1: 0.16666666666666666, 2: 0.16666666666666666, 3: 0.16666666666666666, 4: 0.16666666666666666, 5: 0.16666666666666666, 6: 0.16666666666666666})
Using the add operator, we can compute the distribution for the sum of two dice.
In [11]:
d6_twice = d6 + d6 = 'two dice'

for key, prob in d6_twice.items():
    print(key, prob)
2 0.0277777777778
3 0.0555555555556
4 0.0833333333333
5 0.111111111111
6 0.138888888889
7 0.166666666667
8 0.138888888889
9 0.111111111111
10 0.0833333333333
11 0.0555555555556
12 0.0277777777778
Using numpy.sum, we can compute the distribution for the sum of three dice.
In [14]:
# if we use the built-in sum we have to provide a Pmf additive identity value
# pmf_ident = Pmf([0])
# d6_thrice = sum([d6]*3, pmf_ident)

# with np.sum, we don't need an identity
d6_thrice = np.sum([d6, d6, d6]) = 'three dice'
And then plot the results (using Pmf.render)
In [19]:
import matplotlib.pyplot as plt
%matplotlib inline
In [20]:
for die in [d6, d6_twice, d6_thrice]:
    xs, ys = die.render()
    plt.plot(xs, ys,, linewidth=3, alpha=0.5)

Bayesian statistics

A Suite is a Pmf that represents a set of hypotheses and their probabilities; it provides bayesian_update, which updates the probability of the hypotheses based on new data.
Suite is an abstract parent class; child classes should provide a likelihood method that evaluates the likelihood of the data under a given hypothesis. update_bayesian loops through the hypothesis, evaluates the likelihood of the data under each hypothesis, and updates the probabilities accordingly. Then it re-normalizes the PMF.
In [21]:
class Suite(Pmf):
    """Map from hypothesis to probability."""

    def bayesian_update(self, data):
        """Performs a Bayesian update.
        Note: called bayesian_update to avoid overriding dict.update

        data: result of a die roll
        for hypo in self:
            like = self.likelihood(data, hypo)
            self[hypo] *= like

As an example, I'll use Suite to solve the "Dice Problem," from Chapter 3 of Think Bayes.
"Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about. Suppose I select a die from the box at random, roll it, and get a 6. What is the probability that I rolled each die?"
I'll start by making a list of Pmfs to represent the dice:
In [31]:
def make_die(num_sides):
    die = Pmf(range(1, num_sides+1)) = 'd' + str(num_sides)
    return die

dice = [make_die(x) for x in [4, 6, 8, 12, 20]]
for die in dice:
Pmf({1: 0.25, 2: 0.25, 3: 0.25, 4: 0.25})
Pmf({1: 0.16666666666666666, 2: 0.16666666666666666, 3: 0.16666666666666666, 4: 0.16666666666666666, 5: 0.16666666666666666, 6: 0.16666666666666666})
Pmf({1: 0.125, 2: 0.125, 3: 0.125, 4: 0.125, 5: 0.125, 6: 0.125, 7: 0.125, 8: 0.125})
Pmf({1: 0.08333333333333333, 2: 0.08333333333333333, 3: 0.08333333333333333, 4: 0.08333333333333333, 5: 0.08333333333333333, 6: 0.08333333333333333, 7: 0.08333333333333333, 8: 0.08333333333333333, 9: 0.08333333333333333, 10: 0.08333333333333333, 11: 0.08333333333333333, 12: 0.08333333333333333})
Pmf({1: 0.05, 2: 0.05, 3: 0.05, 4: 0.05, 5: 0.05, 6: 0.05, 7: 0.05, 8: 0.05, 9: 0.05, 10: 0.05, 11: 0.05, 12: 0.05, 13: 0.05, 14: 0.05, 15: 0.05, 16: 0.05, 17: 0.05, 18: 0.05, 19: 0.05, 20: 0.05})
Next I'll define DiceSuite, which inherits bayesian_update from Suite and provides likelihood.
data is the observed die roll, 6 in the example.
hypo is the hypothetical die I might have rolled; to get the likelihood of the data, I select, from the given die, the probability of the given value.
In [26]:
class DiceSuite(Suite):
    def likelihood(self, data, hypo):
        """Computes the likelihood of the data under the hypothesis.

        data: result of a die roll
        hypo: Pmf object representing a die
        return hypo[data]
Finally, I use the list of dice to instantiate a Suite that maps from each die to its prior probability. By default, all dice have the same prior.
Then I update the distribution with the given value and print the results:
In [33]:
dice_suite = DiceSuite(dice)


for die in sorted(dice_suite):
    print(len(die), dice_suite[die])
4 0.0
6 0.392156862745
8 0.294117647059
12 0.196078431373
20 0.117647058824
As expected, the 4-sided die has been eliminated; it now has 0 probability. The 6-sided die is the most likely, but the 8-sided die is still quite possible.
Now suppose I roll the die again and get an 8. We can update the Suite again with the new data
In [30]:

for die, prob in sorted(dice_suite.items()):
    print(, prob)
d4 0.0
d6 0.0
d8 0.623268698061
d12 0.277008310249
d20 0.0997229916898
Now the 6-sided die has been eliminated, the 8-sided die is most likely, and there is less than a 10% chance that I am rolling a 20-sided die.
These examples demonstrate the versatility of the Counter class, one of Python's underused data structures.
In [ ]:

Tuesday, April 4, 2017

Honey, money, weather, terror

In my Data Science class this semester, students are working on a series of reports where they explore a freely-available dataset, use data to answer questions, and present their findings.  After each batch of reports, I will publish the abstracts here; you can follow the links below to see what they found.

The Impact of Military Status on Income Bracket

Joey Maalouf

Using the National Health Interview Survey data, I was able to look for a potential link between military status and financial status. More specifically, I wanted to check if whether someone served in the United States military affects their current income bracket. It was apparent that people who served in the military were underrepresented in low income brackets and overrepresented in high income brackets compared to the rest of the population. This difference appears more clearly if we group the income data into even broader brackets for further analysis; being in the military increased one's chances of being in the upper half of respondents by 16.06 percentage points, and of being in the upper third by 14.64 percentage points. Further statistical analysis reported a Cohen effect size of 0.32, which is above the standard threshold to be considered more than a small effect.

Getting Treatment

Kaitlyn Keil

In the so-called "War on Drugs", one of the primary tactics is teaching children to "Just Say No!" However, less attention is paid to treatment for those who are already addicted. Except for the occasional comment on how a celebrity disappeared off to rehab, there is a silence in our culture about the apparently shameful act of getting treatment. This silence made me begin to wonder: how many people who struggle with addictions actually get treated, and how long does it take before they receive this help? Using the National Survey on Drug Use and Health data from 2014, I found that very few people who use drugs report getting treatment or counseling, and the length of time they go without getting treatment isn't particularly correlated with other factors.

What's the Chance You will Die Due to Terrorism?

Kevin Zhang

With Trump's recent travel ban and the escalation of controversial actions against Middle Eastern people, there has been a rise of paranoia towards the Middle East region for fear of the possibility of a terrorist attack. But is there is a reason to be so afraid of the Middle East, or even terrorism in general? What is the chance that an American would be a victim to terrorism? This article looks into just how likely the average person in the US will be affected by a terrorist attack, should one happen. Results show that the chance of a person being affected by terrorism in the North American region is almost 0, especially when compared to the probability in the Middle East itself. The data suggests that people's fears are unfounded and that the controversial reactions towards Middle East citizens because of a 1 in 15 million chance are irrational.

Is There a Seasonality in the Response to Social Media Posts?

Sungwoo Park

Using a dataset containing over 4 million Facebook posts from 15 mainstream news outlets, I investigate the existence of seasonality in the number of likes a Facebook post from a news outlet gets. The dataset contains contents and attributes, such as number of likes and timestamp, of all facebook posts posted by the top media sources from 2012 to 2016. The media outlets included in the data are ABC, BBC, CBS, CNN, Fox & Friends, Fox, LA Times, NBC, NPR, The Huffington Post, The New York Times, The Wall Street Journal, The Washington Post, Time, and USA Today.

The Association Between Drug Usage and Depression

David Papp and Willem Thorbecke

The goal of this article is to explore the association between drug usage and depression. Intuitively, many would argue that those who use drugs are more likely to be depressed. To explore this relationship, we took data from the National Drug Usage and Health Survey from 2014. We conducted logistical regression on cocaine, marijuana, alcohol, and heroin while controlling for possible confounding variables such as sex, income, and health conditions. Surprisingly, there appears to be a negative correlation between drug usage and depression.

US Apiculture and Honey Production

Matthew Ruehle and Sean Carter

We examine the historic honey-producing bee colony counts, yield, and honey production of states as collected by the USDA, finding statistical evidence for regional "clustering" of production and a negative correlation between per-hive yield and overall price, most strongly reflected in states with the greatest absolute production

Most Terrorism is Local

Radmer van der Heyde

I explored the Global Terrorism Database to see how terror has evolved over time, and whether international terrorism has any defining features. Over time terrorism has increased and gotten deadlier, but shifted regions. However, international terrorism was too small a percentage of the dataset to reach an appropriate conclusion.

Does it get warmer before it rains?

Apurva Raman and William Lu

Speculating about the weather has been a staple of small talk and human curiosity for a long time, and as a result, many weather "myths" exist. One such myth we’ve heard is that it gets warmer before a precipitation event (e.g. rain, snow, hail, sleet, etc.) occurs. Using data from the US National Oceanic and Atmospheric Association (NOAA), we find that change in temperature is a poor indicator for whether or not there will be a precipitation event.