Monday, January 9, 2017

Third batch of notebooks for Think Stats

As I mentioned in the previous post and the one before that, I am getting ready to teach Data Science in the spring, so I am going back through Think Stats and updating the Jupyter notebooks.  I am done with Chapters 1 through 9 now.

If you are reading the book, you can get the notebooks by cloning this repository on GitHub and running the notebooks on your computer.  Or you can read (but not run) the notebooks on GitHub:

Chapter 7 Notebook (Chapter 7 Solutions)
Chapter 8 Notebook (Chapter 8 Solutions)
Chapter 9 Notebook (Chapter 9 Solutions)

I'll post the next batch soon; in the meantime, here are some of the examples from Chapter 7, demonstrating the surprising difficulty of making an effective scatter plot, especially with large datasets (in this example, I use data from the Behavioral Risk Factor Surveillance System, which includes data from more than 300,000 respondents).



Scatter plots

I'll start with the data from the BRFSS again.
In [2]:
df = brfss.ReadBrfss(nrows=None)
The following function selects a random subset of a DataFrame.
In [3]:
def SampleRows(df, nrows, replace=False):
    indices = np.random.choice(df.index, nrows, replace=replace)
    sample = df.loc[indices]
    return sample
I'll extract the height in cm and the weight in kg of the respondents in the sample.
In [4]:
sample = SampleRows(df, 5000)
heights, weights = sample.htm3, sample.wtkg2
Here's a simple scatter plot with alpha=1, so each data point is fully saturated.
In [5]:
thinkplot.Scatter(heights, weights, alpha=1)
thinkplot.Config(xlabel='Height (cm)',
                 ylabel='Weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)
The data fall in obvious columns because they were rounded off. We can reduce this visual artifact by adding some random noice to the data.
NOTE: The version of Jitter in the book uses noise with a uniform distribution. Here I am using a normal distribution. The normal distribution does a better job of blurring artifacts, but the uniform distribution might be more true to the data.
In [6]:
def Jitter(values, jitter=0.5):
    n = len(values)
    return np.random.normal(0, jitter, n) + values
Heights were probably rounded off to the nearest inch, which is 2.8 cm, so I'll add random values from -1.4 to 1.4.
In [7]:
heights = Jitter(heights, 1.4)
weights = Jitter(weights, 0.5)
And here's what the jittered data look like.
In [8]:
thinkplot.Scatter(heights, weights, alpha=1.0)
thinkplot.Config(xlabel='Height (cm)',
                 ylabel='Weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)
The columns are gone, but now we have a different problem: saturation. Where there are many overlapping points, the plot is not as dark as it should be, which means that the outliers are darker than they should be, which gives the impression that the data are more scattered than they actually are.
This is a surprisingly common problem, even in papers published in peer-reviewed journals.
We can usually solve the saturation problem by adjusting alpha and the size of the markers, s.
In [9]:
thinkplot.Scatter(heights, weights, alpha=0.1, s=10)
thinkplot.Config(xlabel='Height (cm)',
                 ylabel='Weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)
That's better. This version of the figure shows the location and shape of the distribution most accurately. There are still some apparent columns and rows where, most likely, people reported their height and weight using rounded values. If that effect is important, this figure makes it apparent; if it is not important, we could use more aggressive jittering to minimize it.
An alternative to a scatter plot is something like a HexBin plot, which breaks the plane into bins, counts the number of respondents in each bin, and colors each bin in proportion to its count.
In [10]:
thinkplot.HexBin(heights, weights)
thinkplot.Config(xlabel='Height (cm)',
                 ylabel='Weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)
In this case the binned plot does a pretty good job of showing the location and shape of the distribution. It obscures the row and column effects, which may or may not be a good thing.
Exercise: So far we have been working with a subset of only 5000 respondents. When we include the entire dataset, making an effective scatterplot can be tricky. As an exercise, experiment with Scatter and HexBin to make a plot that represents the entire dataset well.
In [11]:
# Solution

# With smaller markers, I needed more aggressive jittering to
# blur the measurement artifacts

# With this dataset, using all of the rows might be more trouble
# than it's worth.  Visualizing a subset of the data might be
# more practical and more effective.

heights = Jitter(df.htm3, 2.8)
weights = Jitter(df.wtkg2, 1.0)

thinkplot.Scatter(heights, weights, alpha=0.01, s=2)
thinkplot.Config(xlabel='Height (cm)',
                 ylabel='Weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)

Plotting percentiles

Sometimes a better way to get a sense of the relationship between variables is to divide the dataset into groups using one variable, and then plot percentiles of the other variable.
First I'll drop any rows that are missing height or weight.
In [12]:
cleaned = df.dropna(subset=['htm3', 'wtkg2'])
Then I'll divide the dataset into groups by height.
In [13]:
bins = np.arange(135, 210, 5)
indices = np.digitize(cleaned.htm3, bins)
groups = cleaned.groupby(indices)
Here are the number of respondents in each group:
In [14]:
for i, group in groups:
    print(i, len(group))
0 305
1 228
2 477
3 2162
4 18759
5 45761
6 70610
7 72138
8 61725
9 49938
10 43555
11 20077
12 7784
13 1777
14 405
15 131
Now we can compute the CDF of weight within each group.
In [15]:
mean_heights = [group.htm3.mean() for i, group in groups]
cdfs = [thinkstats2.Cdf(group.wtkg2) for i, group in groups]
And then extract the 25th, 50th, and 75th percentile from each group.
In [16]:
for percent in [75, 50, 25]:
    weight_percentiles = [cdf.Percentile(percent) for cdf in cdfs]
    label = '%dth' % percent
    thinkplot.Plot(mean_heights, weight_percentiles, label=label)
    
thinkplot.Config(xlabel='Height (cm)',
                 ylabel='Weight (kg)',
                 axis=[140, 210, 20, 200],
                 legend=False)
Exercise: Yet another option is to divide the dataset into groups and then plot the CDF for each group. As an exercise, divide the dataset into a smaller number of groups and plot the CDF for each group.
In [17]:
# Solution

bins = np.arange(140, 210, 10)
indices = np.digitize(cleaned.htm3, bins)
groups = cleaned.groupby(indices)
cdfs = [thinkstats2.Cdf(group.wtkg2) for i, group in groups]

thinkplot.PrePlot(len(cdfs))
thinkplot.Cdfs(cdfs)
thinkplot.Config(xlabel='Weight (kg)',
                 ylabel='CDF',
                 axis=[20, 200, 0, 1],
                 legend=False)

Correlation

The following function computes the covariance of two variables using NumPy's dot function.
In [18]:
def Cov(xs, ys, meanx=None, meany=None):
    xs = np.asarray(xs)
    ys = np.asarray(ys)

    if meanx is None:
        meanx = np.mean(xs)
    if meany is None:
        meany = np.mean(ys)

    cov = np.dot(xs-meanx, ys-meany) / len(xs)
    return cov
And here's an example:
In [19]:
heights, weights = cleaned.htm3, cleaned.wtkg2
Cov(heights, weights)
Out[19]:
103.33290857697797
Covariance is useful for some calculations, but it doesn't mean much by itself. The coefficient of correlation is a standardized version of covariance that is easier to interpret.
In [20]:
def Corr(xs, ys):
    xs = np.asarray(xs)
    ys = np.asarray(ys)

    meanx, varx = thinkstats2.MeanVar(xs)
    meany, vary = thinkstats2.MeanVar(ys)

    corr = Cov(xs, ys, meanx, meany) / np.sqrt(varx * vary)
    return corr
The correlation of height and weight is about 0.51, which is a moderately strong correlation.
In [21]:
Corr(heights, weights)
Out[21]:
0.50873647897347707
NumPy provides a function that computes correlations, too:
In [22]:
np.corrcoef(heights, weights)
Out[22]:
array([[ 1.        ,  0.50873648],
       [ 0.50873648,  1.        ]])
The result is a matrix with self-correlations on the diagonal (which are always 1), and cross-correlations on the off-diagonals (which are always symmetric).
Pearson's correlation is not robust in the presence of outliers, and it tends to underestimate the strength of non-linear relationships.
Spearman's correlation is more robust, and it can handle non-linear relationships as long as they are monotonic. Here's a function that computes Spearman's correlation:
In [23]:
import pandas as pd

def SpearmanCorr(xs, ys):
    xranks = pd.Series(xs).rank()
    yranks = pd.Series(ys).rank()
    return Corr(xranks, yranks)
For heights and weights, Spearman's correlation is a little higher:
In [24]:
SpearmanCorr(heights, weights)
Out[24]:
0.54058462623204762
A Pandas Series provides a method that computes correlations, and it offers spearman as one of the options.
In [25]:
def SpearmanCorr(xs, ys):
    xs = pd.Series(xs)
    ys = pd.Series(ys)
    return xs.corr(ys, method='spearman')
The result is the same as for the one we wrote.
In [26]:
SpearmanCorr(heights, weights)
Out[26]:
0.54058462623204839
An alternative to Spearman's correlation is to transform one or both of the variables in a way that makes the relationship closer to linear, and the compute Pearson's correlation.
In [27]:
Corr(cleaned.htm3, np.log(cleaned.wtkg2))
Out[27]:
0.53172826059834655

Exercises

Using data from the NSFG, make a scatter plot of birth weight versus mother’s age. Plot percentiles of birth weight versus mother’s age. Compute Pearson’s and Spearman’s correlations. How would you characterize the relationship between these variables?
In [28]:
import first

live, firsts, others = first.MakeFrames()
live = live.dropna(subset=['agepreg', 'totalwgt_lb'])
In [29]:
# Solution

ages = live.agepreg
weights = live.totalwgt_lb
print('Corr', Corr(ages, weights))
print('SpearmanCorr', SpearmanCorr(ages, weights))
Corr 0.0688339703541
SpearmanCorr 0.0946100410966
In [30]:
# Solution

def BinnedPercentiles(df):
    """Bin the data by age and plot percentiles of weight for each bin.

    df: DataFrame
    """
    bins = np.arange(10, 48, 3)
    indices = np.digitize(df.agepreg, bins)
    groups = df.groupby(indices)

    ages = [group.agepreg.mean() for i, group in groups][1:-1]
    cdfs = [thinkstats2.Cdf(group.totalwgt_lb) for i, group in groups][1:-1]

    thinkplot.PrePlot(3)
    for percent in [75, 50, 25]:
        weights = [cdf.Percentile(percent) for cdf in cdfs]
        label = '%dth' % percent
        thinkplot.Plot(ages, weights, label=label)

    thinkplot.Config(xlabel="Mother's age (years)",
                     ylabel='Birth weight (lbs)',
                     xlim=[14, 45], legend=True)
    
BinnedPercentiles(live)
In [31]:
# Solution

def ScatterPlot(ages, weights, alpha=1.0, s=20):
    """Make a scatter plot and save it.

    ages: sequence of float
    weights: sequence of float
    alpha: float
    """
    thinkplot.Scatter(ages, weights, alpha=alpha)
    thinkplot.Config(xlabel='Age (years)',
                     ylabel='Birth weight (lbs)',
                     xlim=[10, 45],
                     ylim=[0, 15],
                     legend=False)
    
ScatterPlot(ages, weights, alpha=0.05, s=10)
In [32]:
# Solution

# My conclusions:

# 1) The scatterplot shows a weak relationship between the variables but
#    it is hard to see clearly.

# 2) The correlations support this.  Pearson's is around 0.07, Spearman's
#    is around 0.09.  The difference between them suggests some influence
#    of outliers or a non-linear relationsip.

# 3) Plotting percentiles of weight versus age suggests that the
#    relationship is non-linear.  Birth weight increases more quickly
#    in the range of mother's age from 15 to 25.  After that, the effect
#    is weaker.
In [ ]:
 

No comments:

Post a Comment