## Monday, January 24, 2011

### Are you popular? Hint: no.

According to this feel-good article in Scientific American, you are probably less popular than your friends.  The author, John Allen Paulos, explains:

“...it’s more probable that we will be among a popular person’s friends simply because he or she has a larger number of them.”

The same logic explains the class size paradox I wrote about in Think Stats.  If you ask students how big their classes are, more of them will report large classes, because more of them are in large classes.

To see how this works in social networks, I downloaded data from the Stanford Large Network Dataset Collection; one of the datasets there contains friend/foe information from the Slashdot Zoo.  For each (anonymized) user, we have the number of other users he or she tagged as a friend or foe.

There are more than 82,000 users in the dataset.  The average user has 55 friends and foes, but the median is only 18, which suggests that the distribution is skewed to the right.  In fact, the distribution is heavy tailed, as shown in the figure below.

The blue line is the distribution for a randomly-chosen user.  As expected, it is right-skewed, even on a log scale.  Most people have fewer than 20 connections, but 15% have more than 100 and 0.2% have more than 1000.

The green line is the distribution for your friends.  It is biased because more popular people are more likely to be your friend.  In the extremes, a person with no friends has no chance to be your friend; a person with thousands of friends has a much higher probability.  In general, a person with n friends is n times more likely than a person with 1 friend.

So if we choose a person at random, and then choose one of their friends at random, we get the biased distribution.  The difference is substantial, about an order of magnitude.  In the biased distribution, the mean is 308, the median 171.  So what are the chances that your friend is more popular than you?  In this distribution, about 84%.

Loser.

-----

To compute the biased distribution, we loop through the items in the PMF and multiply each probability by its corresponding value.  Here’s what the code looks like:

def BiasPmf(pmf, name, invert=False):
"""Returns the Pmf with oversampling proportional to value.

If pmf is the distribution of true values, the result is the
distribution that would be seen if values are oversampled in
proportion to their values; for example, if you ask students
how big their classes are, large classes are oversampled in
proportion to their size.

If invert=True, computes the inverse operation; for example,
unbiasing a sample collected from students.

Args:
pmf: Pmf object.
invert: boolean

Returns:
Pmf object
"""
new_pmf = pmf.Copy()
new_pmf.name = name

for x, p in pmf.Items():
if invert:
new_pmf.Mult(x, 1.0/x)
else:
new_pmf.Mult(x, x)

new_pmf.Normalize()
return new_pmf

This code uses the PMF library, which you can read about in my book, Think Stats.

## Monday, January 17, 2011

### Obesity epidemic cured!

I have good news to report -- the obesity epidemic is over.  And the solution turns out to be the best kind: simple and completely painless.  The solution is self-reporting.  Instead of weighing people, all we have to do is ask them what they weigh.  Problem solved!

I got data from the Behavioral Risk Factor Surveillance System (BRFSS), run by the National Center for Chronic Disease Prevention and Health Promotion.  In 2008, they surveyed 414,509 respondents and asked about their demographics, health and health risks.  In particular, they asked respondents for their weight now and their weight one year ago.

By taking the difference, we can characterize weight change in the population.  And the news is good.  According to these reports, the average weight change was -0.64 kilos; that is, the average respondent lost 1.4 pounds.

Of course, this result is unrelated to reality.  In fact, most adults gain about a pound a year.  The most likely explanation is some combination of inaccurate recall, self-delusion, and attempts by respondents to impress surveyors.

But before making harsh judgments, let’s look more closely at the data.  Before computing the mean change, I discarded a few outliers -- anyone who reported a change of more than 40 kg.  Even so, the mean can be misleading, so we should look at the distribution.  This figure shows the cumulative distribution (CDF) of reported weight changes:

About 25% of respondents reported a weight loss, 52% reported no change, and 23% reported a gain.  The distribution is roughly symmetric, but we can get a better look by plotting CDFs for gains and losses:

The curves have the same shape, but the distribution of losses is shifted to the right.  This result suggests that we can’t blame a small number of wildly inaccurate respondents; delusion seems to be widespread.

These data demonstrate the dangers of self-reporting.  Even when surveys are administered carefully, people tend to remember wrongly, kid themselves, and portray themselves in a positive light.

But if you want the obesity epidemic to go away, just ask!

----------

If you are interested in topics like this, you might like my book, Think Stats: Probability and Statistics for Programmers.

## Monday, January 10, 2011

### Observer effect in relay races

In most foot races, everyone starts at the same time.  If you are a fast runner, you usually pass a lot of people at the beginning of the race, but after a few miles everyone around you is going at the same speed.

Last September I ran the Reach the Beach relay, where teams of 12 run 209 miles in New Hampshire from Cannon to Hampton Beach.  While I was running my second leg, I noticed an odd phenomenon: when I overtook another runner, I was usually much faster, and when other runners overtook me, they were usually much faster.

At first I thought that the distribution of speeds might be bimodal; that is, there were many slow runners and many fast runners, but few at my speed.  Then I realized that I was the victim of selection bias.

The race was unusual in two ways: it used a staggered start, so teams started at different times; also, many teams included runners at different levels of ability.  As a result, runners were spread out along the course with little relationship between speed and location.  When I started my leg, the runners near me were (pretty much) a random sample of the runners in the race.

So where does the bias come from?  During my time on the course, the chance of overtaking a runner, or being overtaken, is proportional to the difference in our speeds.  To see why, think about the extremes.  If another runner is going at the same speed as me, neither of us will overtake the other.  If someone is going so fast that they cover the entire course while I am running, they are certain to overtake me.

To see what effect this has on the distribution of speeds, I downloaded the results from a race I ran last spring (the James Joyce Ramble 10K in Dedham MA) and converted the pace of each runner to MPH.  Here’s what the probability mass function (PMF) of speeds looks like in a normal road race (not a relay):

It is bell-shaped, which suggests a Gaussian distribution.  There are more fast runners than we would expect in a Gaussian distribution, but that’s a topic for another post.

Now, let’s see what this looks like from the point of view of a runner in a relay race going 7.5 MPH.  For each speed, x, I apply a weight proportional to abs(x-7.5).  The result looks like this:

It’s bimodal, with many runners faster and slower than the observer, but few runners at or near the same speed.  So that’s consistent with my observation while I was running.  (The tails are spiky, but that’s an artifact of the small sample size.  I could apply some smoothing, but I like to keep data-mangling to a minimum.)

One of the nice things about long-distance running is that you have time to think about things like this.

-----

Appendix: Here’s the code I used to compute the observed speeds:

def BiasPmf(pmf, speed, name=None):
"""Returns a new PDF representing speeds observed at a given speed.

The chance of observing a runner is proportional to the difference in speed.

Args:
pmf: distribution of actual speeds
speed: speed of the observing runner
name: string name for the new dist

Returns:
Pmf object
"""
new = pmf.Copy(name=name)
for val, prob in new.Items():
diff = abs(val - speed)
new.Mult(val, diff)
new.Normalize()
return new

This code uses the PMF library, which you can read about in my book, Think Stats.

## Tuesday, January 4, 2011

### Proofiness and elections

I am enjoying Charles Seife’s Proofiness, but have to point out what I think is a flaw.  A major part of the book addresses errors in elections and election polls, and Seife presents a detailed analysis of two contested elections: Franken vs. Coleman in Minnesota and Gore vs. Bush in Florida.

In both cases the margin of victory was (much) smaller than the margin of error, and Seife concludes that the result should be considered a tie.  He points out:
“...both states, by coincidence, happen to break a tie in exactly the same way.  In the case of a tie vote, the winner shall be determined by lot.  In other words, flip a coin.  It’s hard to swallow, but the 2008 Minnesota Senate race and, even more startling, the 2000 presidential election should have been settled with the flip of a coin.”
So far, I agree, but Seife implies that this solution would avoid the legal manipulations that follow close elections (which he describes in entertaining detail, including a contested ballot in Minnesota full of write-in votes for “Lizard People”).

But Seife doesn’t solve the problem; he only moves it.  Instead of two outcomes (A wins and B wins) there are three (A wins, statistical tie, B wins), but the lines between outcomes are just as sharp, so any election that approaches them will be just as sharply contested.

As always, the correct solution is easy if you just apply Bayesian statistics.  In order to compute the margin of error, we need a model of the error process (for example, there is a chance that any vote will be lost, and a chance that any vote will be double counted).  With a model like that, we can use the observed vote count to compute the probability that either candidate received more votes.  Then we toss a biased coin, with the computed probability, to determine the winner.

For example, the final count in the 2008 Minnesota Senate race was 1,212,629 votes for Franken and 1,212,317 votes for Coleman.  But the actual number of votes cast for each candidate could easily be 1000 more or less than the count, so it is possible that Coleman actually received more votes; if we do the math, we might find that the probability is 30%.  In that case, I suggest, we should choose a random number between 1 and 100, and if it is less than 30, Coleman wins.

This system has a few nice properties: it solves the problem Seife calls “disestimation,” which is the tendency to pretend that measurements are more precise than they are.  And it is fair in the sense that the probability of winning equals the probability of receiving more votes.

There are drawbacks.  One is that it, in order to model the error process, we have to acknowledge that there are errors, which is unpalatable.  The other is that it is possible for a candidate with a higher vote count to lose.  That outcome would be strange, but certainly no stranger than the result of the 2000 presidential election.
-----
If you find this sort of thing interesting, you might like my free statistics textbook, Think Stats. You can download it or read it at thinkstats.com.