According to this feel-good article in Scientific American, you are probably less popular than your friends. The author, John Allen Paulos, explains:

“...it’s more probable that we will be among a popular person’s friends simply because he or she has a larger number of them.”

The same logic explains the class size paradox I wrote about in Think Stats. If you ask students how big their classes are, more of them will report large classes, because more of them are in large classes.

To see how this works in social networks, I downloaded data from the Stanford Large Network Dataset Collection; one of the datasets there contains friend/foe information from the Slashdot Zoo. For each (anonymized) user, we have the number of other users he or she tagged as a friend or foe.

There are more than 82,000 users in the dataset. The average user has 55 friends and foes, but the median is only 18, which suggests that the distribution is skewed to the right. In fact, the distribution is heavy tailed, as shown in the figure below.

“...it’s more probable that we will be among a popular person’s friends simply because he or she has a larger number of them.”

The same logic explains the class size paradox I wrote about in Think Stats. If you ask students how big their classes are, more of them will report large classes, because more of them are in large classes.

To see how this works in social networks, I downloaded data from the Stanford Large Network Dataset Collection; one of the datasets there contains friend/foe information from the Slashdot Zoo. For each (anonymized) user, we have the number of other users he or she tagged as a friend or foe.

There are more than 82,000 users in the dataset. The average user has 55 friends and foes, but the median is only 18, which suggests that the distribution is skewed to the right. In fact, the distribution is heavy tailed, as shown in the figure below.

The blue line is the distribution for a randomly-chosen user. As expected, it is right-skewed, even on a log scale. Most people have fewer than 20 connections, but 15% have more than 100 and 0.2% have more than 1000.

The green line is the distribution for your friends. It is biased because more popular people are more likely to be your friend. In the extremes, a person with no friends has no chance to be your friend; a person with thousands of friends has a much higher probability. In general, a person with n friends is n times more likely than a person with 1 friend.

So if we choose a person at random, and then choose one of their friends at random, we get the biased distribution. The difference is substantial, about an order of magnitude. In the biased distribution, the mean is 308, the median 171. So what are the chances that your friend is more popular than you? In this distribution, about 84%.

Loser.

-----

To compute the biased distribution, we loop through the items in the PMF and multiply each probability by its corresponding value. Here’s what the code looks like:

def BiasPmf(pmf, name, invert=False):

"""Returns the Pmf with oversampling proportional to value.

If pmf is the distribution of true values, the result is the

distribution that would be seen if values are oversampled in

proportion to their values; for example, if you ask students

how big their classes are, large classes are oversampled in

proportion to their size.

If invert=True, computes the inverse operation; for example,

unbiasing a sample collected from students.

Args:

pmf: Pmf object.

invert: boolean

Returns:

Pmf object

"""

new_pmf = pmf.Copy()

new_pmf.name = name

for x, p in pmf.Items():

if invert:

new_pmf.Mult(x, 1.0/x)

else:

new_pmf.Mult(x, x)

new_pmf.Normalize()

return new_pmf

I'm working my way through your book, but cannot wrap my head around this.

ReplyDelete1. I don't understand the difference between the three PMFs: the one perceived by the Dean, biased PMF, and unbiased PMF.

2. I don't why multiplying by the value or 1.0/x creates the Biased and Unbiased PMF.