Correlations are often reported in terms of “explained variability”: a correlation of 0.53 explains R^2=28% of the variance in the data. But even that overstates the predictive power of SAT scores because it is expressed in terms of variance, not standard deviation.

To make that more concrete, suppose you are trying to guess students’ first year grades; how much better could you guess if you knew their SAT scores? Without using SAT scores, the best you could do is to guess the mean, which is 2.97 in the College Board study. In that case your root-mean-squared-error (RMS) would be the standard deviation, 0.71.

Using SAT scores, you could reduce RMS to 0.60, which is a 15% improvement. That’s a little bit better, but 0.6 is the difference between a B average and an A-, and you should expect to be off by at least that much 32% of the time.

At colleges with strict admissions standards, the correlation tends to be even lower. There are two reasons for this effect. One is explained in this article by John Allen Paulos, who points out that “Colleges usually accept students from a fairly narrow swath of the SAT spectrum,” and “the degree of correlation between two variables depends on the range of the variables considered.”

To see why that’s true, consider Harvard, where the 25th percentile for the combined SAT is 2100 (see satscores.us). So 75% of the students averaged 700 or better on the three parts of the exam. That’s a “fairly narrow swath.”

In fact the problem is even worse than Paulos suggests, because the SAT doesn’t discriminate very well between people at the top of the scale. For example, the difference between a 700 and a 770 can be as few as 4 questions, which might be due to chance.

To quantify that, I will compute the chance that someone who gets a 700 is actually better than someone with an 770. By “better,” I only mean “better at answering SAT questions,” which I will describe with P, the probability of getting a question right.

To estimate a student’s value of P, we start with a prior distribution and use the test score to perform an update. We can get a reasonable prior distribution by taking the distribution of scaled scores, mapping them back to raw scores, and converting raw scores to percentages.

I got the distribution of scores and the scale from College Board data tables. The distribution of raw scores is roughly normal (by design). This figure shows a normal probability plot, which compares the actual values to random values from a normal distribution:

The red line shows a linear fit to the data. At both ends, the data deviate from the model, which indicates that the range of the test is not as wide as the range of ability; if it were, the top score would be 80, not 54.

Nevertheless, we can convert these scores to percentages (54/54 = 100%) and use the distribution of percentages as a prior for P. Then to perform a Bayesian update, we compute the likelihood of a score, R, given P:

**def Likelihood(self, R, P):**

**right = R**

**wrong = 54 - R**

**return pow(P, right) * pow(1-P, wrong)**

Which is an unnormalized binomial probability. This figure shows the posterior distributions of P for hypothetical students with scores of 700 and 770:

The 90% credible interval for a student who gets a 700 is (650, 730); for a student who gets a 770 it is (700, 780). To put that in perspective, at Harvard the interquartile range for the math exam is 700-790. So a student who gets a 770 might really be in the 25th percentile of his or her class. And the chance is about 10% that a classmate who got a 700 is actually better (at answering SAT questions, anyway).

-----

Notes: I made some simplifications, all of which tend to underestimate the margin of error for SAT scores.

1) Some SAT questions are easier than others, so the probability of getting each question is not the same. For students at the top of the range, the probability of getting the easy questions is close to 1, so the observed differences are effectively based on a small number of more difficult questions.

2) The compression of scores at the high end also compresses the margins of error. For example, the credible interval for a student who gets a 770 only goes up to 780. This is clearly an underestimate.

3) I ignored the fact that students who miss a question lose ¼ point. Since students at the top end of the range usually answer every question, a student who missed 4 questions loses 5 points, amplifying the effect of random errors.

I would like to say that I think these articles are great and I plan on reading a lot of them in the upcoming weeks. I like how they provide nice real situations that explain lots of problems faced in statistics.

ReplyDeleteI understand the distribution of the number of wrong responses is from a random distribution, but can we really state that the chance that someone has a question wrong is random?

Thanks for your kind words.

DeleteThe model I present here is simple. It assumes that each test taker has some probability of answering each questions correctly.

I present a more detailed model, based on item response theory, in this chapter of Think Bayes:

http://www.greenteapress.com/thinkbayes/html/thinkbayes013.html#toc99

The results from the detailed model are not very different, which suggests that the simple model is good enough.

Thanks Dr. Downey. I look forward to doing more reading.

Delete