Wednesday, January 7, 2009

Fun with Statistics

I was working with my daughter on her AP Stats homework the other night. The section is all about using the normal model (the old Bell curve of infamy) based on information about proportions. Since normal distributions pop up fairly often in the real world, and proportions are a common way of expressing information about distributions, this is all good stuff with wide applicability.

It pays to be careful however. Not every distribution is a normal distribution (ages in Gaza, for example, is strongly skewed to the left) and not every sample is a reliable one. One thing I like about the presentation in this section is that all the problems require a check of a set of assumptions that should be met for the normal model to be applicable. Even better, some of the problems violate one or the other of the assumptions. Here are the conditions:
  1. The randomization condition: The sample needs to be an unbiased representative selection from the population.
  2. The 10% condition: The sample represents less that 10% of the population.
  3. Success/failure conditions: The sample needs to be enough that the number of successes and failures is more than 10. So if the sample size is n and the proportion is p, then both np >= 10 and n(1-p) >= 10.
Undoubtedly the last couple of assumptions will get mapped to some stronger mathematical basis involving confidence levels and the like later on (Statistics is annoying different from the rest of mathematics, being built from the top down instead of the bottom up) but this is a nice set of rules of thumb to apply to many statistical claims floating about out there.

So, given a proportion p and a sample size n that meet these conditions, what can we do? The fundamental operations all take place by computing areas under the normal curve to give probabilities. For example, given that 13% of the population is left-handed, an auditorium with 120 seats, of which 15 have left-handed desks, what is the probability that there will not be a left-handed desk available for some poor lefty?

If we know the number of standard deviations away from the mean this (15 lefties) represents, there are standard tables and fancy calculators that give the probability that a value is less (or greater, by subtracting from 1) this value. The standard deviation is easy enough to compute: σ=√(p (1-p)/n) 0.0307

Now we want to know P(#lefties > 15) based on this model. The z-score (number of standard deviations) is computed simply as well: (p-p̂ )/σ ≅ 0.1629
Looking this up, we get the cumulative probability that a sample will be less than 15 of about 0.8580 so the probability that it will be greater is 1-0.8580. So about 14.2% of the time, some poor lefty will have to do without a left-handed desk.

All this assumes that we meet the conditions in the first place, however. So, let's look. The success/failure condition is the easiest: np = 120*0.15 =18 and n(1-p) = 120*0.85 = 102, both of which are greater than 10. Does the sample represent less than 10% of the population? Well, that depends on what you take the "population" to be. You have to be a little careful to avoid getting fooled. Is the population "all possible students everywhere"? Then the sample surely does represent less than 10% of the population. If this auditorium is for the exclusive use of the fine folks of the East Krumwich School for the Sinister Arts, maybe a better characterization of the population is "all possible students of EKSSA" and it could be that 120 is not less than 10% of the population. But in all likelihood, the 10% condition will be met, or the auditorium was a very bad investment. What about randomization? Again, it all depends. If this auditorium is at a highly selective college, then you have to ask whether the selection the college applied in accepting students biases the sample in an important way with respect to left-handedness.

The randomization condition is where the assumptions of normality can really head south. It is very easy to come up with samples that are biased in some way or other. In the recent election certain national polls were biased against younger voters because the sampling was conducted exclusively on land-lines, and a larger percentage of younger voters use cell-phones and have no land-line. Similarly, a survey of the number of people who make confidential information available over the web by looking at the number of FaceBook users who post confidential information is useless: people who aren't going to make confidential information available over the web are less likely to use FaceBook in the first place.

The take-home message here is, whenever you see some statement of the form "X% of Ys are Z", your very first question needs to be "how could the selection of Ys be biased?"

It is perhaps worth pointing out that there is also a fundamental assumption in applying the normal model in this way that you know what the proportion actually is. Since the usual way to come up with p is by performing some kind of statistical sampling, there is an uncomfortable circularity possible. Maybe the proportion of lefties in the world at large is 13%, but the proportion of lefties at an art school is going to be higher, because the population is different. (Or, alternatively, you can say that the selection of a sample of students in an art school auditorium is not an unbiased sampling of the population of students as a whole.)

Another interesting aspect of the normal model is that everything hinges on the variance. A distribution with greater variance is more flattened out, with larger tails; one with less variance is squished in towards the middle, with less in the tails. That is, loss of variance means reversion to the mean, and reversion to the mean means fewer extremes, and that means that the expected difference between two selections is less. At some level this is obvious (true by definition), but the implications can be interesting and are not always appreciated. Stephen Jay Gould devoted a whole book to this idea (Full House, ISBN 978-0609801406) with examples from such diverse arenas as baseball and the Cambrian explosion. The capsule summary is that in competitive arenas there is a long term tendency to reduce variance, which means that the difference between the worst and the best gets smaller, which means that you no longer see the 22-0 blow-outs in English league play that you did in the 1880s, and you can make the case that it was easier for Babe Ruth to rack up a lot of home runs than for Hank Aaron, because Babe Ruth got to play against relatively worse pitchers more of the time. The long term trend is towards mediocrity. (Although please note: the absolute mean may in fact be higher, and probably is, as there is a convergence on the techniques that get the most benefit to the limits of what is possible. So "mediocre" in the sense of "closer to the mean" not in the sense of "bad".)

I think you can make the case that this also appears to apply to the world of politics: as time goes by, the relative excellence of the candidates tends to get closer and we have fewer really bad candidates as well as fewer really outstanding candidates, and the result is closer elections. Where are the statesmen of the caliber of the founding fathers (set aside the bias from the rosy glow of the passage of time)? More unlikely to appear. On the plus side, grossly inadequate candidates are less likely as well. The only way to escape from the trap of mediocrity and small differences in excellence is to change the game and jump to a completely new distribution: take steroids, start leveraging new social media, get a fancy high-tech swimsuit.