Who Was Karl Pearson?

An offhanded joke in the Usenet newsgroup alt.fan.cecil-adams — a great spot for offhanded jokes, as the audience is reasonably demanding — about baseball being a game of statistics but this is ridiculous prompted me to say I hoped the Pearson Chi-Squared Test had a good season since it was at the core of my baseball statistics fantasy team. One respondent asked if this was connected to Pearson Publishing, which has recently screwed up its composition of standardized tests for New York State quite severely, including giving as a reading comprehension assignment a bit of nonsense composed to have no meaning, and twenty mistakes in the non-English translation of a math exam. There’s no connection of which I’m aware; but, why not take a couple paragraphs to talk about Karl Pearson?

Karl Pearson is one of the safe names to guess if you’re asked who invented some statistical tool. He even gave the name “standard deviation” to that quantity, which measures how spread out a collection of data is, and is the first thing a statistics student learns to calculate after getting the mean, median, and mode down. The Pearson Chi-Squared Test, besides having a wonderful sound to its name, is one of those rare mathematical things named for the person it ought to be named for; the test is a way of seeing whether a particular result from a discrete set of possible outcomes turns up so much more often than expected that the result is suspicious (or, equivalently, that there are improbably few occurrences of that result). He’s also behind a number of tools in the study of correlations: as one variable changes, how can we expect another variable to change?

We encounter correlations all the time, nearly whenever we plot one variable against another. Usually when we hear of a study linking one thing to another, it’s a correlation between, say, incidents of heart disease and amounts of chocolate eaten per day. Finding such a correlation is interesting, because it can provoke a search for whether one quantity affects another and, if so, how much; or in the comment threads on every news article reporting it, it provokes people to insist that since correlation does not equal causation, finding this link between one property and another means we can stop looking for how the two affect one another as it would be fallacious to suppose there was any causal link. This is among the reasons comment threads on news articles should never, ever be read.

Where Pearson gets some notoriety, and the occasional mention in non-mathematics books (specifically, history and biology texts), is that he was one of the big figures in the eugenics movement of the late 19th and early 20th century. Many establishing names in statistics were, and for about the reason you’d expect: it’s much easier to claim that people like you are superior to those breeding hordes if you have sheafs of carefully-reasoned symbols on top of piles of raw data to demonstrate how you must be more fit than the masses. With that sarcasm delivered, I must admit my ignorance here: I haven’t studied Pearson’s biography at any length, and I am unfit to express much of an opinion about what motivated his eugenics work. The late 19th and early 20th century was a time when, among other things, the implications of thermodynamics — that all things decay — and of evolution — that all things die off — were starting to penetrate the public consciousness in a way the usual thoughts of how things used to be better never had, and many people came to fear that the things which were good about civilization were being worn down, and might end within generations. The handful of quotes I’ve read from Pearson do not make me like him, but recall Robert Benchley’s much-quoted quote about quotes.

Pearson ended up in a longrunning scientific feud with Sir Ronald Aylmer Fisher, one of the other safe bets if you need to name someone who invented some statistical tool. Fisher’s great work that gets in every Intro to Statistics book is the Analysis of Variance, which tries to measure how much of the variability of a quantity depends on two (or more) possible sources of it (and is abbreviated to ANOVA immediately after it is named). Pearson was more interested in large sample sizes, and Fisher in smaller ones, in identifying causes of correlations and you can see how this makes it utterly impossible to get along.