Today’s A To Z term is another from Mr Wu, of mathtuition88.com. The term does not, technically, start with X. But the Greek letter χ certainly looks like an X. And the modern English letter X traces back to that χ. So that’s near enough for my needs.

# χ^{2} Test.

The χ^{2} test is a creature of statistics. Creatures, really. But if one just says “*the* χ^{2} test” without qualification they mean Pearson’s χ^{2} test. Pearson here is a familiar name to anyone reading the biographical sidebar in their statistics book. He was Karl Pearson, who in the late 19th and early 20th century developed pretty much every tool of inferential statistics.

Pearson was, besides a ferocious mathematical talent, a white supremacist and eugenicist. This is something to say about many pioneers of statistics. Many of the important basics of statistics were created to prove that some groups of humans were inferior to the kinds of people who get offered an OBE. They were created at a time that white society was very afraid that it might be out-bred by Italians or something even worse. This is not to say the tools of statistics are wrong, or bad. It is to say that anyone telling you mathematics is a socially independent, politically neutral thing is a fool or a liar.

Inferential statistics is the branch of statistics used to test hypotheses. The hypothesis, generally, is about whether one sample of things is really distinguishable from a population of things. It is different from descriptive statistics, which is that thing I do each month when I say how many pages got a view and from how many countries. Descriptive statistics give us a handful of numbers with which to approximate a complicated things. Both do valuable work, although I agree it seems like descriptive statistics are the boring part. Without them, though, inferential statistics has nothing to study.

The χ^{2} test works like many hypothesis-testing tools do. It takes two parts. One of this is observations. We start with something that comes in two or more categories. Categories come in many kinds: the postal code where a person comes from. The color of a car. The number of years of schooling someone has had. The species of flower. What is important is that the categories be mutually exclusive. One has either been a smoker for more than one year or else one has not.

Count the number of observations of … whatever is interesting … for each category. There is some fraction of observations that belong to the first category, some fraction that belong to the second, some to the third, and so on. Find those fractions. This is all easy enough stuff, really. Counting and dividing by the total number of observations. Which is a hallmark of many inferential statistics tools. They are often tedious, involving a lot of calculation. But they rarely involve *difficult* calculations. Square roots are often where they top out.

That covers observations. What we also need are expectations. This is our hypothesis for what fraction “ought” to be in each category. How do you know what there “ought” to be? … This is the hard part of inferential statistics. Often we are interested in showing that some class is more likely than another to have whatever we’ve observed happen. So we can use as a hypothesis that the thing is observed just as much in one case as another. If we want to test whether one sample is indistinguishable from another, we use the proportions from the other sample. If we want to test whether one sample matches a theoretical ideal, we use that theoretical ideal. People writing probability and statistics problems love throwing dice. Let me make that my example. We hypothesize that on throwing a six-sided die a thousand times, each number comes up exactly one-sixth of the time.

It’s impossible that each number will come up exactly one-sixth of the time, in a thousand throws. We could only hope to achieve this if we tossed some ridiculous number like a thousand and two times. But even if we went to that much extra work, it’s impossible that each number would come up exactly the 167 times. Here I mean it’s “impossible” in the same way it’s impossible I could drop eight coins from my pocket and have them all come up tails. Undoubtedly, some number will be unlucky and just not turn up the full 167 times. Some other number will come up a bit too much. But it’s not *required*; it’s just like that. Some coin lands heads.

This doesn’t necessarily mean the die is biased. The question is whether the observations are too far off from the prediction. How far is that? For each category, take the difference between the observed frequency and the expected frequency. Square that. Divide it by the expected frequency. Once you’ve done that for every category, add up all these numbers. This is χ^{2}. Do all this and you’ll get some nice nonnegative number like, oh, 5.094 or 11.216 or, heck, 20.482.

The χ^{2} test works like many inferential-statistics tests do. It tells us how likely it is that, if the hypothetical expected values were right, that random chance would give us the observed data. The farther χ^{2} is from zero, the less likely it is this was pure chance. Which, all right. But how big does it have to be?

It depends on two important things. First is the number of categories that you have. Or, to use the lingo, the degrees of freedom in your problem. This is one minus the total number of categories. The greater the number of degrees of freedom, the bigger χ^{2} can be without it saying this difference can’t just be chance.

The second important thing is called the alpha level. This is a judgement call. This is how unlikely you want a result to be before you’ll declare that it couldn’t be chance. We have an instinctive idea of this. If you toss a coin twenty times and it comes up tails every time, you’ll declare that was impossible and the coin must be rigged. But it isn’t *impossible*. Start a run of twenty coin flips right now. You have a 0.000 095 37% chance of it being all tails. But I would be comfortable, on the 20th tail, to say something is up. I accept that I am ascribing to malice what is in fact just one of those things.

So the choice of alpha level is a measure of how willing we are to make a mistake in our conclusions. In a simple science like particle physics we can set very stringent standards. There are many particles around and we can smash them as long as the budget holds out. In more difficult sciences, such as epidemiology, we must let alpha be larger. We often accept an alpha of five-percent or one-percent.

What we must do, then, is find for an alpha level and a number of degrees of freedom, what the threshold χ^{2} is. If the sample’s χ^{2} is below that threshold, OK. The observations are consistent with the hypothesis. If the sample’s χ^{2} is larger than that threshold, OK. It’s less-than-the-alpha-level percent likely that the observations are consistent with the hypothesis. This is what most statistical inference tests are like. You calculate a number and check whether it is above or below a threshold. If it’s below the threshold, the observation is consistent with the hypothesis. If it’s above the threshold, there’s less than the alpha-level chance that the observation is consistent with the hypothesis.

How do we find these threshold values? … Well, under no circumstances do we try to calculate *those*. They’re based on a thing called the χ^{2} distributions, the name you’d expect. They’re hard to calculate. There is no earthly reason for you to calculate them. You can find them in the back of your statistics textbook. Or do a web search for χ^{2} test tables. I’m sure Matlab has a function to give you this. If it doesn’t, there’s a function you can download from somebody to work it out. There’s no need to calculate that yourself. Which is again common to inferential statistics tests. You find the thresholds by just looking them up.

χ^{2} tests are just one of the hypothesis-testing tools of inferential statistics. They are a good example of such. They’re designed for observations that can be fit into several categories, and comparing those to an expected forecast. But the calculations one does, and the way one interprets them, are typical for these tests. Even the way they are more tedious than hard is typical. It’s a good example of the family of tools.

I have two letters, and one more week, to go in this series. I hope to have the letter Y published on Tuesday. All the other A-to-Z essays for this year are also at that link. Past A-to-Z essays are at this link, and for the end of this week I’ll feature two past essays at this link. Thank you for reading all this.