## My 2019 Mathematics A To Z: Chi-squared test

Today’s A To Z term is another from Mr Wu, of mathtuition88.com. The term does not, technically, start with X. But the Greek letter χ certainly looks like an X. And the modern English letter X traces back to that χ. So that’s near enough for my needs.

# χ2 Test.

The χ2 test is a creature of statistics. Creatures, really. But if one just says “the χ2 test” without qualification they mean Pearson’s χ2 test. Pearson here is a familiar name to anyone reading the biographical sidebar in their statistics book. He was Karl Pearson, who in the late 19th and early 20th century developed pretty much every tool of inferential statistics.

Pearson was, besides a ferocious mathematical talent, a white supremacist and eugenicist. This is something to say about many pioneers of statistics. Many of the important basics of statistics were created to prove that some groups of humans were inferior to the kinds of people who get offered an OBE. They were created at a time that white society was very afraid that it might be out-bred by Italians or something even worse. This is not to say the tools of statistics are wrong, or bad. It is to say that anyone telling you mathematics is a socially independent, politically neutral thing is a fool or a liar.

Inferential statistics is the branch of statistics used to test hypotheses. The hypothesis, generally, is about whether one sample of things is really distinguishable from a population of things. It is different from descriptive statistics, which is that thing I do each month when I say how many pages got a view and from how many countries. Descriptive statistics give us a handful of numbers with which to approximate a complicated things. Both do valuable work, although I agree it seems like descriptive statistics are the boring part. Without them, though, inferential statistics has nothing to study.

The χ2 test works like many hypothesis-testing tools do. It takes two parts. One of this is observations. We start with something that comes in two or more categories. Categories come in many kinds: the postal code where a person comes from. The color of a car. The number of years of schooling someone has had. The species of flower. What is important is that the categories be mutually exclusive. One has either been a smoker for more than one year or else one has not.

Count the number of observations of … whatever is interesting … for each category. There is some fraction of observations that belong to the first category, some fraction that belong to the second, some to the third, and so on. Find those fractions. This is all easy enough stuff, really. Counting and dividing by the total number of observations. Which is a hallmark of many inferential statistics tools. They are often tedious, involving a lot of calculation. But they rarely involve difficult calculations. Square roots are often where they top out.

That covers observations. What we also need are expectations. This is our hypothesis for what fraction “ought” to be in each category. How do you know what there “ought” to be? … This is the hard part of inferential statistics. Often we are interested in showing that some class is more likely than another to have whatever we’ve observed happen. So we can use as a hypothesis that the thing is observed just as much in one case as another. If we want to test whether one sample is indistinguishable from another, we use the proportions from the other sample. If we want to test whether one sample matches a theoretical ideal, we use that theoretical ideal. People writing probability and statistics problems love throwing dice. Let me make that my example. We hypothesize that on throwing a six-sided die a thousand times, each number comes up exactly one-sixth of the time.

It’s impossible that each number will come up exactly one-sixth of the time, in a thousand throws. We could only hope to achieve this if we tossed some ridiculous number like a thousand and two times. But even if we went to that much extra work, it’s impossible that each number would come up exactly the 167 times. Here I mean it’s “impossible” in the same way it’s impossible I could drop eight coins from my pocket and have them all come up tails. Undoubtedly, some number will be unlucky and just not turn up the full 167 times. Some other number will come up a bit too much. But it’s not required; it’s just like that. Some coin lands heads.

This doesn’t necessarily mean the die is biased. The question is whether the observations are too far off from the prediction. How far is that? For each category, take the difference between the observed frequency and the expected frequency. Square that. Divide it by the expected frequency. Once you’ve done that for every category, add up all these numbers. This is χ2. Do all this and you’ll get some nice nonnegative number like, oh, 5.094 or 11.216 or, heck, 20.482.

The χ2 test works like many inferential-statistics tests do. It tells us how likely it is that, if the hypothetical expected values were right, that random chance would give us the observed data. The farther χ2 is from zero, the less likely it is this was pure chance. Which, all right. But how big does it have to be?

It depends on two important things. First is the number of categories that you have. Or, to use the lingo, the degrees of freedom in your problem. This is one minus the total number of categories. The greater the number of degrees of freedom, the bigger χ2 can be without it saying this difference can’t just be chance.

The second important thing is called the alpha level. This is a judgement call. This is how unlikely you want a result to be before you’ll declare that it couldn’t be chance. We have an instinctive idea of this. If you toss a coin twenty times and it comes up tails every time, you’ll declare that was impossible and the coin must be rigged. But it isn’t impossible. Start a run of twenty coin flips right now. You have a 0.000 095 37% chance of it being all tails. But I would be comfortable, on the 20th tail, to say something is up. I accept that I am ascribing to malice what is in fact just one of those things.

So the choice of alpha level is a measure of how willing we are to make a mistake in our conclusions. In a simple science like particle physics we can set very stringent standards. There are many particles around and we can smash them as long as the budget holds out. In more difficult sciences, such as epidemiology, we must let alpha be larger. We often accept an alpha of five-percent or one-percent.

What we must do, then, is find for an alpha level and a number of degrees of freedom, what the threshold χ2 is. If the sample’s χ2 is below that threshold, OK. The observations are consistent with the hypothesis. If the sample’s χ2 is larger than that threshold, OK. It’s less-than-the-alpha-level percent likely that the observations are consistent with the hypothesis. This is what most statistical inference tests are like. You calculate a number and check whether it is above or below a threshold. If it’s below the threshold, the observation is consistent with the hypothesis. If it’s above the threshold, there’s less than the alpha-level chance that the observation is consistent with the hypothesis.

How do we find these threshold values? … Well, under no circumstances do we try to calculate those. They’re based on a thing called the χ2 distributions, the name you’d expect. They’re hard to calculate. There is no earthly reason for you to calculate them. You can find them in the back of your statistics textbook. Or do a web search for χ2 test tables. I’m sure Matlab has a function to give you this. If it doesn’t, there’s a function you can download from somebody to work it out. There’s no need to calculate that yourself. Which is again common to inferential statistics tests. You find the thresholds by just looking them up.

χ2 tests are just one of the hypothesis-testing tools of inferential statistics. They are a good example of such. They’re designed for observations that can be fit into several categories, and comparing those to an expected forecast. But the calculations one does, and the way one interprets them, are typical for these tests. Even the way they are more tedious than hard is typical. It’s a good example of the family of tools.

I have two letters, and one more week, to go in this series. I hope to have the letter Y published on Tuesday. All the other A-to-Z essays for this year are also at that link. Past A-to-Z essays are at this link, and for the end of this week I’ll feature two past essays at this link. Thank you for reading all this.

## How Much Might I Have Lost At Pinball?

After the state pinball championship last month there was a second, side tournament. It was a sort-of marathon event in which I played sixteen games in short order. I won three of them and lost thirteen, a disheartening record. The question I can draw from this: was I hopelessly outclassed in the side tournament? Is it plausible that I could do so awfully?

The answer would be “of course not”. I was playing against, mostly, the same people who were in the state finals. (A few who didn’t qualify for the finals joined the side tournament.) In that I had done well enough, winning seven games in all out of fifteen played. It’s implausible that I got significantly worse at pinball between the main and the side tournament. But can I make a logically sound argument about this?

In full, probably not. It’s too hard. The question is, did I win way too few games compared to what I should have expected? But what should I have expected? I haven’t got any information on how likely it should have been that I’d win any of the games, especially not when I faced something like a dozen different opponents. (I played several opponents twice.)

But we can make a model. Suppose that I had a fifty percent chance of winning each match. This is a lie in detail. The model contains lies; all models do. The lies might let us learn something interesting. Some people there I could only beat with a stroke of luck on my side. Some people there I could fairly often expect to beat. If we pretend I had the same chance against everyone, though, we get something that we can model. It might tell us something about what really happened.

If I play 16 matches, and have a 50 percent chance of winning each of them, then I should expect to win eight matches. But there’s no reason I might not win seven instead, or nine. Might win six, or ten, without that being too implausible. It’s even possible I might not win a single match, or that I might win all sixteen matches. How likely?

This calls for a creature from the field of probability that we call the binomial distribution. It’s “binomial” because it’s about stuff for which there are exactly two possible outcomes. This fits. Each match I can win or I can lose. (If we tie, or if the match is interrupted, we replay it, so there’s not another case.) It’s a “distribution” because we describe, for a set of some number of attempted matches, how the possible outcomes are distributed. The outcomes are: I win none of them. I win exactly one of them. I win exactly two of them. And so on, all the way up to “I win exactly all but one of them” and “I win all of them”.

To answer the question of whether it’s plausible I should have done so badly I need to know more than just how likely it is I would win only three games. I need to also know the chance I’d have done worse. If I had won only two games, or only one, or none at all. Why?

Here I admit: I’m not sure I can give a compelling reason, at least not in English. I’ve been reworking it all week without being happy at the results. Let me try pieces.

One part is that as I put the question — is it plausible that I could do so awfully? — isn’t answered just by checking how likely it is I would win only three games out of sixteen. If that’s awful, then doing even worse must also be awful. I can’t rule out even-worse results from awfulness without losing a sense of what the word “awful” means. Fair enough, to answer that question. But I made up the question. Why did I make up that one? Why not just “is it plausible I’d get only three out of sixteen games”?

Habit, largely. Experience shows me that the probability of any particular result turns out to be implausibly low. It isn’t quite that case here; there’s only seventeen possible noticeably different outcomes of playing sixteen games. But there can be so many possible outcomes that even the most likely one isn’t.

Take an extreme case. (Extreme cases are often good ways to build an intuitive understanding of things.) Imagine I played 16,000 games, with a 50-50 chance of winning each one of them. It is most likely that I would win 8,000 of the games. But the probability of winning exactly 8,000 games is small: only about 0.6 percent. What’s going on there is that there’s almost the same chance of winning exactly 8,001 or 8,002 games. As the number of games increases the number of possible different outcomes increases. If there are 16,000 games there are 16,001 possible outcomes. It’s less likely that any of them will stand out. What saves our ability to predict the results of things is that the number of plausible outcomes increases more slowly. It’s plausible someone would win exactly three games out of sixteen. It’s impossible that someone would win exactly three thousand games out of sixteen thousand, even though that’s the same ratio of won games.

Card games offer another way to get comfortable with this idea. A bridge hand, for example, is thirteen cards drawn out of fifty-two. But the chance that you were dealt the hand you just got? Impossibly low. Should we conclude from this all bridge hands are hoaxes? No, but ask my mother sometime about the bridge class she took that one cruise. “Three of sixteen” is too particular; “at best three of sixteen” is a class I can study.

Unconvinced? I don’t blame you. I’m not sure I would be convinced of that, but I might allow the argument to continue. I hope you will. So here are the specifics. These are the chance of each count of wins, and the chance of having exactly that many wins, for sixteen matches:

Wins Percentage
0 0.002 %
1 0.024 %
2 0.183 %
3 0.854 %
4 2.777 %
5 6.665 %
6 12.219 %
7 17.456 %
8 19.638 %
9 17.456 %
10 12.219 %
11 6.665 %
12 2.777 %
13 0.854 %
14 0.183 %
15 0.024 %
16 0.002 %

So the chance of doing as awfully as I had — winning zero or one or two or three games — is pretty dire. It’s a little above one percent.

Is that implausibly low? Is there so small a chance that I’d do so badly that we have to figure I didn’t have a 50-50 chance of winning each game?

I hate to think that. I didn’t think I was outclassed. But here’s a problem. We need some standard for what is “it’s implausibly unlikely that this happened by chance alone”. If there were only one chance in a trillion that someone with a 50-50 chance of winning any game would put in the performance I did, we could suppose that I didn’t actually have a 50-50 chance of winning any game. If there were only one chance in a million of that performance, we might also suppose I didn’t actually have a 50-50 chance of winning any game. But here there was only one chance in a hundred? Is that too unlikely?

It depends. We should have set a threshold for “too implausibly unlikely” before we started research. It’s bad form to decide afterward. There are some thresholds that are commonly taken. Five percent is often useful for stuff where it’s hard to do bigger experiments and the harm of guessing wrong (dismissing the idea I had a 50-50 chance of winning any given game, for example) isn’t so serious. One percent is another common threshold, again common in stuff like psychological studies where it’s hard to get more and more data. In a field like physics, where experiments are relatively cheap to keep running, you can gather enough data to insist on fractions of a percent as your threshold. Setting the threshold after is bad form.

In my defense, I thought (without doing the work) that I probably had something like a five percent chance of doing that badly by luck alone. It suggests that I did have a much worse than 50 percent chance of winning any given game.

Is that credible? Well, yeah; I may have been in the top sixteen players in the state. But a lot of those people are incredibly good. Maybe I had only one chance in three, or something like that. That would make the chance I did that poorly something like one in six, likely enough.

And it’s also plausible that games are not independent, that whether I win one game depends in some way on whether I won or lost the previous. But it does feel like it’s easier to win after a win, or after a close loss. And it feels harder to win a game after a string of losses. I don’t know that this can be proved, not on the meager evidence I have available. And you can almost always question the independence of a string of events like this. It’s the safe bet.

## Proving Something With One Month’s Counting

One week, it seems, isn’t enough to tell the difference conclusively between the first bidder on Contestants Row having a 25 percent chance of winning — winning one out of four times — or a 17 percent chance of winning — winning one out of six times. But we’re not limited to watching just the one week of The Price Is Right, at least in principle. Some more episodes might help us, and we can test how many episodes are needed to be confident that we can tell the difference. I won’t be clever about this. I have a tool — Octave — which makes it very easy to figure out whether it’s plausible for something which happens 1/4 of the time to turn up only 1/6 of the time in a set number of attempts, and I’ll just keep trying larger numbers of attempts until I’m satisfied. Sometimes the easiest way to solve a problem is to keep trying numbers until something works.

In two weeks (or any ten episodes, really, as talked about above), with 60 items up for bids, a 25 percent chance of winning suggests the first bidder should win 15 times. A 17 percent chance of winning would be a touch over 10 wins. The chance of 10 or fewer successes out of 60 attempts, with a 25 percent chance of success each time, is about 8.6 percent, still none too compelling.

Here we might turn to despair: 6,000 episodes — about 35 years of production — weren’t enough to give perfectly unambiguous answers about whether there were fewer clean sweeps than we expected. There were too few at the 5 percent significance level, but not too few at the 1 percent significance level. Do we really expect to do better with only 60 shows?

## What Can One Week Prove?

We have some reason to think the chance of winning an Item Up For Bids, if you’re the first one of the four to place bids — let’s call this the first bidder or first seat so there’s a name for it — is lower than the 25 percent which we’d expect if every contestant in The Price Is Right‘s Contestants Row had an equal shot at it. Based on the assertion that only one time in about six thousand episodes had all six winning bids in one episode come from the same seat, we reasoned that the chance for the first bidder — the same seat as won the previous bid — could be around 17 percent. My next question is how we could test this? The chance for the first bidder to win might be higher than 17 percent — around 1/6, which is near enough and easier to work with — or lower than 25 percent — exactly 1/4 — or conceivably even be outside that range.

The obvious thing to do is test: watch a couple episodes, and see whether it’s nearer to 1/6 or to 1/4 of the winning bids come from the first seat. It’s easy to tally the number of items up for bid and how often the first bidder wins. However, there are only six items up for bid each episode, and there are five episodes per week, for 30 trials in all. I talk about a week’s worth of episodes because it’s a convenient unit, easy to record on the Tivo or an equivalent device, easy to watch at The Price Is Right‘s online site, but it doesn’t have to be a single week. It could be any five episodes. But I’ll say a week just because it’s convenient to do so.

If the first seat has a chance of 25 percent of winning, we expect 30 times 1/4, or seven or eight, first-seat wins per week. If the first seat has a 17 percent chance of winning, we expect 30 times 1/6, or 5, first-seat wins per week. That’s not much difference. What’s the chance we see 5 first-seat wins if the first seat has a 25 percent chance of winning?

## Interpreting Drew Carey

If we’ve decided that at the significance level we find comfortable there are too few clean sweeps of any position in Contestants Row, the natural question is why there are so few. We estimated there should have been six clean sweeps, based on modelling clean-sweep occurrences as a binomial distribution. Something in the model went wrong. Let’s try to reason out what it was.

One assumption for a binomial distribution are that we have some trial, some event, which happens many times. Each episodes is the obvious trial here. The outcome we’re interested in seeing has some probability of happening on each trial; there is indeed some probability of a clean sweep each episode. The binomial distribution assumes that this probability is constant for every trial, that it doesn’t become more or less likely the tenth or hundredth or thousandth time around, and this seems likely to hold for The Price Is Right episodes. Granted there is some chance of a clean sweep in one episode; what could be done to increase or decrease the likelihood from episode to episode?

## Finding, and Starting to Understand, the Answer

If the probability of having one or fewer clean sweep episodes of The Price Is Right out of 6,000 aired shows is a little over one and a half percent — and it is — and we consider outcomes whose probability is less than five percent to be so unlikely that we can rule them out as happening by chance — and, last time, we did — then there are improbably few episodes where all six contestants came from the same seat in Contestants Row, and we can usefully start looking for possible explanations as to why there are so few clean sweeps. At least, that’s the conclusion at our significance level, that five percent.

But there’s no law dictating that we pick that five percent significance level. If we picked a one percent significance level, which is still common enough and not too stringent, then we would say this might be fewer clean sweeps than we expected, but it isn’t so drastically few as to raise our eyebrows yet. And we would be correct to do so. Depending on the significance level, what we saw is either so few clean sweeps as to be suspicious, or it’s not. This is why it’s better form to choose the significance level before we know the outcome; it feels like drawing the bullseye after shooting the arrow the other way around.

## The First Tail

We became suspicious of the number of clean sweeps in The Price Is Right when there were not the expected six of them in 6,000 episodes. The chance there would be only one was about one and a half percent, not very high. But are there so few clean sweeps that we should be suspicious? That is, is the difference between the expected number of sweeps and the observed number so large as to be significant? Is it too big to just result from chance?

This is significance testing: is whatever quantity we mean to observe dramatically less than what is expected? Is it dramatically more? Is it at least different? Are these differences bigger than what could be expected by mere chance? For every statistician’s favorite example, a tossed fair coin will come up tails half the time; that means, of twenty flips, there are expected to be ten tails. But there being merely nine or as many as twelve is reasonable. Three or fifteen tails may be a little unlikely. Zero or twenty seem impossible. There’s a point where if our observations are so different from what we expect then we have to reject the idea that our observations and our expectations agree.

It’s not enough to say there’s a probability of only 1.5 percent that there should be exactly one clean sweep episode out of 6,000, though. It’s unlikely that should happen, but if we look at it, it’s unlikely there should be any outcome. Even the most likely result of 6,000 episodes, six clean sweeps, has only about one chance in six of happening. That’s near the chance that the next person you meet will have a birthday in either September or November. That isn’t absurdly unlikely, but, the person betting against it has the surer deal.