When I wrote about how interesting the results of a basketball tournament were, and came to the conclusion that it was 63 (and filled in that I meant 63 bits of information), I was careful to say that the outcome of a basketball game between two evenly-matched opponents has an information content of 1 bit. If the game is a foregone conclusion, then the game hasn’t got so much information about it. If the game really is foregone, the information content is 0 bits; you already know what the result will be. If the game is an almost sure thing, there’s very little information to be had from actually seeing the game. An upset might be thrilling to watch, but you would hardly count on that, if you’re being rational. But most games aren’t sure things; we might expect the higher-seed to win, but it’s plausible they don’t. How does that affect how much information there is in the results of a tournament?
Last year, the NCAA College Men’s Basketball tournament inspired me to look up what the outcomes of various types of matches were, and which teams were more likely to win than others. If some person who wrote something for statistics.about.com is correct, based on 27 years of March Madness outcomes, the play between a number one and a number 16 seed is a foregone conclusion — the number one seed always wins — while number two versus number 15 is nearly sure. So while the first round of play will involve 32 games — four regions, each region having eight games — there’ll be something less than 32 bits of information in all these games, since many of them are so predictable.
If we take the results from that statistics.about.com page as accurate and reliable as a way of predicting the outcomes of various-seeded teams, then we can estimate the information content of the first round of play at least.
Here’s how I work it out, anyway:
|Contest||Probability the Higher Seed Wins||Information Content of this Outcome|
|#1 seed vs #16 seed||100%||0 bits|
|#2 seed vs #15 seed||96%||0.2423 bits|
|#3 seed vs #14 seed||85%||0.6098 bits|
|#4 seed vs #13 seed||79%||0.7415 bits|
|#5 seed vs #12 seed||67%||0.9149 bits|
|#6 seed vs #11 seed||67%||0.9149 bits|
|#7 seed vs #10 seed||60%||0.9710 bits|
|#8 seed vs #9 seed||47%||0.9974 bits|
So if the eight contests in a single region were all evenly matched, the information content of that region would be 8 bits. But there’s one sure and one nearly-sure game in there, and there’s only a couple games where the two teams are close to evenly matched. As a result, I make out the information content of a single region to be about 5.392 bits of information. Since there’s four regions, that means the first round of play — the first 32 games — have altogether about 21.567 bits of information.
A statistical analysis of the tournaments which I dug up last year indicated that in the last three rounds — the Elite Eight, Final Four, and championship game — the higher- and lower-seeded teams are equally likely to win, and therefore those games have an information content of 1 bit per game. The last three rounds therefore have 7 bits of information total.
Unfortunately, experimental data seems to fall short for the second round — 16 games, where the 32 winners in the first round play, producing the Sweet Sixteen teams — and the third round — 8 games, producing the Elite Eight. If someone’s done a study of how often the higher-seeded team wins I haven’t run across it.
There are six of these games in each of the four regions, for 24 games total. Presumably the higher-seeded is more likely than the lower-seeded to win, but I don’t know how much more probable it is the higher-seed will win. I can come up with some bounds: the 24 games total in the second and third rounds can’t have an information content less than 0 bits, since they’re not all foregone conclusions. The higher-ranked seed won’t win all the time. And they can’t have an information content of more than 24 bits, since that’s how much there would be if the games were perfectly even matches.
So, then: the first round carries about 21.567 bits of information. The second and third rounds carry between 0 and 24 bits. The fourth through sixth rounds (the sixth round is the championship game) carry seven bits. Overall, the 63 games of the tournament carry between 28.567 and 52.567 bits of information. I would expect that many of the second-round and most of the third-round games are pretty close to even matches, so I would expect the higher end of that range to be closer to the true information content.
Let me make the assumption that in this second and third round the higher-seed has roughly a chance of 75 percent of beating the lower seed. That’s a number taken pretty arbitrarily as one that sounds like a plausible but not excessive advantage the higher-seeded teams might have. (It happens it’s close to the average you get of the higher-seed beating the lower-seed in the first round of play, something that I took as confirming my intuition about a plausible advantage the higher seed has.) If, in the second and third rounds, the higher-seed wins 75 percent of the time and the lower-seed 25 percent, then the outcome of each game is about 0.8113 bits of information. Since there are 24 games total in the second and third rounds, that suggests the second and third rounds carry about 19.471 bits of information.
Taking all these numbers, though — the first round with its something like 21.567 bits of information; the second and third rounds with something like 19.471 bits; the fourth through sixth rounds with 7 bits — the conclusion is that the win/loss results of the entire 63-game tournament are about 48 bits of information. It’s a bit higher the more unpredictable the games involving the final 32 and the Sweet 16 are; it’s a bit lower the more foregone those conclusions are. But 48 bits sounds like a plausible enough answer to me.