## Let Me Remind You How Interesting a Basketball Tournament Is

Several years ago I stumbled into a nice sequence. All my nice sequences have been things I stumbled upon. This one looked at the most basic elements of information theory by what they tell us about the NCAA College Basketball tournament. This is (in the main) a 64-team single-elimination playoff. It’s been a few years since I ran through the sequence. But it’s been a couple years since the tournament could be run with a reasonably clear conscience too. So here’s my essays:

And this spins off to questions about other sports events.

And I still figure to get to this year’s Pi Day comic strips. Soon. It’s been a while since I felt I had so much to write up.

## Let Me Tell You How Interesting March Madness Could Possibly Be

I read something alarming in the daily “Best of GoComics” e-mail this morning. It was a panel of Dave Whamond’s Reality Check. It’s a panel comic, although it stands out from the pack by having a squirrel character in the margins. And here’s the panel.

Certainly a solid enough pun to rate a mention. I don’t know of anyone actually doing a March Mathness bracket, but it’s not a bad idea. Rating mathematical terms for their importance or usefulness or just beauty might be fun. And might give a reason to talk about their meaning some. It’s a good angle to discuss what’s interesting about mathematical terms.

And that lets me segue into talking about a set of essays. The next few weeks see the NCAA college basketball tournament, March Madness. I’ve used that to write some stuff about information theory, as it applies to the question: is a basketball game interesting?

Along the way here I got to looking up actual scoring results from major sports. This let me estimate the information-theory content of the scores of soccer, (US) football, and baseball scores, to match my estimate of basketball scores’ information content.

• How Interesting Is A Football Score? Football scoring is a complicated thing. But I was able to find a trove of historical data to give me an estimate of the information theory content of a score.
• How Interesting Is A Baseball Score? Some Partial Results I found some summaries of actual historical baseball scores. Somehow I couldn’t find the detail I wanted for baseball, a sport that since 1845 has kept track of every possible bit of information, including how long the games ran, about every game ever. I made do, though.
• How Interesting Is A Baseball Score? Some Further Results Since I found some more detailed summaries and refined the estimate a little.
• How Interesting Is A Low-Scoring Game? And here, well, I start making up scores. It’s meant to represent low-scoring games such as soccer, hockey, or baseball to draw some conclusions. This includes the question: just because a distribution of small whole numbers is good for mathematicians, is that a good match for what sports scores are like?

## Is A Basketball Tournament Interesting? My Thoughts

It’s a good weekend to bring this back. I have some essays about information theory and sports contests and maybe you missed them earlier. Here goes.

And then for a follow-up I started looking into actual scoring results from major sports. This let me estimate the information-theory content of the scores of soccer, (US) football, and baseball scores, to match my estimate of basketball scores’ information content.

Don’t try to use this to pass your computer science quals. But I hope it gives you something interesting to talk about while sulking over your brackets, and maybe to read about after that.

## How Interesting Is March Madness?

And now let me close the week with some other evergreen articles. A couple years back I mixed the NCAA men’s basketball tournament with information theory to produce a series of essays that fit the title I’ve given this recap. They also sprawl out into (US) football and baseball. Let me link you to them:

## But How Interesting Is A Real Basketball Tournament?

When I wrote about how interesting the results of a basketball tournament were, and came to the conclusion that it was 63 (and filled in that I meant 63 bits of information), I was careful to say that the outcome of a basketball game between two evenly-matched opponents has an information content of 1 bit. If the game is a foregone conclusion, then the game hasn’t got so much information about it. If the game really is foregone, the information content is 0 bits; you already know what the result will be. If the game is an almost sure thing, there’s very little information to be had from actually seeing the game. An upset might be thrilling to watch, but you would hardly count on that, if you’re being rational. But most games aren’t sure things; we might expect the higher-seed to win, but it’s plausible they don’t. How does that affect how much information there is in the results of a tournament?

Last year, the NCAA College Men’s Basketball tournament inspired me to look up what the outcomes of various types of matches were, and which teams were more likely to win than others. If some person who wrote something for statistics.about.com is correct, based on 27 years of March Madness outcomes, the play between a number one and a number 16 seed is a foregone conclusion — the number one seed always wins — while number two versus number 15 is nearly sure. So while the first round of play will involve 32 games — four regions, each region having eight games — there’ll be something less than 32 bits of information in all these games, since many of them are so predictable.

If we take the results from that statistics.about.com page as accurate and reliable as a way of predicting the outcomes of various-seeded teams, then we can estimate the information content of the first round of play at least.

Here’s how I work it out, anyway:

Contest Probability the Higher Seed Wins Information Content of this Outcome
#1 seed vs #16 seed 100% 0 bits
#2 seed vs #15 seed 96% 0.2423 bits
#3 seed vs #14 seed 85% 0.6098 bits
#4 seed vs #13 seed 79% 0.7415 bits
#5 seed vs #12 seed 67% 0.9149 bits
#6 seed vs #11 seed 67% 0.9149 bits
#7 seed vs #10 seed 60% 0.9710 bits
#8 seed vs #9 seed 47% 0.9974 bits

So if the eight contests in a single region were all evenly matched, the information content of that region would be 8 bits. But there’s one sure and one nearly-sure game in there, and there’s only a couple games where the two teams are close to evenly matched. As a result, I make out the information content of a single region to be about 5.392 bits of information. Since there’s four regions, that means the first round of play — the first 32 games — have altogether about 21.567 bits of information.

Warning: I used three digits past the decimal point just because three is a nice comfortable number. Do not by hypnotized into thinking this is a more precise measure than it really is. I don’t know what the precise chance of, say, a number three seed beating a number fourteen seed is; all I know is that in a 27-year sample, it happened the higher-seed won 85 percent of the time, so the chance of the higher-seed winning is probably close to 85 percent. And I only know that if whoever it was wrote this article actually gathered and processed and reported the information correctly. I would not be at all surprised if the first round turned out to have only 21.565 bits of information, or as many as 21.568.

A statistical analysis of the tournaments which I dug up last year indicated that in the last three rounds — the Elite Eight, Final Four, and championship game — the higher- and lower-seeded teams are equally likely to win, and therefore those games have an information content of 1 bit per game. The last three rounds therefore have 7 bits of information total.

Unfortunately, experimental data seems to fall short for the second round — 16 games, where the 32 winners in the first round play, producing the Sweet Sixteen teams — and the third round — 8 games, producing the Elite Eight. If someone’s done a study of how often the higher-seeded team wins I haven’t run across it.

There are six of these games in each of the four regions, for 24 games total. Presumably the higher-seeded is more likely than the lower-seeded to win, but I don’t know how much more probable it is the higher-seed will win. I can come up with some bounds: the 24 games total in the second and third rounds can’t have an information content less than 0 bits, since they’re not all foregone conclusions. The higher-ranked seed won’t win all the time. And they can’t have an information content of more than 24 bits, since that’s how much there would be if the games were perfectly even matches.

So, then: the first round carries about 21.567 bits of information. The second and third rounds carry between 0 and 24 bits. The fourth through sixth rounds (the sixth round is the championship game) carry seven bits. Overall, the 63 games of the tournament carry between 28.567 and 52.567 bits of information. I would expect that many of the second-round and most of the third-round games are pretty close to even matches, so I would expect the higher end of that range to be closer to the true information content.

Let me make the assumption that in this second and third round the higher-seed has roughly a chance of 75 percent of beating the lower seed. That’s a number taken pretty arbitrarily as one that sounds like a plausible but not excessive advantage the higher-seeded teams might have. (It happens it’s close to the average you get of the higher-seed beating the lower-seed in the first round of play, something that I took as confirming my intuition about a plausible advantage the higher seed has.) If, in the second and third rounds, the higher-seed wins 75 percent of the time and the lower-seed 25 percent, then the outcome of each game is about 0.8113 bits of information. Since there are 24 games total in the second and third rounds, that suggests the second and third rounds carry about 19.471 bits of information.

Warning: Again, I went to three digits past the decimal just because three digits looks nice. Given that I do not actually know the chance a higher-seed beats a lower-seed in these rounds, and that I just made up a number that seems plausible you should not be surprised if the actual information content turns out to be 19.468 or even 19.472 bits of information.

Taking all these numbers, though — the first round with its something like 21.567 bits of information; the second and third rounds with something like 19.471 bits; the fourth through sixth rounds with 7 bits — the conclusion is that the win/loss results of the entire 63-game tournament are about 48 bits of information. It’s a bit higher the more unpredictable the games involving the final 32 and the Sweet 16 are; it’s a bit lower the more foregone those conclusions are. But 48 bits sounds like a plausible enough answer to me.

## What We Talk About When We Talk About How Interesting What We’re Talking About Is

When I wrote last weekend’s piece about how interesting a basketball tournament was, I let some terms slide without definition, mostly so I could explain what ideas I wanted to use and how they should relate. My love, for example, read the article and looked up and asked what exactly I meant by “interesting”, in the attempt to measure how interesting a set of games might be, even if the reasoning that brought me to a 63-game tournament having an interest level of 63 seemed to satisfy.

When I spoke about something being interesting, what I had meant was that it’s something whose outcome I would like to know. In mathematical terms this “something whose outcome I would like to know” is often termed an experiment’ to be performed or, even better, a message’ that presumably I wil receive; and the outcome is the “information” of that experiment or message. And information is, in this context, something you do not know but would like to.

So the information content of a foregone conclusion is low, or at least very low, because you already know what the result is going to be, or are pretty close to knowing. The information content of something you can’t predict is high, because you would like to know it but there’s no (accurately) guessing what it might be.

This seems like a straightforward idea of what information should mean, and it’s a very fruitful one; the field of “information theory” and a great deal of modern communication theory is based on them. This doesn’t mean there aren’t some curious philosophical implications, though; for example, technically speaking, this seems to imply that anything you already know is by definition not information, and therefore learning something destroys the information it had. This seems impish, at least. Claude Shannon, who’s largely responsible for information theory as we now know it, was renowned for jokes; I recall a Time Life science-series book mentioning how he had built a complex-looking contraption which, turned on, would churn to life, make a hand poke out of its innards, and turn itself off, which makes me smile to imagine. Still, this definition of information is a useful one, so maybe I’m imagining a prank where there’s not one intended.

And something I hadn’t brought up, but which was hanging awkwardly loose, last time was: granted that the outcome of a single game might have an interest level, or an information content, of 1 unit, what’s the unit? If we have units of mass and length and temperature and spiciness of chili sauce, don’t we have a unit of how informative something is?

We have. If we measure how interesting something is — how much information there is in its result — using base-two logarithms the way we did last time, then the unit of information is a bit. That is the same bit that somehow goes into bytes, which go on your computer into kilobytes and megabytes and gigabytes, and onto your hard drive or USB stick as somehow slightly fewer gigabytes than the label on the box says. A bit is, in this sense, the amount of information it takes to distinguish between two equally likely outcomes. Whether that’s a piece of information in a computer’s memory, where a 0 or a 1 is a priori equally likely, or whether it’s the outcome of a basketball game between two evenly matched teams, it’s the same quantity of information to have.

So a March Madness-style tournament has an information content of 63 bits, if all you’re interested in is which teams win. You could communicate the outcome of the whole string of matches by indicating whether the “home” team wins or loses for each of the 63 distinct games. You could do it with 63 flashes of light, or a string of dots and dashes on a telegraph, or checked boxes on a largely empty piece of graphing paper, coins arranged tails-up or heads-up, or chunks of memory on a USB stick. We’re quantifying how much of the message is independent of the medium.

## How Interesting Is A Basketball Tournament?

Yes, I can hear people snarking, “not even the tiniest bit”. These are people who think calling all athletic contests “sportsball” is still a fresh and witty insult. No matter; what I mean to talk about applies to anything where there are multiple possible outcomes. If you would rather talk about how interesting the results of some elections are, or whether the stock market rises or falls, whether your preferred web browser gains or loses market share, whatever, read it as that instead. The work is all the same.

To talk about quantifying how interesting the outcome of a game (election, trading day, whatever) means we have to think about what “interesting” qualitatively means. A sure thing, a result that’s bound to happen, is not at all interesting, since we know going in that it’s the result. A result that’s nearly sure but not guaranteed is at least a bit interesting, since after all, it might not happen. An extremely unlikely result would be extremely interesting, if it could happen.

## Reading the Comics, March 15, 2015: Pi Day Edition

I had kind of expected the 14th of March — the Pi Day Of The Century — would produce a flurry of mathematics-themed comics. There were some, although they were fewer and less creatively diverse than I had expected. Anyway, between that, and the regular pace of comics, there’s plenty for me to write about. Recently featured, mostly on Gocomics.com, a little bit on Creators.com, have been:

Brian Anderson’s Dog Eat Doug (March 11) features a cat who claims to be “pondering several quantum equations” to prove something about a parallel universe. It’s an interesting thing to claim because, really, how can the results of an equation prove something about reality? We’re extremely used to the idea that equations can model reality, and that the results of equations predict real things, to the point that it’s easy to forget that there is a difference. A model’s predictions still need some kind of validation, reason to think that these predictions are meaningful and correct when done correctly, and it’s quite hard to think of a meaningful way to validate a predication about “another” universe.

## The Math Blog Statistics, March 2014

It’s the start of a fresh month, so let me carry on my blog statistics reporting. In February 2014, apparently, there were a mere 423 pages viewed around here, with 209 unique visitors. That’s increased a bit, to 453 views from 257 visitors, my second-highest number of views since last June and second-highest number of visitors since last April. I can make that depressing, though: it means views per visitor dropped from 2.02 to 1.76, but then, they were at 1.76 in January anyway. And I reached my 14,000th page view, which is fun, but I’d need an extraordinary bit of luck to get to 15,000 this month.

March’s most popular articles were a mix of the evergreens — trapezoids and comics — with a bit of talk about March Madness serving as obviously successful clickbait:

1. How Many Trapezoids I Can Draw, and again, nobody’s found one I overlooked.
2. Calculating March Madness, and the tricky problem of figuring out the chance of getting a perfect bracket.
3. Reading The Comics, March 1, 2014: Isn’t It One-Half X Squared Plus C? Edition, showing how well an alleged joke will make comic strips popular.
4. Reading The Comics, March 26, 2014: Kitchen Science Department, showing that maybe it’s just naming the comics installments that matters.
5. What Are The Chances Of An Upset, which introduces some of the interesting quirks of the bracket and seed system of playoffs, such as the apparent advantage an eleventh seed has over an eighth seed.

There’s a familiar set of countries sending me the most readers: as ever the United States up top (277), with Denmark in second (26) and Canada in third (17). That’s almost a tie, though, as the United Kingdom (16), Austria (15), and the Philippines (13) could have taken third easily. I don’t want to explicitly encourage international rivalries to drive up my page count here, I’m just pointing it out. Singapore is in range too. The single-visitor countries this past month were the Bahamas, Belgium, Brazil, Colombia, Hungary, Mexico, Peru, Rwanda, Saudi Arabia, Spain, Sri Lanka, Sweden, Syria, and Taiwan. Hungary, Peru, and Saudi Arabia are the only repeat visitors from February, and nobody’s got a three-month streak going.

There wasn’t any good search-term poetry this month; mostly it was questions about trapezoids, but there were a couple interesting ones:

So, that’s where things stand: I need to get back to writing about trapezoids and comic strips.

## What Are The Chances Of An Upset?

I’d wondered idly the other day if a number-16 seed had ever lost to a number-one seed in the NCAA Men’s Basketball tournament. This finally made me go and actually try looking it up; a page on statistics.about.com has what it claims are the first-round results from 1985 (when the current 64-team format was adopted) to 2012. This lets us work out roughly the probability of, for example, the number-three seed beating the number-14, at least by what’s termed the “frequentist” interpretation of probability. In that interpretation, the probability of something happening is roughly how many times the thing you’re interested in happens for the number of times it could happen. From 1985 to 2012 each of the various first-round possibilites was played 112 times (28 tournaments with four divisions each); if we make some plausible assumptions about games being independent events (how one seed did last year doesn’t affect how it does this year), we should have a decent rough idea of the probability of each seed winning.

According to its statistics, and remarkable to me, is that apparently the number-one seed has never been beaten by the number-16. I’m surprised; I’d have guessed the bottom team had at least a one percent chance of victory. I’m also surprised that the Internet seems to have only the one page that’s gathered explicitly how often the first rounds go to the various seeds, although perhaps I’m just not searching for the right terms.

From http://bracketodds.cs.illinois.edu I learn that Dr Sheldon Jacobson and Dr Douglas M King of the University of Illinois (Urbana) published an interesting paper “Seeding In The NCAA Men’s Basketball Tournament: When is A Higher Seed Better?” which runs a variety of statistical tests on the outcomes of March Madness tournaments and finds that the seeding does seem to correspond to the stronger team in the first few rounds, but that after the Elite Eight round there’s not the evidence that a higher seed is more likely to win than the lower; effectively, after the first few rounds you might as well make a random pick.

Jacobson and King, along with Dr Alexander Nikolaev at SUNY/Buffalo and Dr Adrian J Lee, Central Illinois Technology and Education Research Institute, also wrote “Seed Distributions for the NCAA Men’s Basketball Tournament” which tries to model the tournament’s outcomes as random variables, and compares how these random-variable projections compare to what actually happened between 1985 and 2010. This includes some interesting projections about how often we might expect the various seeds to appear in the Sweet Sixteen, Elite Eight, or Final Four. It brings out some surprises — which make sense when you look back at the brackets — such as that the number-eight or number-nine seed has a worse chance of getting to the Sweet Sixteen than the eleventh- or twelfth-seed does.

(The eighth or ninth seed, if they win, have to play whoever wins the sixteen-versus-one contest, which will be the number-one seed. The eleventh seed has to beat first the number-six seed, and then either the number-three or the number-14 seed, either one of which is more likely.)

Meanwhile, it turns out that in my brackets I had picked Connecticut to beat Villanova, which has me doing well in my group — we get bonus points for calling upsets — apart from the accusations of witchcraft.

## Calculating March Madness

I did join a little group of people competing to try calling the various NCAA basketball tournament brackets. It’s a silly pastime and way to commiserate with other people about how badly we’re doing forecasting the outcome of the 63 games in the match. We’re competing just for points and the glory of doing a little better than our friends, but there’s some actual betting pools out there, and some contests that offer, for perfect brackets, a billion dollars (Warren Buffet, if I have that right), or maybe even a new car (WLNS-TV, channel 6, Lansing).

Working out what the odds are of getting all 63 games right is more interesting than it might seem at first. The natural (it seems to me) first guess at working out the odds is to say, well, there are 63 games, and whatever team you pick has a 50 percent chance of winning that game, so the chance of getting all 63 games right is $\left(\frac{1}{2}\right)^{63}$, or one chance in 9,223,372,036,854,775,808.

But it’s not quite so, and the reason is buried in the assumption that every team has a 50 percent chance of winning any given game. And that’s just not so: it’s plausible (as of this writing) to think that the final game will be Michigan State playing the University of Michigan. It’s just ridiculous to think that the final game will be SUNY/Albany (16th seeded) playing Wofford (15th).

The thing is that not all the matches are equally likely to be won by either team. The contest starts out with the number one seed playing the number 16, the number two seed playing the number 15, and so on. The seeding order roughly approximates the order of how good the teams are. It doesn’t take any great stretch to imagine the number ten seed beating the number nine seed; but, has a number 16 seed ever beaten the number one?

To really work out the probability of getting all the brackets right turns into a fairly involved problem. We can probably assume that the chance of, say, number-one seed Virginia beating number-16 seed Coastal Carolina is close to how frequently number-one seeds have beaten number-16 seeds in the past, and similarly that number-four seed Michigan State’s chances over number-13 Delaware is close to that historical average. But there are some 9,223,372,036,854,775,808 possible ways that the tournament could, in principle, go, and they’ve all got different probabilities of happening.

So there isn’t a unique answer to what is the chance that you’ve picked a perfect bracket set. It’s higher if you’ve picked a lot of higher-ranking seeds, certainly, at least assuming that this year’s tournament is much like previous years’, and that seeds do somewhat well reflect how likely teams are to win. At some point it starts to be easier to accept “one chance in 9,223,372,036,854,775,808” as close enough. Me, I’ll be gloating for the whole tournament thanks to my guess that Ohio State would lose to Dayton.

[Edit: first paragraph originally read “games in the match”, which doesn’t quite parse.]