And now let me close the week with some other evergreen articles. A couple years back I mixed the NCAA men’s basketball tournament with information theory to produce a series of essays that fit the title I’ve given this recap. They also sprawl out into (US) football and baseball. Let me link you to them:
That’s a relief. After the previous week’s suspicious silence Comic Strip Master Command sent a healthy number of mathematically-themed comics my way. They cover a pretty normal spread of topics. So this makes for a nice normal sort of roundup.
Mac King and Bill King’s Magic In A Minute for the 4th is an arithmetic-magic-trick. Like most arithmetic-magic it depends on some true but, to me, dull bit of mathematics. In this case, that 81,234,567 minus 12,345,678 is equal to something. As a kid this sort of trick never impressed me because, well, anyone can do subtraction. I didn’t appreciate that the fun of stage magic in presenting well the mundane.
Jerry Scott and Jim Borgman’s Zits for the 5th is an ordinary mathematics-is-hard joke. But it’s elevated by the artwork, which shows off the expressive and slightly surreal style that makes the comic so reliable and popular. The formulas look fair enough, the sorts of things someone might’ve been cramming before class. If they’re a bit jumbled up, well, Pierce hasn’t been well.
Mustard and Boloney popped back in on the 8th with a strip I don’t have in my archive at least. It’s your standard Pi Pun, though. If they’re smart they’ll rerun it in March. I like the coloring; it’s at least a pleasant panel to look at.
Percy Crosby’s Skippy from the 9th of July, 1929 was rerun the 6th of September. It seems like a simple kid-saying-silly-stuff strip: what is the difference between the phone numbers Clinton 2651 and Clinton 2741 when they add to the same number? (And if Central knows what the number is why do they waste Skippy’s time correcting him? And why, 87 years later, does the phone yell at me for not guessing correctly whether I need the area code for a local number and whether I need to dial 1 before that?) But then who cares what the digits in a telephone number add to? What could that tell us about anything?
As phone numbers historically developed, the sum can’t tell us anything at all. But if we had designed telephone numbers correctly we could have made it … not impossible to dial a wrong number, but at least made it harder. This insight comes to us from information theory, which, to be fair, we have because telephone companies spent decades trying to work out solutions to problems like people dialing numbers wrong or signals getting garbled in the transmission. We can allow for error detection by schemes as simple as passing along, besides the numbers, the sum of the numbers. This can allow for the detection of a single error: had Skippy called for number 2641 instead of 2741 the problem would be known. But it’s helpless against two errors, calling for 2541 instead of 2741. But we could detect a second error by calculating some second term based on the number we wanted, and sending that along too.
By adding some more information, other modified sums of the digits we want, we can even start correcting errors. We understand the logic of this intuitively. When we repeat a message twice after sending it, we are trusting that even if one copy of the message is garbled the recipient will take the version received twice as more likely what’s meant. We can design subtler schemes, ones that don’t require we repeat the number three times over. But that should convince you that we can do it.
The tradeoff is obvious. We have to say more digits of the number we want. It isn’t hard to reach the point we’re ending more error-detecting and error-correcting numbers than we are numbers we want. And what if we make a mistake in the error-correcting numbers? (If we used a smart enough scheme, we can work out the error was in the error-correcting number, and relax.) If it’s important that we get the message through, we shrug and accept this. If there’s no real harm done in getting the message wrong — if we can shrug off the problem of accidentally getting the wrong phone number — then we don’t worry about making a mistake.
And at this point we’re only a few days into the week. I have enough hundreds of words on the close of the week I’ll put off posting that a couple of days. It’s quite good having the comics back to normal.
I’m still curious about the information-theory content, the entropy, of sports scores. I haven’t found the statistics I need about baseball or soccer game outcomes that I need. I’d also like hockey score outcomes if I could get them. If anyone knows a reference I’d be glad to know of it.
But there’s still stuff I can talk about without knowing details of every game ever. One of them suggested itself when I looked at the Washington Post‘s graphic. I mean the one giving how many times each score came up in baseball’s history.
I had planned to write about this when one of my Twitter friends wrote —
@nebusj Cool, looks sort of like a beta distribution. Most movie grosses over time follow a similar curve.
By “distribution” mathematicians mean almost what you would imagine. Suppose we have something that might hold any of a range of values. This we call a “random variable”. How likely is it to hold any particular value? That’s what the distribution tells us. The higher the distribution, the more likely it is we’ll see that value. In baseball terms, that means we’re reasonably likely to see a game with a team scoring three runs. We’re not likely to see a game with a team scoring twenty runs.
There are many families of distributions. Feloni Mayhem suggested the baseball scores look like one called the Beta Distribution. I can’t quite agree, on technical grounds. Beta Distributions describe continuously-valued variables. They’re good for stuff like the time it takes to do something, or the height of a person, or the weight of a produced thing. They’re for measurements that can, in principle, go on forever after the decimal point. A baseball score isn’t like that. A team can score zero points, or one, or 46, but it can’t score four and two-thirds points. Baseball scores are “discrete” variables.
But there are good distributions for discrete variables. Almost everything you encounter taking an Intro to Probability class will be about discrete variables. So will most any recreational mathematics puzzle. The distribution of a tossed die’s outcomes is discrete. So is the number of times tails comes up in a set number of coin tosses. So are the birth dates of people in a room, or the number of cars passed on the side of the road during your ride, or the number of runs scored by a baseball team in a full game.
I suspected that, of the simpler distributions, the best model for baseball should be the Poisson distribution. It also seems good for any other low-scoring game, such as soccer or hockey. The Poisson distribution turns up whenever you have a large number of times that some discrete event can happen. But that event can happen only once each chance. And it has a constant chance of happening. That is, happening this chance doesn’t make it more likely or less likely it’ll happen next chance.
I have reasons to think baseball scoring should be well-modelled this way. There are hundreds of pitches in a game. Each of them is in principle a scoring opportunity. (Well, an intentional walk takes three pitches without offering any chance for scoring. And there’s probably some other odd case where a pitched ball can’t even in principle let someone score. But these are minor fallings-away from the ideal.) This is part of the appeal of baseball, at least for some: the chance is always there.
We only need one number to work out the Poisson distribution of something. That number is the mean, the arithmetic mean of all the possible values. Let me call the mean μ, which is the Greek version of m and so a good name for a mean. The probability that you’ll see the thing happen n times is . Here e is that base of the natural logarithm, that 2.71828 et cetera number. n! is the factorial. That’s n times (n – 1) times (n – 2) times (n – 3) and so on all the way down to times 2 times 1.
And here is the Poisson distribution for getting numbers from 0 through 20, if we take the mean to be 3.4. I can defend using the Poisson distribution much more than I can defend picking 3.4 as the mean. Why not 3.2, or 3.8? Mostly, I tried a couple means around the three-to-four runs range and picked one that looked about right. Given the lack of better data, what else can I do?
I don’t think it’s a bad fit. The shape looks about right, to me. But the Poisson distribution suggests fewer zero- and one-run games than the actual data offers. And there are more high-scoring games in the real data than in the Poisson distribution. Maybe there’s something that needs tweaking.
And there are several plausible causes for this. A Poisson distribution, for example, supposes that there are a lot of chances for a distinct event. That would be scoring on a pitch. But in an actual baseball game there might be up to four runs scored on one pitch. It’s less likely to score four runs than to score one, sure, but it does happen. This I imagine boosts the number of high-scoring games.
I suspect this could be salvaged by a model that’s kind of a chain of Poisson distributions. That is, have one distribution that represents the chance of scoring on any given pitch. Then use another distribution to say whether the scoring was one, two, three, or four runs.
Low-scoring games I have a harder time accounting for. My suspicion is that each pitch isn’t quite an independent event. Experience shows that pitchers lose control of their game the more they pitch. This results in the modern close watching of pitch counts. We see pitchers replaced at something like a hundred pitches even if they haven’t lost control of the game yet.
If I am right this says we might model games like soccer and hockey, with many chances to score a single run each, with a Poisson distribution. A game like baseball, or basketball, with many chances to score one or more points at once needs a more complicated model.
While researching for my post about the information content of baseball scores I found some tantalizing links. I had wanted to know how often each score came up. From this I could calculate the entropy, the amount of information in the score. That’s the sum, taken over every outcome, of minus one times the frequency of that score times the base-two logarithm of the frequency of the outcome. And I couldn’t find that.
An article in The Washington Post had a fine lead, though. It offers, per the title, “the score of every basketball, football, and baseball game in league history visualized”. And as promised it gives charts of how often each number of runs has turned up in a game. The most common single-team score in a game is 3, with 4 and 2 almost as common. I’m not sure the date range for these scores. The chart says it includes (and highlights) data from “a century ago”. And as the article was posted in December 2014 it can hardly use data from after that. I can’t imagine that the 2015 season has changed much, though. And whether they start their baseball statistics at either 1871, 1876, 1883, 1891, or 1901 (each a defensible choice) should only change details.
Which is fine. I can’t get precise frequency data from the chart. The chart offers how many thousands of times a particular score has come up. But there’s not the reference lines to say definitely whether a zero was scored closer to 21,000 or 22,000 times. I will accept a rough estimate, since I can’t do any better.
I made my best guess at the frequency, from the chart. And then made a second-best guess. My best guess gave the information content of a single team’s score as a touch more than 3.5 bits. My second-best guess gave the information content as a touch less than 3.5 bits. So I feel safe in saying a single team’s score is about three and a half bits of information.
So the score of a baseball game, with two teams scoring, is probably somewhere around twice that, or about seven bits of information.
I have to say “around”. This is because the two teams aren’t scoring runs independently of one another. Baseball doesn’t allow for tie games except in rare circumstances. (It would usually be a game interrupted for some reason, and then never finished because the season ended with neither team in a position where winning or losing could affect their standing. I’m not sure that would technically count as a “game” for Major League Baseball statistical purposes. But I could easily see a roster of game scores counting that.) So if one team’s scored three runs in a game, we have the information that the other team almost certainly didn’t score three runs.
This estimate, though, does fit within my range estimate from 3.76 to 9.25 bits. And as I expected, it’s closer to nine bits than to four bits. The entropy seems to be a bit less than (American) football scores — somewhere around 8.7 bits — and college basketball — probably somewhere around 10.8 bits — which is probably fair. There are a lot of numbers that make for plausible college basketball scores. There are slightly fewer pairs of numbers that make for plausible football scores. There are fewer still pairs of scores that make for plausible baseball scores. So there’s less information conveyed in knowing that the game’s score is.
The raw data is available. Retrosheet.org has logs of pretty much every baseball game, going back to the forming of major leagues in the 1870s. What they don’t have, as best I can figure, is a list of all the times each possible baseball score has turned up. That I could probably work out, when I feel up to writing the scripts necessary, but “work”? Ugh.
Some people have done the work, although they haven’t shared all the results. I don’t blame them; the full results make for a boring sort of page. “The Most Popular Scores In Baseball History”, at ValueOverReplacementGrit.com, reports the top ten most common scores from 1871 through 2010. The essay also mentions that as of then there were 611 unique final scores. And that lets me give some partial results, if we trust that blogger post from people I never heard of before are accurate and true. I will make that assumption over and over here.
There’s, in principle, no limit to how many scores are possible. Baseball contains many implied infinities, and it’s not impossible that a game could end, say, 580 to 578. But it seems likely that after 139 seasons of play there can’t be all that many more scores practically achievable.
Suppose then there are 611 possible baseball score outcomes, and that each of them is equally likely. Then the information-theory content of a score’s outcome is negative one times the logarithm, base two, of 1/611. That’s a number a little bit over nine and a quarter. You could deduce the score for a given game by asking usually nine, sometimes ten, yes-or-no questions from a source that knew the outcome. That’s a little higher than the 8.7 I worked out for football. And it’s a bit less than the 10.8 I estimate for college basketball.
(You may wonder how incompetent baseball players of the 1870s were that a game could get to 49-33. Not so bad as you imagine. But the equipment and conditions they were playing with were unspeakably bad by modern standards. Notably, the playing field couldn’t be counted on to be flat and level and well-mowed. There would be unexpected divots or irregularities. This makes even simple ground balls hard to field. The baseball, instead of being replaced with every batter, would stay in the game. It would get beaten until it was a little smashed shell of unpredictable dynamics and barely any structural integrity. People were playing without gloves. If a game ran long enough, they would play at dusk, without lights, with a muddy ball on a dusty field. And sometimes you just have four innings that get out of control.)
What’s needed is a guide to what are the common scores and what are the rare scores. And I haven’t found that, nor worked up the energy to make the list myself. But I found some promising partial results. In a September 2008 post on Baseball-Fever.com, user weskelton listed the 24 most common scores and their frequency. This was for games from 1993 to 2008. One might gripe that the list only covers fifteen years. True enough, but if the years are representative that’s fine. And the top scores for the fifteen-year survey look to be pretty much the same as the 139-year tally. The 24 most common scores add up to just over sixty percent of all baseball games, which leaves a lot of scores unaccounted for. I am amazed that about three in five games will have a score that’s one of these 24 choices though.
But that’s something. We can calculate the information content for the 25 outcomes, one each of the 24 particular scores and one for “other”. This will under-estimate the information content. That’s because “other” is any of 587 possible outcomes that we’re not distinguishing. But if we have a lower bound and an upper bound, then we’ve learned something about what the number we want can actually be. The upper bound is that 9.25, above.
The information content, the entropy, we calculate from the probability of each outcome. We don’t know what that is. Not really. But we can suppose that the frequency of each outcome is close to its probability. If there’ve been a lot of games played, then the frequency of a score and the probability of a score should be close. At least they’ll be close if games are independent, if the score of one game doesn’t affect another’s. I think that’s close to true. (Some games at the end of pennant races might affect each other: why try so hard to score if you’re already out for the year? But there’s few of them.)
The entropy then we find by calculating, for each outcome, a product. It’s minus one times the probability of that outcome times the base-two logarithm of the probability of that outcome. Then add up all those products. There’s good reasons for doing it this way and in the college-basketball link above I give some rough explanations of what the reasons are. Or you can just trust that I’m not lying or getting things wrong on purpose.
So let’s suppose I have calculated this right, using the 24 distinct outcomes and the one “other” outcome. That makes out the information content of a baseball score’s outcome to be a little over 3.76 bits.
As said, that’s a low estimate. Lumping about two-fifths of all games into the single category “other” drags the entropy down.
But that gives me a range, at least. A baseball game’s score seems to be somewhere between about 3.76 and 9.25 bits of information. I expect that it’s closer to nine bits than it is to four bits, but will have to do a little more work to make the case for it.
Last month, Sarcastic Goat asked me how interesting a soccer game was. This is “interesting” in the information theory sense. I describe what that is in a series of posts, linked to from above. That had been inspired by the NCAA “March Madness” basketball tournament. I’d been wondering about the information-theory content of knowing the outcome of the tournament, and of each game.
This measure, called the entropy, we can work out from knowing how likely all the possible outcomes of something — anything — are. If there’s two possible outcomes and they’re equally likely, the entropy is 1. If there’s two possible outcomes and one is a sure thing while the other can’t happen, the entropy is 0. If there’s four possible outcomes and they’re all equally likely, the entropy is 2. If there’s eight possible outcomes, all equally likely, the entropy is 3. If there’s eight possible outcomes and some are likely while some are long shots, the entropy is … smaller than 3, but bigger than 0. The entropy grows with the number of possible outcomes and shrinks with the number of unlikely outcomes.
But it’s easy to calculate. List all the possible outcomes. Find the probability of each of those possible outcomes happening. Then, calculate minus one times the probability of each outcome times the logarithm, base two, of that outcome. For each outcome, so yes, this might take a while. Then add up all those products.
I’d estimated the outcome of the 63-game basketball tournament was somewhere around 48 bits of information. There’s a fair number of foregone, or almost foregone, conclusions in the game, after all. And I guessed, based on a toy model of what kinds of scores often turn up in college basketball games, that the game’s score had an information content of a little under 11 bits of information.
Sarcastic Goat, as I say, asked about soccer scores. I don’t feel confident that I could make up a plausible model of soccer score distributions. So I went looking for historical data. Surely, a history of actual professional soccer scores over a couple decades would show all the possible, plausible, outcomes and how likely each was to turn out.
As you’d figure, there are a lot of freakish scores; only once in professional football history has the game ended 62-28. (Although it’s ended 62-14 twice, somehow.) There hasn’t been a 2-0 game since the second week of the 1938 season. Some scores turn up a lot; 248 games (as of this writing) have ended 20-17. That’s the most common score, in its records. 27-24 and 17-14 are the next most common scores. If I’m not making a dumb mistake, 7-0 is the 21st most common score. 93 games have ended with that tally. But it hasn’t actually been a game’s final score since the 14th week of the 1983 season, somehow. 98 games have ended 21-17; only ten have ended 21-18. Weird.
Anyway, there’s 1,026 recorded outcomes. That’s surely as close to “all the possible outcomes” as we can expect to get, at least until the Jets manage to lose 74-0 in their home opener. But if all 1,026 outcomes were equally likely then the information content of the game’s score would be a touch over 10 bits. But these outcomes aren’t all equally likely. It’s vastly more likely that a game ended 16-13 than it is likely it ended 16-8.
Let’s suppose I didn’t make any stupid mistakes in working out the frequency of all the possible outcomes. Then the information content of a football game’s outcome is a little over 8.72 bits.
Don’t be too hypnotized by the digits past the decimal. It’s approximate. But it suggests that if you were asking a source that would only answer ‘yes’ or ‘no’, then you could expect to get the score for any particular football game with about nine well-chosen questions.
I’m not surprised this is less than my estimated information content of a basketball game’s score. I think basketball games see a wider range of likely scores than football games do.
If someone has a reference for the outcomes of soccer games — or other sports — over a reasonably long time please let me know. I can run the same sort of calculation. We might even manage the completely pointless business of ranking all major sports by the information content of their scores.
The Kullback-Leibler Divergence comes to us from information theory. It’s also known as “information divergence” or “relative entropy”. Entropy is by now a familiar friend. We got to know it through, among other things, the “How interesting is a basketball tournament?” question. In this context, entropy is a measure of how surprising it would be to know which of several possible outcomes happens. A sure thing has an entropy of zero; there’s no potential surprise in it. If there are two equally likely outcomes, then the entropy is 1. If there are four equally likely outcomes, then the entropy is 2. If there are four possible outcomes, but one is very likely and the other three mediocre, the entropy might be low, say, 0.5 or so. It’s mostly but not perfectly predictable.
Suppose we have a set of possible outcomes for something. (Pick anything you like. It could be the outcomes of a basketball tournament. It could be how much a favored stock rises or falls over the day. It could be how long your ride into work takes. As long as there are different possible outcomes, we have something workable.) If we have a probability, a measure of how likely each of the different outcomes is, then we have a probability distribution. More likely things have probabilities closer to 1. Less likely things have probabilities closer to 0. No probability is less than zero or more than 1. All the probabilities added together sum up to 1. (These are the rules which make something a probability distribution, not just a bunch of numbers we had in the junk drawer.)
The Kullback-Leibler Divergence describes how similar two probability distributions are to one another. Let me call one of these probability distributions p. I’ll call the other one q. We have some number of possible outcomes, and we’ll use k as an index for them. pk is how likely, in distribution p, that outcome number k is. qk is how likely, in distribution q, that outcome number k is.
To calculate this divergence, we work out, for each k, the number pk times the logarithm of pk divided by qk. Here the logarithm is base two. Calculate all this for every one of the possible outcomes, and add it together. This will be some number that’s at least zero, but it might be larger.
The closer that distribution p and distribution q are to each other, the smaller this number is. If they’re exactly the same, this number will be zero. The less that distribution p and distribution q are like each other, the bigger this number is.
And that’s all good fun, but, why bother with it? And at least one answer I can give is that it lets us measure how good a model of something is.
Suppose we think we have an explanation for how something varies. We can say how likely it is we think there’ll be each of the possible different outcomes. This gives us a probability distribution which let’s call q. We can compare that to actual data. Watch whatever it is for a while, and measure how often each of the different possible outcomes actually does happen. This gives us a probability distribution which let’s call p.
If our model is a good one, then the Kullback-Leibler Divergence between p and q will be small. If our model’s a lousy one, then this divergence will be large. If we have a couple different models, we can see which ones make for smaller divergences and which ones make for larger divergences. Probably we’ll want smaller divergences.
Here you might ask: why do we need a model? Isn’t the actual data the best model we might have? It’s a fair question. But no, real data is kind of lousy. It’s all messy. It’s complicated. We get extraneous little bits of nonsense clogging it up. And the next batch of results is going to be different from the old ones anyway, because real data always varies.
Furthermore, one of the purposes of a model is to be simpler than reality. A model should do away with complications so that it is easier to analyze, easier to make predictions with, and easier to teach than the reality is. But a model mustn’t be so simple that it can’t represent important aspects of the thing we want to study.
The Kullback-Leibler Divergence is a tool that we can use to quantify how much better one model or another fits our data. It also lets us quantify how much of the grit of reality we lose in our model. And this is at least some of the use of this quantity.
The United States is about to spend a good bit of time worrying about the NCAA men’s basketball tournament. It’s a good distraction from the women’s basketball tournament and from the National Invitational Tournament. Last year I used this to write a couple essays that stepped into information theory. Nobody knowledgeable in information theory has sent me threatening letters since. So since the inspiration is back in season I’d like to bring them to your attention again:
But How Interesting Is A Real Basketball Tournament? Because I started out assuming that games were perfectly even match ups either team was likely to win. This isn’t so. If we grant that a number-16 seed is almost sure to lose to a number-1 seed, how does the information content change?
It looks like Comic Strip Master Command wanted to give me a nice, easy start of the year. The first group of mathematics-themed comic strips doesn’t get into deep waters and so could be written up with just a few moments. I foiled them by not having even a few moments to write things up, so that I’m behind on 2016 already. I’m sure I kind of win.
Dan Thompson’s Brevity for the 1st of January starts us off with icons of counting and computing. The abacus, of course, is one of the longest-used tools for computing. The calculator was a useful stopgap between the slide rule and the smart phone. The Count infects numerals with such contagious joy. And the whiteboard is where a lot of good mathematics work gets done. And yes, I noticed the sequence of numbers on the board. The prime numbers are often cited as the sort of message an alien entity would recognize. I suppose it’s likely an intelligence alert enough to pick up messages across space would be able to recognize prime numbers. Whether they’re certain to see them as important building blocks to the ways numbers work, the way we do? I don’t know. But I would expect someone to know the sequence, at least.
Ryan Pagelow’s Buni for New Year’s Day qualifies as the anthropomorphic-numerals joke for this essay.
Scott Hilburn’s The Argyle Sweater for the 2nd of January qualifies as the Roman numerals joke for this essay. It does prompt me to wonder whether about the way people who used Roman numerals as a their primary system thought, though. Obviously, “XCIX red balloons” should be pronounced as “ninety-nine red balloons”. But would someone scan it as “ninety-nine” or would it be read as the characters, “x-c-i-x” and then that converted to a number? I’m not sure I’m expressing the thing I wonder.
Steve Moore’s In The Bleachers for the 4th of January shows a basketball player overthinking the problem of getting a ball in the basket. The overthinking includes a bundle of equations which are all relevant to the problem, though. They’re the kinds of things you get in describing an object tossed up and falling without significant air resistance. I had thought I’d featured this strip — a rerun — before, but it seems not. Moore has used the same kind of joke a couple of other times, though, and he does like getting the equations right where possible.
Justin Boyd’s absurdist Invisible Bread for the 4th of January has Mom clean up a messy hard drive by putting all the 1’s together and all the 0’s together. And, yes, that’s not how data works. We say we represent data, on a computer, with 1’s and 0’s, but those are just names. We need to call them something. They’re in truth — oh, they’re positive or negative electric charges, or magnetic fields pointing one way or another, or they’re switches that are closed or open, or whatever. That’s for the person building the computer to worry about. Our description of what a computer does doesn’t care about the physical manifestation of our data. We could be as right if we say we’re representing data with A’s and purples, or with stop signs and empty cups of tea. What’s important is the pattern, and how likely it is that a 1 will follow a 0, or a 0 will follow a 1. If that sounds reminiscent of my information-theory talk about entropy, well, good: it is. Sweeping all the data into homogenous blocks of 1’s and of 0’s, alas, wipes out the interesting stuff. Information is hidden, somehow, in the ways we line up 1’s and 0’s, whatever we call them.
As anyone who tries to be funny knows, some words just are funnier than others. Or at least they sound funnier. (This brings us into the problem of whether something is actually funny or whether we just think it is.) Westbury, Shaoul, Moroschan, and Ramscar try testing whether a common measure of how unpredictable something is — the entropy, a cornerstone of information theory — can tell us how funny a word might be.
We’ve encountered entropy in these parts before. I used it in that series earlier this year about how interesting a basketball tournament was. Entropy, in this context, is low if something is predictable. It gets higher the more unpredictable the thing being studied is. You see this at work in auto-completion: if you have typed in ‘th’, it’s likely your next letter is going to be an ‘e’. This reflects the low entropy of ‘the’ as an english word. It’s rather unlikely the next letter will be ‘j’, because English has few contexts that need ‘thj’ to be written out. So it will suggest words that start ‘the’ (and ‘tha’, and maybe even ‘thi’), while giving ‘thj’ and ‘thq’ and ‘thd’ a pass.
Westbury, Shaoul, Moroschan, and Ramscar found that the entropy of a word, how unlikely that collection of letters is to appear in an English word, matches quite well how funny people unfamiliar with it find it. This fits well with one of the more respectable theories of comedy, Arthur Schopenhauer’s theory that humor comes about from violating expectations. That matches well with unpredictability.
Of course it isn’t just entropy that makes words funny. Anyone trying to be funny learns that soon enough, since a string of perfect nonsense is also boring. But this is one of the things that can be measured and that does have an influence.
(I doubt there is any one explanation for why things are funny. My sense is that there are many different kinds of humor, not all of them perfectly compatible. It would be bizarre if any one thing could explain them all. But explanations for pieces of them are plausible enough.)
In this information-theory context, an experiment is just anything that could have different outcomes. A team can win or can lose or can tie in a game; that makes the game an experiment. The outcomes are the team wins, or loses, or ties. A team can get a particular score in the game; that makes that game a different experiment. The possible outcomes are the team scores zero points, or one point, or two points, or so on up to whatever the greatest possible score is.
If you know the probability p of each of the different outcomes, and since this is a mathematics thing we suppose that you do, then we have what I was calling the information content of the outcome of the experiment. That’s a number, measured in bits, and given by the formula
The sigma summation symbol means to evaluate the expression to the right of it for every value of some index j. The pj means the probability of outcome number j. And the logarithm may be that of any base, although if we use base two then we have an information content measured in bits. Those are the same bits as are in the bytes that make up the megabytes and gigabytes in your computer. You can see this number as an estimate of how many well-chosen yes-or-no questions you’d have to ask to pick the actual result out of all the possible ones.
I’d called this the information content of the experiment’s outcome. That’s an idiosyncratic term, chosen because I wanted to hide what it’s normally called. The normal name for this is the “entropy”.
To be more precise, it’s known as the “Shannon entropy”, after Claude Shannon, pioneer of the modern theory of information. However, the equation defining it looks the same as one that defines the entropy of statistical mechanics, that thing everyone knows is always increasing and somehow connected with stuff breaking down. Well, almost the same. The statistical mechanics one multiplies the sum by a constant number called the Boltzmann constant, after Ludwig Boltzmann, who did so much to put statistical mechanics in its present and very useful form. We aren’t thrown by that. The statistical mechanics entropy describes energy that is in a system but that can’t be used. It’s almost background noise, present but nothing of interest.
Is this Shannon entropy the same entropy as in statistical mechanics? This gets into some abstract grounds. If two things are described by the same formula, are they the same kind of thing? Maybe they are, although it’s hard to see what kind of thing might be shared by “how interesting the score of a basketball game is” and “how much unavailable energy there is in an engine”.
The legend has it that when Shannon was working out his information theory he needed a name for this quantity. John von Neumann, the mathematician and pioneer of computer science, suggested, “You should call it entropy. In the first place, a mathematical development very much like yours already exists in Boltzmann’s statistical mechanics, and in the second place, no one understands entropy very well, so in any discussion you will be in a position of advantage.” There are variations of the quote, but they have the same structure and punch line. The anecdote appears to trace back to an April 1961 seminar at MIT given by one Myron Tribus, who claimed to have heard the story from Shannon. I am not sure whether it is literally true, but it does express a feeling about how people understand entropy that is true.
Well, these entropies have the same form. And they’re given the same name, give or take a modifier of “Shannon” or “statistical” or some other qualifier. They’re even often given the same symbol; normally a capital S or maybe an H is used as the quantity of entropy. (H tends to be more common for the Shannon entropy, but your equation would be understood either way.)
I’m not comfortable saying they’re the same thing, though. After all, we use the same formula to calculate a batting average and to work out the average time of a commute. But we don’t think those are the same thing, at least not more generally than “they’re both averages”. These entropies measure different kinds of things. They have different units that just can’t be sensibly converted from one to another. And the statistical mechanics entropy has many definitions that not just don’t have parallels for information, but wouldn’t even make sense for information. I would call these entropies siblings, with strikingly similar profiles, but not more than that.
But let me point out something about the Shannon entropy. It is low when an outcome is predictable. If the outcome is unpredictable, presumably knowing the outcome will be interesting, because there is no guessing what it might be. This is where the entropy is maximized. But an absolutely random outcome also has a high entropy. And that’s boring. There’s no reason for the outcome to be one option instead of another. Somehow, as looked at by the measure of entropy, the most interesting of outcomes and the most meaningless of outcomes blur together. There is something wondrous and strange in that.
I’d worked out an estimate of how much information content there is in a basketball score, by which I was careful to say the score that one team manages in a game. I wasn’t able to find out what the actual distribution of real-world scores was like, unfortunately, so I made up a plausible-sounding guess: that college basketball scores would be distributed among the imaginable numbers (whole numbers from zero through … well, infinitely large numbers, though in practice probably not more than 150) according to a very common distribution called the “Gaussian” or “normal” distribution, that the arithmetic mean score would be about 65, and that the standard deviation, a measure of how spread out the distribution of scores is, would be about 10.
If those assumptions are true, or are at least close enough to true, then there are something like 5.4 bits of information in a single team’s score. Put another way, if you were trying to divine the score by asking someone who knew it a series of carefully-chosen questions, like, “is the score less than 65?” or “is the score more than 39?”, with at each stage each question equally likely to be answered yes or no, you could expect to hit the exact score with usually five, sometimes six, such questions.
When I worked out how interesting, in an information-theory sense, a basketball game — and from that, a tournament — might be, I supposed there was only one thing that might be interesting about the game: who won? Or to be exact, “did (this team) win”? But that isn’t everything we might want to know about a game. For example, we might want to know what a team scored. People often do. So how to measure this?
When I wrote about how interesting the results of a basketball tournament were, and came to the conclusion that it was 63 (and filled in that I meant 63 bits of information), I was careful to say that the outcome of a basketball game between two evenly-matched opponents has an information content of 1 bit. If the game is a foregone conclusion, then the game hasn’t got so much information about it. If the game really is foregone, the information content is 0 bits; you already know what the result will be. If the game is an almost sure thing, there’s very little information to be had from actually seeing the game. An upset might be thrilling to watch, but you would hardly count on that, if you’re being rational. But most games aren’t sure things; we might expect the higher-seed to win, but it’s plausible they don’t. How does that affect how much information there is in the results of a tournament?
Last year, the NCAA College Men’s Basketball tournament inspired me to look up what the outcomes of various types of matches were, and which teams were more likely to win than others. If some person who wrote something for statistics.about.com is correct, based on 27 years of March Madness outcomes, the play between a number one and a number 16 seed is a foregone conclusion — the number one seed always wins — while number two versus number 15 is nearly sure. So while the first round of play will involve 32 games — four regions, each region having eight games — there’ll be something less than 32 bits of information in all these games, since many of them are so predictable.
So if the eight contests in a single region were all evenly matched, the information content of that region would be 8 bits. But there’s one sure and one nearly-sure game in there, and there’s only a couple games where the two teams are close to evenly matched. As a result, I make out the information content of a single region to be about 5.392 bits of information. Since there’s four regions, that means the first round of play — the first 32 games — have altogether about 21.567 bits of information.
Warning: I used three digits past the decimal point just because three is a nice comfortable number. Do not by hypnotized into thinking this is a more precise measure than it really is. I don’t know what the precise chance of, say, a number three seed beating a number fourteen seed is; all I know is that in a 27-year sample, it happened the higher-seed won 85 percent of the time, so the chance of the higher-seed winning is probably close to 85 percent. And I only know that if whoever it was wrote this article actually gathered and processed and reported the information correctly. I would not be at all surprised if the first round turned out to have only 21.565 bits of information, or as many as 21.568.
A statistical analysis of the tournaments which I dug up last year indicated that in the last three rounds — the Elite Eight, Final Four, and championship game — the higher- and lower-seeded teams are equally likely to win, and therefore those games have an information content of 1 bit per game. The last three rounds therefore have 7 bits of information total.
Unfortunately, experimental data seems to fall short for the second round — 16 games, where the 32 winners in the first round play, producing the Sweet Sixteen teams — and the third round — 8 games, producing the Elite Eight. If someone’s done a study of how often the higher-seeded team wins I haven’t run across it.
There are six of these games in each of the four regions, for 24 games total. Presumably the higher-seeded is more likely than the lower-seeded to win, but I don’t know how much more probable it is the higher-seed will win. I can come up with some bounds: the 24 games total in the second and third rounds can’t have an information content less than 0 bits, since they’re not all foregone conclusions. The higher-ranked seed won’t win all the time. And they can’t have an information content of more than 24 bits, since that’s how much there would be if the games were perfectly even matches.
So, then: the first round carries about 21.567 bits of information. The second and third rounds carry between 0 and 24 bits. The fourth through sixth rounds (the sixth round is the championship game) carry seven bits. Overall, the 63 games of the tournament carry between 28.567 and 52.567 bits of information. I would expect that many of the second-round and most of the third-round games are pretty close to even matches, so I would expect the higher end of that range to be closer to the true information content.
Let me make the assumption that in this second and third round the higher-seed has roughly a chance of 75 percent of beating the lower seed. That’s a number taken pretty arbitrarily as one that sounds like a plausible but not excessive advantage the higher-seeded teams might have. (It happens it’s close to the average you get of the higher-seed beating the lower-seed in the first round of play, something that I took as confirming my intuition about a plausible advantage the higher seed has.) If, in the second and third rounds, the higher-seed wins 75 percent of the time and the lower-seed 25 percent, then the outcome of each game is about 0.8113 bits of information. Since there are 24 games total in the second and third rounds, that suggests the second and third rounds carry about 19.471 bits of information.
Warning: Again, I went to three digits past the decimal just because three digits looks nice. Given that I do not actually know the chance a higher-seed beats a lower-seed in these rounds, and that I just made up a number that seems plausible you should not be surprised if the actual information content turns out to be 19.468 or even 19.472 bits of information.
Taking all these numbers, though — the first round with its something like 21.567 bits of information; the second and third rounds with something like 19.471 bits; the fourth through sixth rounds with 7 bits — the conclusion is that the win/loss results of the entire 63-game tournament are about 48 bits of information. It’s a bit higher the more unpredictable the games involving the final 32 and the Sweet 16 are; it’s a bit lower the more foregone those conclusions are. But 48 bits sounds like a plausible enough answer to me.
When I wrote last weekend’s piece about how interesting a basketball tournament was, I let some terms slide without definition, mostly so I could explain what ideas I wanted to use and how they should relate. My love, for example, read the article and looked up and asked what exactly I meant by “interesting”, in the attempt to measure how interesting a set of games might be, even if the reasoning that brought me to a 63-game tournament having an interest level of 63 seemed to satisfy.
When I spoke about something being interesting, what I had meant was that it’s something whose outcome I would like to know. In mathematical terms this “something whose outcome I would like to know” is often termed an `experiment’ to be performed or, even better, a `message’ that presumably I wil receive; and the outcome is the “information” of that experiment or message. And information is, in this context, something you do not know but would like to.
So the information content of a foregone conclusion is low, or at least very low, because you already know what the result is going to be, or are pretty close to knowing. The information content of something you can’t predict is high, because you would like to know it but there’s no (accurately) guessing what it might be.
This seems like a straightforward idea of what information should mean, and it’s a very fruitful one; the field of “information theory” and a great deal of modern communication theory is based on them. This doesn’t mean there aren’t some curious philosophical implications, though; for example, technically speaking, this seems to imply that anything you already know is by definition not information, and therefore learning something destroys the information it had. This seems impish, at least. Claude Shannon, who’s largely responsible for information theory as we now know it, was renowned for jokes; I recall a Time Life science-series book mentioning how he had built a complex-looking contraption which, turned on, would churn to life, make a hand poke out of its innards, and turn itself off, which makes me smile to imagine. Still, this definition of information is a useful one, so maybe I’m imagining a prank where there’s not one intended.
And something I hadn’t brought up, but which was hanging awkwardly loose, last time was: granted that the outcome of a single game might have an interest level, or an information content, of 1 unit, what’s the unit? If we have units of mass and length and temperature and spiciness of chili sauce, don’t we have a unit of how informative something is?
We have. If we measure how interesting something is — how much information there is in its result — using base-two logarithms the way we did last time, then the unit of information is a bit. That is the same bit that somehow goes into bytes, which go on your computer into kilobytes and megabytes and gigabytes, and onto your hard drive or USB stick as somehow slightly fewer gigabytes than the label on the box says. A bit is, in this sense, the amount of information it takes to distinguish between two equally likely outcomes. Whether that’s a piece of information in a computer’s memory, where a 0 or a 1 is a priori equally likely, or whether it’s the outcome of a basketball game between two evenly matched teams, it’s the same quantity of information to have.
So a March Madness-style tournament has an information content of 63 bits, if all you’re interested in is which teams win. You could communicate the outcome of the whole string of matches by indicating whether the “home” team wins or loses for each of the 63 distinct games. You could do it with 63 flashes of light, or a string of dots and dashes on a telegraph, or checked boxes on a largely empty piece of graphing paper, coins arranged tails-up or heads-up, or chunks of memory on a USB stick. We’re quantifying how much of the message is independent of the medium.
Yes, I can hear people snarking, “not even the tiniest bit”. These are people who think calling all athletic contests “sportsball” is still a fresh and witty insult. No matter; what I mean to talk about applies to anything where there are multiple possible outcomes. If you would rather talk about how interesting the results of some elections are, or whether the stock market rises or falls, whether your preferred web browser gains or loses market share, whatever, read it as that instead. The work is all the same.
To talk about quantifying how interesting the outcome of a game (election, trading day, whatever) means we have to think about what “interesting” qualitatively means. A sure thing, a result that’s bound to happen, is not at all interesting, since we know going in that it’s the result. A result that’s nearly sure but not guaranteed is at least a bit interesting, since after all, it might not happen. An extremely unlikely result would be extremely interesting, if it could happen.
While reading that biography of Donald Coxeter that brought up that lovely triangle theorem, I ran across some mentions of the sphere-packing problem. That’s the treatment of a problem anyone who’s had a stack of oranges or golf balls has independently discovered: how can you arrange balls, all the same size (oranges are near enough), so as to have the least amount of wasted space between balls? It’s a mathematics problem with a lot of applications, both the obvious ones of arranging orange or golf-ball shipments, and less obvious ones such as sending error-free messages. You can recast the problem of sending a message so it’s understood even despite errors in coding, transmitting, receiving, or decoding, as one of packing equal-size balls around one another.
The “packing density” is the term used to say how much of a volume of space can be filled with balls of equal size using some pattern or other. Patterns called the cubic close packing or the hexagonal close packing are the best that can be done with periodic packings, ones that repeat some base pattern over and over; they fill a touch over 74 percent of the available space with balls. If you don’t want to follow the Mathworld links before, just get a tub of balls, or crate of oranges, or some foam Mystery Science Theater 3000 logo balls as packing materials when you order the new DVD set, and play around with a while and you’ll likely rediscover them. If you’re willing to give up that repetition you can get up to nearly 78 percent. Finding these efficient packings is known as the Kepler conjecture, and yes, it’s that Kepler, and it did take a couple centuries to show that these were the most efficient packings.
While thinking about that I wondered: what’s the least efficient way to pack balls? The obvious answer is to start with a container the size of the universe, and then put no balls in it, for a packing fraction of zero percent. This seems to fall outside the spirit of the question, though; it’s at least implicit in wondering the least efficient way to pack balls to suppose that there’s at least one ball that exists.
As indicated by my last post, I’d really like to tie in philosophical contributions to mathematics to the rise of the computer. I’d like to jump from Leibniz to Boole, since Boole got the ball rolling to finally bring to fruition what Leibniz first speculated on the possibility.
In graduate school, I came across a series of lectures by a former head of the entire research and development division of IBM, which covered, in surprising level of detail, the philosophical origins of the computer industry. To be honest, it’s the sort of subject that really should be book in length. But I think it really is a great contemporary example of exactly what philosophy is supposed to be, discovering new methods of analysis that as they develop are spun out of philosophy and are given birth as a new independent (or semi-independent) field their philosophical origins. Theoretical linguistics is a…