I had been talking about how much information there is in the outcome of basketball games, or tournaments, or the like. I wanted to fill in at least one technical term, to match some of the others I’d given.

In this information-theory context, an experiment is just anything that could have different outcomes. A team can win or can lose or can tie in a game; that makes the game an experiment. The outcomes are the team wins, or loses, or ties. A team can get a particular score in the game; that makes that game a different experiment. The possible outcomes are the team scores zero points, or one point, or two points, or so on up to whatever the greatest possible score is.

If you know the probability p of each of the different outcomes, and since this is a mathematics thing we suppose that you do, then we have what I was calling the information content of the outcome of the experiment. That’s a number, measured in bits, and given by the formula

$\sum_{j} - p_j \cdot \log\left(p_j\right)$

The sigma summation symbol means to evaluate the expression to the right of it for every value of some index j. The pj means the probability of outcome number j. And the logarithm may be that of any base, although if we use base two then we have an information content measured in bits. Those are the same bits as are in the bytes that make up the megabytes and gigabytes in your computer. You can see this number as an estimate of how many well-chosen yes-or-no questions you’d have to ask to pick the actual result out of all the possible ones.

I’d called this the information content of the experiment’s outcome. That’s an idiosyncratic term, chosen because I wanted to hide what it’s normally called. The normal name for this is the “entropy”.

To be more precise, it’s known as the “Shannon entropy”, after Claude Shannon, pioneer of the modern theory of information. However, the equation defining it looks the same as one that defines the entropy of statistical mechanics, that thing everyone knows is always increasing and somehow connected with stuff breaking down. Well, almost the same. The statistical mechanics one multiplies the sum by a constant number called the Boltzmann constant, after Ludwig Boltzmann, who did so much to put statistical mechanics in its present and very useful form. We aren’t thrown by that. The statistical mechanics entropy describes energy that is in a system but that can’t be used. It’s almost background noise, present but nothing of interest.

Is this Shannon entropy the same entropy as in statistical mechanics? This gets into some abstract grounds. If two things are described by the same formula, are they the same kind of thing? Maybe they are, although it’s hard to see what kind of thing might be shared by “how interesting the score of a basketball game is” and “how much unavailable energy there is in an engine”.

The legend has it that when Shannon was working out his information theory he needed a name for this quantity. John von Neumann, the mathematician and pioneer of computer science, suggested, “You should call it entropy. In the first place, a mathematical development very much like yours already exists in Boltzmann’s statistical mechanics, and in the second place, no one understands entropy very well, so in any discussion you will be in a position of advantage.” There are variations of the quote, but they have the same structure and punch line. The anecdote appears to trace back to an April 1961 seminar at MIT given by one Myron Tribus, who claimed to have heard the story from Shannon. I am not sure whether it is literally true, but it does express a feeling about how people understand entropy that is true.

Well, these entropies have the same form. And they’re given the same name, give or take a modifier of “Shannon” or “statistical” or some other qualifier. They’re even often given the same symbol; normally a capital S or maybe an H is used as the quantity of entropy. (H tends to be more common for the Shannon entropy, but your equation would be understood either way.)

I’m not comfortable saying they’re the same thing, though. After all, we use the same formula to calculate a batting average and to work out the average time of a commute. But we don’t think those are the same thing, at least not more generally than “they’re both averages”. These entropies measure different kinds of things. They have different units that just can’t be sensibly converted from one to another. And the statistical mechanics entropy has many definitions that not just don’t have parallels for information, but wouldn’t even make sense for information. I would call these entropies siblings, with strikingly similar profiles, but not more than that.

But let me point out something about the Shannon entropy. It is low when an outcome is predictable. If the outcome is unpredictable, presumably knowing the outcome will be interesting, because there is no guessing what it might be. This is where the entropy is maximized. But an absolutely random outcome also has a high entropy. And that’s boring. There’s no reason for the outcome to be one option instead of another. Somehow, as looked at by the measure of entropy, the most interesting of outcomes and the most meaningless of outcomes blur together. There is something wondrous and strange in that.

When I wrote last weekend’s piece about how interesting a basketball tournament was, I let some terms slide without definition, mostly so I could explain what ideas I wanted to use and how they should relate. My love, for example, read the article and looked up and asked what exactly I meant by “interesting”, in the attempt to measure how interesting a set of games might be, even if the reasoning that brought me to a 63-game tournament having an interest level of 63 seemed to satisfy.

When I spoke about something being interesting, what I had meant was that it’s something whose outcome I would like to know. In mathematical terms this “something whose outcome I would like to know” is often termed an experiment’ to be performed or, even better, a message’ that presumably I wil receive; and the outcome is the “information” of that experiment or message. And information is, in this context, something you do not know but would like to.

So the information content of a foregone conclusion is low, or at least very low, because you already know what the result is going to be, or are pretty close to knowing. The information content of something you can’t predict is high, because you would like to know it but there’s no (accurately) guessing what it might be.

This seems like a straightforward idea of what information should mean, and it’s a very fruitful one; the field of “information theory” and a great deal of modern communication theory is based on them. This doesn’t mean there aren’t some curious philosophical implications, though; for example, technically speaking, this seems to imply that anything you already know is by definition not information, and therefore learning something destroys the information it had. This seems impish, at least. Claude Shannon, who’s largely responsible for information theory as we now know it, was renowned for jokes; I recall a Time Life science-series book mentioning how he had built a complex-looking contraption which, turned on, would churn to life, make a hand poke out of its innards, and turn itself off, which makes me smile to imagine. Still, this definition of information is a useful one, so maybe I’m imagining a prank where there’s not one intended.

And something I hadn’t brought up, but which was hanging awkwardly loose, last time was: granted that the outcome of a single game might have an interest level, or an information content, of 1 unit, what’s the unit? If we have units of mass and length and temperature and spiciness of chili sauce, don’t we have a unit of how informative something is?

We have. If we measure how interesting something is — how much information there is in its result — using base-two logarithms the way we did last time, then the unit of information is a bit. That is the same bit that somehow goes into bytes, which go on your computer into kilobytes and megabytes and gigabytes, and onto your hard drive or USB stick as somehow slightly fewer gigabytes than the label on the box says. A bit is, in this sense, the amount of information it takes to distinguish between two equally likely outcomes. Whether that’s a piece of information in a computer’s memory, where a 0 or a 1 is a priori equally likely, or whether it’s the outcome of a basketball game between two evenly matched teams, it’s the same quantity of information to have.

So a March Madness-style tournament has an information content of 63 bits, if all you’re interested in is which teams win. You could communicate the outcome of the whole string of matches by indicating whether the “home” team wins or loses for each of the 63 distinct games. You could do it with 63 flashes of light, or a string of dots and dashes on a telegraph, or checked boxes on a largely empty piece of graphing paper, coins arranged tails-up or heads-up, or chunks of memory on a USB stick. We’re quantifying how much of the message is independent of the medium.