## How Did Compute!’s Gazette’s MLX Program Work?

This is, at least, a retrocomputing-adjacent piece. I’m looking back at the logic of a common and useful tool from the early-to-mid-80s and why it’s built that way. I hope you enjoy. It has to deal with some of the fussier points about how Commodore 64 computers worked. If you find a paragraph is too much technical fussing for you, I ask you to not give up, just zip on to the next paragraph. It’s interesting to know why something was written that way, but it’s all right to accept that it was and move to the next point.

When the world and I were young, in the 1980s, we still had computers. There were two ways to get software, though. One was trading cassette tapes or floppy disks with cracked programs on them. (The cracking was taking off the copy-protection.) The other was typing. You could type in your own programs, certainly, just like you can make your own web page just by typing. Or you could type in a program. We had many magazines and books that had programs ready for entry. Some were serious programs, spreadsheets and word processors and such. Some were fun, like games or fractal-generators or such. Some were in-between, programs to draw or compose music or the such. Some added graphics or sound commands that the built-in BASIC programming language lacked. All this was available for the $2.95 cover price, or ten cents a page at the library photocopier. I had a Commodore 64 for most of this era, moving to a Commodore 128 (which also ran Commodore 64 programs) in 1989 or so. So my impressions, and this article, default to the Commodore 64 experience. These programs all had the same weakness. You had to type them in. You can expect to make errors. If the program was written in BASIC you had a hope of spotting errors. The BASIC programming language uses common English words for its commands. Their grammar is not English, but it’s also very formulaic, and not hard to pick up. One has a chance of spotting mistakes if it’s 250 PIRNT "SUM; " S one typed. But many programs were distributed as machine language. That is, the actual specific numbers that correspond to microchip instructions. For the Commodore 64, and most of the eight-bit home computers of the era, this was the 6502 microchip. (The 64 used a variation, the 6510. The differences between the 6502 and 6510 don’t matter for this essay.) Machine language had advantages, making the programs run faster, and usually able to do more things than BASIC could. But a string of numbers is only barely human-readable. Oh, you might in time learn to recognize the valid microchip instructions. But it is much harder to spot the mistakes on entering 32 255 120. That last would be a valid command on any eight-bit Commodore computer. It would have the computer print something, if it weren’t for the transposition errors. ## What Was MLX and How Did You Use It? The magazines came up with tools to handle this. In the 398-page(!) December 1983 issue of Compute!, my favorite line of magazines introduced MLX. This was a program, written in BASIC, which let you enter machine language programs. Charles Brannon has the credit for writing the article which introduced it. I assume he also wrote the program, but could be mistaken. I’m open to better information. Other magazines had other programs to do the same work; I knew them less well. MLX formatted machine language programs to look like this: 49152 :169,002,141,178,002,169,149 49158 :000,141,179,002,141,180,137 49164 :002,141,181,002,169,001,252 49170 :141,183,002,169,003,141,145  This was the first few lines of code for a game called Turnabout, by Mark Tuttle and Kevin Mykytyn. It ran in issue #28 of Commodore-computers-spinoff magazine Compute!’s Gazette. You can see the original at archive.org and imagine the poor wrists of people typing all this in. What did all this mean, though? These were lines you would enter in while running MLX. Before the colon was a location in memory. The numbers after the colon — the entries, I’ll call them — are six machine language instructions, one number to go into each memory cell. So, the number 169 was destined to go into memory location 49152. The number 002 would go into memory location 49153. The number 141 would go into memory location 49154. And so on; 000 would go into memory location 49158, 141 into 49159, 179 into 49160. 002 would go into memory location 49164; 141 would go into memory location 49170. And so on. MLX would prompt you with the line number, the 49152 or 49158 or 49164 or so on. Machine language programs could go into almost any memory location. You had to tell it where to start. 49152 was a popular location for Commodore 64 programs. It was the start of a nice block of memory not easily accessed except by machine language programs. Then you would type in the entries, the numbers that follow. This was a reasonably efficient way to key this stuff in. MLX automatically advanced the location in memory and would handle things like saving the program to tape or disk when you were done. The alert reader notices, though, that there are seven entries after the colon in each line. That seventh number is the checksum. It’s the guard that Compute! and Compute!’s Gazette put against typos. This seventh number was a checksum. MLX did a calculation based on the memory location and the first six numbers of the line. If it was not the seventh number on the line, then there was an error somewhere. You had to re-enter the line to get it right. The thing I’d wondered, and finally got curious enough to explore, was how it calculated this. ## What Was The Checksum And How Did It Work? Happily, Compute! and Compute!’s Gazette published MLX in almost every issue, so it’s easy to find. You can see it, for example, on page 123 of the October 1985 issue of Compute!’s Gazette. And MLX was itself a BASIC program. There are quirks of the language, and its representation in magazine print, that take time to get used to. But one can parse it without needing much expertise. One important thing is that most Commodore BASIC commands didn’t need spaces after them. For an often-used program like this they’d skip the spaces. And the : symbol denoted the end of one command and start of another. So, for example, PRINTCHR$(20):IFN=CKSUMTHEN530 one learns means PRINT CHR\$(20); IF N = CKSUM THEN 530.

So how does it work? MLX is, as a program, convoluted. It’s well-described by the old term “spaghetti code”. But the actual calculation of the checksum is done in a single line of the program, albeit one with several instructions. I’ll print it, but with some spaces added in to make it easier to read.

500 CKSUM = AD - INT(AD/256)*256:
FOR I = 1 TO 6:
CKSUM = (CKSUM + A(I))AND 255:
NEXT


Most of this you have a chance of understanding even if you don’t program. CKSUM is the checksum number. AD is the memory address for the start of the line. A is an array of six numbers, the six numbers of that line of machine language. I is an index, a number that ranges from 1 to 6 here. Each A(I) happens to be a number between 0 and 255 inclusive, because that’s the range of integers you can represent with eight bits.

## What Did This Code Mean?

So to decipher all this. Starting off. CKSUM = AD - INT(AD/256)*256. INT means “calculate the largest integer not greater than whatever’s inside”. So, like, INT(50/256) would be 0; INT(300/256) would be 1; INT(600/256) would be 2. What we start with, then, is the checksum is “the remainder after dividing the line’s starting address by 256”. We’re familiar with this, mathematically, as “address modulo 256”.

In any modern programming language, we’d write this as CKSUM = MOD(AD, 256) or CKSUM = AD % 256. But Commodore 64 BASIC didn’t have a modulo command. This structure was the familiar and comfortable enough workaround. But, read on.

The next bit was a for/next loop. This would do the steps inside for every integer value of I, starting at 1 and increasing to 6. CKSUM + A(I) has an obvious enough intention. What is the AND 255 part doing, though?

AND, here, is a logic operator. For the Commodore 64, it works on numbers represented as two-byte integers. These have a memory representation of 11111111 11111111 for ‘true’, and 00000000 00000000 for ‘false’. The very leftmost bit, for integers, is a plus-or-minus-sign. If that leftmost bit is a 1, the number is negative; if that leftmost bit is a 0, the number is positive. Did you notice me palming that card, there? We’ll come back to that.

Ordinary whole numbers can be represented in binary too. Like, the number 26 has a binary representation of 00000000 00011010. The number, say, 14 has a binary representation of 00000000 00001110. 26 AND 14 is the number 00000000 00001010, the binary digit being a 1 only when both the first and second numbers have a 1 in that column. This bitwise and operation is also sometimes referred to as masking, as in masking tape. The zeroes in the binary digits of one number mask out the binary digits of the other. (Which does the masking is a matter of taste; 26 AND 14 is the same number as 14 AND 26.)

The binary 00000000 0001010 is the decimal number 10. So you can see that generally these bitwise and operations give you weird results. Taking the bitwise and for 255 is more predictable, though. The number 255 has a bit representation of 00000000 11111111. So what (CKSUM + A(I)) AND 255 does is … give the remainder after dividing (CKSUM + A(I)) by 256. That is, it’s (CKSUM + A(I)) modulo 256.

The formula’s not complicated. To write it in mathematical terms, the calculation is:

$ck = \left(addr + \sum_{i = 1}^6 a_i\right) mod 256$

## Why Write It Like That?

So we have a question. Why are we calculating a number modulo 256 by two different processes? And in the same line of the program?

We get an answer by looking at the binary representation of 49152, which is 11000000 00000000. Remember that card I just palmed? I had warned that if the leftmost digit there were a 1, the number was understood to be negative. 49152 is many things, none of them negative.

So now we know the reason behind the odd programming choice to do the same thing two different ways. As with many odd programming choices it amounts to technical details of how Commodore hardware worked. The Commodore 64’s logical operators — AND, OR, and NOT — work on variables stored as two-byte integers. Two-byte integers can represent numbers from -32,768 up to +32,767. But memory addresses on the Commodore 64 are indexed from 0 up to 65,535. We can’t use bit masking to do the modulo operation, not on memory locations.

I have a second question, though. Look at the work inside the FOR loop. It takes the current value of the checksum, adds one of the entries to it, and takes the bitwise AND of that with 255. Why? The value would be the same if we waited until the loop was done to take the bitwise AND. At least, it would be unless the checksum grew to larger than 32,767. The checksum will be the sum of at most seven numbers, none of them larger than 255, though, so that can’t be the contraint. It’s usually faster to do as little inside a loop as possible, so, why this extravagance?

My first observation is that this FOR loop does the commands inside it six times. And logical operations like AND are very fast. The speed difference could not possibly be perceived. There is a point where optimizing your code is just making life harder for yourself.

My second observation goes back to the quirks of the Commodore 64. You entered commands, like the lines of a BASIC program, on a “logical line” that allowed up to eighty tokens. For typing in commands this is the same as the number of characters. Can this line be rewritten so there’s no redundant code inside the for loop, and so it’s all under 80 characters long?

Yes. This line would have the same effect and it’s only 78 characters:

500 CKSUM=AD-INT(AD/256)*256:FORI=1TO6:CKSUM=CKSUM+A(I):NEXT:CKSUM=CKSUMAND255


Why not use that, then?

I don’t have a clear answer. I suspect it’s for the benefit of people typing in the MLX program. In typing that in I’d have trouble not putting in a space between FOR and I, or between CKSUM and AND. Also before and after the TO and before and after AND. This would make the line run over 80 characters and make it crash. The original line is 68 characters, short enough that anyone could add a space here and there and not mess up anything. In looking through MLX, and other programs, I find there are relatively few lines more than 70 characters long. I have found them as long as 76 characters, though. I can’t rule out there being 78- or 79-character lines. They would have to suppose anyone typing them in understands when the line is too long.

There’s an interesting bit of support for this. Compute! also published machine language programs for the Atari 400 and 800. A version of MLX came out for the Atari at the same time the Commodore 64’s came out. Atari BASIC allowed for 120 characters total. And the equivalent line in Atari MLX was:

500 CKSUM=ADDR-INT(ADDR/256)*256:FOR I=1 TO 6:CKSUM=CKSUM+A(I):CKSUM=CKSUM-256*(CKSUM>255):NEXT I


This has a longer name for the address variable. It uses a different way to ensure that CKSUM stays a number between 0 and 255. But the whole line is only 98 characters.

We could save more spaces on the Commodore 64 version, though. Commodore BASIC “really” used only the first two characters of a variable name. To write CKSUM is for the convenience of the programmer. To the computer it would be the same if we wrote CK. We could even truncate it to CK for this one line of code. The only penalty would be confusing the reader who doesn’t remember that CK and CKSUM are the same variable.

And there’s no reason that this couldn’t have been two lines. One line could add up the checksum and a second could do the bitwise AND. Maybe this is all a matter of the programmer’s tastes.

In a modern language this is all quite zippy to code. To write it in Octave or Matlab is something like:

function [checksOut] = oldmlx(oneline)
entries = oneline(2:7);
keyedChecksum = oneline(8);

checksum += sum(entries);
checksum = mod(checksum, 256);

checksOut = (keyedChecksum == checksum);
endfunction


This is a bit verbose. I want it to be easier to see what work is being done. We could make it this compact:

function [checksOut] = oldmlx(oneline)
checksOut = !(mod(sum(oneline(1:7))-oneline(8), 256));
endfunction


I don’t like compressing my thinking quite that much, though.

But that’s the checksum. Now the question: did it work?

Since Compute! and Compute!’s Gazette used it for years, the presumptive answer is that it did. The real question, then, is did it work well? “Well” means does it prevent the kinds of mistakes you’re likely to make without demanding too much extra work. We could, for example, eliminate nearly all errors by demanding every line be entered three times and accept only a number that’s entered the same at least two of three times. That’s an incredible typing load. Here? We have to enter one extra number for every six. Much lower load, but it allows more errors through. But the calculation is — effectively — simply “add together all the numbers we typed in, and see if that adds to the expected total”. If it stops the most likely errors, though, then it’s good. So let’s consider them.

The first and simplest error? Entering the wrong line. MLX advanced the memory location on its own. So if you intend to write the line for memory location 50268, and your eye slips and you start entering that for 50274 instead? Or even, reading left to right, going to line 50814 in the next column? Very easy to do. This checksum will detect that nicely, though. Entering one line too soon, or too late, will give a checksum that’s off by 6. If your eye skips two lines, the checksum will be off by 12. The only way to not have the checksum miss is to enter a line that’s some multiple of 256 memory locations away. And since each line is six memory locations, that means you have to jump 768 memory locations away. That is 128 lines away. You are not going to make that mistake. (Going from one column in the magazine to the next is a jump of 91 lines. The pages were 8½-by-11 pages, so were a bit easier to read than the image makes them look.)

How about other errors? You could mis-key, say, 169. But think of the plausible errors. Typing it in as 159 or 196 or 269 would be detected by the checksum. The only one that wouldn’t would be to enter a number that’s equal to 169, modulo 256. So, 425, say, or 681. There is nobody so careless as to read 169 and accidentally type 425, though. In any case, other code in MLX rejects any data that’s not between 0 and 255, so that’s caught before the checksum comes into play.

So it’s safe against the most obvious mistake. And against mis-keying a single entry. Yes, it’s possible that you typed in the whole line right but mis-keyed the checksum. If you did that you felt dumb but re-entered the line. If you even noticed and didn’t just accept the error report and start re-entering the line.

What about mis-keying double entries? And here we have trouble. Suppose that you’re supposed to enter 169, 062 and instead enter 159, 072. They’ll add to the same quantity, and the same checksum. All that’s protecting you is that it takes a bit of luck to make two errors that exactly balance each other. But, then, slipping and hitting an adjacent number on the keyboard is an easy mistake to make.

Worse is entry transposition. If you enter 062, 169 instead you have made no checksum errors. And you won’t even be typing any number “wrong”. At least with the mis-keying you might notice that 169 is a common number and 159 a rare one in machine language. (169 was the command “Load Accumulator”. That is, copy a number into the Central Processing Unit’s accumulator. This was one of three on-chip memory slots. 159 was no meaningful command. It would only appear as data.) Swapping two numbers is another easy error to make.

And they would happen. I can attest from experience. I’d had at least one program which, after typing, had one of these glitches. After all the time spent entering it, I ended up with a program that didn’t work. And I never had the heart to go back and track down the glitch or, more efficiently, retype the whole thing from scratch.

The irony is that the program with the critical typing errors was a machine language compiler. It’s something that would have let me write this sort of machine language code. Since I never reentered it, I never created anything but the most trivial of machine language programs for the 64.

So this MLX checksum was fair. It devoted one-seventh of the typing to error detection. It could catch line-swap errors, single-entry mis-keyings, and transpositions within one entry. It couldn’t catch transposing two entries. So that could have been better. I hope to address that soon.

## My 2019 Mathematics A To Z: Encryption schemes

Today’s A To Z term is encryption schemes. It’s another suggested by aajohannas. It’s a chance to dip into information theory.

Mr Wu, author of the Mathtuition88 blog, suggested the Extreme Value Theorem. I was tempted and then realized that I had written this in the 2018 A-to-Z, as the “X” letter. The end of the alphabet has a shortage of good mathematics words. Sometimes we have to work around problems.

# Encryption schemes.

Why encrypt anything?

The oldest reason to hide a message, at least from all but select recipients. Ancient encryption methods will substitute one letter for another, or will mix up the order of letters in a message. This won’t hide a message forever. But it will slow down a person trying to decrypt the message until they decide they don’t need to know what it says. Or decide to bludgeon the message-writer into revealing the secret.

Substituting one letter for another won’t stop an eavesdropper from working out the message. Not indefinitely, anyway. There are patterns in the language. Any language, but take English as an example. A single-letter word is either ‘I’ or ‘A’. A two-letter word has a great chance of being ‘in’, ‘on’, ‘by’, ‘of’, ‘an’, or a couple other choices. Solving this is a fun pastime, for people who like this. If you need it done fast, let a computer work it out.

To hide the message better requires being cleverer. For example, you could substitue letters according to a slightly different scheme for each letter in the original message. The Vignère cipher is an example of this. I remember some books from my childhood, written in the second person. They had programs that you-the-reader could type in to live the thrill of being a child secret agent computer programmer. This encryption scheme was one of the programs used for passing on messages. We can make the plans more complicated yet, but that won’t give us better insight yet.

The objective is to turn the message into something less predictable. An encryption which turns, say, ‘the’ into ‘rgw’ will slow the reader down. But if they pay attention and notice, oh, the text also has the words ‘rgwm’, ‘rgey’, and rgwb’ turn up a lot? It’s hard not to suspect these are ‘the’, ‘them’, ‘they’, and ‘then’. If a different three-letter code is used for every appearance of ‘the’, good. If there’s a way to conceal the spaces as something else, that’s even better, if we want it harder to decrypt the message.

So the messages hardest to decrypt should be the most random. We can give randomness a precise definition. We owe it to information theory, which is the study of how to encode and successfully transmit and decode messages. In this, the information content of a message is its entropy. Yes, the same word as used to describe broken eggs and cream stirred into coffee. The entropy measures how likely each possible message is. Encryption matches the message you really want with a message of higher entropy. That is, one that’s harder to predict. Decrypting reverses that matching.

So what goes into a message? We call them words, or codewords, so we have a clear noun to use. A codeword is a string of letters from an agreed-on alphabet. The terminology draws from common ordinary language. Cryptography grew out of sending sentences.

But anything can be the letters of the alphabet. Any string of them can be a codeword. An unavoidable song from my childhood told the story of a man asking his former lover to tie a yellow ribbon around an oak tree. This is a tiny alphabet, but it only had to convey two words, signalling whether she was open to resuming their relationship. Digital computers use an alphabet of two memory states. We label them ‘0’ and ‘1’, although we could as well label them +5 and -5, or A and B, or whatever. It’s not like actual symbols are scrawled very tight into the chips. Morse code uses dots and dashes and short and long pauses. Naval signal flags have a set of shapes and patterns to represent the letters of the alphabet, as well as common or urgent messages. There is not a single universally correct number of letters or length of words for encryption. It depends on what the code will be used for, and how.

Naval signal flags help me to my next point. There’s a single pattern which, if shown, communicates the message “I require a pilot”. Another, “I am on fire and have dangerous cargo”. Still another, “All persons should report on board as the vessel is about to set to sea”. These are whole sentences; they’re encrypted into a single letter.

And this is the second great use of encryption. English — any human language — has redundancy to it. Think of the sentence “No, I’d rather not go out this evening”. It’s polite, but is there anything in it not communicated by texting back “N”? An encrypted message is, often, shorter than the original. To send a message costs something. Time, if nothing else. To send it more briefly is typically better.

There are dangers to this. Strike out any word from “No, I’d rather not go out this evening”. Ask someone to guess what belongs there. Only the extroverts will have trouble. I guess if you strike out “evening” people might guess “time” or “weekend” or something. The sentiment of the sentence endures.

But strike out a letter from “N” and ask someone to guess what was meant. And this is a danger of encryption. The encrypted message has a higher entropy, a higher unpredictability. If some mistake happens in transmission, we’re lost.

We can fight this. It’s possible to build checks into an encryption. To carry a bit of extra information that lets one know that the message was garbled. These are “error-detecting codes”. It’s even possible to carry enough extra information to correct some errors. These are “error-correcting codes”. There are limits, of course. This kind of error-correcting takes calculation time and message space. We lose some economy but gain reliability. There is a general lesson in this.

And not everything can compress. There are (if I’m reading this right) 26 letter, 10 numeral, and four repeater flags used under the International Code of Symbols. So there are at most 40 signals that could be reduced to a single flag. If we need to communicate “I am on fire but have no dangerous cargo” we’re at a loss. We have to spell things out more. It’s a quick proof, by way of the pigeonhole principle, which tells us that not every message can compress. But this is all right. There are many messages we will never need to send. (“I am on fire and my cargo needs updates on Funky Winkerbean.”) If it’s mostly those that have no compressed version, who cares?

Encryption schemes are almost as flexible as language itself. There are families of kinds of schemes. This lets us fit schemes to needs: how many different messages do we need to be able to send? How sure do we need to be that errors are corrected? Or that errors are detected? How hard do we want it to be for eavesdroppers to decode the message? Are we able to set up information with the intended recipients separately? What we need, and what we are willing to do without, guide the scheme we use.

Thank you again for reading. All of Fall 2019 A To Z posts should be at this link. I hope to have a letter F piece on Thursday. All of the A To Z essays should be at this link and if I can sort out some trouble with the first two, they will be soon. And if you’d like to nominate topics for essays, I’m asking for the letters I through N at this link.

## Reading the Comics, May 25, 2019: Slighter Comics Edition.

It turned out to be Thursday. These things happen. The comics for the second half of last week were more marginal

Zach Weinersmith’s Saturday Morning Breakfast Cereal for the 20th is a joke about holographic cosmology, proving that there are such things as jokes about holographic cosmology. Cosmology is about the big picture stuff, like, why there is a universe and why it looks like that. It’s a rather mathematical field, owing to the difficulty of doing controlled experiments. Holograms are that same technology used back in the 80s to put shoddy three-dimensional-ish pictures of eagles on credit cards. (In the United States. I imagine they were other animals in other countries.) Holograms, at least when they’re well-made, encode the information needed to make a three-dimensional image in a two-dimensional surface. (Please pretend that anything made of matter is two-dimensional like that.)

Holographic cosmology is a mathematical model for the universe. It represents the things in a space with a description of information on the boundary of this space. This seems bizarre and it won’t surprise you that key inspiration was in the strange physics of black holes. Properties of everything which falls into a black hole manifest in the event horizon, the boundary between normal space and whatever’s going on inside the black hole. The black hole is this three-dimensional volume, but in some way everything there is to say about it is the two-dimensional edge.

Dr Leonard Susskind did much to give this precise mathematical form. You didn’t think the character name was just a bit of whimsy, did you? Susskind’s work showed how the information of a particle falling into a black hole — information here meaning stuff like its position and momentum — turn into oscillations in the event horizon. The holographic principle argues this can be extended to ordinary space, the whole of the regular universe. Is this so? It’s hard to say. It’s a corner of string theory. It’s difficult to run experiments that prove very much. And we are stuck with an epistemological problem. If all the things in the universe and their interactions are equally well described as a three-dimensional volume or as a two-dimensional surface, which is “real”? It may seem intuitively obvious that we experience a three-dimensional space. But that intuition is a way we organize our understanding of our experiences. That’s not the same thing as truth.

Gene Weingarten, Dan Weingarten, and David Clark’s Barney and Clyde for the 22nd is a joke about power, and how it can coerce someone out of truth. Arithmetic serves as an example of indisputable truth. It could be any deductive logic statement, or for that matter a definition. Arithmetic is great for the comic purpose needed here, though. Anyone can understand, at least the simpler statements, and work out their truth or falsity. And need very little word balloon space for it.

Bill Griffith’s Zippy the Pinhead for the 25th also features a quick mention of algebra as the height of rationality. Also as something difficult to understand. Most fields are hard to understand, when you truly try. But algebra works well for this writing purpose. Anyone who’d read Zippy the Pinhead has an idea of what understanding algebra would be like, the way they might not have an idea of holographic cosmology.

Teresa Logan’s Laughing Redhead Comics for the 25th is the Venn diagram joke for the week, this one with a celebrity theme. Your choice whether the logic of the joke makes sense. Ryan Reynolds and John Krasinski are among those celebrities that I keep thinking I don’t know, but that it turns out I do know. Ryan Gosling I’m still not sure about.

And then there are a couple strips too slight even to appear in this collection. Dean Young and John Marshall’s Blondie on the 22nd did a lottery joke, with discussion of probability along the way. (And I hadn’t had a tag for ‘Blondie’ before, so that’s an addition which someday will baffle me.) Bob Shannon’s Tough Town for the 23rd mentions mathematics teaching. It’s in service of a pun.

And now I’ve had the past week covered. The next Reading the Comics post should be at this link come Sunday.

## Let Me Tell You How Interesting March Madness Could Possibly Be

I read something alarming in the daily “Best of GoComics” e-mail this morning. It was a panel of Dave Whamond’s Reality Check. It’s a panel comic, although it stands out from the pack by having a squirrel character in the margins. And here’s the panel.

Certainly a solid enough pun to rate a mention. I don’t know of anyone actually doing a March Mathness bracket, but it’s not a bad idea. Rating mathematical terms for their importance or usefulness or just beauty might be fun. And might give a reason to talk about their meaning some. It’s a good angle to discuss what’s intersting about mathematical terms.

And that lets me segue into talking about a set of essays. The next few weeks see the NCAA college basketball tournament, March Madness. I’ve used that to write some stuff about information theory, as it applies to the question: is a basketball game interesting?

Along the way here I got to looking up actual scoring results from major sports. This let me estimate the information-theory content of the scores of soccer, (US) football, and baseball scores, to match my estimate of basketball scores’ information content.

• How Interesting Is A Football Score? Football scoring is a complicated thing. But I was able to find a trove of historical data to give me an estimate of the information theory content of a score.
• How Interesting Is A Baseball Score? Some Partial Results I found some summaries of actual historical baseball scores. Somehow I couldn’t find the detail I wanted for baseball, a sport that since 1845 has kept track of every possible bit of information, including how long the games ran, about every game ever. I made do, though.
• How Interesting Is A Baseball Score? Some Further Results Since I found some more detailed summaries and refined the estimate a little.
• How Interesting Is A Low-Scoring Game? And here, well, I start making up scores. It’s meant to represent low-scoring games such as soccer, hockey, or baseball to draw some conclusions. This includes the question: just because a distribution of small whole numbers is good for mathematicians, is that a good match for what sports scores are like?

## Is A Basketball Tournament Interesting? My Thoughts

It’s a good weekend to bring this back. I have some essays about information theory and sports contests and maybe you missed them earlier. Here goes.

And then for a follow-up I started looking into actual scoring results from major sports. This let me estimate the information-theory content of the scores of soccer, (US) football, and baseball scores, to match my estimate of basketball scores’ information content.

Don’t try to use this to pass your computer science quals. But I hope it gives you something interesting to talk about while sulking over your brackets, and maybe to read about after that.

## How Interesting Is March Madness?

And now let me close the week with some other evergreen articles. A couple years back I mixed the NCAA men’s basketball tournament with information theory to produce a series of essays that fit the title I’ve given this recap. They also sprawl out into (US) football and baseball. Let me link you to them:

## Reading the Comics, September 6, 2016: Oh Thank Goodness We’re Back Edition

That’s a relief. After the previous week’s suspicious silence Comic Strip Master Command sent a healthy number of mathematically-themed comics my way. They cover a pretty normal spread of topics. So this makes for a nice normal sort of roundup.

Mac King and Bill King’s Magic In A Minute for the 4th is an arithmetic-magic-trick. Like most arithmetic-magic it depends on some true but, to me, dull bit of mathematics. In this case, that 81,234,567 minus 12,345,678 is equal to something. As a kid this sort of trick never impressed me because, well, anyone can do subtraction. I didn’t appreciate that the fun of stage magic in presenting well the mundane.

Jerry Scott and Jim Borgman’s Zits for the 5th is an ordinary mathematics-is-hard joke. But it’s elevated by the artwork, which shows off the expressive and slightly surreal style that makes the comic so reliable and popular. The formulas look fair enough, the sorts of things someone might’ve been cramming before class. If they’re a bit jumbled up, well, Pierce hasn’t been well.

Jeffrey Caulfield and Alexandre Rouillard’s Mustard and Boloney for the 6th is an anthropomorphic-shapes joke and I feel like it’s been here before. Ah, yeah, there it is, from about this time last year. It’s a fair one to rerun.

Mustard and Boloney popped back in on the 8th with a strip I don’t have in my archive at least. It’s your standard Pi Pun, though. If they’re smart they’ll rerun it in March. I like the coloring; it’s at least a pleasant panel to look at.

Percy Crosby’s Skippy from the 9th of July, 1929 was rerun the 6th of September. It seems like a simple kid-saying-silly-stuff strip: what is the difference between the phone numbers Clinton 2651 and Clinton 2741 when they add to the same number? (And if Central knows what the number is why do they waste Skippy’s time correcting him? And why, 87 years later, does the phone yell at me for not guessing correctly whether I need the area code for a local number and whether I need to dial 1 before that?) But then who cares what the digits in a telephone number add to? What could that tell us about anything?

As phone numbers historically developed, the sum can’t tell us anything at all. But if we had designed telephone numbers correctly we could have made it … not impossible to dial a wrong number, but at least made it harder. This insight comes to us from information theory, which, to be fair, we have because telephone companies spent decades trying to work out solutions to problems like people dialing numbers wrong or signals getting garbled in the transmission. We can allow for error detection by schemes as simple as passing along, besides the numbers, the sum of the numbers. This can allow for the detection of a single error: had Skippy called for number 2641 instead of 2741 the problem would be known. But it’s helpless against two errors, calling for 2541 instead of 2741. But we could detect a second error by calculating some second term based on the number we wanted, and sending that along too.

By adding some more information, other modified sums of the digits we want, we can even start correcting errors. We understand the logic of this intuitively. When we repeat a message twice after sending it, we are trusting that even if one copy of the message is garbled the recipient will take the version received twice as more likely what’s meant. We can design subtler schemes, ones that don’t require we repeat the number three times over. But that should convince you that we can do it.

The tradeoff is obvious. We have to say more digits of the number we want. It isn’t hard to reach the point we’re ending more error-detecting and error-correcting numbers than we are numbers we want. And what if we make a mistake in the error-correcting numbers? (If we used a smart enough scheme, we can work out the error was in the error-correcting number, and relax.) If it’s important that we get the message through, we shrug and accept this. If there’s no real harm done in getting the message wrong — if we can shrug off the problem of accidentally getting the wrong phone number — then we don’t worry about making a mistake.

And at this point we’re only a few days into the week. I have enough hundreds of words on the close of the week I’ll put off posting that a couple of days. It’s quite good having the comics back to normal.

## How Interesting Is A Low-Scoring Game?

I’m still curious about the information-theory content, the entropy, of sports scores. I haven’t found the statistics I need about baseball or soccer game outcomes that I need. I’d also like hockey score outcomes if I could get them. If anyone knows a reference I’d be glad to know of it.

But there’s still stuff I can talk about without knowing details of every game ever. One of them suggested itself when I looked at the Washington Post‘s graphic. I mean the one giving how many times each score came up in baseball’s history.

By “distribution” mathematicians mean almost what you would imagine. Suppose we have something that might hold any of a range of values. This we call a “random variable”. How likely is it to hold any particular value? That’s what the distribution tells us. The higher the distribution, the more likely it is we’ll see that value. In baseball terms, that means we’re reasonably likely to see a game with a team scoring three runs. We’re not likely to see a game with a team scoring twenty runs.

There are many families of distributions. Feloni Mayhem suggested the baseball scores look like one called the Beta Distribution. I can’t quite agree, on technical grounds. Beta Distributions describe continuously-valued variables. They’re good for stuff like the time it takes to do something, or the height of a person, or the weight of a produced thing. They’re for measurements that can, in principle, go on forever after the decimal point. A baseball score isn’t like that. A team can score zero points, or one, or 46, but it can’t score four and two-thirds points. Baseball scores are “discrete” variables.

But there are good distributions for discrete variables. Almost everything you encounter taking an Intro to Probability class will be about discrete variables. So will most any recreational mathematics puzzle. The distribution of a tossed die’s outcomes is discrete. So is the number of times tails comes up in a set number of coin tosses. So are the birth dates of people in a room, or the number of cars passed on the side of the road during your ride, or the number of runs scored by a baseball team in a full game.

I suspected that, of the simpler distributions, the best model for baseball should be the Poisson distribution. It also seems good for any other low-scoring game, such as soccer or hockey. The Poisson distribution turns up whenever you have a large number of times that some discrete event can happen. But that event can happen only once each chance. And it has a constant chance of happening. That is, happening this chance doesn’t make it more likely or less likely it’ll happen next chance.

I have reasons to think baseball scoring should be well-modelled this way. There are hundreds of pitches in a game. Each of them is in principle a scoring opportunity. (Well, an intentional walk takes three pitches without offering any chance for scoring. And there’s probably some other odd case where a pitched ball can’t even in principle let someone score. But these are minor fallings-away from the ideal.) This is part of the appeal of baseball, at least for some: the chance is always there.

We only need one number to work out the Poisson distribution of something. That number is the mean, the arithmetic mean of all the possible values. Let me call the mean μ, which is the Greek version of m and so a good name for a mean. The probability that you’ll see the thing happen n times is $\mu^n e^{-\mu} \div (n!)$. Here e is that base of the natural logarithm, that 2.71828 et cetera number. n! is the factorial. That’s n times (n – 1) times (n – 2) times (n – 3) and so on all the way down to times 2 times 1.

And here is the Poisson distribution for getting numbers from 0 through 20, if we take the mean to be 3.4. I can defend using the Poisson distribution much more than I can defend picking 3.4 as the mean. Why not 3.2, or 3.8? Mostly, I tried a couple means around the three-to-four runs range and picked one that looked about right. Given the lack of better data, what else can I do?

I don’t think it’s a bad fit. The shape looks about right, to me. But the Poisson distribution suggests fewer zero- and one-run games than the actual data offers. And there are more high-scoring games in the real data than in the Poisson distribution. Maybe there’s something that needs tweaking.

And there are several plausible causes for this. A Poisson distribution, for example, supposes that there are a lot of chances for a distinct event. That would be scoring on a pitch. But in an actual baseball game there might be up to four runs scored on one pitch. It’s less likely to score four runs than to score one, sure, but it does happen. This I imagine boosts the number of high-scoring games.

I suspect this could be salvaged by a model that’s kind of a chain of Poisson distributions. That is, have one distribution that represents the chance of scoring on any given pitch. Then use another distribution to say whether the scoring was one, two, three, or four runs.

Low-scoring games I have a harder time accounting for. My suspicion is that each pitch isn’t quite an independent event. Experience shows that pitchers lose control of their game the more they pitch. This results in the modern close watching of pitch counts. We see pitchers replaced at something like a hundred pitches even if they haven’t lost control of the game yet.

If we ignore reasons to doubt this distribution, then, it suggests an entropy of about 2.9 for a single team’s score. That’s lower than the 3.5 bits I estimated last time, using score frequencies. I think that’s because of the multiple-runs problem. Scores are spread out across more values than the Poisson distribution suggests.

If I am right this says we might model games like soccer and hockey, with many chances to score a single run each, with a Poisson distribution. A game like baseball, or basketball, with many chances to score one or more points at once needs a more complicated model.

## How Interesting Is A Baseball Score? Some Further Results

While researching for my post about the information content of baseball scores I found some tantalizing links. I had wanted to know how often each score came up. From this I could calculate the entropy, the amount of information in the score. That’s the sum, taken over every outcome, of minus one times the frequency of that score times the base-two logarithm of the frequency of the outcome. And I couldn’t find that.

An article in The Washington Post had a fine lead, though. It offers, per the title, “the score of every basketball, football, and baseball game in league history visualized”. And as promised it gives charts of how often each number of runs has turned up in a game. The most common single-team score in a game is 3, with 4 and 2 almost as common. I’m not sure the date range for these scores. The chart says it includes (and highlights) data from “a century ago”. And as the article was posted in December 2014 it can hardly use data from after that. I can’t imagine that the 2015 season has changed much, though. And whether they start their baseball statistics at either 1871, 1876, 1883, 1891, or 1901 (each a defensible choice) should only change details.

Which is fine. I can’t get precise frequency data from the chart. The chart offers how many thousands of times a particular score has come up. But there’s not the reference lines to say definitely whether a zero was scored closer to 21,000 or 22,000 times. I will accept a rough estimate, since I can’t do any better.

I made my best guess at the frequency, from the chart. And then made a second-best guess. My best guess gave the information content of a single team’s score as a touch more than 3.5 bits. My second-best guess gave the information content as a touch less than 3.5 bits. So I feel safe in saying a single team’s score is about three and a half bits of information.

So the score of a baseball game, with two teams scoring, is probably somewhere around twice that, or about seven bits of information.

I have to say “around”. This is because the two teams aren’t scoring runs independently of one another. Baseball doesn’t allow for tie games except in rare circumstances. (It would usually be a game interrupted for some reason, and then never finished because the season ended with neither team in a position where winning or losing could affect their standing. I’m not sure that would technically count as a “game” for Major League Baseball statistical purposes. But I could easily see a roster of game scores counting that.) So if one team’s scored three runs in a game, we have the information that the other team almost certainly didn’t score three runs.

This estimate, though, does fit within my range estimate from 3.76 to 9.25 bits. And as I expected, it’s closer to nine bits than to four bits. The entropy seems to be a bit less than (American) football scores — somewhere around 8.7 bits — and college basketball — probably somewhere around 10.8 bits — which is probably fair. There are a lot of numbers that make for plausible college basketball scores. There are slightly fewer pairs of numbers that make for plausible football scores. There are fewer still pairs of scores that make for plausible baseball scores. So there’s less information conveyed in knowing that the game’s score is.

## How Interesting Is A Baseball Score? Some Partial Results

Meanwhile I have the slight ongoing quest to work out the information-theory content of sports scores. For college basketball scores I made up some plausible-looking score distributions and used that. For professional (American) football I found a record of all the score outcomes that’ve happened, and how often. I could use experimental results. And I’ve wanted to do other sports. Soccer was asked for. I haven’t been able to find the scoring data I need for that. Baseball, maybe the supreme example of sports as a way to generate statistics … has been frustrating.

The raw data is available. Retrosheet.org has logs of pretty much every baseball game, going back to the forming of major leagues in the 1870s. What they don’t have, as best I can figure, is a list of all the times each possible baseball score has turned up. That I could probably work out, when I feel up to writing the scripts necessary, but “work”? Ugh.

Some people have done the work, although they haven’t shared all the results. I don’t blame them; the full results make for a boring sort of page. “The Most Popular Scores In Baseball History”, at ValueOverReplacementGrit.com, reports the top ten most common scores from 1871 through 2010. The essay also mentions that as of then there were 611 unique final scores. And that lets me give some partial results, if we trust that blogger post from people I never heard of before are accurate and true. I will make that assumption over and over here.

There’s, in principle, no limit to how many scores are possible. Baseball contains many implied infinities, and it’s not impossible that a game could end, say, 580 to 578. But it seems likely that after 139 seasons of play there can’t be all that many more scores practically achievable.

Suppose then there are 611 possible baseball score outcomes, and that each of them is equally likely. Then the information-theory content of a score’s outcome is negative one times the logarithm, base two, of 1/611. That’s a number a little bit over nine and a quarter. You could deduce the score for a given game by asking usually nine, sometimes ten, yes-or-no questions from a source that knew the outcome. That’s a little higher than the 8.7 I worked out for football. And it’s a bit less than the 10.8 I estimate for college basketball.

And there’s obvious rubbish there. In no way are all 611 possible outcomes equally likely. “The Most Popular Scores In Baseball History” says that right there in the essay title. The most common outcome was a score of 3-2, with 4-3 barely less popular. Meanwhile it seems only once, on the 28th of June, 1871, has a baseball game ended with a score of 49-33. Some scores are so rare we can ignore them as possibilities.

(You may wonder how incompetent baseball players of the 1870s were that a game could get to 49-33. Not so bad as you imagine. But the equipment and conditions they were playing with were unspeakably bad by modern standards. Notably, the playing field couldn’t be counted on to be flat and level and well-mowed. There would be unexpected divots or irregularities. This makes even simple ground balls hard to field. The baseball, instead of being replaced with every batter, would stay in the game. It would get beaten until it was a little smashed shell of unpredictable dynamics and barely any structural integrity. People were playing without gloves. If a game ran long enough, they would play at dusk, without lights, with a muddy ball on a dusty field. And sometimes you just have four innings that get out of control.)

What’s needed is a guide to what are the common scores and what are the rare scores. And I haven’t found that, nor worked up the energy to make the list myself. But I found some promising partial results. In a September 2008 post on Baseball-Fever.com, user weskelton listed the 24 most common scores and their frequency. This was for games from 1993 to 2008. One might gripe that the list only covers fifteen years. True enough, but if the years are representative that’s fine. And the top scores for the fifteen-year survey look to be pretty much the same as the 139-year tally. The 24 most common scores add up to just over sixty percent of all baseball games, which leaves a lot of scores unaccounted for. I am amazed that about three in five games will have a score that’s one of these 24 choices though.

But that’s something. We can calculate the information content for the 25 outcomes, one each of the 24 particular scores and one for “other”. This will under-estimate the information content. That’s because “other” is any of 587 possible outcomes that we’re not distinguishing. But if we have a lower bound and an upper bound, then we’ve learned something about what the number we want can actually be. The upper bound is that 9.25, above.

The information content, the entropy, we calculate from the probability of each outcome. We don’t know what that is. Not really. But we can suppose that the frequency of each outcome is close to its probability. If there’ve been a lot of games played, then the frequency of a score and the probability of a score should be close. At least they’ll be close if games are independent, if the score of one game doesn’t affect another’s. I think that’s close to true. (Some games at the end of pennant races might affect each other: why try so hard to score if you’re already out for the year? But there’s few of them.)

The entropy then we find by calculating, for each outcome, a product. It’s minus one times the probability of that outcome times the base-two logarithm of the probability of that outcome. Then add up all those products. There’s good reasons for doing it this way and in the college-basketball link above I give some rough explanations of what the reasons are. Or you can just trust that I’m not lying or getting things wrong on purpose.

So let’s suppose I have calculated this right, using the 24 distinct outcomes and the one “other” outcome. That makes out the information content of a baseball score’s outcome to be a little over 3.76 bits.

As said, that’s a low estimate. Lumping about two-fifths of all games into the single category “other” drags the entropy down.

But that gives me a range, at least. A baseball game’s score seems to be somewhere between about 3.76 and 9.25 bits of information. I expect that it’s closer to nine bits than it is to four bits, but will have to do a little more work to make the case for it.

## How Interesting Is A Football Score?

Last month, Sarcastic Goat asked me how interesting a soccer game was. This is “interesting” in the information theory sense. I describe what that is in a series of posts, linked to from above. That had been inspired by the NCAA “March Madness” basketball tournament. I’d been wondering about the information-theory content of knowing the outcome of the tournament, and of each game.

This measure, called the entropy, we can work out from knowing how likely all the possible outcomes of something — anything — are. If there’s two possible outcomes and they’re equally likely, the entropy is 1. If there’s two possible outcomes and one is a sure thing while the other can’t happen, the entropy is 0. If there’s four possible outcomes and they’re all equally likely, the entropy is 2. If there’s eight possible outcomes, all equally likely, the entropy is 3. If there’s eight possible outcomes and some are likely while some are long shots, the entropy is … smaller than 3, but bigger than 0. The entropy grows with the number of possible outcomes and shrinks with the number of unlikely outcomes.

But it’s easy to calculate. List all the possible outcomes. Find the probability of each of those possible outcomes happening. Then, calculate minus one times the probability of each outcome times the logarithm, base two, of that outcome. For each outcome, so yes, this might take a while. Then add up all those products.

I’d estimated the outcome of the 63-game basketball tournament was somewhere around 48 bits of information. There’s a fair number of foregone, or almost foregone, conclusions in the game, after all. And I guessed, based on a toy model of what kinds of scores often turn up in college basketball games, that the game’s score had an information content of a little under 11 bits of information.

Sarcastic Goat, as I say, asked about soccer scores. I don’t feel confident that I could make up a plausible model of soccer score distributions. So I went looking for historical data. Surely, a history of actual professional soccer scores over a couple decades would show all the possible, plausible, outcomes and how likely each was to turn out.

I didn’t find one. My search for soccer scores kept getting contaminated with (American) football scores. But that turned up something interesting anyway. Sports Reference LLC has a table which purports to list the results of all professional football games played from 1920 through the start of 2016. There’ve been, apparently, some 1,026 different score outcomes, from 0-0 through to 73-0.

As you’d figure, there are a lot of freakish scores; only once in professional football history has the game ended 62-28. (Although it’s ended 62-14 twice, somehow.) There hasn’t been a 2-0 game since the second week of the 1938 season. Some scores turn up a lot; 248 games (as of this writing) have ended 20-17. That’s the most common score, in its records. 27-24 and 17-14 are the next most common scores. If I’m not making a dumb mistake, 7-0 is the 21st most common score. 93 games have ended with that tally. But it hasn’t actually been a game’s final score since the 14th week of the 1983 season, somehow. 98 games have ended 21-17; only ten have ended 21-18. Weird.

Anyway, there’s 1,026 recorded outcomes. That’s surely as close to “all the possible outcomes” as we can expect to get, at least until the Jets manage to lose 74-0 in their home opener. But if all 1,026 outcomes were equally likely then the information content of the game’s score would be a touch over 10 bits. But these outcomes aren’t all equally likely. It’s vastly more likely that a game ended 16-13 than it is likely it ended 16-8.

Let’s suppose I didn’t make any stupid mistakes in working out the frequency of all the possible outcomes. Then the information content of a football game’s outcome is a little over 8.72 bits.

Don’t be too hypnotized by the digits past the decimal. It’s approximate. But it suggests that if you were asking a source that would only answer ‘yes’ or ‘no’, then you could expect to get the score for any particular football game with about nine well-chosen questions.

I’m not surprised this is less than my estimated information content of a basketball game’s score. I think basketball games see a wider range of likely scores than football games do.

If someone has a reference for the outcomes of soccer games — or other sports — over a reasonably long time please let me know. I can run the same sort of calculation. We might even manage the completely pointless business of ranking all major sports by the information content of their scores.

## A Leap Day 2016 Mathematics A To Z: Kullbach-Leibler Divergence

Today’s mathematics glossary term is another one requested by Jacob Kanev. Kaven, I learned last time, has got a blog, “Some Unconsidered Trifles”, for those interested in having more things to read. Kanev’s request this time was a term new to me. But learning things I didn’t expect to consider is part of the fun of this dance.

## Kullback-Leibler Divergence.

The Kullback-Leibler Divergence comes to us from information theory. It’s also known as “information divergence” or “relative entropy”. Entropy is by now a familiar friend. We got to know it through, among other things, the “How interesting is a basketball tournament?” question. In this context, entropy is a measure of how surprising it would be to know which of several possible outcomes happens. A sure thing has an entropy of zero; there’s no potential surprise in it. If there are two equally likely outcomes, then the entropy is 1. If there are four equally likely outcomes, then the entropy is 2. If there are four possible outcomes, but one is very likely and the other three mediocre, the entropy might be low, say, 0.5 or so. It’s mostly but not perfectly predictable.

Suppose we have a set of possible outcomes for something. (Pick anything you like. It could be the outcomes of a basketball tournament. It could be how much a favored stock rises or falls over the day. It could be how long your ride into work takes. As long as there are different possible outcomes, we have something workable.) If we have a probability, a measure of how likely each of the different outcomes is, then we have a probability distribution. More likely things have probabilities closer to 1. Less likely things have probabilities closer to 0. No probability is less than zero or more than 1. All the probabilities added together sum up to 1. (These are the rules which make something a probability distribution, not just a bunch of numbers we had in the junk drawer.)

The Kullback-Leibler Divergence describes how similar two probability distributions are to one another. Let me call one of these probability distributions p. I’ll call the other one q. We have some number of possible outcomes, and we’ll use k as an index for them. pk is how likely, in distribution p, that outcome number k is. qk is how likely, in distribution q, that outcome number k is.

To calculate this divergence, we work out, for each k, the number pk times the logarithm of pk divided by qk. Here the logarithm is base two. Calculate all this for every one of the possible outcomes, and add it together. This will be some number that’s at least zero, but it might be larger.

The closer that distribution p and distribution q are to each other, the smaller this number is. If they’re exactly the same, this number will be zero. The less that distribution p and distribution q are like each other, the bigger this number is.

And that’s all good fun, but, why bother with it? And at least one answer I can give is that it lets us measure how good a model of something is.

Suppose we think we have an explanation for how something varies. We can say how likely it is we think there’ll be each of the possible different outcomes. This gives us a probability distribution which let’s call q. We can compare that to actual data. Watch whatever it is for a while, and measure how often each of the different possible outcomes actually does happen. This gives us a probability distribution which let’s call p.

If our model is a good one, then the Kullback-Leibler Divergence between p and q will be small. If our model’s a lousy one, then this divergence will be large. If we have a couple different models, we can see which ones make for smaller divergences and which ones make for larger divergences. Probably we’ll want smaller divergences.

Here you might ask: why do we need a model? Isn’t the actual data the best model we might have? It’s a fair question. But no, real data is kind of lousy. It’s all messy. It’s complicated. We get extraneous little bits of nonsense clogging it up. And the next batch of results is going to be different from the old ones anyway, because real data always varies.

Furthermore, one of the purposes of a model is to be simpler than reality. A model should do away with complications so that it is easier to analyze, easier to make predictions with, and easier to teach than the reality is. But a model mustn’t be so simple that it can’t represent important aspects of the thing we want to study.

The Kullback-Leibler Divergence is a tool that we can use to quantify how much better one model or another fits our data. It also lets us quantify how much of the grit of reality we lose in our model. And this is at least some of the use of this quantity.

## How Interesting Can A Basketball Tournament Be?

The United States is about to spend a good bit of time worrying about the NCAA men’s basketball tournament. It’s a good distraction from the women’s basketball tournament and from the National Invitational Tournament. Last year I used this to write a couple essays that stepped into information theory. Nobody knowledgeable in information theory has sent me threatening letters since. So since the inspiration is back in season I’d like to bring them to your attention again:

## Reading the Comics, January 4, 2015: An Easy New Year Edition

It looks like Comic Strip Master Command wanted to give me a nice, easy start of the year. The first group of mathematics-themed comic strips doesn’t get into deep waters and so could be written up with just a few moments. I foiled them by not having even a few moments to write things up, so that I’m behind on 2016 already. I’m sure I kind of win.

Dan Thompson’s Brevity for the 1st of January starts us off with icons of counting and computing. The abacus, of course, is one of the longest-used tools for computing. The calculator was a useful stopgap between the slide rule and the smart phone. The Count infects numerals with such contagious joy. And the whiteboard is where a lot of good mathematics work gets done. And yes, I noticed the sequence of numbers on the board. The prime numbers are often cited as the sort of message an alien entity would recognize. I suppose it’s likely an intelligence alert enough to pick up messages across space would be able to recognize prime numbers. Whether they’re certain to see them as important building blocks to the ways numbers work, the way we do? I don’t know. But I would expect someone to know the sequence, at least.

Ryan Pagelow’s Buni for New Year’s Day qualifies as the anthropomorphic-numerals joke for this essay.

Scott Hilburn’s The Argyle Sweater for the 2nd of January qualifies as the Roman numerals joke for this essay. It does prompt me to wonder whether about the way people who used Roman numerals as a their primary system thought, though. Obviously, “XCIX red balloons” should be pronounced as “ninety-nine red balloons”. But would someone scan it as “ninety-nine” or would it be read as the characters, “x-c-i-x” and then that converted to a number? I’m not sure I’m expressing the thing I wonder.

Steve Moore’s In The Bleachers for the 4th of January shows a basketball player overthinking the problem of getting a ball in the basket. The overthinking includes a bundle of equations which are all relevant to the problem, though. They’re the kinds of things you get in describing an object tossed up and falling without significant air resistance. I had thought I’d featured this strip — a rerun — before, but it seems not. Moore has used the same kind of joke a couple of other times, though, and he does like getting the equations right where possible.

Justin Boyd’s absurdist Invisible Bread for the 4th of January has Mom clean up a messy hard drive by putting all the 1’s together and all the 0’s together. And, yes, that’s not how data works. We say we represent data, on a computer, with 1’s and 0’s, but those are just names. We need to call them something. They’re in truth — oh, they’re positive or negative electric charges, or magnetic fields pointing one way or another, or they’re switches that are closed or open, or whatever. That’s for the person building the computer to worry about. Our description of what a computer does doesn’t care about the physical manifestation of our data. We could be as right if we say we’re representing data with A’s and purples, or with stop signs and empty cups of tea. What’s important is the pattern, and how likely it is that a 1 will follow a 0, or a 0 will follow a 1. If that sounds reminiscent of my information-theory talk about entropy, well, good: it is. Sweeping all the data into homogenous blocks of 1’s and of 0’s, alas, wipes out the interesting stuff. Information is hidden, somehow, in the ways we line up 1’s and 0’s, whatever we call them.

Steve Boreman’s Little Dog Lost for the 4th of January does a bit of comic wordplay with ones, zeroes, and twos. I like this sort of comic interplay.

And finally, John Deering and John Newcombe saw that Facebook meme about algebra just a few weeks ago, then drew the Zack Hill for the 4th of January.

## Making A Joke Of Entropy

This entered into my awareness a few weeks back. Of course I’ve lost where I got it from. But the headline is of natural interest to me. Kristy Condon’s “Researchers establish the world’s first mathematical theory of humor” describes the results of an interesting paper studying the phenomenon of funny words.

The original paper is by Chris Westbury, Cyrus Shaoul, Gail Moroschan, and Michael Ramscar, titled “Telling the world’s least funny jokes: On the quantification of humor as entropy”. It appeared in The Journal of Memory and Language. The thing studied was whether it’s possible to predict how funny people are likely to find a made-up non-word.

As anyone who tries to be funny knows, some words just are funnier than others. Or at least they sound funnier. (This brings us into the problem of whether something is actually funny or whether we just think it is.) Westbury, Shaoul, Moroschan, and Ramscar try testing whether a common measure of how unpredictable something is — the entropy, a cornerstone of information theory — can tell us how funny a word might be.

We’ve encountered entropy in these parts before. I used it in that series earlier this year about how interesting a basketball tournament was. Entropy, in this context, is low if something is predictable. It gets higher the more unpredictable the thing being studied is. You see this at work in auto-completion: if you have typed in ‘th’, it’s likely your next letter is going to be an ‘e’. This reflects the low entropy of ‘the’ as an english word. It’s rather unlikely the next letter will be ‘j’, because English has few contexts that need ‘thj’ to be written out. So it will suggest words that start ‘the’ (and ‘tha’, and maybe even ‘thi’), while giving ‘thj’ and ‘thq’ and ‘thd’ a pass.

Westbury, Shaoul, Moroschan, and Ramscar found that the entropy of a word, how unlikely that collection of letters is to appear in an English word, matches quite well how funny people unfamiliar with it find it. This fits well with one of the more respectable theories of comedy, Arthur Schopenhauer’s theory that humor comes about from violating expectations. That matches well with unpredictability.

Of course it isn’t just entropy that makes words funny. Anyone trying to be funny learns that soon enough, since a string of perfect nonsense is also boring. But this is one of the things that can be measured and that does have an influence.

(I doubt there is any one explanation for why things are funny. My sense is that there are many different kinds of humor, not all of them perfectly compatible. It would be bizarre if any one thing could explain them all. But explanations for pieces of them are plausible enough.)

Anyway, I recommend looking at the Kristy Condon report. It explains the paper and the research in some more detail. And if you feel up to reading an academic paper, try Westbury, Shaoul, Moroschan, and Ramscar’s report. I thought it readable, even though so much of it is outside my field. And if all else fails there’s a list of two hundred made-up words used in field tests for funniness. Some of them look pretty solid to me.

I had been talking about how much information there is in the outcome of basketball games, or tournaments, or the like. I wanted to fill in at least one technical term, to match some of the others I’d given.

In this information-theory context, an experiment is just anything that could have different outcomes. A team can win or can lose or can tie in a game; that makes the game an experiment. The outcomes are the team wins, or loses, or ties. A team can get a particular score in the game; that makes that game a different experiment. The possible outcomes are the team scores zero points, or one point, or two points, or so on up to whatever the greatest possible score is.

If you know the probability p of each of the different outcomes, and since this is a mathematics thing we suppose that you do, then we have what I was calling the information content of the outcome of the experiment. That’s a number, measured in bits, and given by the formula

$\sum_{j} - p_j \cdot \log\left(p_j\right)$

The sigma summation symbol means to evaluate the expression to the right of it for every value of some index j. The pj means the probability of outcome number j. And the logarithm may be that of any base, although if we use base two then we have an information content measured in bits. Those are the same bits as are in the bytes that make up the megabytes and gigabytes in your computer. You can see this number as an estimate of how many well-chosen yes-or-no questions you’d have to ask to pick the actual result out of all the possible ones.

I’d called this the information content of the experiment’s outcome. That’s an idiosyncratic term, chosen because I wanted to hide what it’s normally called. The normal name for this is the “entropy”.

To be more precise, it’s known as the “Shannon entropy”, after Claude Shannon, pioneer of the modern theory of information. However, the equation defining it looks the same as one that defines the entropy of statistical mechanics, that thing everyone knows is always increasing and somehow connected with stuff breaking down. Well, almost the same. The statistical mechanics one multiplies the sum by a constant number called the Boltzmann constant, after Ludwig Boltzmann, who did so much to put statistical mechanics in its present and very useful form. We aren’t thrown by that. The statistical mechanics entropy describes energy that is in a system but that can’t be used. It’s almost background noise, present but nothing of interest.

Is this Shannon entropy the same entropy as in statistical mechanics? This gets into some abstract grounds. If two things are described by the same formula, are they the same kind of thing? Maybe they are, although it’s hard to see what kind of thing might be shared by “how interesting the score of a basketball game is” and “how much unavailable energy there is in an engine”.

The legend has it that when Shannon was working out his information theory he needed a name for this quantity. John von Neumann, the mathematician and pioneer of computer science, suggested, “You should call it entropy. In the first place, a mathematical development very much like yours already exists in Boltzmann’s statistical mechanics, and in the second place, no one understands entropy very well, so in any discussion you will be in a position of advantage.” There are variations of the quote, but they have the same structure and punch line. The anecdote appears to trace back to an April 1961 seminar at MIT given by one Myron Tribus, who claimed to have heard the story from Shannon. I am not sure whether it is literally true, but it does express a feeling about how people understand entropy that is true.

Well, these entropies have the same form. And they’re given the same name, give or take a modifier of “Shannon” or “statistical” or some other qualifier. They’re even often given the same symbol; normally a capital S or maybe an H is used as the quantity of entropy. (H tends to be more common for the Shannon entropy, but your equation would be understood either way.)

I’m not comfortable saying they’re the same thing, though. After all, we use the same formula to calculate a batting average and to work out the average time of a commute. But we don’t think those are the same thing, at least not more generally than “they’re both averages”. These entropies measure different kinds of things. They have different units that just can’t be sensibly converted from one to another. And the statistical mechanics entropy has many definitions that not just don’t have parallels for information, but wouldn’t even make sense for information. I would call these entropies siblings, with strikingly similar profiles, but not more than that.

But let me point out something about the Shannon entropy. It is low when an outcome is predictable. If the outcome is unpredictable, presumably knowing the outcome will be interesting, because there is no guessing what it might be. This is where the entropy is maximized. But an absolutely random outcome also has a high entropy. And that’s boring. There’s no reason for the outcome to be one option instead of another. Somehow, as looked at by the measure of entropy, the most interesting of outcomes and the most meaningless of outcomes blur together. There is something wondrous and strange in that.

## Doesn’t The Other Team Count? How Much?

I’d worked out an estimate of how much information content there is in a basketball score, by which I was careful to say the score that one team manages in a game. I wasn’t able to find out what the actual distribution of real-world scores was like, unfortunately, so I made up a plausible-sounding guess: that college basketball scores would be distributed among the imaginable numbers (whole numbers from zero through … well, infinitely large numbers, though in practice probably not more than 150) according to a very common distribution called the “Gaussian” or “normal” distribution, that the arithmetic mean score would be about 65, and that the standard deviation, a measure of how spread out the distribution of scores is, would be about 10.

If those assumptions are true, or are at least close enough to true, then there are something like 5.4 bits of information in a single team’s score. Put another way, if you were trying to divine the score by asking someone who knew it a series of carefully-chosen questions, like, “is the score less than 65?” or “is the score more than 39?”, with at each stage each question equally likely to be answered yes or no, you could expect to hit the exact score with usually five, sometimes six, such questions.

## But How Interesting Is A Basketball Score?

When I worked out how interesting, in an information-theory sense, a basketball game — and from that, a tournament — might be, I supposed there was only one thing that might be interesting about the game: who won? Or to be exact, “did (this team) win”? But that isn’t everything we might want to know about a game. For example, we might want to know what a team scored. People often do. So how to measure this?

The answer was given, in embryo, in my first piece about how interesting a game might be. If you can list all the possible outcomes of something that has multiple outcomes, and how probable each of those outcomes is, then you can describe how much information there is in knowing the result. It’s the sum, for all of the possible results, of the quantity negative one times the probability of the result times the logarithm-base-two of the probability of the result. When we were interested in only whether a team won or lost there were just the two outcomes possible, which made for some fairly simple calculations, and indicates that the information content of a game can be as high as 1 — if the team is equally likely to win or to lose — or as low as 0 — if the team is sure to win, or sure to lose. And the units of this measure are bits, the same kind of thing we use to measure (in groups of bits called bytes) how big a computer file is.

## But How Interesting Is A Real Basketball Tournament?

When I wrote about how interesting the results of a basketball tournament were, and came to the conclusion that it was 63 (and filled in that I meant 63 bits of information), I was careful to say that the outcome of a basketball game between two evenly-matched opponents has an information content of 1 bit. If the game is a foregone conclusion, then the game hasn’t got so much information about it. If the game really is foregone, the information content is 0 bits; you already know what the result will be. If the game is an almost sure thing, there’s very little information to be had from actually seeing the game. An upset might be thrilling to watch, but you would hardly count on that, if you’re being rational. But most games aren’t sure things; we might expect the higher-seed to win, but it’s plausible they don’t. How does that affect how much information there is in the results of a tournament?

Last year, the NCAA College Men’s Basketball tournament inspired me to look up what the outcomes of various types of matches were, and which teams were more likely to win than others. If some person who wrote something for statistics.about.com is correct, based on 27 years of March Madness outcomes, the play between a number one and a number 16 seed is a foregone conclusion — the number one seed always wins — while number two versus number 15 is nearly sure. So while the first round of play will involve 32 games — four regions, each region having eight games — there’ll be something less than 32 bits of information in all these games, since many of them are so predictable.

If we take the results from that statistics.about.com page as accurate and reliable as a way of predicting the outcomes of various-seeded teams, then we can estimate the information content of the first round of play at least.

Here’s how I work it out, anyway:

Contest Probability the Higher Seed Wins Information Content of this Outcome
#1 seed vs #16 seed 100% 0 bits
#2 seed vs #15 seed 96% 0.2423 bits
#3 seed vs #14 seed 85% 0.6098 bits
#4 seed vs #13 seed 79% 0.7415 bits
#5 seed vs #12 seed 67% 0.9149 bits
#6 seed vs #11 seed 67% 0.9149 bits
#7 seed vs #10 seed 60% 0.9710 bits
#8 seed vs #9 seed 47% 0.9974 bits

So if the eight contests in a single region were all evenly matched, the information content of that region would be 8 bits. But there’s one sure and one nearly-sure game in there, and there’s only a couple games where the two teams are close to evenly matched. As a result, I make out the information content of a single region to be about 5.392 bits of information. Since there’s four regions, that means the first round of play — the first 32 games — have altogether about 21.567 bits of information.

Warning: I used three digits past the decimal point just because three is a nice comfortable number. Do not by hypnotized into thinking this is a more precise measure than it really is. I don’t know what the precise chance of, say, a number three seed beating a number fourteen seed is; all I know is that in a 27-year sample, it happened the higher-seed won 85 percent of the time, so the chance of the higher-seed winning is probably close to 85 percent. And I only know that if whoever it was wrote this article actually gathered and processed and reported the information correctly. I would not be at all surprised if the first round turned out to have only 21.565 bits of information, or as many as 21.568.

A statistical analysis of the tournaments which I dug up last year indicated that in the last three rounds — the Elite Eight, Final Four, and championship game — the higher- and lower-seeded teams are equally likely to win, and therefore those games have an information content of 1 bit per game. The last three rounds therefore have 7 bits of information total.

Unfortunately, experimental data seems to fall short for the second round — 16 games, where the 32 winners in the first round play, producing the Sweet Sixteen teams — and the third round — 8 games, producing the Elite Eight. If someone’s done a study of how often the higher-seeded team wins I haven’t run across it.

There are six of these games in each of the four regions, for 24 games total. Presumably the higher-seeded is more likely than the lower-seeded to win, but I don’t know how much more probable it is the higher-seed will win. I can come up with some bounds: the 24 games total in the second and third rounds can’t have an information content less than 0 bits, since they’re not all foregone conclusions. The higher-ranked seed won’t win all the time. And they can’t have an information content of more than 24 bits, since that’s how much there would be if the games were perfectly even matches.

So, then: the first round carries about 21.567 bits of information. The second and third rounds carry between 0 and 24 bits. The fourth through sixth rounds (the sixth round is the championship game) carry seven bits. Overall, the 63 games of the tournament carry between 28.567 and 52.567 bits of information. I would expect that many of the second-round and most of the third-round games are pretty close to even matches, so I would expect the higher end of that range to be closer to the true information content.

Let me make the assumption that in this second and third round the higher-seed has roughly a chance of 75 percent of beating the lower seed. That’s a number taken pretty arbitrarily as one that sounds like a plausible but not excessive advantage the higher-seeded teams might have. (It happens it’s close to the average you get of the higher-seed beating the lower-seed in the first round of play, something that I took as confirming my intuition about a plausible advantage the higher seed has.) If, in the second and third rounds, the higher-seed wins 75 percent of the time and the lower-seed 25 percent, then the outcome of each game is about 0.8113 bits of information. Since there are 24 games total in the second and third rounds, that suggests the second and third rounds carry about 19.471 bits of information.

Warning: Again, I went to three digits past the decimal just because three digits looks nice. Given that I do not actually know the chance a higher-seed beats a lower-seed in these rounds, and that I just made up a number that seems plausible you should not be surprised if the actual information content turns out to be 19.468 or even 19.472 bits of information.

Taking all these numbers, though — the first round with its something like 21.567 bits of information; the second and third rounds with something like 19.471 bits; the fourth through sixth rounds with 7 bits — the conclusion is that the win/loss results of the entire 63-game tournament are about 48 bits of information. It’s a bit higher the more unpredictable the games involving the final 32 and the Sweet 16 are; it’s a bit lower the more foregone those conclusions are. But 48 bits sounds like a plausible enough answer to me.

When I wrote last weekend’s piece about how interesting a basketball tournament was, I let some terms slide without definition, mostly so I could explain what ideas I wanted to use and how they should relate. My love, for example, read the article and looked up and asked what exactly I meant by “interesting”, in the attempt to measure how interesting a set of games might be, even if the reasoning that brought me to a 63-game tournament having an interest level of 63 seemed to satisfy.

When I spoke about something being interesting, what I had meant was that it’s something whose outcome I would like to know. In mathematical terms this “something whose outcome I would like to know” is often termed an experiment’ to be performed or, even better, a message’ that presumably I wil receive; and the outcome is the “information” of that experiment or message. And information is, in this context, something you do not know but would like to.

So the information content of a foregone conclusion is low, or at least very low, because you already know what the result is going to be, or are pretty close to knowing. The information content of something you can’t predict is high, because you would like to know it but there’s no (accurately) guessing what it might be.

This seems like a straightforward idea of what information should mean, and it’s a very fruitful one; the field of “information theory” and a great deal of modern communication theory is based on them. This doesn’t mean there aren’t some curious philosophical implications, though; for example, technically speaking, this seems to imply that anything you already know is by definition not information, and therefore learning something destroys the information it had. This seems impish, at least. Claude Shannon, who’s largely responsible for information theory as we now know it, was renowned for jokes; I recall a Time Life science-series book mentioning how he had built a complex-looking contraption which, turned on, would churn to life, make a hand poke out of its innards, and turn itself off, which makes me smile to imagine. Still, this definition of information is a useful one, so maybe I’m imagining a prank where there’s not one intended.

And something I hadn’t brought up, but which was hanging awkwardly loose, last time was: granted that the outcome of a single game might have an interest level, or an information content, of 1 unit, what’s the unit? If we have units of mass and length and temperature and spiciness of chili sauce, don’t we have a unit of how informative something is?

We have. If we measure how interesting something is — how much information there is in its result — using base-two logarithms the way we did last time, then the unit of information is a bit. That is the same bit that somehow goes into bytes, which go on your computer into kilobytes and megabytes and gigabytes, and onto your hard drive or USB stick as somehow slightly fewer gigabytes than the label on the box says. A bit is, in this sense, the amount of information it takes to distinguish between two equally likely outcomes. Whether that’s a piece of information in a computer’s memory, where a 0 or a 1 is a priori equally likely, or whether it’s the outcome of a basketball game between two evenly matched teams, it’s the same quantity of information to have.

So a March Madness-style tournament has an information content of 63 bits, if all you’re interested in is which teams win. You could communicate the outcome of the whole string of matches by indicating whether the “home” team wins or loses for each of the 63 distinct games. You could do it with 63 flashes of light, or a string of dots and dashes on a telegraph, or checked boxes on a largely empty piece of graphing paper, coins arranged tails-up or heads-up, or chunks of memory on a USB stick. We’re quantifying how much of the message is independent of the medium.

## How Interesting Is A Basketball Tournament?

Yes, I can hear people snarking, “not even the tiniest bit”. These are people who think calling all athletic contests “sportsball” is still a fresh and witty insult. No matter; what I mean to talk about applies to anything where there are multiple possible outcomes. If you would rather talk about how interesting the results of some elections are, or whether the stock market rises or falls, whether your preferred web browser gains or loses market share, whatever, read it as that instead. The work is all the same.

To talk about quantifying how interesting the outcome of a game (election, trading day, whatever) means we have to think about what “interesting” qualitatively means. A sure thing, a result that’s bound to happen, is not at all interesting, since we know going in that it’s the result. A result that’s nearly sure but not guaranteed is at least a bit interesting, since after all, it might not happen. An extremely unlikely result would be extremely interesting, if it could happen.

## What’s The Worst Way To Pack?

While reading that biography of Donald Coxeter that brought up that lovely triangle theorem, I ran across some mentions of the sphere-packing problem. That’s the treatment of a problem anyone who’s had a stack of oranges or golf balls has independently discovered: how can you arrange balls, all the same size (oranges are near enough), so as to have the least amount of wasted space between balls? It’s a mathematics problem with a lot of applications, both the obvious ones of arranging orange or golf-ball shipments, and less obvious ones such as sending error-free messages. You can recast the problem of sending a message so it’s understood even despite errors in coding, transmitting, receiving, or decoding, as one of packing equal-size balls around one another.

The “packing density” is the term used to say how much of a volume of space can be filled with balls of equal size using some pattern or other. Patterns called the cubic close packing or the hexagonal close packing are the best that can be done with periodic packings, ones that repeat some base pattern over and over; they fill a touch over 74 percent of the available space with balls. If you don’t want to follow the Mathworld links before, just get a tub of balls, or crate of oranges, or some foam Mystery Science Theater 3000 logo balls as packing materials when you order the new DVD set, and play around with a while and you’ll likely rediscover them. If you’re willing to give up that repetition you can get up to nearly 78 percent. Finding these efficient packings is known as the Kepler conjecture, and yes, it’s that Kepler, and it did take a couple centuries to show that these were the most efficient packings.

While thinking about that I wondered: what’s the least efficient way to pack balls? The obvious answer is to start with a container the size of the universe, and then put no balls in it, for a packing fraction of zero percent. This seems to fall outside the spirit of the question, though; it’s at least implicit in wondering the least efficient way to pack balls to suppose that there’s at least one ball that exists.

## Philosophical Origins of Computers

Scott Pellegrino here talks a bit about Boole’s laws, logic, set theory, and is building up into computation and information theory if his writing continues along the line promised here.

As indicated by my last post, I’d really like to tie in philosophical contributions to mathematics to the rise of the computer.  I’d like to jump from Leibniz to Boole, since Boole got the ball rolling to finally bring to fruition what Leibniz first speculated on the possibility.

In graduate school, I came across a series of lectures by a former head of the entire research and development division of IBM, which covered, in surprising level of detail, the philosophical origins of the computer industry. To be honest, it’s the sort of subject that really should be book in length.  But I think it really is a great contemporary example of exactly what philosophy is supposed to be, discovering new methods of analysis that as they develop are spun out of philosophy and are given birth as a new independent (or semi-independent) field their philosophical origins.  Theoretical linguistics is a…

View original post 771 more words