Today’s A To Z term is another from Iva Sallay, Find The Factors blog creator and, as with asymptote, friend of the blog. Thank you for it.
People can’t remember many things at once. This has effects. Some of them are obvious. Like, how a phone number, back in the days you might have to memorize them, wouldn’t be more than about seven or eight digits. Some are subtle, such as that we have descriptive statistics. We have descriptive statistics because we want to understand collections of a lot of data. But we can’t understand all the data. We have to simplify it. From this we get many numbers, based on data, that try to represent it. Means. Medians. Variance. Quartiles. All these.
And it’s not enough. We try to understand data further by visualization. Usually this is literal, making pictures that represent data. Now and then somebody visualizes data by something slick, like turning it into an audio recording. (Somewhere here I have an early-60s album turning 18 months of solar radio measurements into something music-like.) But that’s rare, and usually more of an artistic statement. Mostly it’s pictures. Sighted people learn much of the world from the experience of seeing it and moving around it. Visualization turns arithmetic into geometry. We can support our sense of number with our sense of space.
Many of the ways we visualize data came from the same person. William Playfair set out the rules for line charts and area charts and bar charts and pie charts and circle graphs. Florence Nightingale used many of them in her reports on medical care in the Crimean War. And this made them public and familiar enough that we still use them.
Box-and-whisker plots are not among them. I’m startled too. Playfair had a great talent for these sorts of visualizations. That he missed this is a reminder to us all. There are great, simple ideas still available for us to discover.
At least for the brilliant among us to discover. Box-and-whisker plots were introduced in 1969. I’m surprised it’s that recent. John Tukey developed them. Computer scientists remember Tukey’s name; he coined the term ‘bit’, as in the element of computer memory. They also remember he was an early user, if not the coiner, of the term ‘software’. Mathematicians know Tukey’s name too. He and James Cooley developed the Fast Fourier Transform. The Fast Fourier Transform appears on every list of the Most Important Algorithms of the 20th Century. Sometimes the Most Important Algorithms of All Time. The Fourier Transform is this great thing. It’s a way of finding patterns in messy, complicated data. It’s hard to calculate, though. Cooley and Tukey, though, found that the calculations you have to do can be made simpler, and much quicker. (In certain conditions. Mostly depending on how the data’s gathered. Fortunately, computers encourage gathering data in ways that make the Fast Fourier Transform possible. And then go and calculate it nice and fast.)
Box-and-whisker plots are a way to visualize sets of data. Too many data points to look at all at once, not without getting confused. They extract a couple bits of information about the distribution. Distributions say what ranges a data point, picked at random, are likely to be in, and are unlikely to be in. Distributions can be good things to look at. They let you know what typical experiences of a thing are likely to be. And they’re stable. A handful of weird fluke events don’t change them much. If you have a lot of fluke events, that changes the distribution. But if you have a lot of fluke events, they’re not flukes. They’re just events.
Box-and-whisker plots start from the median. This is the second of the three things commonly called “average”. It’s the data point that half the remaining data is less than, and half the remaining data is greater than. It’s a nice number to know. Start your box-and-whisker plot with a short line, horizontal or vertical as fits your worksheet, and labelled with that median.
Around this line we’ll draw a box. It’ll be as wide as the line you made for the median. But how tall should it be?
That is, normally, based on the first and third quartiles. These are the data points like the median. The first quartile has one-quarter the data points less than it, and three-quarters the data points more than it. The third quartile has three-quarters the data points less than it, and one-quarter the data points more than it. (And now you might ask if we can’t call the median the “second quartile”. We sure can. And will if we want to think about how the quartiles relate to each other.) Between the first and the third quartile are half of all the data points. The first and the third quartiles the boundaries of your box. They’re where the edges of the rectangle are.
That’s the box. What are the whiskers?
Well, they’re vertical lines. Or horizontal lines. Whatever’s perpendicular to how you started. They start at the quartile lines. Should they go to the maximum or minimum data points?
Maybe. Maximum and minimum data are neat, yes. But they’re also suspect. They’re extremes. They’re not quite reliable. If you went back to the same source of data, and collected it again, you’d get about the same median, and the same first and third quartile. You’d get different minimums and maximums, though. Often crazily different. Still, if you want to understand the data you did get, it’s hard to ignore that this is the data you have. So one choice for representing these is to just use the maximum and minimum points. Draw the whiskers out to the maximum and minimum, and then add a little cross bar or a circle at the end. This makes clear you meant the line to end there, rather than that your ink ran out. (Making a figure safe against misprinting is one of the understated essentials of good visualization.)
But again, the very highest and lowest data may be flukes. So we could look at other, more stable endpoints for the whiskers. The point of this is to show the range of what we believe most data points are. There are different ways to do this. There’s not one that’s always right. It’s important, when showing a box-and-whisker plot, to explain how far out the whiskers go.
Tukey’s original idea, for example, was to extend the whiskers based on the interquartile range. This is the difference between the third quartile and the first quartile. Like, just subtraction. Find a number that’s one-and-a-half times the interquartile range above the third quartile. The upper whisker goes to the data point that’s closest to that boundary without going over. This might well be the maximum already. The other number is the one that’s the first quartile minus one-and-a-halt times the interquartile range. The lower whisker goes to the data point that’s closest to that boundary without falling underneath it. And this might be the minimum. It depends how the data’s distributed. The upper whisker and the lower whisker aren’t guaranteed to be the same lengths. If there are data outside these whisker ranges, mark them with dots or x’s or something else easy to spot. There’ll typically be only a few of these.
But you can use other rules too. Again as long as you are clear about what they represent. The whiskers might go out, for example, to particular percentiles. Or might reach out a certain number of standard deviations from the mean.
The point of doing this box-and-whisker plot is to show where half the data are. That’s inside the box. And where the rest of the non-fluke data is. That’s the whiskers. And the flukes, those are the odd little dots left outside the whiskers. And it doesn’t take any deep calculations. You need to sort the data in ascending order. You need to count how many data points there are, to find the median and the first and third quartiles. (You might have to do addition and division. If you have, for example, twelve distinct data points, then the median is the arithmetic mean of the sixth and seventh values. The first quartile is the arithmetic mean of the third and fourth values. The third quartile is the arithmetic mean of the ninth and tenth values.) You (might) need to subtract, to find the interquartile range. And multiply that by one and a half, and add or subtract that from the quartiles.
This shows you what are likely and what are improbable values. They give you a cruder picture than, say, the standard deviation and the coefficients of variance do. But they need no hard calculations. None of what you need for box-and-whisker plots is computationally intensive. Heck, none of what you need is hard. You knew everything you needed to find these numbers by fourth grade. And yet they tell you about the distribution. You can compare whether two sets of data are similar by eye. Telling whether sets of data are similar becomes telling whether two shapes look about the same. It’s brilliant to represent so much from such simple work.