Today’s A To Z term is another from Iva Sallay, Find The Factors blog creator and, as with asymptote, friend of the blog. Thank you for it.
People can’t remember many things at once. This has effects. Some of them are obvious. Like, how a phone number, back in the days you might have to memorize them, wouldn’t be more than about seven or eight digits. Some are subtle, such as that we have descriptive statistics. We have descriptive statistics because we want to understand collections of a lot of data. But we can’t understand all the data. We have to simplify it. From this we get many numbers, based on data, that try to represent it. Means. Medians. Variance. Quartiles. All these.
And it’s not enough. We try to understand data further by visualization. Usually this is literal, making pictures that represent data. Now and then somebody visualizes data by something slick, like turning it into an audio recording. (Somewhere here I have an early-60s album turning 18 months of solar radio measurements into something music-like.) But that’s rare, and usually more of an artistic statement. Mostly it’s pictures. Sighted people learn much of the world from the experience of seeing it and moving around it. Visualization turns arithmetic into geometry. We can support our sense of number with our sense of space.
Many of the ways we visualize data came from the same person. William Playfair set out the rules for line charts and area charts and bar charts and pie charts and circle graphs. Florence Nightingale used many of them in her reports on medical care in the Crimean War. And this made them public and familiar enough that we still use them.
Box-and-whisker plots are not among them. I’m startled too. Playfair had a great talent for these sorts of visualizations. That he missed this is a reminder to us all. There are great, simple ideas still available for us to discover.
At least for the brilliant among us to discover. Box-and-whisker plots were introduced in 1969. I’m surprised it’s that recent. John Tukey developed them. Computer scientists remember Tukey’s name; he coined the term ‘bit’, as in the element of computer memory. They also remember he was an early user, if not the coiner, of the term ‘software’. Mathematicians know Tukey’s name too. He and James Cooley developed the Fast Fourier Transform. The Fast Fourier Transform appears on every list of the Most Important Algorithms of the 20th Century. Sometimes the Most Important Algorithms of All Time. The Fourier Transform is this great thing. It’s a way of finding patterns in messy, complicated data. It’s hard to calculate, though. Cooley and Tukey, though, found that the calculations you have to do can be made simpler, and much quicker. (In certain conditions. Mostly depending on how the data’s gathered. Fortunately, computers encourage gathering data in ways that make the Fast Fourier Transform possible. And then go and calculate it nice and fast.)
Box-and-whisker plots are a way to visualize sets of data. Too many data points to look at all at once, not without getting confused. They extract a couple bits of information about the distribution. Distributions say what ranges a data point, picked at random, are likely to be in, and are unlikely to be in. Distributions can be good things to look at. They let you know what typical experiences of a thing are likely to be. And they’re stable. A handful of weird fluke events don’t change them much. If you have a lot of fluke events, that changes the distribution. But if you have a lot of fluke events, they’re not flukes. They’re just events.
Box-and-whisker plots start from the median. This is the second of the three things commonly called “average”. It’s the data point that half the remaining data is less than, and half the remaining data is greater than. It’s a nice number to know. Start your box-and-whisker plot with a short line, horizontal or vertical as fits your worksheet, and labelled with that median.
Around this line we’ll draw a box. It’ll be as wide as the line you made for the median. But how tall should it be?
That is, normally, based on the first and third quartiles. These are the data points like the median. The first quartile has one-quarter the data points less than it, and three-quarters the data points more than it. The third quartile has three-quarters the data points less than it, and one-quarter the data points more than it. (And now you might ask if we can’t call the median the “second quartile”. We sure can. And will if we want to think about how the quartiles relate to each other.) Between the first and the third quartile are half of all the data points. The first and the third quartiles the boundaries of your box. They’re where the edges of the rectangle are.
That’s the box. What are the whiskers?
Well, they’re vertical lines. Or horizontal lines. Whatever’s perpendicular to how you started. They start at the quartile lines. Should they go to the maximum or minimum data points?
Maybe. Maximum and minimum data are neat, yes. But they’re also suspect. They’re extremes. They’re not quite reliable. If you went back to the same source of data, and collected it again, you’d get about the same median, and the same first and third quartile. You’d get different minimums and maximums, though. Often crazily different. Still, if you want to understand the data you did get, it’s hard to ignore that this is the data you have. So one choice for representing these is to just use the maximum and minimum points. Draw the whiskers out to the maximum and minimum, and then add a little cross bar or a circle at the end. This makes clear you meant the line to end there, rather than that your ink ran out. (Making a figure safe against misprinting is one of the understated essentials of good visualization.)
But again, the very highest and lowest data may be flukes. So we could look at other, more stable endpoints for the whiskers. The point of this is to show the range of what we believe most data points are. There are different ways to do this. There’s not one that’s always right. It’s important, when showing a box-and-whisker plot, to explain how far out the whiskers go.
Tukey’s original idea, for example, was to extend the whiskers based on the interquartile range. This is the difference between the third quartile and the first quartile. Like, just subtraction. Find a number that’s one-and-a-half times the interquartile range above the third quartile. The upper whisker goes to the data point that’s closest to that boundary without going over. This might well be the maximum already. The other number is the one that’s the first quartile minus one-and-a-halt times the interquartile range. The lower whisker goes to the data point that’s closest to that boundary without falling underneath it. And this might be the minimum. It depends how the data’s distributed. The upper whisker and the lower whisker aren’t guaranteed to be the same lengths. If there are data outside these whisker ranges, mark them with dots or x’s or something else easy to spot. There’ll typically be only a few of these.
But you can use other rules too. Again as long as you are clear about what they represent. The whiskers might go out, for example, to particular percentiles. Or might reach out a certain number of standard deviations from the mean.
The point of doing this box-and-whisker plot is to show where half the data are. That’s inside the box. And where the rest of the non-fluke data is. That’s the whiskers. And the flukes, those are the odd little dots left outside the whiskers. And it doesn’t take any deep calculations. You need to sort the data in ascending order. You need to count how many data points there are, to find the median and the first and third quartiles. (You might have to do addition and division. If you have, for example, twelve distinct data points, then the median is the arithmetic mean of the sixth and seventh values. The first quartile is the arithmetic mean of the third and fourth values. The third quartile is the arithmetic mean of the ninth and tenth values.) You (might) need to subtract, to find the interquartile range. And multiply that by one and a half, and add or subtract that from the quartiles.
This shows you what are likely and what are improbable values. They give you a cruder picture than, say, the standard deviation and the coefficients of variance do. But they need no hard calculations. None of what you need for box-and-whisker plots is computationally intensive. Heck, none of what you need is hard. You knew everything you needed to find these numbers by fourth grade. And yet they tell you about the distribution. You can compare whether two sets of data are similar by eye. Telling whether sets of data are similar becomes telling whether two shapes look about the same. It’s brilliant to represent so much from such simple work.
9 thoughts on “My 2018 Mathematics A To Z: Box-And-Whisker Plot”
This is a very informative and understandable explanation of box and whisker plots. I didn’t understand them before, but now I do! As a bonus, you included lots of other related or barely related mathematical information. I loved it. Thank you for writing this article. That box and whisker plots were invented in 1969 explains why I didn’t learn about them when I was a youngster in school, but it doesn’t explain why I didn’t learn about them in the two probability and statistics classes I took in college after that date but before 1975. The only thing I think you could have done to improve this article would be to include a labeled illustration of a box and whisker plot. I turned to google images to see some illustrated. Perhaps you could consider making a box and whisker plot one Saturday to display statistics of animals who have whiskers and live in some kind of box!
Aw, thank you, and I’m glad you enjoyed the result. … You’re right that it’s a bit curious you weren’t introduced to them in classes before 1975. But I don’t know how rapidly they spread into common use, and from then how long they took to get into statistics courses. It might be just the lag in realizing something could be used. Or it might have been choice of the author or instructor. I know that I’ve had descriptive statistics books that didn’t discuss them, or at least didn’t make much fuss about them.
You’re right that I ought to have included pictures. I haven’t got a drawing setup I’m really happy with right now — I think I’m probably best off sketching something on paper and then scanning it, like this was 1996 or something — but I’m certainly able to do that. I should add it to the post when I have the chance to do some illustrating.
… And yeah, I haven’t done a Statistics Saturday post that used box-and-whiskers. There must be something good waiting.
LikeLiked by 1 person
Glad to see a Box and Whisker Plot IS NOT a “Big Bang Theory” episode that bandys about Schrödinger’s cat to sound like they know what they’re talking about.
I must admit, you could say most anything was a Big Bang Theory episode and I would have to accept your testimony. All I know of the show is that sometimes the DVR catches the last few minutes of it before the start of Conan O’Brien, and it’s usually something like Bob Newhart looking tired.
Sorry to take up another slot but I forgot to say how much I liked your “Dern” good new A to Z banner, please kindly overlook my need to pun if you would. (I’m afraid the Bennett Cerf is great in this one, even if the puns aren’t)
Thanks kindly. The real credit goes to Thomas K Dye, though, who took the vague commission of ‘mathematics A to Z’ and turned it in to something good.