Why is there statistics?
There are many reasons statistics got organized as a field of study mostly in the late 19th and early 20th century. Mostly they reflect wanting to be able to say something about big collections of data. People can only keep track of so much information at once. Even if we could keep track of more information, we’re usually interested in relationships between pieces of data. When there’s enough data there are so many possible relationships that we can’t see what’s interesting.
One of the things statistics gives us is a way of representing lots of data with fewer numbers. We trust there’ll be few enough numbers we can understand them all simultaneously, and so understand something about the whole data.
Quintiles are one of the tools we have. They’re a lesser tool, I admit, but that makes them sound more exotic. They’re descriptions of how the values of a set of data are distributed. Distributions are interesting. They tell us what kinds of values are likely and which are rare. They tell us also how variable the data is, or how reliably we are measuring data. These are things we often want to know: what is normal for the thing we’re measuring, and what’s a normal range?
We get quintiles from imagining the data set placed in ascending order. There’s some value that one-fifth of the data points are smaller than, and four-fifths are greater than. That’s your first quintile. Suppose we had the values 269, 444, 525, 745, and 1284 as our data set. The first quintile would be the arithmetic mean of the 269 and 444, that is, 356.5.
The second quintile is some value that two-fifths of your data points are smaller than, and that three-fifths are greater than. With that data set we started with that would be the mean of 444 and 525, or 484.5.
The third quintile is a value that three-fifths of the data set is less than, and two-fifths greater than; in this case, that’s 635.
And the fourth quintile is a value that four-fifths of the data set is less than, and one-fifth greater than. That’s the mean of 745 and 1284, or 1014.5.
From looking at the quintiles we can say … well, not much, because this is a silly made-up problem that demonstrates how quintiles are calculated rather instead of why we’d want to do anything with them. At least the numbers come from real data. They’re the word counts of my first five A-to-Z definitions. But the existence of the quintiles at 365.5, 484.5, 635, and 1014.5, along with the minimum and maximum data points at 269 and 1284, tells us something. Mostly that numbers are bunched up in the three and four hundreds, but there could be some weird high numbers. If we had a bigger data set the results would be less obvious.
If the calculating of quintiles sounds much like the way we work out the median, that’s because it is. The median is the value that half the data is less than, and half the data is greater than. There are other ways of breaking down distributions. The first quartile is the value one-quarter of the data is less than. The second quartile a value two-quarters of the data is less than (so, yes, that’s the median all over again). The third quartile is a value three-quarters of the data is less than.
Percentiles are another variation on this. The (say) 26th percentile is a value that 26 percent — 26 hundredths — of the data is less than. The 72nd percentile a value greater than 72 percent of the data.
Are quintiles useful? Well, that’s a loaded question. They are used less than quartiles are. And I’m not sure knowing them is better than looking at a spreadsheet’s plot of the data. A plot of the data with the quintiles, or quartiles if you prefer, drawn in is better than either separately. But these are among the tools we have to tell what data values are likely, and how tightly bunched-up they are.