The article goes into several wys that one can test whether a thing has an effect. These naturally get mathematical. Among the tests developed is one that someone who didn’t know mathematics might independently invent. This is called linear regression, or linear correlation. The idea is to run experiments. If you think something causes an effect, try doing a little of that something. Measure how big the effect is. Then try doing more of that something. How big is the effect now? Try a lot. How big is the effect? Do none of it. How big is the effect?
Through calculations that are tedious but not actually hard, you can find a line that “best fits” the data. And it will tell you whether, on average, increasing the something will increase the effect. Or decrease it. There are subsidiary tests that will tell you how strong the fit is. That is, whether the something and the effect match their variations very well, or whether there’s just a loose correspondence. It can easily be that random factors, or factors you aren’t looking at, are more important than the something you’re trying to vary, after all.
In principle, online advertising should be excellent at matching advertising to people. It’s quite easy to test different combinations of sales pitches and measure how much of whatever it is gets bought. In practice?
You have surely heard the aphorism that correlation does not prove causation, usually from someone trying to explain that we can’t really prove that some large industry is doing something murderous and awful. But there are also people who will say this in honest good faith. Showing that, say, placing advertisements in one source correlates with a healthy number of sales does not prove that the advertisements helped any. One needs to design experiments thoughtfully to tease that out. Part of Frederik and Martijn’s essay is about the search for those thoughtful experiments, and what they indicate. There is an old saw that in science what one does not measure one does not understand. But it is also true that measuring a thing does not mean one understands it.
(Linear regression is far from the only tool available, or discussed in the article. It’s one that’s easy to imagine and explain, both in goal and in calculation, however.)
Okay, so writing “this next essay right away” didn’t come to pass, because all sorts of other things got in the way. But to get back to where we had been: we hoped to figure out which of the players at the local pinball league had most improved over the season. The data I had available. But data is always imperfect. We try to learn anyway.
What data I had was this. Each league night we selected five pinball games. Each player there played those five tables. We recorded their scores. Each player’s standing was based on, for each table, how many other players they beat. If you beat everyone on a particular table, you got 100 points. If you beat all but three people, you got 96 points. If ten people beat you, you got 90 points. And so on. Add together the points earned for all five games of that night. We didn’t play the same games week to week. And not everyone played every single week. These are some of the limits of the data.
My first approach was to look at a linear regression. That is, take a plot where the independent variable is the league night number and the dependent variable is player’s nightly scores. This will almost certainly not be a straight line. There’s an excellent chance it will never touch any of the data points. But there is some line that comes closer than any other line to touching all these data points. What is that line, and what is its slope? And that’s easy to calculate. Well, it’s tedious to calculate. But the formula for it is easy enough to make a computer do. And then it’s easy to look at the slope of the line approximating each player’s performance. The highest slope of their performance line obviously belongs to the best player.
And the answer gotten was that the most improved player — the one whose score increased most, week to week — was a player I’ll call T. The thing is T was already a good player. A great one, really. He’d just been unable to join the league until partway through. So nights that he didn’t play, and so was retroactively given a minimal score for, counted as “terrible early nights”. This made his play look like it was getting better than it was. It’s not just a problem of one person, either. I had missed a night, early on, and that weird outlier case made my league performance look, to this regression, like it was improving pretty well. If we removed the missed nights, my apparent improvement changed to a slight decline. If we pretend that my second-week absence happened on week eight instead, I had a calamitous fall over the season.
And that felt wrong, so I went back to re-think. This is dangerous stuff, by the way. You can fool yourself if you go back and change your methods because your answer looked wrong. But. An important part of finding answers is validating your answer. Getting a wrong-looking answer can be a warning that your method was wrong. This is especially so if you started out unsure how to find what you were looking for.
So what did that first answer, that I didn’t believe, tell me? It told me I needed some better way to handle noisy data. I should tell apart a person who’s steadily doing better week to week and a person who’s just had one lousy night. Or two lousy nights. Or someone who just had a lousy season, but enjoyed one outstanding night where they couldn’t be beaten. Is there a measure of consistency?
And there — well, there kind of is. I’m looking at Pearson’s Correlation Coefficient, also known as Pearson’s r, or r. Karl Pearson is a name you will know if you learn statistics, because he invented just about all of them except the Student T test. Or you will not know if you learn statistics, because we don’t talk much about the history of statistics. (A lot of the development of statistical ideas was done in the late 19th and early 20th century, often by people — like Pearson — who were eugenicists. When we talk about mathematics history we’re more likely to talk about, oh, this fellow published what he learned trying to do quality control at Guinness breweries. We move with embarrassed coughing past oh, this fellow was interested in showing which nationalities were dragging the average down.) I hope you’ll allow me to move on with just some embarrassed coughing about this.
Anyway, Pearson’s ‘r’ is a number between -1 and 1. It reflects how well a line actually describes your data. The closer this ‘r’ is to zero, the less like a line your data really is. And the square of this, r2, has a great, easy physical interpretation. It tells you how much of the variations in your dependent variable — the rankings, here — can be explained by a linear function of the independent variable — the league night, here. The bigger r2 is, the more line-like the original data is. The less its result depends on fluke events.
This is another tedious calculation, yes. Computers. They do great things for statistical study. These told me something unsurprising: r2 for our putative best player, T, was about 0.313. That is, about 31 percent of his score’s change could be attributed to improvement; 69 percent of it was noise, reflecting the missed nights. For me, r2 was about 0.105. That is, 90 percent of the variation in my standing was noise. This suggests by the way that I was playing pretty consistently, week to week, which matched how I felt about my season. And yes, we did have one player whose r2 was 0.000. So he was consistent and about all the change in his week-to-week score reflected noise. (I only looked at three digits past the decimal. That’s more precision than the data could support, though. I wouldn’t be willing to say whether he played more consistently than the person with r2 of 0.005 or the one with 0.012.)
Now, looking at that — ah, here’s something much better. Here’s a player, L, with a Pearson’s r of 0.803. r2 was about 0.645, the highest of anyone. The most nearly linear performance in the league. Only about 35 percent of L’s performance change could be attributed to random noise rather than to a linear change, week-to-week. And that change was the second-highest in the league, too. L’s standing improved by about 5.21 points per league night. Better than anyone but T.
This, then, was my nomination for the most improved player. L had a large positive slope, in looking at ranking-over-time. L also also a high correlation coefficient. This makes the argument that the improvement was consistent and due to something besides L getting luckier later in the season.
Yes, I am fortunate that I didn’t have to decide between someone with a high r2 and mediocre slope versus someone with a mediocre r2 and high slope. Maybe this season. I’ll let you know how it turns out.
Back before suddenly everything got complicated I was working on the question of who’s the most improved pinball player? This was specifically for our local league. The league meets, normally, twice a month for a four-month season. Everyone plays the same five pinball tables for the night. They get league points for each of the five tables. The points are based on how many of their fellow players their score on that table beat that night. (Most leagues don’t keep standings this way. It’s one that harmonizes well with the vengue and the league’s history.) The highest score on a game earns its player 100 league points. Second-highest earns its scorer 99 league points. Third-highest earns 98, and so on. Setting the highest score to a 100 and counting down makes the race for the top less dependent on how many people show up each night. A fantastic night when 20 people attended is as good as a fantastic night when only 12 could make it out.
Last season had a large number of new players join the league. The natural question this inspired was, who was most improved? One answer is to use linear regression. That is, look at the scores each player had each of the eight nights of the season. This will be a bunch of points — eight, in this league’s case — with x-coordinates from 1 through 8 and y-coordinates from between about 400 to 500. There is some straight line which comes the nearest to describing each player’s performance that a straight line possibly can. Finding that straight line is the “linear regression”.
A straight line has a slope. This describes stuff about the x- and y-coordinates that match points on the line. Particularly, if you start from a point on the line, and change the x-coordinate a tiny bit, how much does the y-coordinate change? A positive slope means the y-coordinate changes as the x-coordinate changes. So a positive slope implies that each successive league night (increase in the x-coordinate) we expect an increase in the nightly score (the y-coordinate).
For me, I had a slope of about 2.48. That’s a positive number, so apparently I was on average getting better all season. Good to know. And with the data on each player and their nightly scores on hand, it was easy to calculate the slopes of all their performances. This is because I did not do it. I had the computer do it. Finding the slopes of these linear regressions is not hard; it’s just tedious. It takes these multiplications and additions and divisions and you know? This is what we have computing machines for. Setting up the problem and interpreting the results is what we have people for.
And with that work done we found the most improved player in the league was … ah-huh. No, that’s not right. The person with the highest slope, T, finished the season a quite good player, yes. Thing is he started the season that way too. He’d been playing pinball for years. Playing competitively very well, too, at least when he could. Work often kept him away from chances. Now that he’s retired, he’s a plausible candidate to make the state championship contest, even if his winning would be rather a surprise. Still. It’s possible he improved over the course of our eight meetings. But more than everyone else in the league, including people who came in as complete novices and finished as competent players?
So what happened?
T joined the league late, is what happened. After the first week. So he was proleptically scored at the bottom of the league that first meeting. He also had to miss one of the league’s first several meetings after joining. The result is that he had two boat-anchor scores in the first half of the season, and then basically middle-to-good scores for the latter half. Numerically, yeah, T started the season lousy and ended great. That’s improvement. Improved the standings by about 6.79 points per league meeting, by this standard. That’s just not so.
This approach for measuring how a competitor improved is flawed. But then every scheme for measuring things is flawed. Anything actually interesting is complicated and multifaceted; measurements of it are, at least, a couple of discrete values. We hope that this tiny measurement can tell us something about a complicated system. To do that, we have to understand in what ways we know the measurements to be flawed.
So treating a missed night as a bottomed-out score is bad. Also the bottomed-out scores are a bit flaky. If you miss a night when ten people were at league, you get a score of 450. Miss a night when twenty people were at league, you get a score of 400. It’s daft to get fifty points for something that doesn’t reflect anything you did except spread false information about what day league was.
Still, this is something we can compensate for. We can re-run the linear regression, for example, taking out the scores that represent missed nights. This done, T’s slope drops to 2.57. Still quite the improvement. T was getting used to the games, apparently. But it’s no longer a slope that dominates the league while feeling illogical. I’m not happy with this decision, though, not least because the same change for me drops my slope to -0.50. That is, that I got appreciably worse over the season. But that’s sentiment. Someone looking at the plot of my scores, that anomalous second week aside, would probably say that yeah, my scores were probably dropping night-to-night. Ouch.
Or does it drop to -0.50? If we count league nights as the x-coordinate and league points as the y-coordinate, then yeah, omitting night two altogether gives me a slope of -0.50. What if the x-coordinate is instead the number of league nights I’ve been to, to get to that score? That is, if for night 2 I record, not a blank score, but the 472 points I got on league night number three? And for night 3 I record the 473 I got on league night number four? If I count by my improvement over the seven nights I played? … Then my slope is -0.68. I got worse even faster. I had a poor last night, and a lousy league night number six. They sank me.
And what if we pretend that for night two I got an average-for-me score? There are a couple kinds of averages, yes. The arithmetic mean for my other nights was a score of 468.57. The arithmetic mean is what normal people intend when they say average. Fill that in as a provisional night two score. My weekly decline in standing itself declines, to only -0.41. The other average that anyone might find convincing is my median score. For the rest of the season that was 472; I put in as many scores lower than that as I did higher. Using this average makes my decline worse again. Then my slope is -0.62.
You see where I’m getting more dissatisfied. What was my performance like over the season? Depending on how you address how to handle a missed night, I either got noticeably better, with a slope of 2.48. Or I got noticeably worse, with a slope of -0.68. Or maybe -0.61. Or I got modestly worse, with a slope of -0.41.
There’s something unsatisfying with a study of some data if handling one or two bad entries throws our answers this far off. More thought is needed. I’ll come back to this, but I mean to write this next essay right away so that I actually do.
Could I say what a “most improved” pinball player looks like? Well, I can give a rough idea. A player’s improving if their rankings increase over the the season. The most-improved person would show the biggest improvement. This definition might go awry; maybe there’s some important factor I overlooked. But it was a place to start looking.
So here’s the first problem. It’s the plot of my own data, my league scores over the season. Yes, league night 2 is dismal. I’d had to miss the night and so got the lowest score possible.
Is this getting better? Or worse? The obvious thing to do is to look for a curve that goes through these points. Then look at what that curve is doing. The thing is, it’s always possible to draw a curve through a bunch of data points. As long as there’s not something crazy like there’s four data points for the same league night. As long as there’s one data point for each measurement you can always connect those points to some curve. Worse, you can always fit more than one curve through those points. We need to think harder.
Here’s the thing about pinball league night results. Or any other data that comes from the real world. It’s got noise in it. There’s some amount of it that’s just random. We don’t need to look for a curve that matches every data point. Or any data point particularly. What if the actual data is “some easy-to-understand curve, plus some random noise”?
It’s a good thought. It’s a dangerous thought. You need to have an idea of what the “real” curve should be. There’s infinitely many possibilities. You can bias your answer by choosing what curve you think the data ought to represent. Or by not thinking before you make a choice. As ever, the hard part is not in doing a calculation. It’s choosing what calculation to do.
That said there’s a couple safe bets. One of them is straight lines. Why? … Well, they’re easy to work with. But we have deeper reasons. Lots of stuff, when it changes, looks like it’s changing in a straight line. Take any curve that hasn’t got a corner or a jump or a break in it. There’s a straight line that looks close enough to it. Maybe not for long, but at least for some stretch. In the absence of a better idea of what ought to be right, a line is at least a starting point. You might learn something even if a line doesn’t fit well, and get ideas for why to look at particular other shapes.
So there’s good, steady mathematics business to be found in doing “linear regression”. That is, find the line that best fits a set of data points. What do we mean by “best fits”?
The mathematical community has an answer. I agree with it, surely to the comfort of the mathematical community. Here’s the premise. You have a bunch of data points, with a dependent variable ‘x’ and an independent variable ‘y’. So the data points are a bunch of points, for a couple values of j. You want the line that “best” matches that. Fine. In my pinball league case here, j is the whole numbers from 1 to 8. is … just j again. All right, as happens, this is more mechanism than we need for this problem. But there’s problems where it would be useful anyway. And for , well, here:
For the linear regression, propose a line described by the equation . No idea what ‘m’ and ‘b’ are just yet. But. Calculate for each of the values what the projection would be, that is, what . How far are those from the actual data?
Are there choices for ‘m’ and ‘b’ that make the difference smaller? It’s easy to convince yourself there are. Suppose we started out with ‘m’ equal to 0 and ‘b’ equal to 472. That’s an okay fit. Suppose we started out with ‘m’ equal to 100,000,000 and ‘b’ equal to -2,038. That’s a crazy bad fit. So there must be some ‘m’ and ‘b’ that make for better fits.
Is there a best fit? If you don’t think much about mathematics the answer is obvious: of course there’s a best fit. If there’s some poor, some decent, some good fits there must be a best. If you’re a bit better-learned and have thought more about mathematics you might grow suspicious. That term ‘best’ is dangerous. Maybe there’s several fits that are all different but equally good. Maybe there’s an endless series of ever-better fits but no one best. (If you’re not clear how this could work, ponder: what’s the largest negative real number?)
Good suspicions. If you learn a bit more mathematics you learn the calculus of variations. This is the study of how small changes in one quantity change something that depends on it; and it’s all about finding the maxima or minima of stuff. And that tells us that there is, indeed, a best choice for ‘m’ and ‘b’.
(Here I’m going to hedge. I’ve learned a bit more mathematics than that. I don’t think there’s some freaky set of data that will turn up multiple best-fit curves. But my gut won’t let me just declare that. There’s all kinds of crazy, intuition-busting stuff out there. But if there exists some data set that breaks linear regression you aren’t going to run into it by accident.)
So. How to find the best ‘m’ and ‘b’ for this? You’ve got choices. You can open up DuckDuckGo and search for ‘matlab linear regression’ and follow the instructions. Or ‘excel linear regression’, if you have an easier time entering data into spreadsheets. If you’re on the Mac, maybe ‘apple numbers linear regression’. Follow the directions on the second or third link returned. Oh, you can do the calculation yourself. It’s not hard. It’s just tedious. It’s a lot of multiplication and addition and you know what? We’ve already built tools that know how to do this. Use them. Not if your homework assignment is to do this by hand, but, for stuff you care about yes. (In Octave, an open-source clone of Matlab, you can do it by an admirably slick formula that might even be memorizable.)
If you suspect that some shape other than a line is best, okay. Then you’ll want to look up and understand the formulas for these linear regression coefficients. That’ll guide you to finding a best-fit for these other shapes. Or you can do a quick, dirty hack. Like, if you think it should be an exponential curve, then try fitting a line to x and the logarithm of y. And then don’t listen to those doubts about whether this would be the best-fit exponential curve. It’s a calculation, it’s done, isn’t that enough?
Back to lines, back to my data. I’ll spare you the calculations and show you the results.
Done. For me, this season, I ended up with a slope ‘m’ of about 2.48 and a ‘b’ of about 451.3. That is, the slightly diagonal black line here. The red circles are what my scores would have been if my performance exactly matched the line.
That seems like a claim that I’m improving over the season. Maybe not a compelling case. That missed night certainly dragged me down. But everybody had some outlier bad night, surely. Why not find the line that best fits everyone’s season, and declare the most-improved person to be the one with the largest positive slope?
My love just completed a season as head of a competitive pinball league. People find this an enchanting fact. People find competitive pinball at all enchanting. Many didn’t know pinball was still around, much less big enough to have regular competitions.
Pinball’s in great shape compared to, say, the early 2000s. There’s one major manufacturer. There’s a couple of small manufacturers who are well-organized enough to make a string of games without (yet) collapsing from not knowing how to finance game-building. Many games go right to private collections. But the “barcade” model of a hipster bar with a bunch of pinball machines and, often, video games is working quite well right now. We’re fortunate to live in Michigan. All the major cities in the lower part of the state have pretty good venues and leagues in or near them. We’re especially fortunate to live in Lansing, so that most of these spots are within an hour’s drive, and all of them are within two hours’ drive.
Ah, but how do they work? Many ways, but there are a couple of popular ones. My love’s league uses a scheme that surely has a name. In this scheme everybody plays their own turn on a set of games. Then they get ranked for each game. So the person who puts up the highest score on the game Junkyard earns 100 league points. The person who puts up the second-highest score on Junkyard earns 99 league points. The person with the third-highest score on Junkyard earns 98 league points. And so on, like this. If 20 people showed up for the day, then the poor person who bottoms out earns a mere 81 league points for the game.
This is a relative ranking, yes. I don’t know any competitive-pinball scheme that uses more than one game that doesn’t rank players relative to each other. I’m not sure how an alternative could work. Different games have different scoring schemes. Some games try to dazzle with blazingly high numbers. Some hoard their points as if giving them away cost them anything. A score of 50 million points? If you had that on Attack From Mars you would earn sympathetic hugs and the promise that life will not always be like that. (I’m not sure it’s possible to get a score that low without tilting your game away.) 50 million points on Lord of the Rings would earn a bunch of nods that yeah, that’s doing respectably, but there’s other people yet to play. 50 million points on Scared Stiff would earn applause for the best game anyone had seen all year. 50 million points on The Wizard of Oz would get you named the Lord Mayor of Pinball, your every whim to be rapidly done.
And each individual manifestation of a table is different. It’s part of the fun of pinball. Each game is a real, physical thing, with its own idiosyncrasies. The flippers are a little different in strength. The rubber bands that guard most things are a little harder or softer. The table is a little more or less worn. The sensors are a little more or less sensitive. The tilt detector a little more forgiving, or a little more brutal. Really the least unfair way to rate play is comparing people to each other on a particular table played at approximately the same time.
It’s not perfectly fair. How could any real thing be? It’s maddening to put up the best game of your life on some table, and come in the middle of the pack because everybody else was having great games too. It’s some compensation that there’ll be times you have a mediocre game but everybody else has a lousy one so you’re third-place for the night.
Back to league. Players earn these points for every game played. So whoever has the highest score of all on, say, Attack From Mars gets 100 league points for that regardless of whatever they did on Junkyard. Whoever has the best score on Iron Maiden (a game so new we haven’t actually played it during league yet, and that somehow hasn’t got an entry on the Internet Pinball Database; give it time) gets their 100 points. And so on. A player’s standings for the night are based on all the league points earned on all the tables played. For us that’s usually five games. Five or six games seems about standard; that’s enough time playing and hanging out to feel worthwhile without seeming too long.
So each league night all the players earn between (about) 420 and 500 points. We have eight league nights. Add the scores up over those league nights and there we go. (Well, we drop the lowest nightly total for each player. This lets them miss a night for some responsibility, like work or travel or recovering from sickness or something, without penalizing them.)
As we got to the end of the season my love asked: is it possible to figure out which player showed the best improvement over time?
Well. I had everybody’s scores from every night played. And I’ve taken multiple classes in statistics. Why would I not be able to?