## Complex Experiments with Grading Mathematics

While I’ve never managed to attempt an experimental grading system as the one I enjoyed in Real Analysis, I have tried a few more modest experiments. The one chance I’ve had to really go wild and do something I’d never seen before, sadly, failed, but let me resurrect it enough to leave someone else, I hope, better-informed.

The setting was a summer course, which the department routinely gave to graduate students as a way of keeping them in the luxurious lifestyle to which grad students become accustomed. For five weeks and a couple days I’d spend several hours explaining the elements of vector calculus to students who either didn’t get it the first time around or who wanted to not have to deal with it during the normal term. (It’s the expansion of calculus to deal with integrals and differentials along curves, and across surfaces, and through solid bodies, and remarkably is not as impossibly complicated as this sounds. It’s probably easier to learn, once you know normal calculus, than it is to learn calculus to start. It’s essential, among other things, for working out physics problems in space, since it gives you the mathematical background to handle things like electric fields or the flow of fluids.)

What I thought was: the goal of the class is to get students to be proficient in a variety of techniques — that they could recognize what they were supposed to do, set up a problem to use whatever technique was needed, and could carry out the technique successfully. So why not divide the course up into all the things that I thought were different techniques, and challenge students to demonstrate proficiency in each of them? With experience behind me I understand at least one major objection to this, but if the forthcoming objection were to be dealt with, I’d still have blown it in the implementation.

Dividing a course into its atoms is a fine enough idea and probably useful for people who work to a much more detailed syllabus than I do. But here’s how I graded proficiency: students would get homework problems, and exam problems, and if they did great work on either homework or exam, or decent work on both together, that counted as proving their skills on that topic. Another objection should come here, one which proved pretty important to the failure.

To get a scheme this loose to fit to the A/B/C/D/F selection the Registrar wanted, I made the course grade to be the number of topics with demonstrated proficiency divided by the number of topics in the course. I think it worked out to something like 40 topics, so, people who proved they were able to do problems in 36 topics or more got an A. 32 and up got a B, and so on down to miserable failure.

Here’s how complete my failure was: nobody failed. Or got a D. Or a C, for that matter. As I remember it — I *think* accurately, but surely I would? — that of the 13 students in the summer course, four got B’s and nine got A’s, the most inflated grade curve I’ve ever given out. Mercifully, the department didn’t ask me to explain this, possibly trusting that between the small class size and the abnormal population there would be weird grading artifacts, possibly trusting that however I screwed up the grading the students would recover.

So how did it get so screwed up? I think the core problem was the demonstration-of-proficiency standard. As it worked out, this was binary: the student had shown an ability to do it, or hadn’t. But proficiency is a more fluid concept than that. It requires more subtle gradations, which is one of the benefits of an exam (or homework) being out of 100 (or whatever) points: you can recognize the distinction between A-level work and C-level work.

One of the side effects of this mistake was that I had to divide the course into many, many little sections — the 40 I mentioned above, and I remember trying to think whether I could get it up to 50. After all, a true/false test can measure with pretty fine gradation how well someone knows a subject, if there are enough questions, covering the subject broadly enough. So the binary grading of each topic implied having many topics.

Another consequence was that since I made the course depend on so many topics, I couldn’t ask too many questions about each of the topics. It would just produce a terrible workload, for myself as grader and for my students as people trying to figure out what the heck a contour integral *is*, to do otherwise. I couldn’t say with certainty that a student who proved proficiency *actually* was any good at it — the way working out a bunch of problems covering similar themes would — or just got lucky. Or did homework with someone who was good at it.

This brings me to another way that this scheme failed: students could pretty easily game the system. Students could skip, for example, exam problems which covered topics they’d already proved proficiency in, and if they were managing their time wisely certainly should. The exams could be second chances for poorly done homeworks instead, and I believe that is how the students treated them. This was great for getting exams graded more rapidly — students only worked on a couple problems and the most blessed thing to encounter on an exam is a blank page (edging out a perfectly correct answer neatly written) — but I don’t think that was a good trade-off to make.

Another of the core problems of this approach is that dividing the course into so many topics meant I had to divide up homework and exam problems into each topic, and — to be fair to the students — identify which problems covered which topics. I couldn’t do a multiple-part problem that tested different topics, lest I force students to redo topics that, by the rules, they didn’t have to.

A smaller problem, although one I probably could have managed, was that I couldn’t write exams the way I normally like. I like to have an exam be several long-form questions, a page or two of short-answer questions, another page of true/false or multiple choice answers. These are good ways to check tiny bits of knowledge, certainly, and I admit they’re easy grading, but they also mean that points which I have to be sure a student knows but which aren’t worth a full question can be asked. But this was my limited thinking. I could’ve overcome that if, for example, I’d required answering a set of multiple-choice questions on a single topic for the proof of proficiency.

But now I think even that wouldn’t have fixed the deepest flaw in the way it constrained my exams. I implicitly promised to ask questions about every topic since the past exam. At least some of the value of exams has to be that students should consider what topics are likely to be on the test, and what aren’t, and how important each topic is, and evaluate what they should have learned and what they have to get better on, swiftly.

If I wanted to salvage this scheme — and I’m not sure it’s worth it — I’d have to fix the definition of “proficiency” so that it had a broader range than just demonstrated/not demonstrated, first of all. I think this would require dividing the course into fewer topics, as well, so that “proficiency” can be a more graduated measure, and so that I can have some of the demonstration be in homework and some in exams without having either be pointless or a last-chance makeup.

If someone else wants to salvage this scheme, you’re welcome to try. I’d be interested to know of ways to make it functional.

## Reply