And the tools we are given are so custom, so specific that it feels like they must do something to tease out the nuanced metric we are trying so hard to measure. Selecting from constraints and categories meant to block out all opportunities for extraneous bias; these tools have one job to do, so they must be doing it, right? We hope the performance assessment tools we subject our colleagues to do something to, you know, assess performance. Placement exams for college that categorize people into ready and not-quite-ready-and-could-benefit-from-some-help should, we hope, ultimately help people in both categories reach their greatest potential. And the Sorting Hat from Harry Potter should plant the incoming class with enough seeds of adolescent doubt and in-group/out-group dynamics to set the stage for a sufficiently exciting school year. Though that last one was sarcastic, it may be the only assessment tool in the list doing the its described job.
One colleague of mine is co-organizing a conference, and part of her duties include selecting speakers from the pool of submitted abstracts. They created a scoring system and assigned multiple relevant experts as reviewers to assess them, taking care that each abstract landed in front of multiple sets of eyes. It all felt very purposeful and fair. Then the scores came in. With a scoring system out of 100, many of the abstracts came in with variation of more than 40-50 points between reviewers. Once they dug into the data, they found that reviewers fell into two main categories. There were the reviewers who started with scores of 100 and only took away points for issues/concerns as they came up. They tended to have scores mostly in the 80s and above. Then there were the reviewers who started with a blank slate and made the applicants earn every point from the bottom, ending up with batches of scores in the 60s. Across individual reviewers, the tool was applied pretty consistently. But combine the scores across reviewers, and the totals quickly became meaningless. Last I checked, they were working on ways to scale the scores based on the average score of each individual reviewer.
This is hardly unique. One study from the Journal of Applied Psychology found that as much as 62% of variance in job performance ratings had to do with idiosyncrasies of the reviewer rather than the performance being reviewed. The reviewer effect was stronger than any of the other factors combined. In their book “Nine Lies About Work,” authors Marcus Buckingham and Ashley Goodall compare the process to judges rating the performance of a figure skater. Even in this task, based on seemingly objective factors like number of rotations and side of blade at takeoff and landing, the variation can be huge between judges. And as opposed to the relatively concrete attributes of figure skating techniques, the authors lament how job performance measures ask us to rate each other on nebulous concepts like “leadership” and “political savvy.” Suddenly finding a mathematical scaling system for those conference abstracts seems like an easy problem.
And it’s obvious that such sorting matters. It can decide who gets a raise, who gets promoted, and who has to take remedial classes before starting college. And the labels that apply often stick. A friend of mine works at a company that uses a rating system which gives employees annual scores of 1-5 in a number of categories. For those employees, promotions and raises are contingent on the number of fours and fives they accumulate. But, because the company wants to limit raises and promotions, they put caps on these ratings. Each department only gets so many fours and fives to go around. Giving a five to one person means it can’t go to someone else. And managers in the same department end up jockeying, trying to keep too many high ratings from going to a single unit. This, of course, has more to do with a perceived sense of fairness rather than anything to do with actual performance. Buckingham and Goodall cite similar systems in their book, observing that the tools no longer have anything to do with assessing performance and operate instead as a means for controlling the distribution of resources.
But for the employees, these ratings and labels become part of their records as supposed measures of their performance. My friend laments the hours his colleagues spend laboring to provide evidence of their own work in maddeningly abstract categories, spending lunches theorizing about what their manager “is really looking for.” The manager, on the other hand, is figuring out how to distribute an arbitrary and pre-defined set of numbers across units – those detailed performance reports probably long forgotten.
From the higher education side, entrance exams and assessments are coming under similar scrutiny. Tools that are meant to identify students that would benefit from remedial work should show some indication that students who take part in that coursework were ultimately more successful than similar students who didn’t. But like the employees who aren’t hearing “it just wasn’t your year for a four” but rather “you lack potential,” these students seem to be hearing from something else as well. One study found that 1/3 of students who were told they needed a remedial course ultimately never enrolled in any courses for the program. Only 60% of them even enrolled in the course that was recommended to them. What kind of measure of effectiveness can you even get if the students being told that they need remedial work aren’t failing to complete programs because of coursework or withdrawing – but because they dropped out of the system without ever even signing up? And another 2016 report found that GPA was a better predictor of program performance than the exams designed to assess exactly that performance (SAT, ACT, ACCUPLACER). How can you tell if a placement tool is effective if the results stop students from even trying? What do you do with Sorting Hats that tell students and employees one thing, but ultimately deliver the message, “Maybe you should just go home.”