This year, an international diploma program used by thousands of high schools canceled exams that students rely on for college admissions, replacing them with a grade determined by an algorithm. The result was lower-than-expected grades for many students, who are now left to deal with the consequences. What went wrong and what can be done to prevent it from happening again?
A statistical model to predict each student’s results
Every year, high school students around the world participating in the International Baccalaureate program take an exam that determines their final grade and is used to apply to universities worldwide. The exam results affect whether a student will be accepted to their chosen school and the scholarships they might receive, and thus have a direct influence on the student’s future.
This year, the in-person exams were canceled because of the COVID-19 pandemic. Instead of moving the tests online, IB, the foundation behind the program, decided to instead use a statistical model to predict each student’s results. The predictions were supposed to be based on the student’s coursework, teacher predictions, and historic school data.
An algorithm to change the lives of thousands of students - for worse
At first glance, this might sound like a decent workaround, especially during a global crisis: it streamlines the process and provides results when taking in-person exams is simply not possible. But the system misfired badly, leaving many teachers, students, and their families dumbfounded. Some kids had their college acceptances rescinded, others lost scholarships.
The model itself was flawed and the way it worked was not fully revealed.
But what if the model was better? Would it be okay to just drop it onto a group of people who were completely unprepared for it?
First, let’s state something that should be obvious, but sometimes, amidst the machine-learning hype, is not at all: this algorithm was not helping you pick a movie to watch. It was deciding the final grades of hundreds of thousands of high schoolers. “Predicting” how they would do on the exam and influencing what college they would be able to go to. As a result, the system was making decisions that had a massive impact on young people’s lives.
Done properly, the whole process would have to take into consideration the results’ impact on everyone involved. It wasn’t just about getting the algorithm right. First and foremost, it was about proper change management and putting human limitations on the system. Machine learning by definition is an approximation of real life, not real life per se, and thus cannot achieve a 100% rate of ‘being right’.
The information would still be assessed by people, just with a boost from machine-learning. And that’s not a revolutionary solution - that’s how it works in healthcare, where machine-learning is integrated but works in conjunction with actual humans checking all the results.
Furthermore, human graders integrated into the process could have assessed the quality of the system and tested it on a small group. The group would provide additional context that the machine was missing. Only after that phase, after most issues got discovered and addressed, would the ML system be rolled out to a wider group. Finally, after a lot of careful monitoring, it could be considered whether you can introduce a full auto-grading system without disastrous effects. And even then, these predictions would need to be manually reviewed and a super-easy appeal process be made available.
When introducing machine-learning, especially in a sensitive field where it may make a social impact, we should not disregard the ethical implications.