Providing Padding Around ML/AI Models

Posted by Q McCallum on 2020-06-03

Just as I was wrapping up the previous post – on the what, how, and why of keeping an eye on your ML/AI models – I came across a real-world example: this Washington Post article describes how Facebook and YouTube are using models to spot music copyright infringement … but some legitimate performances are getting caught in the net.

What went wrong, from an ML/AI perspective?

The technology works as intended, but not as expected

You know how apps like SoundHound and Shazam can hear a few seconds of a song, and then tell you its name and performer? The models employed by Facebook and YouTube do the same thing, but with the intent to catch unauthorized use. “I’ve found Popular Song X, but it’s not on that band’s official channel, better flag it.”

Which works well. Except when it doesn’t.

A lot of classical music has fallen out of copyright, so just about anyone is free to perform those pieces. Also, as everyone is reading off the same sheet music, it’s expected that those performances will all sound alike. The model that catches illegal copies of Popular Contemporary Song X will also catch perfectly legitimate performances of Well-Known Classical Piece Y.

The models are doing exactly what they were built to do. They’re also doing a lot more than their designers expected. So the problem isn’t so much what the models are doing, but how much the models are trusted. We need more humans involved.

More humans, please

It’s not just YouTube and Facebook who have been bitten by this. To be fair, companies are generally pretty bad at deploying this kind of automation. From ML/AI models, all the way back to customer service voicemail prompts, their goal seems to be to completely eliminate the human in the loop (or, anywhere near the loop) in order to save money. They under-price the cost of the automation when comparing it to human staffers doing the same work. Models may eliminate some human jobs, yes, but they redistribute others. And you have to factor that into your plans if you want to succeed.

Consider commercial air travel. A key component of a passenger jet is the autopilot, a sophisticated piece of equipment that is capable of keeping the aircraft aloft and moving in the right direction. While the autopilot is an impressive device, it’s only possible because of the amount of human interaction on either side of it. On the one side, you have the all of the people who designed, built, and thorougly tested the system. On the other side, you have the pilot and copilot, who have undergone rigorous training and who are more than capable of flying the aircraft should the autopilot ever prove unfit.

All of this human investment is why a good portion of your flight is handled by automated systems. Through planning, training, and the ability to override decisions, humans create the padding around an ML/AI model (or, frankly, any other kind of automation) to make it safe. Skimping on that padding is how you find the model’s pointy bits. And it sounds like YouTube and Facebook found a few.

The “corner case” that’s really a blind spot

Every model is subject to a number of “ah, I didn’t consider that” situations called corner cases and blind spots. A corner case is a troublesome, but highly unlikely, situation that the model may encounter. A blind spot is also a problem, but it’s a far more likely occurrence.

You want to catch both corner cases and blind spots as early as possible. You do this by asking a series of “what if ..?” and “what about ..?” questions, which is very similar to how you kick off a risk assessment. If your team’s knowledge pool includes a sufficient understanding of the model’s operating environment, domain experience, and well-rounded cognitive diversity, such an exercise should uncover most of your blind spots and a number of corner cases long before the model begins operating in the wild.

Given the number of people who are typically involved in creating an ML/AI model – not just data scientists and data engineers, but also stakeholders (who come up with the idea), product folks (who shape the idea and shepherd it to market), and testers (who develop use cases to catch problems) – I’m having a hard time figuring out how no one would have noticed that the Facebook and YouTube models would have so much trouble with classical music.

Not enough humans in the (remediation) loop

The more automation you employ, the more important it is to have humans involved when things go awry. This is why you need to uncover blind spots and identify corner cases early on: doing so tells you how much human staffing you’ll need to compensate for the model’s limited world-view. (Usually, by overriding its decision.) And that, in turn, determines what communication paths you must define so that people affected by your misbehaving, narrow-sighted model can get out of its net. If you don’t have enough humans around to use the override switch, you effectively don’t have an override switch.

Credit card companies handle this very well. They leverage plenty of automation for fraud detection. They also establish 24x7 call centers, for those cases when the model flags a legitimate purchase as a fraud attempt (a “false positive,” in industry language). A customer can speak with a human being to override the model and allow the purchase to proceed. All this, within a matter of minutes. Based on what I read in the Washington Post article, it seems that Facebook and YouTube haven’t quite sorted this out.

Remember the padding

Keep in mind, an ML/AI model is effectively a piece of factory equipment. It will operate as it is trained, not necessarily as you intended. You want humans on both sides of the model – during development, so people can scan for corner cases and blind spots; and after deployment deployment, so people can quickly address false positives – to provide the padding that keeps it out of trouble.