The always/never tradeoff in data collection

Posted by Q McCallum on 2023-05-30
image of a person walking on a tightrope, across a canyon

(Photo by Loic Leray on Unsplash)

Earlier this month it was reported that the Los Angeles Police Department (LAPD) had experienced a data leak. That’s not a euphemism for a data breach, in which a bad actor infiltrates a system and walks out with data. I mean an actual leak, an unintentional release of data. The LAPD was responding to a request for public information but accidentally included some sensitive, nonpublic information in that bundle – including the names and photographs of officers who were operating undercover.

There’s a lot to be said about this incident. Top of mind is the safety of the (no-longer-)undercover officers and the cases they were working on, as well as the LAPD’s policies and procedures for handling information requests.

There’s also a lesson here about data management, in the form of the always/never paradox.

Always/never explained

The “always/never” concept hails from nuclear safety and is detailed in Eric Schlosser’s book Command and Control. The idea is that you want a nuclear weapon that always launches and detonates when it’s supposed to, and never launches or detonates when it’s not supposed to. As nuclear weapons are intended to cause widespread damage, this blend of always and never forms a desirable state.

It’s also an impossible state.

Any nuclear missile that has a chance of functioning at all will also have a chance of functioning when you don’t want it to do so. Human error, confusing safety rules, sabotage, and technology flaws can all trigger an unintended launch. (This isn’t just speculation on my part. There have been at least two recorded incidents in which world superpowers, briefly misunderstanding the situation, almost launched missiles in what they thought would be a counterattack.) The mere act of having such a weapon means you bear the risk of starting World War III.

Zooming out, always/never is a particular framing of the risk/reward concept. If you want the reward of “being able to launch a nuke at your enemies” then you shoulder the risk of “that same nuke going off at the wrong time.” It’s a package deal.

Always/never in data collection and storage

The lesson about always/never applies beyond the world of nuclear weapons. The existence of cars means that there will be car accidents, for example. The act of eating food means that, at some point, there will be food poisoning. And so on.

This doesn’t mean that every driver will be involved in an accident every time they get behind the wheel. Nor will every person come down with food poisoning at each meal. But if we look at the big picture of driving or eating, we’ll see that some percentage of drivers and eaters wind up on the wrong side of the statistics and they suffer incidents.

The LAPD leak represents two flavors of always/never for companies that collect and store data:

1/ If you have a process for releasing data, that process may lead to the release of too much data or the wrong data. This can be as innocuous as a client getting access to an additional day’s worth of market data, or as serious as mailing a customer someone else’s banking statement.

2/ If you collect any data at all, you bear the risk of that data getting out. This isn’t limited to bad actors conducting a data breach. There are numerous tales of large, well-known corporations accidentally setting the wrong permissions on files, thereby making them accessible by a public URL.

Addressing the issue

So … what do you do?

The only way to guarantee that you won’t accidentally release data is to not collect data at all. Some groups (notably, certain VPN providers) do this on purpose to protect their clients’ privacy. They can’t leak data that they don’t have.

For most other companies, however, this isn’t an option. Maybe you use this data to improve your internal operations, or it drives your ML models, or you analyze and resell it as a data product. Your very business model comes with a side of risk: you get the money, and you may experience a mishap.

You can’t completely eliminate the chances of experiencing a problem, but you can reduce them. (Similar to our driving example: just because data leaks happen, writ large, does not mean they have to happen to you. You can take some steps to nudge yourself to the favorable end of that statistic.)

The first step is to undergo a thorough risk assessment. This will give you a clearer idea of what those problems are. You can then decide on your risk mitigation approach. You either close off those sources of trouble, or you accept them and operate with your eyes wide open to the potential consequences.

Delete data you don’t need: As you review your data field-by-field, ask yourself whether you really need it. Any data that you’re not actively using, and that you don’t have a concrete plan to use, is a net negative: it’s not generating revenue (because you’re not using it) but it also represents a possible future cost (because it can leak, and then you have to handle the cleanup).

Develop procedures against internal mishaps and bad actors: Once you’ve settled on which data to keep, it’s time to sort out who in the company should have access to that data. If you insist on building a “data lake,” for example, be thoughtful about which people or teams get access to which datasets (and which fields in those datasets). Then establish controls around that data: track requests for data access, failed attempts at access, and successful accesses.

This is also an opportunity to document your data flows so you can remove sensitive records from a downstream ML model or data product. For extra points, rehearse your data exports and review the outputs. Do you have a leak baked into your procedures?

Establish safeguards against external bad actors: What have you told your Chief Information Security Officer (CISO) about where all of your data lives and how it flows? They’ll need to know this in order to develop controls (such as firewalls) that keep sensitive systems away from the public internet. They may even require you to limit what datasets or fields exist on an internet-connected machine.

For bonus points, you can periodically audit those externally-facing systems to check for files or records that shouldn’t be there.

Review your incident response procedures: Your PR, IT, and legal teams should know in advance what they’ll do when things eventually go wrong. To do that, they must know what data you have, how you collected it, how you notified people of that collection, and so on.

Narrowing the scope of your “always”

The always/never paradox explains why the simple act of collecting data exposes you to the risk of a data leak. If you want the never, you have to give up the always.

Still, you can reduce your exposure by taking proactive measures to understand what data you have and how you’re protecting it. The mitigation steps I’ve listed above should take you a long way on that journey.