Not All Datasets Are Created Equal

Posted by Q McCallum on 2022-04-11

Not all of your datasets are created equal.

Some datasets are of higher quality than others, true. Maybe you’ve cleaned them up, or there were stronger checks in place to ensure consistency at the point of collection.

And sure, some datasets have proven more useful to you. Say, portions of your CRM that you’ve turned into successful marketing campaigns. Or data that has improved your retail site’s recommendation engine.

Some of these datasets may have even greater direct revenue potential, because you’ve been able to sell them, or data products derived from them.

But I’m talking about something different: those datasets that carry greater risk than others.

These are the datasets that are more likely to get you in trouble. A change in data privacy laws can suddenly put you on the wrong side of regulatory matters. Or lax security standards can lead a leak of sensitive data, which will certainly cause a PR headache. Doubly so if you’d never disclosed that you were collecting that data in the first place.

These risky datasets are making money for you today. But they may cost you money down the line.

Looking for trouble

It may help to answer some questions as you review your datasets:

  • Who collected (or generated) this data? If you acquired this yourself – that is, this is first-party data – you know where it came from, which reduces (though does not eliminate) the chances that you’ll be compelled to delete it later. On the other hand, if you’ve purchased this from someone else, you’ll have to take their word on the data’s provenance. Do you believe your data provider’s claims that they collected it from knowing, willing participants? Should you?
  • Does it contain sensitive information? A dataset that doesn’t contain any specific details about people is far less likely to get you into trouble. The idea of “details about people” goes beyond the formal definition of personally identifiable information (PII); it also includes, say, a person’s contact info that you collected as they placed an order.
  • Was it collected appropriately? Don’t assume this is a concern just for third-party data. It can still be an issue for data you’ve collected yourself. Consider a case in which you’ve scraped another website, contrary to the terms of service (TOS). When the site owner finds out, you may be legally required to delete the data and any models built from it. And you may have to endure costly legal action to reach that point.
  • How can we protect this data? Even if you’ve collected the data appropriately, and you have legitimate reason to hold onto it, you still have a responsibility to secure it. That means limiting access for internal projects (is this team allowed to use this data in their work?), external bad actors (people trying to infiltrate your systems so they can steal it), and even opportunists (internal or external people who find the data sitting in a publicly-accessible file share or website path).

Your answers to these questions will help you rate each dataset according to the risk it carries. From there, you can compare the risk of holding or using that data to the value or revenue it generates through related business processes and ML models. You may find that some datasets aren’t worth the trouble.

Not all data is good data

When you’re building or acquiring a new dataset, be sure to rate it in terms of its risk/reward tradeoff. Data that requires extra protection, or that may cause you trouble in the long run, “costs” more (ergo, is worth less) than it may appear on the surface.

I’ll dig deeper into data valuation approaches in a future post.