This article is part of a series. In Part 1, I outlined the premise: ML/AI shops can borrow tips and best practices from how algorithmic trading (“algo trading”) shops operate. The rest of the articles explore those ideas in more detail.
In the previous post I explore the importance of data infrastructure. Closely related to that is the importance of data quality and data monitoring. Some ML/AI shops mistakenly treat the model as the center of the universe, but the well-run shops treat data as a first-class citizen. A big part of that involves monitoring their data feeds. Not just “are we still collecting data?” but “what’s coming through? and is that OK?”
The first source of problems is in the data itself: malformed records, out-of-bounds (or, worse yet, eerily consistent) values, and disappearing fields can all throw your ML/AI model training routines for a loop. To prevent this, perform regular – and, preferably, automated – data quality analysis 1 on your data feeds to detect anomalous conditions. You should do this whether the data comes from an in-house application or an external data vendor, because no source of data is perfect.
Another, more subtle problem occurs when the incoming data contains values that are authentic but also abnormal. Spencer Burns describes such a case in the chapter he contributed to Bad Data Handbook 2 : a stock ticker symbol – what should be a unique company identifier for tracking share prices – can change or even disappear overnight as a result of companies closing down or merging. Price data can also change by a wide margin when a stock splits, or when two companies merge and the familiar ticker symbol now reflects the combined entity. Wise traders keep an eye on the business news so these matters don’t catch them unawares and send their algos down the wrong path.
The lesson applies well beyond the stock market. When you feed your ML/AI models data related to real world events, then you need to monitor real-world activity. Ask anyone building models that examine news reports, shopping patterns, or potential fraud: election years, natural disasters, or even the current COVID-19 pandemic can dramatically redefine “normal” and confuse models.
I’ll talk about how to handle these real-yet-wildly-abnormal patches of data in the next post.
For a deeper look into data quality analysis, I recommend the chapter I co-wrote with Ken Gleason in Bad Data Handbook called “Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough.” Therein, we introduce our framework, “The Four Cs of Data Quality Analysis.” ↩
Bad Data Handbook, chapter 9: “When Data and Reality Don’t Match.” ↩