This article is part of a series. In Part 1, I outlined the premise: ML/AI shops can borrow tips and best practices from how algorithmic trading (“algo trading”) shops operate. The rest of the articles explore those ideas in more detail.
Data infrastructure is all about the intake, storage, and maintenance of data. A reliable data infrastructure is a pillar of running experiments, and – as I mentioned last time – running expermiments is the key to success in the data world. Without a good data infrastructure, researchers will distract themselves looking for data, trying to sort out what various fields mean, and building their own tools to test models.
How you should build out your data infrastructure will vary based on the data you collect, how you need to control access (more on this closer to the end of this series), your performance requirements, and where your company finds itself on its data journey. That’s worth a book unto itself. That said, the high level concept is straightforward: you collect data into a central location and make sure it’s always available when people need it.
The tough part is in the discipline. You’ll need to develop and maintain data catalogs (so people know what datasets are available and where to find them) as well as data dictionaries (so people know what’s actually in each dataset and what value ranges to expect). This is hardly fun work, but without it, your researchers can easily head down the wrong path.
Closely related to availability is performance. Work with your researchers (and any other customers of the data infrastructure) to calculate the volume of data you’ll store, note how it enters the system, and explore everyone’s processing use cases. That last point will determine expected response times for searches performed on that data, and for retrieving results.
Being able to store, search, and retrieve data is of little value if that data fails quality checks. That’s a deeper topic that I’ll explore in the next post: “Monitor Your Data Feeds.”