The Importance of Data Infrastructure

Posted by Q McCallum on 2016-11-21

A successful data science shop requires more than just data scientists.

Many firms assume that the key to winning the data race is to hire several data scientists. While this is not entirely wrong – you’ll certainly need people to analyze the data – nor is it entirely correct. To have data scientists on your team is only one of several necessary ingredients.

By analogy, consider the world of medicine: surgery involves more than surgeons. Think of all of the people, procedures, and spaces that must be in place and running smoothly so that your surgeon can focus on your operation: scheduling, room preparation, patient preparation, and anesthesiology are just a few. In a well-run hospital, neither you nor the surgeon need be aware of most of what goes on behind the scenes. Everything just works.

You want the same for your firm’s data science practice. Effective use of data requires developing a culture of experimentation, in which people (even those who do not consider themselves to be data scientists) are able to develop and test hypotheses at will. In turn, that requires the ability for people to quickly get to the data they need so they can analyze it. They don’t want to waste time fumbling around looking for the data, sussing out what each field means, or connecting their tools to it. They want things to just work so they can focus on the task at hand.

How do you make your data shop run like the hospital in our example? How can you enable your data scientists to skip the data access and data prep, so they can get straight to the analysis? You’ll need a solid, well-maintained data infrastructure based on the following:

  • data collection and data pipelines: Ingest data from internal and upstream sources, clean it up, and otherwise transform it so it is usable.
  • data repository: Figure out where and how you will store the data. Based on your volume of data and how you plan to use it, this can be a series of relational databases, a data warehouse, or a data lake.
  • policies and access controls: Determine who can see which datasets.
  • documentation of each data source: Define a directory that explains what data sources are available and where to find them, plus details on what each field means.
  • tools: People will use these to analyze the data. Make sure they are compatible with the underlying repository and can perform the analyses of interest.

In turn, a proper data infrastrucure requires some key hires:

  • data engineers: These data-focused software developers will create and maintain your data pipelines.
  • tool admins: Whether you’re living on relational databases, Hadoop, or some specialized appliance, the tools’ end-users should not also be the maintainers.
  • IT operations: This crew is responsible for the underlying hardware and architecture. Yes, you’ll need experts here even (or, perhaps, “especially”) if your systems run in the cloud.

In an ideal world, you would have all of this infrastructure in place before your first data scientist joins. It’s tougher to implement all of this after your data efforts are already underway; but the longer you wait, the tougher it gets to establish the necessary policies and boundaries.

Furthermore, to develop and maintain a solid data infrastructure requires discipline because you’ll sometimes trade short-term gains for long-term stability. For example: it’s tempting to just throw data into the repository and start analyzing it. It takes extra time and effort to first update your data dictionary – to track the dataset’s source provider, location in the repository, and meaning of each field – and people are likely to resist because doing this extra work slows them down. You may require someone in a senior leadership role to enforce the rules so that short-term thinking doesn’t take over.

In closing: a solid data infrastructure smooths the road for data science efforts. Data scientists (and analysts, and anyone else) do their best and most efficient job when they have stable, ready access to data that is up-to-date, fit for purpose, and well-documented. To invest in such a data infrastructure is to invest in the long-term success of your firm’s data science activities.

Do you want to lay the groundwork for your company’s data science efforts? Contact me to get started.