Building Data Science Infrastructure

Based on a presentation by Caitlin Hudon – Lead Data Scientist at OnlineMedEd, watch the full presentation here.

At #DSSATX (overall event recap here), we loved the super-practical step-by-step advice on setting up the very first data science infrastructure and team in an organization! Caitlin Hudon joined educational startup OnlineMedEd as their very first data scientist; read on for her process, advice, and strategies.

HIRING THE FIRST ONE

If you're getting ready to hire your first data scientist, you need three things:

You need data—make sure that you have data for someone to work with.
You need business questions—things that you want to get out of your data.
You need a specialist—early in your org, someone could be answering basic data questions but eventually you'll hit a point where you really want someone who specializes in data to be focusing on those questions.

Caitlin’s on-boarding was basically these two words: ADD VALUE. They have become her guiding principles for the entire data science department.

INFRASTRUCTURE

Build an opinionated data science infrastructure. Make decisions about the way data is handled. Think: if you have missing values in your data set, whether you choose to code those as missing or code those as null or code those as zeros, those are all opinionated ways to deal with data. Then, you build those opinions into the pipelines and scripts based on business knowledge outside of the raw data.

Data Plumbing

Consider the infrastructure in two different parts: the data plumbing—getting the data, putting the data where it needs to go, and then building tools to help analyze the data—and the data team/organization.

All of this is really important: we want to make sure that we're getting the right data and we're getting it in the right way, serving it up for the right problem at the right time. The data plumbing includes:

Data collection mechanisms
Data pipelines
Databases and warehouses
EDA, analysis, and ML tools

Foundational Data Team & Organization

Documentation & knowledge capture: making sure that you're capturing any of the institutional knowledge that exists and documenting it
Building a plan to deliver on goals
Team structure, culture, and roles
Policies, politics, communication: how to interact with the rest of the org (how they access data, how they work with the data science team, how to add value for stakeholders)

THE FIRST WEEK

Primary goals:

Learning the business model: figuring out how the company makes money
Getting all of the access that comes with any sort of technical role
Talking to different stakeholders and starting to learn about what they care about (what data they had and what data they wanted)
Start to learn the data and its lineage
Afterwards: documenting it all
Internal education: explaining the types of problems data science solves, the ways that we can work together, and a little bit about data science as a whole.

BOOK RECOMMENDATION: The First 90 Days suggests to focus on building momentum while laying a foundation. You need to build foundations to do things correctly in the long term—but can't spend the entire time just doing that. Figure out how to add value right away.

The Data Science Hierarchy of Needs

For folks who want to be building out data science infrastructure, the hierarchy of needs is a really good place to start. It's a really good way to get everyone on the same page and get buy-in.

This data science hierarchy of needs does a really good job of explaining the pieces that you need in order to get to the cool AI and deep learning at the very top of the pyramid. You have to start with a really good foundation in data, really good infrastructure. Using the hierarchy to explain to internal stakeholders why getting the data right first is so important can be an easy way to frame that discussion.

THE FIRST MONTH

Set up shop: set up some stack tools (Confluence, JIRA, Slack), a data dictionary, a query library, and some sand boxes and templates.

A data dictionary is a way that you can record information about your data. It could be lineage, like what database it comes from, it could be transformations, it could be talking about where you're getting the raw data that's used to create this new feature that you care about. Basically it's a place to answer questions that might come up about the data.

Creating one of these when you're onboarding is a really good long-lasting artifact but also a really good way to learn your database and to learn what pieces are important.

Caitlin has a couple of blog entries about data dictionaries and query libraries if you want to read more.

A query library is a place where you can keep commonly used sql queries. If you're using a sql query, chances are someone's going to ask you for that data again. Putting it all into a single repository is really helpful so that you know what you delivered the last time someone asked you for it.

Sandbox projects is a way to take the scaffolding thinking out of projects so you don't have to think about the way that you organize things. Settle on one way to organize all of your data science projects, and then you can just copy those projects each time. Everything is organized the same way. This is also really helpful when you start working with other people.

THE FIRST 6 MONTHS: DECISIONS

You’ll likely face a bunch of decisions to make about the ways to work with data and work with the rest of the business.

Data Pipeline

As we worked with other departments, we started to think about our data pipeline. Take a query library, and make it into something more robust and longer-lasting like the data pipeline. Some of the earlier things will be used in the larger infrastructure, in order to move data around in a good opinionated way.

You don’t have to set up everything on your own: for data collection, OnlineMedEd decided on a tool called Segment. Segment can power a few things with connections to tools like HubSpot; combined with Mode Analytics and it becomes even more powerful. Mode Analytics allows you to do R and sql and Python all in one place and then share, quickly and with many people.

Shape the data pipelines by focusing on data pain points. The company had shifted billing systems three times—if you've ever worked with data across different building systems, you know it gets really messy. Build a process to combine all of the data so there’s a clear picture looking forwards and historically at all of the data.

Knowledge Documentation

Building out the knowledge documentation is an important part of the first six months. As an example, create a subscription explainer—the company sells subscriptions, but over time, the types of subscriptions change. A product explainer would be the same, if you’ve sold various products over various points of time. You create a promotion code explainer, a time zone explainer, and then build that knowledge into the data pipeline. Finally, document known data issues, as even new data can contain bugs.

Culture

As a new team, celebrate the wins as you’re building the team. If roles are being cross trained from other roles within the company, education is really important. OnlineMedEd uses a book club to hold people accountable and motivate people.

Intake Form

An intake form can be a triage mechanism for a small data team in an org that's growing, where the thirst for data is growing—as they see more, they want more. The form should be really opinionated; use conditional logic to make sure that the right data goes to the right people. If multiple people want the same piece of data or analysis, we're able to combine that before responding.

Roadmapping

A lot of the work that gets done by data teams can be invisible until you roll something out. People don't really know what you're doing, so create a place that shows your roadmap, intake form and the links to the things the team is working on. Creating a ton of visibility into your work makes data science accessible to the whole organization.

Pick a good first project

Talk to stakeholders to figure out what a small win could be for your data org.
Figure out what data could be most helpful that isn't being collected; what problems could be solved pretty easily: LTV, churn, or acquisition are all problems that are common to a lot of businesses and a good place to start

Adding value

Continue to live and die by this. Make sure that anything the team is doing is adding value to the org, whether it's laying the foundations to be able to do work later on, or helping people out with small data requests as you’re building larger projects. Leverage data to get as much value for the company as you possibly can.

Asking for help

I've found Twitter to be really helpful and we're all in a slack group now [for DSS ATX attendees] so you can ask there.

Join us on Twitter @DataSciSalon; follow Caitlin Hudon @beeonaposy.

Building Data Science Infrastructure

Sign up for our newsletter