What is the ML Stack? | RudderStack

neub9
By neub9
11 Min Read


This is part III of a series breaking down each phase of the Data Maturity Journey, a framework we created to help data teams architect a practical customer data stack for every phase of their company’s journey to data maturity.

Before we get to the hot topic of machine learning, let’s do a quick recap of where we are in the data maturity journey. In the Starter Stack, you solved the point-to-point integration problem by implementing a unified, event-based integration layer. In the Growth Stack, your data needs became a little more sophisticated—to enable downstream teams (and management) to answer harder questions and act on all of your customer data, you needed to centralize both clickstream data and relational data to build a full picture of the customer and their journey. To solve these challenges you implemented a cloud data warehouse as the single source of truth for all customer data, then used reverse ETL pipelines to activate that data.

At this point, life is good. All of your customer data is collected and sent to your entire stack in real-time to cloud tools via event streaming and in batch loads to your data warehouse (via both event streaming batch loads as well as ETL for relational data). This means you can build a full view of your customer and push complete profiles, computed traits, and even audiences from your warehouse back to your stack via reverse ETL.
Yes, life is good. But could it be even better? Currently all of the insights and activation your work has enabled are based on historical data—what users have done in the past. The more teams understand about the past, though, the more they think about how to mitigate problems in the future.
Customer churn is a great example. With the right data you can easily see which customers have churned and launch win back emails, but wouldn’t it be better to engage those users before they churn? This type of magic is the next frontier of optimization, and it requires moving from historical analytics to predictive analytics. Predictive analytics enable you to determine the likelihood of future outcomes for users.
Predicting future behavior and acting on it can be extremely valuable for a business. Consider the example above. When you leverage behavioral and transactional data signals to identify customers who are likely to churn and proactively incentivize their next interaction, you’re more likely to prevent them from churning. The same is true for new users: if you can predict their potential lifetime value, you can customize offers accordingly.
This is a big step for your data team and your data stack because it introduces the need for predictive modeling, and predictive modeling requires additional tooling. The good news is that if your data is in order, which it should be if you implemented the Growth Stack, you already have a running start.

What is the ML Stack?

Machine learning, ML ops and keeping it simple

Depending on a number of factors you are considering in your analysis, and the types of data you’re using, the most scalable way to answer predictive questions is to use machine learning. Machine learning is a subset of artificial intelligence at the junction of data engineering and computer science that aims to make predictions through the use of statistical methods.

It’s worth noting from the outset that not all predictive problems have to be solved with formal machine learning, though. A deterministic model (i.e., if a customer fits these characteristics, tag them as being likely to churn) or multivariate linear regression analysis can do the job in many cases (hat tip to SQL and IF statements!). Said another way, machine learning isn’t a magic wand you can point at any problem to conjure up a game-changing answer. In fact, we would say that intentionally starting out with basic analyses and models is the best first step into predictive analytics—another good reminder of the KISS principle when building out your data stack and workflows.

If you are familiar with machine learning, you know that there is also an entire engineering discipline focused on the tools and workflows required to do the work (this is often called ML ops or ML engineering). ML engineering is different from data engineering, but the two disciplines must operate in tandem because the heart of good predictive analytics is good data. This is why many data scientists play the role of part-time data engineer.

One other important thing to note: machine learning is a wide and deep field. The most advanced companies build fully custom tooling to support their various data science workflows, but for many companies complicated data science workflows are overkill. The good news is that with modern tooling, you don’t have to implement a massive amount of ML-focused infrastructure to start realizing the value of predictive modeling.

To that end, this isn’t a post about ML ops, workflows or various types of modeling. Our focus will be showing you how many companies take their first steps into predictive analytics by making simple additions to their toolset and leveraging existing infrastructure (the Growth Stack) to operationalize results from models.

Without further ado, let’s dive into the stack itself.

What is the ML Stack?

The ML Stack introduces a data flow that enables teams to work on predictive analytics. For companies running the Growth Stack, there are two fundamental challenges:

Limitations of warehouse-based analysis: data teams hit ceilings in the warehouse when they 1) can’t perform their desired analysis in SQL and 2) need to work with unstructured data that can’t be stored in the warehouse (more on this below)

Operationalizing model outputs: when models do produce outputs, it can be technically difficult to deliver them to tools where they can be used to optimize the customer journey

The ML Stack solves these problems by:

Introducing a data lake for unstructured data

Introducing a modeling/analysis toolset

Leveraging existing warehouse and reverse ETL infrastructure to deliver model outputs to tools across the stack

As you can see in the architectural diagram below, the ML stack feeds data (inputs) to an analysis and modeling toolset from the data warehouse and data lake. The model runs and produces outputs (or features), which are pushed to the warehouse in the form of a materialized table. That table can then be used in the standard reverse ETL flow to send those features to the rest of the stack.

When do you need to implement the ML Stack?

You’d be surprised at how often machine learning gets thrown at questions that can be answered with basic SQL—something we heard loud and clear from one of the people who helped build the data science practice at Airbnb. So, when should you actually use machine learning?

Symptoms that indicate you need the ML Stack

The best indication you need to implement ML tooling into your stack is the desire to make data-driven predictions about the future. Here are a few example symptoms:

Your teams have built muscle in understanding historical trends and events and why they happened, and now want to act proactively to influence those events before they occur.

This is impossible because teams running the growth stack can’t anticipate the likelihood of those events happening, they can only analyze them in retrospect.

You’ve hit the limits of SQL-on-warehouse analysis and are actively exploring how to apply statistical analysis to your data.

You want to leverage new kinds of data that can’t be managed in your cloud data warehouse. This is often due to new tooling or processes that have been implemented (i.e., standing up a call center), or a desire to leverage existing data that hasn’t previously been used in analysis.

What your company or team might look like

As with every other phase of the data maturity journey, your stack isn’t about the size of your company, but your data needs. That said, stepping into the world of machine learning generally requires both a minimum threshold of data as well as dedicated resources to work on predictive projects. This means it’s often larger mid-market and enterprise companies that implement the ML Stack. But smaller companies are increasingly collecting huge amounts of data, and the ML tooling itself is getting easier to use (more on this in the tooling section below).

If your company is on the smaller end of the spectrum:

You likely have a business that produces huge amounts of data (think eCommerce, web3, media, gaming, etc.) but can run with a smaller technical team

Your business model stands…

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *