What is a Machine Learning (ML) Pipeline?

Posted by Jim Range on April 3, 2020

Machine Learning (ML) Pipeline

A machine learning (ML) pipeline can be thought of as a way of breaking down a machine learning solution into sequenceable stages. The core of these stages are data access, data pre-processing, ML model development, ML model deployment, and ML model monitoring and support.

One important goal of the ML pipeline is to create manageable components that can be separately maintained and portions of the pipeline can be automated to simplify operating each stage.

On the surface this may seem more like a linear process, but it is more of a system than a process. The intent of the ML pipeline is to document and automate each stage as much as possible so that the system can be refreshed with new data periodically in batches or even consistently.

For example, over time the ML pipeline can periodically ingest new training data, process that data as defined by the pipeline, use that data to train the model in an automated or interactive manner with various statistical and technical controls, and then that new model can be validated, tested and deployed.

Business Case

Some may disagree that the business case definition is a part of the pipeline, but I believe it should be in that it is a fundamental driver for the vision of why the solution is being created. Therefore, the first and arguably the most important part of a machine learning pipeline is to identify a clearly defined reason why a machine learning solution would add value to the business.

It is no secret that immense value is hidden away in data. But if you are going to invest in a machine learning solution it is essential to understand what problem you are trying to solve and have a clear vision of how a machine learning solution can solve that problem better than other possible options.

This doesn't mean you need to have all of the answers up front. You can choose to do exploratory research via a proof-of-concept project that can give you a better idea of what value is hidden away in your data and how machine learning might help make use of it.

Data Access and Authorization

Before a machine learning solution can ingest data it is necessary to review and be aware of the contents of the data and ensure that the organization will not violate any regulatory compliance laws or company policy regarding accessing the specific data. There may be hurdles to pass regarding how to gain access to the data, who is authorized to access the data, and how you will protect the data from a risk management perspective.

Decisions will need to be made regarding how to extract the data, who will have access to the data, and if you will plan to store the data what your plan will be for that.

Data Extraction - Collecting data from various sources to ready it for a data lake or direct use in the next phase of a machine learning pipeline.

Methods of Data Extraction - Depending on the source, the data may be structured or unstructured. The data may be accessible via queries from a database, text file, or accessible via a network API. Often custom extraction tools (such as log data agents or web scrapers) need to be created and supported for cases where new data will flow into the system over time.

Data Lake - A location to store useful data from disparate sources to streamline data mining and exploration of the data by various stakeholders. In contrast to a data mart or data warehouse which is structured and pre-processed, a data lake may contain unprocessed data of various formats and structures, and can even be in the data's original format. The purpose of the data lake is to encourage the use of the data by making it more readily available to stakeholders where regulatory compliance and organization policy permits. A data lake can also lower costs for an organization where disconnected business units can share the cost of data extraction.

Data Warehouse - A structured and pre-processed location to store data such as a SQL or NoSQL database that is more organized than a data lake but less so than a data mart.

Data Mart - Structured and filtered data that has a specific purpose and may even have web applications or business intelligence tools as a frontend to enable easy access for non-technical users. A data mart contains content that when accessed can be used to gain insight into customer behavior for specific actions such as product browse, search and purchase statistics. The term democratization of data is often associated with data marts due to the heavy filtering and cleaning of the data along with an emphasis of the data mart improving ease of access to the data. A data mart makes it easy to use and understand the data correctly and difficult to make mistakes when using the data. This has also led to the term citizen data scientist, where one without formal data science training can use some of the tools a data scientist might use to make better business decisions.

Data Pre-Processing

After gaining access to data, even structured data, it is often messy, missing values, corrupt, and not in the ideal format for use in a machine learning solution. That is why the data pre-processing stage of a machine learning model is a critical component for a successful machine learning solution.

Machine Learning (ML) Model Development

After gaining access to properly pre-processed data it is possible to begin ML model development. The primary sub-stages here include model identification, training, validation and testing.

Model Identification

The first thing to do when developing a machine learning model is not waste time with detailed plans and instead try out various types of models that might be suitable for your purpose. Start with the simple and efficient options and do some quick testing to see what might work. Try several options and compare to get a baseline of where you might like to proceed.

Model Training

After selecting a model you will begin the process of training the model. For supervised models, it is important to separate out (randomly, evenly over time, or whatever can be statistically justified for your model to ensure each subset comes from the same distribution) a subset of training data, test data and validation data. The model hyper-parameters are then tuned using the training data.

Model Validation

After a good set of hyper-parameters are identified for the model, the model validation data can be used to identify if the model is over or under fitting the data by observing variance and bias of the model. It is ok to iterate between the validation and training data some, but too many iterations might justify obtaining additional validation data that is separate from the training or testing data to prevent implicitly tuning the hyper-parameters due to validation performance.

Model Testing

After identifying hyper-parameters that you feel are a good fit for the model, test data can be used to get a better idea of how well your model will perform on un-seen data. Consider that after using the test data, if you iterate back to make changes to the model and then test again with the same data you will be implicitly tuning the model to the test data, which means you may end up overfitting the model.

Machine Learning (ML) Model Deployment

Model deployment includes the actions necessary to make the model and it's related application capabilities available for use at scale in what is referred to as a production environment.

Source Control and Deployment Automation

It is essential that a source code repository such as git is used throughout the development and deployment of the ML model. This is an important component of continuous integration and continious delivery.

An effective way of deploying a model to production is to leverage a continuous integration and continuous delivery solution that can automate validation and testing of the entire ML system that will be deployed. This can then simplify the manual deployment process down to whatever your organization's desired change management policy dictates regarding deciding when to deploy to production. And in some cases this too could even be systematized and automated.

Configuration Management

Separate configuration management from source code repositories and ensure credentials are stored securely.

Change Management

The deployment of the ML model to production should be done via a formally planned process that automates as much as possible and minimizes the risk of a negative impact on users and systems that depend on the ML solution.

Machine Learning (ML) Model Monitoring

After deploying a ML model to production it is necessary to monitor it's performance. This includes monitoring accuracy of the output of the model over time, detecting rates of unexpected inputs to the model, and various other custom statistics that can reliably detect when the model may need to be updated.