Machine Learning (ML) is being applied successfully across many industries and by all kinds of organizations, from start-ups to large corporations. However, many encounter unexpected problems when trying to apply ML to their use case. In this post, we are going to see what it takes to develop a successful ML product, and how Machine Learning Operations (MLOps) can help.
Machine Learning in real production scenarios
There is a quote that is very widespread in ML circles, and it goes loosely like this: “More than [scary number] percent of ML projects never reach production”. This means that an organization has invested money and time developing a proof-of-concept application of ML, but failed to exploit it in a real product or application. While many estimate that scary number to be as high as 80%, a famous study by Gartner estimates it to be around 50%.
So what is causing so many ML projects to fail? There are several factors involved but - contrary to a common misconception - the lack of sophisticated modeling techniques is not usually one of them. These days it is relatively easy for data professionals to develop ML models using well-known techniques for which training material and free, high-quality and open-source tools are readily available. In many cases (but not all), with just a bit of coding and a few months of training anybody with a computer can learn to train basic ML models that can in principle provide real value. There are even several auto-ML tools that are highly successful and sophisticated, and do not require in-depth ML knowledge.
The barrier of entry, in other words, is not very high. This is a monumental achievement for which the ML community and the open-source movement deserve a lot of credit.
Then, what is the issue? Training an ML model is only a small part of what it takes to build a real product or application of ML, that can actually be used by somebody to achieve something.
Factors for a successful ML project
There are several objectives that can determine a successful ML project:
- Solve the right problem: you need to solve a problem that is worth solving, for which there is sustained interest, allowing your organization to extract value according to its mission. Often, pinpointing the right problem requires iteration: you start with a quick proof of value, then based on the feedback you iterate.
- Collect enough good data: this is the most common pitfall. ML models are data-hungry, so you need to have enough data covering the domain of your application, or a clear path allowing you to collect such data. For example, typically, a labeled dataset for computer vision contains tens of thousands of examples (although sometimes you can get away with a thousand or less, by using transfer learning).
- Treat ML as applied science: Developing ML models is not as predictable as developing software or other products. Instead, ML models require controlled experimentation just like applied science. Therefore, plan conservatively for the unexpected, but at the same time strive to maintain a good pace in clearing milestones. A solution that hits the requirements and lands in the expected time frame is much more valuable than a perfect solution that lands a year late. You don’t need it to get it perfect before you can start providing values.
- Focus on production: a model that is 80% accurate and can be put in production is infinitely more valuable than a model that is 99% accurate but cannot be used in production. For example, you cannot use features that are not available at inference time in production. Or, you cannot use a technique that is too slow or too demanding for the computation budget you have in production.
- Assemble the right team: ML development is a team sport, as we will see more in detail later. You need a team that can work well together, with clear responsibilities and priorities.
- Plan for maintenance: a deployed ML model needs monitoring and maintenance, including in many circumstances periodic retraining. Develop with maintainability in mind. You need reproducibility of experiments, tracking of artifacts, and training automation so you can retrain a model at any time with the least amount of effort possible.
What is MLOps
Machine Learning Operations (MLOps) is an ensemble of techniques, tools and best practices that help achieve the objectives delineated in the previous section. With MLOps you can lay out a development process for ML that is efficient, frictionless, transparent, and reproducible. MLOps cannot of course guarantee the success of a ML project - only you can - but it is an essential tool that reduces the effort, shortens the development time and reduces the cost across the lifespan of the project well beyond the first deployment.
In many ways, MLOps is for machine learning what DevOps is for software development. However, there are important differences. While DevOps only deals with software and its artifacts, MLOps must deal with 3 components: the software to train the model (version control, dependencies, containerization…), the data (data version control, feature management…), and the model itself (experiment tracking, model repository, deployment…).
Moreover, while the development of software is ideally linear (see the figure above), the development of ML is very iterative by nature. You should do experiments with different possible solutions to the same problem and then select the best-performing one. Finally, to test ML models you need not only unit tests and integration tests just like for software, but also data testing and validation, as well as model performance evaluation. Consequently, the MLOps process tends to be more complex than the DevOps process and typically involves more people and more tools.
Taming complexity is indeed generally the challenge with MLOps, especially at this stage where the landscape of tools and practices is still relatively immature. You also need a balanced MLOps solution that matches the development stage of your ML solution. If you are deploying your first ML model, you don’t need to invest in the same infrastructure that you would need if you already have 50 models in production. Instead, the infrastructure can grow with your company and ML use case.
Roles in MLOps
Given the complexity of processes and tools involved in an MLops solution, it is evident that developing ML solutions in an MLOps workflow requires diverse profiles and skills. Depending on the organization, the names might change or be used in different ways, but the roles stay pretty much unchanged.
On the technical side, data scientists and ML engineers are involved. These are the main people responsible for developing the ML model and the training pipelines, as well as performing all the relevant analytics for example on data quality, performance measurements, and so on.
There is also involvement from data engineers, who are responsible for the data ingestion pipelines, the quality of the data that the DS/MLE receive, as well as for provisioning the right data at inference time.
From the beginning, software engineers, also called platform engineers in some organizations, should also be involved. They are responsible for the production environment, both front-end and back-end. They are necessary partners to think about how a certain ML model could be deployed to production, and the constraints are in terms of processing power, latency, throughput and so on. Ideally, your MLOps process should not require the software engineers to re-implement the ML models. Instead, the ML models developed by the DS/MLE should be usable out of the box, typically as API endpoints.
Finally, MLOps engineers and DevOps engineers help in handling the infrastructure: training servers, the different MLOps tools, and any other infrastructure needed to train and deploy a model.
In this post on machine learning in production and MLOps, we have seen what some key goals for a successful ML project are, how MLOps can help achieve them and what roles are included in MLOps. Subscribe to our newsletter to stay tuned for more MLOps related content coming soon!