Maintaining an inference server is distressing but necessary.
This article continues a series of articles tackling the most frightening idea in the world of production ML: putting the damn thing in production.
In previous stories, we saw two different approaches to designing a Machine Learning (ML) powered application. First, we examined why you’d want to keep your model within your web server and why you should not do it.
Keeping your model side-by-side with your core business logic is a must as you experiment with different ideas and want quick feedback from a trusted circle of testers, but it falls short when deploying in production.
The first solution we discussed is a very simple technique that permits us to separate the model from our web server. We took a closer look at the model-in-database solution, the use cases that are strong candidates for this approach, and its pros and cons.
The model-in-database approach is straightforward to implement, but it has limited scope. It also leads to stale models and does not always provide a valuable service to our users.
So, how can we resolve the issues we face with the two previous approaches? This story examines the most common architectural design: how to place your model in its own inference server.
Creating and maintaining a separate web service for our models adds infrastructural complexity, but, as we will see, it is the sweet spot for most ML-powered products.
Learning Rate is a newsletter for those who are curious about the world of AI and MLOps. You’ll hear from me on the first Saturday of every month with updates and thoughts on the latest AI news and articles. Subscribe here!
The Model-as-a-Service (MaaS) architectural paradigm permits us to run the model online in its own inference service. This inference service is like any other backend, specifically designed to run ML models. That means it is backed by the proper infrastructure; one or more machines capable of keeping one or a series of ML models performant.
Our backend (i.e., our web server) interacts by assuming the role of the client: it makes requests and receives responses. For example, as it receives a request from the client (e.g., our user) to identify the objects that are present in an image, it forwards the request to the inference server, the model makes the prediction, sends back a response, and, finally, the web server forwards the response back to the client.
As we said in the introduction, this approach seems to be the sweet spot for most ML-powered applications. But why is that? Let’s examine this approach in depth, when you should use it, and go through its advantages and disadvantages.
The MaaS approach is, arguably, the most common approach for serving ML models. And there is a good reason why this is the recommended approach for most use cases.
First, a bug in your ML model code won’t bring down your whole application. You will have a chance to handle errors gracefully and still manage to provide a reliable service to your users.
Then, as we said before, running a separate inference service allows us to choose the optimal hardware for our ML model and scale it appropriately. Do you need a GPU accelerator? Do you need to scale the server vertically or horizontally? With this approach, you can scale your service any way you need without affecting the other parts of your application.
Finally, this approach provides flexibility. Your model has now become a distinct web service. Thus, you can reuse it across multiple applications, parts of your application, or even open it to the public. You can build an API around it and commercialize it, just like OpenAPI.
The most prominent disadvantage of the MaaS approach is that it adds more latency to your system. When your server interacts with your model, it has to make a request over the network. This adds latency, and you should measure its impact on your application.
Finally, this approach adds infrastructure complexity as you now have to design, run, and maintain a different service.
This story examined a third architectural paradigm that can help us design and implement an ML application. The Model-as-a-Service approach is the sweet spot for most ML-powered products, and there is good reason for that.
The disadvantages of the other approaches are just too significant. You need to be able to scale your models independently, and you need them to be flexible enough to handle any type of data in the request.
In the following story, we will see what it takes to run a genuine, production-ready inference service on top of a Kubernetes cluster using KServe.
My name is Dimitris Poulopoulos, and I’m a machine learning engineer working for Arrikto. I have designed and implemented AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA.
Opinions expressed are solely my own and do not express the views or opinions of my employer.
The Unnerving Sweet Spot for ML-Powered Products Republished from Source https://towardsdatascience.com/the-unnerving-sweet-spot-for-ml-powered-products-c34b54e17179?source=rss----7f60cf5620c9---4 via https://towardsdatascience.com/feed