Machine Learning Inference on AWS Lambda Functions powered by AWS Graviton2 Processors
Machine Learning became a necessity for a lot of companies — from Fortune 500 companies to small startups. With all the frameworks and libraries available out there, it became a lot easier to start developing machine learning models. The new challenge is to architect a prediction pipeline in the cloud. In this article, I will cover:
- use cases of using serverless for inference
- how, by using Graviton2, it becomes both faster and cheaper to use it for making predictions
- benchmarks and the link to the repository with code and libraries where you can try them with your models
Serverless for machine learning
To deploy a model in production, we need to adhere to multiple requirements — time, cost, and scale. Usually, it’s pretty hard to satisfy all three and you may need to prioritize some of them based on the context. For example, a GPU cluster will provide the best speed of the prediction, but at the same time, it may be expensive since you would need to pay for idle time and it would be hard to scale in case you have peak loads.
Serverless infrastructure scales based on the current load, but the main disadvantages are in the initialization time and access to CPU only. Keeping those in mind, there are several cases when you may want to use serverless inference:
- You want to deploy your model for a pet project.
In this case, you can utilize both simple architecture and AWS Free Tier which provides some amount of lambda executions for free.
- You want to make a simple MVP for your startup/project.
In this case, you start really fast due to simple architecture, and at the same time, be able to scale based on your load in a predictable manner without spending more engineering resources on supporting the infrastructure.
- You have a simple model and this architecture will reduce cost.
In this case, initialization time will be minimal and this architecture will be able to minimize idle time while also reducing operational complexity.
- You have rare peak loads and it is hard to manage clusters.
In this case, serverless infrastructure will scale based on your load and will process rare/scattered peak loads without throttling the requests. (Keep in mind that you may need to increase AWS account limits so that you can start thousands of Lambda functions concurrently)
At the same time, I want to list the cases where using serverless infrastructure for ML predictions won’t work:
- You really need to optimize speed for the response time
- You have a model which requires GPU or a lot of RAM
In both of these cases, you would want to use cluster or CaaS service for the inference. Here are the links to the relevant blog posts which cover how you can do it using Amazon SageMaker:
Graviton2 AWS Lambda update
AWS Graviton Processor is custom built by AWS and utilizes 64-bit Arm Neoverse cores. AWS Graviton2 provides a bigger performance boost compared to x86 architecture. With the announcement of Graviton2 availability on AWS Lambda, more use cases will become serverless friendly as jobs will become both faster and cheaper (up to 34% price-performance improvement and 20% cheaper per GB-s).
Graviton2 AWS Lambda is cheaper than x86-based AWS Lambda for the same RAM, but it comes with the challenge of using Arm built libraries. The good news here is that a lot of libraries are already built for Arm 64 and you don’t have to build them yourself.
Graviton2 is groundbreaking for machine learning use cases as now it will be faster and cheaper to make predictions with complex models which will open more Machine Learning use cases to serverless. Let’s compare ML model performance on x86-based and Graviton2 AWS Lambda both for training and inference.
Benchmarks and links to the repo
We will use the following settings for the Lambda ML benchmark:
- Framework: Scikit-Learn
- Data: Binary classification dataset generated synthetically
- Model: SVM classifier
- Train: 512 samples with different number of features
- Test: 1024 samples with different number of features
- Number of cycles: 40
- Lambda Configuration: 10GB RAM, 5-minute limit
Here is the handler code which was used. Libraries are available at the repo https://github.com/ryfeus/lambda-packs:
Let’s take a look at the table of comparison between Graviton2 and x_86 AWS Lambda.
We can see two main trends here:
- Training on average takes the same or less time on Graviton2 compared to x86-based Lambda
- Inference takes more time on Graviton2 Lambda when the number of features is small and becomes faster than x86-based Lambda when the number of features is large
There are two main takeaways based on benchmark:
- Graviton2 is definitely better at complex operations than x86-based Lambda by being both faster and cheaper.
- Depending on your case you may need to use one or another so the best option is to run your workflow on both and check which one performs better. To check Lambda performance on your case feel free to use libraries from the repo https://github.com/ryfeus/lambda-packs
We’ve looked into the use cases for Machine Learning inference and what are the updates to AWS Lambda and how they can be used for Machine Learning. Finally, we compared Normal and Graviton2 Lambda for training machine learning models and making predictions.
As a hobby, I port a lot of libraries to make the serverless friendly. You can look at them here. They all have an MIT license, so feel free to modify and use them for your project.
Also, I have a project which is dedicated to comparing different deep learning inferences on different AWS services (including AWS inferentia instances and ONNX/TFLite frameworks on AWS Lambda). Feel free to check it out and compare different prediction frameworks for your project.
I’m excited to see how others are using serverless to empower their development. Feel free to drop me a line in the comments, and happy developing!