Scalable Cloud inference endpoint using ONNX and AWS Fargate
As a Machine Learning Engineer, I often find myself deploying models to the cloud. It’s an essential part of getting our models out of the lab and into the hands of users, and it’s also one of the most challenging parts of our work. There are a plethora of tools and services available to help with this task, but today, I want to talk about a specific combination that I’ve found particularly powerful: AWS Copilot, AWS Fargate, and the ONNX framework. AWS Copilot makes it very easy to deploy and update inference cluster and ONNX makes the whole stack very flexible.
This tutorial assumes you have a basic understanding of Docker and AWS.
What are AWS Copilot, AWS Fargate, and ONNX?
- AWS Copilot is a command line interface (CLI) that makes it easy to create, release and manage production-ready containerized applications on AWS App Runner, Amazon ECS, and AWS Fargate.
- AWS Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS). Fargate makes it easy for you to focus on building your applications without worrying about the underlying infrastructure.
- Open Neural Network Exchange (ONNX) is an open-source artificial intelligence (AI) model format that acts as a cross-platform for machine learning models. It allows models to be trained in one framework and then used in another, providing flexibility and aiding in the broader adoption and accessibility of powerful AI models.
For this tutorial, we’ll be deploying an inference endpoint using the ONNX model from the aws-inference-benchmark repository. Inference endpoint will use ECS cluster with AWS Fargate for the compute and ALB for the load balancing.
Prerequisites
Before starting, you need to have the following tools installed on your machine:
- Python 3.6 or later
- Docker
- AWS CLI
- AWS Copilot
Please ensure that you have appropriate permissions for AWS services and that your AWS CLI is properly configured.
Step-by-Step Deployment
1. Clone the Repository
First, clone the GitHub repository to your local machine:
git clone https://github.com/ryfeus/aws-inference-benchmark.git
cd copilot/cpu/aws-copilot-inference-service
This repository contains all the necessary files to deploy the deep learning model for image inference.
2. Code in the repo
The main files in the code repo are app.py
with business logic, Dockerfile
with docker image implementation and manifest.yml
which is used for configuring cloud infrastructure
app.py — contains code for the ONNX inference, healthcheck and web request management.
import aiohttp
from aiohttp import web
import onnxruntime as ort
import numpy as np
import io
import asyncio
import json
from PIL import Image
from concurrent.futures import ThreadPoolExecutor
sess = ort.InferenceSession('efficientnet-lite4-11.onnx')
with open("labels_map.txt") as f:
label_map = json.load(f)
routes = web.RouteTableDef()
executor = ThreadPoolExecutor()
@routes.get('/')
async def hello(request):
return web.Response()
def image_preprocess(image):
image = image.resize((224, 224), resample=Image.BICUBIC)
image = np.array(image).astype('float32')
image = image[:, :, :3]
image = np.expand_dims(image, axis=0)
image -= [127.0, 127.0, 127.0]
image /= [128.0, 128.0, 128.0]
return image
# Set up the HTTP server
@routes.post('/predict')
async def predict(request):
try:
image_data = await request.content.read()
image = Image.open(io.BytesIO(image_data))
image = image_preprocess(image)
input_name = sess.get_inputs()[0].name
output_name = sess.get_outputs()[0].name
loop = asyncio.get_event_loop()
results = await loop.run_in_executor(executor, sess.run, [output_name], {input_name: image})
result = reversed(results[0][0].argsort()[-5:])
preds = []
for r in result:
preds.append({"label": label_map[str(r)], "score": str(results[0][0][r])})
resp = {
'preds': preds
}
return web.Response(text=json.dumps(resp))
except Exception as e:
return web.Response(text=str(e), status=500)
def create_app():
app = web.Application()
app.add_routes(routes)
return app
async def create_gunicorn_app():
return create_app()
if __name__ == '__main__':
web.run_app(create_app(), port=8080)
Dockerfile — is used for building docker image which will be running on ECS cluster.
FROM python:3.9-slim-buster
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8080
CMD ["gunicorn", "-c", "gunicorn_conf.py", "app:create_gunicorn_app"]
Copilot manifest — configuration file which defines different parameters of the cluster like CPU/Memory allocation, number of VMs and available paths.
name: demo
type: Load Balanced Web Service
http:
path: '/'
healthcheck: '/'
# Configuration for your containers and service.
image:
# Docker build arguments. For additional overrides: https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/#image-build
build: cpu/aws-copilot-inference-service/Dockerfile
# Port exposed through your container to route traffic to it.
port: 8080
cpu: 256 # Number of CPU units for the task.
memory: 512 # Amount of memory in MiB used by the task.
platform: linux/x86_64 # See https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/#platform
count: 1 # Number of tasks that should be running in your service.
exec: true # Enable running commands in your container.
network:
connect: true # Enable Service Connect for intra-environment traffic between services.
3. Initialize the Environment and Deploy the Application
Next, initialize your AWS Copilot environment:
copilot env init
This command sets up the environment where your service will reside.
Now, you can deploy the application using:
copilot deploy
This command will take care of building the Docker image, pushing it to Amazon ECR, and deploying it to Amazon ECS/Fargate.
4. Test the Endpoint
After successful deployment, you can make a single prediction by making a POST request to your service. Replace <prefix>
with the prefix of your endpoint:
curl -X POST -H "Content-Type: image/jpeg" --data-binary "@flower.png" http://<prefix>.us-east-1.elb.amazonaws.com/predict
The response should be:
{"preds": [{"label": "tray", "score": "0.43721294"}, {"label": "vase", "score": "0.41533998"}, {"label": "pot, flowerpot", "score": "0.08949976"}, {"label": "handkerchief, hankie, hanky, hankey", "score": "0.00976433"}, {"label": "greenhouse, nursery, glasshouse", "score": "0.0029122673"}]}
If you want to benchmark the service, you can use the Apache Benchmark tool:
ab -n 10 -c 10 -p flower.png -T image/jpeg http://<prefix>.us-east-1.elb.amazonaws.com/predict
Here is how the example result will look like:
Server Software: Python/3.9
Server Hostname: <prefix>.us-east-1.elb.amazonaws.com
Server Port: 80
Document Path: /predict
Document Length: 291 bytes
Concurrency Level: 2
Time taken for tests: 26.986 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 46200 bytes
Total body sent: 6518900
HTML transferred: 29100 bytes
Requests per second: 3.71 [#/sec] (mean)
Time per request: 539.715 [ms] (mean)
Time per request: 269.858 [ms] (mean, across all concurrent requests)
Transfer rate: 1.67 [Kbytes/sec] received
235.91 kb/s sent
237.58 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 80 88 6.6 86 129
Processing: 305 443 183.5 395 1232
Waiting: 304 442 183.6 393 1231
Total: 394 531 183.4 485 1318
Percentage of the requests served within a certain time (ms)
50% 485
66% 502
75% 518
80% 562
90% 798
95% 908
98% 1254
99% 1318
100% 1318 (longest request)
Running Locally
If you wish to run the service locally, you can do so using Docker.
1. Build the Docker Image
docker build -t image-inference .
2. Run the Docker Container
docker run --rm -p 8080:8080 image-inference
Now, you can make a prediction using the REST API:
curl -X POST -H "Content-Type: image/jpeg" --data-binary "@flower.png" http://localhost:8080/predict
Testing
To test the service, you first need to install the development dependencies:
pip install -r dev-requirements.txt
Then, you can run the tests:
pytest -v test_inference.py
Conclusion
In this blog post, we showed how to deploy an inference endpoint using AWS Copilot on AWS Fargate with ONNX framework. AWS Copilot simplifies the process of deploying your services, and AWS Fargate ensures that they run smoothly in a serverless environment. With the ONNX framework, you can easily share models across different AI tools and platforms. We hope this tutorial helps you in your journey of deploying scalable and robust deep learning models.