Scalable Cloud inference endpoint using ONNX and AWS Fargate

Published in

AWS Tip

6 min readMay 22, 2023

As a Machine Learning Engineer, I often find myself deploying models to the cloud. It’s an essential part of getting our models out of the lab and into the hands of users, and it’s also one of the most challenging parts of our work. There are a plethora of tools and services available to help with this task, but today, I want to talk about a specific combination that I’ve found particularly powerful: AWS Copilot, AWS Fargate, and the ONNX framework. AWS Copilot makes it very easy to deploy and update inference cluster and ONNX makes the whole stack very flexible.

This tutorial assumes you have a basic understanding of Docker and AWS.

What are AWS Copilot, AWS Fargate, and ONNX?

AWS Copilot is a command line interface (CLI) that makes it easy to create, release and manage production-ready containerized applications on AWS App Runner, Amazon ECS, and AWS Fargate.
AWS Fargate is a serverless compute engine for containers that works with both Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS). Fargate makes it easy for you to focus on building your applications without worrying about the underlying infrastructure.
Open Neural Network Exchange (ONNX) is an open-source artificial intelligence (AI) model format that acts as a cross-platform for machine learning models. It allows models to be trained in one framework and then used in another, providing flexibility and aiding in the broader adoption and accessibility of powerful AI models.

For this tutorial, we’ll be deploying an inference endpoint using the ONNX model from the aws-inference-benchmark repository. Inference endpoint will use ECS cluster with AWS Fargate for the compute and ALB for the load balancing.

From https://aws.github.io/copilot-cli/docs/concepts/services/

Prerequisites

Before starting, you need to have the following tools installed on your machine:

Python 3.6 or later
Docker
AWS CLI
AWS Copilot

Please ensure that you have appropriate permissions for AWS services and that your AWS CLI is properly configured.

Step-by-Step Deployment

1. Clone the Repository

First, clone the GitHub repository to your local machine:

git clone https://github.com/ryfeus/aws-inference-benchmark.git
cd copilot/cpu/aws-copilot-inference-service

This repository contains all the necessary files to deploy the deep learning model for image inference.

2. Code in the repo

The main files in the code repo are app.py with business logic, Dockerfile with docker image implementation and manifest.yml which is used for configuring cloud infrastructure

app.py — contains code for the ONNX inference, healthcheck and web request management.

import aiohttp
from aiohttp import web
import onnxruntime as ort
import numpy as np
import io
import asyncio
import json
from PIL import Image
from concurrent.futures import ThreadPoolExecutor

sess = ort.InferenceSession('efficientnet-lite4-11.onnx')
with open("labels_map.txt") as f:
    label_map = json.load(f)

routes = web.RouteTableDef()
executor = ThreadPoolExecutor()


@routes.get('/')
async def hello(request):
    return web.Response()


def image_preprocess(image):
    image = image.resize((224, 224), resample=Image.BICUBIC)
    image = np.array(image).astype('float32')
    image = image[:, :, :3]
    image = np.expand_dims(image, axis=0)
    image -= [127.0, 127.0, 127.0]
    image /= [128.0, 128.0, 128.0]
    return image


# Set up the HTTP server
@routes.post('/predict')
async def predict(request):
    try:
        image_data = await request.content.read()
        image = Image.open(io.BytesIO(image_data))

        image = image_preprocess(image)

        input_name = sess.get_inputs()[0].name
        output_name = sess.get_outputs()[0].name

        loop = asyncio.get_event_loop()
        results = await loop.run_in_executor(executor, sess.run, [output_name], {input_name: image})

        result = reversed(results[0][0].argsort()[-5:])
        preds = []
        for r in result:
            preds.append({"label": label_map[str(r)], "score": str(results[0][0][r])})

        resp = {
            'preds': preds
        }

        return web.Response(text=json.dumps(resp))
    except Exception as e:
        return web.Response(text=str(e), status=500)


def create_app():
    app = web.Application()
    app.add_routes(routes)
    return app


async def create_gunicorn_app():
    return create_app()


if __name__ == '__main__':
    web.run_app(create_app(), port=8080)

Dockerfile — is used for building docker image which will be running on ECS cluster.

FROM python:3.9-slim-buster
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8080
CMD ["gunicorn", "-c", "gunicorn_conf.py", "app:create_gunicorn_app"]

Copilot manifest — configuration file which defines different parameters of the cluster like CPU/Memory allocation, number of VMs and available paths.

name: demo
type: Load Balanced Web Service

http:
  path: '/'
  healthcheck: '/'

# Configuration for your containers and service.
image:
  # Docker build arguments. For additional overrides: https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/#image-build
  build: cpu/aws-copilot-inference-service/Dockerfile
  # Port exposed through your container to route traffic to it.
  port: 8080

cpu: 256       # Number of CPU units for the task.
memory: 512    # Amount of memory in MiB used by the task.
platform: linux/x86_64  # See https://aws.github.io/copilot-cli/docs/manifest/lb-web-service/#platform
count: 1       # Number of tasks that should be running in your service.
exec: true     # Enable running commands in your container.
network:
  connect: true # Enable Service Connect for intra-environment traffic between services.

3. Initialize the Environment and Deploy the Application

Next, initialize your AWS Copilot environment:

copilot env init

This command sets up the environment where your service will reside.

Now, you can deploy the application using:

copilot deploy

This command will take care of building the Docker image, pushing it to Amazon ECR, and deploying it to Amazon ECS/Fargate.

4. Test the Endpoint

After successful deployment, you can make a single prediction by making a POST request to your service. Replace <prefix> with the prefix of your endpoint:

curl -X POST -H "Content-Type: image/jpeg" --data-binary "@flower.png" http://<prefix>.us-east-1.elb.amazonaws.com/predict

The response should be:

{"preds": [{"label": "tray", "score": "0.43721294"}, {"label": "vase", "score": "0.41533998"}, {"label": "pot, flowerpot", "score": "0.08949976"}, {"label": "handkerchief, hankie, hanky, hankey", "score": "0.00976433"}, {"label": "greenhouse, nursery, glasshouse", "score": "0.0029122673"}]}

If you want to benchmark the service, you can use the Apache Benchmark tool:

ab -n 10 -c 10 -p flower.png -T image/jpeg http://<prefix>.us-east-1.elb.amazonaws.com/predict

Here is how the example result will look like:

Server Software:        Python/3.9
Server Hostname:        <prefix>.us-east-1.elb.amazonaws.com
Server Port:            80

Document Path:          /predict
Document Length:        291 bytes

Concurrency Level:      2
Time taken for tests:   26.986 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      46200 bytes
Total body sent:        6518900
HTML transferred:       29100 bytes
Requests per second:    3.71 [#/sec] (mean)
Time per request:       539.715 [ms] (mean)
Time per request:       269.858 [ms] (mean, across all concurrent requests)
Transfer rate:          1.67 [Kbytes/sec] received
                        235.91 kb/s sent
                        237.58 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       80   88   6.6     86     129
Processing:   305  443 183.5    395    1232
Waiting:      304  442 183.6    393    1231
Total:        394  531 183.4    485    1318

Percentage of the requests served within a certain time (ms)
  50%    485
  66%    502
  75%    518
  80%    562
  90%    798
  95%    908
  98%   1254
  99%   1318
 100%   1318 (longest request)

Running Locally

If you wish to run the service locally, you can do so using Docker.

1. Build the Docker Image

docker build -t image-inference .

2. Run the Docker Container

docker run --rm -p 8080:8080 image-inference

Now, you can make a prediction using the REST API:

curl -X POST -H "Content-Type: image/jpeg" --data-binary "@flower.png" http://localhost:8080/predict

Testing

To test the service, you first need to install the development dependencies:

pip install -r dev-requirements.txt

Then, you can run the tests:

pytest -v test_inference.py

Conclusion

In this blog post, we showed how to deploy an inference endpoint using AWS Copilot on AWS Fargate with ONNX framework. AWS Copilot simplifies the process of deploying your services, and AWS Fargate ensures that they run smoothly in a serverless environment. With the ONNX framework, you can easily share models across different AI tools and platforms. We hope this tutorial helps you in your journey of deploying scalable and robust deep learning models.