12 MLOps breakout sessions I’m looking forward to at Re:Invent 2022

Rustem Feyzkhanov
4 min readNov 8, 2022

--

MLOps is a field of best practices for companies to run ML workflows in production. It encompasses a large stack of tasks from optimizing the pipeline for training and inference to the observability of the model in production. New tools and practices appear every year so here are the top 12 AWS Re:Invent 2022 sessions to help you stay in the loop of MLOps in the AWS cloud. This year there are significantly more sessions at Re:Invent so I will add a separate blog post to cover ML use cases shared by different companies. Also, check my blog posts about Re:Invent 2020 here and 2021 here.

This year Re:Invent will have both in-person and virtual experiences. The virtual experience is free and you can register for virtual attendance at the AWS Portal with access to keynotes, leadership, and breakout sessions.

Here are the 12 MLOps breakout sessions which I’m looking forward to. All these sessions will be available both in person and online on Re:Invent platform.

Sessions about MLOps best practices and tools

These sessions cover best practices for MLOps and how they can be implemented using AWS infrastructure. They cover the whole end-to-end pipeline — from model training to deploying the model to production and monitoring it.

[AIM208] Idea to production on Amazon SageMaker, with Thomson Reuters

This session covers the introduction to Amazon SageMaker and can be useful to learn the services under the SageMaker umbrella and how they can be used to label data and build, train, and deploy machine learning models.

[AIM301] Train ML models at scale with Amazon SageMaker, featuring AI21 Labs

This session covers how to use SageMaker for model training and how to utilize high-performance compute infrastructure available without worrying about scale and using distributed training libraries.

[AIM321] Productionize ML workloads using Amazon SageMaker MLOps, feat. NatWest

This session covers most of the services under SageMaker umbrella — from SageMaker Pipelines, SageMaker Projects, SageMaker Experiments for experiments and training to SageMaker Model Registry, and SageMaker Model Monitor for deploying models in production.

[AIM302] Deploy ML models for inference at high performance & low cost, feat. AT&T

This session covers different ways of deploying models to production using SageMaker and compares them on different axis. Examples include real-time, serverless, asynchronous, and batch inference; single-model, multi-model, and multi-container endpoints; autoscaling.

[AIM343] Minimizing the production impact of ML model updates with shadow testing

One of the challenges with releasing new models to production is to make sure that it performs better than the old ones. The new model rollout is an important step of ML release process. This session covers how it can be implemented by using a new ML model to shadow the old one and then compare the results.

Sessions about training large models

This year there are a lot of novel model architectures like Stable Diffusion and DALL-E and these sessions cover best practices for training and deploying large models with novel architectures.

[CMP314] How Stable Diffusion was built: Tips and tricks to train large AI models

This session covers both the story behind Stable Diffusion (stability.ai CEO Emad Mostaque is one of the speakers) and how they utilized AWS infrastructure for training the model which became the standard in image generation.

[AIM404] Train and host foundation models with PyTorch on AWS

This session covers best practices for organizing training and inference for large models like GPT-3, and DALL-E. They cover how to reduce training time and cost by optimizing compute, network communication, input/output, checkpointing, and offloading from GPUs to CPUs

[CMP209] AI parallelism explained: How Amazon Search scales deep-learning training

This session covers transformer-based models which became incredibly popular for both image and text tasks and dive into parallelization techniques for model training and inference options for large models.

Sessions about machine learning hardware

ML-optimized hardware is an emerging field with players entering the market + existing ones creating new tools for ML training. These sessions cover new hardware solutions for model training.

[CMP207] Choosing the right accelerator for training and inference

AWS provides a big variety of different instances which can be used for Machine Learning. Some are useful for training, some for inference and they all have their pros and cons. This session covers these instances, benchmarks, and ideal use case guidelines for each of these instances.

[CMP313] Accelerate deep learning and innovate faster with AWS Trainium

Amazon EC2 Trn1 instances, powered by AWS Trainium chips were announced during Re:Invent 2021 and were released in GA recently. This session covers how it can be used for high-performance training and how it can help to save costs on training.

[PRT280] Accelerate deep learning with Habana Gaudi–based Amazon EC2 DL1 instances (sponsored by Habana Labs, an Intel company)

Amazon EC2 DL1 instances, powered by Gaudi accelerators from Habana Labs (an Intel company). This session covers how it can be used for different deep learning applications.

[PRT219] Deep learning on AWS with NVIDIA: From training to deployment (sponsored by NVIDIA)

NVIDIA is currently the leader in the hardware for deep learning applications — both training and inference. This session covers efficient distributed training as well as streamlined model deployment using NVIDIA hardware on AWS.

--

--

Rustem Feyzkhanov

I'm a staff machine learning engineer at Instrumental, where I work on analytical models for the manufacturing industry, and AWS Machine Learning Hero.