Domain 3: MLOps, Deployment & Orchestration

13 min read 2568 words

Table of Contents

MLOps, Deployment & Orchestration

MLOps, Deployment & Orchestration

Exam Domain: 3 — Deployment and Orchestration of ML Workflows (22%) Tasks: Select deployment infrastructure, script infrastructure, automate CI/CD

SageMaker Inference Options

Endpoint Types

SageMaker Inference Options

┌─────────────────────────────────────────────────────────────┐
│              SageMaker Inference Options                     │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │  Real-Time   │  │  Serverless  │  │    Async     │      │
│  │  Endpoints   │  │  Inference   │  │  Inference   │      │
│  ├──────────────┤  ├──────────────┤  ├──────────────┤      │
│  │ Persistent   │  │ Scale to 0   │  │ Queue-based  │      │
│  │ Always-on    │  │ Cold start   │  │ Long-running │      │
│  │ Low latency  │  │ Pay per use  │  │ Large payload│      │
│  │ ms response  │  │ Intermittent │  │ Up to 1GB    │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐                        │
│  │   Batch      │  │  Inference   │                        │
│  │  Transform   │  │  Pipelines   │                        │
│  ├──────────────┤  ├──────────────┤                        │
│  │ Offline      │  │ Chain models │                        │
│  │ Whole dataset│  │ Pre/post     │                        │
│  │ No endpoint  │  │ processing   │                        │
│  │ One-time job │  │ Serial steps │                        │
│  └──────────────┘  └──────────────┘                        │
└─────────────────────────────────────────────────────────────┘

Type	Latency	Payload	Scaling	Best For
Real-Time	ms	6 MB	Auto-scaling (always on)	Production APIs, interactive apps
Serverless	Cold start (~sec)	4 MB	Scale to 0	Intermittent traffic, cost-sensitive
Async	Minutes	1 GB	Auto-scaling + queue	Large payloads, long processing
Batch Transform	Hours	Unlimited	Managed job	Offline predictions, whole datasets
Inference Pipeline	ms	6 MB	Same as real-time	Multi-step: preprocess → predict → postprocess

Exam tip: Common scenario questions:
“Intermittent traffic, minimize cost” → Serverless
“Large images, processing takes minutes” → Async
“Score entire dataset nightly” → Batch Transform
“Sub-second response for API” → Real-Time

ELI5: The exam loves “which endpoint?” questions. The key is usage pattern: Real-Time = always on (like a running API server, costs money even when idle). Serverless = sleeps when idle (saves money, but has a cold-start delay). Async = drop a big file in a queue and come back later. Batch = score the whole database overnight as a one-time job. Match the pattern to the scenario and you’ll get it right every time.

Production Variants & A/B Testing

Real-Time Endpoint with Production Variants:

                     ┌─────────────────┐
                     │  Endpoint       │
                     │                 │
  Traffic ──────────►│  Variant A: 90% │──► Model v1
                     │  Variant B: 10% │──► Model v2 (testing)
                     │                 │
                     └─────────────────┘

Use case: A/B test new model on small % of traffic before full rollout

ELI5: A/B testing for ML works just like A/B testing for websites. Send 90% of real traffic to the proven model, 10% to the new challenger. Measure actual performance metrics (accuracy, latency, business KPIs) on real users. If the new model wins, gradually shift traffic until it handles 100%. If it underperforms, roll back without anyone noticing. It’s like taste-testing a new recipe on a few tables before changing the whole menu.

Multi-Model Endpoints (MME)

Host multiple models on a single endpoint
Models loaded on-demand, shared instance
Cost-efficient when you have many models with sporadic traffic
Use case: per-customer models, per-region models

Multi-Container Endpoints

Run multiple containers on a single endpoint
Direct invocation (specify which container) or serial pipeline
Use case: ensemble models, different frameworks on same endpoint

Deployment Strategies

Blue/Green Deployment

Before:   [Blue: Model v1 ← 100% traffic]

During:   [Blue: Model v1 ← 100%]  [Green: Model v2 ← 0%]
          (Green is provisioned and tested)

Switch:   [Blue: Model v1 ← 0%]   [Green: Model v2 ← 100%]
          (Instant cutover)

Rollback: Switch traffic back to Blue if issues detected

ELI5: Blue/Green is the safest deployment strategy. You build a complete copy of your environment (Green) with the new model, test it thoroughly while Blue still serves all real traffic, then flip the load balancer switch in an instant. If anything goes wrong, one more flip sends traffic back to Blue. Zero downtime, zero risk — you always have a fully working environment to fall back to.

Canary Deployment

Step 1:  [Model v1 ← 95%]  [Model v2 ← 5%]    (small canary)
Step 2:  [Model v1 ← 80%]  [Model v2 ← 20%]   (increase if healthy)
Step 3:  [Model v1 ← 0%]   [Model v2 ← 100%]  (full rollout)

Alarms: CloudWatch metrics trigger auto-rollback if errors spike

Linear (Rolling) Deployment

Step 1:  [v1 ← 90%]  [v2 ← 10%]
Step 2:  [v1 ← 80%]  [v2 ← 20%]
Step 3:  [v1 ← 70%]  [v2 ← 30%]
  ...gradually shift all traffic over fixed intervals

Shadow Testing

[Model v1 ← 100% (serves responses)]
      │
      └─── Copy of traffic ───► [Model v2 (log only, don't serve)]

Compare v1 vs v2 predictions offline — zero risk to users

ELI5: Shadow testing is the ultimate safe evaluation. The new model receives a copy of every real request and generates predictions, but those predictions are thrown away — users always see Model v1’s response. You just log Model v2’s outputs and compare them to v1 offline. Zero risk to users, but you get a complete picture of how the new model would perform on real production traffic before you ever expose it.

SageMaker Deployment Guardrails

Auto-rollback: CloudWatch alarms trigger automatic rollback
Built-in support for canary, linear, and all-at-once strategies
Configure bake time (wait period before progressing)

Auto-Scaling for Endpoints

SageMaker Auto-Scaling:

  Scaling Policy → CloudWatch Metric → Add/Remove Instances

Target Tracking Policy (recommended):
  "Keep InvocationsPerInstance at 70"
  → SM automatically adjusts instance count

Step Scaling Policy:
  If metric > 80 for 3 minutes → add 2 instances
  If metric > 95 for 1 minute  → add 5 instances
  If metric < 30 for 10 minutes → remove 1 instance

Metric	What It Tracks
`InvocationsPerInstance`	Requests per instance (most common)
`CPUUtilization`	CPU usage
`GPUUtilization`	GPU usage
`ModelLatency`	Time to generate prediction
`OverheadLatency`	SageMaker overhead time

Min instances: 0 (serverless), 1+ (real-time)
Cool-down period: Wait before next scaling action (prevent thrashing)

Edge Deployment

SageMaker Neo

Trained Model → [Neo Compiler] → Optimized Model → Deploy to Edge
                  ↓
              Compiles for target hardware:
              ARM, Intel, NVIDIA, Qualcomm, etc.
              Up to 2x performance improvement

Compile once, run anywhere — optimized for specific hardware
Supports: TensorFlow, PyTorch, MXNet, XGBoost, ONNX
Targets: EC2, IoT Greengrass, mobile devices, embedded systems

ELI5: Neo is a compiler for ML models. Just like a C++ compiler converts source code into CPU-specific machine instructions that run faster than interpreted code, Neo converts ML models into hardware-optimized binary formats. The same PyTorch model compiled for an ARM chip runs up to 2x faster than the generic version — with no change to accuracy.

AWS IoT Greengrass

Run ML inference locally on edge devices
Models trained in cloud → deployed to edge
Operates offline (disconnected environments)
Use case: factory floor defect detection, autonomous vehicles

Containerization & Compute

Docker in ML

SageMaker uses Docker containers for everything:

  ┌────────────────────────────────────────────┐
  │  Docker Container                          │
  │  ├─ /opt/ml/model/        (model artifacts)│
  │  ├─ /opt/ml/input/data/   (training data)  │
  │  ├─ /opt/ml/output/       (results)        │
  │  └─ Your code + dependencies               │
  └────────────────────────────────────────────┘

Two modes:
  Training container:  reads data → trains → writes model
  Inference container: loads model → serves predictions via HTTP

ELI5: Everything in SageMaker runs inside Docker containers. Think of containers like standardized shipping containers: no matter what’s inside (Python code, R scripts, TensorFlow, PyTorch), the outer box is always the same shape. SageMaker can pick it up and run it on any machine without worrying about dependency conflicts. Your model, its libraries, and your code all travel together as one portable unit.

Amazon ECR (Elastic Container Registry)

Private Docker registry for storing container images
SageMaker pulls custom training/inference containers from ECR
Integrates with ECS/EKS for deployment

Amazon ECS (Elastic Container Service)

Feature	Details
Launch types	Fargate (serverless) or EC2 (self-managed)
Task Definition	Blueprint for containers (image, CPU, memory, ports)
Service	Maintains desired count of running tasks
Use for ML	Custom inference services, preprocessing pipelines

Amazon EKS (Elastic Kubernetes Service)

Feature	Details
Kubernetes	Open-source container orchestration
Node types	EC2, Fargate, or both
Use for ML	Complex multi-service ML systems, Kubeflow
SageMaker + K8s	SageMaker Operators for Kubernetes

AWS Batch

Managed batch processing
Automatically provisions optimal compute (EC2 or Fargate)
Use case: large-scale offline data processing, batch predictions

ECS vs EKS vs Batch:
  ECS:   Simple container orchestration (AWS-native)
  EKS:   Kubernetes (portable, complex)
  Batch: Fire-and-forget batch jobs

Infrastructure as Code

AWS CloudFormation

# Define ML infrastructure as YAML/JSON templates
Resources:
  SageMakerEndpoint:
    Type: AWS::SageMaker::Endpoint
    Properties:
      EndpointConfigName: !Ref EndpointConfig
      
  EndpointConfig:
    Type: AWS::SageMaker::EndpointConfig
    Properties:
      ProductionVariants:
        - ModelName: !Ref Model
          InstanceType: ml.m5.xlarge
          InitialInstanceCount: 2

Templates: YAML/JSON describing infrastructure
Stacks: Running instance of a template
Change Sets: Preview changes before applying
Drift Detection: Detect manual changes to infrastructure
Use case: reproducible ML environments, version-controlled infra

AWS CDK (Cloud Development Kit)

Define infrastructure using programming languages (Python, TypeScript, Java, etc.)
Synthesizes to CloudFormation templates
Higher-level constructs than raw CloudFormation
Use case: complex infrastructure with logic, loops, conditionals

CloudFormation vs CDK:
  CloudFormation: Declarative (YAML/JSON) — what you want
  CDK:            Imperative (code) — how to build it
  CDK compiles → CloudFormation template → AWS deploys

CI/CD for ML

AWS CodePipeline

ML CI/CD Pipeline

Source (CodeCommit/GitHub)
    ↓
Build (CodeBuild)
    ↓
Test (CodeBuild — run tests)
    ↓
Deploy (CodeDeploy / SageMaker)

┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
│  Source   │→  │  Build   │→  │  Test    │→  │  Deploy  │
│ CodeCommit│   │CodeBuild │   │CodeBuild │   │CodeDeploy│
│ GitHub    │   │(build    │   │(run unit │   │(deploy   │
│ S3       │   │ container│   │  tests)  │   │ model)   │
└──────────┘   └──────────┘   └──────────┘   └──────────┘

Service	Purpose
CodeCommit	Git repository (AWS managed)
CodeBuild	Build and test (compile, run tests, create container)
CodeDeploy	Deploy to EC2, ECS, Lambda
CodePipeline	Orchestrate the full CI/CD pipeline

SageMaker Pipelines

The native ML CI/CD system — purpose-built for ML workflows.

SageMaker Pipeline Example:

  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
  │ Process  │→  │  Train   │→  │ Evaluate │→  │ Register │
  │ Data     │   │  Model   │   │  Model   │   │  Model   │
  └──────────┘   └──────────┘   └──────────┘   └──────────┘
                                      │
                                      ↓
                                ┌──────────┐
                                │Condition: │
                                │accuracy   │
                                │> 0.90?    │
                                └─────┬─────┘
                              Yes ↙      ↘ No
                        ┌──────────┐  ┌──────────┐
                        │ Register │  │  Fail /  │
                        │ + Deploy │  │  Retrain │
                        └──────────┘  └──────────┘

Step Type	What It Does
Processing	Data processing with SageMaker Processing
Training	Model training
Tuning	Hyperparameter tuning
Transform	Batch inference
Condition	Branch based on metric values
RegisterModel	Register in Model Registry
CreateModel	Create deployable model
Lambda	Run custom Lambda function
Fail	Mark pipeline as failed

Exam tip: SageMaker Pipelines is the answer when the question asks about ML-specific CI/CD or automated retraining workflows.

ELI5: SageMaker Pipelines is like a factory assembly line for ML: raw data goes in one end, a validated deployed model comes out the other. Each step (processing, training, evaluation, registration) is automated and tracked. If the model doesn’t hit an accuracy threshold, the pipeline stops and can trigger retraining. The whole thing is reproducible — run it again next month with new data and get consistent results.

Workflow Orchestration

Amazon EventBridge

Event bus — react to events from AWS services and custom apps
Schedule rules (cron) or event-pattern matching
Use case: trigger retraining when new data arrives in S3

S3 PutObject event
    → EventBridge rule (pattern: "new training data in s3://ml-data/")
    → Target: Step Functions (start retraining pipeline)

AWS Step Functions

State machine orchestration — coordinate multiple AWS services
Visual workflow designer
Built-in error handling, retries, parallel execution
Integrates with SageMaker, Lambda, Glue, ECS, and more

Step Functions States:
  Task      →  Call an AWS service or Lambda
  Choice    →  Branch based on condition
  Parallel  →  Run branches concurrently
  Map       →  Iterate over an array
  Wait      →  Pause for time or signal
  Succeed   →  End successfully
  Fail      →  End with error

ELI5: Step Functions is the general-purpose workflow engine — it can orchestrate any AWS service in any sequence. SageMaker Pipelines is ML-specific and knows about training jobs, model registries, and evaluation steps natively. MWAA (Airflow) is for teams that already have Python-based DAG workflows they want to keep. For the exam: ML workflow with training/evaluation gates = SageMaker Pipelines; everything else (ETL, cross-service automation) = Step Functions.

Amazon MWAA (Managed Apache Airflow)

Managed Apache Airflow — Python-based workflow orchestration
DAGs (Directed Acyclic Graphs) define workflows
Rich ecosystem of operators and hooks
Use case: complex multi-step ML pipelines already defined in Airflow

Git Workflow for ML

Gitflow for ML:

  main ─────────●─────────────────●──────────
                │                 ↑
  develop ──────┼──●──●──●───────┤
                │  │     │       │
  feature ──────┼──┘     │       │
  branch        │        │       │
                │        └───────┘
                         (merge when model approved)

GitHub Flow (simpler):
  main ──────●──────●──────●──────
             │      ↑      │
  feature ───┘──────┘      │
  branch                   │

AWS Lake Formation

Centralized data governance for data lakes.

Feature	Purpose
Centralized permissions	Single place to manage access to S3 data lake
Fine-grained access	Column-level and row-level security
Data Catalog	Built on Glue Data Catalog
Cross-account sharing	Share data across AWS accounts
Governed tables	ACID transactions on S3
Data filters	Cell-level security (row + column combined)

Lake Formation Governance

Lake Formation sits on top of S3 + Glue:

  Users / Services
       ↓
  Lake Formation (permissions)
       ↓
  Glue Data Catalog (metadata)
       ↓
  S3 Data Lake (storage)
       ↓
  Accessed by: Athena, Redshift, EMR, Glue ETL

Exam tip: When the question asks about fine-grained access control for a data lake → Lake Formation.

Quick Reference: When to Use What

Scenario	Service
ML-specific CI/CD pipeline	SageMaker Pipelines
General CI/CD	CodePipeline + CodeBuild + CodeDeploy
React to AWS events	EventBridge
Visual workflow with AWS services	Step Functions
Python DAG-based workflows	MWAA (Airflow)
Define infra as YAML	CloudFormation
Define infra as code (Python/TS)	CDK
Store Docker images	ECR
Run containers (simple)	ECS + Fargate
Run containers (Kubernetes)	EKS
Batch processing jobs	AWS Batch
Deploy to edge devices	SageMaker Neo + IoT Greengrass
Data lake governance	Lake Formation
A/B test models	Production Variants
Gradual traffic shift	Canary / Linear deployment
Zero-risk model comparison	Shadow testing

Additional Services to Know

AWS Lambda Limits

Limit	Value
Timeout	15 minutes max
Memory	128 MB – 10 GB
Package size	50 MB zipped, 250 MB unzipped
Concurrency	1000 default (can increase)
Payload	6 MB (sync), 256 KB (async)

Exam trap: “Process 2GB file in Lambda” = IMPOSSIBLE. Use ECS/Fargate or SageMaker Processing.

Amazon Redshift

Feature	Details
Redshift ML	Run ML predictions using SQL (`CREATE MODEL` — uses Autopilot under the hood)
Redshift Spectrum	Query S3 data directly without loading into warehouse
Firehose target	Kinesis Firehose can deliver directly to Redshift

Use Redshift when data is already in a warehouse and needs ML predictions.

Glue Job Bookmarks

Run 1: Process files A, B, C → bookmark saves position
Run 2: Only process files D, E (new since bookmark)
        Files A, B, C are SKIPPED

Without bookmarks: Reprocesses everything every run (wasteful)

Glue DataBrew vs Glue ETL vs Data Wrangler

	Glue DataBrew	Glue ETL	Data Wrangler
User	Data analyst (no code)	Data engineer (PySpark)	Data scientist (ML focus)
Scale	Moderate	Petabyte (Spark)	Moderate
ML features	None	None	Target leakage detection
Output	S3, Glue Catalog	S3, JDBC	SageMaker Pipeline step code

SageMaker Pipelines — Step Caching

Run 1: ProcessingStep(data_v1) → TrainingStep(hp_v1) → Eval → Register
       All steps execute. Outputs cached.

Run 2 (only hyperparams changed):
       ProcessingStep(data_v1) → CACHED! Skip.
       TrainingStep(hp_v2) → Executes.
       Eval → Executes.

Saves hours of compute when only part of pipeline changes.
Configure: CacheConfig(enable_caching=True, expire_after="P30D")

Pipeline Step Types (Complete)

Step Type	Purpose
ProcessingStep	Data prep, evaluation scripts
TrainingStep	Model training
TuningStep	Hyperparameter tuning
TransformStep	Batch predictions
CreateModelStep	Prepare model for deployment
RegisterModelStep	Add to Model Registry
ConditionStep	Branch if metric > threshold
CallbackStep	Wait for human approval / external event
FailStep	Fail pipeline explicitly
LambdaStep	Run Lambda function
QualityCheckStep	Automated data/model quality gates
ClarifyCheckStep	Automated bias/explainability gates
EMRStep	Run EMR job

MLOps, Deployment & Orchestration#

SageMaker Inference Options#

Endpoint Types#

Production Variants & A/B Testing#

Multi-Model Endpoints (MME)#

Multi-Container Endpoints#

Deployment Strategies#

Blue/Green Deployment#

Canary Deployment#

Linear (Rolling) Deployment#

Shadow Testing#

SageMaker Deployment Guardrails#

Auto-Scaling for Endpoints#

Edge Deployment#

SageMaker Neo#

AWS IoT Greengrass#

Containerization & Compute#

Docker in ML#

Amazon ECR (Elastic Container Registry)#

Amazon ECS (Elastic Container Service)#

Amazon EKS (Elastic Kubernetes Service)#

AWS Batch#

Infrastructure as Code#

AWS CloudFormation#

AWS CDK (Cloud Development Kit)#

CI/CD for ML#

AWS CodePipeline#

SageMaker Pipelines#

Workflow Orchestration#

Amazon EventBridge#

AWS Step Functions#

Amazon MWAA (Managed Apache Airflow)#

Git Workflow for ML#

AWS Lake Formation#

Quick Reference: When to Use What#

Additional Services to Know#

AWS Lambda Limits#

Amazon Redshift#

Glue Job Bookmarks#

Glue DataBrew vs Glue ETL vs Data Wrangler#

SageMaker Pipelines — Step Caching#

Pipeline Step Types (Complete)#

MLOps, Deployment & Orchestration

SageMaker Inference Options

Endpoint Types

Production Variants & A/B Testing

Multi-Model Endpoints (MME)

Multi-Container Endpoints

Deployment Strategies

Blue/Green Deployment

Canary Deployment

Linear (Rolling) Deployment

Shadow Testing

SageMaker Deployment Guardrails

Auto-Scaling for Endpoints

Edge Deployment

SageMaker Neo

AWS IoT Greengrass

Containerization & Compute

Docker in ML

Amazon ECR (Elastic Container Registry)

Amazon ECS (Elastic Container Service)

Amazon EKS (Elastic Kubernetes Service)

AWS Batch

Infrastructure as Code

AWS CloudFormation

AWS CDK (Cloud Development Kit)

CI/CD for ML

AWS CodePipeline

SageMaker Pipelines

Workflow Orchestration

Amazon EventBridge

AWS Step Functions

Amazon MWAA (Managed Apache Airflow)

Git Workflow for ML

AWS Lake Formation

Quick Reference: When to Use What

Additional Services to Know

AWS Lambda Limits

Amazon Redshift

Glue Job Bookmarks

Glue DataBrew vs Glue ETL vs Data Wrangler

SageMaker Pipelines — Step Caching

Pipeline Step Types (Complete)