Domain 3: MLOps, Deployment & Orchestration
MLOps, Deployment & Orchestration
Exam Domain: 3 — Deployment and Orchestration of ML Workflows (22%) Tasks: Select deployment infrastructure, script infrastructure, automate CI/CD
SageMaker Inference Options
Endpoint Types

┌─────────────────────────────────────────────────────────────┐
│ SageMaker Inference Options │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Real-Time │ │ Serverless │ │ Async │ │
│ │ Endpoints │ │ Inference │ │ Inference │ │
│ ├──────────────┤ ├──────────────┤ ├──────────────┤ │
│ │ Persistent │ │ Scale to 0 │ │ Queue-based │ │
│ │ Always-on │ │ Cold start │ │ Long-running │ │
│ │ Low latency │ │ Pay per use │ │ Large payload│ │
│ │ ms response │ │ Intermittent │ │ Up to 1GB │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Batch │ │ Inference │ │
│ │ Transform │ │ Pipelines │ │
│ ├──────────────┤ ├──────────────┤ │
│ │ Offline │ │ Chain models │ │
│ │ Whole dataset│ │ Pre/post │ │
│ │ No endpoint │ │ processing │ │
│ │ One-time job │ │ Serial steps │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
| Type | Latency | Payload | Scaling | Best For |
|---|---|---|---|---|
| Real-Time | ms | 6 MB | Auto-scaling (always on) | Production APIs, interactive apps |
| Serverless | Cold start (~sec) | 4 MB | Scale to 0 | Intermittent traffic, cost-sensitive |
| Async | Minutes | 1 GB | Auto-scaling + queue | Large payloads, long processing |
| Batch Transform | Hours | Unlimited | Managed job | Offline predictions, whole datasets |
| Inference Pipeline | ms | 6 MB | Same as real-time | Multi-step: preprocess → predict → postprocess |
Exam tip: Common scenario questions:
- “Intermittent traffic, minimize cost” → Serverless
- “Large images, processing takes minutes” → Async
- “Score entire dataset nightly” → Batch Transform
- “Sub-second response for API” → Real-Time
ELI5: The exam loves “which endpoint?” questions. The key is usage pattern: Real-Time = always on (like a running API server, costs money even when idle). Serverless = sleeps when idle (saves money, but has a cold-start delay). Async = drop a big file in a queue and come back later. Batch = score the whole database overnight as a one-time job. Match the pattern to the scenario and you’ll get it right every time.
Production Variants & A/B Testing
Real-Time Endpoint with Production Variants:
┌─────────────────┐
│ Endpoint │
│ │
Traffic ──────────►│ Variant A: 90% │──► Model v1
│ Variant B: 10% │──► Model v2 (testing)
│ │
└─────────────────┘
Use case: A/B test new model on small % of traffic before full rollout
ELI5: A/B testing for ML works just like A/B testing for websites. Send 90% of real traffic to the proven model, 10% to the new challenger. Measure actual performance metrics (accuracy, latency, business KPIs) on real users. If the new model wins, gradually shift traffic until it handles 100%. If it underperforms, roll back without anyone noticing. It’s like taste-testing a new recipe on a few tables before changing the whole menu.
Multi-Model Endpoints (MME)
- Host multiple models on a single endpoint
- Models loaded on-demand, shared instance
- Cost-efficient when you have many models with sporadic traffic
- Use case: per-customer models, per-region models
Multi-Container Endpoints
- Run multiple containers on a single endpoint
- Direct invocation (specify which container) or serial pipeline
- Use case: ensemble models, different frameworks on same endpoint
Deployment Strategies

Blue/Green Deployment
Before: [Blue: Model v1 ← 100% traffic]
During: [Blue: Model v1 ← 100%] [Green: Model v2 ← 0%]
(Green is provisioned and tested)
Switch: [Blue: Model v1 ← 0%] [Green: Model v2 ← 100%]
(Instant cutover)
Rollback: Switch traffic back to Blue if issues detected
ELI5: Blue/Green is the safest deployment strategy. You build a complete copy of your environment (Green) with the new model, test it thoroughly while Blue still serves all real traffic, then flip the load balancer switch in an instant. If anything goes wrong, one more flip sends traffic back to Blue. Zero downtime, zero risk — you always have a fully working environment to fall back to.
Canary Deployment
Step 1: [Model v1 ← 95%] [Model v2 ← 5%] (small canary)
Step 2: [Model v1 ← 80%] [Model v2 ← 20%] (increase if healthy)
Step 3: [Model v1 ← 0%] [Model v2 ← 100%] (full rollout)
Alarms: CloudWatch metrics trigger auto-rollback if errors spike
Linear (Rolling) Deployment
Step 1: [v1 ← 90%] [v2 ← 10%]
Step 2: [v1 ← 80%] [v2 ← 20%]
Step 3: [v1 ← 70%] [v2 ← 30%]
...gradually shift all traffic over fixed intervals
Shadow Testing
[Model v1 ← 100% (serves responses)]
│
└─── Copy of traffic ───► [Model v2 (log only, don't serve)]
Compare v1 vs v2 predictions offline — zero risk to users
ELI5: Shadow testing is the ultimate safe evaluation. The new model receives a copy of every real request and generates predictions, but those predictions are thrown away — users always see Model v1’s response. You just log Model v2’s outputs and compare them to v1 offline. Zero risk to users, but you get a complete picture of how the new model would perform on real production traffic before you ever expose it.
SageMaker Deployment Guardrails
- Auto-rollback: CloudWatch alarms trigger automatic rollback
- Built-in support for canary, linear, and all-at-once strategies
- Configure bake time (wait period before progressing)
Auto-Scaling for Endpoints
SageMaker Auto-Scaling:
Scaling Policy → CloudWatch Metric → Add/Remove Instances
Target Tracking Policy (recommended):
"Keep InvocationsPerInstance at 70"
→ SM automatically adjusts instance count
Step Scaling Policy:
If metric > 80 for 3 minutes → add 2 instances
If metric > 95 for 1 minute → add 5 instances
If metric < 30 for 10 minutes → remove 1 instance
| Metric | What It Tracks |
|---|---|
InvocationsPerInstance | Requests per instance (most common) |
CPUUtilization | CPU usage |
GPUUtilization | GPU usage |
ModelLatency | Time to generate prediction |
OverheadLatency | SageMaker overhead time |
- Min instances: 0 (serverless), 1+ (real-time)
- Cool-down period: Wait before next scaling action (prevent thrashing)
Edge Deployment
SageMaker Neo
Trained Model → [Neo Compiler] → Optimized Model → Deploy to Edge
↓
Compiles for target hardware:
ARM, Intel, NVIDIA, Qualcomm, etc.
Up to 2x performance improvement
- Compile once, run anywhere — optimized for specific hardware
- Supports: TensorFlow, PyTorch, MXNet, XGBoost, ONNX
- Targets: EC2, IoT Greengrass, mobile devices, embedded systems
ELI5: Neo is a compiler for ML models. Just like a C++ compiler converts source code into CPU-specific machine instructions that run faster than interpreted code, Neo converts ML models into hardware-optimized binary formats. The same PyTorch model compiled for an ARM chip runs up to 2x faster than the generic version — with no change to accuracy.
AWS IoT Greengrass
- Run ML inference locally on edge devices
- Models trained in cloud → deployed to edge
- Operates offline (disconnected environments)
- Use case: factory floor defect detection, autonomous vehicles
Containerization & Compute
Docker in ML
SageMaker uses Docker containers for everything:
┌────────────────────────────────────────────┐
│ Docker Container │
│ ├─ /opt/ml/model/ (model artifacts)│
│ ├─ /opt/ml/input/data/ (training data) │
│ ├─ /opt/ml/output/ (results) │
│ └─ Your code + dependencies │
└────────────────────────────────────────────┘
Two modes:
Training container: reads data → trains → writes model
Inference container: loads model → serves predictions via HTTP
ELI5: Everything in SageMaker runs inside Docker containers. Think of containers like standardized shipping containers: no matter what’s inside (Python code, R scripts, TensorFlow, PyTorch), the outer box is always the same shape. SageMaker can pick it up and run it on any machine without worrying about dependency conflicts. Your model, its libraries, and your code all travel together as one portable unit.
Amazon ECR (Elastic Container Registry)
- Private Docker registry for storing container images
- SageMaker pulls custom training/inference containers from ECR
- Integrates with ECS/EKS for deployment
Amazon ECS (Elastic Container Service)
| Feature | Details |
|---|---|
| Launch types | Fargate (serverless) or EC2 (self-managed) |
| Task Definition | Blueprint for containers (image, CPU, memory, ports) |
| Service | Maintains desired count of running tasks |
| Use for ML | Custom inference services, preprocessing pipelines |
Amazon EKS (Elastic Kubernetes Service)
| Feature | Details |
|---|---|
| Kubernetes | Open-source container orchestration |
| Node types | EC2, Fargate, or both |
| Use for ML | Complex multi-service ML systems, Kubeflow |
| SageMaker + K8s | SageMaker Operators for Kubernetes |
AWS Batch
- Managed batch processing
- Automatically provisions optimal compute (EC2 or Fargate)
- Use case: large-scale offline data processing, batch predictions
ECS vs EKS vs Batch:
ECS: Simple container orchestration (AWS-native)
EKS: Kubernetes (portable, complex)
Batch: Fire-and-forget batch jobs
Infrastructure as Code
AWS CloudFormation
# Define ML infrastructure as YAML/JSON templates
Resources:
SageMakerEndpoint:
Type: AWS::SageMaker::Endpoint
Properties:
EndpointConfigName: !Ref EndpointConfig
EndpointConfig:
Type: AWS::SageMaker::EndpointConfig
Properties:
ProductionVariants:
- ModelName: !Ref Model
InstanceType: ml.m5.xlarge
InitialInstanceCount: 2
- Templates: YAML/JSON describing infrastructure
- Stacks: Running instance of a template
- Change Sets: Preview changes before applying
- Drift Detection: Detect manual changes to infrastructure
- Use case: reproducible ML environments, version-controlled infra
AWS CDK (Cloud Development Kit)
- Define infrastructure using programming languages (Python, TypeScript, Java, etc.)
- Synthesizes to CloudFormation templates
- Higher-level constructs than raw CloudFormation
- Use case: complex infrastructure with logic, loops, conditionals
CloudFormation vs CDK:
CloudFormation: Declarative (YAML/JSON) — what you want
CDK: Imperative (code) — how to build it
CDK compiles → CloudFormation template → AWS deploys
CI/CD for ML
AWS CodePipeline

Source (CodeCommit/GitHub)
↓
Build (CodeBuild)
↓
Test (CodeBuild — run tests)
↓
Deploy (CodeDeploy / SageMaker)
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Source │→ │ Build │→ │ Test │→ │ Deploy │
│ CodeCommit│ │CodeBuild │ │CodeBuild │ │CodeDeploy│
│ GitHub │ │(build │ │(run unit │ │(deploy │
│ S3 │ │ container│ │ tests) │ │ model) │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
| Service | Purpose |
|---|---|
| CodeCommit | Git repository (AWS managed) |
| CodeBuild | Build and test (compile, run tests, create container) |
| CodeDeploy | Deploy to EC2, ECS, Lambda |
| CodePipeline | Orchestrate the full CI/CD pipeline |
SageMaker Pipelines
The native ML CI/CD system — purpose-built for ML workflows.
SageMaker Pipeline Example:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Process │→ │ Train │→ │ Evaluate │→ │ Register │
│ Data │ │ Model │ │ Model │ │ Model │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│
↓
┌──────────┐
│Condition: │
│accuracy │
│> 0.90? │
└─────┬─────┘
Yes ↙ ↘ No
┌──────────┐ ┌──────────┐
│ Register │ │ Fail / │
│ + Deploy │ │ Retrain │
└──────────┘ └──────────┘
| Step Type | What It Does |
|---|---|
| Processing | Data processing with SageMaker Processing |
| Training | Model training |
| Tuning | Hyperparameter tuning |
| Transform | Batch inference |
| Condition | Branch based on metric values |
| RegisterModel | Register in Model Registry |
| CreateModel | Create deployable model |
| Lambda | Run custom Lambda function |
| Fail | Mark pipeline as failed |
Exam tip: SageMaker Pipelines is the answer when the question asks about ML-specific CI/CD or automated retraining workflows.
ELI5: SageMaker Pipelines is like a factory assembly line for ML: raw data goes in one end, a validated deployed model comes out the other. Each step (processing, training, evaluation, registration) is automated and tracked. If the model doesn’t hit an accuracy threshold, the pipeline stops and can trigger retraining. The whole thing is reproducible — run it again next month with new data and get consistent results.
Workflow Orchestration
Amazon EventBridge
- Event bus — react to events from AWS services and custom apps
- Schedule rules (cron) or event-pattern matching
- Use case: trigger retraining when new data arrives in S3
S3 PutObject event
→ EventBridge rule (pattern: "new training data in s3://ml-data/")
→ Target: Step Functions (start retraining pipeline)
AWS Step Functions
- State machine orchestration — coordinate multiple AWS services
- Visual workflow designer
- Built-in error handling, retries, parallel execution
- Integrates with SageMaker, Lambda, Glue, ECS, and more
Step Functions States:
Task → Call an AWS service or Lambda
Choice → Branch based on condition
Parallel → Run branches concurrently
Map → Iterate over an array
Wait → Pause for time or signal
Succeed → End successfully
Fail → End with error
ELI5: Step Functions is the general-purpose workflow engine — it can orchestrate any AWS service in any sequence. SageMaker Pipelines is ML-specific and knows about training jobs, model registries, and evaluation steps natively. MWAA (Airflow) is for teams that already have Python-based DAG workflows they want to keep. For the exam: ML workflow with training/evaluation gates = SageMaker Pipelines; everything else (ETL, cross-service automation) = Step Functions.
Amazon MWAA (Managed Apache Airflow)
- Managed Apache Airflow — Python-based workflow orchestration
- DAGs (Directed Acyclic Graphs) define workflows
- Rich ecosystem of operators and hooks
- Use case: complex multi-step ML pipelines already defined in Airflow
Git Workflow for ML
Gitflow for ML:
main ─────────●─────────────────●──────────
│ ↑
develop ──────┼──●──●──●───────┤
│ │ │ │
feature ──────┼──┘ │ │
branch │ │ │
│ └───────┘
(merge when model approved)
GitHub Flow (simpler):
main ──────●──────●──────●──────
│ ↑ │
feature ───┘──────┘ │
branch │
AWS Lake Formation
Centralized data governance for data lakes.
| Feature | Purpose |
|---|---|
| Centralized permissions | Single place to manage access to S3 data lake |
| Fine-grained access | Column-level and row-level security |
| Data Catalog | Built on Glue Data Catalog |
| Cross-account sharing | Share data across AWS accounts |
| Governed tables | ACID transactions on S3 |
| Data filters | Cell-level security (row + column combined) |

Lake Formation sits on top of S3 + Glue:
Users / Services
↓
Lake Formation (permissions)
↓
Glue Data Catalog (metadata)
↓
S3 Data Lake (storage)
↓
Accessed by: Athena, Redshift, EMR, Glue ETL
Exam tip: When the question asks about fine-grained access control for a data lake → Lake Formation.
Quick Reference: When to Use What
| Scenario | Service |
|---|---|
| ML-specific CI/CD pipeline | SageMaker Pipelines |
| General CI/CD | CodePipeline + CodeBuild + CodeDeploy |
| React to AWS events | EventBridge |
| Visual workflow with AWS services | Step Functions |
| Python DAG-based workflows | MWAA (Airflow) |
| Define infra as YAML | CloudFormation |
| Define infra as code (Python/TS) | CDK |
| Store Docker images | ECR |
| Run containers (simple) | ECS + Fargate |
| Run containers (Kubernetes) | EKS |
| Batch processing jobs | AWS Batch |
| Deploy to edge devices | SageMaker Neo + IoT Greengrass |
| Data lake governance | Lake Formation |
| A/B test models | Production Variants |
| Gradual traffic shift | Canary / Linear deployment |
| Zero-risk model comparison | Shadow testing |
Additional Services to Know
AWS Lambda Limits
| Limit | Value |
|---|---|
| Timeout | 15 minutes max |
| Memory | 128 MB – 10 GB |
| Package size | 50 MB zipped, 250 MB unzipped |
| Concurrency | 1000 default (can increase) |
| Payload | 6 MB (sync), 256 KB (async) |
Exam trap: “Process 2GB file in Lambda” = IMPOSSIBLE. Use ECS/Fargate or SageMaker Processing.
Amazon Redshift
| Feature | Details |
|---|---|
| Redshift ML | Run ML predictions using SQL (CREATE MODEL — uses Autopilot under the hood) |
| Redshift Spectrum | Query S3 data directly without loading into warehouse |
| Firehose target | Kinesis Firehose can deliver directly to Redshift |
Use Redshift when data is already in a warehouse and needs ML predictions.
Glue Job Bookmarks
Run 1: Process files A, B, C → bookmark saves position
Run 2: Only process files D, E (new since bookmark)
Files A, B, C are SKIPPED
Without bookmarks: Reprocesses everything every run (wasteful)
Glue DataBrew vs Glue ETL vs Data Wrangler
| Glue DataBrew | Glue ETL | Data Wrangler | |
|---|---|---|---|
| User | Data analyst (no code) | Data engineer (PySpark) | Data scientist (ML focus) |
| Scale | Moderate | Petabyte (Spark) | Moderate |
| ML features | None | None | Target leakage detection |
| Output | S3, Glue Catalog | S3, JDBC | SageMaker Pipeline step code |
SageMaker Pipelines — Step Caching
Run 1: ProcessingStep(data_v1) → TrainingStep(hp_v1) → Eval → Register
All steps execute. Outputs cached.
Run 2 (only hyperparams changed):
ProcessingStep(data_v1) → CACHED! Skip.
TrainingStep(hp_v2) → Executes.
Eval → Executes.
Saves hours of compute when only part of pipeline changes.
Configure: CacheConfig(enable_caching=True, expire_after="P30D")
Pipeline Step Types (Complete)
| Step Type | Purpose |
|---|---|
| ProcessingStep | Data prep, evaluation scripts |
| TrainingStep | Model training |
| TuningStep | Hyperparameter tuning |
| TransformStep | Batch predictions |
| CreateModelStep | Prepare model for deployment |
| RegisterModelStep | Add to Model Registry |
| ConditionStep | Branch if metric > threshold |
| CallbackStep | Wait for human approval / external event |
| FailStep | Fail pipeline explicitly |
| LambdaStep | Run Lambda function |
| QualityCheckStep | Automated data/model quality gates |
| ClarifyCheckStep | Automated bias/explainability gates |
| EMRStep | Run EMR job |