Domain 4B: SageMaker Deployment & MLOps
Table of Contents
- SageMaker Deployment & MLOps
- Deployment Options — The Complete Picture
- Real-Time Inference (Endpoints)
- Serverless Inference
- Batch Transform
- Asynchronous Inference
- Deployment Options Comparison Table
- Multi-Model Endpoints (MME)
- Multi-Container Endpoints (MCE)
- Inference Pipeline
- SageMaker Model Registry
- SageMaker Pipelines — ML CI/CD
- Docker Containers for Custom Algorithms
- SageMaker Neo
- AWS Inferentia
- Edge Deployment
- MLOps Best Practices
- SageMaker Projects
- Quick Reference
SageMaker Deployment & MLOps
Exam Domain: 4 — ML Implementation and Operations (20%) Task: Deploy models, build MLOps pipelines, and select appropriate inference strategies
Deployment Options — The Complete Picture
┌─────────────────────────────────────────────────────────────────────┐
│ SAGEMAKER DEPLOYMENT OPTIONS │
├──────────────────┬──────────────────┬──────────────┬───────────────┤
│ Real-Time │ Serverless │ Batch │ Async │
│ Inference │ Inference │ Transform │ Inference │
├──────────────────┼──────────────────┼──────────────┼───────────────┤
│ Persistent │ No instance │ One-time │ Queue-based │
│ instance always │ (cold start) │ job, S3→S3 │ processing │
│ running │ │ │ │
├──────────────────┼──────────────────┼──────────────┼───────────────┤
│ ms latency │ seconds (cold) │ minutes-hrs │ minutes │
├──────────────────┼──────────────────┼──────────────┼───────────────┤
│ 24/7 cost │ Pay per request │ Pay per job │ Pay per job │
├──────────────────┼──────────────────┼──────────────┼───────────────┤
│ Steady traffic │ Intermittent │ Offline │ Large payload │
│ Low latency SLA │ Dev/test │ batch score │ Long inference│
└──────────────────┴──────────────────┴──────────────┴───────────────┘
Why this matters for the exam: Every deployment scenario question maps to one of these four options. Know the trigger words: “low latency” → Real-Time, “intermittent traffic” → Serverless, “large dataset offline” → Batch Transform, “payload > 6MB or inference > 60s” → Async.
Real-Time Inference (Endpoints)
How It Works
┌──────────────┐ HTTPS ┌─────────────────────────────────┐
│ Client │ ─────────────► │ SageMaker Endpoint │
│ (app/API) │ ◄───────────── │ ┌──────────┐ ┌──────────┐ │
└──────────────┘ prediction │ │ Instance │ │ Instance │ │
│ │ + model │ │ + model │ │
│ └──────────┘ └──────────┘ │
│ Load Balancer │
└─────────────────────────────────┘
- Model loaded into memory on persistent EC2 instances
- Requests routed via internal load balancer
- Predictions returned synchronously (sub-second)
- Multiple instances for high availability and throughput
When to Use
- Real-time user-facing applications (fraud detection at checkout, product recommendations)
- Sub-second latency is required
- Steady, predictable traffic patterns
- Payload < 6 MB per request
Instance Type Selection for Inference
| Instance Family | Hardware | Best For |
|---|---|---|
| ml.m5, ml.m4 | CPU, general | Lightweight models, tabular ML |
| ml.c5, ml.c4 | CPU, compute-optimized | CPU-intensive inference, NLP |
| ml.g4dn | GPU (T4) | Deep learning, computer vision |
| ml.p3 | GPU (V100) | Large model inference, NLP transformers |
| ml.inf1 | AWS Inferentia (custom chip) | High-throughput deep learning, cost-optimized |
| ml.inf2 | AWS Inferentia2 | Large language models, cost-optimized |
| ml.trn1 | AWS Trainium | Training (not inference) |
Exam tip: For cost-optimized deep learning inference, the answer is
ml.inf1orml.inf2(Inferentia). For maximum GPU performance,ml.p3. For standard tabular ML,ml.m5orml.c5.
Auto-Scaling Endpoints
ELI5: Auto-scaling is like a restaurant that adds more tables during the lunch rush and removes them at 3pm. SageMaker watches how many requests per instance are coming in and automatically adds or removes instances to match demand.
Scaling policies:
| Policy Type | How It Works | Best For |
|---|---|---|
| Target Tracking | Maintain a target metric value (e.g., 100 invocations/instance) | Most use cases |
| Step Scaling | Scale by fixed amounts at defined thresholds | Precise control |
| Scheduled Scaling | Scale at specific times | Known traffic patterns |
Key metrics for scaling:
InvocationsPerInstance— most common trigger (target: your desired requests/instance)CPUUtilization— scale on CPU pressureMemoryUtilization— scale on memory pressureGPUUtilization— scale on GPU utilization
Scaling configuration:
- Minimum capacity: floor for scale-in (don’t go below X instances)
- Maximum capacity: ceiling for scale-out (cost control)
- Cooldown: wait period after scaling action before next action
Serverless Inference
How It Works
Request arrives ──► SageMaker checks for warm container
│
┌──────────┴──────────┐
│ Warm container? │
YES NO
│ │
Route immediately Cold start (~seconds)
(ms latency) Load model, then serve
- No persistent instance — model container spins up on demand
- AWS manages all infrastructure
- Cold start: first request after idle period takes a few seconds
- Warm container: subsequent requests are fast
Configuration Parameters
- Memory size: 1024 MB to 6144 MB — determines CPU allocation too (more memory = more CPU)
- Max concurrent invocations: limits simultaneous requests (cost control, prevents runaway spend)
When to Use
- Intermittent, unpredictable traffic (dev/test, low-volume APIs)
- Can tolerate occasional seconds of cold start latency
- Cost optimization: pay only for compute used, not idle time
- Model size < 1 GB (for reasonable cold start times)
ELI5: Serverless inference is like a taxi vs. owning a car. The taxi (serverless) takes a minute to arrive but costs nothing when you’re not using it. Your car (real-time endpoint) is always in the driveway, always ready, but the insurance and payments run 24/7. If you only make a few trips a week, take the taxi.
Provisioned Concurrency: Pre-warm a specified number of instances to eliminate cold starts. Adds cost but removes latency variability. Use when you need fast response times but traffic is still intermittent.
Batch Transform
How It Works
S3 Input SageMaker S3 Output
┌────────┐ ┌──────────────────┐ ┌────────────┐
│input/ │──►│ Batch Transform │──►│ output/ │
│data.csv│ │ (auto-spins up │ │ results.csv│
└────────┘ │ instances, │ └────────────┘
│ processes, │
│ terminates) │
└──────────────────┘
- You specify: model, instance type/count, input S3 path, output S3 path
- SageMaker spins up instances, processes all data, writes results, then terminates instances
- No persistent endpoint — only pay during the job
Advanced Features
- Data splitting: how to split input data for batching (Line, RecordIO, TFRecord, None)
- Join source: merge input data with predictions in output (useful for audit trails)
- Filtering: include/exclude fields from output
- Assemble with: how to combine predictions back (Line, None)
When to Use
- Offline scoring: generate predictions for entire dataset (millions of records)
- Periodic batch jobs: nightly scoring, weekly recommendations refresh
- No hard latency requirement
- Large datasets that would timeout a real-time endpoint
- Cost-conscious: no 24/7 endpoint cost
Asynchronous Inference
How It Works
Client POST request SageMaker
┌───────────┐ ┌──────────────────────────┐
│ Client │──── Request ────►│ Input S3 location │
│ │◄─── Location ────│ (auto-queued) │
└───────────┘ (immediate) │ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ ML Instance │ │
│ │ (processes │ │
│ │ the job) │ │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ Output to S3 │
│ SNS notification sent │
└──────────────────────────┘
- Client sends request → gets back a location (URL), not a result
- SageMaker queues the request and processes asynchronously
- When done: result in S3, SNS notification sent to client
- Endpoint can auto-scale to zero when queue is empty (major cost saving)
When to Use
- Payload > 6 MB (real-time limit) — video files, large images, long documents
- Inference time > 60 seconds — complex models, large batch within a request
- Can tolerate latency of minutes (not milliseconds)
- Cost-sensitive: want instances to scale to zero when idle
Why this matters for the exam: The payload size limit (6 MB for real-time) and inference time limit (60s) are key discriminators. If either is exceeded, answer is Async Inference.
Deployment Options Comparison Table
| Feature | Real-Time | Serverless | Batch Transform | Async |
|---|---|---|---|---|
| Latency | ms | s (cold start) | minutes-hours | minutes |
| Max payload | 6 MB | 4 MB | Unlimited (S3) | 1 GB |
| Max inference time | 60 s | 60 s | Unlimited | 15 min |
| Scaling | Auto-scaling | Automatic | N/A (one-time) | Auto (to zero) |
| Cost when idle | Continuous | Zero | Zero | Near-zero |
| Persistent endpoint | Yes | Yes | No | Yes |
| Response mechanism | Synchronous | Synchronous | S3 poll | S3 + SNS |
Multi-Model Endpoints (MME)
What and Why
Host thousands of models on a single endpoint — models are loaded and unloaded dynamically.
ELI5: Instead of renting 1,000 apartments for 1,000 tenants who rarely visit, rent one hotel. Guests check in (model loaded into memory) when they arrive, check out (model evicted from memory) when idle. The hotel is always available; you just pay for one building, not 1,000.
How It Works
┌───────────────────────────────────────────────┐
│ Multi-Model Endpoint │
│ │
│ Memory: [Model A] [Model C] [empty...] │
│ Disk: model_A.tar.gz, model_B.tar.gz, │
│ model_C.tar.gz ... (thousands) │
│ │
│ Request for Model B: │
│ 1. Check memory → not there │
│ 2. Load from disk/S3 → into memory │
│ 3. Serve prediction │
│ 4. Keep in memory (evict LRU if full) │
└───────────────────────────────────────────────┘
When to Use
- Per-customer models (1 model per customer, most customers have low traffic)
- Per-region or per-segment models with sparse traffic
- Cost optimization: one endpoint instead of thousands
Limitations
- Higher latency on first call to a model (loading time)
- Not suitable when many models need to be in memory simultaneously
- Models must use same framework (e.g., all XGBoost, or all TensorFlow)
Multi-Container Endpoints (MCE)
Host multiple different containers on one endpoint.
Two invocation modes:
| Mode | How It Works | Use Case |
|---|---|---|
| Direct | Client chooses which container to invoke by name | A/B testing different frameworks |
| Serial Pipeline | Output of container A becomes input to container B | Preprocessing → Model → Postprocessing |
Serial inference pipeline example:
Input text ──► [SparkML preprocessing] ──► [XGBoost model] ──► [custom postprocessor] ──► Output
Container 1 Container 2 Container 3
Inference Pipeline
Chain multiple containers into a single SageMaker endpoint — used when you need the same transformations at inference time as at training time.
Why it matters:
- Training applies feature transformations (scaling, encoding)
- Those SAME transformations must be applied at inference
- Inference Pipeline keeps them together, preventing training-serving skew
Architecture:
┌──────────────────────────────────────────────────────┐
│ Inference Pipeline │
│ │
│ Raw Input ──► [Preprocessing] ──► [Model] ──► Output│
│ (Scikit-learn (XGBoost or │
│ or SparkML) TensorFlow) │
└──────────────────────────────────────────────────────┘
SageMaker Model Registry
Version, track, and govern ML models across their lifecycle.
Model versioning:
- Every trained model creates a new model version
- Metadata: training metrics, training job ARN, dataset version, owner
- Compare versions side by side
Approval workflow:
Training Complete ──► Pending Review ──► [Human Approves/Rejects]
│
┌──────────┴──────────┐
Approved Rejected
│ │
Deploy to prod Archive
Cross-account deployment:
- Model registered in Account A (dev)
- Approved model promoted to Account B (prod)
- Permissions via IAM resource-based policies
Integration with SageMaker Pipelines:
- Pipeline step: RegisterModel → triggers approval workflow
- CI/CD: on approval → automated deployment
SageMaker Pipelines — ML CI/CD
What It Is
Native ML workflow orchestration — define, run, and track ML workflows as DAGs.
ELI5: Pipelines is a recipe card for your ML workflow. Each step (preprocess data, train model, evaluate accuracy, register if good enough) is defined once. You can run the recipe any time, track every run, and reuse unchanged steps from cache.
Pipeline Anatomy
┌─────────────────────────────────────────────────────────────────────┐
│ SAGEMAKER PIPELINE (DAG) │
│ │
│ Parameters: │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ train_size=0.8, epochs=10, instance_type=ml.m5.xlarge │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Steps: │
│ [Processing] ──► [Training] ──► [Evaluation] ──► [Condition] │
│ │ │ │ │ │
│ Prepare data Train model Compute metrics If accuracy │
│ Split train/ on training on test set > 0.85: │
│ test sets data ├─YES─► [Register]│
│ └─NO──► [Fail] │
└─────────────────────────────────────────────────────────────────────┘
Pipeline Steps
| Step Type | Purpose |
|---|---|
| Processing | Run SageMaker Processing job (data prep, eval) |
| Training | Run SageMaker Training job |
| Tuning | Run Hyperparameter tuning job |
| Transform | Run Batch Transform job |
| Model | Create SageMaker Model |
| RegisterModel | Register model in Model Registry |
| Condition | Branch based on metric threshold |
| Callback | Call external system, wait for response |
| Lambda | Run Lambda function inline |
| ClarifyCheck | Check for bias or explainability |
| QualityCheck | Baseline data or model quality |
Key Features
- Parameters: runtime configuration — change instance types, dataset paths, thresholds without editing pipeline code
- Caching: skip steps that haven’t changed since last run (reuse training output if data/code unchanged)
- Lineage: automatic tracking of inputs, outputs, parameters for every run
- Retry policies: automatic retry on transient failures
Orchestration Comparison
| Feature | SageMaker Pipelines | Step Functions | Apache Airflow (MWAA) |
|---|---|---|---|
| ML-native | Yes (built for ML) | No (general) | No (general) |
| SageMaker integration | Native | Via SDK | Via operators |
| Complexity | Low-medium | Medium | High |
| Non-ML tasks | Limited | Full | Full |
| Best for | ML workflows | Multi-service workflows | Complex ETL + ML |
Exam tip: If the question is about ML workflow orchestration and all steps are ML steps — use SageMaker Pipelines. If the workflow includes non-ML steps (DynamoDB writes, Lambda, SNS) alongside ML — consider Step Functions. If the company already uses Airflow heavily — MWAA.
Docker Containers for Custom Algorithms
Why Docker
- Reproducible environments: same container in dev, test, prod
- Any framework, any language (not just AWS-supported frameworks)
- Isolation: dependencies don’t conflict
- Portability: run on SageMaker, ECS, on-premise, or locally
SageMaker Container Contract
ELI5: SageMaker talks to your container through a strict contract — specific directory paths for data and model files, and specific entry points for training and serving. As long as your container respects this contract, SageMaker can train and deploy it anywhere.
┌──────────────────────────────────────────────────┐
│ SageMaker Container Directory Layout │
│ │
│ /opt/ml/ │
│ ├── input/ │
│ │ ├── config/ │
│ │ │ ├── hyperparameters.json ◄─ SageMaker │
│ │ │ └── resourceConfig.json injects │
│ │ └── data/ │
│ │ └── training/ ◄── YOUR training data │
│ ├── model/ ◄── Save trained model HERE │
│ ├── output/ │
│ │ └── failure ◄── Write failure messages │
│ └── code/ ◄── Your script (script mode) │
│ │
│ Entry points: │
│ - /opt/ml/code/train (runs at training time) │
│ - /opt/ml/code/serve (runs at inference time) │
└──────────────────────────────────────────────────┘
Container Customization Options
Least Custom ◄──────────────────────────────────► Most Custom
│ │ │
Built-in Algorithm Script Mode Bring Your Own
(no Docker) (your code + Container (BYOC)
AWS container) (full custom Docker)
│ │ │
Easiest Most Common Maximum Control
No flexibility Best trade-off Maximum effort
Script Mode (most common for custom code):
- Use AWS pre-built container for your framework (TensorFlow, PyTorch, sklearn)
- Provide YOUR training/inference script
- AWS handles all environment setup
- Easy to iterate — just change your script
BYOC (Bring Your Own Container):
- Build custom Docker image
- Full control: custom frameworks, languages, dependencies
- Push to Amazon ECR
- SageMaker pulls and runs it
- Use when: proprietary framework, unusual dependencies, compiled code
Pre-built Containers (Script Mode ready)
| Framework | Container Available | Script Mode |
|---|---|---|
| TensorFlow | Yes | Yes |
| PyTorch | Yes | Yes |
| MXNet | Yes | Yes |
| Hugging Face Transformers | Yes | Yes |
| Scikit-learn | Yes | Yes |
| XGBoost | Yes | Yes |
| SparkML | Yes | Yes |
SageMaker Neo
Compile ML models for optimized inference across platforms.
What it does:
- Takes a trained model (TensorFlow SavedModel, PyTorch .pt, ONNX, XGBoost, etc.)
- Compiles it for a specific target hardware (CPU architecture, GPU type, edge chip)
- Output: optimized binary that runs faster with less memory
Why it matters:
- 2x-10x faster inference on the same hardware
- Smaller model footprint (important for edge devices)
- Compiled model runs with AWS Deep Learning Runtime (no framework installation needed)
Supported target platforms:
- SageMaker ML instances (cloud)
- ARM, x86, NVIDIA GPU (edge servers)
- Raspberry Pi, Jetson Nano (IoT edge)
- Qualcomm, Intel, NXP chips
AWS Inferentia
AWS’s custom ML inference chip — built for high-throughput, cost-optimized deep learning inference.
┌────────────────────────────────────────────────────────────┐
│ INFERENCE CHIP COMPARISON │
├────────────────────┬───────────────────────────────────────┤
│ GPU (ml.g4dn/p3) │ Inferentia (ml.inf1/inf2) │
├────────────────────┼───────────────────────────────────────┤
│ General compute │ Purpose-built for ML inference │
│ High cost/hr │ Up to 70% lower cost vs GPU │
│ Great for training │ Great for inference only │
│ Flexible │ Requires Neuron SDK compilation │
│ Any framework │ TensorFlow, PyTorch, MXNet │
└────────────────────┴───────────────────────────────────────┘
Neuron SDK: Compiles models to run on Inferentia chips. Similar to Neo but specifically for Inferentia hardware.
Use case: High-volume, cost-sensitive deep learning inference (NLP, computer vision at scale).
Edge Deployment
When to Deploy at the Edge
| Requirement | Deploy at Edge |
|---|---|
| Ultra-low latency (< 5ms) | Yes |
| Offline capability (no internet) | Yes |
| Data privacy (data can’t leave device) | Yes |
| High bandwidth cost to cloud | Yes |
| Real-time video analysis | Yes |
Edge ML Stack
┌─────────────────────────────────────────────────────────┐
│ EDGE DEPLOYMENT OPTIONS │
│ │
│ Train in Cloud (SageMaker) │
│ │ │
│ ▼ Compile with Neo │
│ Optimized model artifact │
│ │ │
│ ┌────┴────────────────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ IoT Greengrass SageMaker Edge Manager │
│ (IoT device runtime) (fleet management) │
│ │ │ │
│ └─────────────┬───────────────────┘ │
│ ▼ │
│ Edge Device │
│ (Raspberry Pi, Jetson, etc.) │
└─────────────────────────────────────────────────────────┘
SageMaker Edge Manager:
- Deploy, monitor, and update models on a fleet of edge devices
- Data capture from edge for model retraining
- Model versioning and rollback
AWS IoT Greengrass:
- General IoT device runtime (not ML-specific, but supports ML inference)
- Lambda-based or component-based ML inference at the edge
- Integrates with AWS IoT Core for device management
MLOps Best Practices
What Makes ML Different from Software CI/CD
Traditional Software CI/CD:
Code change ──► Build ──► Test ──► Deploy
ML CI/CD (three axes of change):
Code change ──► Retrain ──► Evaluate ──► Deploy
Data change ──► Retrain ──► Evaluate ──► Deploy
Model drift ──► Retrain ──► Deploy
MLOps Maturity Levels
| Level | Description | What You Have |
|---|---|---|
| 0 | Manual | Scripts on laptop, no automation |
| 1 | Automated Training | Training pipeline automated, manual deployment |
| 2 | Full CI/CD | Code + data triggers automatic retrain + deploy |
A/B Testing and Safe Deployment
ELI5: A/B testing in ML is like testing a new recipe at a restaurant. You serve the new dish to 10% of customers (model B) while 90% still get the proven dish (model A). You measure which dish gets better reviews (metrics) before switching everyone to the new recipe.
SageMaker Production Variants:
# Example endpoint configuration with A/B split
EndpointConfig:
ProductionVariants:
- ModelName: model-v1
VariantName: ModelA
InitialVariantWeight: 90 # 90% traffic
- ModelName: model-v2
VariantName: ModelB
InitialVariantWeight: 10 # 10% traffic
Deployment strategies:
| Strategy | Traffic Split | Risk | Rollback Speed |
|---|---|---|---|
| Blue/Green | 100% instant cutover | High | Slow |
| Canary | 5% → 25% → 50% → 100% | Low | Fast |
| Linear | Incremental % over time | Low | Fast |
| Shadow | 100% to both, serve from A only | Zero | N/A |
Shadow mode:
- Route all traffic to BOTH old and new model
- Only return predictions from OLD model to users
- Compare new model’s predictions in background
- Zero user impact, full validation before cutover
Model Lineage Tracking
Automatically tracked by SageMaker:
Dataset (S3) ──► Training Job ──► Model ──► Endpoint
│ │ │ │
└── Artifacts ───┘ └── Actions ┘
└─── Contexts (grouping related objects) ───┘
- Artifacts: data and model files (URIs + metadata)
- Actions: what happened (training, deployment)
- Contexts: logical groupings (experiment, pipeline run)
- Query lineage: “What data was used to train the model serving endpoint X?”
Infrastructure as Code for ML
- CloudFormation: define SageMaker resources as YAML/JSON templates
- AWS CDK: define SageMaker resources in Python/TypeScript code
- Terraform: third-party IaC (HCL), popular in multi-cloud environments
SageMaker Projects
Pre-built MLOps templates with integrated CI/CD.
What you get:
SageMaker Project
├── CodeCommit repos (model build + model deploy)
├── CodePipeline (automated workflow)
├── CodeBuild (build and test)
└── SageMaker Pipelines (ML workflow)
Built-in templates:
- Model build, train, deploy pipeline
- Model monitoring with retraining trigger
- Multi-account deployment (dev → staging → prod)
Service Catalog integration: Projects are defined as Service Catalog products — admins define approved project templates, teams instantiate them.
Quick Reference
Deployment Option → Use Case
| Scenario | Deployment Option |
|---|---|
| Fraud detection at payment time | Real-Time Endpoint |
| Internal dev/test API, low traffic | Serverless Inference |
| Score 10 million records overnight | Batch Transform |
| Process 500MB video files | Async Inference |
| 1000 customer-specific models, low per-customer traffic | Multi-Model Endpoint |
| Preprocessing + model as one unit | Inference Pipeline |
| Computer vision on factory cameras | Edge (Panorama/Greengrass) |
MLOps Component → AWS Service
| MLOps Need | AWS Service |
|---|---|
| Workflow orchestration | SageMaker Pipelines |
| Model versioning and approval | SageMaker Model Registry |
| CI/CD for ML projects | SageMaker Projects |
| Data and model quality monitoring | SageMaker Model Monitor |
| Bias and explainability | SageMaker Clarify |
| Model lineage tracking | SageMaker ML Lineage |
| Experiment tracking | SageMaker Experiments |
| Custom containers | Amazon ECR + Docker |
| Edge deployment fleet management | SageMaker Edge Manager |
| Cost-optimized DL inference chip | AWS Inferentia (ml.inf1/inf2) |
| Model optimization for any hardware | SageMaker Neo |