Domain 4B: SageMaker Deployment & MLOps

15 min read 3176 words

Table of Contents

SageMaker Deployment & MLOps

SageMaker Deployment & MLOps

Exam Domain: 4 — ML Implementation and Operations (20%) Task: Deploy models, build MLOps pipelines, and select appropriate inference strategies

Deployment Options — The Complete Picture

┌─────────────────────────────────────────────────────────────────────┐
│               SAGEMAKER DEPLOYMENT OPTIONS                         │
├──────────────────┬──────────────────┬──────────────┬───────────────┤
│  Real-Time       │  Serverless      │  Batch       │  Async        │
│  Inference       │  Inference       │  Transform   │  Inference    │
├──────────────────┼──────────────────┼──────────────┼───────────────┤
│ Persistent       │ No instance      │ One-time     │ Queue-based   │
│ instance always  │ (cold start)     │ job, S3→S3   │ processing    │
│ running          │                  │              │               │
├──────────────────┼──────────────────┼──────────────┼───────────────┤
│ ms latency       │ seconds (cold)   │ minutes-hrs  │ minutes       │
├──────────────────┼──────────────────┼──────────────┼───────────────┤
│ 24/7 cost        │ Pay per request  │ Pay per job  │ Pay per job   │
├──────────────────┼──────────────────┼──────────────┼───────────────┤
│ Steady traffic   │ Intermittent     │ Offline      │ Large payload │
│ Low latency SLA  │ Dev/test         │ batch score  │ Long inference│
└──────────────────┴──────────────────┴──────────────┴───────────────┘

Why this matters for the exam: Every deployment scenario question maps to one of these four options. Know the trigger words: “low latency” → Real-Time, “intermittent traffic” → Serverless, “large dataset offline” → Batch Transform, “payload > 6MB or inference > 60s” → Async.

Real-Time Inference (Endpoints)

How It Works

┌──────────────┐     HTTPS      ┌─────────────────────────────────┐
│   Client     │ ─────────────► │  SageMaker Endpoint             │
│  (app/API)   │ ◄───────────── │  ┌──────────┐  ┌──────────┐    │
└──────────────┘   prediction   │  │ Instance │  │ Instance │    │
                                │  │  + model │  │  + model │    │
                                │  └──────────┘  └──────────┘    │
                                │         Load Balancer          │
                                └─────────────────────────────────┘

Model loaded into memory on persistent EC2 instances
Requests routed via internal load balancer
Predictions returned synchronously (sub-second)
Multiple instances for high availability and throughput

When to Use

Real-time user-facing applications (fraud detection at checkout, product recommendations)
Sub-second latency is required
Steady, predictable traffic patterns
Payload < 6 MB per request

Instance Type Selection for Inference

Instance Family	Hardware	Best For
ml.m5, ml.m4	CPU, general	Lightweight models, tabular ML
ml.c5, ml.c4	CPU, compute-optimized	CPU-intensive inference, NLP
ml.g4dn	GPU (T4)	Deep learning, computer vision
ml.p3	GPU (V100)	Large model inference, NLP transformers
ml.inf1	AWS Inferentia (custom chip)	High-throughput deep learning, cost-optimized
ml.inf2	AWS Inferentia2	Large language models, cost-optimized
ml.trn1	AWS Trainium	Training (not inference)

Exam tip: For cost-optimized deep learning inference, the answer is ml.inf1 or ml.inf2 (Inferentia). For maximum GPU performance, ml.p3. For standard tabular ML, ml.m5 or ml.c5.

Auto-Scaling Endpoints

ELI5: Auto-scaling is like a restaurant that adds more tables during the lunch rush and removes them at 3pm. SageMaker watches how many requests per instance are coming in and automatically adds or removes instances to match demand.

Scaling policies:

Policy Type	How It Works	Best For
Target Tracking	Maintain a target metric value (e.g., 100 invocations/instance)	Most use cases
Step Scaling	Scale by fixed amounts at defined thresholds	Precise control
Scheduled Scaling	Scale at specific times	Known traffic patterns

Key metrics for scaling:

InvocationsPerInstance — most common trigger (target: your desired requests/instance)
CPUUtilization — scale on CPU pressure
MemoryUtilization — scale on memory pressure
GPUUtilization — scale on GPU utilization

Scaling configuration:

Minimum capacity: floor for scale-in (don’t go below X instances)
Maximum capacity: ceiling for scale-out (cost control)
Cooldown: wait period after scaling action before next action

Serverless Inference

How It Works

Request arrives ──► SageMaker checks for warm container
                         │
              ┌──────────┴──────────┐
              │ Warm container?     │
            YES                    NO
              │                    │
         Route immediately    Cold start (~seconds)
         (ms latency)         Load model, then serve

No persistent instance — model container spins up on demand
AWS manages all infrastructure
Cold start: first request after idle period takes a few seconds
Warm container: subsequent requests are fast

Configuration Parameters

Memory size: 1024 MB to 6144 MB — determines CPU allocation too (more memory = more CPU)
Max concurrent invocations: limits simultaneous requests (cost control, prevents runaway spend)

When to Use

Intermittent, unpredictable traffic (dev/test, low-volume APIs)
Can tolerate occasional seconds of cold start latency
Cost optimization: pay only for compute used, not idle time
Model size < 1 GB (for reasonable cold start times)

ELI5: Serverless inference is like a taxi vs. owning a car. The taxi (serverless) takes a minute to arrive but costs nothing when you’re not using it. Your car (real-time endpoint) is always in the driveway, always ready, but the insurance and payments run 24/7. If you only make a few trips a week, take the taxi.

Provisioned Concurrency: Pre-warm a specified number of instances to eliminate cold starts. Adds cost but removes latency variability. Use when you need fast response times but traffic is still intermittent.

Batch Transform

How It Works

S3 Input          SageMaker             S3 Output
┌────────┐   ┌──────────────────┐   ┌────────────┐
│input/  │──►│  Batch Transform │──►│  output/   │
│data.csv│   │  (auto-spins up  │   │ results.csv│
└────────┘   │   instances,     │   └────────────┘
             │   processes,     │
             │   terminates)    │
             └──────────────────┘

You specify: model, instance type/count, input S3 path, output S3 path
SageMaker spins up instances, processes all data, writes results, then terminates instances
No persistent endpoint — only pay during the job

Advanced Features

Data splitting: how to split input data for batching (Line, RecordIO, TFRecord, None)
Join source: merge input data with predictions in output (useful for audit trails)
Filtering: include/exclude fields from output
Assemble with: how to combine predictions back (Line, None)

When to Use

Offline scoring: generate predictions for entire dataset (millions of records)
Periodic batch jobs: nightly scoring, weekly recommendations refresh
No hard latency requirement
Large datasets that would timeout a real-time endpoint
Cost-conscious: no 24/7 endpoint cost

Asynchronous Inference

How It Works

Client POST request                SageMaker
┌───────────┐                  ┌──────────────────────────┐
│  Client   │──── Request ────►│  Input S3 location       │
│           │◄─── Location ────│  (auto-queued)           │
└───────────┘   (immediate)    │          │               │
                               │          ▼               │
                               │  ┌──────────────┐       │
                               │  │ ML Instance  │       │
                               │  │ (processes   │       │
                               │  │  the job)    │       │
                               │  └──────┬───────┘       │
                               │         │               │
                               │         ▼               │
                               │  Output to S3           │
                               │  SNS notification sent  │
                               └──────────────────────────┘

Client sends request → gets back a location (URL), not a result
SageMaker queues the request and processes asynchronously
When done: result in S3, SNS notification sent to client
Endpoint can auto-scale to zero when queue is empty (major cost saving)

When to Use

Payload > 6 MB (real-time limit) — video files, large images, long documents
Inference time > 60 seconds — complex models, large batch within a request
Can tolerate latency of minutes (not milliseconds)
Cost-sensitive: want instances to scale to zero when idle

Why this matters for the exam: The payload size limit (6 MB for real-time) and inference time limit (60s) are key discriminators. If either is exceeded, answer is Async Inference.

Deployment Options Comparison Table

Feature	Real-Time	Serverless	Batch Transform	Async
Latency	ms	s (cold start)	minutes-hours	minutes
Max payload	6 MB	4 MB	Unlimited (S3)	1 GB
Max inference time	60 s	60 s	Unlimited	15 min
Scaling	Auto-scaling	Automatic	N/A (one-time)	Auto (to zero)
Cost when idle	Continuous	Zero	Zero	Near-zero
Persistent endpoint	Yes	Yes	No	Yes
Response mechanism	Synchronous	Synchronous	S3 poll	S3 + SNS

Multi-Model Endpoints (MME)

What and Why

Host thousands of models on a single endpoint — models are loaded and unloaded dynamically.

ELI5: Instead of renting 1,000 apartments for 1,000 tenants who rarely visit, rent one hotel. Guests check in (model loaded into memory) when they arrive, check out (model evicted from memory) when idle. The hotel is always available; you just pay for one building, not 1,000.

How It Works

┌───────────────────────────────────────────────┐
│         Multi-Model Endpoint                  │
│                                               │
│  Memory:  [Model A] [Model C] [empty...]      │
│  Disk:    model_A.tar.gz, model_B.tar.gz,     │
│           model_C.tar.gz ... (thousands)      │
│                                               │
│  Request for Model B:                         │
│  1. Check memory → not there                  │
│  2. Load from disk/S3 → into memory           │
│  3. Serve prediction                          │
│  4. Keep in memory (evict LRU if full)        │
└───────────────────────────────────────────────┘

When to Use

Per-customer models (1 model per customer, most customers have low traffic)
Per-region or per-segment models with sparse traffic
Cost optimization: one endpoint instead of thousands

Limitations

Higher latency on first call to a model (loading time)
Not suitable when many models need to be in memory simultaneously
Models must use same framework (e.g., all XGBoost, or all TensorFlow)

Multi-Container Endpoints (MCE)

Host multiple different containers on one endpoint.

Two invocation modes:

Mode	How It Works	Use Case
Direct	Client chooses which container to invoke by name	A/B testing different frameworks
Serial Pipeline	Output of container A becomes input to container B	Preprocessing → Model → Postprocessing

Serial inference pipeline example:

Input text ──► [SparkML preprocessing] ──► [XGBoost model] ──► [custom postprocessor] ──► Output
              Container 1                  Container 2          Container 3

Inference Pipeline

Chain multiple containers into a single SageMaker endpoint — used when you need the same transformations at inference time as at training time.

Why it matters:

Training applies feature transformations (scaling, encoding)
Those SAME transformations must be applied at inference
Inference Pipeline keeps them together, preventing training-serving skew

Architecture:

┌──────────────────────────────────────────────────────┐
│                 Inference Pipeline                   │
│                                                      │
│  Raw Input ──► [Preprocessing] ──► [Model] ──► Output│
│               (Scikit-learn     (XGBoost or          │
│                or SparkML)       TensorFlow)         │
└──────────────────────────────────────────────────────┘

SageMaker Model Registry

Version, track, and govern ML models across their lifecycle.

Model versioning:

Every trained model creates a new model version
Metadata: training metrics, training job ARN, dataset version, owner
Compare versions side by side

Approval workflow:

Training Complete ──► Pending Review ──► [Human Approves/Rejects]
                                               │
                                    ┌──────────┴──────────┐
                                  Approved              Rejected
                                    │                    │
                               Deploy to prod        Archive

Cross-account deployment:

Model registered in Account A (dev)
Approved model promoted to Account B (prod)
Permissions via IAM resource-based policies

Integration with SageMaker Pipelines:

Pipeline step: RegisterModel → triggers approval workflow
CI/CD: on approval → automated deployment

SageMaker Pipelines — ML CI/CD

What It Is

Native ML workflow orchestration — define, run, and track ML workflows as DAGs.

ELI5: Pipelines is a recipe card for your ML workflow. Each step (preprocess data, train model, evaluate accuracy, register if good enough) is defined once. You can run the recipe any time, track every run, and reuse unchanged steps from cache.

Pipeline Anatomy

┌─────────────────────────────────────────────────────────────────────┐
│                    SAGEMAKER PIPELINE (DAG)                        │
│                                                                     │
│  Parameters:                                                        │
│  ┌──────────────────────────────────────────────────────────┐      │
│  │ train_size=0.8, epochs=10, instance_type=ml.m5.xlarge    │      │
│  └──────────────────────────────────────────────────────────┘      │
│                                                                     │
│  Steps:                                                             │
│  [Processing] ──► [Training] ──► [Evaluation] ──► [Condition]      │
│      │               │               │                │            │
│  Prepare data    Train model    Compute metrics   If accuracy      │
│  Split train/    on training    on test set       > 0.85:          │
│  test sets       data                             ├─YES─► [Register]│
│                                                   └─NO──► [Fail]   │
└─────────────────────────────────────────────────────────────────────┘

Pipeline Steps

Step Type	Purpose
Processing	Run SageMaker Processing job (data prep, eval)
Training	Run SageMaker Training job
Tuning	Run Hyperparameter tuning job
Transform	Run Batch Transform job
Model	Create SageMaker Model
RegisterModel	Register model in Model Registry
Condition	Branch based on metric threshold
Callback	Call external system, wait for response
Lambda	Run Lambda function inline
ClarifyCheck	Check for bias or explainability
QualityCheck	Baseline data or model quality

Key Features

Parameters: runtime configuration — change instance types, dataset paths, thresholds without editing pipeline code
Caching: skip steps that haven’t changed since last run (reuse training output if data/code unchanged)
Lineage: automatic tracking of inputs, outputs, parameters for every run
Retry policies: automatic retry on transient failures

Orchestration Comparison

Feature	SageMaker Pipelines	Step Functions	Apache Airflow (MWAA)
ML-native	Yes (built for ML)	No (general)	No (general)
SageMaker integration	Native	Via SDK	Via operators
Complexity	Low-medium	Medium	High
Non-ML tasks	Limited	Full	Full
Best for	ML workflows	Multi-service workflows	Complex ETL + ML

Exam tip: If the question is about ML workflow orchestration and all steps are ML steps — use SageMaker Pipelines. If the workflow includes non-ML steps (DynamoDB writes, Lambda, SNS) alongside ML — consider Step Functions. If the company already uses Airflow heavily — MWAA.

Docker Containers for Custom Algorithms

Why Docker

Reproducible environments: same container in dev, test, prod
Any framework, any language (not just AWS-supported frameworks)
Isolation: dependencies don’t conflict
Portability: run on SageMaker, ECS, on-premise, or locally

SageMaker Container Contract

ELI5: SageMaker talks to your container through a strict contract — specific directory paths for data and model files, and specific entry points for training and serving. As long as your container respects this contract, SageMaker can train and deploy it anywhere.

┌──────────────────────────────────────────────────┐
│         SageMaker Container Directory Layout     │
│                                                  │
│  /opt/ml/                                        │
│  ├── input/                                      │
│  │   ├── config/                                 │
│  │   │   ├── hyperparameters.json  ◄─ SageMaker  │
│  │   │   └── resourceConfig.json      injects    │
│  │   └── data/                                   │
│  │       └── training/  ◄── YOUR training data   │
│  ├── model/  ◄── Save trained model HERE         │
│  ├── output/                                     │
│  │   └── failure  ◄── Write failure messages     │
│  └── code/  ◄── Your script (script mode)        │
│                                                  │
│  Entry points:                                   │
│  - /opt/ml/code/train  (runs at training time)   │
│  - /opt/ml/code/serve  (runs at inference time)  │
└──────────────────────────────────────────────────┘

Container Customization Options

Least Custom ◄──────────────────────────────────► Most Custom
     │                   │                              │
Built-in Algorithm    Script Mode              Bring Your Own
  (no Docker)      (your code +             Container (BYOC)
                  AWS container)         (full custom Docker)
     │                   │                              │
  Easiest           Most Common              Maximum Control
  No flexibility    Best trade-off           Maximum effort

Script Mode (most common for custom code):

Use AWS pre-built container for your framework (TensorFlow, PyTorch, sklearn)
Provide YOUR training/inference script
AWS handles all environment setup
Easy to iterate — just change your script

BYOC (Bring Your Own Container):

Build custom Docker image
Full control: custom frameworks, languages, dependencies
Push to Amazon ECR
SageMaker pulls and runs it
Use when: proprietary framework, unusual dependencies, compiled code

Pre-built Containers (Script Mode ready)

Framework	Container Available	Script Mode
TensorFlow	Yes	Yes
PyTorch	Yes	Yes
MXNet	Yes	Yes
Hugging Face Transformers	Yes	Yes
Scikit-learn	Yes	Yes
XGBoost	Yes	Yes
SparkML	Yes	Yes

SageMaker Neo

Compile ML models for optimized inference across platforms.

What it does:

Takes a trained model (TensorFlow SavedModel, PyTorch .pt, ONNX, XGBoost, etc.)
Compiles it for a specific target hardware (CPU architecture, GPU type, edge chip)
Output: optimized binary that runs faster with less memory

Why it matters:

2x-10x faster inference on the same hardware
Smaller model footprint (important for edge devices)
Compiled model runs with AWS Deep Learning Runtime (no framework installation needed)

Supported target platforms:

SageMaker ML instances (cloud)
ARM, x86, NVIDIA GPU (edge servers)
Raspberry Pi, Jetson Nano (IoT edge)
Qualcomm, Intel, NXP chips

AWS Inferentia

AWS’s custom ML inference chip — built for high-throughput, cost-optimized deep learning inference.

┌────────────────────────────────────────────────────────────┐
│               INFERENCE CHIP COMPARISON                   │
├────────────────────┬───────────────────────────────────────┤
│ GPU (ml.g4dn/p3)   │  Inferentia (ml.inf1/inf2)           │
├────────────────────┼───────────────────────────────────────┤
│ General compute    │  Purpose-built for ML inference       │
│ High cost/hr       │  Up to 70% lower cost vs GPU         │
│ Great for training │  Great for inference only             │
│ Flexible           │  Requires Neuron SDK compilation      │
│ Any framework      │  TensorFlow, PyTorch, MXNet           │
└────────────────────┴───────────────────────────────────────┘

Neuron SDK: Compiles models to run on Inferentia chips. Similar to Neo but specifically for Inferentia hardware.

Use case: High-volume, cost-sensitive deep learning inference (NLP, computer vision at scale).

Edge Deployment

When to Deploy at the Edge

Requirement	Deploy at Edge
Ultra-low latency (< 5ms)	Yes
Offline capability (no internet)	Yes
Data privacy (data can’t leave device)	Yes
High bandwidth cost to cloud	Yes
Real-time video analysis	Yes

Edge ML Stack

┌─────────────────────────────────────────────────────────┐
│              EDGE DEPLOYMENT OPTIONS                    │
│                                                         │
│  Train in Cloud (SageMaker)                            │
│          │                                              │
│          ▼ Compile with Neo                             │
│  Optimized model artifact                              │
│          │                                              │
│     ┌────┴────────────────────────────┐               │
│     │                                 │               │
│     ▼                                 ▼               │
│  IoT Greengrass              SageMaker Edge Manager   │
│  (IoT device runtime)        (fleet management)       │
│     │                                 │               │
│     └─────────────┬───────────────────┘               │
│                   ▼                                    │
│            Edge Device                                 │
│      (Raspberry Pi, Jetson, etc.)                      │
└─────────────────────────────────────────────────────────┘

SageMaker Edge Manager:

Deploy, monitor, and update models on a fleet of edge devices
Data capture from edge for model retraining
Model versioning and rollback

AWS IoT Greengrass:

General IoT device runtime (not ML-specific, but supports ML inference)
Lambda-based or component-based ML inference at the edge
Integrates with AWS IoT Core for device management

MLOps Best Practices

What Makes ML Different from Software CI/CD

Traditional Software CI/CD:
  Code change ──► Build ──► Test ──► Deploy

ML CI/CD (three axes of change):
  Code change   ──► Retrain ──► Evaluate ──► Deploy
  Data change   ──► Retrain ──► Evaluate ──► Deploy
  Model drift   ──►           Retrain ──► Deploy

MLOps Maturity Levels

Level	Description	What You Have
0	Manual	Scripts on laptop, no automation
1	Automated Training	Training pipeline automated, manual deployment
2	Full CI/CD	Code + data triggers automatic retrain + deploy

A/B Testing and Safe Deployment

ELI5: A/B testing in ML is like testing a new recipe at a restaurant. You serve the new dish to 10% of customers (model B) while 90% still get the proven dish (model A). You measure which dish gets better reviews (metrics) before switching everyone to the new recipe.

SageMaker Production Variants:

# Example endpoint configuration with A/B split
EndpointConfig:
  ProductionVariants:
    - ModelName: model-v1
      VariantName: ModelA
      InitialVariantWeight: 90    # 90% traffic
    - ModelName: model-v2
      VariantName: ModelB
      InitialVariantWeight: 10    # 10% traffic

Deployment strategies:

Strategy	Traffic Split	Risk	Rollback Speed
Blue/Green	100% instant cutover	High	Slow
Canary	5% → 25% → 50% → 100%	Low	Fast
Linear	Incremental % over time	Low	Fast
Shadow	100% to both, serve from A only	Zero	N/A

Shadow mode:

Route all traffic to BOTH old and new model
Only return predictions from OLD model to users
Compare new model’s predictions in background
Zero user impact, full validation before cutover

Model Lineage Tracking

Automatically tracked by SageMaker:

Dataset (S3) ──► Training Job ──► Model ──► Endpoint
     │                │             │           │
     └── Artifacts ───┘             └── Actions ┘
            └─── Contexts (grouping related objects) ───┘

Artifacts: data and model files (URIs + metadata)
Actions: what happened (training, deployment)
Contexts: logical groupings (experiment, pipeline run)
Query lineage: “What data was used to train the model serving endpoint X?”

Infrastructure as Code for ML

CloudFormation: define SageMaker resources as YAML/JSON templates
AWS CDK: define SageMaker resources in Python/TypeScript code
Terraform: third-party IaC (HCL), popular in multi-cloud environments

SageMaker Projects

Pre-built MLOps templates with integrated CI/CD.

What you get:

SageMaker Project
├── CodeCommit repos (model build + model deploy)
├── CodePipeline (automated workflow)
├── CodeBuild (build and test)
└── SageMaker Pipelines (ML workflow)

Built-in templates:

Model build, train, deploy pipeline
Model monitoring with retraining trigger
Multi-account deployment (dev → staging → prod)

Service Catalog integration: Projects are defined as Service Catalog products — admins define approved project templates, teams instantiate them.

Quick Reference

Deployment Option → Use Case

Scenario	Deployment Option
Fraud detection at payment time	Real-Time Endpoint
Internal dev/test API, low traffic	Serverless Inference
Score 10 million records overnight	Batch Transform
Process 500MB video files	Async Inference
1000 customer-specific models, low per-customer traffic	Multi-Model Endpoint
Preprocessing + model as one unit	Inference Pipeline
Computer vision on factory cameras	Edge (Panorama/Greengrass)

MLOps Component → AWS Service

MLOps Need	AWS Service
Workflow orchestration	SageMaker Pipelines
Model versioning and approval	SageMaker Model Registry
CI/CD for ML projects	SageMaker Projects
Data and model quality monitoring	SageMaker Model Monitor
Bias and explainability	SageMaker Clarify
Model lineage tracking	SageMaker ML Lineage
Experiment tracking	SageMaker Experiments
Custom containers	Amazon ECR + Docker
Edge deployment fleet management	SageMaker Edge Manager
Cost-optimized DL inference chip	AWS Inferentia (ml.inf1/inf2)
Model optimization for any hardware	SageMaker Neo

SageMaker Deployment & MLOps#

Deployment Options — The Complete Picture#

Real-Time Inference (Endpoints)#

How It Works#

When to Use#

Instance Type Selection for Inference#

Auto-Scaling Endpoints#

Serverless Inference#

How It Works#

Configuration Parameters#

When to Use#

Batch Transform#

How It Works#

Advanced Features#

When to Use#

Asynchronous Inference#

How It Works#

When to Use#

Deployment Options Comparison Table#

Multi-Model Endpoints (MME)#

What and Why#

How It Works#

When to Use#

Limitations#

Multi-Container Endpoints (MCE)#

Inference Pipeline#

SageMaker Model Registry#

SageMaker Pipelines — ML CI/CD#

What It Is#

Pipeline Anatomy#

Pipeline Steps#

Key Features#

Orchestration Comparison#

Docker Containers for Custom Algorithms#

Why Docker#

SageMaker Container Contract#

Container Customization Options#

Pre-built Containers (Script Mode ready)#

SageMaker Neo#

AWS Inferentia#

Edge Deployment#

When to Deploy at the Edge#

Edge ML Stack#

MLOps Best Practices#

What Makes ML Different from Software CI/CD#

MLOps Maturity Levels#

A/B Testing and Safe Deployment#

Model Lineage Tracking#

Infrastructure as Code for ML#

SageMaker Projects#

Quick Reference#

Deployment Option → Use Case#

MLOps Component → AWS Service#

SageMaker Deployment & MLOps

Deployment Options — The Complete Picture

Real-Time Inference (Endpoints)

How It Works

When to Use

Instance Type Selection for Inference

Auto-Scaling Endpoints

Serverless Inference

How It Works

Configuration Parameters

When to Use

Batch Transform

How It Works

Advanced Features

When to Use

Asynchronous Inference

How It Works

When to Use

Deployment Options Comparison Table

Multi-Model Endpoints (MME)

What and Why

How It Works

When to Use

Limitations

Multi-Container Endpoints (MCE)

Inference Pipeline

SageMaker Model Registry

SageMaker Pipelines — ML CI/CD

What It Is

Pipeline Anatomy

Pipeline Steps

Key Features

Orchestration Comparison

Docker Containers for Custom Algorithms

Why Docker

SageMaker Container Contract

Container Customization Options

Pre-built Containers (Script Mode ready)

SageMaker Neo

AWS Inferentia

Edge Deployment

When to Deploy at the Edge

Edge ML Stack

MLOps Best Practices

What Makes ML Different from Software CI/CD

MLOps Maturity Levels

A/B Testing and Safe Deployment

Model Lineage Tracking

Infrastructure as Code for ML

SageMaker Projects

Quick Reference

Deployment Option → Use Case

MLOps Component → AWS Service