← AWS MLS-C01 — ML Specialty

Domain 4B: SageMaker Deployment & MLOps

SageMaker Deployment & MLOps

Exam Domain: 4 — ML Implementation and Operations (20%) Task: Deploy models, build MLOps pipelines, and select appropriate inference strategies


Deployment Options — The Complete Picture

┌─────────────────────────────────────────────────────────────────────┐
│               SAGEMAKER DEPLOYMENT OPTIONS                         │
├──────────────────┬──────────────────┬──────────────┬───────────────┤
│  Real-Time       │  Serverless      │  Batch       │  Async        │
│  Inference       │  Inference       │  Transform   │  Inference    │
├──────────────────┼──────────────────┼──────────────┼───────────────┤
│ Persistent       │ No instance      │ One-time     │ Queue-based   │
│ instance always  │ (cold start)     │ job, S3→S3   │ processing    │
│ running          │                  │              │               │
├──────────────────┼──────────────────┼──────────────┼───────────────┤
│ ms latency       │ seconds (cold)   │ minutes-hrs  │ minutes       │
├──────────────────┼──────────────────┼──────────────┼───────────────┤
│ 24/7 cost        │ Pay per request  │ Pay per job  │ Pay per job   │
├──────────────────┼──────────────────┼──────────────┼───────────────┤
│ Steady traffic   │ Intermittent     │ Offline      │ Large payload │
│ Low latency SLA  │ Dev/test         │ batch score  │ Long inference│
└──────────────────┴──────────────────┴──────────────┴───────────────┘

Why this matters for the exam: Every deployment scenario question maps to one of these four options. Know the trigger words: “low latency” → Real-Time, “intermittent traffic” → Serverless, “large dataset offline” → Batch Transform, “payload > 6MB or inference > 60s” → Async.


Real-Time Inference (Endpoints)

How It Works

┌──────────────┐     HTTPS      ┌─────────────────────────────────┐
│   Client     │ ─────────────► │  SageMaker Endpoint             │
│  (app/API)   │ ◄───────────── │  ┌──────────┐  ┌──────────┐    │
└──────────────┘   prediction   │  │ Instance │  │ Instance │    │
                                │  │  + model │  │  + model │    │
                                │  └──────────┘  └──────────┘    │
                                │         Load Balancer          │
                                └─────────────────────────────────┘
  • Model loaded into memory on persistent EC2 instances
  • Requests routed via internal load balancer
  • Predictions returned synchronously (sub-second)
  • Multiple instances for high availability and throughput

When to Use

  • Real-time user-facing applications (fraud detection at checkout, product recommendations)
  • Sub-second latency is required
  • Steady, predictable traffic patterns
  • Payload < 6 MB per request

Instance Type Selection for Inference

Instance FamilyHardwareBest For
ml.m5, ml.m4CPU, generalLightweight models, tabular ML
ml.c5, ml.c4CPU, compute-optimizedCPU-intensive inference, NLP
ml.g4dnGPU (T4)Deep learning, computer vision
ml.p3GPU (V100)Large model inference, NLP transformers
ml.inf1AWS Inferentia (custom chip)High-throughput deep learning, cost-optimized
ml.inf2AWS Inferentia2Large language models, cost-optimized
ml.trn1AWS TrainiumTraining (not inference)

Exam tip: For cost-optimized deep learning inference, the answer is ml.inf1 or ml.inf2 (Inferentia). For maximum GPU performance, ml.p3. For standard tabular ML, ml.m5 or ml.c5.

Auto-Scaling Endpoints

ELI5: Auto-scaling is like a restaurant that adds more tables during the lunch rush and removes them at 3pm. SageMaker watches how many requests per instance are coming in and automatically adds or removes instances to match demand.

Scaling policies:

Policy TypeHow It WorksBest For
Target TrackingMaintain a target metric value (e.g., 100 invocations/instance)Most use cases
Step ScalingScale by fixed amounts at defined thresholdsPrecise control
Scheduled ScalingScale at specific timesKnown traffic patterns

Key metrics for scaling:

  • InvocationsPerInstance — most common trigger (target: your desired requests/instance)
  • CPUUtilization — scale on CPU pressure
  • MemoryUtilization — scale on memory pressure
  • GPUUtilization — scale on GPU utilization

Scaling configuration:

  • Minimum capacity: floor for scale-in (don’t go below X instances)
  • Maximum capacity: ceiling for scale-out (cost control)
  • Cooldown: wait period after scaling action before next action

Serverless Inference

How It Works

Request arrives ──► SageMaker checks for warm container
                         │
              ┌──────────┴──────────┐
              │ Warm container?     │
            YES                    NO
              │                    │
         Route immediately    Cold start (~seconds)
         (ms latency)         Load model, then serve
  • No persistent instance — model container spins up on demand
  • AWS manages all infrastructure
  • Cold start: first request after idle period takes a few seconds
  • Warm container: subsequent requests are fast

Configuration Parameters

  • Memory size: 1024 MB to 6144 MB — determines CPU allocation too (more memory = more CPU)
  • Max concurrent invocations: limits simultaneous requests (cost control, prevents runaway spend)

When to Use

  • Intermittent, unpredictable traffic (dev/test, low-volume APIs)
  • Can tolerate occasional seconds of cold start latency
  • Cost optimization: pay only for compute used, not idle time
  • Model size < 1 GB (for reasonable cold start times)

ELI5: Serverless inference is like a taxi vs. owning a car. The taxi (serverless) takes a minute to arrive but costs nothing when you’re not using it. Your car (real-time endpoint) is always in the driveway, always ready, but the insurance and payments run 24/7. If you only make a few trips a week, take the taxi.

Provisioned Concurrency: Pre-warm a specified number of instances to eliminate cold starts. Adds cost but removes latency variability. Use when you need fast response times but traffic is still intermittent.


Batch Transform

How It Works

S3 Input          SageMaker             S3 Output
┌────────┐   ┌──────────────────┐   ┌────────────┐
│input/  │──►│  Batch Transform │──►│  output/   │
│data.csv│   │  (auto-spins up  │   │ results.csv│
└────────┘   │   instances,     │   └────────────┘
             │   processes,     │
             │   terminates)    │
             └──────────────────┘
  • You specify: model, instance type/count, input S3 path, output S3 path
  • SageMaker spins up instances, processes all data, writes results, then terminates instances
  • No persistent endpoint — only pay during the job

Advanced Features

  • Data splitting: how to split input data for batching (Line, RecordIO, TFRecord, None)
  • Join source: merge input data with predictions in output (useful for audit trails)
  • Filtering: include/exclude fields from output
  • Assemble with: how to combine predictions back (Line, None)

When to Use

  • Offline scoring: generate predictions for entire dataset (millions of records)
  • Periodic batch jobs: nightly scoring, weekly recommendations refresh
  • No hard latency requirement
  • Large datasets that would timeout a real-time endpoint
  • Cost-conscious: no 24/7 endpoint cost

Asynchronous Inference

How It Works

Client POST request                SageMaker
┌───────────┐                  ┌──────────────────────────┐
│  Client   │──── Request ────►│  Input S3 location       │
│           │◄─── Location ────│  (auto-queued)           │
└───────────┘   (immediate)    │          │               │
                               │          ▼               │
                               │  ┌──────────────┐       │
                               │  │ ML Instance  │       │
                               │  │ (processes   │       │
                               │  │  the job)    │       │
                               │  └──────┬───────┘       │
                               │         │               │
                               │         ▼               │
                               │  Output to S3           │
                               │  SNS notification sent  │
                               └──────────────────────────┘
  • Client sends request → gets back a location (URL), not a result
  • SageMaker queues the request and processes asynchronously
  • When done: result in S3, SNS notification sent to client
  • Endpoint can auto-scale to zero when queue is empty (major cost saving)

When to Use

  • Payload > 6 MB (real-time limit) — video files, large images, long documents
  • Inference time > 60 seconds — complex models, large batch within a request
  • Can tolerate latency of minutes (not milliseconds)
  • Cost-sensitive: want instances to scale to zero when idle

Why this matters for the exam: The payload size limit (6 MB for real-time) and inference time limit (60s) are key discriminators. If either is exceeded, answer is Async Inference.


Deployment Options Comparison Table

FeatureReal-TimeServerlessBatch TransformAsync
Latencymss (cold start)minutes-hoursminutes
Max payload6 MB4 MBUnlimited (S3)1 GB
Max inference time60 s60 sUnlimited15 min
ScalingAuto-scalingAutomaticN/A (one-time)Auto (to zero)
Cost when idleContinuousZeroZeroNear-zero
Persistent endpointYesYesNoYes
Response mechanismSynchronousSynchronousS3 pollS3 + SNS

Multi-Model Endpoints (MME)

What and Why

Host thousands of models on a single endpoint — models are loaded and unloaded dynamically.

ELI5: Instead of renting 1,000 apartments for 1,000 tenants who rarely visit, rent one hotel. Guests check in (model loaded into memory) when they arrive, check out (model evicted from memory) when idle. The hotel is always available; you just pay for one building, not 1,000.

How It Works

┌───────────────────────────────────────────────┐
│         Multi-Model Endpoint                  │
│                                               │
│  Memory:  [Model A] [Model C] [empty...]      │
│  Disk:    model_A.tar.gz, model_B.tar.gz,     │
│           model_C.tar.gz ... (thousands)      │
│                                               │
│  Request for Model B:                         │
│  1. Check memory → not there                  │
│  2. Load from disk/S3 → into memory           │
│  3. Serve prediction                          │
│  4. Keep in memory (evict LRU if full)        │
└───────────────────────────────────────────────┘

When to Use

  • Per-customer models (1 model per customer, most customers have low traffic)
  • Per-region or per-segment models with sparse traffic
  • Cost optimization: one endpoint instead of thousands

Limitations

  • Higher latency on first call to a model (loading time)
  • Not suitable when many models need to be in memory simultaneously
  • Models must use same framework (e.g., all XGBoost, or all TensorFlow)

Multi-Container Endpoints (MCE)

Host multiple different containers on one endpoint.

Two invocation modes:

ModeHow It WorksUse Case
DirectClient chooses which container to invoke by nameA/B testing different frameworks
Serial PipelineOutput of container A becomes input to container BPreprocessing → Model → Postprocessing

Serial inference pipeline example:

Input text ──► [SparkML preprocessing] ──► [XGBoost model] ──► [custom postprocessor] ──► Output
              Container 1                  Container 2          Container 3

Inference Pipeline

Chain multiple containers into a single SageMaker endpoint — used when you need the same transformations at inference time as at training time.

Why it matters:

  • Training applies feature transformations (scaling, encoding)
  • Those SAME transformations must be applied at inference
  • Inference Pipeline keeps them together, preventing training-serving skew

Architecture:

┌──────────────────────────────────────────────────────┐
│                 Inference Pipeline                   │
│                                                      │
│  Raw Input ──► [Preprocessing] ──► [Model] ──► Output│
│               (Scikit-learn     (XGBoost or          │
│                or SparkML)       TensorFlow)         │
└──────────────────────────────────────────────────────┘

SageMaker Model Registry

Version, track, and govern ML models across their lifecycle.

Model versioning:

  • Every trained model creates a new model version
  • Metadata: training metrics, training job ARN, dataset version, owner
  • Compare versions side by side

Approval workflow:

Training Complete ──► Pending Review ──► [Human Approves/Rejects]
                                               │
                                    ┌──────────┴──────────┐
                                  Approved              Rejected
                                    │                    │
                               Deploy to prod        Archive

Cross-account deployment:

  • Model registered in Account A (dev)
  • Approved model promoted to Account B (prod)
  • Permissions via IAM resource-based policies

Integration with SageMaker Pipelines:

  • Pipeline step: RegisterModel → triggers approval workflow
  • CI/CD: on approval → automated deployment

SageMaker Pipelines — ML CI/CD

What It Is

Native ML workflow orchestration — define, run, and track ML workflows as DAGs.

ELI5: Pipelines is a recipe card for your ML workflow. Each step (preprocess data, train model, evaluate accuracy, register if good enough) is defined once. You can run the recipe any time, track every run, and reuse unchanged steps from cache.

Pipeline Anatomy

┌─────────────────────────────────────────────────────────────────────┐
│                    SAGEMAKER PIPELINE (DAG)                        │
│                                                                     │
│  Parameters:                                                        │
│  ┌──────────────────────────────────────────────────────────┐      │
│  │ train_size=0.8, epochs=10, instance_type=ml.m5.xlarge    │      │
│  └──────────────────────────────────────────────────────────┘      │
│                                                                     │
│  Steps:                                                             │
│  [Processing] ──► [Training] ──► [Evaluation] ──► [Condition]      │
│      │               │               │                │            │
│  Prepare data    Train model    Compute metrics   If accuracy      │
│  Split train/    on training    on test set       > 0.85:          │
│  test sets       data                             ├─YES─► [Register]│
│                                                   └─NO──► [Fail]   │
└─────────────────────────────────────────────────────────────────────┘

Pipeline Steps

Step TypePurpose
ProcessingRun SageMaker Processing job (data prep, eval)
TrainingRun SageMaker Training job
TuningRun Hyperparameter tuning job
TransformRun Batch Transform job
ModelCreate SageMaker Model
RegisterModelRegister model in Model Registry
ConditionBranch based on metric threshold
CallbackCall external system, wait for response
LambdaRun Lambda function inline
ClarifyCheckCheck for bias or explainability
QualityCheckBaseline data or model quality

Key Features

  • Parameters: runtime configuration — change instance types, dataset paths, thresholds without editing pipeline code
  • Caching: skip steps that haven’t changed since last run (reuse training output if data/code unchanged)
  • Lineage: automatic tracking of inputs, outputs, parameters for every run
  • Retry policies: automatic retry on transient failures

Orchestration Comparison

FeatureSageMaker PipelinesStep FunctionsApache Airflow (MWAA)
ML-nativeYes (built for ML)No (general)No (general)
SageMaker integrationNativeVia SDKVia operators
ComplexityLow-mediumMediumHigh
Non-ML tasksLimitedFullFull
Best forML workflowsMulti-service workflowsComplex ETL + ML

Exam tip: If the question is about ML workflow orchestration and all steps are ML steps — use SageMaker Pipelines. If the workflow includes non-ML steps (DynamoDB writes, Lambda, SNS) alongside ML — consider Step Functions. If the company already uses Airflow heavily — MWAA.


Docker Containers for Custom Algorithms

Why Docker

  • Reproducible environments: same container in dev, test, prod
  • Any framework, any language (not just AWS-supported frameworks)
  • Isolation: dependencies don’t conflict
  • Portability: run on SageMaker, ECS, on-premise, or locally

SageMaker Container Contract

ELI5: SageMaker talks to your container through a strict contract — specific directory paths for data and model files, and specific entry points for training and serving. As long as your container respects this contract, SageMaker can train and deploy it anywhere.

┌──────────────────────────────────────────────────┐
│         SageMaker Container Directory Layout     │
│                                                  │
│  /opt/ml/                                        │
│  ├── input/                                      │
│  │   ├── config/                                 │
│  │   │   ├── hyperparameters.json  ◄─ SageMaker  │
│  │   │   └── resourceConfig.json      injects    │
│  │   └── data/                                   │
│  │       └── training/  ◄── YOUR training data   │
│  ├── model/  ◄── Save trained model HERE         │
│  ├── output/                                     │
│  │   └── failure  ◄── Write failure messages     │
│  └── code/  ◄── Your script (script mode)        │
│                                                  │
│  Entry points:                                   │
│  - /opt/ml/code/train  (runs at training time)   │
│  - /opt/ml/code/serve  (runs at inference time)  │
└──────────────────────────────────────────────────┘

Container Customization Options

Least Custom ◄──────────────────────────────────► Most Custom
     │                   │                              │
Built-in Algorithm    Script Mode              Bring Your Own
  (no Docker)      (your code +             Container (BYOC)
                  AWS container)         (full custom Docker)
     │                   │                              │
  Easiest           Most Common              Maximum Control
  No flexibility    Best trade-off           Maximum effort

Script Mode (most common for custom code):

  • Use AWS pre-built container for your framework (TensorFlow, PyTorch, sklearn)
  • Provide YOUR training/inference script
  • AWS handles all environment setup
  • Easy to iterate — just change your script

BYOC (Bring Your Own Container):

  • Build custom Docker image
  • Full control: custom frameworks, languages, dependencies
  • Push to Amazon ECR
  • SageMaker pulls and runs it
  • Use when: proprietary framework, unusual dependencies, compiled code

Pre-built Containers (Script Mode ready)

FrameworkContainer AvailableScript Mode
TensorFlowYesYes
PyTorchYesYes
MXNetYesYes
Hugging Face TransformersYesYes
Scikit-learnYesYes
XGBoostYesYes
SparkMLYesYes

SageMaker Neo

Compile ML models for optimized inference across platforms.

What it does:

  • Takes a trained model (TensorFlow SavedModel, PyTorch .pt, ONNX, XGBoost, etc.)
  • Compiles it for a specific target hardware (CPU architecture, GPU type, edge chip)
  • Output: optimized binary that runs faster with less memory

Why it matters:

  • 2x-10x faster inference on the same hardware
  • Smaller model footprint (important for edge devices)
  • Compiled model runs with AWS Deep Learning Runtime (no framework installation needed)

Supported target platforms:

  • SageMaker ML instances (cloud)
  • ARM, x86, NVIDIA GPU (edge servers)
  • Raspberry Pi, Jetson Nano (IoT edge)
  • Qualcomm, Intel, NXP chips

AWS Inferentia

AWS’s custom ML inference chip — built for high-throughput, cost-optimized deep learning inference.

┌────────────────────────────────────────────────────────────┐
│               INFERENCE CHIP COMPARISON                   │
├────────────────────┬───────────────────────────────────────┤
│ GPU (ml.g4dn/p3)   │  Inferentia (ml.inf1/inf2)           │
├────────────────────┼───────────────────────────────────────┤
│ General compute    │  Purpose-built for ML inference       │
│ High cost/hr       │  Up to 70% lower cost vs GPU         │
│ Great for training │  Great for inference only             │
│ Flexible           │  Requires Neuron SDK compilation      │
│ Any framework      │  TensorFlow, PyTorch, MXNet           │
└────────────────────┴───────────────────────────────────────┘

Neuron SDK: Compiles models to run on Inferentia chips. Similar to Neo but specifically for Inferentia hardware.

Use case: High-volume, cost-sensitive deep learning inference (NLP, computer vision at scale).


Edge Deployment

When to Deploy at the Edge

RequirementDeploy at Edge
Ultra-low latency (< 5ms)Yes
Offline capability (no internet)Yes
Data privacy (data can’t leave device)Yes
High bandwidth cost to cloudYes
Real-time video analysisYes

Edge ML Stack

┌─────────────────────────────────────────────────────────┐
│              EDGE DEPLOYMENT OPTIONS                    │
│                                                         │
│  Train in Cloud (SageMaker)                            │
│          │                                              │
│          ▼ Compile with Neo                             │
│  Optimized model artifact                              │
│          │                                              │
│     ┌────┴────────────────────────────┐               │
│     │                                 │               │
│     ▼                                 ▼               │
│  IoT Greengrass              SageMaker Edge Manager   │
│  (IoT device runtime)        (fleet management)       │
│     │                                 │               │
│     └─────────────┬───────────────────┘               │
│                   ▼                                    │
│            Edge Device                                 │
│      (Raspberry Pi, Jetson, etc.)                      │
└─────────────────────────────────────────────────────────┘

SageMaker Edge Manager:

  • Deploy, monitor, and update models on a fleet of edge devices
  • Data capture from edge for model retraining
  • Model versioning and rollback

AWS IoT Greengrass:

  • General IoT device runtime (not ML-specific, but supports ML inference)
  • Lambda-based or component-based ML inference at the edge
  • Integrates with AWS IoT Core for device management

MLOps Best Practices

What Makes ML Different from Software CI/CD

Traditional Software CI/CD:
  Code change ──► Build ──► Test ──► Deploy

ML CI/CD (three axes of change):
  Code change   ──► Retrain ──► Evaluate ──► Deploy
  Data change   ──► Retrain ──► Evaluate ──► Deploy
  Model drift   ──►           Retrain ──► Deploy

MLOps Maturity Levels

LevelDescriptionWhat You Have
0ManualScripts on laptop, no automation
1Automated TrainingTraining pipeline automated, manual deployment
2Full CI/CDCode + data triggers automatic retrain + deploy

A/B Testing and Safe Deployment

ELI5: A/B testing in ML is like testing a new recipe at a restaurant. You serve the new dish to 10% of customers (model B) while 90% still get the proven dish (model A). You measure which dish gets better reviews (metrics) before switching everyone to the new recipe.

SageMaker Production Variants:

# Example endpoint configuration with A/B split
EndpointConfig:
  ProductionVariants:
    - ModelName: model-v1
      VariantName: ModelA
      InitialVariantWeight: 90    # 90% traffic
    - ModelName: model-v2
      VariantName: ModelB
      InitialVariantWeight: 10    # 10% traffic

Deployment strategies:

StrategyTraffic SplitRiskRollback Speed
Blue/Green100% instant cutoverHighSlow
Canary5% → 25% → 50% → 100%LowFast
LinearIncremental % over timeLowFast
Shadow100% to both, serve from A onlyZeroN/A

Shadow mode:

  • Route all traffic to BOTH old and new model
  • Only return predictions from OLD model to users
  • Compare new model’s predictions in background
  • Zero user impact, full validation before cutover

Model Lineage Tracking

Automatically tracked by SageMaker:

Dataset (S3) ──► Training Job ──► Model ──► Endpoint
     │                │             │           │
     └── Artifacts ───┘             └── Actions ┘
            └─── Contexts (grouping related objects) ───┘
  • Artifacts: data and model files (URIs + metadata)
  • Actions: what happened (training, deployment)
  • Contexts: logical groupings (experiment, pipeline run)
  • Query lineage: “What data was used to train the model serving endpoint X?”

Infrastructure as Code for ML

  • CloudFormation: define SageMaker resources as YAML/JSON templates
  • AWS CDK: define SageMaker resources in Python/TypeScript code
  • Terraform: third-party IaC (HCL), popular in multi-cloud environments

SageMaker Projects

Pre-built MLOps templates with integrated CI/CD.

What you get:

SageMaker Project
├── CodeCommit repos (model build + model deploy)
├── CodePipeline (automated workflow)
├── CodeBuild (build and test)
└── SageMaker Pipelines (ML workflow)

Built-in templates:

  • Model build, train, deploy pipeline
  • Model monitoring with retraining trigger
  • Multi-account deployment (dev → staging → prod)

Service Catalog integration: Projects are defined as Service Catalog products — admins define approved project templates, teams instantiate them.


Quick Reference

Deployment Option → Use Case

ScenarioDeployment Option
Fraud detection at payment timeReal-Time Endpoint
Internal dev/test API, low trafficServerless Inference
Score 10 million records overnightBatch Transform
Process 500MB video filesAsync Inference
1000 customer-specific models, low per-customer trafficMulti-Model Endpoint
Preprocessing + model as one unitInference Pipeline
Computer vision on factory camerasEdge (Panorama/Greengrass)

MLOps Component → AWS Service

MLOps NeedAWS Service
Workflow orchestrationSageMaker Pipelines
Model versioning and approvalSageMaker Model Registry
CI/CD for ML projectsSageMaker Projects
Data and model quality monitoringSageMaker Model Monitor
Bias and explainabilitySageMaker Clarify
Model lineage trackingSageMaker ML Lineage
Experiment trackingSageMaker Experiments
Custom containersAmazon ECR + Docker
Edge deployment fleet managementSageMaker Edge Manager
Cost-optimized DL inference chipAWS Inferentia (ml.inf1/inf2)
Model optimization for any hardwareSageMaker Neo