← AWS MLA-C01 — ML Engineer Associate

Domain 3: MLOps, Deployment & Orchestration

MLOps, Deployment & Orchestration

Exam Domain: 3 — Deployment and Orchestration of ML Workflows (22%) Tasks: Select deployment infrastructure, script infrastructure, automate CI/CD


SageMaker Inference Options

Endpoint Types

SageMaker Inference Options

┌─────────────────────────────────────────────────────────────┐
│              SageMaker Inference Options                     │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │  Real-Time   │  │  Serverless  │  │    Async     │      │
│  │  Endpoints   │  │  Inference   │  │  Inference   │      │
│  ├──────────────┤  ├──────────────┤  ├──────────────┤      │
│  │ Persistent   │  │ Scale to 0   │  │ Queue-based  │      │
│  │ Always-on    │  │ Cold start   │  │ Long-running │      │
│  │ Low latency  │  │ Pay per use  │  │ Large payload│      │
│  │ ms response  │  │ Intermittent │  │ Up to 1GB    │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐                        │
│  │   Batch      │  │  Inference   │                        │
│  │  Transform   │  │  Pipelines   │                        │
│  ├──────────────┤  ├──────────────┤                        │
│  │ Offline      │  │ Chain models │                        │
│  │ Whole dataset│  │ Pre/post     │                        │
│  │ No endpoint  │  │ processing   │                        │
│  │ One-time job │  │ Serial steps │                        │
│  └──────────────┘  └──────────────┘                        │
└─────────────────────────────────────────────────────────────┘
TypeLatencyPayloadScalingBest For
Real-Timems6 MBAuto-scaling (always on)Production APIs, interactive apps
ServerlessCold start (~sec)4 MBScale to 0Intermittent traffic, cost-sensitive
AsyncMinutes1 GBAuto-scaling + queueLarge payloads, long processing
Batch TransformHoursUnlimitedManaged jobOffline predictions, whole datasets
Inference Pipelinems6 MBSame as real-timeMulti-step: preprocess → predict → postprocess

Exam tip: Common scenario questions:

  • “Intermittent traffic, minimize cost” → Serverless
  • “Large images, processing takes minutes” → Async
  • “Score entire dataset nightly” → Batch Transform
  • “Sub-second response for API” → Real-Time

ELI5: The exam loves “which endpoint?” questions. The key is usage pattern: Real-Time = always on (like a running API server, costs money even when idle). Serverless = sleeps when idle (saves money, but has a cold-start delay). Async = drop a big file in a queue and come back later. Batch = score the whole database overnight as a one-time job. Match the pattern to the scenario and you’ll get it right every time.

Production Variants & A/B Testing

Real-Time Endpoint with Production Variants:

                     ┌─────────────────┐
                     │  Endpoint       │
                     │                 │
  Traffic ──────────►│  Variant A: 90% │──► Model v1
                     │  Variant B: 10% │──► Model v2 (testing)
                     │                 │
                     └─────────────────┘

Use case: A/B test new model on small % of traffic before full rollout

ELI5: A/B testing for ML works just like A/B testing for websites. Send 90% of real traffic to the proven model, 10% to the new challenger. Measure actual performance metrics (accuracy, latency, business KPIs) on real users. If the new model wins, gradually shift traffic until it handles 100%. If it underperforms, roll back without anyone noticing. It’s like taste-testing a new recipe on a few tables before changing the whole menu.

Multi-Model Endpoints (MME)

  • Host multiple models on a single endpoint
  • Models loaded on-demand, shared instance
  • Cost-efficient when you have many models with sporadic traffic
  • Use case: per-customer models, per-region models

Multi-Container Endpoints

  • Run multiple containers on a single endpoint
  • Direct invocation (specify which container) or serial pipeline
  • Use case: ensemble models, different frameworks on same endpoint

Deployment Strategies

Deployment Strategies

Blue/Green Deployment

Before:   [Blue: Model v1 ← 100% traffic]

During:   [Blue: Model v1 ← 100%]  [Green: Model v2 ← 0%]
          (Green is provisioned and tested)

Switch:   [Blue: Model v1 ← 0%]   [Green: Model v2 ← 100%]
          (Instant cutover)

Rollback: Switch traffic back to Blue if issues detected

ELI5: Blue/Green is the safest deployment strategy. You build a complete copy of your environment (Green) with the new model, test it thoroughly while Blue still serves all real traffic, then flip the load balancer switch in an instant. If anything goes wrong, one more flip sends traffic back to Blue. Zero downtime, zero risk — you always have a fully working environment to fall back to.

Canary Deployment

Step 1:  [Model v1 ← 95%]  [Model v2 ← 5%]    (small canary)
Step 2:  [Model v1 ← 80%]  [Model v2 ← 20%]   (increase if healthy)
Step 3:  [Model v1 ← 0%]   [Model v2 ← 100%]  (full rollout)

Alarms: CloudWatch metrics trigger auto-rollback if errors spike

Linear (Rolling) Deployment

Step 1:  [v1 ← 90%]  [v2 ← 10%]
Step 2:  [v1 ← 80%]  [v2 ← 20%]
Step 3:  [v1 ← 70%]  [v2 ← 30%]
  ...gradually shift all traffic over fixed intervals

Shadow Testing

[Model v1 ← 100% (serves responses)]
      │
      └─── Copy of traffic ───► [Model v2 (log only, don't serve)]

Compare v1 vs v2 predictions offline — zero risk to users

ELI5: Shadow testing is the ultimate safe evaluation. The new model receives a copy of every real request and generates predictions, but those predictions are thrown away — users always see Model v1’s response. You just log Model v2’s outputs and compare them to v1 offline. Zero risk to users, but you get a complete picture of how the new model would perform on real production traffic before you ever expose it.

SageMaker Deployment Guardrails

  • Auto-rollback: CloudWatch alarms trigger automatic rollback
  • Built-in support for canary, linear, and all-at-once strategies
  • Configure bake time (wait period before progressing)

Auto-Scaling for Endpoints

SageMaker Auto-Scaling:

  Scaling Policy → CloudWatch Metric → Add/Remove Instances

Target Tracking Policy (recommended):
  "Keep InvocationsPerInstance at 70"
  → SM automatically adjusts instance count

Step Scaling Policy:
  If metric > 80 for 3 minutes → add 2 instances
  If metric > 95 for 1 minute  → add 5 instances
  If metric < 30 for 10 minutes → remove 1 instance
MetricWhat It Tracks
InvocationsPerInstanceRequests per instance (most common)
CPUUtilizationCPU usage
GPUUtilizationGPU usage
ModelLatencyTime to generate prediction
OverheadLatencySageMaker overhead time
  • Min instances: 0 (serverless), 1+ (real-time)
  • Cool-down period: Wait before next scaling action (prevent thrashing)

Edge Deployment

SageMaker Neo

Trained Model → [Neo Compiler] → Optimized Model → Deploy to Edge
                  ↓
              Compiles for target hardware:
              ARM, Intel, NVIDIA, Qualcomm, etc.
              Up to 2x performance improvement
  • Compile once, run anywhere — optimized for specific hardware
  • Supports: TensorFlow, PyTorch, MXNet, XGBoost, ONNX
  • Targets: EC2, IoT Greengrass, mobile devices, embedded systems

ELI5: Neo is a compiler for ML models. Just like a C++ compiler converts source code into CPU-specific machine instructions that run faster than interpreted code, Neo converts ML models into hardware-optimized binary formats. The same PyTorch model compiled for an ARM chip runs up to 2x faster than the generic version — with no change to accuracy.

AWS IoT Greengrass

  • Run ML inference locally on edge devices
  • Models trained in cloud → deployed to edge
  • Operates offline (disconnected environments)
  • Use case: factory floor defect detection, autonomous vehicles

Containerization & Compute

Docker in ML

SageMaker uses Docker containers for everything:

  ┌────────────────────────────────────────────┐
  │  Docker Container                          │
  │  ├─ /opt/ml/model/        (model artifacts)│
  │  ├─ /opt/ml/input/data/   (training data)  │
  │  ├─ /opt/ml/output/       (results)        │
  │  └─ Your code + dependencies               │
  └────────────────────────────────────────────┘

Two modes:
  Training container:  reads data → trains → writes model
  Inference container: loads model → serves predictions via HTTP

ELI5: Everything in SageMaker runs inside Docker containers. Think of containers like standardized shipping containers: no matter what’s inside (Python code, R scripts, TensorFlow, PyTorch), the outer box is always the same shape. SageMaker can pick it up and run it on any machine without worrying about dependency conflicts. Your model, its libraries, and your code all travel together as one portable unit.

Amazon ECR (Elastic Container Registry)

  • Private Docker registry for storing container images
  • SageMaker pulls custom training/inference containers from ECR
  • Integrates with ECS/EKS for deployment

Amazon ECS (Elastic Container Service)

FeatureDetails
Launch typesFargate (serverless) or EC2 (self-managed)
Task DefinitionBlueprint for containers (image, CPU, memory, ports)
ServiceMaintains desired count of running tasks
Use for MLCustom inference services, preprocessing pipelines

Amazon EKS (Elastic Kubernetes Service)

FeatureDetails
KubernetesOpen-source container orchestration
Node typesEC2, Fargate, or both
Use for MLComplex multi-service ML systems, Kubeflow
SageMaker + K8sSageMaker Operators for Kubernetes

AWS Batch

  • Managed batch processing
  • Automatically provisions optimal compute (EC2 or Fargate)
  • Use case: large-scale offline data processing, batch predictions
ECS vs EKS vs Batch:
  ECS:   Simple container orchestration (AWS-native)
  EKS:   Kubernetes (portable, complex)
  Batch: Fire-and-forget batch jobs

Infrastructure as Code

AWS CloudFormation

# Define ML infrastructure as YAML/JSON templates
Resources:
  SageMakerEndpoint:
    Type: AWS::SageMaker::Endpoint
    Properties:
      EndpointConfigName: !Ref EndpointConfig
      
  EndpointConfig:
    Type: AWS::SageMaker::EndpointConfig
    Properties:
      ProductionVariants:
        - ModelName: !Ref Model
          InstanceType: ml.m5.xlarge
          InitialInstanceCount: 2
  • Templates: YAML/JSON describing infrastructure
  • Stacks: Running instance of a template
  • Change Sets: Preview changes before applying
  • Drift Detection: Detect manual changes to infrastructure
  • Use case: reproducible ML environments, version-controlled infra

AWS CDK (Cloud Development Kit)

  • Define infrastructure using programming languages (Python, TypeScript, Java, etc.)
  • Synthesizes to CloudFormation templates
  • Higher-level constructs than raw CloudFormation
  • Use case: complex infrastructure with logic, loops, conditionals
CloudFormation vs CDK:
  CloudFormation: Declarative (YAML/JSON) — what you want
  CDK:            Imperative (code) — how to build it
  CDK compiles → CloudFormation template → AWS deploys

CI/CD for ML

AWS CodePipeline

ML CI/CD Pipeline

Source (CodeCommit/GitHub)
    ↓
Build (CodeBuild)
    ↓
Test (CodeBuild — run tests)
    ↓
Deploy (CodeDeploy / SageMaker)

┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
│  Source   │→  │  Build   │→  │  Test    │→  │  Deploy  │
│ CodeCommit│   │CodeBuild │   │CodeBuild │   │CodeDeploy│
│ GitHub    │   │(build    │   │(run unit │   │(deploy   │
│ S3       │   │ container│   │  tests)  │   │ model)   │
└──────────┘   └──────────┘   └──────────┘   └──────────┘
ServicePurpose
CodeCommitGit repository (AWS managed)
CodeBuildBuild and test (compile, run tests, create container)
CodeDeployDeploy to EC2, ECS, Lambda
CodePipelineOrchestrate the full CI/CD pipeline

SageMaker Pipelines

The native ML CI/CD system — purpose-built for ML workflows.

SageMaker Pipeline Example:

  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐
  │ Process  │→  │  Train   │→  │ Evaluate │→  │ Register │
  │ Data     │   │  Model   │   │  Model   │   │  Model   │
  └──────────┘   └──────────┘   └──────────┘   └──────────┘
                                      │
                                      ↓
                                ┌──────────┐
                                │Condition: │
                                │accuracy   │
                                │> 0.90?    │
                                └─────┬─────┘
                              Yes ↙      ↘ No
                        ┌──────────┐  ┌──────────┐
                        │ Register │  │  Fail /  │
                        │ + Deploy │  │  Retrain │
                        └──────────┘  └──────────┘
Step TypeWhat It Does
ProcessingData processing with SageMaker Processing
TrainingModel training
TuningHyperparameter tuning
TransformBatch inference
ConditionBranch based on metric values
RegisterModelRegister in Model Registry
CreateModelCreate deployable model
LambdaRun custom Lambda function
FailMark pipeline as failed

Exam tip: SageMaker Pipelines is the answer when the question asks about ML-specific CI/CD or automated retraining workflows.

ELI5: SageMaker Pipelines is like a factory assembly line for ML: raw data goes in one end, a validated deployed model comes out the other. Each step (processing, training, evaluation, registration) is automated and tracked. If the model doesn’t hit an accuracy threshold, the pipeline stops and can trigger retraining. The whole thing is reproducible — run it again next month with new data and get consistent results.


Workflow Orchestration

Amazon EventBridge

  • Event bus — react to events from AWS services and custom apps
  • Schedule rules (cron) or event-pattern matching
  • Use case: trigger retraining when new data arrives in S3
S3 PutObject event
    → EventBridge rule (pattern: "new training data in s3://ml-data/")
    → Target: Step Functions (start retraining pipeline)

AWS Step Functions

  • State machine orchestration — coordinate multiple AWS services
  • Visual workflow designer
  • Built-in error handling, retries, parallel execution
  • Integrates with SageMaker, Lambda, Glue, ECS, and more
Step Functions States:
  Task      →  Call an AWS service or Lambda
  Choice    →  Branch based on condition
  Parallel  →  Run branches concurrently
  Map       →  Iterate over an array
  Wait      →  Pause for time or signal
  Succeed   →  End successfully
  Fail      →  End with error

ELI5: Step Functions is the general-purpose workflow engine — it can orchestrate any AWS service in any sequence. SageMaker Pipelines is ML-specific and knows about training jobs, model registries, and evaluation steps natively. MWAA (Airflow) is for teams that already have Python-based DAG workflows they want to keep. For the exam: ML workflow with training/evaluation gates = SageMaker Pipelines; everything else (ETL, cross-service automation) = Step Functions.

Amazon MWAA (Managed Apache Airflow)

  • Managed Apache Airflow — Python-based workflow orchestration
  • DAGs (Directed Acyclic Graphs) define workflows
  • Rich ecosystem of operators and hooks
  • Use case: complex multi-step ML pipelines already defined in Airflow

Git Workflow for ML

Gitflow for ML:

  main ─────────●─────────────────●──────────
                │                 ↑
  develop ──────┼──●──●──●───────┤
                │  │     │       │
  feature ──────┼──┘     │       │
  branch        │        │       │
                │        └───────┘
                         (merge when model approved)

GitHub Flow (simpler):
  main ──────●──────●──────●──────
             │      ↑      │
  feature ───┘──────┘      │
  branch                   │

AWS Lake Formation

Centralized data governance for data lakes.

FeaturePurpose
Centralized permissionsSingle place to manage access to S3 data lake
Fine-grained accessColumn-level and row-level security
Data CatalogBuilt on Glue Data Catalog
Cross-account sharingShare data across AWS accounts
Governed tablesACID transactions on S3
Data filtersCell-level security (row + column combined)

Lake Formation Governance

Lake Formation sits on top of S3 + Glue:

  Users / Services
       ↓
  Lake Formation (permissions)
       ↓
  Glue Data Catalog (metadata)
       ↓
  S3 Data Lake (storage)
       ↓
  Accessed by: Athena, Redshift, EMR, Glue ETL

Exam tip: When the question asks about fine-grained access control for a data lake → Lake Formation.


Quick Reference: When to Use What

ScenarioService
ML-specific CI/CD pipelineSageMaker Pipelines
General CI/CDCodePipeline + CodeBuild + CodeDeploy
React to AWS eventsEventBridge
Visual workflow with AWS servicesStep Functions
Python DAG-based workflowsMWAA (Airflow)
Define infra as YAMLCloudFormation
Define infra as code (Python/TS)CDK
Store Docker imagesECR
Run containers (simple)ECS + Fargate
Run containers (Kubernetes)EKS
Batch processing jobsAWS Batch
Deploy to edge devicesSageMaker Neo + IoT Greengrass
Data lake governanceLake Formation
A/B test modelsProduction Variants
Gradual traffic shiftCanary / Linear deployment
Zero-risk model comparisonShadow testing

Additional Services to Know

AWS Lambda Limits

LimitValue
Timeout15 minutes max
Memory128 MB – 10 GB
Package size50 MB zipped, 250 MB unzipped
Concurrency1000 default (can increase)
Payload6 MB (sync), 256 KB (async)

Exam trap: “Process 2GB file in Lambda” = IMPOSSIBLE. Use ECS/Fargate or SageMaker Processing.

Amazon Redshift

FeatureDetails
Redshift MLRun ML predictions using SQL (CREATE MODEL — uses Autopilot under the hood)
Redshift SpectrumQuery S3 data directly without loading into warehouse
Firehose targetKinesis Firehose can deliver directly to Redshift

Use Redshift when data is already in a warehouse and needs ML predictions.

Glue Job Bookmarks

Run 1: Process files A, B, C → bookmark saves position
Run 2: Only process files D, E (new since bookmark)
        Files A, B, C are SKIPPED

Without bookmarks: Reprocesses everything every run (wasteful)

Glue DataBrew vs Glue ETL vs Data Wrangler

Glue DataBrewGlue ETLData Wrangler
UserData analyst (no code)Data engineer (PySpark)Data scientist (ML focus)
ScaleModeratePetabyte (Spark)Moderate
ML featuresNoneNoneTarget leakage detection
OutputS3, Glue CatalogS3, JDBCSageMaker Pipeline step code

SageMaker Pipelines — Step Caching

Run 1: ProcessingStep(data_v1) → TrainingStep(hp_v1) → Eval → Register
       All steps execute. Outputs cached.

Run 2 (only hyperparams changed):
       ProcessingStep(data_v1) → CACHED! Skip.
       TrainingStep(hp_v2) → Executes.
       Eval → Executes.

Saves hours of compute when only part of pipeline changes.
Configure: CacheConfig(enable_caching=True, expire_after="P30D")

Pipeline Step Types (Complete)

Step TypePurpose
ProcessingStepData prep, evaluation scripts
TrainingStepModel training
TuningStepHyperparameter tuning
TransformStepBatch predictions
CreateModelStepPrepare model for deployment
RegisterModelStepAdd to Model Registry
ConditionStepBranch if metric > threshold
CallbackStepWait for human approval / external event
FailStepFail pipeline explicitly
LambdaStepRun Lambda function
QualityCheckStepAutomated data/model quality gates
ClarifyCheckStepAutomated bias/explainability gates
EMRStepRun EMR job