← AWS MLA-C01 — ML Engineer Associate

Domain 4: Monitoring, Maintenance & Security

Monitoring, Maintenance & Security

Exam Domain: 4 — ML Solution Monitoring, Maintenance, and Security (24%) Tasks: Monitor inference, optimize infrastructure/costs, secure resources


SageMaker Model Monitor

The primary tool for monitoring ML models in production.

Architecture

┌──────────────────────────────────────────────────────────┐
│            SageMaker Model Monitor                        │
│                                                          │
│  Endpoint Traffic                                        │
│       │                                                  │
│       ↓                                                  │
│  ┌──────────────┐     ┌────────────────┐                 │
│  │ Data Capture │────→│ S3 (captured   │                 │
│  │ (sample %)   │     │  requests +    │                 │
│  └──────────────┘     │  responses)    │                 │
│                       └───────┬────────┘                 │
│                               ↓                          │
│                  ┌────────────────────────┐               │
│                  │  Monitoring Schedule   │               │
│                  │  (hourly/daily)        │               │
│                  │                        │               │
│                  │  Compare current data  │               │
│                  │  against baseline      │               │
│                  └───────────┬────────────┘               │
│                              ↓                           │
│                  ┌────────────────────────┐               │
│                  │  Violations Report     │               │
│                  │  → CloudWatch Alarm    │               │
│                  │  → SNS Notification    │               │
│                  └────────────────────────┘               │
└──────────────────────────────────────────────────────────┘

Four Types of Monitoring

TypeWhat It DetectsBaseline
Data QualitySchema changes, missing values, data type changesStatistics from training data
Model QualityAccuracy/precision/recall degradationMetrics from validation set
Bias DriftBias metrics changing over timeClarify bias baseline
Feature Attribution DriftSHAP values changing (feature importance shifting)Clarify explainability baseline
Model Quality Monitor requires ground truth labels:
  Predictions (from endpoint) + Actual labels (delayed feedback)
       ↓                              ↓
  Merge on record ID → Compare → Compute metrics → Alert if degraded

ELI5: Data Quality monitoring is easy — it just checks whether incoming data “looks right” compared to training data (same schema, same value ranges, no missing fields). No labels needed. Model Quality monitoring is harder — it requires the actual correct answers (ground truth labels) to arrive later so you can check if the model was right. This distinction is a common exam trap: “monitor without labels” = Data Quality; “detect accuracy drop” = Model Quality (needs labels).

Data Capture

# Enable data capture on endpoint config
DataCaptureConfig:
    EnableCapture: true
    InitialSamplingPercentage: 100    # or lower %
    CaptureOptions:
      - CaptureMode: Input            # capture requests
      - CaptureMode: Output           # capture responses
    DestinationS3Uri: s3://bucket/capture/
  • Captures request/response payloads
  • Configurable sampling percentage (save costs)
  • Stored in S3 as JSON lines

Drift Detection

Types of Drift

┌─────────────────────────────────────────────────────────┐
│                  Types of Drift                          │
│                                                         │
│  Data Drift (Covariate Shift)                           │
│  └─ Input data distribution changes                     │
│     e.g., customer demographics shift over time         │
│                                                         │
│  Concept Drift                                          │
│  └─ Relationship between input and output changes       │
│     e.g., what defines "spam" evolves                   │
│                                                         │
│  Prediction Drift                                       │
│  └─ Model output distribution changes                   │
│     e.g., model predicts more "fraud" than before       │
│                                                         │
│  Label Drift                                            │
│  └─ Ground truth distribution changes                   │
│     e.g., fraud rate actually increased                 │
└─────────────────────────────────────────────────────────┘

ELI5: Drift is the silent killer of ML models. Your model was trained on last year’s reality — if reality shifts, your model becomes outdated without anyone touching it. Data drift means your customers changed (younger demographic, new device types). Concept drift means the rules of the game changed (what counted as “spam” in 2020 is different from today). It’s like using a 2019 map in a city that’s been rebuilt — the model itself is fine, the world it learned just no longer matches the world it’s operating in.

When to Retrain

Monitoring Signal          →  Action
─────────────────────────────────────────────
Data quality violation     →  Investigate data pipeline
Model quality drop         →  Retrain with recent data
Bias drift detected        →  Retrain + investigate cause
Feature importance shift   →  Review features, possible retrain
Concept drift              →  Retrain (often with sliding window)

Automated Retraining Pipeline:

Automated Retraining Pipeline

Model Monitor detects drift
    → CloudWatch Alarm
    → EventBridge Rule
    → Step Functions / SageMaker Pipeline
    → Retrain → Evaluate → If better → Register → Deploy

Infrastructure Monitoring

Amazon CloudWatch

The central monitoring service for all AWS resources.

FeaturePurpose
MetricsNumerical data points (CPU, memory, invocations)
AlarmsAlert when metric crosses threshold
LogsCentralized log storage and search
DashboardsVisual monitoring displays
Events / EventBridgeReact to state changes
Contributor InsightsTop-N contributors to metrics
Anomaly DetectionML-based anomaly detection on metrics

Key SageMaker CloudWatch Metrics

MetricWhat It Tracks
InvocationsNumber of requests to endpoint
InvocationErrors (4xx/5xx)Failed requests
ModelLatencyTime for model to generate prediction
OverheadLatencySageMaker infrastructure overhead
CPUUtilizationCPU usage on endpoint instances
GPUUtilizationGPU usage
MemoryUtilizationMemory usage
DiskUtilizationDisk usage on instances
GPUMemoryUtilizationGPU memory usage

ELI5: ModelLatency is the model’s own thinking time — how long the actual inference computation takes. OverheadLatency is SageMaker’s bureaucracy time — routing the request, serializing/deserializing payloads, health checks. If ModelLatency is high, your model is too slow or under-provisioned (try a bigger instance or Neo compilation). If OverheadLatency is high, your infrastructure is overloaded (scale out with more instances). Total latency the user sees = ModelLatency + OverheadLatency.

Latency Troubleshooting

Total Latency = OverheadLatency + ModelLatency

High ModelLatency?
  → Model too complex → Simplify model or use faster instance
  → GPU not utilized → Check if model uses GPU
  → Batch too large → Reduce batch size

High OverheadLatency?
  → Instance overloaded → Scale up (bigger instance) or out (more instances)
  → Cold start (serverless) → Use provisioned concurrency
  → Network issues → Check VPC configuration

Cost Optimization

SageMaker Cost Strategies

StrategySavingsTrade-off
Spot Instances (Training)Up to 90%Can be interrupted — need checkpointing
Serverless InferencePay per requestCold start latency
Auto-ScalingMatch demandConfiguration complexity
Multi-Model EndpointsShare instancesHigher latency for cold models
SageMaker Savings PlansUp to 64%1-3 year commitment
Inference RecommenderRight-sizeTakes time to benchmark
AWS Inferentia/TrainiumUp to 70%/50%Limited framework support
Neo CompilationUp to 2x throughputCompilation step required

ELI5: The #1 cost trick for the exam: Spot Instances for training. You can save up to 90% on training costs because AWS can reclaim the instance at any time with 2 minutes’ notice. The catch: if your job gets interrupted, you lose all progress unless you’ve enabled checkpointing (saving model state to S3 periodically). It’s like buying last-minute standby airline tickets — massively cheaper, but you might get bumped mid-flight. Checkpointing means you can board the next flight and pick up from where you left off.

SageMaker Inference Recommender

Input: Model artifact + sample payload
    ↓
Inference Recommender runs benchmarks across instance types
    ↓
Output: Recommended instance type + cost/performance comparison

                Performance (throughput)
                    ↑
                    │      ● ml.p3.2xlarge
                    │   ● ml.g4dn.xlarge
                    │ ● ml.c5.xlarge     ← best value
                    │● ml.m5.large
                    └──────────────────→ Cost

AWS Cost Management Tools

ToolPurpose
Cost ExplorerVisualize and analyze spending over time
BudgetsSet spending alerts and limits
Trusted AdvisorRecommendations for cost, security, performance
Compute OptimizerRight-sizing recommendations for instances
Billing DashboardCurrent month charges breakdown

Security — IAM

IAM Fundamentals

┌─────────────────────────────────────────────┐
│               AWS IAM                        │
│                                             │
│  Users ──── belong to ──── Groups           │
│    │                         │              │
│    └─── attached to ─── Policies            │
│                              │              │
│  Roles ── assumed by ─── Services / Users   │
│    │                                        │
│    └─── attached to ─── Policies            │
│                                             │
│  Policy = JSON document defining permissions│
└─────────────────────────────────────────────┘

ELI5: IAM is the bouncer at every door in AWS. Users are real people with permanent credentials. Roles are temporary badges assumed by services or users for a limited time — when SageMaker needs to read your S3 bucket, it assumes an Execution Role, gets a temporary credential, and uses it. Policies are the list on the bouncer’s clipboard: “this badge can read S3 but cannot delete EC2 instances.” The SageMaker Execution Role is what SageMaker wears to access your data — always scope it to the minimum permissions needed.

Principle of Least Privilege

  • Grant minimum permissions needed to perform the task
  • Use IAM policies to restrict access to specific resources
  • Prefer IAM Roles over long-lived access keys

SageMaker IAM

RolePurpose
SageMaker Execution RoleAssumed by SageMaker to access S3, ECR, CloudWatch, etc.
SageMaker Role ManagerCreate least-privilege roles for ML personas
// Example: SageMaker execution role policy
{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject",
    "s3:PutObject"
  ],
  "Resource": "arn:aws:s3:::my-ml-bucket/*"
}

Multi-Factor Authentication (MFA)

  • Required for: root account, privileged IAM users
  • Types: virtual MFA (authenticator app), hardware key, U2F

Encryption

Encryption at Rest vs in Transit

At Rest:                              In Transit:
  S3 → SSE-S3, SSE-KMS, SSE-C         HTTPS/TLS between services
  EBS → EBS encryption (AES-256)       SageMaker endpoints use TLS
  SageMaker → notebook, training,      Inter-node training uses TLS
              model artifacts           (optional, adds overhead)

AWS KMS (Key Management Service)

FeatureDetails
AWS Managed KeysAWS creates and manages (aws/s3, aws/ebs, etc.)
Customer Managed Keys (CMK)You create, you control rotation/policies
Imported KeysBring your own key material
Key RotationAutomatic yearly rotation (CMK)
Key PoliciesWho can use/manage the key
Envelope EncryptionData key encrypts data, KMS key encrypts data key
Envelope Encryption:
  KMS Master Key
       │
       └─ encrypts → Data Key (plaintext)
                          │
                          └─ encrypts → Your Data

  Stored: Encrypted Data Key + Encrypted Data
  KMS never stores your data — only the master key

ELI5: Envelope encryption is like putting a letter in a locked box, then locking the key to that box inside a second box that only KMS can open. Your actual data is encrypted by a “data key.” That data key is itself encrypted by a KMS master key. AWS stores the encrypted data key alongside your encrypted data — so even if someone steals both, they still can’t read your data without going through KMS. AWS never sees your data directly; they only guard the master key.

SageMaker Encryption

WhatHow
Training data on S3SSE-S3 or SSE-KMS
Training instance volumesKMS encryption
Model artifacts on S3SSE-KMS
Notebook instance storageKMS encryption
Inter-container trafficOptional TLS (adds ~30% overhead)
Endpoint trafficTLS enforced

Network Security

VPC (Virtual Private Cloud)

SageMaker VPC Security

┌─────────────────── VPC ──────────────────────┐
│                                              │
│  ┌────── Public Subnet ──────┐               │
│  │  ┌──────────┐             │               │
│  │  │ NAT GW   │←── Internet │               │
│  │  └──────────┘   Gateway   │               │
│  └───────────────────────────┘               │
│           │                                  │
│  ┌────── Private Subnet ─────┐               │
│  │  ┌──────────────────┐     │               │
│  │  │ SageMaker        │     │               │
│  │  │ Training/Endpoint │     │               │
│  │  └──────────────────┘     │               │
│  └───────────────────────────┘               │
│                                              │
│  Security Group: Instance-level firewall     │
│  NACL: Subnet-level firewall                 │
└──────────────────────────────────────────────┘
ComponentScopeStateful?Rules
Security GroupsInstance-levelYes (return traffic auto-allowed)Allow only (no deny)
NACLsSubnet-levelNo (must define both directions)Allow and Deny
Without VPC Endpoint:
  SageMaker in VPC → Internet Gateway → S3 (public)

With VPC Endpoint:
  SageMaker in VPC → VPC Endpoint → S3 (private, never leaves AWS)
TypeForExample
Gateway EndpointS3, DynamoDBFree, route-table based
Interface EndpointAll other servicesENI in subnet, costs $

ELI5: VPC Endpoints are like digging a private tunnel between SageMaker and S3 inside AWS’s own network, so your training data never travels over the public internet. Without a VPC Endpoint, your data leaves your VPC, hits the internet gateway, and reaches S3 via the public AWS endpoint — technically still on AWS infrastructure, but exposed to more network hops. For the exam: any scenario that says “keep data private” or “prevent internet traffic” requires VPC Endpoints plus Private Subnets.

SageMaker VPC Configuration

  • Training in VPC: Isolate training instances, access data via VPC endpoints
  • Inference in VPC: Endpoint instances in private subnet
  • Network Isolation: Block all internet access (fully air-gapped)
  • Inter-container encryption: TLS between distributed training containers

Network Isolation Mode

enable_network_isolation=True

Effect:
  • Container has NO outbound internet access
  • Cannot pip install, make API calls, etc.
  • Can ONLY access S3 via training input channels
  • All dependencies must be baked into container image

When to use:
  • Regulatory: data cannot leave the network
  • Prevent training containers from exfiltrating data
  • HIPAA, PCI-DSS compliance

ELI5: Network Isolation is the nuclear option for data security. The training container is completely cut off — no internet, no external API calls, no package downloads mid-run. Everything (code, libraries, dependencies) must already be inside the Docker image before training starts. It’s used when the data is so sensitive (medical records, financial data) that the risk of any outbound connection — even accidental — is unacceptable. The tradeoff: you must pre-bake all dependencies, which makes image management more complex.

HIPAA Compliance Checklist

ALL of these are required:
  1. KMS encryption at rest (S3 + EBS)
  2. TLS in transit
  3. Inter-container encryption (distributed training)
  4. VPC with private subnets
  5. CloudTrail audit logging
  6. Network isolation (if required)

Common VPC Troubleshooting

ErrorCauseFix
“Unable to download data from S3”Missing S3 VPC endpointAdd S3 Gateway Endpoint
“Cannot pull Docker container”Missing ECR VPC endpointsAdd ecr.api + ecr.dkr Interface Endpoints
“Cannot write logs”Missing CloudWatch endpointAdd logs Interface Endpoint

Exam tip: “Prevent data from traversing the internet” → VPC Endpoints + Private Subnets


Sensitive Data Protection

Amazon Macie

  • ML-powered service to discover and protect sensitive data in S3
  • Automatically detects PII, financial data, credentials
  • Generates findings with severity ratings
  • Integrates with EventBridge for automated remediation

AWS Secrets Manager

  • Store and auto-rotate database credentials, API keys, tokens
  • Fine-grained IAM access control
  • Automatic rotation with Lambda functions
  • Use case: ML pipeline accessing databases, external APIs

Data Masking & Anonymization

TechniqueDescriptionReversible?
MaskingReplace with fixed value (e.g., XXX-XX-1234)No
TokenizationReplace with random token, keep mappingYes
HashingOne-way hash (SHA-256)No
EncryptionEncrypt with keyYes (with key)
GeneralizationReduce precision (age 27 → “20-30”)No
PerturbationAdd noise to numerical dataNo
Synthetic dataGenerate fake but statistically similar dataN/A

Web Application Security

AWS WAF (Web Application Firewall)

  • Protect API endpoints from web attacks
  • Filter by: IP, rate limiting, geographic, SQL injection, XSS patterns
  • Use case: protect SageMaker public endpoints or API Gateway

AWS Shield

TierProtectionCost
StandardL3/L4 DDoS protectionFree (automatic)
AdvancedL3/L4/L7 + 24/7 DRT team + cost protection$3,000/month

Compliance & Auditing

AWS CloudTrail

  • Records all API calls in your AWS account
  • Who did what, when, from where
  • Essential for security auditing and compliance
  • Store logs in S3, analyze with Athena

VPC Flow Logs

  • Capture IP traffic going to/from network interfaces in VPC
  • Troubleshoot connectivity issues
  • Monitor security (detect unusual traffic patterns)

Quick Reference: Security Checklist for ML

□ SageMaker execution role with least-privilege IAM policies
□ S3 data encrypted with SSE-KMS
□ Training instances in private VPC subnet
□ VPC endpoints for S3 and other services (no internet transit)
□ Network isolation enabled (if required)
□ Inter-container encryption for distributed training (if sensitive)
□ Model artifacts encrypted in S3
□ Notebook instance encryption enabled
□ CloudTrail logging enabled
□ Model Monitor + CloudWatch alarms configured
□ Macie scanning S3 for PII
□ Secrets Manager for credentials (not hardcoded)
□ MFA enabled for console users
□ SageMaker Role Manager for persona-based access

Quick Reference: When to Use What

ScenarioService
Monitor model predictions in productionSageMaker Model Monitor
Detect bias drift over timeModel Monitor + Clarify
Dashboard for endpoint metricsCloudWatch Dashboards
Alert on metric thresholdCloudWatch Alarms
Right-size inference instancesInference Recommender
Reduce inference costServerless / Spot / Inferentia / Savings Plans
Manage encryption keysKMS
Keep data private in AWSVPC Endpoints + Private Subnets
Discover PII in S3Macie
Store secretsSecrets Manager
Audit API callsCloudTrail
Protect APIs from attacksWAF
DDoS protectionShield
Fine-grained data lake accessLake Formation
Automated retraining on driftModel Monitor → EventBridge → SageMaker Pipelines