Domain 4: Monitoring, Maintenance & Security

13 min read 2562 words

Table of Contents

Monitoring, Maintenance & Security

Monitoring, Maintenance & Security

Exam Domain: 4 — ML Solution Monitoring, Maintenance, and Security (24%) Tasks: Monitor inference, optimize infrastructure/costs, secure resources

SageMaker Model Monitor

The primary tool for monitoring ML models in production.

Architecture

┌──────────────────────────────────────────────────────────┐
│            SageMaker Model Monitor                        │
│                                                          │
│  Endpoint Traffic                                        │
│       │                                                  │
│       ↓                                                  │
│  ┌──────────────┐     ┌────────────────┐                 │
│  │ Data Capture │────→│ S3 (captured   │                 │
│  │ (sample %)   │     │  requests +    │                 │
│  └──────────────┘     │  responses)    │                 │
│                       └───────┬────────┘                 │
│                               ↓                          │
│                  ┌────────────────────────┐               │
│                  │  Monitoring Schedule   │               │
│                  │  (hourly/daily)        │               │
│                  │                        │               │
│                  │  Compare current data  │               │
│                  │  against baseline      │               │
│                  └───────────┬────────────┘               │
│                              ↓                           │
│                  ┌────────────────────────┐               │
│                  │  Violations Report     │               │
│                  │  → CloudWatch Alarm    │               │
│                  │  → SNS Notification    │               │
│                  └────────────────────────┘               │
└──────────────────────────────────────────────────────────┘

Four Types of Monitoring

Type	What It Detects	Baseline
Data Quality	Schema changes, missing values, data type changes	Statistics from training data
Model Quality	Accuracy/precision/recall degradation	Metrics from validation set
Bias Drift	Bias metrics changing over time	Clarify bias baseline
Feature Attribution Drift	SHAP values changing (feature importance shifting)	Clarify explainability baseline

Model Quality Monitor requires ground truth labels:
  Predictions (from endpoint) + Actual labels (delayed feedback)
       ↓                              ↓
  Merge on record ID → Compare → Compute metrics → Alert if degraded

ELI5: Data Quality monitoring is easy — it just checks whether incoming data “looks right” compared to training data (same schema, same value ranges, no missing fields). No labels needed. Model Quality monitoring is harder — it requires the actual correct answers (ground truth labels) to arrive later so you can check if the model was right. This distinction is a common exam trap: “monitor without labels” = Data Quality; “detect accuracy drop” = Model Quality (needs labels).

Data Capture

# Enable data capture on endpoint config
DataCaptureConfig:
    EnableCapture: true
    InitialSamplingPercentage: 100    # or lower %
    CaptureOptions:
      - CaptureMode: Input            # capture requests
      - CaptureMode: Output           # capture responses
    DestinationS3Uri: s3://bucket/capture/

Captures request/response payloads
Configurable sampling percentage (save costs)
Stored in S3 as JSON lines

Drift Detection

Types of Drift

┌─────────────────────────────────────────────────────────┐
│                  Types of Drift                          │
│                                                         │
│  Data Drift (Covariate Shift)                           │
│  └─ Input data distribution changes                     │
│     e.g., customer demographics shift over time         │
│                                                         │
│  Concept Drift                                          │
│  └─ Relationship between input and output changes       │
│     e.g., what defines "spam" evolves                   │
│                                                         │
│  Prediction Drift                                       │
│  └─ Model output distribution changes                   │
│     e.g., model predicts more "fraud" than before       │
│                                                         │
│  Label Drift                                            │
│  └─ Ground truth distribution changes                   │
│     e.g., fraud rate actually increased                 │
└─────────────────────────────────────────────────────────┘

ELI5: Drift is the silent killer of ML models. Your model was trained on last year’s reality — if reality shifts, your model becomes outdated without anyone touching it. Data drift means your customers changed (younger demographic, new device types). Concept drift means the rules of the game changed (what counted as “spam” in 2020 is different from today). It’s like using a 2019 map in a city that’s been rebuilt — the model itself is fine, the world it learned just no longer matches the world it’s operating in.

When to Retrain

Monitoring Signal          →  Action
─────────────────────────────────────────────
Data quality violation     →  Investigate data pipeline
Model quality drop         →  Retrain with recent data
Bias drift detected        →  Retrain + investigate cause
Feature importance shift   →  Review features, possible retrain
Concept drift              →  Retrain (often with sliding window)

Automated Retraining Pipeline:

Automated Retraining Pipeline

Model Monitor detects drift
    → CloudWatch Alarm
    → EventBridge Rule
    → Step Functions / SageMaker Pipeline
    → Retrain → Evaluate → If better → Register → Deploy

Infrastructure Monitoring

Amazon CloudWatch

The central monitoring service for all AWS resources.

Feature	Purpose
Metrics	Numerical data points (CPU, memory, invocations)
Alarms	Alert when metric crosses threshold
Logs	Centralized log storage and search
Dashboards	Visual monitoring displays
Events / EventBridge	React to state changes
Contributor Insights	Top-N contributors to metrics
Anomaly Detection	ML-based anomaly detection on metrics

Key SageMaker CloudWatch Metrics

Metric	What It Tracks
`Invocations`	Number of requests to endpoint
`InvocationErrors` (4xx/5xx)	Failed requests
`ModelLatency`	Time for model to generate prediction
`OverheadLatency`	SageMaker infrastructure overhead
`CPUUtilization`	CPU usage on endpoint instances
`GPUUtilization`	GPU usage
`MemoryUtilization`	Memory usage
`DiskUtilization`	Disk usage on instances
`GPUMemoryUtilization`	GPU memory usage

ELI5: ModelLatency is the model’s own thinking time — how long the actual inference computation takes. OverheadLatency is SageMaker’s bureaucracy time — routing the request, serializing/deserializing payloads, health checks. If ModelLatency is high, your model is too slow or under-provisioned (try a bigger instance or Neo compilation). If OverheadLatency is high, your infrastructure is overloaded (scale out with more instances). Total latency the user sees = ModelLatency + OverheadLatency.

Latency Troubleshooting

Total Latency = OverheadLatency + ModelLatency

High ModelLatency?
  → Model too complex → Simplify model or use faster instance
  → GPU not utilized → Check if model uses GPU
  → Batch too large → Reduce batch size

High OverheadLatency?
  → Instance overloaded → Scale up (bigger instance) or out (more instances)
  → Cold start (serverless) → Use provisioned concurrency
  → Network issues → Check VPC configuration

Cost Optimization

SageMaker Cost Strategies

Strategy	Savings	Trade-off
Spot Instances (Training)	Up to 90%	Can be interrupted — need checkpointing
Serverless Inference	Pay per request	Cold start latency
Auto-Scaling	Match demand	Configuration complexity
Multi-Model Endpoints	Share instances	Higher latency for cold models
SageMaker Savings Plans	Up to 64%	1-3 year commitment
Inference Recommender	Right-size	Takes time to benchmark
AWS Inferentia/Trainium	Up to 70%/50%	Limited framework support
Neo Compilation	Up to 2x throughput	Compilation step required

ELI5: The #1 cost trick for the exam: Spot Instances for training. You can save up to 90% on training costs because AWS can reclaim the instance at any time with 2 minutes’ notice. The catch: if your job gets interrupted, you lose all progress unless you’ve enabled checkpointing (saving model state to S3 periodically). It’s like buying last-minute standby airline tickets — massively cheaper, but you might get bumped mid-flight. Checkpointing means you can board the next flight and pick up from where you left off.

SageMaker Inference Recommender

Input: Model artifact + sample payload
    ↓
Inference Recommender runs benchmarks across instance types
    ↓
Output: Recommended instance type + cost/performance comparison

                Performance (throughput)
                    ↑
                    │      ● ml.p3.2xlarge
                    │   ● ml.g4dn.xlarge
                    │ ● ml.c5.xlarge     ← best value
                    │● ml.m5.large
                    └──────────────────→ Cost

AWS Cost Management Tools

Tool	Purpose
Cost Explorer	Visualize and analyze spending over time
Budgets	Set spending alerts and limits
Trusted Advisor	Recommendations for cost, security, performance
Compute Optimizer	Right-sizing recommendations for instances
Billing Dashboard	Current month charges breakdown

Security — IAM

IAM Fundamentals

┌─────────────────────────────────────────────┐
│               AWS IAM                        │
│                                             │
│  Users ──── belong to ──── Groups           │
│    │                         │              │
│    └─── attached to ─── Policies            │
│                              │              │
│  Roles ── assumed by ─── Services / Users   │
│    │                                        │
│    └─── attached to ─── Policies            │
│                                             │
│  Policy = JSON document defining permissions│
└─────────────────────────────────────────────┘

ELI5: IAM is the bouncer at every door in AWS. Users are real people with permanent credentials. Roles are temporary badges assumed by services or users for a limited time — when SageMaker needs to read your S3 bucket, it assumes an Execution Role, gets a temporary credential, and uses it. Policies are the list on the bouncer’s clipboard: “this badge can read S3 but cannot delete EC2 instances.” The SageMaker Execution Role is what SageMaker wears to access your data — always scope it to the minimum permissions needed.

Principle of Least Privilege

Grant minimum permissions needed to perform the task
Use IAM policies to restrict access to specific resources
Prefer IAM Roles over long-lived access keys

SageMaker IAM

Role	Purpose
SageMaker Execution Role	Assumed by SageMaker to access S3, ECR, CloudWatch, etc.
SageMaker Role Manager	Create least-privilege roles for ML personas

// Example: SageMaker execution role policy
{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject",
    "s3:PutObject"
  ],
  "Resource": "arn:aws:s3:::my-ml-bucket/*"
}

Multi-Factor Authentication (MFA)

Required for: root account, privileged IAM users
Types: virtual MFA (authenticator app), hardware key, U2F

Encryption

Encryption at Rest vs in Transit

At Rest:                              In Transit:
  S3 → SSE-S3, SSE-KMS, SSE-C         HTTPS/TLS between services
  EBS → EBS encryption (AES-256)       SageMaker endpoints use TLS
  SageMaker → notebook, training,      Inter-node training uses TLS
              model artifacts           (optional, adds overhead)

AWS KMS (Key Management Service)

Feature	Details
AWS Managed Keys	AWS creates and manages (aws/s3, aws/ebs, etc.)
Customer Managed Keys (CMK)	You create, you control rotation/policies
Imported Keys	Bring your own key material
Key Rotation	Automatic yearly rotation (CMK)
Key Policies	Who can use/manage the key
Envelope Encryption	Data key encrypts data, KMS key encrypts data key

Envelope Encryption:
  KMS Master Key
       │
       └─ encrypts → Data Key (plaintext)
                          │
                          └─ encrypts → Your Data

  Stored: Encrypted Data Key + Encrypted Data
  KMS never stores your data — only the master key

ELI5: Envelope encryption is like putting a letter in a locked box, then locking the key to that box inside a second box that only KMS can open. Your actual data is encrypted by a “data key.” That data key is itself encrypted by a KMS master key. AWS stores the encrypted data key alongside your encrypted data — so even if someone steals both, they still can’t read your data without going through KMS. AWS never sees your data directly; they only guard the master key.

SageMaker Encryption

What	How
Training data on S3	SSE-S3 or SSE-KMS
Training instance volumes	KMS encryption
Model artifacts on S3	SSE-KMS
Notebook instance storage	KMS encryption
Inter-container traffic	Optional TLS (adds ~30% overhead)
Endpoint traffic	TLS enforced

Network Security

VPC (Virtual Private Cloud)

SageMaker VPC Security

┌─────────────────── VPC ──────────────────────┐
│                                              │
│  ┌────── Public Subnet ──────┐               │
│  │  ┌──────────┐             │               │
│  │  │ NAT GW   │←── Internet │               │
│  │  └──────────┘   Gateway   │               │
│  └───────────────────────────┘               │
│           │                                  │
│  ┌────── Private Subnet ─────┐               │
│  │  ┌──────────────────┐     │               │
│  │  │ SageMaker        │     │               │
│  │  │ Training/Endpoint │     │               │
│  │  └──────────────────┘     │               │
│  └───────────────────────────┘               │
│                                              │
│  Security Group: Instance-level firewall     │
│  NACL: Subnet-level firewall                 │
└──────────────────────────────────────────────┘

Component	Scope	Stateful?	Rules
Security Groups	Instance-level	Yes (return traffic auto-allowed)	Allow only (no deny)
NACLs	Subnet-level	No (must define both directions)	Allow and Deny

VPC Endpoints (PrivateLink)

Without VPC Endpoint:
  SageMaker in VPC → Internet Gateway → S3 (public)

With VPC Endpoint:
  SageMaker in VPC → VPC Endpoint → S3 (private, never leaves AWS)

Type	For	Example
Gateway Endpoint	S3, DynamoDB	Free, route-table based
Interface Endpoint	All other services	ENI in subnet, costs $

ELI5: VPC Endpoints are like digging a private tunnel between SageMaker and S3 inside AWS’s own network, so your training data never travels over the public internet. Without a VPC Endpoint, your data leaves your VPC, hits the internet gateway, and reaches S3 via the public AWS endpoint — technically still on AWS infrastructure, but exposed to more network hops. For the exam: any scenario that says “keep data private” or “prevent internet traffic” requires VPC Endpoints plus Private Subnets.

SageMaker VPC Configuration

Training in VPC: Isolate training instances, access data via VPC endpoints
Inference in VPC: Endpoint instances in private subnet
Network Isolation: Block all internet access (fully air-gapped)
Inter-container encryption: TLS between distributed training containers

Network Isolation Mode

enable_network_isolation=True

Effect:
  • Container has NO outbound internet access
  • Cannot pip install, make API calls, etc.
  • Can ONLY access S3 via training input channels
  • All dependencies must be baked into container image

When to use:
  • Regulatory: data cannot leave the network
  • Prevent training containers from exfiltrating data
  • HIPAA, PCI-DSS compliance

ELI5: Network Isolation is the nuclear option for data security. The training container is completely cut off — no internet, no external API calls, no package downloads mid-run. Everything (code, libraries, dependencies) must already be inside the Docker image before training starts. It’s used when the data is so sensitive (medical records, financial data) that the risk of any outbound connection — even accidental — is unacceptable. The tradeoff: you must pre-bake all dependencies, which makes image management more complex.

HIPAA Compliance Checklist

ALL of these are required:
  1. KMS encryption at rest (S3 + EBS)
  2. TLS in transit
  3. Inter-container encryption (distributed training)
  4. VPC with private subnets
  5. CloudTrail audit logging
  6. Network isolation (if required)

Common VPC Troubleshooting

Error	Cause	Fix
“Unable to download data from S3”	Missing S3 VPC endpoint	Add S3 Gateway Endpoint
“Cannot pull Docker container”	Missing ECR VPC endpoints	Add ecr.api + ecr.dkr Interface Endpoints
“Cannot write logs”	Missing CloudWatch endpoint	Add logs Interface Endpoint

Exam tip: “Prevent data from traversing the internet” → VPC Endpoints + Private Subnets

Sensitive Data Protection

Amazon Macie

ML-powered service to discover and protect sensitive data in S3
Automatically detects PII, financial data, credentials
Generates findings with severity ratings
Integrates with EventBridge for automated remediation

AWS Secrets Manager

Store and auto-rotate database credentials, API keys, tokens
Fine-grained IAM access control
Automatic rotation with Lambda functions
Use case: ML pipeline accessing databases, external APIs

Data Masking & Anonymization

Technique	Description	Reversible?
Masking	Replace with fixed value (e.g., XXX-XX-1234)	No
Tokenization	Replace with random token, keep mapping	Yes
Hashing	One-way hash (SHA-256)	No
Encryption	Encrypt with key	Yes (with key)
Generalization	Reduce precision (age 27 → “20-30”)	No
Perturbation	Add noise to numerical data	No
Synthetic data	Generate fake but statistically similar data	N/A

Web Application Security

AWS WAF (Web Application Firewall)

Protect API endpoints from web attacks
Filter by: IP, rate limiting, geographic, SQL injection, XSS patterns
Use case: protect SageMaker public endpoints or API Gateway

AWS Shield

Tier	Protection	Cost
Standard	L3/L4 DDoS protection	Free (automatic)
Advanced	L3/L4/L7 + 24/7 DRT team + cost protection	$3,000/month

Compliance & Auditing

AWS CloudTrail

Records all API calls in your AWS account
Who did what, when, from where
Essential for security auditing and compliance
Store logs in S3, analyze with Athena

VPC Flow Logs

Capture IP traffic going to/from network interfaces in VPC
Troubleshoot connectivity issues
Monitor security (detect unusual traffic patterns)

Quick Reference: Security Checklist for ML

□ SageMaker execution role with least-privilege IAM policies
□ S3 data encrypted with SSE-KMS
□ Training instances in private VPC subnet
□ VPC endpoints for S3 and other services (no internet transit)
□ Network isolation enabled (if required)
□ Inter-container encryption for distributed training (if sensitive)
□ Model artifacts encrypted in S3
□ Notebook instance encryption enabled
□ CloudTrail logging enabled
□ Model Monitor + CloudWatch alarms configured
□ Macie scanning S3 for PII
□ Secrets Manager for credentials (not hardcoded)
□ MFA enabled for console users
□ SageMaker Role Manager for persona-based access

Quick Reference: When to Use What

Scenario	Service
Monitor model predictions in production	SageMaker Model Monitor
Detect bias drift over time	Model Monitor + Clarify
Dashboard for endpoint metrics	CloudWatch Dashboards
Alert on metric threshold	CloudWatch Alarms
Right-size inference instances	Inference Recommender
Reduce inference cost	Serverless / Spot / Inferentia / Savings Plans
Manage encryption keys	KMS
Keep data private in AWS	VPC Endpoints + Private Subnets
Discover PII in S3	Macie
Store secrets	Secrets Manager
Audit API calls	CloudTrail
Protect APIs from attacks	WAF
DDoS protection	Shield
Fine-grained data lake access	Lake Formation
Automated retraining on drift	Model Monitor → EventBridge → SageMaker Pipelines

Monitoring, Maintenance & Security#

SageMaker Model Monitor#

Architecture#

Four Types of Monitoring#

Data Capture#

Drift Detection#

Types of Drift#

When to Retrain#

Infrastructure Monitoring#

Amazon CloudWatch#

Key SageMaker CloudWatch Metrics#

Latency Troubleshooting#

Cost Optimization#

SageMaker Cost Strategies#

SageMaker Inference Recommender#

AWS Cost Management Tools#

Security — IAM#

IAM Fundamentals#

Principle of Least Privilege#

SageMaker IAM#

Multi-Factor Authentication (MFA)#

Encryption#

Encryption at Rest vs in Transit#

AWS KMS (Key Management Service)#

SageMaker Encryption#

Network Security#

VPC (Virtual Private Cloud)#

VPC Endpoints (PrivateLink)#

SageMaker VPC Configuration#

Network Isolation Mode#

HIPAA Compliance Checklist#

Common VPC Troubleshooting#

Sensitive Data Protection#

Amazon Macie#

AWS Secrets Manager#

Data Masking & Anonymization#

Web Application Security#

AWS WAF (Web Application Firewall)#

AWS Shield#

Compliance & Auditing#

AWS CloudTrail#

VPC Flow Logs#

Quick Reference: Security Checklist for ML#

Quick Reference: When to Use What#