Domain 4: Monitoring, Maintenance & Security
Table of Contents
Monitoring, Maintenance & Security
Exam Domain: 4 — ML Solution Monitoring, Maintenance, and Security (24%) Tasks: Monitor inference, optimize infrastructure/costs, secure resources
SageMaker Model Monitor
The primary tool for monitoring ML models in production.
Architecture
┌──────────────────────────────────────────────────────────┐
│ SageMaker Model Monitor │
│ │
│ Endpoint Traffic │
│ │ │
│ ↓ │
│ ┌──────────────┐ ┌────────────────┐ │
│ │ Data Capture │────→│ S3 (captured │ │
│ │ (sample %) │ │ requests + │ │
│ └──────────────┘ │ responses) │ │
│ └───────┬────────┘ │
│ ↓ │
│ ┌────────────────────────┐ │
│ │ Monitoring Schedule │ │
│ │ (hourly/daily) │ │
│ │ │ │
│ │ Compare current data │ │
│ │ against baseline │ │
│ └───────────┬────────────┘ │
│ ↓ │
│ ┌────────────────────────┐ │
│ │ Violations Report │ │
│ │ → CloudWatch Alarm │ │
│ │ → SNS Notification │ │
│ └────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
Four Types of Monitoring
| Type | What It Detects | Baseline |
|---|---|---|
| Data Quality | Schema changes, missing values, data type changes | Statistics from training data |
| Model Quality | Accuracy/precision/recall degradation | Metrics from validation set |
| Bias Drift | Bias metrics changing over time | Clarify bias baseline |
| Feature Attribution Drift | SHAP values changing (feature importance shifting) | Clarify explainability baseline |
Model Quality Monitor requires ground truth labels:
Predictions (from endpoint) + Actual labels (delayed feedback)
↓ ↓
Merge on record ID → Compare → Compute metrics → Alert if degraded
ELI5: Data Quality monitoring is easy — it just checks whether incoming data “looks right” compared to training data (same schema, same value ranges, no missing fields). No labels needed. Model Quality monitoring is harder — it requires the actual correct answers (ground truth labels) to arrive later so you can check if the model was right. This distinction is a common exam trap: “monitor without labels” = Data Quality; “detect accuracy drop” = Model Quality (needs labels).
Data Capture
# Enable data capture on endpoint config
DataCaptureConfig:
EnableCapture: true
InitialSamplingPercentage: 100 # or lower %
CaptureOptions:
- CaptureMode: Input # capture requests
- CaptureMode: Output # capture responses
DestinationS3Uri: s3://bucket/capture/
- Captures request/response payloads
- Configurable sampling percentage (save costs)
- Stored in S3 as JSON lines
Drift Detection
Types of Drift
┌─────────────────────────────────────────────────────────┐
│ Types of Drift │
│ │
│ Data Drift (Covariate Shift) │
│ └─ Input data distribution changes │
│ e.g., customer demographics shift over time │
│ │
│ Concept Drift │
│ └─ Relationship between input and output changes │
│ e.g., what defines "spam" evolves │
│ │
│ Prediction Drift │
│ └─ Model output distribution changes │
│ e.g., model predicts more "fraud" than before │
│ │
│ Label Drift │
│ └─ Ground truth distribution changes │
│ e.g., fraud rate actually increased │
└─────────────────────────────────────────────────────────┘
ELI5: Drift is the silent killer of ML models. Your model was trained on last year’s reality — if reality shifts, your model becomes outdated without anyone touching it. Data drift means your customers changed (younger demographic, new device types). Concept drift means the rules of the game changed (what counted as “spam” in 2020 is different from today). It’s like using a 2019 map in a city that’s been rebuilt — the model itself is fine, the world it learned just no longer matches the world it’s operating in.
When to Retrain
Monitoring Signal → Action
─────────────────────────────────────────────
Data quality violation → Investigate data pipeline
Model quality drop → Retrain with recent data
Bias drift detected → Retrain + investigate cause
Feature importance shift → Review features, possible retrain
Concept drift → Retrain (often with sliding window)
Automated Retraining Pipeline:

Model Monitor detects drift
→ CloudWatch Alarm
→ EventBridge Rule
→ Step Functions / SageMaker Pipeline
→ Retrain → Evaluate → If better → Register → Deploy
Infrastructure Monitoring
Amazon CloudWatch
The central monitoring service for all AWS resources.
| Feature | Purpose |
|---|---|
| Metrics | Numerical data points (CPU, memory, invocations) |
| Alarms | Alert when metric crosses threshold |
| Logs | Centralized log storage and search |
| Dashboards | Visual monitoring displays |
| Events / EventBridge | React to state changes |
| Contributor Insights | Top-N contributors to metrics |
| Anomaly Detection | ML-based anomaly detection on metrics |
Key SageMaker CloudWatch Metrics
| Metric | What It Tracks |
|---|---|
Invocations | Number of requests to endpoint |
InvocationErrors (4xx/5xx) | Failed requests |
ModelLatency | Time for model to generate prediction |
OverheadLatency | SageMaker infrastructure overhead |
CPUUtilization | CPU usage on endpoint instances |
GPUUtilization | GPU usage |
MemoryUtilization | Memory usage |
DiskUtilization | Disk usage on instances |
GPUMemoryUtilization | GPU memory usage |
ELI5: ModelLatency is the model’s own thinking time — how long the actual inference computation takes. OverheadLatency is SageMaker’s bureaucracy time — routing the request, serializing/deserializing payloads, health checks. If ModelLatency is high, your model is too slow or under-provisioned (try a bigger instance or Neo compilation). If OverheadLatency is high, your infrastructure is overloaded (scale out with more instances). Total latency the user sees = ModelLatency + OverheadLatency.
Latency Troubleshooting
Total Latency = OverheadLatency + ModelLatency
High ModelLatency?
→ Model too complex → Simplify model or use faster instance
→ GPU not utilized → Check if model uses GPU
→ Batch too large → Reduce batch size
High OverheadLatency?
→ Instance overloaded → Scale up (bigger instance) or out (more instances)
→ Cold start (serverless) → Use provisioned concurrency
→ Network issues → Check VPC configuration
Cost Optimization
SageMaker Cost Strategies
| Strategy | Savings | Trade-off |
|---|---|---|
| Spot Instances (Training) | Up to 90% | Can be interrupted — need checkpointing |
| Serverless Inference | Pay per request | Cold start latency |
| Auto-Scaling | Match demand | Configuration complexity |
| Multi-Model Endpoints | Share instances | Higher latency for cold models |
| SageMaker Savings Plans | Up to 64% | 1-3 year commitment |
| Inference Recommender | Right-size | Takes time to benchmark |
| AWS Inferentia/Trainium | Up to 70%/50% | Limited framework support |
| Neo Compilation | Up to 2x throughput | Compilation step required |
ELI5: The #1 cost trick for the exam: Spot Instances for training. You can save up to 90% on training costs because AWS can reclaim the instance at any time with 2 minutes’ notice. The catch: if your job gets interrupted, you lose all progress unless you’ve enabled checkpointing (saving model state to S3 periodically). It’s like buying last-minute standby airline tickets — massively cheaper, but you might get bumped mid-flight. Checkpointing means you can board the next flight and pick up from where you left off.
SageMaker Inference Recommender
Input: Model artifact + sample payload
↓
Inference Recommender runs benchmarks across instance types
↓
Output: Recommended instance type + cost/performance comparison
Performance (throughput)
↑
│ ● ml.p3.2xlarge
│ ● ml.g4dn.xlarge
│ ● ml.c5.xlarge ← best value
│● ml.m5.large
└──────────────────→ Cost
AWS Cost Management Tools
| Tool | Purpose |
|---|---|
| Cost Explorer | Visualize and analyze spending over time |
| Budgets | Set spending alerts and limits |
| Trusted Advisor | Recommendations for cost, security, performance |
| Compute Optimizer | Right-sizing recommendations for instances |
| Billing Dashboard | Current month charges breakdown |
Security — IAM
IAM Fundamentals
┌─────────────────────────────────────────────┐
│ AWS IAM │
│ │
│ Users ──── belong to ──── Groups │
│ │ │ │
│ └─── attached to ─── Policies │
│ │ │
│ Roles ── assumed by ─── Services / Users │
│ │ │
│ └─── attached to ─── Policies │
│ │
│ Policy = JSON document defining permissions│
└─────────────────────────────────────────────┘
ELI5: IAM is the bouncer at every door in AWS. Users are real people with permanent credentials. Roles are temporary badges assumed by services or users for a limited time — when SageMaker needs to read your S3 bucket, it assumes an Execution Role, gets a temporary credential, and uses it. Policies are the list on the bouncer’s clipboard: “this badge can read S3 but cannot delete EC2 instances.” The SageMaker Execution Role is what SageMaker wears to access your data — always scope it to the minimum permissions needed.
Principle of Least Privilege
- Grant minimum permissions needed to perform the task
- Use IAM policies to restrict access to specific resources
- Prefer IAM Roles over long-lived access keys
SageMaker IAM
| Role | Purpose |
|---|---|
| SageMaker Execution Role | Assumed by SageMaker to access S3, ECR, CloudWatch, etc. |
| SageMaker Role Manager | Create least-privilege roles for ML personas |
// Example: SageMaker execution role policy
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::my-ml-bucket/*"
}
Multi-Factor Authentication (MFA)
- Required for: root account, privileged IAM users
- Types: virtual MFA (authenticator app), hardware key, U2F
Encryption
Encryption at Rest vs in Transit
At Rest: In Transit:
S3 → SSE-S3, SSE-KMS, SSE-C HTTPS/TLS between services
EBS → EBS encryption (AES-256) SageMaker endpoints use TLS
SageMaker → notebook, training, Inter-node training uses TLS
model artifacts (optional, adds overhead)
AWS KMS (Key Management Service)
| Feature | Details |
|---|---|
| AWS Managed Keys | AWS creates and manages (aws/s3, aws/ebs, etc.) |
| Customer Managed Keys (CMK) | You create, you control rotation/policies |
| Imported Keys | Bring your own key material |
| Key Rotation | Automatic yearly rotation (CMK) |
| Key Policies | Who can use/manage the key |
| Envelope Encryption | Data key encrypts data, KMS key encrypts data key |
Envelope Encryption:
KMS Master Key
│
└─ encrypts → Data Key (plaintext)
│
└─ encrypts → Your Data
Stored: Encrypted Data Key + Encrypted Data
KMS never stores your data — only the master key
ELI5: Envelope encryption is like putting a letter in a locked box, then locking the key to that box inside a second box that only KMS can open. Your actual data is encrypted by a “data key.” That data key is itself encrypted by a KMS master key. AWS stores the encrypted data key alongside your encrypted data — so even if someone steals both, they still can’t read your data without going through KMS. AWS never sees your data directly; they only guard the master key.
SageMaker Encryption
| What | How |
|---|---|
| Training data on S3 | SSE-S3 or SSE-KMS |
| Training instance volumes | KMS encryption |
| Model artifacts on S3 | SSE-KMS |
| Notebook instance storage | KMS encryption |
| Inter-container traffic | Optional TLS (adds ~30% overhead) |
| Endpoint traffic | TLS enforced |
Network Security
VPC (Virtual Private Cloud)

┌─────────────────── VPC ──────────────────────┐
│ │
│ ┌────── Public Subnet ──────┐ │
│ │ ┌──────────┐ │ │
│ │ │ NAT GW │←── Internet │ │
│ │ └──────────┘ Gateway │ │
│ └───────────────────────────┘ │
│ │ │
│ ┌────── Private Subnet ─────┐ │
│ │ ┌──────────────────┐ │ │
│ │ │ SageMaker │ │ │
│ │ │ Training/Endpoint │ │ │
│ │ └──────────────────┘ │ │
│ └───────────────────────────┘ │
│ │
│ Security Group: Instance-level firewall │
│ NACL: Subnet-level firewall │
└──────────────────────────────────────────────┘
| Component | Scope | Stateful? | Rules |
|---|---|---|---|
| Security Groups | Instance-level | Yes (return traffic auto-allowed) | Allow only (no deny) |
| NACLs | Subnet-level | No (must define both directions) | Allow and Deny |
VPC Endpoints (PrivateLink)
Without VPC Endpoint:
SageMaker in VPC → Internet Gateway → S3 (public)
With VPC Endpoint:
SageMaker in VPC → VPC Endpoint → S3 (private, never leaves AWS)
| Type | For | Example |
|---|---|---|
| Gateway Endpoint | S3, DynamoDB | Free, route-table based |
| Interface Endpoint | All other services | ENI in subnet, costs $ |
ELI5: VPC Endpoints are like digging a private tunnel between SageMaker and S3 inside AWS’s own network, so your training data never travels over the public internet. Without a VPC Endpoint, your data leaves your VPC, hits the internet gateway, and reaches S3 via the public AWS endpoint — technically still on AWS infrastructure, but exposed to more network hops. For the exam: any scenario that says “keep data private” or “prevent internet traffic” requires VPC Endpoints plus Private Subnets.
SageMaker VPC Configuration
- Training in VPC: Isolate training instances, access data via VPC endpoints
- Inference in VPC: Endpoint instances in private subnet
- Network Isolation: Block all internet access (fully air-gapped)
- Inter-container encryption: TLS between distributed training containers
Network Isolation Mode
enable_network_isolation=True
Effect:
• Container has NO outbound internet access
• Cannot pip install, make API calls, etc.
• Can ONLY access S3 via training input channels
• All dependencies must be baked into container image
When to use:
• Regulatory: data cannot leave the network
• Prevent training containers from exfiltrating data
• HIPAA, PCI-DSS compliance
ELI5: Network Isolation is the nuclear option for data security. The training container is completely cut off — no internet, no external API calls, no package downloads mid-run. Everything (code, libraries, dependencies) must already be inside the Docker image before training starts. It’s used when the data is so sensitive (medical records, financial data) that the risk of any outbound connection — even accidental — is unacceptable. The tradeoff: you must pre-bake all dependencies, which makes image management more complex.
HIPAA Compliance Checklist
ALL of these are required:
1. KMS encryption at rest (S3 + EBS)
2. TLS in transit
3. Inter-container encryption (distributed training)
4. VPC with private subnets
5. CloudTrail audit logging
6. Network isolation (if required)
Common VPC Troubleshooting
| Error | Cause | Fix |
|---|---|---|
| “Unable to download data from S3” | Missing S3 VPC endpoint | Add S3 Gateway Endpoint |
| “Cannot pull Docker container” | Missing ECR VPC endpoints | Add ecr.api + ecr.dkr Interface Endpoints |
| “Cannot write logs” | Missing CloudWatch endpoint | Add logs Interface Endpoint |
Exam tip: “Prevent data from traversing the internet” → VPC Endpoints + Private Subnets
Sensitive Data Protection
Amazon Macie
- ML-powered service to discover and protect sensitive data in S3
- Automatically detects PII, financial data, credentials
- Generates findings with severity ratings
- Integrates with EventBridge for automated remediation
AWS Secrets Manager
- Store and auto-rotate database credentials, API keys, tokens
- Fine-grained IAM access control
- Automatic rotation with Lambda functions
- Use case: ML pipeline accessing databases, external APIs
Data Masking & Anonymization
| Technique | Description | Reversible? |
|---|---|---|
| Masking | Replace with fixed value (e.g., XXX-XX-1234) | No |
| Tokenization | Replace with random token, keep mapping | Yes |
| Hashing | One-way hash (SHA-256) | No |
| Encryption | Encrypt with key | Yes (with key) |
| Generalization | Reduce precision (age 27 → “20-30”) | No |
| Perturbation | Add noise to numerical data | No |
| Synthetic data | Generate fake but statistically similar data | N/A |
Web Application Security
AWS WAF (Web Application Firewall)
- Protect API endpoints from web attacks
- Filter by: IP, rate limiting, geographic, SQL injection, XSS patterns
- Use case: protect SageMaker public endpoints or API Gateway
AWS Shield
| Tier | Protection | Cost |
|---|---|---|
| Standard | L3/L4 DDoS protection | Free (automatic) |
| Advanced | L3/L4/L7 + 24/7 DRT team + cost protection | $3,000/month |
Compliance & Auditing
AWS CloudTrail
- Records all API calls in your AWS account
- Who did what, when, from where
- Essential for security auditing and compliance
- Store logs in S3, analyze with Athena
VPC Flow Logs
- Capture IP traffic going to/from network interfaces in VPC
- Troubleshoot connectivity issues
- Monitor security (detect unusual traffic patterns)
Quick Reference: Security Checklist for ML
□ SageMaker execution role with least-privilege IAM policies
□ S3 data encrypted with SSE-KMS
□ Training instances in private VPC subnet
□ VPC endpoints for S3 and other services (no internet transit)
□ Network isolation enabled (if required)
□ Inter-container encryption for distributed training (if sensitive)
□ Model artifacts encrypted in S3
□ Notebook instance encryption enabled
□ CloudTrail logging enabled
□ Model Monitor + CloudWatch alarms configured
□ Macie scanning S3 for PII
□ Secrets Manager for credentials (not hardcoded)
□ MFA enabled for console users
□ SageMaker Role Manager for persona-based access
Quick Reference: When to Use What
| Scenario | Service |
|---|---|
| Monitor model predictions in production | SageMaker Model Monitor |
| Detect bias drift over time | Model Monitor + Clarify |
| Dashboard for endpoint metrics | CloudWatch Dashboards |
| Alert on metric threshold | CloudWatch Alarms |
| Right-size inference instances | Inference Recommender |
| Reduce inference cost | Serverless / Spot / Inferentia / Savings Plans |
| Manage encryption keys | KMS |
| Keep data private in AWS | VPC Endpoints + Private Subnets |
| Discover PII in S3 | Macie |
| Store secrets | Secrets Manager |
| Audit API calls | CloudTrail |
| Protect APIs from attacks | WAF |
| DDoS protection | Shield |
| Fine-grained data lake access | Lake Formation |
| Automated retraining on drift | Model Monitor → EventBridge → SageMaker Pipelines |