Lab: Domain 4 — Monitoring & Security Hands-On
Table of Contents
- Lab: Domain 4 — Monitoring & Security Hands-On
- Lab 1: SageMaker Model Monitor — The Complete Lifecycle
- What Is This Lab About?
- The Four Types of Monitoring — When Each Fires
- The Complete Setup — 3 Steps Before Monitoring Works
- What’s ACTUALLY Running Behind Model Monitor?
- Step 1: Deploy Model with Data Capture
- Step 2: Generate Traffic
- Step 3: Create Baseline from Training Data
- Step 4: Examine the Baseline
- Step 5: Schedule Monitoring
- Step 6: View Violations
- Lab 2: Model Quality Monitor — When You Have Ground Truth
- Lab 3: Automated Retraining — The Full Loop
- Lab 4: Securing the ML Pipeline
- Lab 5: Cost Optimization Patterns
- Domain 4 Lab Summary
- Lab 1: SageMaker Model Monitor — The Complete Lifecycle
Lab: Domain 4 — Monitoring & Security Hands-On
Models degrade in production. Data drifts, user behavior changes, bias emerges. This domain (24% of the exam) is about catching problems before they hurt the business — and securing the entire ML pipeline.
Lab 1: SageMaker Model Monitor — The Complete Lifecycle

What Is This Lab About?
You deployed a customer churn model 3 months ago. It worked great — 95% accuracy. But slowly, customer demographics shifted: average income rose, new product categories appeared, a competitor entered the market. Your model never saw this new reality. Accuracy silently dropped to 72%, and nobody noticed until the VP asked why retention campaigns stopped working.
Model Monitor catches this automatically. It compares production data against your training baseline and alerts you when things drift.
The Four Types of Monitoring — When Each Fires
┌─────────────────────────────────────────────────────────────┐
│ The Four Monitors and What They Catch │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ 1. DATA QUALITY MONITOR │ │
│ │ "Has the INPUT data changed?" │ │
│ │ │ │
│ │ Catches: │ │
│ │ • Feature "income" mean shifted $50K → $80K │ │
│ │ • New category "crypto" in "payment_method" │ │
│ │ • "email" column is now 15% null (was 0%) │ │
│ │ • Data type changed (integer → string) │ │
│ │ │ │
│ │ Needs ground truth? NO — just compares input stats │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ 2. MODEL QUALITY MONITOR │ │
│ │ "Is the model getting WRONG answers?" │ │
│ │ │ │
│ │ Catches: │ │
│ │ • Accuracy dropped from 95% to 72% │ │
│ │ • F1 score degraded below threshold │ │
│ │ • AUC-ROC declined significantly │ │
│ │ │ │
│ │ Needs ground truth? YES — must know actual outcomes│ │
│ │ (this is the KEY exam distinction) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ 3. BIAS DRIFT MONITOR │ │
│ │ "Is the model becoming UNFAIR?" │ │
│ │ │ │
│ │ Catches: │ │
│ │ • Approval rate for women dropped from 75% → 55% │ │
│ │ • Disparate Impact fell below 0.8 (legal threshold)│ │
│ │ • New demographic group underrepresented │ │
│ │ │ │
│ │ Uses: SageMaker Clarify bias baseline │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ 4. EXPLAINABILITY DRIFT MONITOR │ │
│ │ "Are DIFFERENT features driving predictions now?" │ │
│ │ │ │
│ │ Catches: │ │
│ │ • "credit_score" was #1 feature, now "zip_code" is │ │
│ │ • SHAP values shifted significantly │ │
│ │ • Model logic has changed even if accuracy hasn't │ │
│ │ │ │
│ │ Uses: SageMaker Clarify SHAP baseline │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
The Complete Setup — 3 Steps Before Monitoring Works
STEP 1 STEP 2 STEP 3
Enable Data Create Baseline Schedule
Capture (from training data) Monitoring
│ │ │
▼ ▼ ▼
Every request ──▶ "What does NORMAL Compare captured
and response data look like?" data against
logged to S3 Generate statistics: baseline every
mean, std, min, max, hour/day.
data types, Flag violations.
completeness Alert via
CloudWatch.
Exam trap: “Model Monitor detects no drift” — check if Data Capture is enabled. Without it, there’s literally no data to monitor. This is tested frequently.
What’s ACTUALLY Running Behind Model Monitor?
Model Monitor looks simple on the surface, but there’s real infrastructure doing the work:
Data Capture — it’s an S3 logging layer on the endpoint:
- When you enable Data Capture, the SageMaker endpoint runtime starts writing JSON Lines files to your specified S3 path
- Each file contains captured requests and responses, serialized with metadata (event ID, timestamp, content type)
- Files are organized by:
s3://bucket/prefix/endpoint-name/variant/yyyy/mm/dd/hh/ - The sampling percentage controls how many requests are logged (100% = everything, 10% = sample)
- There is NO separate service — it’s built into the endpoint runtime itself
Baseline Job — it’s a SageMaker Processing Job:
- When you call
suggest_baseline(), SageMaker launches a Processing Job using a special Model Monitor container (maintained by AWS) - This container reads your training CSV from S3, computes statistics (mean, std, distribution per feature), and generates two JSON files:
statistics.json— numerical profile of every featureconstraints.json— rules like “age must be Integral, completeness must be > 0.95”
- These files are your “reference” — what NORMAL looks like
Scheduled Monitoring — it’s a recurring SageMaker Processing Job:
create_monitoring_schedule()creates a CloudWatch Events rule (cron expression)- Every hour/day (per your schedule), it triggers a Processing Job using the same Model Monitor container
- This job reads the captured data from S3, computes the same statistics, and compares against your baseline
- If any constraint is violated (e.g., mean shifted >20%, new data type appeared), it writes a violations report to S3
- It also publishes metrics to CloudWatch — one metric per feature per check type
- These CloudWatch metrics are what trigger alarms for automated remediation
The infrastructure chain:
Endpoint Runtime (captures data)
→ S3 (stores captured JSON Lines)
→ CloudWatch Events (triggers on schedule)
→ SageMaker Processing Job (runs Model Monitor container)
→ S3 (writes violation reports)
→ CloudWatch Metrics (publishes drift metrics)
→ CloudWatch Alarm (fires if threshold exceeded)
→ EventBridge (routes alarm event)
→ Lambda (starts retraining pipeline)
Every piece is a real AWS service you’re already familiar with. Model Monitor just orchestrates them.
Step 1: Deploy Model with Data Capture
Data Capture logs every request and response to S3 as JSON Lines files. You choose what to capture and what percentage to sample.
import sagemaker
import boto3
from sagemaker.model_monitor import DataCaptureConfig
session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
prefix = "model-monitor-lab"
data_capture_config = DataCaptureConfig(
enable_capture=True,
sampling_percentage=100, # capture everything (use lower % in production)
destination_s3_uri=f"s3://{bucket}/{prefix}/datacapture",
capture_options=["Input", "Output"], # log both request and response
csv_content_types=["text/csv"],
)
predictor = model.deploy(
initial_instance_count=1,
instance_type="ml.m5.large",
endpoint_name="monitored-churn-endpoint",
data_capture_config=data_capture_config,
)
What a captured record looks like in S3:
{
"captureData": {
"endpointInput": {
"data": "34,75000,720,45",
"encoding": "CSV"
},
"endpointOutput": {
"data": "0.73",
"encoding": "CSV"
}
},
"eventMetadata": {
"eventId": "abcd-1234",
"inferenceTime": "2026-04-27T10:15:30Z"
}
}
The eventId is crucial — it’s how Model Quality Monitor later joins predictions with ground truth labels.
Step 2: Generate Traffic
Send enough requests so there’s data to analyze. In production, this happens naturally from real users.
from sagemaker.serializers import CSVSerializer
import time
predictor.serializer = CSVSerializer()
with open("test_data/sample_requests.csv", "r") as f:
for i, row in enumerate(f):
predictor.predict(row.strip())
if i % 50 == 0:
print(f"Sent {i} requests...")
time.sleep(0.5)
print("Done. Waiting 60s for S3 sync...")
time.sleep(60)
Step 3: Create Baseline from Training Data
The baseline answers: “what does NORMAL data look like?” SageMaker computes statistics (mean, std, min, max, distribution) and constraints (data types, completeness, allowed ranges) from your training dataset.
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat
monitor = DefaultModelMonitor(
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
volume_size_in_gb=20,
max_runtime_in_seconds=3600,
)
monitor.suggest_baseline(
baseline_dataset=f"s3://{bucket}/{prefix}/training-data-with-header.csv",
dataset_format=DatasetFormat.csv(header=True),
output_s3_uri=f"s3://{bucket}/{prefix}/baseline",
wait=True,
)
Step 4: Examine the Baseline
Understanding what the baseline produces helps you understand what violations mean.
import pandas as pd
baseline_job = monitor.latest_baselining_job
# STATISTICS: numerical profile of each feature
stats = baseline_job.baseline_statistics()
stats_df = pd.json_normalize(stats.body_dict["features"])
print(stats_df[["name", "numerical_statistics.mean",
"numerical_statistics.std",
"numerical_statistics.min",
"numerical_statistics.max"]])
Example output:
name mean std min max
age 34.2 12.1 18 72
income 67500 28000 15000 250000
credit_score 710 45 520 850
total_purchases 42 31 0 312
# CONSTRAINTS: rules that production data must follow
constraints = baseline_job.suggested_constraints()
constraints_df = pd.json_normalize(constraints.body_dict["features"])
print(constraints_df[["name", "inferred_type", "completeness"]])
Example output:
name inferred_type completeness
age Integral 1.00
income Fractional 1.00
credit_score Integral 0.98
total_purchases Integral 1.00
This means: age should always be an integer, always present (100% completeness). If production data suddenly has age as a string, or 15% null — that’s a violation.
Step 5: Schedule Monitoring
from sagemaker.model_monitor import CronExpressionGenerator
monitor.create_monitoring_schedule(
monitor_schedule_name="churn-data-quality-schedule",
endpoint_input=predictor.endpoint_name,
output_s3_uri=f"s3://{bucket}/{prefix}/monitoring-reports",
statistics=monitor.baseline_statistics(),
constraints=monitor.suggested_constraints(),
schedule_cron_expression=CronExpressionGenerator.hourly(),
enable_cloudwatch_metrics=True, # publish metrics for alarms
)
Each hour, SageMaker:
- Reads captured data from S3
- Computes current statistics
- Compares against baseline
- Publishes violations report + CloudWatch metrics
Step 6: View Violations
import time
executions = monitor.list_executions()
while len(executions) == 0:
print("Waiting for first monitoring execution...")
time.sleep(120)
executions = monitor.list_executions()
latest = executions[-1]
latest.wait(logs=False)
print(f"Status: {latest.describe()['ProcessingJobStatus']}")
# Check for violations
violations = monitor.latest_monitoring_constraint_violations()
if violations and violations.body_dict.get("violations"):
vdf = pd.json_normalize(violations.body_dict["violations"])
print(vdf[["feature_name", "constraint_check_type", "description"]])
else:
print("No violations — production data matches training baseline")
Example violations:
feature_name constraint_check_type description
income baseline_drift_check Mean shifted from 67500 to 82000 (>20%)
payment_type data_type_check Found new category "crypto" not in baseline
email completeness_check Completeness dropped from 1.0 to 0.85
Each of these tells you something specific went wrong:
- Income drift: Customers are wealthier now — model trained on different population
- New category: A payment type the model never saw — predictions are unreliable for these
- Missing data: 15% of emails are null — if email is a feature, predictions are degraded
Lab 2: Model Quality Monitor — When You Have Ground Truth
What Is This Lab About?
Data Quality catches input drift (features changed). But what if features look fine, yet the model is still giving wrong answers? Maybe the relationship between features and outcome changed (concept drift). For this, you need Model Quality Monitor — and it requires ground truth labels.
The Ground Truth Problem
TIME T=0 (prediction):
Customer 1234 → Model predicts: 73% churn probability → "Will churn"
You know the prediction. You DON'T know if it's right.
TIME T=30 days (outcome):
Customer 1234 → Actually churned: YES (ground truth = 1)
NOW you can evaluate: prediction was correct.
TIME T=60 days (monitoring):
Model Quality Monitor merges:
prediction (from Data Capture) + actual outcome (from you)
→ Computes accuracy, precision, recall, F1, AUC
→ Compares against training baseline
→ Flags if metrics degraded
The key insight: ground truth arrives LATER (days, weeks, months after the prediction). You must upload it yourself. SageMaker cannot infer it.
Upload Ground Truth
Match each prediction using the eventId from Data Capture.
import json
ground_truth_records = [
{"groundTruthData": {"data": "1", "encoding": "CSV"},
"eventMetadata": {"eventId": "abcd-1234"}},
{"groundTruthData": {"data": "0", "encoding": "CSV"},
"eventMetadata": {"eventId": "efgh-5678"}},
]
gt_body = "\n".join(json.dumps(r) for r in ground_truth_records)
s3 = boto3.client("s3")
s3.put_object(
Bucket=bucket,
Key=f"{prefix}/ground-truth/2026/04/27/10/gt.jsonl",
Body=gt_body,
)
Schedule Model Quality Monitoring
from sagemaker.model_monitor import ModelQualityMonitor
from sagemaker.model_monitor import EndpointInput
mq_monitor = ModelQualityMonitor(
role=role, instance_count=1, instance_type="ml.m5.xlarge",
volume_size_in_gb=20, sagemaker_session=session,
)
mq_monitor.create_monitoring_schedule(
monitor_schedule_name="churn-model-quality-schedule",
endpoint_input=EndpointInput(
endpoint_name="monitored-churn-endpoint",
probability_attribute="0", # column index of prediction
),
ground_truth_input=f"s3://{bucket}/{prefix}/ground-truth",
problem_type="BinaryClassification",
output_s3_uri=f"s3://{bucket}/{prefix}/model-quality-reports",
schedule_cron_expression=CronExpressionGenerator.daily(),
)
Exam Decision: Which Monitor When?
| I Notice… | Monitor Type | Needs Ground Truth? |
|---|---|---|
| Feature distributions shifted | Data Quality | NO |
| Model accuracy dropped | Model Quality | YES |
| Predictions unfair to a group | Bias Drift | Depends |
| Different features driving predictions | Explainability Drift | NO |
| No data appearing in monitor | Check Data Capture is enabled | N/A |
Lab 3: Automated Retraining — The Full Loop
What Is This Lab About?
Detecting drift is pointless if nobody acts on it. The real value is an automated pipeline that detects drift, retrains the model, validates it, and deploys — with human approval as the only manual step.
The Complete Automated Flow
┌──────────────────────────────────────────────────────────────┐
│ │
│ PRODUCTION ENDPOINT │
│ │ │
│ ▼ │
│ Model Monitor (hourly) │
│ "Income mean shifted 20%" │
│ │ │
│ ▼ │
│ CloudWatch Metric published │
│ sagemaker/Endpoints/data-metrics/feature_baseline_drift │
│ │ │
│ ▼ │
│ CloudWatch ALARM fires (threshold > 0.2) │
│ │ │
│ ├──────▶ SNS → Email: "Data drift detected on income" │
│ │ │
│ ▼ │
│ EventBridge RULE matches alarm state change │
│ │ │
│ ▼ │
│ LAMBDA FUNCTION triggered │
│ → Calls sagemaker.start_pipeline_execution("MLOpsPipeline") │
│ │ │
│ ▼ │
│ SageMaker PIPELINE runs: │
│ Preprocess new data │
│ → Train new model │
│ → Evaluate on test set │
│ → IF better: Register in Model Registry │
│ (status: PendingManualApproval) │
│ │ │
│ ▼ │
│ DATA SCIENTIST reviews model metrics │
│ → Approves in Model Registry │
│ │ │
│ ▼ │
│ CI/CD DEPLOYS approved model to endpoint │
│ (canary deployment: 5% → monitor → 100%) │
│ │
└──────────────────────────────────────────────────────────────┘
Step 1: Create CloudWatch Alarm
cloudwatch = boto3.client("cloudwatch")
cloudwatch.put_metric_alarm(
AlarmName="income-drift-alarm",
Namespace="aws/sagemaker/Endpoints/data-metrics",
MetricName="feature_baseline_drift_income",
Dimensions=[
{"Name": "Endpoint", "Value": "monitored-churn-endpoint"},
{"Name": "MonitoringSchedule", "Value": "churn-data-quality-schedule"},
],
Statistic="Average",
Period=3600, # check every hour
EvaluationPeriods=1, # fire after 1 breach
Threshold=0.2, # 20% drift threshold
ComparisonOperator="GreaterThanThreshold",
AlarmActions=["arn:aws:sns:us-east-1:123456:drift-alerts"],
)
Step 2: EventBridge Rule → Lambda
events = boto3.client("events")
events.put_rule(
Name="drift-triggers-retraining",
EventPattern=json.dumps({
"source": ["aws.cloudwatch"],
"detail-type": ["CloudWatch Alarm State Change"],
"detail": {
"alarmName": ["income-drift-alarm"],
"state": {"value": ["ALARM"]},
},
}),
State="ENABLED",
)
events.put_targets(
Rule="drift-triggers-retraining",
Targets=[{
"Id": "retrain-lambda",
"Arn": "arn:aws:lambda:us-east-1:123456:function:trigger-retraining",
}],
)
Step 3: Lambda Starts the Pipeline
# Lambda function: trigger-retraining
import boto3
def lambda_handler(event, context):
sm = boto3.client("sagemaker")
response = sm.start_pipeline_execution(
PipelineName="MLOpsPipeline",
PipelineParameters=[
{"Name": "ModelApprovalStatus", "Value": "PendingManualApproval"},
],
PipelineExecutionDescription="Auto-triggered by income drift alarm",
)
return {"executionArn": response["PipelineExecutionArn"]}
The pipeline handles everything else: preprocess, train, evaluate, register. The data scientist only needs to approve in Model Registry.
Lab 4: Securing the ML Pipeline

What Is This Lab About?
ML pipelines handle sensitive data (customer PII, financial records, health information). Securing them isn’t optional — it’s a compliance requirement (HIPAA, PCI-DSS, GDPR). This lab covers the three pillars: encryption, access control, and network isolation.
The Three Pillars of ML Security
┌─────────────────────────────────────────────────────────────┐
│ ML Security Architecture │
│ │
│ PILLAR 1: ENCRYPTION │
│ ┌──────────────────────────────────────────────────┐ │
│ │ At Rest: │ │
│ │ S3 training data → SSE-KMS │ │
│ │ EBS instance volumes → KMS │ │
│ │ Model artifacts → KMS │ │
│ │ Feature Store → KMS │ │
│ │ │ │
│ │ In Transit: │ │
│ │ Client → Endpoint → TLS 1.2 (automatic) │ │
│ │ Instance ↔ Instance → Inter-container TLS │ │
│ │ (opt-in, +10% overhead) │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ PILLAR 2: ACCESS CONTROL │
│ ┌──────────────────────────────────────────────────┐ │
│ │ IAM Roles: │ │
│ │ SageMaker Execution Role → minimum S3/ECR/KMS │ │
│ │ Data Scientist Role → can train, NOT deploy│ │
│ │ MLOps Engineer Role → can deploy, NOT train│ │
│ │ │ │
│ │ Principle of Least Privilege: │ │
│ │ "Grant ONLY the permissions needed for the task"│ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ PILLAR 3: NETWORK ISOLATION │
│ ┌──────────────────────────────────────────────────┐ │
│ │ VPC + Private Subnets: │ │
│ │ Training instances → no public IP │ │
│ │ Endpoints → no public IP │ │
│ │ │ │
│ │ VPC Endpoints (PrivateLink): │ │
│ │ S3, ECR, SageMaker API, CloudWatch Logs │ │
│ │ → Traffic stays within AWS, never hits internet│ │
│ │ │ │
│ │ Network Isolation Mode: │ │
│ │ enable_network_isolation=True │ │
│ │ → Container has ZERO internet access │ │
│ │ → Cannot pip install, call APIs, exfiltrate │ │
│ └──────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Encryption: KMS for Everything
KMS (Key Management Service) manages encryption keys. Customer Managed Keys (CMK) give you full control: you set rotation policy, who can use the key, and audit all usage via CloudTrail.
# Training job with full encryption
estimator = Estimator(
image_uri=image_uri,
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
output_path=f"s3://{bucket}/{prefix}/output",
# Encryption
output_kms_key="arn:aws:kms:us-east-1:123456:key/my-key", # model artifacts
volume_kms_key="arn:aws:kms:us-east-1:123456:key/my-key", # instance disk
encrypt_inter_container_traffic=True, # TLS between distributed training nodes
sagemaker_session=session,
)
HIPAA compliance requires ALL of these:
- KMS encryption at rest (S3 + EBS)
- TLS in transit (automatic for endpoints)
- Inter-container encryption (must opt-in)
- VPC with private subnets
- CloudTrail audit logging
Access Control: Least-Privilege IAM
Data Scientist Role — can train but NOT deploy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"sagemaker:CreateTrainingJob",
"sagemaker:CreateProcessingJob",
"sagemaker:CreateHyperParameterTuningJob",
"sagemaker:Describe*",
"sagemaker:List*"
],
"Resource": "*"
},
{
"Effect": "Deny",
"Action": [
"sagemaker:CreateEndpoint",
"sagemaker:UpdateEndpoint",
"sagemaker:DeleteEndpoint"
],
"Resource": "*"
}
]
}
SageMaker Execution Role — minimum permissions:
{
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
"Resource": ["arn:aws:s3:::my-ml-bucket", "arn:aws:s3:::my-ml-bucket/*"]
},
{
"Effect": "Allow",
"Action": ["ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "ecr:GetAuthorizationToken"],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": ["logs:CreateLogStream", "logs:PutLogEvents", "logs:CreateLogGroup"],
"Resource": "arn:aws:logs:*:*:log-group:/aws/sagemaker/*"
},
{
"Effect": "Allow",
"Action": ["kms:Decrypt", "kms:GenerateDataKey"],
"Resource": "arn:aws:kms:us-east-1:123456:key/my-key"
}
]
}
Network Isolation: VPC + Endpoints
The problem: SageMaker instances in a VPC need to reach S3 (for data), ECR (for containers), and CloudWatch (for logging). Without VPC Endpoints, they’d need a NAT Gateway (expensive, traffic goes through the internet).
The solution: VPC Endpoints create private connections from your VPC to AWS services. Traffic never leaves the AWS network.
┌────────────────────── VPC ──────────────────────────┐
│ │
│ ┌─── Private Subnet ────────────────────────────┐ │
│ │ │ │
│ │ SageMaker Training / Endpoint instances │ │
│ │ (no public IP, no internet access) │ │
│ │ │ │
│ └────────────┬───────────────────────────────────┘ │
│ │ │
│ ┌────────────▼───────────────────────────────────┐ │
│ │ VPC Endpoints (PrivateLink) │ │
│ │ │ │
│ │ ● S3 Gateway Endpoint (free) │ │
│ │ → Training reads/writes data │ │
│ │ │ │
│ │ ● SageMaker API Interface Endpoint │ │
│ │ → API calls to create jobs │ │
│ │ │ │
│ │ ● SageMaker Runtime Interface Endpoint │ │
│ │ → Invoke inference endpoint │ │
│ │ │ │
│ │ ● ECR Interface Endpoints (ecr.api + ecr.dkr) │ │
│ │ → Pull training/inference containers │ │
│ │ │ │
│ │ ● CloudWatch Logs Interface Endpoint │ │
│ │ → Write training/inference logs │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ NO NAT Gateway needed. NO internet traffic. │
└──────────────────────────────────────────────────────┘
Common VPC troubleshooting (heavily tested):
| Error Message | Missing Endpoint | Fix |
|---|---|---|
| “Unable to download data from S3” | S3 Gateway | Add S3 Gateway Endpoint to route table |
| “Cannot pull Docker container” | ECR | Add BOTH ecr.api and ecr.dkr Interface Endpoints |
| “Cannot write logs” | CloudWatch Logs | Add logs Interface Endpoint |
| “Training job hangs at Downloading” | S3 or ECR | Check Security Group allows outbound to endpoints |
Network Isolation — maximum lockdown:
estimator = Estimator(
image_uri=image_uri,
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
subnets=["subnet-abc123"],
security_group_ids=["sg-0123456789"],
enable_network_isolation=True, # ZERO internet access
sagemaker_session=session,
)
With enable_network_isolation=True, the container:
- Cannot
pip installanything - Cannot make HTTP calls to any API
- Cannot exfiltrate data anywhere
- Can ONLY access S3 through the defined input channels
- All dependencies MUST be baked into the container image
Use this for: HIPAA, PCI-DSS, handling classified data, preventing data exfiltration.
Lab 5: Cost Optimization Patterns
What Is This Lab About?
ML workloads are expensive — GPU instances, always-on endpoints, large training runs. But most teams overspend by 40-70% because they don’t use the right cost optimization techniques. The exam tests your ability to recommend the cheapest option that still meets requirements.
The Cost Optimization Decision Tree
TRAINING COSTS:
"How to reduce training cost?"
│
├── Can tolerate interruptions? ── YES ── Spot Training (90% savings)
│ + Checkpointing required
│
├── Running same pipeline daily? ── YES ── Step Caching
│ (skip unchanged steps)
│
├── Using GPU for XGBoost? ── YES ── Switch to CPU (ml.m5)
│ XGBoost doesn't benefit from GPU
│
└── Training frequently? ── YES ── Warm Pools (reduce startup time)
or SageMaker Savings Plans (64%)
INFERENCE COSTS:
"How to reduce inference cost?"
│
├── Sporadic traffic (<100/day)? ── YES ── Serverless (pay per call)
│
├── Irregular + large payloads? ── YES ── Async (scale to 0 + GPU)
│
├── Many small models? ── YES ── Multi-Model Endpoint (share instance)
│
├── Scheduled bulk scoring? ── YES ── Batch Transform (no endpoint)
│
├── Steady, predictable load? ── YES ── Savings Plans (up to 64%)
│ or Inferentia chips (up to 70%)
│
└── Any deployed model? ── YES ── SageMaker Neo (compile for hardware)
Up to 2x faster → use smaller instance
Key Numbers to Remember
| Optimization | Savings | Trade-off |
|---|---|---|
| Spot Training | Up to 90% | Interruptions — need checkpointing |
| Serverless Inference | Pay per call | Cold starts, CPU only, 4 MB max |
| Async (scale to 0) | Pay per use | Minutes latency, queue-based |
| Multi-Model Endpoint | ~80% vs individual endpoints | Cold model load time |
| Batch Transform | Pay per job only | No real-time, hours for results |
| Savings Plans | Up to 64% | 1 or 3-year commitment |
| Inferentia (ml.inf2) | Up to 70% vs GPU | Limited framework support |
| Trainium (ml.trn1) | Up to 50% vs GPU | Training only |
| SageMaker Neo | Up to 2x throughput | Compilation step required |
Domain 4 Lab Summary
| Lab | Service | You Learned |
|---|---|---|
| 1 | Model Monitor | Data Capture → Baseline → Schedule → Violations — the complete lifecycle |
| 2 | Model Quality | Ground truth merging, why it NEEDS actual outcomes, delayed feedback problem |
| 3 | Automated Retraining | Monitor → CloudWatch → EventBridge → Lambda → Pipeline — full automation loop |
| 4 | Security | KMS encryption, IAM least privilege, VPC Endpoints, Network Isolation, HIPAA checklist |
| 5 | Cost Optimization | Decision trees for training and inference cost reduction, key savings numbers |