Lab: Domain 4 — Monitoring & Security Hands-On

17 min read 3463 words

Table of Contents

Lab: Domain 4 — Monitoring & Security Hands-On

Lab: Domain 4 — Monitoring & Security Hands-On

Models degrade in production. Data drifts, user behavior changes, bias emerges. This domain (24% of the exam) is about catching problems before they hurt the business — and securing the entire ML pipeline.

Lab 1: SageMaker Model Monitor — The Complete Lifecycle

Model Monitor Pipeline

What Is This Lab About?

You deployed a customer churn model 3 months ago. It worked great — 95% accuracy. But slowly, customer demographics shifted: average income rose, new product categories appeared, a competitor entered the market. Your model never saw this new reality. Accuracy silently dropped to 72%, and nobody noticed until the VP asked why retention campaigns stopped working.

Model Monitor catches this automatically. It compares production data against your training baseline and alerts you when things drift.

The Four Types of Monitoring — When Each Fires

┌─────────────────────────────────────────────────────────────┐
│           The Four Monitors and What They Catch              │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ 1. DATA QUALITY MONITOR                              │   │
│  │    "Has the INPUT data changed?"                      │   │
│  │                                                      │   │
│  │    Catches:                                          │   │
│  │    • Feature "income" mean shifted $50K → $80K        │   │
│  │    • New category "crypto" in "payment_method"        │   │
│  │    • "email" column is now 15% null (was 0%)          │   │
│  │    • Data type changed (integer → string)             │   │
│  │                                                      │   │
│  │    Needs ground truth? NO — just compares input stats │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ 2. MODEL QUALITY MONITOR                             │   │
│  │    "Is the model getting WRONG answers?"              │   │
│  │                                                      │   │
│  │    Catches:                                          │   │
│  │    • Accuracy dropped from 95% to 72%                 │   │
│  │    • F1 score degraded below threshold                │   │
│  │    • AUC-ROC declined significantly                   │   │
│  │                                                      │   │
│  │    Needs ground truth? YES — must know actual outcomes│   │
│  │    (this is the KEY exam distinction)                 │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ 3. BIAS DRIFT MONITOR                                │   │
│  │    "Is the model becoming UNFAIR?"                    │   │
│  │                                                      │   │
│  │    Catches:                                          │   │
│  │    • Approval rate for women dropped from 75% → 55%   │   │
│  │    • Disparate Impact fell below 0.8 (legal threshold)│   │
│  │    • New demographic group underrepresented           │   │
│  │                                                      │   │
│  │    Uses: SageMaker Clarify bias baseline              │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ 4. EXPLAINABILITY DRIFT MONITOR                      │   │
│  │    "Are DIFFERENT features driving predictions now?"   │   │
│  │                                                      │   │
│  │    Catches:                                          │   │
│  │    • "credit_score" was #1 feature, now "zip_code" is │   │
│  │    • SHAP values shifted significantly                │   │
│  │    • Model logic has changed even if accuracy hasn't  │   │
│  │                                                      │   │
│  │    Uses: SageMaker Clarify SHAP baseline              │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

The Complete Setup — 3 Steps Before Monitoring Works

STEP 1               STEP 2                   STEP 3
Enable Data          Create Baseline           Schedule
Capture              (from training data)      Monitoring
    │                     │                        │
    ▼                     ▼                        ▼
Every request ──▶  "What does NORMAL      Compare captured
and response       data look like?"        data against
logged to S3       Generate statistics:    baseline every
                   mean, std, min, max,    hour/day.
                   data types,             Flag violations.
                   completeness            Alert via
                                          CloudWatch.

Exam trap: “Model Monitor detects no drift” — check if Data Capture is enabled. Without it, there’s literally no data to monitor. This is tested frequently.

What’s ACTUALLY Running Behind Model Monitor?

Model Monitor looks simple on the surface, but there’s real infrastructure doing the work:

Data Capture — it’s an S3 logging layer on the endpoint:

When you enable Data Capture, the SageMaker endpoint runtime starts writing JSON Lines files to your specified S3 path
Each file contains captured requests and responses, serialized with metadata (event ID, timestamp, content type)
Files are organized by: s3://bucket/prefix/endpoint-name/variant/yyyy/mm/dd/hh/
The sampling percentage controls how many requests are logged (100% = everything, 10% = sample)
There is NO separate service — it’s built into the endpoint runtime itself

Baseline Job — it’s a SageMaker Processing Job:

When you call suggest_baseline(), SageMaker launches a Processing Job using a special Model Monitor container (maintained by AWS)
This container reads your training CSV from S3, computes statistics (mean, std, distribution per feature), and generates two JSON files:
- statistics.json — numerical profile of every feature
- constraints.json — rules like “age must be Integral, completeness must be > 0.95”
These files are your “reference” — what NORMAL looks like

Scheduled Monitoring — it’s a recurring SageMaker Processing Job:

create_monitoring_schedule() creates a CloudWatch Events rule (cron expression)
Every hour/day (per your schedule), it triggers a Processing Job using the same Model Monitor container
This job reads the captured data from S3, computes the same statistics, and compares against your baseline
If any constraint is violated (e.g., mean shifted >20%, new data type appeared), it writes a violations report to S3
It also publishes metrics to CloudWatch — one metric per feature per check type
These CloudWatch metrics are what trigger alarms for automated remediation

The infrastructure chain:

Endpoint Runtime (captures data)
    → S3 (stores captured JSON Lines)
    → CloudWatch Events (triggers on schedule)
    → SageMaker Processing Job (runs Model Monitor container)
    → S3 (writes violation reports)
    → CloudWatch Metrics (publishes drift metrics)
    → CloudWatch Alarm (fires if threshold exceeded)
    → EventBridge (routes alarm event)
    → Lambda (starts retraining pipeline)

Every piece is a real AWS service you’re already familiar with. Model Monitor just orchestrates them.

Step 1: Deploy Model with Data Capture

Data Capture logs every request and response to S3 as JSON Lines files. You choose what to capture and what percentage to sample.

import sagemaker
import boto3
from sagemaker.model_monitor import DataCaptureConfig

session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
prefix = "model-monitor-lab"

data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,             # capture everything (use lower % in production)
    destination_s3_uri=f"s3://{bucket}/{prefix}/datacapture",
    capture_options=["Input", "Output"], # log both request and response
    csv_content_types=["text/csv"],
)

predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
    endpoint_name="monitored-churn-endpoint",
    data_capture_config=data_capture_config,
)

What a captured record looks like in S3:

{
  "captureData": {
    "endpointInput": {
      "data": "34,75000,720,45",
      "encoding": "CSV"
    },
    "endpointOutput": {
      "data": "0.73",
      "encoding": "CSV"
    }
  },
  "eventMetadata": {
    "eventId": "abcd-1234",
    "inferenceTime": "2026-04-27T10:15:30Z"
  }
}

The eventId is crucial — it’s how Model Quality Monitor later joins predictions with ground truth labels.

Step 2: Generate Traffic

Send enough requests so there’s data to analyze. In production, this happens naturally from real users.

from sagemaker.serializers import CSVSerializer
import time

predictor.serializer = CSVSerializer()

with open("test_data/sample_requests.csv", "r") as f:
    for i, row in enumerate(f):
        predictor.predict(row.strip())
        if i % 50 == 0:
            print(f"Sent {i} requests...")
        time.sleep(0.5)

print("Done. Waiting 60s for S3 sync...")
time.sleep(60)

Step 3: Create Baseline from Training Data

The baseline answers: “what does NORMAL data look like?” SageMaker computes statistics (mean, std, min, max, distribution) and constraints (data types, completeness, allowed ranges) from your training dataset.

from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

monitor.suggest_baseline(
    baseline_dataset=f"s3://{bucket}/{prefix}/training-data-with-header.csv",
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=f"s3://{bucket}/{prefix}/baseline",
    wait=True,
)

Step 4: Examine the Baseline

Understanding what the baseline produces helps you understand what violations mean.

import pandas as pd

baseline_job = monitor.latest_baselining_job

# STATISTICS: numerical profile of each feature
stats = baseline_job.baseline_statistics()
stats_df = pd.json_normalize(stats.body_dict["features"])
print(stats_df[["name", "numerical_statistics.mean",
                 "numerical_statistics.std",
                 "numerical_statistics.min",
                 "numerical_statistics.max"]])

Example output:

name              mean      std       min      max
age               34.2      12.1      18       72
income            67500     28000     15000    250000
credit_score      710       45        520      850
total_purchases   42        31        0        312

# CONSTRAINTS: rules that production data must follow
constraints = baseline_job.suggested_constraints()
constraints_df = pd.json_normalize(constraints.body_dict["features"])
print(constraints_df[["name", "inferred_type", "completeness"]])

Example output:

name              inferred_type    completeness
age               Integral         1.00
income            Fractional       1.00
credit_score      Integral         0.98
total_purchases   Integral         1.00

This means: age should always be an integer, always present (100% completeness). If production data suddenly has age as a string, or 15% null — that’s a violation.

Step 5: Schedule Monitoring

from sagemaker.model_monitor import CronExpressionGenerator

monitor.create_monitoring_schedule(
    monitor_schedule_name="churn-data-quality-schedule",
    endpoint_input=predictor.endpoint_name,
    output_s3_uri=f"s3://{bucket}/{prefix}/monitoring-reports",
    statistics=monitor.baseline_statistics(),
    constraints=monitor.suggested_constraints(),
    schedule_cron_expression=CronExpressionGenerator.hourly(),
    enable_cloudwatch_metrics=True,   # publish metrics for alarms
)

Each hour, SageMaker:

Reads captured data from S3
Computes current statistics
Compares against baseline
Publishes violations report + CloudWatch metrics

Step 6: View Violations

import time

executions = monitor.list_executions()
while len(executions) == 0:
    print("Waiting for first monitoring execution...")
    time.sleep(120)
    executions = monitor.list_executions()

latest = executions[-1]
latest.wait(logs=False)
print(f"Status: {latest.describe()['ProcessingJobStatus']}")

# Check for violations
violations = monitor.latest_monitoring_constraint_violations()
if violations and violations.body_dict.get("violations"):
    vdf = pd.json_normalize(violations.body_dict["violations"])
    print(vdf[["feature_name", "constraint_check_type", "description"]])
else:
    print("No violations — production data matches training baseline")

Example violations:

feature_name   constraint_check_type    description
income         baseline_drift_check     Mean shifted from 67500 to 82000 (>20%)
payment_type   data_type_check          Found new category "crypto" not in baseline
email          completeness_check       Completeness dropped from 1.0 to 0.85

Each of these tells you something specific went wrong:

Income drift: Customers are wealthier now — model trained on different population
New category: A payment type the model never saw — predictions are unreliable for these
Missing data: 15% of emails are null — if email is a feature, predictions are degraded

Lab 2: Model Quality Monitor — When You Have Ground Truth

What Is This Lab About?

Data Quality catches input drift (features changed). But what if features look fine, yet the model is still giving wrong answers? Maybe the relationship between features and outcome changed (concept drift). For this, you need Model Quality Monitor — and it requires ground truth labels.

The Ground Truth Problem

TIME T=0 (prediction):
  Customer 1234 → Model predicts: 73% churn probability → "Will churn"
  
  You know the prediction. You DON'T know if it's right.

TIME T=30 days (outcome):
  Customer 1234 → Actually churned: YES (ground truth = 1)
  
  NOW you can evaluate: prediction was correct.
  
TIME T=60 days (monitoring):
  Model Quality Monitor merges:
    prediction (from Data Capture) + actual outcome (from you)
    → Computes accuracy, precision, recall, F1, AUC
    → Compares against training baseline
    → Flags if metrics degraded

The key insight: ground truth arrives LATER (days, weeks, months after the prediction). You must upload it yourself. SageMaker cannot infer it.

Upload Ground Truth

Match each prediction using the eventId from Data Capture.

import json

ground_truth_records = [
    {"groundTruthData": {"data": "1", "encoding": "CSV"},
     "eventMetadata": {"eventId": "abcd-1234"}},
    {"groundTruthData": {"data": "0", "encoding": "CSV"},
     "eventMetadata": {"eventId": "efgh-5678"}},
]

gt_body = "\n".join(json.dumps(r) for r in ground_truth_records)
s3 = boto3.client("s3")
s3.put_object(
    Bucket=bucket,
    Key=f"{prefix}/ground-truth/2026/04/27/10/gt.jsonl",
    Body=gt_body,
)

Schedule Model Quality Monitoring

from sagemaker.model_monitor import ModelQualityMonitor
from sagemaker.model_monitor import EndpointInput

mq_monitor = ModelQualityMonitor(
    role=role, instance_count=1, instance_type="ml.m5.xlarge",
    volume_size_in_gb=20, sagemaker_session=session,
)

mq_monitor.create_monitoring_schedule(
    monitor_schedule_name="churn-model-quality-schedule",
    endpoint_input=EndpointInput(
        endpoint_name="monitored-churn-endpoint",
        probability_attribute="0",  # column index of prediction
    ),
    ground_truth_input=f"s3://{bucket}/{prefix}/ground-truth",
    problem_type="BinaryClassification",
    output_s3_uri=f"s3://{bucket}/{prefix}/model-quality-reports",
    schedule_cron_expression=CronExpressionGenerator.daily(),
)

Exam Decision: Which Monitor When?

I Notice…	Monitor Type	Needs Ground Truth?
Feature distributions shifted	Data Quality	NO
Model accuracy dropped	Model Quality	YES
Predictions unfair to a group	Bias Drift	Depends
Different features driving predictions	Explainability Drift	NO
No data appearing in monitor	Check Data Capture is enabled	N/A

Lab 3: Automated Retraining — The Full Loop

What Is This Lab About?

Detecting drift is pointless if nobody acts on it. The real value is an automated pipeline that detects drift, retrains the model, validates it, and deploys — with human approval as the only manual step.

The Complete Automated Flow

┌──────────────────────────────────────────────────────────────┐
│                                                              │
│  PRODUCTION ENDPOINT                                         │
│       │                                                      │
│       ▼                                                      │
│  Model Monitor (hourly)                                      │
│  "Income mean shifted 20%"                                   │
│       │                                                      │
│       ▼                                                      │
│  CloudWatch Metric published                                 │
│  sagemaker/Endpoints/data-metrics/feature_baseline_drift     │
│       │                                                      │
│       ▼                                                      │
│  CloudWatch ALARM fires (threshold > 0.2)                    │
│       │                                                      │
│       ├──────▶ SNS → Email: "Data drift detected on income"  │
│       │                                                      │
│       ▼                                                      │
│  EventBridge RULE matches alarm state change                 │
│       │                                                      │
│       ▼                                                      │
│  LAMBDA FUNCTION triggered                                   │
│  → Calls sagemaker.start_pipeline_execution("MLOpsPipeline") │
│       │                                                      │
│       ▼                                                      │
│  SageMaker PIPELINE runs:                                    │
│    Preprocess new data                                       │
│    → Train new model                                         │
│    → Evaluate on test set                                    │
│    → IF better: Register in Model Registry                   │
│                 (status: PendingManualApproval)               │
│       │                                                      │
│       ▼                                                      │
│  DATA SCIENTIST reviews model metrics                        │
│  → Approves in Model Registry                                │
│       │                                                      │
│       ▼                                                      │
│  CI/CD DEPLOYS approved model to endpoint                    │
│  (canary deployment: 5% → monitor → 100%)                    │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Step 1: Create CloudWatch Alarm

cloudwatch = boto3.client("cloudwatch")

cloudwatch.put_metric_alarm(
    AlarmName="income-drift-alarm",
    Namespace="aws/sagemaker/Endpoints/data-metrics",
    MetricName="feature_baseline_drift_income",
    Dimensions=[
        {"Name": "Endpoint", "Value": "monitored-churn-endpoint"},
        {"Name": "MonitoringSchedule", "Value": "churn-data-quality-schedule"},
    ],
    Statistic="Average",
    Period=3600,           # check every hour
    EvaluationPeriods=1,   # fire after 1 breach
    Threshold=0.2,         # 20% drift threshold
    ComparisonOperator="GreaterThanThreshold",
    AlarmActions=["arn:aws:sns:us-east-1:123456:drift-alerts"],
)

Step 2: EventBridge Rule → Lambda

events = boto3.client("events")

events.put_rule(
    Name="drift-triggers-retraining",
    EventPattern=json.dumps({
        "source": ["aws.cloudwatch"],
        "detail-type": ["CloudWatch Alarm State Change"],
        "detail": {
            "alarmName": ["income-drift-alarm"],
            "state": {"value": ["ALARM"]},
        },
    }),
    State="ENABLED",
)

events.put_targets(
    Rule="drift-triggers-retraining",
    Targets=[{
        "Id": "retrain-lambda",
        "Arn": "arn:aws:lambda:us-east-1:123456:function:trigger-retraining",
    }],
)

Step 3: Lambda Starts the Pipeline

# Lambda function: trigger-retraining
import boto3

def lambda_handler(event, context):
    sm = boto3.client("sagemaker")
    response = sm.start_pipeline_execution(
        PipelineName="MLOpsPipeline",
        PipelineParameters=[
            {"Name": "ModelApprovalStatus", "Value": "PendingManualApproval"},
        ],
        PipelineExecutionDescription="Auto-triggered by income drift alarm",
    )
    return {"executionArn": response["PipelineExecutionArn"]}

The pipeline handles everything else: preprocess, train, evaluate, register. The data scientist only needs to approve in Model Registry.

Lab 4: Securing the ML Pipeline

VPC Security Architecture

What Is This Lab About?

ML pipelines handle sensitive data (customer PII, financial records, health information). Securing them isn’t optional — it’s a compliance requirement (HIPAA, PCI-DSS, GDPR). This lab covers the three pillars: encryption, access control, and network isolation.

The Three Pillars of ML Security

┌─────────────────────────────────────────────────────────────┐
│                  ML Security Architecture                    │
│                                                             │
│  PILLAR 1: ENCRYPTION                                       │
│  ┌──────────────────────────────────────────────────┐       │
│  │ At Rest:                                         │       │
│  │   S3 training data    → SSE-KMS                  │       │
│  │   EBS instance volumes → KMS                     │       │
│  │   Model artifacts     → KMS                      │       │
│  │   Feature Store       → KMS                      │       │
│  │                                                  │       │
│  │ In Transit:                                      │       │
│  │   Client → Endpoint   → TLS 1.2 (automatic)     │       │
│  │   Instance ↔ Instance → Inter-container TLS      │       │
│  │                         (opt-in, +10% overhead)  │       │
│  └──────────────────────────────────────────────────┘       │
│                                                             │
│  PILLAR 2: ACCESS CONTROL                                   │
│  ┌──────────────────────────────────────────────────┐       │
│  │ IAM Roles:                                       │       │
│  │   SageMaker Execution Role → minimum S3/ECR/KMS  │       │
│  │   Data Scientist Role     → can train, NOT deploy│       │
│  │   MLOps Engineer Role     → can deploy, NOT train│       │
│  │                                                  │       │
│  │ Principle of Least Privilege:                     │       │
│  │   "Grant ONLY the permissions needed for the task"│       │
│  └──────────────────────────────────────────────────┘       │
│                                                             │
│  PILLAR 3: NETWORK ISOLATION                                │
│  ┌──────────────────────────────────────────────────┐       │
│  │ VPC + Private Subnets:                           │       │
│  │   Training instances → no public IP              │       │
│  │   Endpoints          → no public IP              │       │
│  │                                                  │       │
│  │ VPC Endpoints (PrivateLink):                     │       │
│  │   S3, ECR, SageMaker API, CloudWatch Logs        │       │
│  │   → Traffic stays within AWS, never hits internet│       │
│  │                                                  │       │
│  │ Network Isolation Mode:                          │       │
│  │   enable_network_isolation=True                  │       │
│  │   → Container has ZERO internet access           │       │
│  │   → Cannot pip install, call APIs, exfiltrate    │       │
│  └──────────────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────────┘

Encryption: KMS for Everything

KMS (Key Management Service) manages encryption keys. Customer Managed Keys (CMK) give you full control: you set rotation policy, who can use the key, and audit all usage via CloudTrail.

# Training job with full encryption
estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=f"s3://{bucket}/{prefix}/output",
    
    # Encryption
    output_kms_key="arn:aws:kms:us-east-1:123456:key/my-key",   # model artifacts
    volume_kms_key="arn:aws:kms:us-east-1:123456:key/my-key",   # instance disk
    encrypt_inter_container_traffic=True,   # TLS between distributed training nodes
    
    sagemaker_session=session,
)

HIPAA compliance requires ALL of these:

KMS encryption at rest (S3 + EBS)
TLS in transit (automatic for endpoints)
Inter-container encryption (must opt-in)
VPC with private subnets
CloudTrail audit logging

Access Control: Least-Privilege IAM

Data Scientist Role — can train but NOT deploy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateTrainingJob",
        "sagemaker:CreateProcessingJob",
        "sagemaker:CreateHyperParameterTuningJob",
        "sagemaker:Describe*",
        "sagemaker:List*"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Deny",
      "Action": [
        "sagemaker:CreateEndpoint",
        "sagemaker:UpdateEndpoint",
        "sagemaker:DeleteEndpoint"
      ],
      "Resource": "*"
    }
  ]
}

SageMaker Execution Role — minimum permissions:

{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
      "Resource": ["arn:aws:s3:::my-ml-bucket", "arn:aws:s3:::my-ml-bucket/*"]
    },
    {
      "Effect": "Allow",
      "Action": ["ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "ecr:GetAuthorizationToken"],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["logs:CreateLogStream", "logs:PutLogEvents", "logs:CreateLogGroup"],
      "Resource": "arn:aws:logs:*:*:log-group:/aws/sagemaker/*"
    },
    {
      "Effect": "Allow",
      "Action": ["kms:Decrypt", "kms:GenerateDataKey"],
      "Resource": "arn:aws:kms:us-east-1:123456:key/my-key"
    }
  ]
}

Network Isolation: VPC + Endpoints

The problem: SageMaker instances in a VPC need to reach S3 (for data), ECR (for containers), and CloudWatch (for logging). Without VPC Endpoints, they’d need a NAT Gateway (expensive, traffic goes through the internet).

The solution: VPC Endpoints create private connections from your VPC to AWS services. Traffic never leaves the AWS network.

┌────────────────────── VPC ──────────────────────────┐
│                                                      │
│  ┌─── Private Subnet ────────────────────────────┐  │
│  │                                                │  │
│  │   SageMaker Training / Endpoint instances      │  │
│  │   (no public IP, no internet access)           │  │
│  │                                                │  │
│  └────────────┬───────────────────────────────────┘  │
│               │                                      │
│  ┌────────────▼───────────────────────────────────┐  │
│  │  VPC Endpoints (PrivateLink)                   │  │
│  │                                                │  │
│  │  ● S3 Gateway Endpoint (free)                  │  │
│  │    → Training reads/writes data                │  │
│  │                                                │  │
│  │  ● SageMaker API Interface Endpoint            │  │
│  │    → API calls to create jobs                  │  │
│  │                                                │  │
│  │  ● SageMaker Runtime Interface Endpoint        │  │
│  │    → Invoke inference endpoint                 │  │
│  │                                                │  │
│  │  ● ECR Interface Endpoints (ecr.api + ecr.dkr) │  │
│  │    → Pull training/inference containers        │  │
│  │                                                │  │
│  │  ● CloudWatch Logs Interface Endpoint          │  │
│  │    → Write training/inference logs              │  │
│  └────────────────────────────────────────────────┘  │
│                                                      │
│  NO NAT Gateway needed. NO internet traffic.         │
└──────────────────────────────────────────────────────┘

Common VPC troubleshooting (heavily tested):

Error Message	Missing Endpoint	Fix
“Unable to download data from S3”	S3 Gateway	Add S3 Gateway Endpoint to route table
“Cannot pull Docker container”	ECR	Add BOTH `ecr.api` and `ecr.dkr` Interface Endpoints
“Cannot write logs”	CloudWatch Logs	Add `logs` Interface Endpoint
“Training job hangs at Downloading”	S3 or ECR	Check Security Group allows outbound to endpoints

Network Isolation — maximum lockdown:

estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    subnets=["subnet-abc123"],
    security_group_ids=["sg-0123456789"],
    enable_network_isolation=True,   # ZERO internet access
    sagemaker_session=session,
)

With enable_network_isolation=True, the container:

Cannot pip install anything
Cannot make HTTP calls to any API
Cannot exfiltrate data anywhere
Can ONLY access S3 through the defined input channels
All dependencies MUST be baked into the container image

Use this for: HIPAA, PCI-DSS, handling classified data, preventing data exfiltration.

Lab 5: Cost Optimization Patterns

What Is This Lab About?

ML workloads are expensive — GPU instances, always-on endpoints, large training runs. But most teams overspend by 40-70% because they don’t use the right cost optimization techniques. The exam tests your ability to recommend the cheapest option that still meets requirements.

The Cost Optimization Decision Tree

TRAINING COSTS:
  "How to reduce training cost?"
      │
      ├── Can tolerate interruptions? ── YES ── Spot Training (90% savings)
      │                                          + Checkpointing required
      │
      ├── Running same pipeline daily? ── YES ── Step Caching
      │                                          (skip unchanged steps)
      │
      ├── Using GPU for XGBoost? ── YES ── Switch to CPU (ml.m5)
      │                                    XGBoost doesn't benefit from GPU
      │
      └── Training frequently? ── YES ── Warm Pools (reduce startup time)
                                         or SageMaker Savings Plans (64%)

INFERENCE COSTS:
  "How to reduce inference cost?"
      │
      ├── Sporadic traffic (<100/day)? ── YES ── Serverless (pay per call)
      │
      ├── Irregular + large payloads? ── YES ── Async (scale to 0 + GPU)
      │
      ├── Many small models? ── YES ── Multi-Model Endpoint (share instance)
      │
      ├── Scheduled bulk scoring? ── YES ── Batch Transform (no endpoint)
      │
      ├── Steady, predictable load? ── YES ── Savings Plans (up to 64%)
      │                                       or Inferentia chips (up to 70%)
      │
      └── Any deployed model? ── YES ── SageMaker Neo (compile for hardware)
                                        Up to 2x faster → use smaller instance

Key Numbers to Remember

Optimization	Savings	Trade-off
Spot Training	Up to 90%	Interruptions — need checkpointing
Serverless Inference	Pay per call	Cold starts, CPU only, 4 MB max
Async (scale to 0)	Pay per use	Minutes latency, queue-based
Multi-Model Endpoint	~80% vs individual endpoints	Cold model load time
Batch Transform	Pay per job only	No real-time, hours for results
Savings Plans	Up to 64%	1 or 3-year commitment
Inferentia (ml.inf2)	Up to 70% vs GPU	Limited framework support
Trainium (ml.trn1)	Up to 50% vs GPU	Training only
SageMaker Neo	Up to 2x throughput	Compilation step required

Domain 4 Lab Summary

Lab	Service	You Learned
1	Model Monitor	Data Capture → Baseline → Schedule → Violations — the complete lifecycle
2	Model Quality	Ground truth merging, why it NEEDS actual outcomes, delayed feedback problem
3	Automated Retraining	Monitor → CloudWatch → EventBridge → Lambda → Pipeline — full automation loop
4	Security	KMS encryption, IAM least privilege, VPC Endpoints, Network Isolation, HIPAA checklist
5	Cost Optimization	Decision trees for training and inference cost reduction, key savings numbers

Lab: Domain 4 — Monitoring & Security Hands-On#

Lab 1: SageMaker Model Monitor — The Complete Lifecycle#

What Is This Lab About?#

The Four Types of Monitoring — When Each Fires#

The Complete Setup — 3 Steps Before Monitoring Works#

What’s ACTUALLY Running Behind Model Monitor?#

Step 1: Deploy Model with Data Capture#

Step 2: Generate Traffic#

Step 3: Create Baseline from Training Data#

Step 4: Examine the Baseline#

Step 5: Schedule Monitoring#

Step 6: View Violations#

Lab 2: Model Quality Monitor — When You Have Ground Truth#

What Is This Lab About?#

The Ground Truth Problem#

Upload Ground Truth#

Schedule Model Quality Monitoring#

Exam Decision: Which Monitor When?#

Lab 3: Automated Retraining — The Full Loop#

What Is This Lab About?#

The Complete Automated Flow#

Step 1: Create CloudWatch Alarm#

Step 2: EventBridge Rule → Lambda#

Step 3: Lambda Starts the Pipeline#

Lab 4: Securing the ML Pipeline#

What Is This Lab About?#

The Three Pillars of ML Security#

Encryption: KMS for Everything#

Access Control: Least-Privilege IAM#

Network Isolation: VPC + Endpoints#

Lab 5: Cost Optimization Patterns#

What Is This Lab About?#

The Cost Optimization Decision Tree#

Key Numbers to Remember#

Domain 4 Lab Summary#

Lab: Domain 4 — Monitoring & Security Hands-On

Lab 1: SageMaker Model Monitor — The Complete Lifecycle

What Is This Lab About?

The Four Types of Monitoring — When Each Fires

The Complete Setup — 3 Steps Before Monitoring Works

What’s ACTUALLY Running Behind Model Monitor?

Step 1: Deploy Model with Data Capture

Step 2: Generate Traffic

Step 3: Create Baseline from Training Data

Step 4: Examine the Baseline

Step 5: Schedule Monitoring

Step 6: View Violations

Lab 2: Model Quality Monitor — When You Have Ground Truth

What Is This Lab About?

The Ground Truth Problem

Upload Ground Truth

Schedule Model Quality Monitoring

Exam Decision: Which Monitor When?

Lab 3: Automated Retraining — The Full Loop

What Is This Lab About?

The Complete Automated Flow

Step 1: Create CloudWatch Alarm

Step 2: EventBridge Rule → Lambda

Step 3: Lambda Starts the Pipeline

Lab 4: Securing the ML Pipeline

What Is This Lab About?

The Three Pillars of ML Security

Encryption: KMS for Everything

Access Control: Least-Privilege IAM

Network Isolation: VPC + Endpoints

Lab 5: Cost Optimization Patterns

What Is This Lab About?

The Cost Optimization Decision Tree

Key Numbers to Remember

Domain 4 Lab Summary