← AWS MLA-C01 — ML Engineer Associate

Lab: Domain 4 — Monitoring & Security Hands-On

Lab: Domain 4 — Monitoring & Security Hands-On

Models degrade in production. Data drifts, user behavior changes, bias emerges. This domain (24% of the exam) is about catching problems before they hurt the business — and securing the entire ML pipeline.


Lab 1: SageMaker Model Monitor — The Complete Lifecycle

Model Monitor Pipeline

What Is This Lab About?

You deployed a customer churn model 3 months ago. It worked great — 95% accuracy. But slowly, customer demographics shifted: average income rose, new product categories appeared, a competitor entered the market. Your model never saw this new reality. Accuracy silently dropped to 72%, and nobody noticed until the VP asked why retention campaigns stopped working.

Model Monitor catches this automatically. It compares production data against your training baseline and alerts you when things drift.

The Four Types of Monitoring — When Each Fires

┌─────────────────────────────────────────────────────────────┐
│           The Four Monitors and What They Catch              │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ 1. DATA QUALITY MONITOR                              │   │
│  │    "Has the INPUT data changed?"                      │   │
│  │                                                      │   │
│  │    Catches:                                          │   │
│  │    • Feature "income" mean shifted $50K → $80K        │   │
│  │    • New category "crypto" in "payment_method"        │   │
│  │    • "email" column is now 15% null (was 0%)          │   │
│  │    • Data type changed (integer → string)             │   │
│  │                                                      │   │
│  │    Needs ground truth? NO — just compares input stats │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ 2. MODEL QUALITY MONITOR                             │   │
│  │    "Is the model getting WRONG answers?"              │   │
│  │                                                      │   │
│  │    Catches:                                          │   │
│  │    • Accuracy dropped from 95% to 72%                 │   │
│  │    • F1 score degraded below threshold                │   │
│  │    • AUC-ROC declined significantly                   │   │
│  │                                                      │   │
│  │    Needs ground truth? YES — must know actual outcomes│   │
│  │    (this is the KEY exam distinction)                 │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ 3. BIAS DRIFT MONITOR                                │   │
│  │    "Is the model becoming UNFAIR?"                    │   │
│  │                                                      │   │
│  │    Catches:                                          │   │
│  │    • Approval rate for women dropped from 75% → 55%   │   │
│  │    • Disparate Impact fell below 0.8 (legal threshold)│   │
│  │    • New demographic group underrepresented           │   │
│  │                                                      │   │
│  │    Uses: SageMaker Clarify bias baseline              │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐   │
│  │ 4. EXPLAINABILITY DRIFT MONITOR                      │   │
│  │    "Are DIFFERENT features driving predictions now?"   │   │
│  │                                                      │   │
│  │    Catches:                                          │   │
│  │    • "credit_score" was #1 feature, now "zip_code" is │   │
│  │    • SHAP values shifted significantly                │   │
│  │    • Model logic has changed even if accuracy hasn't  │   │
│  │                                                      │   │
│  │    Uses: SageMaker Clarify SHAP baseline              │   │
│  └──────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

The Complete Setup — 3 Steps Before Monitoring Works

STEP 1               STEP 2                   STEP 3
Enable Data          Create Baseline           Schedule
Capture              (from training data)      Monitoring
    │                     │                        │
    ▼                     ▼                        ▼
Every request ──▶  "What does NORMAL      Compare captured
and response       data look like?"        data against
logged to S3       Generate statistics:    baseline every
                   mean, std, min, max,    hour/day.
                   data types,             Flag violations.
                   completeness            Alert via
                                          CloudWatch.

Exam trap: “Model Monitor detects no drift” — check if Data Capture is enabled. Without it, there’s literally no data to monitor. This is tested frequently.

What’s ACTUALLY Running Behind Model Monitor?

Model Monitor looks simple on the surface, but there’s real infrastructure doing the work:

Data Capture — it’s an S3 logging layer on the endpoint:

  • When you enable Data Capture, the SageMaker endpoint runtime starts writing JSON Lines files to your specified S3 path
  • Each file contains captured requests and responses, serialized with metadata (event ID, timestamp, content type)
  • Files are organized by: s3://bucket/prefix/endpoint-name/variant/yyyy/mm/dd/hh/
  • The sampling percentage controls how many requests are logged (100% = everything, 10% = sample)
  • There is NO separate service — it’s built into the endpoint runtime itself

Baseline Job — it’s a SageMaker Processing Job:

  • When you call suggest_baseline(), SageMaker launches a Processing Job using a special Model Monitor container (maintained by AWS)
  • This container reads your training CSV from S3, computes statistics (mean, std, distribution per feature), and generates two JSON files:
    • statistics.json — numerical profile of every feature
    • constraints.json — rules like “age must be Integral, completeness must be > 0.95”
  • These files are your “reference” — what NORMAL looks like

Scheduled Monitoring — it’s a recurring SageMaker Processing Job:

  • create_monitoring_schedule() creates a CloudWatch Events rule (cron expression)
  • Every hour/day (per your schedule), it triggers a Processing Job using the same Model Monitor container
  • This job reads the captured data from S3, computes the same statistics, and compares against your baseline
  • If any constraint is violated (e.g., mean shifted >20%, new data type appeared), it writes a violations report to S3
  • It also publishes metrics to CloudWatch — one metric per feature per check type
  • These CloudWatch metrics are what trigger alarms for automated remediation

The infrastructure chain:

Endpoint Runtime (captures data)
    → S3 (stores captured JSON Lines)
    → CloudWatch Events (triggers on schedule)
    → SageMaker Processing Job (runs Model Monitor container)
    → S3 (writes violation reports)
    → CloudWatch Metrics (publishes drift metrics)
    → CloudWatch Alarm (fires if threshold exceeded)
    → EventBridge (routes alarm event)
    → Lambda (starts retraining pipeline)

Every piece is a real AWS service you’re already familiar with. Model Monitor just orchestrates them.

Step 1: Deploy Model with Data Capture

Data Capture logs every request and response to S3 as JSON Lines files. You choose what to capture and what percentage to sample.

import sagemaker
import boto3
from sagemaker.model_monitor import DataCaptureConfig

session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = session.default_bucket()
prefix = "model-monitor-lab"

data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,             # capture everything (use lower % in production)
    destination_s3_uri=f"s3://{bucket}/{prefix}/datacapture",
    capture_options=["Input", "Output"], # log both request and response
    csv_content_types=["text/csv"],
)

predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large",
    endpoint_name="monitored-churn-endpoint",
    data_capture_config=data_capture_config,
)

What a captured record looks like in S3:

{
  "captureData": {
    "endpointInput": {
      "data": "34,75000,720,45",
      "encoding": "CSV"
    },
    "endpointOutput": {
      "data": "0.73",
      "encoding": "CSV"
    }
  },
  "eventMetadata": {
    "eventId": "abcd-1234",
    "inferenceTime": "2026-04-27T10:15:30Z"
  }
}

The eventId is crucial — it’s how Model Quality Monitor later joins predictions with ground truth labels.

Step 2: Generate Traffic

Send enough requests so there’s data to analyze. In production, this happens naturally from real users.

from sagemaker.serializers import CSVSerializer
import time

predictor.serializer = CSVSerializer()

with open("test_data/sample_requests.csv", "r") as f:
    for i, row in enumerate(f):
        predictor.predict(row.strip())
        if i % 50 == 0:
            print(f"Sent {i} requests...")
        time.sleep(0.5)

print("Done. Waiting 60s for S3 sync...")
time.sleep(60)

Step 3: Create Baseline from Training Data

The baseline answers: “what does NORMAL data look like?” SageMaker computes statistics (mean, std, min, max, distribution) and constraints (data types, completeness, allowed ranges) from your training dataset.

from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

monitor = DefaultModelMonitor(
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    volume_size_in_gb=20,
    max_runtime_in_seconds=3600,
)

monitor.suggest_baseline(
    baseline_dataset=f"s3://{bucket}/{prefix}/training-data-with-header.csv",
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=f"s3://{bucket}/{prefix}/baseline",
    wait=True,
)

Step 4: Examine the Baseline

Understanding what the baseline produces helps you understand what violations mean.

import pandas as pd

baseline_job = monitor.latest_baselining_job

# STATISTICS: numerical profile of each feature
stats = baseline_job.baseline_statistics()
stats_df = pd.json_normalize(stats.body_dict["features"])
print(stats_df[["name", "numerical_statistics.mean",
                 "numerical_statistics.std",
                 "numerical_statistics.min",
                 "numerical_statistics.max"]])

Example output:

name              mean      std       min      max
age               34.2      12.1      18       72
income            67500     28000     15000    250000
credit_score      710       45        520      850
total_purchases   42        31        0        312
# CONSTRAINTS: rules that production data must follow
constraints = baseline_job.suggested_constraints()
constraints_df = pd.json_normalize(constraints.body_dict["features"])
print(constraints_df[["name", "inferred_type", "completeness"]])

Example output:

name              inferred_type    completeness
age               Integral         1.00
income            Fractional       1.00
credit_score      Integral         0.98
total_purchases   Integral         1.00

This means: age should always be an integer, always present (100% completeness). If production data suddenly has age as a string, or 15% null — that’s a violation.

Step 5: Schedule Monitoring

from sagemaker.model_monitor import CronExpressionGenerator

monitor.create_monitoring_schedule(
    monitor_schedule_name="churn-data-quality-schedule",
    endpoint_input=predictor.endpoint_name,
    output_s3_uri=f"s3://{bucket}/{prefix}/monitoring-reports",
    statistics=monitor.baseline_statistics(),
    constraints=monitor.suggested_constraints(),
    schedule_cron_expression=CronExpressionGenerator.hourly(),
    enable_cloudwatch_metrics=True,   # publish metrics for alarms
)

Each hour, SageMaker:

  1. Reads captured data from S3
  2. Computes current statistics
  3. Compares against baseline
  4. Publishes violations report + CloudWatch metrics

Step 6: View Violations

import time

executions = monitor.list_executions()
while len(executions) == 0:
    print("Waiting for first monitoring execution...")
    time.sleep(120)
    executions = monitor.list_executions()

latest = executions[-1]
latest.wait(logs=False)
print(f"Status: {latest.describe()['ProcessingJobStatus']}")

# Check for violations
violations = monitor.latest_monitoring_constraint_violations()
if violations and violations.body_dict.get("violations"):
    vdf = pd.json_normalize(violations.body_dict["violations"])
    print(vdf[["feature_name", "constraint_check_type", "description"]])
else:
    print("No violations — production data matches training baseline")

Example violations:

feature_name   constraint_check_type    description
income         baseline_drift_check     Mean shifted from 67500 to 82000 (>20%)
payment_type   data_type_check          Found new category "crypto" not in baseline
email          completeness_check       Completeness dropped from 1.0 to 0.85

Each of these tells you something specific went wrong:

  • Income drift: Customers are wealthier now — model trained on different population
  • New category: A payment type the model never saw — predictions are unreliable for these
  • Missing data: 15% of emails are null — if email is a feature, predictions are degraded

Lab 2: Model Quality Monitor — When You Have Ground Truth

What Is This Lab About?

Data Quality catches input drift (features changed). But what if features look fine, yet the model is still giving wrong answers? Maybe the relationship between features and outcome changed (concept drift). For this, you need Model Quality Monitor — and it requires ground truth labels.

The Ground Truth Problem

TIME T=0 (prediction):
  Customer 1234 → Model predicts: 73% churn probability → "Will churn"
  
  You know the prediction. You DON'T know if it's right.

TIME T=30 days (outcome):
  Customer 1234 → Actually churned: YES (ground truth = 1)
  
  NOW you can evaluate: prediction was correct.
  
TIME T=60 days (monitoring):
  Model Quality Monitor merges:
    prediction (from Data Capture) + actual outcome (from you)
    → Computes accuracy, precision, recall, F1, AUC
    → Compares against training baseline
    → Flags if metrics degraded

The key insight: ground truth arrives LATER (days, weeks, months after the prediction). You must upload it yourself. SageMaker cannot infer it.

Upload Ground Truth

Match each prediction using the eventId from Data Capture.

import json

ground_truth_records = [
    {"groundTruthData": {"data": "1", "encoding": "CSV"},
     "eventMetadata": {"eventId": "abcd-1234"}},
    {"groundTruthData": {"data": "0", "encoding": "CSV"},
     "eventMetadata": {"eventId": "efgh-5678"}},
]

gt_body = "\n".join(json.dumps(r) for r in ground_truth_records)
s3 = boto3.client("s3")
s3.put_object(
    Bucket=bucket,
    Key=f"{prefix}/ground-truth/2026/04/27/10/gt.jsonl",
    Body=gt_body,
)

Schedule Model Quality Monitoring

from sagemaker.model_monitor import ModelQualityMonitor
from sagemaker.model_monitor import EndpointInput

mq_monitor = ModelQualityMonitor(
    role=role, instance_count=1, instance_type="ml.m5.xlarge",
    volume_size_in_gb=20, sagemaker_session=session,
)

mq_monitor.create_monitoring_schedule(
    monitor_schedule_name="churn-model-quality-schedule",
    endpoint_input=EndpointInput(
        endpoint_name="monitored-churn-endpoint",
        probability_attribute="0",  # column index of prediction
    ),
    ground_truth_input=f"s3://{bucket}/{prefix}/ground-truth",
    problem_type="BinaryClassification",
    output_s3_uri=f"s3://{bucket}/{prefix}/model-quality-reports",
    schedule_cron_expression=CronExpressionGenerator.daily(),
)

Exam Decision: Which Monitor When?

I Notice…Monitor TypeNeeds Ground Truth?
Feature distributions shiftedData QualityNO
Model accuracy droppedModel QualityYES
Predictions unfair to a groupBias DriftDepends
Different features driving predictionsExplainability DriftNO
No data appearing in monitorCheck Data Capture is enabledN/A

Lab 3: Automated Retraining — The Full Loop

What Is This Lab About?

Detecting drift is pointless if nobody acts on it. The real value is an automated pipeline that detects drift, retrains the model, validates it, and deploys — with human approval as the only manual step.

The Complete Automated Flow

┌──────────────────────────────────────────────────────────────┐
│                                                              │
│  PRODUCTION ENDPOINT                                         │
│       │                                                      │
│       ▼                                                      │
│  Model Monitor (hourly)                                      │
│  "Income mean shifted 20%"                                   │
│       │                                                      │
│       ▼                                                      │
│  CloudWatch Metric published                                 │
│  sagemaker/Endpoints/data-metrics/feature_baseline_drift     │
│       │                                                      │
│       ▼                                                      │
│  CloudWatch ALARM fires (threshold > 0.2)                    │
│       │                                                      │
│       ├──────▶ SNS → Email: "Data drift detected on income"  │
│       │                                                      │
│       ▼                                                      │
│  EventBridge RULE matches alarm state change                 │
│       │                                                      │
│       ▼                                                      │
│  LAMBDA FUNCTION triggered                                   │
│  → Calls sagemaker.start_pipeline_execution("MLOpsPipeline") │
│       │                                                      │
│       ▼                                                      │
│  SageMaker PIPELINE runs:                                    │
│    Preprocess new data                                       │
│    → Train new model                                         │
│    → Evaluate on test set                                    │
│    → IF better: Register in Model Registry                   │
│                 (status: PendingManualApproval)               │
│       │                                                      │
│       ▼                                                      │
│  DATA SCIENTIST reviews model metrics                        │
│  → Approves in Model Registry                                │
│       │                                                      │
│       ▼                                                      │
│  CI/CD DEPLOYS approved model to endpoint                    │
│  (canary deployment: 5% → monitor → 100%)                    │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Step 1: Create CloudWatch Alarm

cloudwatch = boto3.client("cloudwatch")

cloudwatch.put_metric_alarm(
    AlarmName="income-drift-alarm",
    Namespace="aws/sagemaker/Endpoints/data-metrics",
    MetricName="feature_baseline_drift_income",
    Dimensions=[
        {"Name": "Endpoint", "Value": "monitored-churn-endpoint"},
        {"Name": "MonitoringSchedule", "Value": "churn-data-quality-schedule"},
    ],
    Statistic="Average",
    Period=3600,           # check every hour
    EvaluationPeriods=1,   # fire after 1 breach
    Threshold=0.2,         # 20% drift threshold
    ComparisonOperator="GreaterThanThreshold",
    AlarmActions=["arn:aws:sns:us-east-1:123456:drift-alerts"],
)

Step 2: EventBridge Rule → Lambda

events = boto3.client("events")

events.put_rule(
    Name="drift-triggers-retraining",
    EventPattern=json.dumps({
        "source": ["aws.cloudwatch"],
        "detail-type": ["CloudWatch Alarm State Change"],
        "detail": {
            "alarmName": ["income-drift-alarm"],
            "state": {"value": ["ALARM"]},
        },
    }),
    State="ENABLED",
)

events.put_targets(
    Rule="drift-triggers-retraining",
    Targets=[{
        "Id": "retrain-lambda",
        "Arn": "arn:aws:lambda:us-east-1:123456:function:trigger-retraining",
    }],
)

Step 3: Lambda Starts the Pipeline

# Lambda function: trigger-retraining
import boto3

def lambda_handler(event, context):
    sm = boto3.client("sagemaker")
    response = sm.start_pipeline_execution(
        PipelineName="MLOpsPipeline",
        PipelineParameters=[
            {"Name": "ModelApprovalStatus", "Value": "PendingManualApproval"},
        ],
        PipelineExecutionDescription="Auto-triggered by income drift alarm",
    )
    return {"executionArn": response["PipelineExecutionArn"]}

The pipeline handles everything else: preprocess, train, evaluate, register. The data scientist only needs to approve in Model Registry.


Lab 4: Securing the ML Pipeline

VPC Security Architecture

What Is This Lab About?

ML pipelines handle sensitive data (customer PII, financial records, health information). Securing them isn’t optional — it’s a compliance requirement (HIPAA, PCI-DSS, GDPR). This lab covers the three pillars: encryption, access control, and network isolation.

The Three Pillars of ML Security

┌─────────────────────────────────────────────────────────────┐
│                  ML Security Architecture                    │
│                                                             │
│  PILLAR 1: ENCRYPTION                                       │
│  ┌──────────────────────────────────────────────────┐       │
│  │ At Rest:                                         │       │
│  │   S3 training data    → SSE-KMS                  │       │
│  │   EBS instance volumes → KMS                     │       │
│  │   Model artifacts     → KMS                      │       │
│  │   Feature Store       → KMS                      │       │
│  │                                                  │       │
│  │ In Transit:                                      │       │
│  │   Client → Endpoint   → TLS 1.2 (automatic)     │       │
│  │   Instance ↔ Instance → Inter-container TLS      │       │
│  │                         (opt-in, +10% overhead)  │       │
│  └──────────────────────────────────────────────────┘       │
│                                                             │
│  PILLAR 2: ACCESS CONTROL                                   │
│  ┌──────────────────────────────────────────────────┐       │
│  │ IAM Roles:                                       │       │
│  │   SageMaker Execution Role → minimum S3/ECR/KMS  │       │
│  │   Data Scientist Role     → can train, NOT deploy│       │
│  │   MLOps Engineer Role     → can deploy, NOT train│       │
│  │                                                  │       │
│  │ Principle of Least Privilege:                     │       │
│  │   "Grant ONLY the permissions needed for the task"│       │
│  └──────────────────────────────────────────────────┘       │
│                                                             │
│  PILLAR 3: NETWORK ISOLATION                                │
│  ┌──────────────────────────────────────────────────┐       │
│  │ VPC + Private Subnets:                           │       │
│  │   Training instances → no public IP              │       │
│  │   Endpoints          → no public IP              │       │
│  │                                                  │       │
│  │ VPC Endpoints (PrivateLink):                     │       │
│  │   S3, ECR, SageMaker API, CloudWatch Logs        │       │
│  │   → Traffic stays within AWS, never hits internet│       │
│  │                                                  │       │
│  │ Network Isolation Mode:                          │       │
│  │   enable_network_isolation=True                  │       │
│  │   → Container has ZERO internet access           │       │
│  │   → Cannot pip install, call APIs, exfiltrate    │       │
│  └──────────────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────────────┘

Encryption: KMS for Everything

KMS (Key Management Service) manages encryption keys. Customer Managed Keys (CMK) give you full control: you set rotation policy, who can use the key, and audit all usage via CloudTrail.

# Training job with full encryption
estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=f"s3://{bucket}/{prefix}/output",
    
    # Encryption
    output_kms_key="arn:aws:kms:us-east-1:123456:key/my-key",   # model artifacts
    volume_kms_key="arn:aws:kms:us-east-1:123456:key/my-key",   # instance disk
    encrypt_inter_container_traffic=True,   # TLS between distributed training nodes
    
    sagemaker_session=session,
)

HIPAA compliance requires ALL of these:

  1. KMS encryption at rest (S3 + EBS)
  2. TLS in transit (automatic for endpoints)
  3. Inter-container encryption (must opt-in)
  4. VPC with private subnets
  5. CloudTrail audit logging

Access Control: Least-Privilege IAM

Data Scientist Role — can train but NOT deploy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:CreateTrainingJob",
        "sagemaker:CreateProcessingJob",
        "sagemaker:CreateHyperParameterTuningJob",
        "sagemaker:Describe*",
        "sagemaker:List*"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Deny",
      "Action": [
        "sagemaker:CreateEndpoint",
        "sagemaker:UpdateEndpoint",
        "sagemaker:DeleteEndpoint"
      ],
      "Resource": "*"
    }
  ]
}

SageMaker Execution Role — minimum permissions:

{
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
      "Resource": ["arn:aws:s3:::my-ml-bucket", "arn:aws:s3:::my-ml-bucket/*"]
    },
    {
      "Effect": "Allow",
      "Action": ["ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "ecr:GetAuthorizationToken"],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": ["logs:CreateLogStream", "logs:PutLogEvents", "logs:CreateLogGroup"],
      "Resource": "arn:aws:logs:*:*:log-group:/aws/sagemaker/*"
    },
    {
      "Effect": "Allow",
      "Action": ["kms:Decrypt", "kms:GenerateDataKey"],
      "Resource": "arn:aws:kms:us-east-1:123456:key/my-key"
    }
  ]
}

Network Isolation: VPC + Endpoints

The problem: SageMaker instances in a VPC need to reach S3 (for data), ECR (for containers), and CloudWatch (for logging). Without VPC Endpoints, they’d need a NAT Gateway (expensive, traffic goes through the internet).

The solution: VPC Endpoints create private connections from your VPC to AWS services. Traffic never leaves the AWS network.

┌────────────────────── VPC ──────────────────────────┐
│                                                      │
│  ┌─── Private Subnet ────────────────────────────┐  │
│  │                                                │  │
│  │   SageMaker Training / Endpoint instances      │  │
│  │   (no public IP, no internet access)           │  │
│  │                                                │  │
│  └────────────┬───────────────────────────────────┘  │
│               │                                      │
│  ┌────────────▼───────────────────────────────────┐  │
│  │  VPC Endpoints (PrivateLink)                   │  │
│  │                                                │  │
│  │  ● S3 Gateway Endpoint (free)                  │  │
│  │    → Training reads/writes data                │  │
│  │                                                │  │
│  │  ● SageMaker API Interface Endpoint            │  │
│  │    → API calls to create jobs                  │  │
│  │                                                │  │
│  │  ● SageMaker Runtime Interface Endpoint        │  │
│  │    → Invoke inference endpoint                 │  │
│  │                                                │  │
│  │  ● ECR Interface Endpoints (ecr.api + ecr.dkr) │  │
│  │    → Pull training/inference containers        │  │
│  │                                                │  │
│  │  ● CloudWatch Logs Interface Endpoint          │  │
│  │    → Write training/inference logs              │  │
│  └────────────────────────────────────────────────┘  │
│                                                      │
│  NO NAT Gateway needed. NO internet traffic.         │
└──────────────────────────────────────────────────────┘

Common VPC troubleshooting (heavily tested):

Error MessageMissing EndpointFix
“Unable to download data from S3”S3 GatewayAdd S3 Gateway Endpoint to route table
“Cannot pull Docker container”ECRAdd BOTH ecr.api and ecr.dkr Interface Endpoints
“Cannot write logs”CloudWatch LogsAdd logs Interface Endpoint
“Training job hangs at Downloading”S3 or ECRCheck Security Group allows outbound to endpoints

Network Isolation — maximum lockdown:

estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    subnets=["subnet-abc123"],
    security_group_ids=["sg-0123456789"],
    enable_network_isolation=True,   # ZERO internet access
    sagemaker_session=session,
)

With enable_network_isolation=True, the container:

  • Cannot pip install anything
  • Cannot make HTTP calls to any API
  • Cannot exfiltrate data anywhere
  • Can ONLY access S3 through the defined input channels
  • All dependencies MUST be baked into the container image

Use this for: HIPAA, PCI-DSS, handling classified data, preventing data exfiltration.


Lab 5: Cost Optimization Patterns

What Is This Lab About?

ML workloads are expensive — GPU instances, always-on endpoints, large training runs. But most teams overspend by 40-70% because they don’t use the right cost optimization techniques. The exam tests your ability to recommend the cheapest option that still meets requirements.

The Cost Optimization Decision Tree

TRAINING COSTS:
  "How to reduce training cost?"
      │
      ├── Can tolerate interruptions? ── YES ── Spot Training (90% savings)
      │                                          + Checkpointing required
      │
      ├── Running same pipeline daily? ── YES ── Step Caching
      │                                          (skip unchanged steps)
      │
      ├── Using GPU for XGBoost? ── YES ── Switch to CPU (ml.m5)
      │                                    XGBoost doesn't benefit from GPU
      │
      └── Training frequently? ── YES ── Warm Pools (reduce startup time)
                                         or SageMaker Savings Plans (64%)

INFERENCE COSTS:
  "How to reduce inference cost?"
      │
      ├── Sporadic traffic (<100/day)? ── YES ── Serverless (pay per call)
      │
      ├── Irregular + large payloads? ── YES ── Async (scale to 0 + GPU)
      │
      ├── Many small models? ── YES ── Multi-Model Endpoint (share instance)
      │
      ├── Scheduled bulk scoring? ── YES ── Batch Transform (no endpoint)
      │
      ├── Steady, predictable load? ── YES ── Savings Plans (up to 64%)
      │                                       or Inferentia chips (up to 70%)
      │
      └── Any deployed model? ── YES ── SageMaker Neo (compile for hardware)
                                        Up to 2x faster → use smaller instance

Key Numbers to Remember

OptimizationSavingsTrade-off
Spot TrainingUp to 90%Interruptions — need checkpointing
Serverless InferencePay per callCold starts, CPU only, 4 MB max
Async (scale to 0)Pay per useMinutes latency, queue-based
Multi-Model Endpoint~80% vs individual endpointsCold model load time
Batch TransformPay per job onlyNo real-time, hours for results
Savings PlansUp to 64%1 or 3-year commitment
Inferentia (ml.inf2)Up to 70% vs GPULimited framework support
Trainium (ml.trn1)Up to 50% vs GPUTraining only
SageMaker NeoUp to 2x throughputCompilation step required

Domain 4 Lab Summary

LabServiceYou Learned
1Model MonitorData Capture → Baseline → Schedule → Violations — the complete lifecycle
2Model QualityGround truth merging, why it NEEDS actual outcomes, delayed feedback problem
3Automated RetrainingMonitor → CloudWatch → EventBridge → Lambda → Pipeline — full automation loop
4SecurityKMS encryption, IAM least privilege, VPC Endpoints, Network Isolation, HIPAA checklist
5Cost OptimizationDecision trees for training and inference cost reduction, key savings numbers