← AWS MLS-C01 — ML Specialty

Domain 4C: Security, Compliance & Cost

Security, Compliance & Cost

Exam Domain: 4 — ML Implementation and Operations (20%) Task: Apply security controls, compliance requirements, and cost optimization strategies to ML workloads


Security in ML — First Principles

ML security sits at the intersection of data security, infrastructure security, and model security. Each layer has distinct threats and controls.

┌─────────────────────────────────────────────────────────────────┐
│                   ML ATTACK SURFACE                            │
│                                                                 │
│  Training Data ──► Model Training ──► Model ──► Inference       │
│       │                  │              │           │           │
│  Data poisoning     Credential       Model       Adversarial   │
│  Privacy leakage    theft            stealing    inputs        │
│  Bias injection     Compute theft    IP theft    Model bypass  │
│                                                                 │
│  Controls:                                                      │
│  Encryption        IAM + VPC       Access ctrl  Input valid.   │
│  Access control    Logging         Encryption   Rate limiting  │
└─────────────────────────────────────────────────────────────────┘

ELI5: ML security protects three things: your data (secrets — training data, customer PII), your model (intellectual property — the result of expensive training), and your predictions (business logic — decisions that drive revenue or safety). Each needs different protections, just like a house needs locks on doors, a safe for valuables, and insurance for everything else.

Why this matters for the exam: Security questions appear frequently in Domain 4. Know which service handles which concern: KMS for encryption, IAM for access, VPC for network isolation, CloudTrail for audit, Model Monitor for operational drift.


Data Protection

Encryption at Rest

S3 encryption options:

OptionKey ManagementUse When
SSE-S3AWS manages (AES-256)Default, no compliance requirement
SSE-KMSAWS KMS (you choose key)Audit trail needed, customer managed key
SSE-CCustomer provides key per requestFull key control outside AWS
Client-sideEncrypt before uploadData must never be plaintext in AWS

SageMaker encryption:

  • Notebooks: encrypt EBS volume with KMS key
  • Training jobs: encrypt EBS attached to training instances, encrypt S3 output with KMS
  • Endpoints: encrypt EBS on inference instances
  • Feature Store: encrypt online store (ElastiCache) and offline store (S3) separately
  • Default: SageMaker uses AWS-managed KMS key if you don’t specify one

Exam tip: If a question says “customer managed key” or “you control the key” — that’s SSE-KMS with a Customer Managed Key (CMK). If it says “AWS manages the key” — SSE-S3 or AWS-managed KMS key.

Encryption in Transit

  • All SageMaker API calls use TLS 1.2+
  • Inter-container traffic for distributed training: not encrypted by default
    • Enable with EnableInterContainerTrafficEncryption: true in training job config
    • Adds ~10-20% performance overhead (encryption cost)
    • Required for: HIPAA, PCI-DSS, financial compliance
  • VPC endpoints (PrivateLink): traffic never leaves AWS network

AWS KMS Deep Dive

How KMS Works

ELI5: KMS is a locked vault for your encryption keys. You put a key in the vault and give it a name. When you want to encrypt data, you ask the vault to encrypt it — the key NEVER leaves the vault. If someone steals your encrypted data, they still can’t read it because they don’t have the key, and the vault never hands the key out directly.

Key Types

Key TypeWho Creates ItWho Manages ItUse Case
AWS Managed KeyAWSAWSDefault for most services (free)
Customer Managed Key (CMK)YouYouCompliance, key rotation control
Customer Provided KeyYouYou (external)SSE-C, external HSM
AWS Owned KeyAWSAWSService-internal, you can’t see it

Envelope Encryption

┌────────────────────────────────────────────────────────────────┐
│                  ENVELOPE ENCRYPTION                          │
│                                                                │
│  1. SageMaker requests a DATA KEY from KMS                    │
│                                                                │
│  KMS ──► generates Data Key (plaintext + encrypted copy)      │
│                                                                │
│  2. SageMaker uses plaintext Data Key to encrypt your data    │
│     Your data ──► [AES-256 encrypt with Data Key] ──► ciphertext│
│                                                                │
│  3. SageMaker stores encrypted Data Key alongside ciphertext  │
│     (plaintext Data Key is discarded from memory)             │
│                                                                │
│  4. To decrypt: KMS decrypts the Data Key ──► decrypt data    │
│                                                                │
│  Result: KMS key never touches your data directly.            │
│  Audit trail: every KMS call logged in CloudTrail.            │
└────────────────────────────────────────────────────────────────┘

Why KMS Matters for Compliance

  • Every Encrypt, Decrypt, GenerateDataKey call logged in CloudTrail
  • Key policies control EXACTLY who can use each key
  • Automatic key rotation (annual, for CMKs)
  • Cross-account access: share encrypted data across accounts via key policy
  • Key deletion has mandatory 7-30 day waiting period (prevents accidental loss)

Network Security

VPC Configuration for SageMaker

ELI5: Putting SageMaker in a VPC is like moving your ML workload from a public office building into a private secure facility. The work still gets done, but now there are walls, security guards (security groups), and controlled entry points (VPC endpoints) — no random internet traffic gets in.

┌─────────────────────────────────────────────────────────────────┐
│                    VPC ARCHITECTURE FOR ML                     │
│                                                                 │
│  ┌────────────────── VPC ──────────────────────────────────┐   │
│  │                                                          │   │
│  │  ┌─── Private Subnet ───────────────────────────────┐   │   │
│  │  │                                                   │   │   │
│  │  │  SageMaker Training    SageMaker Endpoint         │   │   │
│  │  │  Instances             Instances                  │   │   │
│  │  │  (no public IP)        (no public IP)             │   │   │
│  │  │                                                   │   │   │
│  │  └───────────────────────────────────────────────────┘   │   │
│  │             │                        │                   │   │
│  │    NAT Gateway              VPC Endpoints                │   │
│  │    (for internet:           (for AWS services:           │   │
│  │     pip install)             S3, ECR, KMS, etc.)         │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Without VPC Endpoints: traffic goes Internet ──► AWS service  │
│  With VPC Endpoints:    traffic stays inside AWS network       │
└─────────────────────────────────────────────────────────────────┘

Network Controls

ControlScopeWhat It Does
Security GroupsInstance-levelStateful firewall: allow/deny by port, protocol, source
Network ACLsSubnet-levelStateless firewall: allow/deny by IP range and port
VPC Endpoints (Interface)Service accessPrivate connectivity to AWS services via PrivateLink
VPC Endpoints (Gateway)S3/DynamoDBRoute table entry for private S3/DynamoDB access
NAT GatewayOutbound internetAllow private subnet instances to reach internet

Why VPC matters for ML:

  • Training data often contains PII or trade secrets — can’t traverse public internet
  • Regulatory requirements (HIPAA, GDPR) often mandate private network paths
  • Prevents data exfiltration: training instance can only talk to approved S3 buckets

SageMaker-Specific Network Flags

# Training job network isolation
NetworkConfig:
  EnableNetworkIsolation: true          # No internet access AT ALL
  EnableInterContainerTrafficEncryption: true  # Encrypt distributed training traffic
  VpcConfig:
    SecurityGroupIds: [sg-xxx]
    Subnets: [subnet-xxx]
  • EnableNetworkIsolation: true — most restrictive: no outbound internet, no VPC communication
    • Use when: regulatory requirement for air-gapped training
    • Limitation: can’t download pip packages during training (must bake into container)

Identity & Access Management

IAM for SageMaker

Two IAM principals to understand:

PrincipalWhat They AreKey Permissions Needed
SageMaker Execution RoleRole assumed BY SageMaker serviceS3 read/write, ECR pull, CloudWatch logs, KMS
User/Team IAM RoleRole assumed BY humans/appsCreate/manage SageMaker resources

Principle of Least Privilege in ML:

Training Job needs:
  ✓ s3:GetObject on training data bucket
  ✓ s3:PutObject on model artifact bucket
  ✓ ecr:GetDownloadUrlForLayer (pull container)
  ✓ cloudwatch:PutMetricData (log metrics)
  ✗ s3:DeleteObject (NOT needed)
  ✗ iam:CreateRole (NOT needed)
  ✗ ec2:TerminateInstances (NOT needed)

Key IAM condition keys for SageMaker:

Condition KeyRestricts
sagemaker:VpcSecurityGroupIdsRequire VPC security group
sagemaker:VpcSubnetsRequire specific subnets
sagemaker:RootAccessControl root access in notebooks
sagemaker:InterContainerTrafficEncryptionRequire encryption
aws:RequestedRegionRestrict to specific regions

Service Control Policies (SCPs):

  • Organization-level guardrails applied to ALL accounts in AWS Organization
  • Cannot be overridden by account-level IAM policies
  • Use case: “No one in any account can launch a SageMaker endpoint without KMS encryption”

Cross-Account ML Access Patterns

┌─── Dev Account ───┐         ┌─── Prod Account ───┐
│                   │         │                     │
│  Train model      │         │  Deploy endpoint    │
│  Register in      │────────►│  Pull approved      │
│  Model Registry   │         │  model from registry│
│                   │         │                     │
└───────────────────┘         └─────────────────────┘

Requirements:
- Model Registry resource policy: allow prod account to read
- Prod account IAM role: allow SageMaker to use cross-account model
- KMS key policy: allow prod account to decrypt model artifacts
- S3 bucket policy: allow prod account to read model artifacts

Monitoring & Logging

Amazon CloudWatch for ML

Training metrics:

  • Custom metrics published via sagemaker.log_metric() during training
  • Built-in metrics: CPU/GPU utilization, memory, disk I/O
  • Regex-based metric extraction from training logs

Endpoint metrics (built-in):

MetricWhat It Measures
InvocationsTotal number of requests
InvocationsPerInstanceRequests per instance (auto-scaling trigger)
ModelLatencyTime spent in model (microseconds)
OverheadLatencySageMaker overhead (not model time)
Invocation4XXErrorsClient errors (bad requests)
Invocation5XXErrorsServer errors (model failures)
ModelSetupTimeTime to load model on container start

Alarms: Trigger actions (SNS, Auto Scaling, Lambda) when metrics cross thresholds.


AWS CloudTrail for ML Governance

  • Records EVERY SageMaker API call: who, what, when, from where
  • Logged calls: CreateTrainingJob, CreateEndpoint, DeleteModel, UpdateEndpoint, etc.
  • Includes: caller identity (IAM user/role), source IP, request parameters
  • Stored in S3, queryable with Athena
  • Use for: compliance audits, forensic investigation, change tracking

Exam tip: CloudTrail is the audit trail. If a question says “who deleted the model?” or “prove that no unauthorized user accessed training data” — CloudTrail is the answer.


SageMaker Model Monitor — Deep Dive

ELI5: Model Monitor is a smoke detector for your ML model. It constantly checks two things: (1) Has the data feeding your model changed significantly since training? (data drift) and (2) Are the model’s predictions getting worse over time? (model quality drift). If either triggers, it raises an alarm so you can retrain.

Four monitoring types:

Monitor TypeWhat It DetectsHow
Data QualityInput data drift (schema, statistics)Compare incoming data to training baseline
Model QualityPrediction accuracy degradationCompare predictions to ground truth labels
Bias DriftChanges in model fairnessTrack bias metrics over time
Feature Attribution DriftChanges in feature importance (SHAP)Compare SHAP values to baseline

How it works:

Step 1: Create Baseline
  Training data ──► Baseline job ──► Statistics + Constraints JSON
  (runs once)       (SageMaker        (stored in S3)
                    Processing)

Step 2: Enable Data Capture
  Endpoint ──► [Model Monitor captures] ──► Inference requests + responses to S3

Step 3: Schedule Monitoring
  Every hour/day ──► Monitoring job runs:
  Captured data vs. Baseline ──► Violations report ──► CloudWatch metrics

Step 4: Alert
  CloudWatch metric exceeds threshold ──► Alarm ──► SNS ──► Re-train trigger

Why drift matters:

  • Model trained on 2023 data may fail on 2025 data (distribution shift)
  • User behavior changes (e.g., pandemic changed shopping patterns)
  • Data pipeline upstream changes silently alter feature distributions
  • Regulatory requirement: show model is still performing as designed

Integration:

  • CloudWatch metrics per feature: feature.mean, feature.std_dev, completeness
  • Violation report: JSON listing which constraints were violated
  • SageMaker Pipelines: trigger retraining when violations exceed threshold

Compliance & Governance

SageMaker Clarify — Bias and Explainability

Pre-training bias analysis:

  • Analyze training dataset BEFORE training
  • Detect: class imbalance, label correlation with sensitive attributes
  • Metrics: Class Imbalance (CI), Difference in Positive Proportions (DPP), KL Divergence

Post-training bias analysis:

  • Analyze model predictions for fairness
  • Detect: disparate impact across demographic groups
  • Metrics: Demographic Disparity (DD), Disparate Impact (DI)

Explainability (SHAP values):

  • Per-prediction feature attribution: “Why did this model return THAT prediction?”
  • SHAP (SHapley Additive exPlanations): mathematically sound attribution
  • Global explanations: which features matter most overall
  • Local explanations: which features drove THIS specific prediction

When required:

  • Financial services: credit decisions (ECOA, FCRA compliance)
  • Healthcare: clinical decision support (FDA AI/ML guidelines)
  • HR: hiring/promotion decisions (EEOC compliance)
  • Insurance: risk pricing (state insurance regulations)

Exam tip: Clarify is the answer when a question mentions “explain why the model made this decision” or “detect bias in model predictions” or “comply with fairness regulations.”


ML Lineage Tracking

SageMaker ML Lineage (automatic):

Dataset ──► Processing Job ──► Processed Dataset ──► Training Job ──► Model ──► Endpoint
   │             │                    │                   │              │          │
Artifact      Action              Artifact             Action         Artifact   Action
                                                                        │
                                                                   Model Registry
                                                                   (approval)

Use cases:

  • Auditor asks: “What training data was used for the production model?” → query lineage
  • Incident response: “Which endpoints use the affected model version?” → query lineage
  • Reproducibility: recreate exact training conditions for a model version

Model Cards

Document ML model details for transparency and accountability.

  • Intended use, out-of-scope use cases
  • Training data description, evaluation results
  • Limitations, biases, performance across demographic groups
  • Contact information, version history

Cost Optimization

The Three Cost Buckets

ELI5: ML costs come from three buckets: training (renting powerful computers to learn), inference (renting computers to make predictions 24/7), and storage (storing your data and models in S3). The strategies to optimize each bucket are completely different — like saving money on rent, groceries, and utilities require different approaches.

┌─────────────────────────────────────────────────────────────────┐
│                    ML COST BREAKDOWN                           │
├──────────────────┬──────────────────┬──────────────────────────┤
│   TRAINING       │   INFERENCE      │   STORAGE                │
├──────────────────┼──────────────────┼──────────────────────────┤
│ Compute-heavy    │ Always-on cost   │ Scales with data volume  │
│ Periodic spikes  │ Predictable      │ Low $/GB but accumulates │
│ GPU-intensive    │ Latency matters  │                          │
├──────────────────┼──────────────────┼──────────────────────────┤
│ Spot Instances   │ Auto-scaling     │ S3 Lifecycle policies    │
│ Right-sizing     │ Serverless       │ Intelligent-Tiering      │
│ Warm Pools       │ Multi-Model EP   │ Delete stale artifacts   │
│ Savings Plans    │ Inferentia       │                          │
└──────────────────┴──────────────────┴──────────────────────────┘

Training Cost Optimization

Spot Instances (most impactful):

  • Up to 90% cheaper than On-Demand instances
  • Risk: instance can be interrupted with 2-minute warning
  • Mitigation: checkpointing — save model state to S3 periodically, resume on new instance
  • SageMaker handles checkpoint/resume automatically with checkpoint_s3_uri config
  • Best for: training jobs > 1 hour where interruption is acceptable

Right-sizing instances:

MistakeFix
Using ml.p3.16xlarge for scikit-learnUse ml.m5.xlarge
Using ml.m5.large for transformer trainingUse ml.p3.2xlarge
Multi-GPU when single GPU fits the modelProfile GPU utilization first
100GB EBS when model needs 10GBSize EBS to actual data needs

Managed Warm Pools:

  • Keep instances warm between training jobs (avoid instance startup overhead)
  • Charges for warm pool idle time, but less than cold start when jobs run frequently
  • Good for: iterative development with frequent short training runs

SageMaker Savings Plans:

  • Commit to a $/hour spend for 1 or 3 years
  • Get up to 64% discount vs. On-Demand
  • Flexible: applies to training, processing, transform jobs across instance types
  • vs. Reserved Instances: Savings Plans are more flexible (no specific instance commitment)

Distributed training cost vs. speed:

Adding more instances has diminishing returns:
  1 instance:  100% efficiency
  2 instances: ~90% efficiency (communication overhead)
  4 instances: ~80% efficiency
  8 instances: ~70% efficiency (gradient synchronization cost)

Rule: don't add more instances just because they're available.
Profile first: is the job GPU-bound, CPU-bound, or I/O-bound?

Inference Cost Optimization

Auto-scaling:

  • Scale to zero or near-zero during low traffic periods (nights, weekends)
  • Scale out during peak traffic
  • Target tracking on InvocationsPerInstance is most reliable

Serverless Inference:

  • Pay per request — zero cost when idle
  • Best for: low-volume APIs, dev/test, intermittent traffic
  • Break-even: ~20% endpoint utilization (below that, serverless is cheaper)

Multi-Model Endpoints:

  • Host thousands of models on one endpoint
  • Amortize instance cost across all models
  • Best for: per-customer models with sparse per-customer traffic

Inferentia/Graviton instances:

ChipSavings vs. GPUUse Case
ml.inf1 (Inferentia1)Up to 70% vs ml.g4dnStandard DL inference
ml.inf2 (Inferentia2)Up to 40% vs ml.p3Large model inference
ml.c7g (Graviton3)~20% vs ml.c6iCPU-based inference (tabular ML)

Batch Transform instead of endpoint:

  • If predictions don’t need to be real-time → use Batch Transform
  • No persistent endpoint = no 24/7 compute cost
  • Run nightly/weekly scoring jobs instead

Model compilation (Neo):

  • Faster inference = fewer instances needed = lower cost
  • 2x throughput = can handle same traffic with half the instances

Storage Cost Optimization

S3 storage tiers:

TierCostRetrievalUse For
S3 StandardHighestInstantActive training data, recent models
S3 Standard-IA~46% lessInstantInfrequently accessed datasets
S3 Glacier Instant Retrieval~68% lessInstantArchived datasets, compliance
S3 Glacier Flexible~82% lessHoursLong-term archive, regulatory
S3 Glacier Deep Archive~95% less12-48 hrsCompliance archive, rarely needed

S3 Intelligent-Tiering:

  • Automatically moves objects between tiers based on access patterns
  • No retrieval fees, small monitoring fee
  • Best for: data with unknown or changing access patterns

Lifecycle rules for ML:

Training data (active) ──► 30 days ──► Standard-IA ──► 90 days ──► Glacier
Model artifacts ──────────► 60 days ──► Standard-IA ──► keep latest N versions only
Experiment logs ──────────► 30 days ──► Standard-IA ──► delete after 1 year

What to delete:

  • Failed experiment model artifacts (keep only the best run per experiment)
  • Intermediate processing outputs (temp data)
  • Old Docker image layers in ECR (use lifecycle policies)
  • CloudWatch logs older than retention period (set log group retention)

Cost Monitoring and Governance

AWS Cost Explorer:

  • Tag all SageMaker resources: project, team, environment, model-name
  • Filter costs by tag to see per-project ML spend
  • Identify cost spikes (unexpected training jobs, endpoints left running)

AWS Budgets:

  • Set monthly budget per tag (e.g., “$5,000/month for project X”)
  • Alert at 80% and 100% of budget
  • Action-based budgets: auto-stop resources when budget exceeded

Resource tagging strategy for ML:

Required tags:
  Environment: dev | staging | prod
  Team: data-science | ml-platform | product-ml
  Project: recommendation-engine | fraud-detection
  CostCenter: [business unit code]
  AutoShutdown: true | false  (custom: for cleanup automation)

Security Requirements → Service Mapping

RequirementAWS Service/Feature
Encrypt training data at restKMS (SSE-KMS on S3)
Encrypt model artifactsKMS encryption on SageMaker training job
Audit who accessed the modelCloudTrail
Keep training data off public internetVPC + VPC Endpoints (PrivateLink)
Encrypt traffic between training containersEnableInterContainerTrafficEncryption
Isolate training from all network accessEnableNetworkIsolation: true
Review low-confidence ML predictionsAmazon A2I
Detect bias in model predictionsSageMaker Clarify
Monitor for model accuracy degradationSageMaker Model Monitor (Model Quality)
Monitor for input data distribution shiftSageMaker Model Monitor (Data Quality)
Prove which data trained production modelSageMaker ML Lineage
Restrict which SageMaker actions users can takeIAM policies with condition keys
Org-wide encryption enforcementService Control Policies (SCPs)
HIPAA-compliant ML workloadsSageMaker in VPC + KMS + CloudTrail + BAA

Cost Strategy → When to Use

StrategyWhen to Apply
Spot InstancesTraining jobs > 30 min, can checkpoint, not latency-sensitive
Savings PlansPredictable training workload, committed spend for 1-3 years
Serverless Inference< few hundred requests/day, can tolerate cold start
Multi-Model Endpoint> 10 models with sparse per-model traffic
Batch TransformPredictions don’t need real-time latency
Inferentia instancesHigh-volume DL inference, willing to compile model
Graviton instancesCPU-based inference (XGBoost, sklearn), ~20% cheaper than x86
Managed Warm PoolsRapid iteration during development (many short training jobs)
S3 Lifecycle policiesAny S3 data older than 30 days that’s not accessed daily
Auto-scaling to zeroEndpoints with predictable low-traffic windows
Model compilation (Neo)Deployed model running at scale, want throughput improvement

Exam tip: The exam loves to ask “what is the MOST cost-effective approach?” For training: Spot + checkpointing. For inference with variable traffic: Serverless (low volume) or Auto-scaling (medium-high volume). For offline scoring: Batch Transform (no persistent endpoint).