Domain 4C: Security, Compliance & Cost

15 min read 3027 words

Security, Compliance & Cost

Exam Domain: 4 — ML Implementation and Operations (20%) Task: Apply security controls, compliance requirements, and cost optimization strategies to ML workloads

Security in ML — First Principles

ML security sits at the intersection of data security, infrastructure security, and model security. Each layer has distinct threats and controls.

┌─────────────────────────────────────────────────────────────────┐
│                   ML ATTACK SURFACE                            │
│                                                                 │
│  Training Data ──► Model Training ──► Model ──► Inference       │
│       │                  │              │           │           │
│  Data poisoning     Credential       Model       Adversarial   │
│  Privacy leakage    theft            stealing    inputs        │
│  Bias injection     Compute theft    IP theft    Model bypass  │
│                                                                 │
│  Controls:                                                      │
│  Encryption        IAM + VPC       Access ctrl  Input valid.   │
│  Access control    Logging         Encryption   Rate limiting  │
└─────────────────────────────────────────────────────────────────┘

ELI5: ML security protects three things: your data (secrets — training data, customer PII), your model (intellectual property — the result of expensive training), and your predictions (business logic — decisions that drive revenue or safety). Each needs different protections, just like a house needs locks on doors, a safe for valuables, and insurance for everything else.

Why this matters for the exam: Security questions appear frequently in Domain 4. Know which service handles which concern: KMS for encryption, IAM for access, VPC for network isolation, CloudTrail for audit, Model Monitor for operational drift.

Data Protection

Encryption at Rest

S3 encryption options:

Option	Key Management	Use When
SSE-S3	AWS manages (AES-256)	Default, no compliance requirement
SSE-KMS	AWS KMS (you choose key)	Audit trail needed, customer managed key
SSE-C	Customer provides key per request	Full key control outside AWS
Client-side	Encrypt before upload	Data must never be plaintext in AWS

SageMaker encryption:

Notebooks: encrypt EBS volume with KMS key
Training jobs: encrypt EBS attached to training instances, encrypt S3 output with KMS
Endpoints: encrypt EBS on inference instances
Feature Store: encrypt online store (ElastiCache) and offline store (S3) separately
Default: SageMaker uses AWS-managed KMS key if you don’t specify one

Exam tip: If a question says “customer managed key” or “you control the key” — that’s SSE-KMS with a Customer Managed Key (CMK). If it says “AWS manages the key” — SSE-S3 or AWS-managed KMS key.

Encryption in Transit

All SageMaker API calls use TLS 1.2+
Inter-container traffic for distributed training: not encrypted by default
- Enable with EnableInterContainerTrafficEncryption: true in training job config
- Adds ~10-20% performance overhead (encryption cost)
- Required for: HIPAA, PCI-DSS, financial compliance
VPC endpoints (PrivateLink): traffic never leaves AWS network

AWS KMS Deep Dive

How KMS Works

ELI5: KMS is a locked vault for your encryption keys. You put a key in the vault and give it a name. When you want to encrypt data, you ask the vault to encrypt it — the key NEVER leaves the vault. If someone steals your encrypted data, they still can’t read it because they don’t have the key, and the vault never hands the key out directly.

Key Types

Key Type	Who Creates It	Who Manages It	Use Case
AWS Managed Key	AWS	AWS	Default for most services (free)
Customer Managed Key (CMK)	You	You	Compliance, key rotation control
Customer Provided Key	You	You (external)	SSE-C, external HSM
AWS Owned Key	AWS	AWS	Service-internal, you can’t see it

Envelope Encryption

┌────────────────────────────────────────────────────────────────┐
│                  ENVELOPE ENCRYPTION                          │
│                                                                │
│  1. SageMaker requests a DATA KEY from KMS                    │
│                                                                │
│  KMS ──► generates Data Key (plaintext + encrypted copy)      │
│                                                                │
│  2. SageMaker uses plaintext Data Key to encrypt your data    │
│     Your data ──► [AES-256 encrypt with Data Key] ──► ciphertext│
│                                                                │
│  3. SageMaker stores encrypted Data Key alongside ciphertext  │
│     (plaintext Data Key is discarded from memory)             │
│                                                                │
│  4. To decrypt: KMS decrypts the Data Key ──► decrypt data    │
│                                                                │
│  Result: KMS key never touches your data directly.            │
│  Audit trail: every KMS call logged in CloudTrail.            │
└────────────────────────────────────────────────────────────────┘

Why KMS Matters for Compliance

Every Encrypt, Decrypt, GenerateDataKey call logged in CloudTrail
Key policies control EXACTLY who can use each key
Automatic key rotation (annual, for CMKs)
Cross-account access: share encrypted data across accounts via key policy
Key deletion has mandatory 7-30 day waiting period (prevents accidental loss)

Network Security

VPC Configuration for SageMaker

ELI5: Putting SageMaker in a VPC is like moving your ML workload from a public office building into a private secure facility. The work still gets done, but now there are walls, security guards (security groups), and controlled entry points (VPC endpoints) — no random internet traffic gets in.

┌─────────────────────────────────────────────────────────────────┐
│                    VPC ARCHITECTURE FOR ML                     │
│                                                                 │
│  ┌────────────────── VPC ──────────────────────────────────┐   │
│  │                                                          │   │
│  │  ┌─── Private Subnet ───────────────────────────────┐   │   │
│  │  │                                                   │   │   │
│  │  │  SageMaker Training    SageMaker Endpoint         │   │   │
│  │  │  Instances             Instances                  │   │   │
│  │  │  (no public IP)        (no public IP)             │   │   │
│  │  │                                                   │   │   │
│  │  └───────────────────────────────────────────────────┘   │   │
│  │             │                        │                   │   │
│  │    NAT Gateway              VPC Endpoints                │   │
│  │    (for internet:           (for AWS services:           │   │
│  │     pip install)             S3, ECR, KMS, etc.)         │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Without VPC Endpoints: traffic goes Internet ──► AWS service  │
│  With VPC Endpoints:    traffic stays inside AWS network       │
└─────────────────────────────────────────────────────────────────┘

Network Controls

Control	Scope	What It Does
Security Groups	Instance-level	Stateful firewall: allow/deny by port, protocol, source
Network ACLs	Subnet-level	Stateless firewall: allow/deny by IP range and port
VPC Endpoints (Interface)	Service access	Private connectivity to AWS services via PrivateLink
VPC Endpoints (Gateway)	S3/DynamoDB	Route table entry for private S3/DynamoDB access
NAT Gateway	Outbound internet	Allow private subnet instances to reach internet

Why VPC matters for ML:

Training data often contains PII or trade secrets — can’t traverse public internet
Regulatory requirements (HIPAA, GDPR) often mandate private network paths
Prevents data exfiltration: training instance can only talk to approved S3 buckets

SageMaker-Specific Network Flags

# Training job network isolation
NetworkConfig:
  EnableNetworkIsolation: true          # No internet access AT ALL
  EnableInterContainerTrafficEncryption: true  # Encrypt distributed training traffic
  VpcConfig:
    SecurityGroupIds: [sg-xxx]
    Subnets: [subnet-xxx]

EnableNetworkIsolation: true — most restrictive: no outbound internet, no VPC communication
- Use when: regulatory requirement for air-gapped training
- Limitation: can’t download pip packages during training (must bake into container)

Identity & Access Management

IAM for SageMaker

Two IAM principals to understand:

Principal	What They Are	Key Permissions Needed
SageMaker Execution Role	Role assumed BY SageMaker service	S3 read/write, ECR pull, CloudWatch logs, KMS
User/Team IAM Role	Role assumed BY humans/apps	Create/manage SageMaker resources

Principle of Least Privilege in ML:

Training Job needs:
  ✓ s3:GetObject on training data bucket
  ✓ s3:PutObject on model artifact bucket
  ✓ ecr:GetDownloadUrlForLayer (pull container)
  ✓ cloudwatch:PutMetricData (log metrics)
  ✗ s3:DeleteObject (NOT needed)
  ✗ iam:CreateRole (NOT needed)
  ✗ ec2:TerminateInstances (NOT needed)

Key IAM condition keys for SageMaker:

Condition Key	Restricts
`sagemaker:VpcSecurityGroupIds`	Require VPC security group
`sagemaker:VpcSubnets`	Require specific subnets
`sagemaker:RootAccess`	Control root access in notebooks
`sagemaker:InterContainerTrafficEncryption`	Require encryption
`aws:RequestedRegion`	Restrict to specific regions

Service Control Policies (SCPs):

Organization-level guardrails applied to ALL accounts in AWS Organization
Cannot be overridden by account-level IAM policies
Use case: “No one in any account can launch a SageMaker endpoint without KMS encryption”

Cross-Account ML Access Patterns

┌─── Dev Account ───┐         ┌─── Prod Account ───┐
│                   │         │                     │
│  Train model      │         │  Deploy endpoint    │
│  Register in      │────────►│  Pull approved      │
│  Model Registry   │         │  model from registry│
│                   │         │                     │
└───────────────────┘         └─────────────────────┘

Requirements:
- Model Registry resource policy: allow prod account to read
- Prod account IAM role: allow SageMaker to use cross-account model
- KMS key policy: allow prod account to decrypt model artifacts
- S3 bucket policy: allow prod account to read model artifacts

Monitoring & Logging

Amazon CloudWatch for ML

Training metrics:

Custom metrics published via sagemaker.log_metric() during training
Built-in metrics: CPU/GPU utilization, memory, disk I/O
Regex-based metric extraction from training logs

Endpoint metrics (built-in):

Metric	What It Measures
Invocations	Total number of requests
InvocationsPerInstance	Requests per instance (auto-scaling trigger)
ModelLatency	Time spent in model (microseconds)
OverheadLatency	SageMaker overhead (not model time)
Invocation4XXErrors	Client errors (bad requests)
Invocation5XXErrors	Server errors (model failures)
ModelSetupTime	Time to load model on container start

Alarms: Trigger actions (SNS, Auto Scaling, Lambda) when metrics cross thresholds.

AWS CloudTrail for ML Governance

Records EVERY SageMaker API call: who, what, when, from where
Logged calls: CreateTrainingJob, CreateEndpoint, DeleteModel, UpdateEndpoint, etc.
Includes: caller identity (IAM user/role), source IP, request parameters
Stored in S3, queryable with Athena
Use for: compliance audits, forensic investigation, change tracking

Exam tip: CloudTrail is the audit trail. If a question says “who deleted the model?” or “prove that no unauthorized user accessed training data” — CloudTrail is the answer.

SageMaker Model Monitor — Deep Dive

ELI5: Model Monitor is a smoke detector for your ML model. It constantly checks two things: (1) Has the data feeding your model changed significantly since training? (data drift) and (2) Are the model’s predictions getting worse over time? (model quality drift). If either triggers, it raises an alarm so you can retrain.

Four monitoring types:

Monitor Type	What It Detects	How
Data Quality	Input data drift (schema, statistics)	Compare incoming data to training baseline
Model Quality	Prediction accuracy degradation	Compare predictions to ground truth labels
Bias Drift	Changes in model fairness	Track bias metrics over time
Feature Attribution Drift	Changes in feature importance (SHAP)	Compare SHAP values to baseline

How it works:

Step 1: Create Baseline
  Training data ──► Baseline job ──► Statistics + Constraints JSON
  (runs once)       (SageMaker        (stored in S3)
                    Processing)

Step 2: Enable Data Capture
  Endpoint ──► [Model Monitor captures] ──► Inference requests + responses to S3

Step 3: Schedule Monitoring
  Every hour/day ──► Monitoring job runs:
  Captured data vs. Baseline ──► Violations report ──► CloudWatch metrics

Step 4: Alert
  CloudWatch metric exceeds threshold ──► Alarm ──► SNS ──► Re-train trigger

Why drift matters:

Model trained on 2023 data may fail on 2025 data (distribution shift)
User behavior changes (e.g., pandemic changed shopping patterns)
Data pipeline upstream changes silently alter feature distributions
Regulatory requirement: show model is still performing as designed

Integration:

CloudWatch metrics per feature: feature.mean, feature.std_dev, completeness
Violation report: JSON listing which constraints were violated
SageMaker Pipelines: trigger retraining when violations exceed threshold

Compliance & Governance

SageMaker Clarify — Bias and Explainability

Pre-training bias analysis:

Analyze training dataset BEFORE training
Detect: class imbalance, label correlation with sensitive attributes
Metrics: Class Imbalance (CI), Difference in Positive Proportions (DPP), KL Divergence

Post-training bias analysis:

Analyze model predictions for fairness
Detect: disparate impact across demographic groups
Metrics: Demographic Disparity (DD), Disparate Impact (DI)

Explainability (SHAP values):

Per-prediction feature attribution: “Why did this model return THAT prediction?”
SHAP (SHapley Additive exPlanations): mathematically sound attribution
Global explanations: which features matter most overall
Local explanations: which features drove THIS specific prediction

When required:

Financial services: credit decisions (ECOA, FCRA compliance)
Healthcare: clinical decision support (FDA AI/ML guidelines)
HR: hiring/promotion decisions (EEOC compliance)
Insurance: risk pricing (state insurance regulations)

Exam tip: Clarify is the answer when a question mentions “explain why the model made this decision” or “detect bias in model predictions” or “comply with fairness regulations.”

ML Lineage Tracking

SageMaker ML Lineage (automatic):

Dataset ──► Processing Job ──► Processed Dataset ──► Training Job ──► Model ──► Endpoint
   │             │                    │                   │              │          │
Artifact      Action              Artifact             Action         Artifact   Action
                                                                        │
                                                                   Model Registry
                                                                   (approval)

Use cases:

Auditor asks: “What training data was used for the production model?” → query lineage
Incident response: “Which endpoints use the affected model version?” → query lineage
Reproducibility: recreate exact training conditions for a model version

Model Cards

Document ML model details for transparency and accountability.

Intended use, out-of-scope use cases
Training data description, evaluation results
Limitations, biases, performance across demographic groups
Contact information, version history

Cost Optimization

The Three Cost Buckets

ELI5: ML costs come from three buckets: training (renting powerful computers to learn), inference (renting computers to make predictions 24/7), and storage (storing your data and models in S3). The strategies to optimize each bucket are completely different — like saving money on rent, groceries, and utilities require different approaches.

┌─────────────────────────────────────────────────────────────────┐
│                    ML COST BREAKDOWN                           │
├──────────────────┬──────────────────┬──────────────────────────┤
│   TRAINING       │   INFERENCE      │   STORAGE                │
├──────────────────┼──────────────────┼──────────────────────────┤
│ Compute-heavy    │ Always-on cost   │ Scales with data volume  │
│ Periodic spikes  │ Predictable      │ Low $/GB but accumulates │
│ GPU-intensive    │ Latency matters  │                          │
├──────────────────┼──────────────────┼──────────────────────────┤
│ Spot Instances   │ Auto-scaling     │ S3 Lifecycle policies    │
│ Right-sizing     │ Serverless       │ Intelligent-Tiering      │
│ Warm Pools       │ Multi-Model EP   │ Delete stale artifacts   │
│ Savings Plans    │ Inferentia       │                          │
└──────────────────┴──────────────────┴──────────────────────────┘

Training Cost Optimization

Spot Instances (most impactful):

Up to 90% cheaper than On-Demand instances
Risk: instance can be interrupted with 2-minute warning
Mitigation: checkpointing — save model state to S3 periodically, resume on new instance
SageMaker handles checkpoint/resume automatically with checkpoint_s3_uri config
Best for: training jobs > 1 hour where interruption is acceptable

Right-sizing instances:

Mistake	Fix
Using ml.p3.16xlarge for scikit-learn	Use ml.m5.xlarge
Using ml.m5.large for transformer training	Use ml.p3.2xlarge
Multi-GPU when single GPU fits the model	Profile GPU utilization first
100GB EBS when model needs 10GB	Size EBS to actual data needs

Managed Warm Pools:

Keep instances warm between training jobs (avoid instance startup overhead)
Charges for warm pool idle time, but less than cold start when jobs run frequently
Good for: iterative development with frequent short training runs

SageMaker Savings Plans:

Commit to a $/hour spend for 1 or 3 years
Get up to 64% discount vs. On-Demand
Flexible: applies to training, processing, transform jobs across instance types
vs. Reserved Instances: Savings Plans are more flexible (no specific instance commitment)

Distributed training cost vs. speed:

Adding more instances has diminishing returns:
  1 instance:  100% efficiency
  2 instances: ~90% efficiency (communication overhead)
  4 instances: ~80% efficiency
  8 instances: ~70% efficiency (gradient synchronization cost)

Rule: don't add more instances just because they're available.
Profile first: is the job GPU-bound, CPU-bound, or I/O-bound?

Inference Cost Optimization

Auto-scaling:

Scale to zero or near-zero during low traffic periods (nights, weekends)
Scale out during peak traffic
Target tracking on InvocationsPerInstance is most reliable

Serverless Inference:

Pay per request — zero cost when idle
Best for: low-volume APIs, dev/test, intermittent traffic
Break-even: ~20% endpoint utilization (below that, serverless is cheaper)

Multi-Model Endpoints:

Host thousands of models on one endpoint
Amortize instance cost across all models
Best for: per-customer models with sparse per-customer traffic

Inferentia/Graviton instances:

Chip	Savings vs. GPU	Use Case
ml.inf1 (Inferentia1)	Up to 70% vs ml.g4dn	Standard DL inference
ml.inf2 (Inferentia2)	Up to 40% vs ml.p3	Large model inference
ml.c7g (Graviton3)	~20% vs ml.c6i	CPU-based inference (tabular ML)

Batch Transform instead of endpoint:

If predictions don’t need to be real-time → use Batch Transform
No persistent endpoint = no 24/7 compute cost
Run nightly/weekly scoring jobs instead

Model compilation (Neo):

Faster inference = fewer instances needed = lower cost
2x throughput = can handle same traffic with half the instances

Storage Cost Optimization

S3 storage tiers:

Tier	Cost	Retrieval	Use For
S3 Standard	Highest	Instant	Active training data, recent models
S3 Standard-IA	~46% less	Instant	Infrequently accessed datasets
S3 Glacier Instant Retrieval	~68% less	Instant	Archived datasets, compliance
S3 Glacier Flexible	~82% less	Hours	Long-term archive, regulatory
S3 Glacier Deep Archive	~95% less	12-48 hrs	Compliance archive, rarely needed

S3 Intelligent-Tiering:

Automatically moves objects between tiers based on access patterns
No retrieval fees, small monitoring fee
Best for: data with unknown or changing access patterns

Lifecycle rules for ML:

Training data (active) ──► 30 days ──► Standard-IA ──► 90 days ──► Glacier
Model artifacts ──────────► 60 days ──► Standard-IA ──► keep latest N versions only
Experiment logs ──────────► 30 days ──► Standard-IA ──► delete after 1 year

What to delete:

Failed experiment model artifacts (keep only the best run per experiment)
Intermediate processing outputs (temp data)
Old Docker image layers in ECR (use lifecycle policies)
CloudWatch logs older than retention period (set log group retention)

Cost Monitoring and Governance

AWS Cost Explorer:

Tag all SageMaker resources: project, team, environment, model-name
Filter costs by tag to see per-project ML spend
Identify cost spikes (unexpected training jobs, endpoints left running)

AWS Budgets:

Set monthly budget per tag (e.g., “$5,000/month for project X”)
Alert at 80% and 100% of budget
Action-based budgets: auto-stop resources when budget exceeded

Resource tagging strategy for ML:

Required tags:
  Environment: dev | staging | prod
  Team: data-science | ml-platform | product-ml
  Project: recommendation-engine | fraud-detection
  CostCenter: [business unit code]
  AutoShutdown: true | false  (custom: for cleanup automation)

Security Requirements → Service Mapping

Requirement	AWS Service/Feature
Encrypt training data at rest	KMS (SSE-KMS on S3)
Encrypt model artifacts	KMS encryption on SageMaker training job
Audit who accessed the model	CloudTrail
Keep training data off public internet	VPC + VPC Endpoints (PrivateLink)
Encrypt traffic between training containers	`EnableInterContainerTrafficEncryption`
Isolate training from all network access	`EnableNetworkIsolation: true`
Review low-confidence ML predictions	Amazon A2I
Detect bias in model predictions	SageMaker Clarify
Monitor for model accuracy degradation	SageMaker Model Monitor (Model Quality)
Monitor for input data distribution shift	SageMaker Model Monitor (Data Quality)
Prove which data trained production model	SageMaker ML Lineage
Restrict which SageMaker actions users can take	IAM policies with condition keys
Org-wide encryption enforcement	Service Control Policies (SCPs)
HIPAA-compliant ML workloads	SageMaker in VPC + KMS + CloudTrail + BAA

Cost Strategy → When to Use

Strategy	When to Apply
Spot Instances	Training jobs > 30 min, can checkpoint, not latency-sensitive
Savings Plans	Predictable training workload, committed spend for 1-3 years
Serverless Inference	< few hundred requests/day, can tolerate cold start
Multi-Model Endpoint	> 10 models with sparse per-model traffic
Batch Transform	Predictions don’t need real-time latency
Inferentia instances	High-volume DL inference, willing to compile model
Graviton instances	CPU-based inference (XGBoost, sklearn), ~20% cheaper than x86
Managed Warm Pools	Rapid iteration during development (many short training jobs)
S3 Lifecycle policies	Any S3 data older than 30 days that’s not accessed daily
Auto-scaling to zero	Endpoints with predictable low-traffic windows
Model compilation (Neo)	Deployed model running at scale, want throughput improvement

Exam tip: The exam loves to ask “what is the MOST cost-effective approach?” For training: Spot + checkpointing. For inference with variable traffic: Serverless (low volume) or Auto-scaling (medium-high volume). For offline scoring: Batch Transform (no persistent endpoint).

Security, Compliance & Cost#

Security in ML — First Principles#

Data Protection#

Encryption at Rest#

Encryption in Transit#

AWS KMS Deep Dive#

How KMS Works#

Key Types#

Envelope Encryption#

Why KMS Matters for Compliance#

Network Security#

VPC Configuration for SageMaker#

Network Controls#

SageMaker-Specific Network Flags#

Identity & Access Management#

IAM for SageMaker#

Cross-Account ML Access Patterns#

Monitoring & Logging#

Amazon CloudWatch for ML#

AWS CloudTrail for ML Governance#

SageMaker Model Monitor — Deep Dive#

Compliance & Governance#

SageMaker Clarify — Bias and Explainability#

ML Lineage Tracking#

Model Cards#

Cost Optimization#

The Three Cost Buckets#

Training Cost Optimization#

Inference Cost Optimization#

Storage Cost Optimization#

Cost Monitoring and Governance#

Security Requirements → Service Mapping#

Cost Strategy → When to Use#