Domain 4C: Security, Compliance & Cost
Security, Compliance & Cost
Exam Domain: 4 — ML Implementation and Operations (20%) Task: Apply security controls, compliance requirements, and cost optimization strategies to ML workloads
Security in ML — First Principles
ML security sits at the intersection of data security, infrastructure security, and model security. Each layer has distinct threats and controls.
┌─────────────────────────────────────────────────────────────────┐
│ ML ATTACK SURFACE │
│ │
│ Training Data ──► Model Training ──► Model ──► Inference │
│ │ │ │ │ │
│ Data poisoning Credential Model Adversarial │
│ Privacy leakage theft stealing inputs │
│ Bias injection Compute theft IP theft Model bypass │
│ │
│ Controls: │
│ Encryption IAM + VPC Access ctrl Input valid. │
│ Access control Logging Encryption Rate limiting │
└─────────────────────────────────────────────────────────────────┘
ELI5: ML security protects three things: your data (secrets — training data, customer PII), your model (intellectual property — the result of expensive training), and your predictions (business logic — decisions that drive revenue or safety). Each needs different protections, just like a house needs locks on doors, a safe for valuables, and insurance for everything else.
Why this matters for the exam: Security questions appear frequently in Domain 4. Know which service handles which concern: KMS for encryption, IAM for access, VPC for network isolation, CloudTrail for audit, Model Monitor for operational drift.
Data Protection
Encryption at Rest
S3 encryption options:
| Option | Key Management | Use When |
|---|---|---|
| SSE-S3 | AWS manages (AES-256) | Default, no compliance requirement |
| SSE-KMS | AWS KMS (you choose key) | Audit trail needed, customer managed key |
| SSE-C | Customer provides key per request | Full key control outside AWS |
| Client-side | Encrypt before upload | Data must never be plaintext in AWS |
SageMaker encryption:
- Notebooks: encrypt EBS volume with KMS key
- Training jobs: encrypt EBS attached to training instances, encrypt S3 output with KMS
- Endpoints: encrypt EBS on inference instances
- Feature Store: encrypt online store (ElastiCache) and offline store (S3) separately
- Default: SageMaker uses AWS-managed KMS key if you don’t specify one
Exam tip: If a question says “customer managed key” or “you control the key” — that’s SSE-KMS with a Customer Managed Key (CMK). If it says “AWS manages the key” — SSE-S3 or AWS-managed KMS key.
Encryption in Transit
- All SageMaker API calls use TLS 1.2+
- Inter-container traffic for distributed training: not encrypted by default
- Enable with
EnableInterContainerTrafficEncryption: truein training job config - Adds ~10-20% performance overhead (encryption cost)
- Required for: HIPAA, PCI-DSS, financial compliance
- Enable with
- VPC endpoints (PrivateLink): traffic never leaves AWS network
AWS KMS Deep Dive
How KMS Works
ELI5: KMS is a locked vault for your encryption keys. You put a key in the vault and give it a name. When you want to encrypt data, you ask the vault to encrypt it — the key NEVER leaves the vault. If someone steals your encrypted data, they still can’t read it because they don’t have the key, and the vault never hands the key out directly.
Key Types
| Key Type | Who Creates It | Who Manages It | Use Case |
|---|---|---|---|
| AWS Managed Key | AWS | AWS | Default for most services (free) |
| Customer Managed Key (CMK) | You | You | Compliance, key rotation control |
| Customer Provided Key | You | You (external) | SSE-C, external HSM |
| AWS Owned Key | AWS | AWS | Service-internal, you can’t see it |
Envelope Encryption
┌────────────────────────────────────────────────────────────────┐
│ ENVELOPE ENCRYPTION │
│ │
│ 1. SageMaker requests a DATA KEY from KMS │
│ │
│ KMS ──► generates Data Key (plaintext + encrypted copy) │
│ │
│ 2. SageMaker uses plaintext Data Key to encrypt your data │
│ Your data ──► [AES-256 encrypt with Data Key] ──► ciphertext│
│ │
│ 3. SageMaker stores encrypted Data Key alongside ciphertext │
│ (plaintext Data Key is discarded from memory) │
│ │
│ 4. To decrypt: KMS decrypts the Data Key ──► decrypt data │
│ │
│ Result: KMS key never touches your data directly. │
│ Audit trail: every KMS call logged in CloudTrail. │
└────────────────────────────────────────────────────────────────┘
Why KMS Matters for Compliance
- Every
Encrypt,Decrypt,GenerateDataKeycall logged in CloudTrail - Key policies control EXACTLY who can use each key
- Automatic key rotation (annual, for CMKs)
- Cross-account access: share encrypted data across accounts via key policy
- Key deletion has mandatory 7-30 day waiting period (prevents accidental loss)
Network Security
VPC Configuration for SageMaker
ELI5: Putting SageMaker in a VPC is like moving your ML workload from a public office building into a private secure facility. The work still gets done, but now there are walls, security guards (security groups), and controlled entry points (VPC endpoints) — no random internet traffic gets in.
┌─────────────────────────────────────────────────────────────────┐
│ VPC ARCHITECTURE FOR ML │
│ │
│ ┌────────────────── VPC ──────────────────────────────────┐ │
│ │ │ │
│ │ ┌─── Private Subnet ───────────────────────────────┐ │ │
│ │ │ │ │ │
│ │ │ SageMaker Training SageMaker Endpoint │ │ │
│ │ │ Instances Instances │ │ │
│ │ │ (no public IP) (no public IP) │ │ │
│ │ │ │ │ │
│ │ └───────────────────────────────────────────────────┘ │ │
│ │ │ │ │ │
│ │ NAT Gateway VPC Endpoints │ │
│ │ (for internet: (for AWS services: │ │
│ │ pip install) S3, ECR, KMS, etc.) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ Without VPC Endpoints: traffic goes Internet ──► AWS service │
│ With VPC Endpoints: traffic stays inside AWS network │
└─────────────────────────────────────────────────────────────────┘
Network Controls
| Control | Scope | What It Does |
|---|---|---|
| Security Groups | Instance-level | Stateful firewall: allow/deny by port, protocol, source |
| Network ACLs | Subnet-level | Stateless firewall: allow/deny by IP range and port |
| VPC Endpoints (Interface) | Service access | Private connectivity to AWS services via PrivateLink |
| VPC Endpoints (Gateway) | S3/DynamoDB | Route table entry for private S3/DynamoDB access |
| NAT Gateway | Outbound internet | Allow private subnet instances to reach internet |
Why VPC matters for ML:
- Training data often contains PII or trade secrets — can’t traverse public internet
- Regulatory requirements (HIPAA, GDPR) often mandate private network paths
- Prevents data exfiltration: training instance can only talk to approved S3 buckets
SageMaker-Specific Network Flags
# Training job network isolation
NetworkConfig:
EnableNetworkIsolation: true # No internet access AT ALL
EnableInterContainerTrafficEncryption: true # Encrypt distributed training traffic
VpcConfig:
SecurityGroupIds: [sg-xxx]
Subnets: [subnet-xxx]
EnableNetworkIsolation: true— most restrictive: no outbound internet, no VPC communication- Use when: regulatory requirement for air-gapped training
- Limitation: can’t download pip packages during training (must bake into container)
Identity & Access Management
IAM for SageMaker
Two IAM principals to understand:
| Principal | What They Are | Key Permissions Needed |
|---|---|---|
| SageMaker Execution Role | Role assumed BY SageMaker service | S3 read/write, ECR pull, CloudWatch logs, KMS |
| User/Team IAM Role | Role assumed BY humans/apps | Create/manage SageMaker resources |
Principle of Least Privilege in ML:
Training Job needs:
✓ s3:GetObject on training data bucket
✓ s3:PutObject on model artifact bucket
✓ ecr:GetDownloadUrlForLayer (pull container)
✓ cloudwatch:PutMetricData (log metrics)
✗ s3:DeleteObject (NOT needed)
✗ iam:CreateRole (NOT needed)
✗ ec2:TerminateInstances (NOT needed)
Key IAM condition keys for SageMaker:
| Condition Key | Restricts |
|---|---|
sagemaker:VpcSecurityGroupIds | Require VPC security group |
sagemaker:VpcSubnets | Require specific subnets |
sagemaker:RootAccess | Control root access in notebooks |
sagemaker:InterContainerTrafficEncryption | Require encryption |
aws:RequestedRegion | Restrict to specific regions |
Service Control Policies (SCPs):
- Organization-level guardrails applied to ALL accounts in AWS Organization
- Cannot be overridden by account-level IAM policies
- Use case: “No one in any account can launch a SageMaker endpoint without KMS encryption”
Cross-Account ML Access Patterns
┌─── Dev Account ───┐ ┌─── Prod Account ───┐
│ │ │ │
│ Train model │ │ Deploy endpoint │
│ Register in │────────►│ Pull approved │
│ Model Registry │ │ model from registry│
│ │ │ │
└───────────────────┘ └─────────────────────┘
Requirements:
- Model Registry resource policy: allow prod account to read
- Prod account IAM role: allow SageMaker to use cross-account model
- KMS key policy: allow prod account to decrypt model artifacts
- S3 bucket policy: allow prod account to read model artifacts
Monitoring & Logging
Amazon CloudWatch for ML
Training metrics:
- Custom metrics published via
sagemaker.log_metric()during training - Built-in metrics: CPU/GPU utilization, memory, disk I/O
- Regex-based metric extraction from training logs
Endpoint metrics (built-in):
| Metric | What It Measures |
|---|---|
| Invocations | Total number of requests |
| InvocationsPerInstance | Requests per instance (auto-scaling trigger) |
| ModelLatency | Time spent in model (microseconds) |
| OverheadLatency | SageMaker overhead (not model time) |
| Invocation4XXErrors | Client errors (bad requests) |
| Invocation5XXErrors | Server errors (model failures) |
| ModelSetupTime | Time to load model on container start |
Alarms: Trigger actions (SNS, Auto Scaling, Lambda) when metrics cross thresholds.
AWS CloudTrail for ML Governance
- Records EVERY SageMaker API call: who, what, when, from where
- Logged calls:
CreateTrainingJob,CreateEndpoint,DeleteModel,UpdateEndpoint, etc. - Includes: caller identity (IAM user/role), source IP, request parameters
- Stored in S3, queryable with Athena
- Use for: compliance audits, forensic investigation, change tracking
Exam tip: CloudTrail is the audit trail. If a question says “who deleted the model?” or “prove that no unauthorized user accessed training data” — CloudTrail is the answer.
SageMaker Model Monitor — Deep Dive
ELI5: Model Monitor is a smoke detector for your ML model. It constantly checks two things: (1) Has the data feeding your model changed significantly since training? (data drift) and (2) Are the model’s predictions getting worse over time? (model quality drift). If either triggers, it raises an alarm so you can retrain.
Four monitoring types:
| Monitor Type | What It Detects | How |
|---|---|---|
| Data Quality | Input data drift (schema, statistics) | Compare incoming data to training baseline |
| Model Quality | Prediction accuracy degradation | Compare predictions to ground truth labels |
| Bias Drift | Changes in model fairness | Track bias metrics over time |
| Feature Attribution Drift | Changes in feature importance (SHAP) | Compare SHAP values to baseline |
How it works:
Step 1: Create Baseline
Training data ──► Baseline job ──► Statistics + Constraints JSON
(runs once) (SageMaker (stored in S3)
Processing)
Step 2: Enable Data Capture
Endpoint ──► [Model Monitor captures] ──► Inference requests + responses to S3
Step 3: Schedule Monitoring
Every hour/day ──► Monitoring job runs:
Captured data vs. Baseline ──► Violations report ──► CloudWatch metrics
Step 4: Alert
CloudWatch metric exceeds threshold ──► Alarm ──► SNS ──► Re-train trigger
Why drift matters:
- Model trained on 2023 data may fail on 2025 data (distribution shift)
- User behavior changes (e.g., pandemic changed shopping patterns)
- Data pipeline upstream changes silently alter feature distributions
- Regulatory requirement: show model is still performing as designed
Integration:
- CloudWatch metrics per feature:
feature.mean,feature.std_dev,completeness - Violation report: JSON listing which constraints were violated
- SageMaker Pipelines: trigger retraining when violations exceed threshold
Compliance & Governance
SageMaker Clarify — Bias and Explainability
Pre-training bias analysis:
- Analyze training dataset BEFORE training
- Detect: class imbalance, label correlation with sensitive attributes
- Metrics: Class Imbalance (CI), Difference in Positive Proportions (DPP), KL Divergence
Post-training bias analysis:
- Analyze model predictions for fairness
- Detect: disparate impact across demographic groups
- Metrics: Demographic Disparity (DD), Disparate Impact (DI)
Explainability (SHAP values):
- Per-prediction feature attribution: “Why did this model return THAT prediction?”
- SHAP (SHapley Additive exPlanations): mathematically sound attribution
- Global explanations: which features matter most overall
- Local explanations: which features drove THIS specific prediction
When required:
- Financial services: credit decisions (ECOA, FCRA compliance)
- Healthcare: clinical decision support (FDA AI/ML guidelines)
- HR: hiring/promotion decisions (EEOC compliance)
- Insurance: risk pricing (state insurance regulations)
Exam tip: Clarify is the answer when a question mentions “explain why the model made this decision” or “detect bias in model predictions” or “comply with fairness regulations.”
ML Lineage Tracking
SageMaker ML Lineage (automatic):
Dataset ──► Processing Job ──► Processed Dataset ──► Training Job ──► Model ──► Endpoint
│ │ │ │ │ │
Artifact Action Artifact Action Artifact Action
│
Model Registry
(approval)
Use cases:
- Auditor asks: “What training data was used for the production model?” → query lineage
- Incident response: “Which endpoints use the affected model version?” → query lineage
- Reproducibility: recreate exact training conditions for a model version
Model Cards
Document ML model details for transparency and accountability.
- Intended use, out-of-scope use cases
- Training data description, evaluation results
- Limitations, biases, performance across demographic groups
- Contact information, version history
Cost Optimization
The Three Cost Buckets
ELI5: ML costs come from three buckets: training (renting powerful computers to learn), inference (renting computers to make predictions 24/7), and storage (storing your data and models in S3). The strategies to optimize each bucket are completely different — like saving money on rent, groceries, and utilities require different approaches.
┌─────────────────────────────────────────────────────────────────┐
│ ML COST BREAKDOWN │
├──────────────────┬──────────────────┬──────────────────────────┤
│ TRAINING │ INFERENCE │ STORAGE │
├──────────────────┼──────────────────┼──────────────────────────┤
│ Compute-heavy │ Always-on cost │ Scales with data volume │
│ Periodic spikes │ Predictable │ Low $/GB but accumulates │
│ GPU-intensive │ Latency matters │ │
├──────────────────┼──────────────────┼──────────────────────────┤
│ Spot Instances │ Auto-scaling │ S3 Lifecycle policies │
│ Right-sizing │ Serverless │ Intelligent-Tiering │
│ Warm Pools │ Multi-Model EP │ Delete stale artifacts │
│ Savings Plans │ Inferentia │ │
└──────────────────┴──────────────────┴──────────────────────────┘
Training Cost Optimization
Spot Instances (most impactful):
- Up to 90% cheaper than On-Demand instances
- Risk: instance can be interrupted with 2-minute warning
- Mitigation: checkpointing — save model state to S3 periodically, resume on new instance
- SageMaker handles checkpoint/resume automatically with
checkpoint_s3_uriconfig - Best for: training jobs > 1 hour where interruption is acceptable
Right-sizing instances:
| Mistake | Fix |
|---|---|
| Using ml.p3.16xlarge for scikit-learn | Use ml.m5.xlarge |
| Using ml.m5.large for transformer training | Use ml.p3.2xlarge |
| Multi-GPU when single GPU fits the model | Profile GPU utilization first |
| 100GB EBS when model needs 10GB | Size EBS to actual data needs |
Managed Warm Pools:
- Keep instances warm between training jobs (avoid instance startup overhead)
- Charges for warm pool idle time, but less than cold start when jobs run frequently
- Good for: iterative development with frequent short training runs
SageMaker Savings Plans:
- Commit to a $/hour spend for 1 or 3 years
- Get up to 64% discount vs. On-Demand
- Flexible: applies to training, processing, transform jobs across instance types
- vs. Reserved Instances: Savings Plans are more flexible (no specific instance commitment)
Distributed training cost vs. speed:
Adding more instances has diminishing returns:
1 instance: 100% efficiency
2 instances: ~90% efficiency (communication overhead)
4 instances: ~80% efficiency
8 instances: ~70% efficiency (gradient synchronization cost)
Rule: don't add more instances just because they're available.
Profile first: is the job GPU-bound, CPU-bound, or I/O-bound?
Inference Cost Optimization
Auto-scaling:
- Scale to zero or near-zero during low traffic periods (nights, weekends)
- Scale out during peak traffic
- Target tracking on
InvocationsPerInstanceis most reliable
Serverless Inference:
- Pay per request — zero cost when idle
- Best for: low-volume APIs, dev/test, intermittent traffic
- Break-even: ~20% endpoint utilization (below that, serverless is cheaper)
Multi-Model Endpoints:
- Host thousands of models on one endpoint
- Amortize instance cost across all models
- Best for: per-customer models with sparse per-customer traffic
Inferentia/Graviton instances:
| Chip | Savings vs. GPU | Use Case |
|---|---|---|
| ml.inf1 (Inferentia1) | Up to 70% vs ml.g4dn | Standard DL inference |
| ml.inf2 (Inferentia2) | Up to 40% vs ml.p3 | Large model inference |
| ml.c7g (Graviton3) | ~20% vs ml.c6i | CPU-based inference (tabular ML) |
Batch Transform instead of endpoint:
- If predictions don’t need to be real-time → use Batch Transform
- No persistent endpoint = no 24/7 compute cost
- Run nightly/weekly scoring jobs instead
Model compilation (Neo):
- Faster inference = fewer instances needed = lower cost
- 2x throughput = can handle same traffic with half the instances
Storage Cost Optimization
S3 storage tiers:
| Tier | Cost | Retrieval | Use For |
|---|---|---|---|
| S3 Standard | Highest | Instant | Active training data, recent models |
| S3 Standard-IA | ~46% less | Instant | Infrequently accessed datasets |
| S3 Glacier Instant Retrieval | ~68% less | Instant | Archived datasets, compliance |
| S3 Glacier Flexible | ~82% less | Hours | Long-term archive, regulatory |
| S3 Glacier Deep Archive | ~95% less | 12-48 hrs | Compliance archive, rarely needed |
S3 Intelligent-Tiering:
- Automatically moves objects between tiers based on access patterns
- No retrieval fees, small monitoring fee
- Best for: data with unknown or changing access patterns
Lifecycle rules for ML:
Training data (active) ──► 30 days ──► Standard-IA ──► 90 days ──► Glacier
Model artifacts ──────────► 60 days ──► Standard-IA ──► keep latest N versions only
Experiment logs ──────────► 30 days ──► Standard-IA ──► delete after 1 year
What to delete:
- Failed experiment model artifacts (keep only the best run per experiment)
- Intermediate processing outputs (temp data)
- Old Docker image layers in ECR (use lifecycle policies)
- CloudWatch logs older than retention period (set log group retention)
Cost Monitoring and Governance
AWS Cost Explorer:
- Tag all SageMaker resources:
project,team,environment,model-name - Filter costs by tag to see per-project ML spend
- Identify cost spikes (unexpected training jobs, endpoints left running)
AWS Budgets:
- Set monthly budget per tag (e.g., “$5,000/month for project X”)
- Alert at 80% and 100% of budget
- Action-based budgets: auto-stop resources when budget exceeded
Resource tagging strategy for ML:
Required tags:
Environment: dev | staging | prod
Team: data-science | ml-platform | product-ml
Project: recommendation-engine | fraud-detection
CostCenter: [business unit code]
AutoShutdown: true | false (custom: for cleanup automation)
Security Requirements → Service Mapping
| Requirement | AWS Service/Feature |
|---|---|
| Encrypt training data at rest | KMS (SSE-KMS on S3) |
| Encrypt model artifacts | KMS encryption on SageMaker training job |
| Audit who accessed the model | CloudTrail |
| Keep training data off public internet | VPC + VPC Endpoints (PrivateLink) |
| Encrypt traffic between training containers | EnableInterContainerTrafficEncryption |
| Isolate training from all network access | EnableNetworkIsolation: true |
| Review low-confidence ML predictions | Amazon A2I |
| Detect bias in model predictions | SageMaker Clarify |
| Monitor for model accuracy degradation | SageMaker Model Monitor (Model Quality) |
| Monitor for input data distribution shift | SageMaker Model Monitor (Data Quality) |
| Prove which data trained production model | SageMaker ML Lineage |
| Restrict which SageMaker actions users can take | IAM policies with condition keys |
| Org-wide encryption enforcement | Service Control Policies (SCPs) |
| HIPAA-compliant ML workloads | SageMaker in VPC + KMS + CloudTrail + BAA |
Cost Strategy → When to Use
| Strategy | When to Apply |
|---|---|
| Spot Instances | Training jobs > 30 min, can checkpoint, not latency-sensitive |
| Savings Plans | Predictable training workload, committed spend for 1-3 years |
| Serverless Inference | < few hundred requests/day, can tolerate cold start |
| Multi-Model Endpoint | > 10 models with sparse per-model traffic |
| Batch Transform | Predictions don’t need real-time latency |
| Inferentia instances | High-volume DL inference, willing to compile model |
| Graviton instances | CPU-based inference (XGBoost, sklearn), ~20% cheaper than x86 |
| Managed Warm Pools | Rapid iteration during development (many short training jobs) |
| S3 Lifecycle policies | Any S3 data older than 30 days that’s not accessed daily |
| Auto-scaling to zero | Endpoints with predictable low-traffic windows |
| Model compilation (Neo) | Deployed model running at scale, want throughput improvement |
Exam tip: The exam loves to ask “what is the MOST cost-effective approach?” For training: Spot + checkpointing. For inference with variable traffic: Serverless (low volume) or Auto-scaling (medium-high volume). For offline scoring: Batch Transform (no persistent endpoint).