← AWS MLA-C01 — ML Engineer Associate

MLA-C01 Exam Cheat Sheet

7 min read 1294 words

MLA-C01 Exam Cheat Sheet

Rapid-fire reference for the exam. Print this and review the night before.

The Golden Rules

SageMaker is almost always the answer for ML workflow questions
Prefer managed services over custom solutions
Bedrock for GenAI, SageMaker for custom ML
Least privilege for every IAM question
VPC Endpoints when “data must not traverse the internet”
KMS for every encryption question
XGBoost is the default for tabular/structured data
RecordIO + Pipe mode for fastest training ingestion

Service → Purpose (One-Line Each)

Storage & Ingestion

Service	One-Liner
S3	The data lake — all ML data lives here
EFS	Shared file system across instances
FSx for Lustre	High-throughput training I/O, backed by S3
Kinesis Data Streams	Real-time streaming with custom consumers
Data Firehose	Managed delivery to S3/Redshift (near real-time)
MSK	Managed Kafka
Apache Flink	Real-time stream processing with SQL

Data Transformation

Service	One-Liner
EMR + Spark	Large-scale distributed data processing
Glue ETL	Serverless ETL jobs (PySpark)
Glue Crawlers	Auto-discover schemas → Data Catalog
Glue DataBrew	Visual data prep + PII handling
Data Wrangler	Visual data prep inside SageMaker Studio
Ground Truth	Managed data labeling (humans + auto-label)
Feature Store	Centralized features (online + offline)
Athena	SQL queries on S3 (serverless)

ML Development

Service	One-Liner
SageMaker Studio	Unified IDE for all ML tasks
SageMaker Canvas	No-code ML for business users
SageMaker Autopilot	AutoML — automated algorithm + tuning selection
SageMaker Clarify	Bias detection + SHAP explainability
SageMaker Debugger	Real-time training monitoring + profiling
SageMaker Experiments	Track/compare training runs
Model Registry	Version control + approval workflows for models
JumpStart	Pre-trained model hub (400+ models)

GenAI / Bedrock

Service	One-Liner
Bedrock	Managed API access to foundation models
Bedrock Knowledge Bases	Managed RAG (S3 → chunk → embed → vector store)
Bedrock Agents	Multi-step LLM tasks with Lambda actions
Bedrock Guardrails	Content filters + PII + topic blocking

Managed AI Services

Service	One-Liner
Comprehend	NLP: sentiment, entities, PII, topics
Translate	Neural machine translation
Transcribe	Speech-to-text
Polly	Text-to-speech
Rekognition	Image/video analysis + custom labels
Textract	OCR for documents, forms, tables
Lex	Build chatbots (intents + slots + Lambda)
Personalize	Real-time recommendation engine
Kendra	Enterprise semantic search
A2I	Human review workflows for ML output
Q Business	Enterprise AI assistant (40+ data connectors)
Q Developer	AI coding assistant in IDE
Macie	Discover PII in S3
Fraud Detector	Managed fraud detection

Deployment & Ops

Service	One-Liner
SageMaker Pipelines	ML-native CI/CD workflows
CodePipeline	General CI/CD orchestration
CodeBuild	Build + test step
CodeDeploy	Deploy to compute targets
Step Functions	State machine workflow orchestration
EventBridge	Event-driven automation (triggers)
MWAA	Managed Apache Airflow (DAGs)
CloudFormation	Infrastructure as YAML/JSON
CDK	Infrastructure as code (Python/TS)
ECR	Docker container registry
ECS + Fargate	Run containers serverlessly
EKS	Managed Kubernetes
Neo	Compile model for edge hardware
IoT Greengrass	Edge ML inference

Monitoring & Security

Service	One-Liner
Model Monitor	Production drift detection (4 types)
CloudWatch	Metrics, alarms, logs, dashboards
CloudTrail	Audit log of all API calls
IAM	Users, roles, policies, permissions
KMS	Encryption key management
Secrets Manager	Store/rotate credentials
WAF	Web application firewall
Shield	DDoS protection
Lake Formation	Data lake access governance
VPC Endpoints	Keep traffic within AWS network

Algorithm Quick-Match

I Have…	I Want To…	Use
Tabular data	Classify / Predict	XGBoost
Simple linear data	Classify / Predict	Linear Learner
Multiple time series	Forecast future values	DeepAR
Text	Classify sentiment/topic	BlazingText
Text corpus	Find topics	NTM
Images	Classify	Image Classification
Images	Find + locate objects	Object Detection
Images	Label every pixel	Semantic Segmentation
Numeric data	Find anomalies	Random Cut Forest
IP addresses	Detect suspicious usage	IP Insights
Sparse click/purchase data	Recommend items	Factorization Machines
High dimensions	Reduce features	PCA
Unlabeled data	Group into clusters	K-Means
Sequences	Translate / Summarize	Seq2Seq

Endpoint Types: Which One?

Signal in Question	Answer
“Low latency”, “real-time API”, “interactive”	Real-Time Endpoint
“Intermittent traffic”, “minimize cost”, “pay per use”	Serverless Inference
“Large payload”, “minutes to process”, “>6MB”	Async Inference
“Score entire dataset”, “nightly batch”, “no endpoint needed”	Batch Transform
“A/B test two models”	Production Variants
“Many models, sporadic traffic”	Multi-Model Endpoint
“Chain preprocessing + prediction”	Inference Pipeline

Deployment: Which Strategy?

Signal in Question	Answer
“Instant rollback”, “zero downtime swap”	Blue/Green
“Test on small % first”, “gradual rollout”	Canary
“Shift traffic linearly over time”	Linear
“Compare models with zero user risk”	Shadow Testing

Data Format Decision

Signal in Question	Answer
“Fastest SageMaker training”	RecordIO + Pipe Mode
“Query with Athena”, “analytics”	Parquet (columnar)
“Schema may evolve”, “streaming”	Avro
“Hive / EMR workloads”	ORC
“Simple, human-readable”	CSV or JSON

Metric Selection

Scenario	Metric
“Don’t miss cancer” (minimize FN)	Recall
“Don’t flag good email as spam” (minimize FP)	Precision
“Balance precision and recall”	F1 Score
“Overall classifier quality”	AUC-ROC
“Regression — same units as target”	RMSE
“Regression — % variance explained”	R²
“Cluster quality”	Silhouette Score

Bias & Explainability

When	Tool
Before training (data bias)	Clarify — pre-training bias metrics
After training (model bias)	Clarify — post-training bias metrics
In production (drift)	Model Monitor + Clarify
Explain individual predictions	SHAP values
Feature effect on overall model	Partial Dependence Plots
Balance imbalanced datasets	Data Wrangler — SMOTE / resampling

Security Patterns

Question Pattern	Answer
“Encrypt data at rest”	SSE-KMS (S3) / KMS (EBS, SageMaker)
“Encrypt in transit”	TLS/SSL (automatic for endpoints)
“Data must not leave AWS network”	VPC Endpoints (PrivateLink)
“Isolate training from internet”	VPC private subnet + network isolation
“Who accessed what resource”	CloudTrail
“Find PII in S3”	Macie
“Handle PII in ETL”	Glue DataBrew
“Rotate database passwords”	Secrets Manager
“Fine-grained data lake permissions”	Lake Formation
“Least-privilege ML roles”	SageMaker Role Manager
“Protect API from DDoS”	Shield + WAF

Cost Optimization Patterns

Question Pattern	Answer
“Reduce training costs”	Spot Instances + Checkpointing
“Reduce inference costs (variable traffic)”	Serverless Inference or Auto-Scaling
“Reduce inference costs (steady traffic)”	Savings Plans or Inferentia instances
“Many models, low traffic each”	Multi-Model Endpoints
“Right-size inference instances”	Inference Recommender
“Reduce DL training costs”	AWS Trainium (ml.trn1)
“Reduce DL inference costs”	AWS Inferentia (ml.inf2)
“Optimize model for target hardware”	SageMaker Neo
“Reduce data scanned in queries”	Parquet + Partitioning (Athena)

Automation Patterns

Trigger	→	Action	Service Chain
New data in S3	→	Start retraining	S3 Event → EventBridge → SageMaker Pipeline
Model drift detected	→	Alert + retrain	Model Monitor → CloudWatch → EventBridge → Pipeline
Model approved	→	Deploy to endpoint	Model Registry → Pipeline → Endpoint
Schedule (daily/weekly)	→	Run batch predictions	EventBridge Schedule → Batch Transform
Code pushed	→	Build + deploy	CodeCommit → CodePipeline → CodeBuild → Deploy

Common Trap Questions

Trap	Correct Answer
“Use EMR for simple ETL”	Glue (EMR is for large-scale distributed)
“Build custom recommendation engine”	Personalize (managed service first)
“Train a chatbot from scratch”	Lex (managed) or Bedrock (LLM-based)
“Store features in S3 manually”	Feature Store (built for this)
“Use EC2 for inference”	SageMaker Endpoint (managed, auto-scaling)
“Use Lambda for ML inference”	Only if model is tiny (<10GB, <15min) — otherwise SageMaker
“Monitor model with custom scripts”	Model Monitor (built-in, integrates with Clarify)
“CloudFormation OR CDK?”	CDK if code-based infra needed, but both valid — CDK compiles to CFN
“Kinesis OR Firehose?”	Kinesis = real-time custom processing; Firehose = managed delivery
“ECS OR EKS?”	ECS = simpler, AWS-native; EKS = Kubernetes, portable

Numbers to Remember

Fact	Value
Exam questions	65 (50 scored + 15 unscored)
Exam time	170 minutes
Passing score	720 / 1000
S3 PUT throughput per prefix	3,500 req/s
S3 GET throughput per prefix	5,500 req/s
Kinesis shard write	1 MB/s or 1,000 records/s
Kinesis shard read	2 MB/s (shared) or 2 MB/s per consumer (enhanced)
SageMaker real-time payload limit	6 MB
SageMaker async payload limit	1 GB
SageMaker serverless payload limit	4 MB
Rekognition Custom Labels min images	10 per label
Firehose min buffer	60 seconds
Spot Instance savings	Up to 90%
SageMaker Savings Plans	Up to 64%
Inferentia inference savings	Up to 70%
Trainium training savings	Up to 50%
Ground Truth auto-labeling savings	Up to 70%