← AWS MLA-C01 — ML Engineer Associate

MLA-C01 Exam Cheat Sheet

MLA-C01 Exam Cheat Sheet

Rapid-fire reference for the exam. Print this and review the night before.


The Golden Rules

  1. SageMaker is almost always the answer for ML workflow questions
  2. Prefer managed services over custom solutions
  3. Bedrock for GenAI, SageMaker for custom ML
  4. Least privilege for every IAM question
  5. VPC Endpoints when “data must not traverse the internet”
  6. KMS for every encryption question
  7. XGBoost is the default for tabular/structured data
  8. RecordIO + Pipe mode for fastest training ingestion

Service → Purpose (One-Line Each)

Storage & Ingestion

ServiceOne-Liner
S3The data lake — all ML data lives here
EFSShared file system across instances
FSx for LustreHigh-throughput training I/O, backed by S3
Kinesis Data StreamsReal-time streaming with custom consumers
Data FirehoseManaged delivery to S3/Redshift (near real-time)
MSKManaged Kafka
Apache FlinkReal-time stream processing with SQL

Data Transformation

ServiceOne-Liner
EMR + SparkLarge-scale distributed data processing
Glue ETLServerless ETL jobs (PySpark)
Glue CrawlersAuto-discover schemas → Data Catalog
Glue DataBrewVisual data prep + PII handling
Data WranglerVisual data prep inside SageMaker Studio
Ground TruthManaged data labeling (humans + auto-label)
Feature StoreCentralized features (online + offline)
AthenaSQL queries on S3 (serverless)

ML Development

ServiceOne-Liner
SageMaker StudioUnified IDE for all ML tasks
SageMaker CanvasNo-code ML for business users
SageMaker AutopilotAutoML — automated algorithm + tuning selection
SageMaker ClarifyBias detection + SHAP explainability
SageMaker DebuggerReal-time training monitoring + profiling
SageMaker ExperimentsTrack/compare training runs
Model RegistryVersion control + approval workflows for models
JumpStartPre-trained model hub (400+ models)

GenAI / Bedrock

ServiceOne-Liner
BedrockManaged API access to foundation models
Bedrock Knowledge BasesManaged RAG (S3 → chunk → embed → vector store)
Bedrock AgentsMulti-step LLM tasks with Lambda actions
Bedrock GuardrailsContent filters + PII + topic blocking

Managed AI Services

ServiceOne-Liner
ComprehendNLP: sentiment, entities, PII, topics
TranslateNeural machine translation
TranscribeSpeech-to-text
PollyText-to-speech
RekognitionImage/video analysis + custom labels
TextractOCR for documents, forms, tables
LexBuild chatbots (intents + slots + Lambda)
PersonalizeReal-time recommendation engine
KendraEnterprise semantic search
A2IHuman review workflows for ML output
Q BusinessEnterprise AI assistant (40+ data connectors)
Q DeveloperAI coding assistant in IDE
MacieDiscover PII in S3
Fraud DetectorManaged fraud detection

Deployment & Ops

ServiceOne-Liner
SageMaker PipelinesML-native CI/CD workflows
CodePipelineGeneral CI/CD orchestration
CodeBuildBuild + test step
CodeDeployDeploy to compute targets
Step FunctionsState machine workflow orchestration
EventBridgeEvent-driven automation (triggers)
MWAAManaged Apache Airflow (DAGs)
CloudFormationInfrastructure as YAML/JSON
CDKInfrastructure as code (Python/TS)
ECRDocker container registry
ECS + FargateRun containers serverlessly
EKSManaged Kubernetes
NeoCompile model for edge hardware
IoT GreengrassEdge ML inference

Monitoring & Security

ServiceOne-Liner
Model MonitorProduction drift detection (4 types)
CloudWatchMetrics, alarms, logs, dashboards
CloudTrailAudit log of all API calls
IAMUsers, roles, policies, permissions
KMSEncryption key management
Secrets ManagerStore/rotate credentials
WAFWeb application firewall
ShieldDDoS protection
Lake FormationData lake access governance
VPC EndpointsKeep traffic within AWS network

Algorithm Quick-Match

I Have…I Want To…Use
Tabular dataClassify / PredictXGBoost
Simple linear dataClassify / PredictLinear Learner
Multiple time seriesForecast future valuesDeepAR
TextClassify sentiment/topicBlazingText
Text corpusFind topicsNTM
ImagesClassifyImage Classification
ImagesFind + locate objectsObject Detection
ImagesLabel every pixelSemantic Segmentation
Numeric dataFind anomaliesRandom Cut Forest
IP addressesDetect suspicious usageIP Insights
Sparse click/purchase dataRecommend itemsFactorization Machines
High dimensionsReduce featuresPCA
Unlabeled dataGroup into clustersK-Means
SequencesTranslate / SummarizeSeq2Seq

Endpoint Types: Which One?

Signal in QuestionAnswer
“Low latency”, “real-time API”, “interactive”Real-Time Endpoint
“Intermittent traffic”, “minimize cost”, “pay per use”Serverless Inference
“Large payload”, “minutes to process”, “>6MB”Async Inference
“Score entire dataset”, “nightly batch”, “no endpoint needed”Batch Transform
“A/B test two models”Production Variants
“Many models, sporadic traffic”Multi-Model Endpoint
“Chain preprocessing + prediction”Inference Pipeline

Deployment: Which Strategy?

Signal in QuestionAnswer
“Instant rollback”, “zero downtime swap”Blue/Green
“Test on small % first”, “gradual rollout”Canary
“Shift traffic linearly over time”Linear
“Compare models with zero user risk”Shadow Testing

Data Format Decision

Signal in QuestionAnswer
“Fastest SageMaker training”RecordIO + Pipe Mode
“Query with Athena”, “analytics”Parquet (columnar)
“Schema may evolve”, “streaming”Avro
“Hive / EMR workloads”ORC
“Simple, human-readable”CSV or JSON

Metric Selection

ScenarioMetric
“Don’t miss cancer” (minimize FN)Recall
“Don’t flag good email as spam” (minimize FP)Precision
“Balance precision and recall”F1 Score
“Overall classifier quality”AUC-ROC
“Regression — same units as target”RMSE
“Regression — % variance explained”
“Cluster quality”Silhouette Score

Bias & Explainability

WhenTool
Before training (data bias)Clarify — pre-training bias metrics
After training (model bias)Clarify — post-training bias metrics
In production (drift)Model Monitor + Clarify
Explain individual predictionsSHAP values
Feature effect on overall modelPartial Dependence Plots
Balance imbalanced datasetsData Wrangler — SMOTE / resampling

Security Patterns

Question PatternAnswer
“Encrypt data at rest”SSE-KMS (S3) / KMS (EBS, SageMaker)
“Encrypt in transit”TLS/SSL (automatic for endpoints)
“Data must not leave AWS network”VPC Endpoints (PrivateLink)
“Isolate training from internet”VPC private subnet + network isolation
“Who accessed what resource”CloudTrail
“Find PII in S3”Macie
“Handle PII in ETL”Glue DataBrew
“Rotate database passwords”Secrets Manager
“Fine-grained data lake permissions”Lake Formation
“Least-privilege ML roles”SageMaker Role Manager
“Protect API from DDoS”Shield + WAF

Cost Optimization Patterns

Question PatternAnswer
“Reduce training costs”Spot Instances + Checkpointing
“Reduce inference costs (variable traffic)”Serverless Inference or Auto-Scaling
“Reduce inference costs (steady traffic)”Savings Plans or Inferentia instances
“Many models, low traffic each”Multi-Model Endpoints
“Right-size inference instances”Inference Recommender
“Reduce DL training costs”AWS Trainium (ml.trn1)
“Reduce DL inference costs”AWS Inferentia (ml.inf2)
“Optimize model for target hardware”SageMaker Neo
“Reduce data scanned in queries”Parquet + Partitioning (Athena)

Automation Patterns

TriggerActionService Chain
New data in S3Start retrainingS3 Event → EventBridge → SageMaker Pipeline
Model drift detectedAlert + retrainModel Monitor → CloudWatch → EventBridge → Pipeline
Model approvedDeploy to endpointModel Registry → Pipeline → Endpoint
Schedule (daily/weekly)Run batch predictionsEventBridge Schedule → Batch Transform
Code pushedBuild + deployCodeCommit → CodePipeline → CodeBuild → Deploy

Common Trap Questions

TrapCorrect Answer
“Use EMR for simple ETL”Glue (EMR is for large-scale distributed)
“Build custom recommendation engine”Personalize (managed service first)
“Train a chatbot from scratch”Lex (managed) or Bedrock (LLM-based)
“Store features in S3 manually”Feature Store (built for this)
“Use EC2 for inference”SageMaker Endpoint (managed, auto-scaling)
“Use Lambda for ML inference”Only if model is tiny (<10GB, <15min) — otherwise SageMaker
“Monitor model with custom scripts”Model Monitor (built-in, integrates with Clarify)
“CloudFormation OR CDK?”CDK if code-based infra needed, but both valid — CDK compiles to CFN
“Kinesis OR Firehose?”Kinesis = real-time custom processing; Firehose = managed delivery
“ECS OR EKS?”ECS = simpler, AWS-native; EKS = Kubernetes, portable

Numbers to Remember

FactValue
Exam questions65 (50 scored + 15 unscored)
Exam time170 minutes
Passing score720 / 1000
S3 PUT throughput per prefix3,500 req/s
S3 GET throughput per prefix5,500 req/s
Kinesis shard write1 MB/s or 1,000 records/s
Kinesis shard read2 MB/s (shared) or 2 MB/s per consumer (enhanced)
SageMaker real-time payload limit6 MB
SageMaker async payload limit1 GB
SageMaker serverless payload limit4 MB
Rekognition Custom Labels min images10 per label
Firehose min buffer60 seconds
Spot Instance savingsUp to 90%
SageMaker Savings PlansUp to 64%
Inferentia inference savingsUp to 70%
Trainium training savingsUp to 50%
Ground Truth auto-labeling savingsUp to 70%