← AWS MLS-C01 — ML Specialty

Exam Scenarios & Trap Questions

Table of Contents

MLS-C01 Exam Scenarios & Trap Questions

The exam always gives you 2 obviously wrong options and 2 that look right. This guide teaches you to break the tie.


How MLS-C01 Questions Work

Question Anatomy

┌─────────────────────────────────────────────────────────────┐
│  SCENARIO     A company stores 10TB of clickstream data...  │
│  CONSTRAINT   ...with minimal operational overhead...       │
│  QUESTION     Which service should they use?                │
│                                                             │
│  A) Amazon EMR with Spark        ← Too complex             │
│  B) AWS Lambda                   ← Wrong scale             │
│  C) AWS Glue                     ← CORRECT                 │
│  D) Custom EC2 Spark cluster     ← Never right             │
└─────────────────────────────────────────────────────────────┘

The Tie-Breaking Framework

When 2 answers look correct, check these constraints in order:

  1. Latency requirement — real-time vs batch changes everything
  2. Payload/data size — some services have hard limits
  3. Operational overhead — “minimal overhead” → managed/serverless wins
  4. Cost — “cost-effective” → Spot, serverless, right-sized
  5. Compliance/security — “must not traverse internet” → VPC endpoint
  6. Expertise — “no ML expertise” → managed AI services (Rekognition, Comprehend)
  7. Scale — does the option actually support the stated volume?

Universal Rules

  • Managed service beats custom when constraints allow
  • SageMaker built-in algorithm beats custom code when possible
  • Serverless beats always-on for intermittent workloads
  • RecordIO + Pipe mode for large SageMaker training data
  • XGBoost for tabular, unless a specific constraint eliminates it
  • Transfer learning for small image datasets — never train from scratch

Domain 1 Scenarios: Data Engineering

Scenario 1 — Daily Log ETL

Scenario: A company ingests 5 TB of application logs daily into S3. They need to clean, filter, and transform this data before ML training. The team has limited Spark expertise.

Trap: The scale (5 TB) might make you think EMR is required.

Correct answer: AWS Glue

Why: 5 TB daily is well within Glue’s serverless capacity. The constraint “limited Spark expertise” + implied operational overhead concern → Glue wins. EMR requires cluster management.

Key discriminator: If the question says “complex custom transformations” or “existing Spark codebase” → EMR. If it says “serverless” or “minimal overhead” → Glue.


Scenario 2 — Real-Time vs Delivery

Scenario: A mobile app sends user events. A downstream ML model must process these within 200ms for personalization. Which streaming service?

Trap: Kinesis Data Firehose sounds simpler and is often brought up for streaming.

Correct answer: Kinesis Data Streams (not Firehose)

Why: Firehose is a delivery service — it buffers for 60 seconds minimum before delivery. For 200ms processing, you need Kinesis Data Streams with a Lambda/Flink consumer.

Key discriminator:

  • Need sub-second processing or custom consumer → Kinesis Data Streams
  • Need to deliver data to S3/Redshift/ES with no code → Kinesis Firehose

Scenario 3 — Minimize Training Time Data Format

Scenario: A team trains a neural network on a 500 GB tabular dataset stored in S3. Training takes 12 hours. They want to reduce training time without changing the model.

Trap: Answers may include “upgrade to larger GPU instance” or “use EFS for shared storage.”

Correct answer: Convert data to RecordIO format and use Pipe mode

Why: File mode copies all 500 GB to the instance disk before training starts. Pipe mode streams data directly from S3, eliminating the wait. RecordIO is SageMaker’s optimized binary format.

Key discriminator: Question asks to reduce training time without model changes → always consider data ingestion format first.


Scenario 4 — Kafka vs Kinesis

Scenario: A company is migrating an on-premises event processing system that uses Apache Kafka to AWS. They want minimal code changes.

Trap: Kinesis Data Streams might seem like the obvious AWS answer.

Correct answer: Amazon MSK (Managed Streaming for Apache Kafka)

Why: MSK is fully managed Kafka — existing Kafka clients, producers, and consumers work unchanged. Migrating to Kinesis requires rewriting producers/consumers to use Kinesis API.

Key discriminator: “Existing Kafka ecosystem” or “Kafka-compatible” → MSK. “New streaming workload on AWS” → Kinesis Data Streams.


Scenario 5 — Query S3 Data Without Loading

Scenario: A data scientist needs to run ad-hoc SQL queries on 2 TB of Parquet files in S3 to explore data before training. The queries run once or twice a week.

Trap: Loading data into Redshift sounds powerful and is a valid option.

Correct answer: Amazon Athena

Why: Loading 2 TB into Redshift takes time and costs money for a DW cluster that runs only twice weekly. Athena is serverless, pay-per-query, and queries Parquet on S3 natively.

Key discriminator: “Ad-hoc queries,” “infrequent,” “no loading” → Athena. “Frequent complex joins,” “data warehouse” → Redshift.


Scenario 6 — Feature Store Offline vs Online

Scenario: A company wants to serve features to a real-time ML model with < 10ms latency, and also train new models on historical features.

Correct answer: SageMaker Feature Store with both online and offline store enabled

Why: Online store (backed by DynamoDB-like store) handles low-latency serving. Offline store (backed by S3) handles training. Same feature definitions, no duplication.


Domain 2 Scenarios: EDA & Feature Engineering

Scenario 7 — Class Imbalance

Scenario: A fraud detection model has 99% “normal” and 1% “fraud” transactions. Trained model achieves 99% accuracy but misses almost all fraud.

Trap: “99% accuracy — great model!” The accuracy paradox.

Correct answer: Accuracy is misleading here. Use AUC-ROC or F1. Also apply SMOTE (oversampling), class weights, or undersample majority class.

Key discriminator: When one class is > 85% of data, accuracy is useless. The question will hint at this with “the model never predicts class X” or “all predictions are class Y.”


Scenario 8 — Feature Scaling

Scenario: A dataset has features: age (18-90), annual income (30,000-500,000), number of purchases (1-50). Which preprocessing step is critical?

Trap: “The model will learn the right scale automatically.”

Correct answer: Feature scaling (StandardScaler or MinMaxScaler) for linear/neural models; not required for tree-based models.

Key discriminator:

  • Linear Learner, KNN, Neural Networks → must scale
  • XGBoost, Random Forest, Decision Trees → scaling not needed

Scenario 9 — Missing Values — High Proportion

Scenario: A dataset has one column with 70% missing values. It is a feature for an ML model. What should you do?

Trap: “Use KNN imputation to fill in the missing values.”

Correct answer: Drop the column (or create a binary indicator “was_missing” + impute).

Why: 70% missing means the feature has very little signal. Imputing 70% of values introduces more noise than signal. Better to drop unless domain knowledge says missingness itself is predictive.

Key discriminator: > 60% missing → drop the column. 5-30% missing → impute.


Scenario 10 — Curse of Dimensionality

Scenario: A model with 20 features achieves 92% accuracy. After feature engineering adds 1,500 new polynomial features, accuracy drops to 75%.

Trap: “Add more training data to fix this.”

Correct answer: Dimensionality reduction — PCA, or feature selection (L1 regularization, feature importance). The model is suffering from the curse of dimensionality.

Why: Too many features relative to samples → sparse feature space → model cannot generalize. PCA reduces dimensionality while preserving variance.


Scenario 11 — Multicollinearity

Scenario: Two features in a regression dataset have Pearson correlation r=0.97. The model is a linear regression.

Trap: “Keep both features — more information is better.”

Correct answer: Drop one feature, or apply PCA to collapse correlated features into uncorrelated components.

Why: Multicollinearity causes unstable coefficient estimates in linear models. The correlated feature provides near-zero marginal information.

Key discriminator: Tree models (XGBoost, RF) handle multicollinearity better than linear models.


Scenario 12 — Time Series Splitting

Scenario: A data scientist randomly shuffles and splits a time series dataset into 80% train, 20% test before training a DeepAR model.

Trap: “Random split is standard practice for most ML problems.”

Correct answer: This is wrong. Random split causes data leakage for time series — future timestamps appear in training data.

Fix: Use forward chaining (time-based split): train on data before date D, test on data after date D.


Scenario 13 — Text Encoding Choice

Scenario: You need to train a logistic regression model to classify 100,000 customer support emails into 5 categories.

Correct answer: TF-IDF vectorization → Logistic Regression (or BlazingText in supervised mode)

Why: TF-IDF works well with linear classifiers for text. Simple, fast, interpretable. Deep learning is overkill for a clean text classification problem with a linear model.


Domain 3 Scenarios: Modeling

Scenario 14 — Tabular Data Default Choice

Scenario: A company has a tabular dataset with 500 features and 1 million rows to predict customer churn. No domain expertise on feature relationships.

Trap: “Deep learning handles complex patterns — use a neural network.”

Correct answer: XGBoost

Why: XGBoost consistently outperforms deep learning on tabular data. Deep learning requires more data preprocessing, tuning, and compute. For tabular: XGBoost first, always.

Key discriminator: Image/text/audio → deep learning. Tabular structured data → XGBoost.


Scenario 15 — Small Image Dataset

Scenario: A medical imaging startup has 500 labeled chest X-ray images. They need to classify pneumonia vs normal.

Trap: “Train a CNN from scratch — full control over architecture.”

Correct answer: Transfer learning — use pretrained ResNet/VGG (pretrained on ImageNet), fine-tune on the 500 images.

Why: 500 images is far too small to train a CNN from scratch. Pretrained features (edges, textures, shapes) transfer even across domains.

Key discriminator: “Small dataset” + images → transfer learning. “Millions of images” → can train from scratch.


Scenario 16 — Streaming Anomaly Detection

Scenario: A manufacturing company needs to detect equipment anomalies from sensor streams in real time. No labeled anomaly data is available.

Trap: “Use Isolation Forest — it’s the standard anomaly detection algorithm.”

Correct answer: SageMaker Random Cut Forest (RCF)

Why: RCF is SageMaker’s built-in algorithm designed for real-time streaming anomaly detection. Isolation Forest is not a SageMaker built-in; RCF integrates natively with Kinesis.

Key discriminator: “Real-time streaming,” “SageMaker built-in,” “no labels” → RCF.


Scenario: A retailer needs to forecast demand for 10,000 SKUs across 500 stores. Historical sales data is available for all SKUs.

Trap: “Train one ARIMA/Prophet model per SKU.”

Correct answer: SageMaker DeepAR

Why: DeepAR learns across all time series jointly — cold start handling, learns inter-series patterns. Training 500,000 separate ARIMA models is infeasible.

Key discriminator: “Many related time series,” “cold start,” “new items” → DeepAR. Single isolated time series → ARIMA/Prophet/classical.


Scenario 18 — Text with Millions of Documents

Scenario: A company needs to classify news articles into 20 categories. They have 50 million labeled articles.

Trap: “Use BERT fine-tuning for best accuracy.”

Correct answer: SageMaker BlazingText in supervised mode

Why: BlazingText is extremely fast at scale — trains on hundreds of millions of texts in minutes. BERT fine-tuning at 50M documents requires massive GPU compute. BlazingText often achieves near-BERT accuracy for classification.

Key discriminator: “Millions of documents,” “fast,” “text classification” → BlazingText. “State-of-the-art accuracy,” “semantic understanding” → BERT/Transformer.


Scenario 19 — Overfitting Diagnosis

Scenario: A neural network achieves 99% training accuracy and 62% validation accuracy after 100 epochs.

Trap: “Train for more epochs to improve validation accuracy.”

Correct answer: The model is overfitting. Fixes in order of preference:

  1. Collect more training data
  2. Add dropout layers
  3. Add L2 regularization (weight decay)
  4. Early stopping (stop before epoch 100)
  5. Reduce model complexity (fewer layers/units)
  6. Data augmentation (if images)

Scenario 20 — Underfitting Diagnosis

Scenario: A model achieves 55% training accuracy and 53% validation accuracy on a binary classification task with balanced classes (baseline = 50%).

Trap: “Add regularization — the model is too complex.”

Correct answer: The model is underfitting (high bias). Fixes:

  1. Use a more complex model (more layers, more estimators)
  2. Add more features / feature engineering
  3. Reduce regularization
  4. Train longer (more epochs)
  5. Try a different algorithm

Key discriminator: Train and val both low → underfitting. Train high, val low → overfitting.


Scenario 21 — Learning Rate Diagnosis

Scenario: During training, the loss curve oscillates wildly and does not decrease consistently.

Correct answer: Learning rate is too high. Reduce it. Consider using a learning rate scheduler (cosine annealing, step decay).

Scenario: Loss curve decreases very slowly and plateaus early without reaching good performance.

Correct answer: Learning rate is too low. Increase it. Try cyclical learning rates or warm-up.


Scenario 22 — Wrong Metric Trap

Scenario: A model classifies credit card transactions as fraud or not. 99.5% are not fraud. The model reports 99.6% accuracy. The business is satisfied.

Trap: “99.6% accuracy — excellent model!”

Correct answer: Accuracy is misleading on imbalanced data. A dummy model predicting “not fraud” for everything achieves 99.5%. Use F1-score, AUC-ROC, Precision-Recall AUC, or confusion matrix.


Scenario 23 — Precision vs Recall Choice

ScenarioCorrect MetricReason
Cancer detectionRecallMissing cancer (FN) = patient dies
Spam filterPrecisionFalse alarm (FP) = lose important email
Fraud detectionRecall then PrecisionMissing fraud is costly; investigate precision next
Content moderationPrecisionOver-blocking (FP) hurts user experience
Drug interaction alertRecallMissing an interaction (FN) is dangerous

Scenario 24 — Recommendation System Choice

Scenario: An e-commerce site has 10 million users and 500,000 products. They want personalized product recommendations. Historical purchase data is available.

Trap: “Build a collaborative filtering model from scratch.”

Correct answer: Amazon Personalize — managed recommendation service. Alternatively, SageMaker Factorization Machines for custom approach.

Key discriminator: “No ML expertise needed,” “managed” → Personalize. “Custom algorithm,” “full control” → SageMaker FM or DeepAR.


Scenario 25 — Ground Truth for Labeling

Scenario: A company needs to label 500,000 images for object detection. They need high quality labels with low cost.

Correct answer: SageMaker Ground Truth with automated data labeling (active learning) + Mechanical Turk for uncertain examples.

Why: Active learning automatically labels high-confidence examples, sends only uncertain ones to humans. Reduces labeling cost by up to 70%.


Domain 4 Scenarios: MLOps & Security

Scenario 26 — Deployment Mode Selection

RequirementCorrect DeploymentWhy
1000 req/sec during day, 0 at nightServerless or auto-scaling real-timeScale to zero
Predict for 10M records monthly, offlineBatch TransformNo persistent endpoint needed
100ms P99 latency SLAReal-time endpointConsistent low latency
Process 500MB video files for inferenceAsync inferencePayload > 6MB limit
50 rarely-used modelsMulti-model endpoint1 endpoint, N models, cost-efficient

Scenario 27 — Keeping Training Data Off Internet

Scenario: A healthcare company trains a model on patient data. The data must not traverse the public internet.

Trap: “Encrypt the data during transfer.”

Correct answer: Configure SageMaker training job within a VPC, use S3 VPC Gateway Endpoint to access S3 without internet.

The full setup:

SageMaker Training Job
├── VPC: private subnet only
├── No public IP
├── S3 Gateway Endpoint: routes S3 traffic within AWS network
└── SageMaker Interface Endpoint: routes API calls within AWS network

Scenario 28 — Audit Trail

Scenario: A financial services company needs to audit who accessed the ML model and when, for compliance.

Trap: “Enable CloudWatch logging on the endpoint.”

Correct answer: AWS CloudTrail — records all API calls including SageMaker InvokeEndpoint, CreateTrainingJob, etc. CloudWatch Logs record application output, not API access.

Key discriminator: “Who called what API” → CloudTrail. “Application logs, metrics” → CloudWatch.


Scenario 29 — Model Drift Detection

Scenario: A model was deployed 6 months ago. Business suspects the model quality has degraded but has no ground truth labels readily available.

Trap: “Retrain the model on new data immediately.”

Correct answer: Enable SageMaker Model Monitor data quality monitoring. Compare current input data distribution against training baseline. Data drift indicates need for investigation/retraining.

If ground truth is available: Enable model quality monitoring to track accuracy, F1, etc. over time.


Scenario 30 — Cost-Effective Long Training

Scenario: A team needs to train a large neural network that takes 5 days. They want to minimize training cost.

Trap: “Use On-Demand instances for reliability.”

Correct answer: Managed Spot Training with checkpointing enabled. SageMaker automatically uses Spot instances, saves checkpoints to S3, and resumes if interrupted.

Key requirement: Must enable checkpoints. Without checkpoints, an interruption loses all progress.

Savings: Up to 90% vs On-Demand.


Scenario 31 — Multi-Model Endpoint

Scenario: A SaaS company has 2,000 customer-specific ML models. Each customer’s model is rarely used. They need to minimize inference cost.

Trap: “Deploy each model to its own endpoint.”

Correct answer: SageMaker Multi-Model Endpoint — loads models on demand, evicts unused models from memory. One endpoint serves thousands of models.

Key discriminator: N models, rarely used, cost concern → Multi-model endpoint. Single model, high traffic → dedicated real-time endpoint.


Scenario 32 — Encryption at Rest

Scenario: A company stores training data in S3 and model artifacts in S3. Compliance requires all data to be encrypted with keys the company controls.

Correct answer: Use SSE-KMS with a Customer Managed Key (CMK) in AWS KMS. The company controls key rotation, access policies, and auditing.

Key discriminator:

  • Keys managed by AWS → SSE-S3 (AES-256, no control)
  • Keys managed by customer in KMS → SSE-KMS (control + audit)
  • Keys managed by customer, provided per-request → SSE-C (most control, most complexity)

Scenario 33 — Step Functions vs Glue Workflows

Scenario: A team wants to orchestrate a multi-step ML pipeline: data validation → preprocessing → training → evaluation → conditional deployment.

Correct answer: AWS Step Functions (or SageMaker Pipelines)

Why: Step Functions natively supports conditional branching (ConditionStep), error handling, retries, and integrates with all AWS services including SageMaker.

Key discriminator: “Conditional logic,” “branching,” “AWS-native” → Step Functions. “Complex Python DAGs,” “Airflow background” → MWAA (Managed Workflows for Apache Airflow).


The “Which Service” Decision Framework

Step 1: What type of problem is this?
  ├─ Data storage/retrieval?     → S3, DynamoDB, RDS, Redshift
  ├─ Data processing/ETL?        → Glue, EMR, Athena
  ├─ Streaming?                  → Kinesis, MSK, Firehose
  ├─ ML Training?                → SageMaker
  ├─ Inference/serving?          → SageMaker endpoints
  └─ No ML needed?               → Managed AI Services (Rekognition, Comprehend...)

Step 2: What are the constraints?
  ├─ "Minimal operational overhead" → managed/serverless wins
  ├─ "No ML expertise"           → managed AI service wins
  ├─ "Low latency"               → real-time endpoint, DynamoDB
  ├─ "Cost-effective"            → Spot, serverless, right-sized
  ├─ "Must not traverse internet" → VPC endpoint
  ├─ "Compliance/audit"          → CloudTrail, KMS, VPC
  └─ "Large payload"             → async inference, batch transform

Step 3: Is there a SageMaker built-in algorithm?
  ├─ Yes → use it (managed, optimized, less code)
  └─ No  → custom container or script mode

Step 4: Choose the simplest option that meets ALL constraints
  → Simpler is almost always correct on this exam

Common Wrong Answer Patterns

The “Over-Engineered” Trap

Question: simple ETL on 100 GB of data Wrong: “Set up an EMR cluster with Spark, configure HDFS, tune memory” Right: “AWS Glue serverless job”

Signal: Answer is technically possible but adds unnecessary complexity. The exam rewards knowing when NOT to use powerful tools.


The “Technically Possible” Trap

Question: “…with no ML expertise required” Wrong: Custom SageMaker training with custom container Right: Amazon Rekognition / Comprehend / managed AI service

Signal: Answer works but violates an explicit constraint. Read every constraint word.


The “Old/Suboptimal Service” Trap

Old/WrongModern/CorrectWhy
AWS Data PipelineAWS Step FunctionsData Pipeline is legacy, limited
Amazon MLSageMakerAmazon ML deprecated
Kinesis Analytics (v1)Kinesis Data Analytics (Flink)Newer, more powerful
Rekognition for custom categoriesSageMaker Image ClassificationRekognition is fixed categories

The “Wrong Metric for the Problem” Trap

Question ContextWrongRight
Imbalanced binary classAccuracyF1 or AUC-ROC
Cancer/disease detectionPrecisionRecall
Spam filterRecallPrecision
Probabilistic outputsRMSELog loss
Time series orderingClassification metricsMAPE, RMSE

The “Data Leakage” Trap

Leakage occurs when:

  • Test data used during training (wrong split)
  • Future data appears in training for time series (random split)
  • Target-derived features included (target encoding without CV)
  • Scaler fit on entire dataset before split

Symptom: Suspiciously high accuracy that doesn’t hold in production.


Red Flags in Answer Options

Option ContainsLikelyReasoning
“Custom EC2 cluster”WrongUse managed service instead
“Copy all data to EBS”WrongEBS is single-instance block storage; S3 is correct
“Train from scratch on 500 images”WrongTransfer learning required
“Use accuracy for fraud/medical”WrongImbalanced data, accuracy is misleading
“Random split for time series”WrongMust use chronological split
“Same data for validation and test”WrongLeakage — need separate holdout
“Deploy to 2000 separate endpoints”WrongMulti-model endpoint
“Use Isolation Forest in SageMaker”WrongRCF is the SageMaker built-in
“t-SNE in production pipeline”Wrongt-SNE cannot transform new points
“Normalize features for XGBoost”WrongTree models don’t need scaling

Scenario Patterns by Domain Weight

Domain 3 (36%) — Most Common Patterns

Pattern 1: Algorithm Selection
  Tabular → XGBoost
  Images → Image Classification / Transfer Learning
  Text classification → BlazingText
  Time series multi → DeepAR
  Anomaly stream → RCF
  Recommendation sparse → FM or Personalize
  Dimensionality reduction → PCA
  Topic modeling → LDA or NTM
  Embeddings custom → Object2Vec

Pattern 2: Training Problem
  High train / low val → overfitting → regularize
  Low train / low val → underfitting → complex model
  Oscillating loss → LR too high
  Slow plateau → LR too low
  Good val / bad prod → distribution shift → Model Monitor

Pattern 3: Evaluation
  Imbalanced → F1 / AUC
  Cost of FN high → recall
  Cost of FP high → precision
  Probabilistic → log loss / AUC
  Both F1 components → F-beta

Last-Minute Reminders — Summary Card

Question TopicAlmost Always Correct
Tabular data, no other infoXGBoost
Small image datasetTransfer learning
Real-time streaming anomalyRandom Cut Forest (RCF)
Fastest training data ingestionRecordIO + Pipe mode
Cost-effective long trainingSpot instances + checkpointing
No ML expertise requiredRekognition, Comprehend, Forecast, Personalize
Drift detection over timeSageMaker Model Monitor
Bias + explainabilitySageMaker Clarify
Security: no internet traversalVPC + S3 VPC Gateway Endpoint
Audit API accessCloudTrail
Encrypt with customer controlSSE-KMS with CMK
Many rare models, one endpointMulti-model endpoint
Large payload (>6MB) inferenceAsync inference
Offline bulk predictionsBatch Transform
Multiple related time seriesDeepAR
Fast large-scale text classifyBlazingText
Serverless ETLAWS Glue
Kafka on AWSAmazon MSK
Ad-hoc SQL on S3Amazon Athena
Imbalanced data metricF1-score or AUC-ROC
Time series splitAlways chronological, never random

Algorithm Elimination Guide

Use this to quickly eliminate wrong answers in algorithm questions:

If data is tabular: Eliminate Image Classification, Object Detection, Seq2Seq, BlazingText

If data is images: Eliminate XGBoost, LDA, BlazingText, DeepAR, RCF

If no labels (unsupervised): Eliminate XGBoost, Linear Learner, BlazingText (supervised), Image Classification

If streaming real-time: Eliminate Batch Transform; favor RCF for anomaly

If text classification: BlazingText > LDA (LDA is topic modeling, not classification)

If “multiple time series”: DeepAR > Linear Learner > ARIMA

If “find similar items/users”: Object2Vec or FM > KNN for large sparse datasets


Timing Strategy

  • 60 questions in 180 minutes = 3 minutes per question
  • Flag and skip any question taking > 2.5 minutes
  • Domain 3 questions are hardest — expect to spend more time
  • Re-read constraints in the last 30 seconds before answering
  • Two obviously wrong options: eliminate immediately, then apply tie-breaking framework
  • When in doubt between two answers: pick the one with LESS operational overhead