Exam Scenarios & Trap Questions
Table of Contents
- MLS-C01 Exam Scenarios & Trap Questions
- How MLS-C01 Questions Work
- Domain 1 Scenarios: Data Engineering
- Domain 2 Scenarios: EDA & Feature Engineering
- Domain 3 Scenarios: Modeling
- Scenario 14 — Tabular Data Default Choice
- Scenario 15 — Small Image Dataset
- Scenario 16 — Streaming Anomaly Detection
- Scenario 17 — Multiple Related Time Series
- Scenario 18 — Text with Millions of Documents
- Scenario 19 — Overfitting Diagnosis
- Scenario 20 — Underfitting Diagnosis
- Scenario 21 — Learning Rate Diagnosis
- Scenario 22 — Wrong Metric Trap
- Scenario 23 — Precision vs Recall Choice
- Scenario 24 — Recommendation System Choice
- Scenario 25 — Ground Truth for Labeling
- Domain 4 Scenarios: MLOps & Security
- Scenario 26 — Deployment Mode Selection
- Scenario 27 — Keeping Training Data Off Internet
- Scenario 28 — Audit Trail
- Scenario 29 — Model Drift Detection
- Scenario 30 — Cost-Effective Long Training
- Scenario 31 — Multi-Model Endpoint
- Scenario 32 — Encryption at Rest
- Scenario 33 — Step Functions vs Glue Workflows
- The “Which Service” Decision Framework
- Common Wrong Answer Patterns
- Red Flags in Answer Options
- Scenario Patterns by Domain Weight
- Last-Minute Reminders — Summary Card
- Algorithm Elimination Guide
- Timing Strategy
MLS-C01 Exam Scenarios & Trap Questions
The exam always gives you 2 obviously wrong options and 2 that look right. This guide teaches you to break the tie.
How MLS-C01 Questions Work
Question Anatomy
┌─────────────────────────────────────────────────────────────┐
│ SCENARIO A company stores 10TB of clickstream data... │
│ CONSTRAINT ...with minimal operational overhead... │
│ QUESTION Which service should they use? │
│ │
│ A) Amazon EMR with Spark ← Too complex │
│ B) AWS Lambda ← Wrong scale │
│ C) AWS Glue ← CORRECT │
│ D) Custom EC2 Spark cluster ← Never right │
└─────────────────────────────────────────────────────────────┘
The Tie-Breaking Framework
When 2 answers look correct, check these constraints in order:
- Latency requirement — real-time vs batch changes everything
- Payload/data size — some services have hard limits
- Operational overhead — “minimal overhead” → managed/serverless wins
- Cost — “cost-effective” → Spot, serverless, right-sized
- Compliance/security — “must not traverse internet” → VPC endpoint
- Expertise — “no ML expertise” → managed AI services (Rekognition, Comprehend)
- Scale — does the option actually support the stated volume?
Universal Rules
- Managed service beats custom when constraints allow
- SageMaker built-in algorithm beats custom code when possible
- Serverless beats always-on for intermittent workloads
- RecordIO + Pipe mode for large SageMaker training data
- XGBoost for tabular, unless a specific constraint eliminates it
- Transfer learning for small image datasets — never train from scratch
Domain 1 Scenarios: Data Engineering
Scenario 1 — Daily Log ETL
Scenario: A company ingests 5 TB of application logs daily into S3. They need to clean, filter, and transform this data before ML training. The team has limited Spark expertise.
Trap: The scale (5 TB) might make you think EMR is required.
Correct answer: AWS Glue
Why: 5 TB daily is well within Glue’s serverless capacity. The constraint “limited Spark expertise” + implied operational overhead concern → Glue wins. EMR requires cluster management.
Key discriminator: If the question says “complex custom transformations” or “existing Spark codebase” → EMR. If it says “serverless” or “minimal overhead” → Glue.
Scenario 2 — Real-Time vs Delivery
Scenario: A mobile app sends user events. A downstream ML model must process these within 200ms for personalization. Which streaming service?
Trap: Kinesis Data Firehose sounds simpler and is often brought up for streaming.
Correct answer: Kinesis Data Streams (not Firehose)
Why: Firehose is a delivery service — it buffers for 60 seconds minimum before delivery. For 200ms processing, you need Kinesis Data Streams with a Lambda/Flink consumer.
Key discriminator:
- Need sub-second processing or custom consumer → Kinesis Data Streams
- Need to deliver data to S3/Redshift/ES with no code → Kinesis Firehose
Scenario 3 — Minimize Training Time Data Format
Scenario: A team trains a neural network on a 500 GB tabular dataset stored in S3. Training takes 12 hours. They want to reduce training time without changing the model.
Trap: Answers may include “upgrade to larger GPU instance” or “use EFS for shared storage.”
Correct answer: Convert data to RecordIO format and use Pipe mode
Why: File mode copies all 500 GB to the instance disk before training starts. Pipe mode streams data directly from S3, eliminating the wait. RecordIO is SageMaker’s optimized binary format.
Key discriminator: Question asks to reduce training time without model changes → always consider data ingestion format first.
Scenario 4 — Kafka vs Kinesis
Scenario: A company is migrating an on-premises event processing system that uses Apache Kafka to AWS. They want minimal code changes.
Trap: Kinesis Data Streams might seem like the obvious AWS answer.
Correct answer: Amazon MSK (Managed Streaming for Apache Kafka)
Why: MSK is fully managed Kafka — existing Kafka clients, producers, and consumers work unchanged. Migrating to Kinesis requires rewriting producers/consumers to use Kinesis API.
Key discriminator: “Existing Kafka ecosystem” or “Kafka-compatible” → MSK. “New streaming workload on AWS” → Kinesis Data Streams.
Scenario 5 — Query S3 Data Without Loading
Scenario: A data scientist needs to run ad-hoc SQL queries on 2 TB of Parquet files in S3 to explore data before training. The queries run once or twice a week.
Trap: Loading data into Redshift sounds powerful and is a valid option.
Correct answer: Amazon Athena
Why: Loading 2 TB into Redshift takes time and costs money for a DW cluster that runs only twice weekly. Athena is serverless, pay-per-query, and queries Parquet on S3 natively.
Key discriminator: “Ad-hoc queries,” “infrequent,” “no loading” → Athena. “Frequent complex joins,” “data warehouse” → Redshift.
Scenario 6 — Feature Store Offline vs Online
Scenario: A company wants to serve features to a real-time ML model with < 10ms latency, and also train new models on historical features.
Correct answer: SageMaker Feature Store with both online and offline store enabled
Why: Online store (backed by DynamoDB-like store) handles low-latency serving. Offline store (backed by S3) handles training. Same feature definitions, no duplication.
Domain 2 Scenarios: EDA & Feature Engineering
Scenario 7 — Class Imbalance
Scenario: A fraud detection model has 99% “normal” and 1% “fraud” transactions. Trained model achieves 99% accuracy but misses almost all fraud.
Trap: “99% accuracy — great model!” The accuracy paradox.
Correct answer: Accuracy is misleading here. Use AUC-ROC or F1. Also apply SMOTE (oversampling), class weights, or undersample majority class.
Key discriminator: When one class is > 85% of data, accuracy is useless. The question will hint at this with “the model never predicts class X” or “all predictions are class Y.”
Scenario 8 — Feature Scaling
Scenario: A dataset has features: age (18-90), annual income (30,000-500,000), number of purchases (1-50). Which preprocessing step is critical?
Trap: “The model will learn the right scale automatically.”
Correct answer: Feature scaling (StandardScaler or MinMaxScaler) for linear/neural models; not required for tree-based models.
Key discriminator:
- Linear Learner, KNN, Neural Networks → must scale
- XGBoost, Random Forest, Decision Trees → scaling not needed
Scenario 9 — Missing Values — High Proportion
Scenario: A dataset has one column with 70% missing values. It is a feature for an ML model. What should you do?
Trap: “Use KNN imputation to fill in the missing values.”
Correct answer: Drop the column (or create a binary indicator “was_missing” + impute).
Why: 70% missing means the feature has very little signal. Imputing 70% of values introduces more noise than signal. Better to drop unless domain knowledge says missingness itself is predictive.
Key discriminator: > 60% missing → drop the column. 5-30% missing → impute.
Scenario 10 — Curse of Dimensionality
Scenario: A model with 20 features achieves 92% accuracy. After feature engineering adds 1,500 new polynomial features, accuracy drops to 75%.
Trap: “Add more training data to fix this.”
Correct answer: Dimensionality reduction — PCA, or feature selection (L1 regularization, feature importance). The model is suffering from the curse of dimensionality.
Why: Too many features relative to samples → sparse feature space → model cannot generalize. PCA reduces dimensionality while preserving variance.
Scenario 11 — Multicollinearity
Scenario: Two features in a regression dataset have Pearson correlation r=0.97. The model is a linear regression.
Trap: “Keep both features — more information is better.”
Correct answer: Drop one feature, or apply PCA to collapse correlated features into uncorrelated components.
Why: Multicollinearity causes unstable coefficient estimates in linear models. The correlated feature provides near-zero marginal information.
Key discriminator: Tree models (XGBoost, RF) handle multicollinearity better than linear models.
Scenario 12 — Time Series Splitting
Scenario: A data scientist randomly shuffles and splits a time series dataset into 80% train, 20% test before training a DeepAR model.
Trap: “Random split is standard practice for most ML problems.”
Correct answer: This is wrong. Random split causes data leakage for time series — future timestamps appear in training data.
Fix: Use forward chaining (time-based split): train on data before date D, test on data after date D.
Scenario 13 — Text Encoding Choice
Scenario: You need to train a logistic regression model to classify 100,000 customer support emails into 5 categories.
Correct answer: TF-IDF vectorization → Logistic Regression (or BlazingText in supervised mode)
Why: TF-IDF works well with linear classifiers for text. Simple, fast, interpretable. Deep learning is overkill for a clean text classification problem with a linear model.
Domain 3 Scenarios: Modeling
Scenario 14 — Tabular Data Default Choice
Scenario: A company has a tabular dataset with 500 features and 1 million rows to predict customer churn. No domain expertise on feature relationships.
Trap: “Deep learning handles complex patterns — use a neural network.”
Correct answer: XGBoost
Why: XGBoost consistently outperforms deep learning on tabular data. Deep learning requires more data preprocessing, tuning, and compute. For tabular: XGBoost first, always.
Key discriminator: Image/text/audio → deep learning. Tabular structured data → XGBoost.
Scenario 15 — Small Image Dataset
Scenario: A medical imaging startup has 500 labeled chest X-ray images. They need to classify pneumonia vs normal.
Trap: “Train a CNN from scratch — full control over architecture.”
Correct answer: Transfer learning — use pretrained ResNet/VGG (pretrained on ImageNet), fine-tune on the 500 images.
Why: 500 images is far too small to train a CNN from scratch. Pretrained features (edges, textures, shapes) transfer even across domains.
Key discriminator: “Small dataset” + images → transfer learning. “Millions of images” → can train from scratch.
Scenario 16 — Streaming Anomaly Detection
Scenario: A manufacturing company needs to detect equipment anomalies from sensor streams in real time. No labeled anomaly data is available.
Trap: “Use Isolation Forest — it’s the standard anomaly detection algorithm.”
Correct answer: SageMaker Random Cut Forest (RCF)
Why: RCF is SageMaker’s built-in algorithm designed for real-time streaming anomaly detection. Isolation Forest is not a SageMaker built-in; RCF integrates natively with Kinesis.
Key discriminator: “Real-time streaming,” “SageMaker built-in,” “no labels” → RCF.
Scenario 17 — Multiple Related Time Series
Scenario: A retailer needs to forecast demand for 10,000 SKUs across 500 stores. Historical sales data is available for all SKUs.
Trap: “Train one ARIMA/Prophet model per SKU.”
Correct answer: SageMaker DeepAR
Why: DeepAR learns across all time series jointly — cold start handling, learns inter-series patterns. Training 500,000 separate ARIMA models is infeasible.
Key discriminator: “Many related time series,” “cold start,” “new items” → DeepAR. Single isolated time series → ARIMA/Prophet/classical.
Scenario 18 — Text with Millions of Documents
Scenario: A company needs to classify news articles into 20 categories. They have 50 million labeled articles.
Trap: “Use BERT fine-tuning for best accuracy.”
Correct answer: SageMaker BlazingText in supervised mode
Why: BlazingText is extremely fast at scale — trains on hundreds of millions of texts in minutes. BERT fine-tuning at 50M documents requires massive GPU compute. BlazingText often achieves near-BERT accuracy for classification.
Key discriminator: “Millions of documents,” “fast,” “text classification” → BlazingText. “State-of-the-art accuracy,” “semantic understanding” → BERT/Transformer.
Scenario 19 — Overfitting Diagnosis
Scenario: A neural network achieves 99% training accuracy and 62% validation accuracy after 100 epochs.
Trap: “Train for more epochs to improve validation accuracy.”
Correct answer: The model is overfitting. Fixes in order of preference:
- Collect more training data
- Add dropout layers
- Add L2 regularization (weight decay)
- Early stopping (stop before epoch 100)
- Reduce model complexity (fewer layers/units)
- Data augmentation (if images)
Scenario 20 — Underfitting Diagnosis
Scenario: A model achieves 55% training accuracy and 53% validation accuracy on a binary classification task with balanced classes (baseline = 50%).
Trap: “Add regularization — the model is too complex.”
Correct answer: The model is underfitting (high bias). Fixes:
- Use a more complex model (more layers, more estimators)
- Add more features / feature engineering
- Reduce regularization
- Train longer (more epochs)
- Try a different algorithm
Key discriminator: Train and val both low → underfitting. Train high, val low → overfitting.
Scenario 21 — Learning Rate Diagnosis
Scenario: During training, the loss curve oscillates wildly and does not decrease consistently.
Correct answer: Learning rate is too high. Reduce it. Consider using a learning rate scheduler (cosine annealing, step decay).
Scenario: Loss curve decreases very slowly and plateaus early without reaching good performance.
Correct answer: Learning rate is too low. Increase it. Try cyclical learning rates or warm-up.
Scenario 22 — Wrong Metric Trap
Scenario: A model classifies credit card transactions as fraud or not. 99.5% are not fraud. The model reports 99.6% accuracy. The business is satisfied.
Trap: “99.6% accuracy — excellent model!”
Correct answer: Accuracy is misleading on imbalanced data. A dummy model predicting “not fraud” for everything achieves 99.5%. Use F1-score, AUC-ROC, Precision-Recall AUC, or confusion matrix.
Scenario 23 — Precision vs Recall Choice
| Scenario | Correct Metric | Reason |
|---|---|---|
| Cancer detection | Recall | Missing cancer (FN) = patient dies |
| Spam filter | Precision | False alarm (FP) = lose important email |
| Fraud detection | Recall then Precision | Missing fraud is costly; investigate precision next |
| Content moderation | Precision | Over-blocking (FP) hurts user experience |
| Drug interaction alert | Recall | Missing an interaction (FN) is dangerous |
Scenario 24 — Recommendation System Choice
Scenario: An e-commerce site has 10 million users and 500,000 products. They want personalized product recommendations. Historical purchase data is available.
Trap: “Build a collaborative filtering model from scratch.”
Correct answer: Amazon Personalize — managed recommendation service. Alternatively, SageMaker Factorization Machines for custom approach.
Key discriminator: “No ML expertise needed,” “managed” → Personalize. “Custom algorithm,” “full control” → SageMaker FM or DeepAR.
Scenario 25 — Ground Truth for Labeling
Scenario: A company needs to label 500,000 images for object detection. They need high quality labels with low cost.
Correct answer: SageMaker Ground Truth with automated data labeling (active learning) + Mechanical Turk for uncertain examples.
Why: Active learning automatically labels high-confidence examples, sends only uncertain ones to humans. Reduces labeling cost by up to 70%.
Domain 4 Scenarios: MLOps & Security
Scenario 26 — Deployment Mode Selection
| Requirement | Correct Deployment | Why |
|---|---|---|
| 1000 req/sec during day, 0 at night | Serverless or auto-scaling real-time | Scale to zero |
| Predict for 10M records monthly, offline | Batch Transform | No persistent endpoint needed |
| 100ms P99 latency SLA | Real-time endpoint | Consistent low latency |
| Process 500MB video files for inference | Async inference | Payload > 6MB limit |
| 50 rarely-used models | Multi-model endpoint | 1 endpoint, N models, cost-efficient |
Scenario 27 — Keeping Training Data Off Internet
Scenario: A healthcare company trains a model on patient data. The data must not traverse the public internet.
Trap: “Encrypt the data during transfer.”
Correct answer: Configure SageMaker training job within a VPC, use S3 VPC Gateway Endpoint to access S3 without internet.
The full setup:
SageMaker Training Job
├── VPC: private subnet only
├── No public IP
├── S3 Gateway Endpoint: routes S3 traffic within AWS network
└── SageMaker Interface Endpoint: routes API calls within AWS network
Scenario 28 — Audit Trail
Scenario: A financial services company needs to audit who accessed the ML model and when, for compliance.
Trap: “Enable CloudWatch logging on the endpoint.”
Correct answer: AWS CloudTrail — records all API calls including SageMaker InvokeEndpoint, CreateTrainingJob, etc. CloudWatch Logs record application output, not API access.
Key discriminator: “Who called what API” → CloudTrail. “Application logs, metrics” → CloudWatch.
Scenario 29 — Model Drift Detection
Scenario: A model was deployed 6 months ago. Business suspects the model quality has degraded but has no ground truth labels readily available.
Trap: “Retrain the model on new data immediately.”
Correct answer: Enable SageMaker Model Monitor data quality monitoring. Compare current input data distribution against training baseline. Data drift indicates need for investigation/retraining.
If ground truth is available: Enable model quality monitoring to track accuracy, F1, etc. over time.
Scenario 30 — Cost-Effective Long Training
Scenario: A team needs to train a large neural network that takes 5 days. They want to minimize training cost.
Trap: “Use On-Demand instances for reliability.”
Correct answer: Managed Spot Training with checkpointing enabled. SageMaker automatically uses Spot instances, saves checkpoints to S3, and resumes if interrupted.
Key requirement: Must enable checkpoints. Without checkpoints, an interruption loses all progress.
Savings: Up to 90% vs On-Demand.
Scenario 31 — Multi-Model Endpoint
Scenario: A SaaS company has 2,000 customer-specific ML models. Each customer’s model is rarely used. They need to minimize inference cost.
Trap: “Deploy each model to its own endpoint.”
Correct answer: SageMaker Multi-Model Endpoint — loads models on demand, evicts unused models from memory. One endpoint serves thousands of models.
Key discriminator: N models, rarely used, cost concern → Multi-model endpoint. Single model, high traffic → dedicated real-time endpoint.
Scenario 32 — Encryption at Rest
Scenario: A company stores training data in S3 and model artifacts in S3. Compliance requires all data to be encrypted with keys the company controls.
Correct answer: Use SSE-KMS with a Customer Managed Key (CMK) in AWS KMS. The company controls key rotation, access policies, and auditing.
Key discriminator:
- Keys managed by AWS → SSE-S3 (AES-256, no control)
- Keys managed by customer in KMS → SSE-KMS (control + audit)
- Keys managed by customer, provided per-request → SSE-C (most control, most complexity)
Scenario 33 — Step Functions vs Glue Workflows
Scenario: A team wants to orchestrate a multi-step ML pipeline: data validation → preprocessing → training → evaluation → conditional deployment.
Correct answer: AWS Step Functions (or SageMaker Pipelines)
Why: Step Functions natively supports conditional branching (ConditionStep), error handling, retries, and integrates with all AWS services including SageMaker.
Key discriminator: “Conditional logic,” “branching,” “AWS-native” → Step Functions. “Complex Python DAGs,” “Airflow background” → MWAA (Managed Workflows for Apache Airflow).
The “Which Service” Decision Framework
Step 1: What type of problem is this?
├─ Data storage/retrieval? → S3, DynamoDB, RDS, Redshift
├─ Data processing/ETL? → Glue, EMR, Athena
├─ Streaming? → Kinesis, MSK, Firehose
├─ ML Training? → SageMaker
├─ Inference/serving? → SageMaker endpoints
└─ No ML needed? → Managed AI Services (Rekognition, Comprehend...)
Step 2: What are the constraints?
├─ "Minimal operational overhead" → managed/serverless wins
├─ "No ML expertise" → managed AI service wins
├─ "Low latency" → real-time endpoint, DynamoDB
├─ "Cost-effective" → Spot, serverless, right-sized
├─ "Must not traverse internet" → VPC endpoint
├─ "Compliance/audit" → CloudTrail, KMS, VPC
└─ "Large payload" → async inference, batch transform
Step 3: Is there a SageMaker built-in algorithm?
├─ Yes → use it (managed, optimized, less code)
└─ No → custom container or script mode
Step 4: Choose the simplest option that meets ALL constraints
→ Simpler is almost always correct on this exam
Common Wrong Answer Patterns
The “Over-Engineered” Trap
Question: simple ETL on 100 GB of data Wrong: “Set up an EMR cluster with Spark, configure HDFS, tune memory” Right: “AWS Glue serverless job”
Signal: Answer is technically possible but adds unnecessary complexity. The exam rewards knowing when NOT to use powerful tools.
The “Technically Possible” Trap
Question: “…with no ML expertise required” Wrong: Custom SageMaker training with custom container Right: Amazon Rekognition / Comprehend / managed AI service
Signal: Answer works but violates an explicit constraint. Read every constraint word.
The “Old/Suboptimal Service” Trap
| Old/Wrong | Modern/Correct | Why |
|---|---|---|
| AWS Data Pipeline | AWS Step Functions | Data Pipeline is legacy, limited |
| Amazon ML | SageMaker | Amazon ML deprecated |
| Kinesis Analytics (v1) | Kinesis Data Analytics (Flink) | Newer, more powerful |
| Rekognition for custom categories | SageMaker Image Classification | Rekognition is fixed categories |
The “Wrong Metric for the Problem” Trap
| Question Context | Wrong | Right |
|---|---|---|
| Imbalanced binary class | Accuracy | F1 or AUC-ROC |
| Cancer/disease detection | Precision | Recall |
| Spam filter | Recall | Precision |
| Probabilistic outputs | RMSE | Log loss |
| Time series ordering | Classification metrics | MAPE, RMSE |
The “Data Leakage” Trap
Leakage occurs when:
- Test data used during training (wrong split)
- Future data appears in training for time series (random split)
- Target-derived features included (target encoding without CV)
- Scaler fit on entire dataset before split
Symptom: Suspiciously high accuracy that doesn’t hold in production.
Red Flags in Answer Options
| Option Contains | Likely | Reasoning |
|---|---|---|
| “Custom EC2 cluster” | Wrong | Use managed service instead |
| “Copy all data to EBS” | Wrong | EBS is single-instance block storage; S3 is correct |
| “Train from scratch on 500 images” | Wrong | Transfer learning required |
| “Use accuracy for fraud/medical” | Wrong | Imbalanced data, accuracy is misleading |
| “Random split for time series” | Wrong | Must use chronological split |
| “Same data for validation and test” | Wrong | Leakage — need separate holdout |
| “Deploy to 2000 separate endpoints” | Wrong | Multi-model endpoint |
| “Use Isolation Forest in SageMaker” | Wrong | RCF is the SageMaker built-in |
| “t-SNE in production pipeline” | Wrong | t-SNE cannot transform new points |
| “Normalize features for XGBoost” | Wrong | Tree models don’t need scaling |
Scenario Patterns by Domain Weight
Domain 3 (36%) — Most Common Patterns
Pattern 1: Algorithm Selection
Tabular → XGBoost
Images → Image Classification / Transfer Learning
Text classification → BlazingText
Time series multi → DeepAR
Anomaly stream → RCF
Recommendation sparse → FM or Personalize
Dimensionality reduction → PCA
Topic modeling → LDA or NTM
Embeddings custom → Object2Vec
Pattern 2: Training Problem
High train / low val → overfitting → regularize
Low train / low val → underfitting → complex model
Oscillating loss → LR too high
Slow plateau → LR too low
Good val / bad prod → distribution shift → Model Monitor
Pattern 3: Evaluation
Imbalanced → F1 / AUC
Cost of FN high → recall
Cost of FP high → precision
Probabilistic → log loss / AUC
Both F1 components → F-beta
Last-Minute Reminders — Summary Card
| Question Topic | Almost Always Correct |
|---|---|
| Tabular data, no other info | XGBoost |
| Small image dataset | Transfer learning |
| Real-time streaming anomaly | Random Cut Forest (RCF) |
| Fastest training data ingestion | RecordIO + Pipe mode |
| Cost-effective long training | Spot instances + checkpointing |
| No ML expertise required | Rekognition, Comprehend, Forecast, Personalize |
| Drift detection over time | SageMaker Model Monitor |
| Bias + explainability | SageMaker Clarify |
| Security: no internet traversal | VPC + S3 VPC Gateway Endpoint |
| Audit API access | CloudTrail |
| Encrypt with customer control | SSE-KMS with CMK |
| Many rare models, one endpoint | Multi-model endpoint |
| Large payload (>6MB) inference | Async inference |
| Offline bulk predictions | Batch Transform |
| Multiple related time series | DeepAR |
| Fast large-scale text classify | BlazingText |
| Serverless ETL | AWS Glue |
| Kafka on AWS | Amazon MSK |
| Ad-hoc SQL on S3 | Amazon Athena |
| Imbalanced data metric | F1-score or AUC-ROC |
| Time series split | Always chronological, never random |
Algorithm Elimination Guide
Use this to quickly eliminate wrong answers in algorithm questions:
If data is tabular: Eliminate Image Classification, Object Detection, Seq2Seq, BlazingText
If data is images: Eliminate XGBoost, LDA, BlazingText, DeepAR, RCF
If no labels (unsupervised): Eliminate XGBoost, Linear Learner, BlazingText (supervised), Image Classification
If streaming real-time: Eliminate Batch Transform; favor RCF for anomaly
If text classification: BlazingText > LDA (LDA is topic modeling, not classification)
If “multiple time series”: DeepAR > Linear Learner > ARIMA
If “find similar items/users”: Object2Vec or FM > KNN for large sparse datasets
Timing Strategy
- 60 questions in 180 minutes = 3 minutes per question
- Flag and skip any question taking > 2.5 minutes
- Domain 3 questions are hardest — expect to spend more time
- Re-read constraints in the last 30 seconds before answering
- Two obviously wrong options: eliminate immediately, then apply tie-breaking framework
- When in doubt between two answers: pick the one with LESS operational overhead