Exam Cheat Sheet

15 min read 3163 words

Table of Contents

MLS-C01 Exam Cheat Sheet

MLS-C01 Exam Cheat Sheet

Last thing to read before the exam. Dense tables, no prose.

Algorithm Quick Reference — All 17+ Built-In

Algorithm	Problem	Input	Compute	Key Hyperparams	Use When
XGBoost	Class/Reg/Rank	tabular CSV/libsvm	CPU (GPU opt)	num_round, max_depth, eta, subsample, alpha, lambda	Tabular data, default first choice
Linear Learner	Class/Reg	tabular RecordIO/CSV	CPU	learning_rate, l1, wd, mini_batch_size	Large tabular, linear relationships, fast baseline
KNN	Class/Reg/Anomaly	tabular RecordIO	CPU	k, sample_size, feature_dim	Fast inference, low latency, small-medium data
Factorization Machines	Class/Reg	sparse CSV/RecordIO	CPU	num_factors, lr, epoch	Sparse data, click prediction, recommendations
DeepAR	Forecasting	JSON Lines time series	GPU/CPU	context_length, prediction_length, num_layers	Multiple related time series, cold start
LDA	Topic modeling	Integer token sequences	CPU	num_topics, alpha0	Text corpus, discover hidden topics
NTM	Topic modeling	Count vectors	GPU/CPU	num_topics, encoder_layers	Similar to LDA, neural approach
BlazingText	Text class / Word2Vec	Text file (space-delimited)	GPU/CPU	mode (supervised/cbow/sg), vector_dim	Fast text classification, word embeddings, millions of docs
Object2Vec	Embeddings	Pair of tokens/sequences	GPU	enc_dim, num_layers	Entity relationships, recommendation, similarity
Semantic Segmentation	Image segmentation	Image+annotation RecordIO	GPU	backbone, algorithm (FCN/PSP/DeepLab)	Pixel-level classification of images
Image Classification	Multi-class images	RecordIO / augmented manifest	GPU	num_classes, num_layers, lr, epochs	Classify whole images, transfer learning available
Object Detection	Bounding boxes	RecordIO / augmented manifest	GPU	num_classes, base_network (VGG/ResNet)	Locate + classify objects in images
Seq2Seq	Sequence translation	Tokenized RecordIO	GPU	num_layers, num_embed, rnn_num_hidden	Machine translation, summarization
IP Insights	Anomaly (IP)	CSV (entity, IP)	GPU/CPU	num_entity_vectors, vector_dim	Detect suspicious IP usage patterns
Random Cut Forest (RCF)	Anomaly detection	RecordIO/CSV	CPU	num_trees, num_samples_per_tree	Real-time streaming anomaly detection
PCA	Dimensionality reduction	RecordIO/CSV	CPU	num_components, algorithm (regular/randomized)	Reduce features, handle multicollinearity
K-Means	Clustering	RecordIO/CSV	GPU/CPU	k, init_method, max_iter	Unsupervised grouping, customer segmentation

Exam tip — “tabular data with no info on structure” → XGBoost first. Images → Image Classification or Object Detection. Streaming anomaly → RCF.

Metric Quick Reference

Classification Metrics

Metric	Formula	Use When
Accuracy	$\frac{TP+TN}{TP+TN+FP+FN}$	Balanced classes, equal cost of errors
Precision	$\frac{TP}{TP+FP}$	Cost of FP is high (spam filter, fraud alert)
Recall	$\frac{TP}{TP+FN}$	Cost of FN is high (cancer detection, fraud missed)
F1	$\frac{2 \cdot P \cdot R}{P+R}$	Imbalanced classes, balance P and R
AUC-ROC	Area under ROC curve	Rank-ordering, imbalanced, threshold-agnostic
Log Loss	$-\frac{1}{N}\sum y\log\hat{p}$	Probabilistic classifier quality
PR-AUC	Area under P-R curve	Highly imbalanced data (better than ROC)

Accuracy is USELESS for imbalanced data. Always flag this on exam.

Regression Metrics

Metric	Formula	Use When
MSE	$\frac{1}{n}\sum(y-\hat{y})^2$	Penalize large errors heavily
RMSE	$\sqrt{MSE}$	Same scale as target, sensitive to outliers
MAE	$\frac{1}{n}\sum\|y-\hat{y}\|$	Robust to outliers, interpretable
R²	$1 - \frac{SS_{res}}{SS_{tot}}$	Proportion of variance explained (0-1)
MAPE	$\frac{100}{n}\sum\frac{\|y-\hat{y}\|}{y}$	Percentage error, scale-independent

Other Metrics

Problem	Metric	Notes
Clustering	Silhouette score	Ranges -1 to 1, higher = better clusters
Clustering	Inertia (SSE)	Within-cluster sum of squares, lower = better
Ranking	NDCG	Normalized Discounted Cumulative Gain
Object Detection	mAP	Mean Average Precision across classes
Segmentation	IoU / mIoU	Intersection over Union

Key Formulas

Classification

$$\text{Precision} = \frac{TP}{TP+FP} \qquad \text{Recall} = \frac{TP}{TP+FN}$$

$$F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP+FP+FN}$$

$$F_\beta = (1+\beta^2)\frac{P \cdot R}{\beta^2 P + R}$$

Regression

$$MSE = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 \qquad RMSE = \sqrt{MSE}$$

$$MAE = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| \qquad R^2 = 1 - \frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2}$$

Gradient Descent

$$\theta := \theta - \alpha \nabla_\theta J(\theta)$$

Regularization

$$L1 \text{ (Lasso)}: J + \lambda\sum|\theta_j| \qquad L2 \text{ (Ridge)}: J + \lambda\sum\theta_j^2$$

$$\text{Elastic Net}: J + \lambda_1\sum|\theta_j| + \lambda_2\sum\theta_j^2$$

Decision Trees

$$\text{Gini} = 1 - \sum_{i=1}^c p_i^2 \qquad \text{Entropy} = -\sum_{i=1}^c p_i \log_2 p_i$$

$$\text{Information Gain} = H(\text{parent}) - \sum_k \frac{|S_k|}{|S|} H(S_k)$$

Probability

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Service Quick Reference — “When to Use What”

Data Storage

Service	Use Case	Key Detail
S3	Everything — always default	Objects, scalable, SageMaker native
EFS	Shared filesystem across instances	NFS, multi-AZ, training clusters
FSx for Lustre	HPC/ML high-throughput	Links to S3, fastest I/O for training
EBS	SageMaker notebook persistence	Block storage, single instance
Redshift	Data warehouse, SQL analytics	Columnar, petabyte scale
DynamoDB	Low-latency K-V lookups	Millisecond, managed NoSQL

Streaming & Ingestion

Service	Use Case	Key Detail
Kinesis Data Streams	Custom consumers, replay, low latency	Shard-based, 24h-7d retention
Kinesis Data Firehose	Delivery to S3/Redshift/ES	No code, auto-scale, no replay
Kinesis Data Analytics	Real-time SQL/Flink on streams	Windowed queries, anomaly detection
MSK (Kafka)	Kafka-compatible streaming	Bring existing Kafka ecosystem
SQS	Decoupled queue, async processing	At-least-once, no ordering (std)

ETL & Processing

Service	Use Case	Key Detail
AWS Glue	Serverless ETL, PySpark	Crawlers, Data Catalog, job bookmarks
EMR	Full Hadoop/Spark control	EC2-based, complex transforms, cost
Athena	Ad-hoc SQL on S3	No server, pay per query, quick
Redshift Spectrum	Query S3 from Redshift	Extend DW without loading
AWS Batch	Large batch compute	Non-ML batch jobs
Step Functions	Workflow orchestration	ML pipelines, state machines

AI/ML Managed Services

Service	Problem	Key Detail
Rekognition	Image/video analysis	Object detection, face, celebrity, moderation
Comprehend	NLP — text analysis	Sentiment, entities, key phrases, topics
Transcribe	Speech → text	ASR, medical version available
Polly	Text → speech	TTS, multiple voices/languages
Translate	Language translation	Neural MT, 75+ languages
Textract	OCR + document parsing	Tables, forms, beyond simple OCR
Forecast	Time series forecasting	AutoML, no ML expertise needed
Personalize	Recommendation engine	User-item interactions, real-time
Lex	Chatbot / conversational AI	Powers Alexa, ASR + NLU
Kendra	Intelligent search	Enterprise search, FAQ extraction
Fraud Detector	Fraud detection	Online fraud, account takeover
Lookout for Metrics	Anomaly in business metrics	No ML required, time series

SageMaker Deployment Modes

Mode	Latency	Payload	When to Use
Real-time endpoint	Low (ms)	< 6 MB	Interactive apps, low-latency APIs
Serverless inference	Variable	< 6 MB	Intermittent traffic, unpredictable spikes
Asynchronous inference	Minutes	Up to 1 GB	Large payloads, video/audio, long inference
Batch transform	Offline	Unlimited	Bulk offline prediction, no persistent endpoint

SageMaker Architecture

┌─────────────────────────────────────────────────────────────┐
│                    SAGEMAKER ECOSYSTEM                       │
│                                                             │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐               │
│  │   Data   │──▶│ Feature  │──▶│ Training │               │
│  │  Sources │   │  Store   │   │   Job    │               │
│  └──────────┘   └──────────┘   └──────────┘               │
│       │                              │                      │
│       │         ┌────────────────────┘                      │
│       │         ▼                                           │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐               │
│  │  S3 /    │   │  Model   │──▶│  Model   │               │
│  │  Ground  │   │ Registry │   │ Monitor  │               │
│  │  Truth   │   └──────────┘   └──────────┘               │
│  └──────────┘        │                                      │
│                       ▼                                      │
│              ┌─────────────────────────────┐               │
│              │     DEPLOYMENT OPTIONS       │               │
│              │ ┌──────┐ ┌──────┐ ┌──────┐ │               │
│              │ │ RT   │ │Batch │ │Async │ │               │
│              │ │Endpt │ │Trans │ │Endpt │ │               │
│              │ └──────┘ └──────┘ └──────┘ │               │
│              └─────────────────────────────┘               │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │    TOOLS: Studio | Pipelines | Experiments | Clarify │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Instance Type Selection

Workload	Instance Family	Notes
Training deep learning	ml.p3, ml.p4, ml.g4dn, ml.g5	GPU required for CV/NLP
Training tree models	ml.m5, ml.c5	XGBoost CPU-optimized
Inference high-throughput	ml.c5, ml.c6g	CPU inference is cheaper
Inference GPU required	ml.g4dn	Real-time DL inference
Notebook exploration	ml.t3, ml.m5	Notebooks don’t need GPU
Large model training	ml.p4d.24xlarge	Multi-GPU, NVLink

/opt/ml/ Directory (Custom Containers)

/opt/ml/
├── input/
│   ├── config/          ← hyperparameters.json, resourceConfig.json
│   └── data/
│       └── {channel}/   ← training data from S3
├── model/               ← save model artifacts HERE
├── output/
│   └── failure          ← write failure reason here on error
└── code/                ← your training script (if using Script Mode)

Data Format Decision Tree

What is your data?
│
├─ Tabular data
│   ├─ Need fast SageMaker training?  → RecordIO (protobuf) + Pipe mode
│   ├─ XGBoost?                       → CSV or libsvm
│   └─ General purpose?               → CSV (simplest)
│
├─ Image data
│   ├─ SageMaker built-in algo?       → RecordIO + augmented manifest
│   └─ Custom model?                  → Raw files in S3
│
├─ Text data
│   ├─ BlazingText?                   → One sentence per line, space-delimited
│   ├─ Seq2Seq?                       → Tokenized integer sequences
│   └─ General NLP?                   → JSON Lines / CSV
│
└─ Time series
    ├─ DeepAR?                         → JSON Lines (target + timestamp)
    └─ Forecast service?               → CSV or Parquet

RecordIO + Pipe mode = fastest SageMaker training (streams from S3, no copy to disk)

Overfitting vs Underfitting Quick Card

Symptom	Diagnosis	Fix
Train acc high, test acc low	Overfitting	Add dropout, regularization (L1/L2), more data, early stopping, simpler model
Train acc low, test acc low	Underfitting	More complex model, more features, more epochs, less regularization, larger network
Train acc = test acc, both low	High bias	Feature engineering, polynomial features, different algorithm
Train acc oscillates	LR too high	Reduce learning rate, use LR scheduler
Train loss never decreases	LR too low / bad init	Increase LR, check data normalization
Good train/val, bad production	Data leakage or distribution shift	Fix data split, monitor drift
Model works on seen, fails on unseen	Poor generalization	Cross-validation, augmentation, regularization

Feature Engineering Quick Card

Numerical Features

Scale: StandardScaler (mean=0, std=1) for linear/neural models; MinMaxScaler (0-1) for bounded
Log transform: $\log(x+1)$ for skewed/long-tail distributions
Binning: Convert continuous → ordinal buckets (age groups)
Polynomial: $x^2, x_1 x_2$ to capture non-linear relationships
Clipping: Remove/cap extreme outliers before scaling

Categorical Features

Label encoding: ordinal categories (low/med/high → 0/1/2)
One-hot encoding: nominal categories, low cardinality (< ~20 values)
Target encoding: high cardinality, replaces category with mean(target) — risk of leakage
Embedding: high cardinality in deep learning (entity embeddings)

Text Features

BoW / TF-IDF: sparse bag of words, good for linear models
Word2Vec / FastText: dense embeddings, captures semantics
BERT / Transformers: contextual embeddings, state-of-the-art
n-grams: capture phrases, increase vocabulary size

Time Series Features

Lag features: $x_{t-1}, x_{t-7}$ (yesterday, last week)
Rolling statistics: rolling mean, rolling std, rolling max
Decomposition: trend + seasonality + residual
Differencing: $x_t - x_{t-1}$ to remove trend, stationarity

Missing Data Strategies

Strategy	When
Mean/median imputation	MCAR (missing completely at random), numerical
Mode imputation	Categorical, low proportion missing
KNN imputation	MCAR/MAR, preserve relationships
Model-based imputation	Complex dependencies
Indicator variable	When missingness itself is informative
Drop column	> 60-70% missing, no signal
Drop row	Very few missing, MCAR

Security Quick Card

Encryption

Layer	Mechanism	Service
At rest (S3)	SSE-S3, SSE-KMS, SSE-C	KMS, S3
At rest (EBS/EFS)	KMS CMK	KMS
In transit	TLS 1.2+	ACM
Inter-container	Enable inter-container encryption	SageMaker training config
Model artifacts	KMS key on S3 bucket	KMS + S3

Network

Component	Purpose
VPC + private subnet	Isolate training/inference from internet
S3 VPC Gateway Endpoint	Access S3 without internet traversal
SageMaker VPC Interface Endpoint	Access SageMaker API without internet
Security Groups	Control inter-instance traffic
NAT Gateway	Outbound internet from private subnet

Access Control

Layer	Mechanism
SageMaker execution role	IAM role attached to notebook/training job
Bucket policy	S3 access from specific role/VPC only
Resource-based policy	Cross-account access
KMS key policy	Who can use encryption keys
CloudTrail	Audit all API calls

Monitoring

Tool	Purpose
CloudWatch Metrics	Resource utilization, invocation counts
CloudWatch Logs	Training/inference logs
CloudTrail	API call audit trail
SageMaker Model Monitor	Data drift, model quality, bias, explainability
SageMaker Clarify	Bias detection, feature importance (SHAP)

Cost Optimization Quick Card

Training

Technique	Savings	Notes
Spot instances	Up to 90%	Requires checkpointing for recovery
Right-size instances	20-50%	Profile before choosing large instance
Warm pools	Reduce startup time	Keep instances ready between jobs
Managed spot training	Built-in	SageMaker handles interruption
Compile with Neo	Speed → cost	Optimize model for inference

Inference

Technique	Savings	Notes
Auto-scaling	Pay only for load	Scale to zero with serverless
Serverless inference	No idle cost	Cold start tradeoff
Multi-model endpoint	1 endpoint N models	Share resources across models
Elastic Inference	GPU fraction	Attach fraction of GPU to CPU instance
Inferentia (Inf1)	GPU cost	Custom ML chip, high throughput
Graviton2 (g4dn)	CPU cost	ARM-based, cheaper CPU inference

Storage

Technique	Savings	Notes
S3 Intelligent-Tiering	Auto-tier	For unknown access patterns
S3 Lifecycle rules	Move to Glacier	Archive old model artifacts
Delete unused endpoints	Immediate	Endpoints bill while running
Shared feature store	Avoid recompute	Feature Store offline store on S3

The “X vs Y” Quick Reference

Pair	X	Y	Key Discriminator
Kinesis vs Firehose	Custom processing, replay, sub-second	Delivery to S3/Redshift, no code	Need replay or sub-second? → Streams
Glue vs EMR	Serverless, PySpark, catalog	Full Hadoop/Spark ecosystem, EC2	Operational overhead concern? → Glue
Athena vs Redshift	Ad-hoc SQL on S3, no loading	DW with complex joins, frequent queries	One-off queries on S3? → Athena
Rekognition vs SM Image Class	No ML expertise, managed	Custom training, specific domain	Need customization? → SageMaker
Random Forest vs XGBoost	Bagging, parallel, robust	Boosting, sequential, often better	XGBoost generally better, RF more robust to overfit
L1 vs L2	Sparse, feature selection (zero weights)	Prevents large weights, smooth	Want to zero out features? → L1
PCA vs t-SNE	Linear, preserve variance, production	Non-linear, visualization only	Production pipeline? → PCA
LSTM vs GRU	More params, better for long sequences	Fewer params, faster, similar perf	Limited compute? → GRU
Batch Norm vs Dropout	Normalize activations, training speed	Randomly drop neurons, regularize	Both can be used together
Precision vs Recall	Minimize FP (cost of false alarm)	Minimize FN (cost of missing)	What error is worse?
Bagging vs Boosting	Parallel, reduces variance, Random Forest	Sequential, reduces bias, XGBoost/AdaBoost	High variance? → Bagging. High bias? → Boosting
File Mode vs Pipe Mode	Data copied to instance disk	Streams directly from S3, faster	Large dataset + RecordIO? → Pipe
Data Parallelism vs Model Parallelism	Replicate model, split data across GPUs	Split model layers across GPUs	Model too large for 1 GPU? → Model parallel
Real-time vs Serverless	Always-on, consistent latency	Intermittent, variable latency, no idle cost	Need < 100ms guaranteed? → Real-time
Batch vs Async	Offline, no endpoint, huge scale	Large payload, async with notification	Payload > 6MB or very long inference? → Async
SageMaker Clarify vs Model Monitor	Bias detection + SHAP explainability	Data drift + model quality over time	Drift detection? → Monitor. Bias? → Clarify
Step Functions vs Airflow (MWAA)	AWS-native, serverless, simple orchestration	Complex DAGs, Python ecosystem	Complex Python DAG? → MWAA

Data Pipeline Architecture Patterns

Batch ML Pipeline:
┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐    ┌──────┐
│  S3  │───▶│ Glue │───▶│  S3  │───▶│  SM  │───▶│  S3  │
│ Raw  │    │ ETL  │    │Clean │    │Train │    │Model │
└──────┘    └──────┘    └──────┘    └──────┘    └──────┘

Streaming ML Pipeline:
┌──────┐    ┌───────┐    ┌─────────┐    ┌──────┐
│Kafka/│───▶│Kinesis│───▶│Analytics│───▶│Lambda│
│Source│    │Streams│    │(Flink)  │    │/SM   │
└──────┘    └───────┘    └─────────┘    └──────┘

Real-time Inference:
┌──────┐    ┌──────┐    ┌──────────┐    ┌──────┐
│Client│───▶│ API  │───▶│SageMaker │───▶│Result│
│      │    │  GW  │    │Real-time │    │      │
└──────┘    └──────┘    └──────────┘    └──────┘

SageMaker Pipelines Key Components

Component	Purpose
Pipeline	DAG of steps
ProcessingStep	Data prep/validation using SageMaker Processing
TrainingStep	Train a model
TuningStep	Hyperparameter tuning (HPO)
EvaluationStep	Evaluate metrics
ModelStep	Create model from artifacts
ConditionStep	Branch on condition (accuracy > threshold)
RegisterModel	Push to Model Registry
TransformStep	Batch inference

Hyperparameter Tuning Quick Reference

HPO Strategy

Strategy	Description	Use When
Grid search	Try all combinations	Small parameter space
Random search	Random samples	Medium space, better than grid
Bayesian optimization	Model-guided search	Expensive training, large space
Hyperband	Early stopping of bad trials	Large space, long training

SageMaker Automatic Model Tuning

Objective metric: specify training job metric to maximize/minimize
Parameter ranges: continuous, integer, categorical
Max jobs: total trials budget
Max parallel jobs: concurrent trials (reduces wall time, less efficient)
Warm starting: continue previous tuning job, transfer learning for HPO

Common Hyperparameter Effects

Hyperparameter	Effect when increased
Learning rate	Faster convergence, risk instability
Batch size	More stable gradients, less noise, memory
Num epochs	Risk overfitting, better convergence
Regularization (lambda)	More regularization, lower variance
Max depth (trees)	More complex model, risk overfitting
Num estimators	Better ensemble, diminishing returns
Dropout rate	More regularization
Hidden units	More capacity, risk overfitting

Model Monitoring Quick Reference

Monitor Type	What it Detects	Baseline
Data Quality	Feature drift, schema violations	Baseline statistics from training data
Model Quality	Prediction drift, accuracy degradation	Ground truth labels needed
Bias Drift	Fairness metric changes over time	Clarify baseline
Feature Attribution Drift	SHAP value changes	Clarify baseline

Key: All Model Monitor types need a baseline job first, then schedule monitoring job.

Mnemonics & Memory Aids

CART vs ID3: CART uses Gini, ID3 uses Entropy (C = Continuous splits, I = Information)

Precision vs Recall with “Spam”:

Spam filter: Precision matters — FP means missing real email
Cancer detection: Recall matters — FN means missed cancer

L1 vs L2 = “Lasso vs Ridge”:

L1 Lasso = “L-one = L-ose features” (zeros out weights)
L2 Ridge = “Ridge = gentle slope, no zeros”

Bagging vs Boosting:

Bagging = Bootstrap samples + Broad parallel trees
Boosting = Build on mistakes sequentially

SageMaker endpoint types — “RSBA”:

Real-time (low latency, online)
Serverless (intermittent, no idle cost)
Batch transform (offline, no endpoint)
Asynchronous (large payload, long running)

DeepAR vs classical forecasting:

DeepAR = Deep learning for All related series at once Recurrently
Use when you have many related time series (products, stores)

RCF for anomaly: Random Cut Forest = Real-time streaming anomalies Cut trees Find outliers

BlazingText modes:

Supervised = text classification
cbow / skipgram = word embeddings (like Word2Vec)

PCA vs t-SNE rule: PCA in production, t-SNE only for visualization (t-SNE is non-parametric, cannot transform new points)

Kinesis shard capacity:

Ingestion: 1 MB/s or 1000 records/s per shard
Consumption: 2 MB/s per shard

SageMaker Training job data: Always read from os.environ['SM_CHANNEL_TRAINING'] in script, save model to os.environ['SM_MODEL_DIR']

Spot training key requirement: MUST enable checkpointing — job can be interrupted and resumed

Ground Truth labeling: Use consolidation → reduce cost; use active learning → label only uncertain examples

Clarify = Bias + Explainability, Monitor = Drift detection over time

Cross-Validation Patterns

Pattern	Use When
k-fold CV	Standard, general purpose
Stratified k-fold	Imbalanced classes
Time series split	Sequential data — NEVER random split
Leave-one-out (LOO)	Very small datasets
Group k-fold	Prevent data leakage from grouped data

Time series: ALWAYS split by time, never random. Future data must never appear in training.

SageMaker Ground Truth

Labeling Workforce Options:
┌──────────────────────────────────────────┐
│  Amazon Mechanical Turk  │  Crowd         │
│  (public, cheap, fast)   │  (public)      │
├──────────────────────────────────────────┤
│  Vendor Managed          │  Private       │
│  (professional labels)   │  (your team)   │
└──────────────────────────────────────────┘

Active Learning Loop:
  Human labels sample → Train model → Auto-label easy examples
  → Send uncertain examples back to humans → Repeat

Ensemble Methods Quick Card

Method	Algorithm	Technique	Reduces
Bagging	Random Forest	Bootstrap + average	Variance
Boosting	XGBoost, AdaBoost, GBM	Sequential correction	Bias
Stacking	Meta-learner	Train on predictions of base models	Both
Voting	Multiple models	Majority / soft vote	Variance

Transfer Learning Decision

Do you have labeled data?
│
├─ Lots of data (similar domain)   → Fine-tune all layers
├─ Lots of data (different domain) → Train from scratch
├─ Little data (similar domain)    → Freeze base, train head only
└─ Little data (different domain)  → Risky; use pretrained head if possible

AWS Data Processing Quick Sizing

Data Volume	Daily ETL	Recommended Service
< 1 GB	Infrequent	Lambda + S3
1 GB - 1 TB	Daily batch	Glue (serverless)
1 TB - 10 TB	Daily batch	Glue or EMR
> 10 TB	Complex jobs	EMR (full control)
Real-time	Streaming	Kinesis + Analytics

Quick Recall: Domain Weights

Domain	Weight	Focus
1: Data Engineering	20%	S3, Kinesis, Glue, ETL, formats
2: EDA & Feature Engineering	24%	Stats, imbalance, scaling, imputation
3: Modeling	36%	Algorithms, training, evaluation (HEAVIEST)
4: ML Implementation & Ops	20%	SageMaker, security, deployment, monitoring

MLS-C01 Exam Cheat Sheet#

Algorithm Quick Reference — All 17+ Built-In#

Metric Quick Reference#

Classification Metrics#

Regression Metrics#

Other Metrics#

Key Formulas#

Classification#

Regression#

Gradient Descent#

Regularization#

Decision Trees#

Probability#

Service Quick Reference — “When to Use What”#

Data Storage#

Streaming & Ingestion#

ETL & Processing#

AI/ML Managed Services#

SageMaker Deployment Modes#

SageMaker Architecture#

Instance Type Selection#

/opt/ml/ Directory (Custom Containers)#

Data Format Decision Tree#

Overfitting vs Underfitting Quick Card#

Feature Engineering Quick Card#

Numerical Features#

Categorical Features#

Text Features#

Time Series Features#

Missing Data Strategies#

Security Quick Card#

Encryption#

Network#

Access Control#

Monitoring#

Cost Optimization Quick Card#

Training#

Inference#

Storage#

The “X vs Y” Quick Reference#

Data Pipeline Architecture Patterns#

SageMaker Pipelines Key Components#

Hyperparameter Tuning Quick Reference#

HPO Strategy#

SageMaker Automatic Model Tuning#

Common Hyperparameter Effects#

Model Monitoring Quick Reference#

Mnemonics & Memory Aids#

Cross-Validation Patterns#

SageMaker Ground Truth#

Ensemble Methods Quick Card#

Transfer Learning Decision#

AWS Data Processing Quick Sizing#

Quick Recall: Domain Weights#