Exam Cheat Sheet
Table of Contents
- MLS-C01 Exam Cheat Sheet
- Algorithm Quick Reference — All 17+ Built-In
- Metric Quick Reference
- Key Formulas
- Service Quick Reference — “When to Use What”
- SageMaker Architecture
- Data Format Decision Tree
- Overfitting vs Underfitting Quick Card
- Feature Engineering Quick Card
- Security Quick Card
- Cost Optimization Quick Card
- The “X vs Y” Quick Reference
- Data Pipeline Architecture Patterns
- SageMaker Pipelines Key Components
- Hyperparameter Tuning Quick Reference
- Model Monitoring Quick Reference
- Mnemonics & Memory Aids
- Cross-Validation Patterns
- SageMaker Ground Truth
- Ensemble Methods Quick Card
- Transfer Learning Decision
- AWS Data Processing Quick Sizing
- Quick Recall: Domain Weights
MLS-C01 Exam Cheat Sheet
Last thing to read before the exam. Dense tables, no prose.
Algorithm Quick Reference — All 17+ Built-In
| Algorithm | Problem | Input | Compute | Key Hyperparams | Use When |
|---|---|---|---|---|---|
| XGBoost | Class/Reg/Rank | tabular CSV/libsvm | CPU (GPU opt) | num_round, max_depth, eta, subsample, alpha, lambda | Tabular data, default first choice |
| Linear Learner | Class/Reg | tabular RecordIO/CSV | CPU | learning_rate, l1, wd, mini_batch_size | Large tabular, linear relationships, fast baseline |
| KNN | Class/Reg/Anomaly | tabular RecordIO | CPU | k, sample_size, feature_dim | Fast inference, low latency, small-medium data |
| Factorization Machines | Class/Reg | sparse CSV/RecordIO | CPU | num_factors, lr, epoch | Sparse data, click prediction, recommendations |
| DeepAR | Forecasting | JSON Lines time series | GPU/CPU | context_length, prediction_length, num_layers | Multiple related time series, cold start |
| LDA | Topic modeling | Integer token sequences | CPU | num_topics, alpha0 | Text corpus, discover hidden topics |
| NTM | Topic modeling | Count vectors | GPU/CPU | num_topics, encoder_layers | Similar to LDA, neural approach |
| BlazingText | Text class / Word2Vec | Text file (space-delimited) | GPU/CPU | mode (supervised/cbow/sg), vector_dim | Fast text classification, word embeddings, millions of docs |
| Object2Vec | Embeddings | Pair of tokens/sequences | GPU | enc_dim, num_layers | Entity relationships, recommendation, similarity |
| Semantic Segmentation | Image segmentation | Image+annotation RecordIO | GPU | backbone, algorithm (FCN/PSP/DeepLab) | Pixel-level classification of images |
| Image Classification | Multi-class images | RecordIO / augmented manifest | GPU | num_classes, num_layers, lr, epochs | Classify whole images, transfer learning available |
| Object Detection | Bounding boxes | RecordIO / augmented manifest | GPU | num_classes, base_network (VGG/ResNet) | Locate + classify objects in images |
| Seq2Seq | Sequence translation | Tokenized RecordIO | GPU | num_layers, num_embed, rnn_num_hidden | Machine translation, summarization |
| IP Insights | Anomaly (IP) | CSV (entity, IP) | GPU/CPU | num_entity_vectors, vector_dim | Detect suspicious IP usage patterns |
| Random Cut Forest (RCF) | Anomaly detection | RecordIO/CSV | CPU | num_trees, num_samples_per_tree | Real-time streaming anomaly detection |
| PCA | Dimensionality reduction | RecordIO/CSV | CPU | num_components, algorithm (regular/randomized) | Reduce features, handle multicollinearity |
| K-Means | Clustering | RecordIO/CSV | GPU/CPU | k, init_method, max_iter | Unsupervised grouping, customer segmentation |
Exam tip — “tabular data with no info on structure” → XGBoost first. Images → Image Classification or Object Detection. Streaming anomaly → RCF.
Metric Quick Reference
Classification Metrics
| Metric | Formula | Use When |
|---|---|---|
| Accuracy | $\frac{TP+TN}{TP+TN+FP+FN}$ | Balanced classes, equal cost of errors |
| Precision | $\frac{TP}{TP+FP}$ | Cost of FP is high (spam filter, fraud alert) |
| Recall | $\frac{TP}{TP+FN}$ | Cost of FN is high (cancer detection, fraud missed) |
| F1 | $\frac{2 \cdot P \cdot R}{P+R}$ | Imbalanced classes, balance P and R |
| AUC-ROC | Area under ROC curve | Rank-ordering, imbalanced, threshold-agnostic |
| Log Loss | $-\frac{1}{N}\sum y\log\hat{p}$ | Probabilistic classifier quality |
| PR-AUC | Area under P-R curve | Highly imbalanced data (better than ROC) |
Accuracy is USELESS for imbalanced data. Always flag this on exam.
Regression Metrics
| Metric | Formula | Use When |
|---|---|---|
| MSE | $\frac{1}{n}\sum(y-\hat{y})^2$ | Penalize large errors heavily |
| RMSE | $\sqrt{MSE}$ | Same scale as target, sensitive to outliers |
| MAE | $\frac{1}{n}\sum|y-\hat{y}|$ | Robust to outliers, interpretable |
| R² | $1 - \frac{SS_{res}}{SS_{tot}}$ | Proportion of variance explained (0-1) |
| MAPE | $\frac{100}{n}\sum\frac{|y-\hat{y}|}{y}$ | Percentage error, scale-independent |
Other Metrics
| Problem | Metric | Notes |
|---|---|---|
| Clustering | Silhouette score | Ranges -1 to 1, higher = better clusters |
| Clustering | Inertia (SSE) | Within-cluster sum of squares, lower = better |
| Ranking | NDCG | Normalized Discounted Cumulative Gain |
| Object Detection | mAP | Mean Average Precision across classes |
| Segmentation | IoU / mIoU | Intersection over Union |
Key Formulas
Classification
$$\text{Precision} = \frac{TP}{TP+FP} \qquad \text{Recall} = \frac{TP}{TP+FN}$$
$$F1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP+FP+FN}$$
$$F_\beta = (1+\beta^2)\frac{P \cdot R}{\beta^2 P + R}$$
Regression
$$MSE = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 \qquad RMSE = \sqrt{MSE}$$
$$MAE = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| \qquad R^2 = 1 - \frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2}$$
Gradient Descent
$$\theta := \theta - \alpha \nabla_\theta J(\theta)$$
Regularization
$$L1 \text{ (Lasso)}: J + \lambda\sum|\theta_j| \qquad L2 \text{ (Ridge)}: J + \lambda\sum\theta_j^2$$
$$\text{Elastic Net}: J + \lambda_1\sum|\theta_j| + \lambda_2\sum\theta_j^2$$
Decision Trees
$$\text{Gini} = 1 - \sum_{i=1}^c p_i^2 \qquad \text{Entropy} = -\sum_{i=1}^c p_i \log_2 p_i$$
$$\text{Information Gain} = H(\text{parent}) - \sum_k \frac{|S_k|}{|S|} H(S_k)$$
Probability
$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$
Service Quick Reference — “When to Use What”
Data Storage
| Service | Use Case | Key Detail |
|---|---|---|
| S3 | Everything — always default | Objects, scalable, SageMaker native |
| EFS | Shared filesystem across instances | NFS, multi-AZ, training clusters |
| FSx for Lustre | HPC/ML high-throughput | Links to S3, fastest I/O for training |
| EBS | SageMaker notebook persistence | Block storage, single instance |
| Redshift | Data warehouse, SQL analytics | Columnar, petabyte scale |
| DynamoDB | Low-latency K-V lookups | Millisecond, managed NoSQL |
Streaming & Ingestion
| Service | Use Case | Key Detail |
|---|---|---|
| Kinesis Data Streams | Custom consumers, replay, low latency | Shard-based, 24h-7d retention |
| Kinesis Data Firehose | Delivery to S3/Redshift/ES | No code, auto-scale, no replay |
| Kinesis Data Analytics | Real-time SQL/Flink on streams | Windowed queries, anomaly detection |
| MSK (Kafka) | Kafka-compatible streaming | Bring existing Kafka ecosystem |
| SQS | Decoupled queue, async processing | At-least-once, no ordering (std) |
ETL & Processing
| Service | Use Case | Key Detail |
|---|---|---|
| AWS Glue | Serverless ETL, PySpark | Crawlers, Data Catalog, job bookmarks |
| EMR | Full Hadoop/Spark control | EC2-based, complex transforms, cost |
| Athena | Ad-hoc SQL on S3 | No server, pay per query, quick |
| Redshift Spectrum | Query S3 from Redshift | Extend DW without loading |
| AWS Batch | Large batch compute | Non-ML batch jobs |
| Step Functions | Workflow orchestration | ML pipelines, state machines |
AI/ML Managed Services
| Service | Problem | Key Detail |
|---|---|---|
| Rekognition | Image/video analysis | Object detection, face, celebrity, moderation |
| Comprehend | NLP — text analysis | Sentiment, entities, key phrases, topics |
| Transcribe | Speech → text | ASR, medical version available |
| Polly | Text → speech | TTS, multiple voices/languages |
| Translate | Language translation | Neural MT, 75+ languages |
| Textract | OCR + document parsing | Tables, forms, beyond simple OCR |
| Forecast | Time series forecasting | AutoML, no ML expertise needed |
| Personalize | Recommendation engine | User-item interactions, real-time |
| Lex | Chatbot / conversational AI | Powers Alexa, ASR + NLU |
| Kendra | Intelligent search | Enterprise search, FAQ extraction |
| Fraud Detector | Fraud detection | Online fraud, account takeover |
| Lookout for Metrics | Anomaly in business metrics | No ML required, time series |
SageMaker Deployment Modes
| Mode | Latency | Payload | When to Use |
|---|---|---|---|
| Real-time endpoint | Low (ms) | < 6 MB | Interactive apps, low-latency APIs |
| Serverless inference | Variable | < 6 MB | Intermittent traffic, unpredictable spikes |
| Asynchronous inference | Minutes | Up to 1 GB | Large payloads, video/audio, long inference |
| Batch transform | Offline | Unlimited | Bulk offline prediction, no persistent endpoint |
SageMaker Architecture
┌─────────────────────────────────────────────────────────────┐
│ SAGEMAKER ECOSYSTEM │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Data │──▶│ Feature │──▶│ Training │ │
│ │ Sources │ │ Store │ │ Job │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │
│ │ ┌────────────────────┘ │
│ │ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ S3 / │ │ Model │──▶│ Model │ │
│ │ Ground │ │ Registry │ │ Monitor │ │
│ │ Truth │ └──────────┘ └──────────┘ │
│ └──────────┘ │ │
│ ▼ │
│ ┌─────────────────────────────┐ │
│ │ DEPLOYMENT OPTIONS │ │
│ │ ┌──────┐ ┌──────┐ ┌──────┐ │ │
│ │ │ RT │ │Batch │ │Async │ │ │
│ │ │Endpt │ │Trans │ │Endpt │ │ │
│ │ └──────┘ └──────┘ └──────┘ │ │
│ └─────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ TOOLS: Studio | Pipelines | Experiments | Clarify │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Instance Type Selection
| Workload | Instance Family | Notes |
|---|---|---|
| Training deep learning | ml.p3, ml.p4, ml.g4dn, ml.g5 | GPU required for CV/NLP |
| Training tree models | ml.m5, ml.c5 | XGBoost CPU-optimized |
| Inference high-throughput | ml.c5, ml.c6g | CPU inference is cheaper |
| Inference GPU required | ml.g4dn | Real-time DL inference |
| Notebook exploration | ml.t3, ml.m5 | Notebooks don’t need GPU |
| Large model training | ml.p4d.24xlarge | Multi-GPU, NVLink |
/opt/ml/ Directory (Custom Containers)
/opt/ml/
├── input/
│ ├── config/ ← hyperparameters.json, resourceConfig.json
│ └── data/
│ └── {channel}/ ← training data from S3
├── model/ ← save model artifacts HERE
├── output/
│ └── failure ← write failure reason here on error
└── code/ ← your training script (if using Script Mode)
Data Format Decision Tree
What is your data?
│
├─ Tabular data
│ ├─ Need fast SageMaker training? → RecordIO (protobuf) + Pipe mode
│ ├─ XGBoost? → CSV or libsvm
│ └─ General purpose? → CSV (simplest)
│
├─ Image data
│ ├─ SageMaker built-in algo? → RecordIO + augmented manifest
│ └─ Custom model? → Raw files in S3
│
├─ Text data
│ ├─ BlazingText? → One sentence per line, space-delimited
│ ├─ Seq2Seq? → Tokenized integer sequences
│ └─ General NLP? → JSON Lines / CSV
│
└─ Time series
├─ DeepAR? → JSON Lines (target + timestamp)
└─ Forecast service? → CSV or Parquet
RecordIO + Pipe mode = fastest SageMaker training (streams from S3, no copy to disk)
Overfitting vs Underfitting Quick Card
| Symptom | Diagnosis | Fix |
|---|---|---|
| Train acc high, test acc low | Overfitting | Add dropout, regularization (L1/L2), more data, early stopping, simpler model |
| Train acc low, test acc low | Underfitting | More complex model, more features, more epochs, less regularization, larger network |
| Train acc = test acc, both low | High bias | Feature engineering, polynomial features, different algorithm |
| Train acc oscillates | LR too high | Reduce learning rate, use LR scheduler |
| Train loss never decreases | LR too low / bad init | Increase LR, check data normalization |
| Good train/val, bad production | Data leakage or distribution shift | Fix data split, monitor drift |
| Model works on seen, fails on unseen | Poor generalization | Cross-validation, augmentation, regularization |
Feature Engineering Quick Card
Numerical Features
- Scale: StandardScaler (mean=0, std=1) for linear/neural models; MinMaxScaler (0-1) for bounded
- Log transform: $\log(x+1)$ for skewed/long-tail distributions
- Binning: Convert continuous → ordinal buckets (age groups)
- Polynomial: $x^2, x_1 x_2$ to capture non-linear relationships
- Clipping: Remove/cap extreme outliers before scaling
Categorical Features
- Label encoding: ordinal categories (low/med/high → 0/1/2)
- One-hot encoding: nominal categories, low cardinality (< ~20 values)
- Target encoding: high cardinality, replaces category with mean(target) — risk of leakage
- Embedding: high cardinality in deep learning (entity embeddings)
Text Features
- BoW / TF-IDF: sparse bag of words, good for linear models
- Word2Vec / FastText: dense embeddings, captures semantics
- BERT / Transformers: contextual embeddings, state-of-the-art
- n-grams: capture phrases, increase vocabulary size
Time Series Features
- Lag features: $x_{t-1}, x_{t-7}$ (yesterday, last week)
- Rolling statistics: rolling mean, rolling std, rolling max
- Decomposition: trend + seasonality + residual
- Differencing: $x_t - x_{t-1}$ to remove trend, stationarity
Missing Data Strategies
| Strategy | When |
|---|---|
| Mean/median imputation | MCAR (missing completely at random), numerical |
| Mode imputation | Categorical, low proportion missing |
| KNN imputation | MCAR/MAR, preserve relationships |
| Model-based imputation | Complex dependencies |
| Indicator variable | When missingness itself is informative |
| Drop column | > 60-70% missing, no signal |
| Drop row | Very few missing, MCAR |
Security Quick Card
Encryption
| Layer | Mechanism | Service |
|---|---|---|
| At rest (S3) | SSE-S3, SSE-KMS, SSE-C | KMS, S3 |
| At rest (EBS/EFS) | KMS CMK | KMS |
| In transit | TLS 1.2+ | ACM |
| Inter-container | Enable inter-container encryption | SageMaker training config |
| Model artifacts | KMS key on S3 bucket | KMS + S3 |
Network
| Component | Purpose |
|---|---|
| VPC + private subnet | Isolate training/inference from internet |
| S3 VPC Gateway Endpoint | Access S3 without internet traversal |
| SageMaker VPC Interface Endpoint | Access SageMaker API without internet |
| Security Groups | Control inter-instance traffic |
| NAT Gateway | Outbound internet from private subnet |
Access Control
| Layer | Mechanism |
|---|---|
| SageMaker execution role | IAM role attached to notebook/training job |
| Bucket policy | S3 access from specific role/VPC only |
| Resource-based policy | Cross-account access |
| KMS key policy | Who can use encryption keys |
| CloudTrail | Audit all API calls |
Monitoring
| Tool | Purpose |
|---|---|
| CloudWatch Metrics | Resource utilization, invocation counts |
| CloudWatch Logs | Training/inference logs |
| CloudTrail | API call audit trail |
| SageMaker Model Monitor | Data drift, model quality, bias, explainability |
| SageMaker Clarify | Bias detection, feature importance (SHAP) |
Cost Optimization Quick Card
Training
| Technique | Savings | Notes |
|---|---|---|
| Spot instances | Up to 90% | Requires checkpointing for recovery |
| Right-size instances | 20-50% | Profile before choosing large instance |
| Warm pools | Reduce startup time | Keep instances ready between jobs |
| Managed spot training | Built-in | SageMaker handles interruption |
| Compile with Neo | Speed → cost | Optimize model for inference |
Inference
| Technique | Savings | Notes |
|---|---|---|
| Auto-scaling | Pay only for load | Scale to zero with serverless |
| Serverless inference | No idle cost | Cold start tradeoff |
| Multi-model endpoint | 1 endpoint N models | Share resources across models |
| Elastic Inference | GPU fraction | Attach fraction of GPU to CPU instance |
| Inferentia (Inf1) | GPU cost | Custom ML chip, high throughput |
| Graviton2 (g4dn) | CPU cost | ARM-based, cheaper CPU inference |
Storage
| Technique | Savings | Notes |
|---|---|---|
| S3 Intelligent-Tiering | Auto-tier | For unknown access patterns |
| S3 Lifecycle rules | Move to Glacier | Archive old model artifacts |
| Delete unused endpoints | Immediate | Endpoints bill while running |
| Shared feature store | Avoid recompute | Feature Store offline store on S3 |
The “X vs Y” Quick Reference
| Pair | X | Y | Key Discriminator |
|---|---|---|---|
| Kinesis vs Firehose | Custom processing, replay, sub-second | Delivery to S3/Redshift, no code | Need replay or sub-second? → Streams |
| Glue vs EMR | Serverless, PySpark, catalog | Full Hadoop/Spark ecosystem, EC2 | Operational overhead concern? → Glue |
| Athena vs Redshift | Ad-hoc SQL on S3, no loading | DW with complex joins, frequent queries | One-off queries on S3? → Athena |
| Rekognition vs SM Image Class | No ML expertise, managed | Custom training, specific domain | Need customization? → SageMaker |
| Random Forest vs XGBoost | Bagging, parallel, robust | Boosting, sequential, often better | XGBoost generally better, RF more robust to overfit |
| L1 vs L2 | Sparse, feature selection (zero weights) | Prevents large weights, smooth | Want to zero out features? → L1 |
| PCA vs t-SNE | Linear, preserve variance, production | Non-linear, visualization only | Production pipeline? → PCA |
| LSTM vs GRU | More params, better for long sequences | Fewer params, faster, similar perf | Limited compute? → GRU |
| Batch Norm vs Dropout | Normalize activations, training speed | Randomly drop neurons, regularize | Both can be used together |
| Precision vs Recall | Minimize FP (cost of false alarm) | Minimize FN (cost of missing) | What error is worse? |
| Bagging vs Boosting | Parallel, reduces variance, Random Forest | Sequential, reduces bias, XGBoost/AdaBoost | High variance? → Bagging. High bias? → Boosting |
| File Mode vs Pipe Mode | Data copied to instance disk | Streams directly from S3, faster | Large dataset + RecordIO? → Pipe |
| Data Parallelism vs Model Parallelism | Replicate model, split data across GPUs | Split model layers across GPUs | Model too large for 1 GPU? → Model parallel |
| Real-time vs Serverless | Always-on, consistent latency | Intermittent, variable latency, no idle cost | Need < 100ms guaranteed? → Real-time |
| Batch vs Async | Offline, no endpoint, huge scale | Large payload, async with notification | Payload > 6MB or very long inference? → Async |
| SageMaker Clarify vs Model Monitor | Bias detection + SHAP explainability | Data drift + model quality over time | Drift detection? → Monitor. Bias? → Clarify |
| Step Functions vs Airflow (MWAA) | AWS-native, serverless, simple orchestration | Complex DAGs, Python ecosystem | Complex Python DAG? → MWAA |
Data Pipeline Architecture Patterns
Batch ML Pipeline:
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│ S3 │───▶│ Glue │───▶│ S3 │───▶│ SM │───▶│ S3 │
│ Raw │ │ ETL │ │Clean │ │Train │ │Model │
└──────┘ └──────┘ └──────┘ └──────┘ └──────┘
Streaming ML Pipeline:
┌──────┐ ┌───────┐ ┌─────────┐ ┌──────┐
│Kafka/│───▶│Kinesis│───▶│Analytics│───▶│Lambda│
│Source│ │Streams│ │(Flink) │ │/SM │
└──────┘ └───────┘ └─────────┘ └──────┘
Real-time Inference:
┌──────┐ ┌──────┐ ┌──────────┐ ┌──────┐
│Client│───▶│ API │───▶│SageMaker │───▶│Result│
│ │ │ GW │ │Real-time │ │ │
└──────┘ └──────┘ └──────────┘ └──────┘
SageMaker Pipelines Key Components
| Component | Purpose |
|---|---|
| Pipeline | DAG of steps |
| ProcessingStep | Data prep/validation using SageMaker Processing |
| TrainingStep | Train a model |
| TuningStep | Hyperparameter tuning (HPO) |
| EvaluationStep | Evaluate metrics |
| ModelStep | Create model from artifacts |
| ConditionStep | Branch on condition (accuracy > threshold) |
| RegisterModel | Push to Model Registry |
| TransformStep | Batch inference |
Hyperparameter Tuning Quick Reference
HPO Strategy
| Strategy | Description | Use When |
|---|---|---|
| Grid search | Try all combinations | Small parameter space |
| Random search | Random samples | Medium space, better than grid |
| Bayesian optimization | Model-guided search | Expensive training, large space |
| Hyperband | Early stopping of bad trials | Large space, long training |
SageMaker Automatic Model Tuning
- Objective metric: specify training job metric to maximize/minimize
- Parameter ranges: continuous, integer, categorical
- Max jobs: total trials budget
- Max parallel jobs: concurrent trials (reduces wall time, less efficient)
- Warm starting: continue previous tuning job, transfer learning for HPO
Common Hyperparameter Effects
| Hyperparameter | Effect when increased |
|---|---|
| Learning rate | Faster convergence, risk instability |
| Batch size | More stable gradients, less noise, memory |
| Num epochs | Risk overfitting, better convergence |
| Regularization (lambda) | More regularization, lower variance |
| Max depth (trees) | More complex model, risk overfitting |
| Num estimators | Better ensemble, diminishing returns |
| Dropout rate | More regularization |
| Hidden units | More capacity, risk overfitting |
Model Monitoring Quick Reference
| Monitor Type | What it Detects | Baseline |
|---|---|---|
| Data Quality | Feature drift, schema violations | Baseline statistics from training data |
| Model Quality | Prediction drift, accuracy degradation | Ground truth labels needed |
| Bias Drift | Fairness metric changes over time | Clarify baseline |
| Feature Attribution Drift | SHAP value changes | Clarify baseline |
Key: All Model Monitor types need a baseline job first, then schedule monitoring job.
Mnemonics & Memory Aids
CART vs ID3: CART uses Gini, ID3 uses Entropy (C = Continuous splits, I = Information)
Precision vs Recall with “Spam”:
- Spam filter: Precision matters — FP means missing real email
- Cancer detection: Recall matters — FN means missed cancer
L1 vs L2 = “Lasso vs Ridge”:
- L1 Lasso = “L-one = L-ose features” (zeros out weights)
- L2 Ridge = “Ridge = gentle slope, no zeros”
Bagging vs Boosting:
- Bagging = Bootstrap samples + Broad parallel trees
- Boosting = Build on mistakes sequentially
SageMaker endpoint types — “RSBA”:
- Real-time (low latency, online)
- Serverless (intermittent, no idle cost)
- Batch transform (offline, no endpoint)
- Asynchronous (large payload, long running)
DeepAR vs classical forecasting:
- DeepAR = Deep learning for All related series at once Recurrently
- Use when you have many related time series (products, stores)
RCF for anomaly: Random Cut Forest = Real-time streaming anomalies Cut trees Find outliers
BlazingText modes:
- Supervised = text classification
- cbow / skipgram = word embeddings (like Word2Vec)
PCA vs t-SNE rule: PCA in production, t-SNE only for visualization (t-SNE is non-parametric, cannot transform new points)
Kinesis shard capacity:
- Ingestion: 1 MB/s or 1000 records/s per shard
- Consumption: 2 MB/s per shard
SageMaker Training job data: Always read from os.environ['SM_CHANNEL_TRAINING'] in script, save model to os.environ['SM_MODEL_DIR']
Spot training key requirement: MUST enable checkpointing — job can be interrupted and resumed
Ground Truth labeling: Use consolidation → reduce cost; use active learning → label only uncertain examples
Clarify = Bias + Explainability, Monitor = Drift detection over time
Cross-Validation Patterns
| Pattern | Use When |
|---|---|
| k-fold CV | Standard, general purpose |
| Stratified k-fold | Imbalanced classes |
| Time series split | Sequential data — NEVER random split |
| Leave-one-out (LOO) | Very small datasets |
| Group k-fold | Prevent data leakage from grouped data |
Time series: ALWAYS split by time, never random. Future data must never appear in training.
SageMaker Ground Truth
Labeling Workforce Options:
┌──────────────────────────────────────────┐
│ Amazon Mechanical Turk │ Crowd │
│ (public, cheap, fast) │ (public) │
├──────────────────────────────────────────┤
│ Vendor Managed │ Private │
│ (professional labels) │ (your team) │
└──────────────────────────────────────────┘
Active Learning Loop:
Human labels sample → Train model → Auto-label easy examples
→ Send uncertain examples back to humans → Repeat
Ensemble Methods Quick Card
| Method | Algorithm | Technique | Reduces |
|---|---|---|---|
| Bagging | Random Forest | Bootstrap + average | Variance |
| Boosting | XGBoost, AdaBoost, GBM | Sequential correction | Bias |
| Stacking | Meta-learner | Train on predictions of base models | Both |
| Voting | Multiple models | Majority / soft vote | Variance |
Transfer Learning Decision
Do you have labeled data?
│
├─ Lots of data (similar domain) → Fine-tune all layers
├─ Lots of data (different domain) → Train from scratch
├─ Little data (similar domain) → Freeze base, train head only
└─ Little data (different domain) → Risky; use pretrained head if possible
AWS Data Processing Quick Sizing
| Data Volume | Daily ETL | Recommended Service |
|---|---|---|
| < 1 GB | Infrequent | Lambda + S3 |
| 1 GB - 1 TB | Daily batch | Glue (serverless) |
| 1 TB - 10 TB | Daily batch | Glue or EMR |
| > 10 TB | Complex jobs | EMR (full control) |
| Real-time | Streaming | Kinesis + Analytics |
Quick Recall: Domain Weights
| Domain | Weight | Focus |
|---|---|---|
| 1: Data Engineering | 20% | S3, Kinesis, Glue, ETL, formats |
| 2: EDA & Feature Engineering | 24% | Stats, imbalance, scaling, imputation |
| 3: Modeling | 36% | Algorithms, training, evaluation (HEAVIEST) |
| 4: ML Implementation & Ops | 20% | SageMaker, security, deployment, monitoring |