Domain 2B: Model Training, Tuning & Evaluation
Table of Contents
- Model Training, Tuning & Evaluation
- Deep Learning Foundations
- Training Concepts
- Evaluation Metrics
- Ensemble Methods
- SageMaker Training & Tuning Tools
- Distributed Training at Scale
- SageMaker Clarify — Bias & Explainability
- SageMaker Processing
- Bring Your Own Container (BYOC)
- Spot Training — Full Configuration
- HPO Warm Start
- Transfer Learning vs Incremental Learning
- Cross-Validation
- Loss Functions
- F-Beta Scores
- Key Formulas
Model Training, Tuning & Evaluation
Exam Domain: 2 — ML Model Development (26%) Task: Train, refine, and analyze model performance
Deep Learning Foundations
Neural Network Basics
Input Layer Hidden Layers Output Layer
(features) (learned) (prediction)
x₁ ─────┐ ┌─── h₁ ───┐
├───┤ ├─── h₄ ───── ŷ
x₂ ─────┤ └─── h₂ ───┤
│ h₃ ────┘
x₃ ─────┘
Each connection has a weight (w).
Each node applies: output = activation(Σ(wᵢ × xᵢ) + bias)
ELI5: A neural network is like a factory assembly line. Raw materials (your input data) go in at one end. They pass through a series of processing stations (layers), where each station’s workers (neurons) each look at the incoming material and decide how much of their own signal to pass forward, based on weights they learned during training. At the end of the line, a finished product comes out (the prediction). Training is just running millions of products through the line, comparing the output to the correct answer, and slightly adjusting each worker’s decision rules to reduce mistakes.
Activation Functions
| Function | Formula | Range | Use Case |
|---|---|---|---|
| Sigmoid | $\sigma(x) = \frac{1}{1+e^{-x}}$ | (0, 1) | Binary classification output |
| Tanh | $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ | (-1, 1) | Hidden layers (zero-centered) |
| ReLU | $f(x) = \max(0, x)$ | [0, ∞) | Default for hidden layers |
| Leaky ReLU | $f(x) = \max(0.01x, x)$ | (-∞, ∞) | Fixes “dying ReLU” problem |
| Softmax | $\frac{e^{x_i}}{\sum e^{x_j}}$ | (0, 1), sums to 1 | Multi-class output |
When to use which activation:
Hidden layers → ReLU (or Leaky ReLU)
Binary output → Sigmoid
Multi-class → Softmax
RNN hidden → Tanh
ELI5: Activation functions are decision gates that control how much signal each neuron passes forward. Without them, stacking layers would be pointless — a network of linear operations is still just one linear operation, no matter how many layers deep. ReLU is the default because it’s simple (just “pass positive numbers through, block negatives”) and doesn’t suffer from the vanishing gradient problem that plagues Sigmoid and Tanh in deep networks. Think of it as: Sigmoid squashes everything to a probability, Softmax turns a list of scores into probabilities that sum to 1, ReLU just lets the positive signal through unfiltered.
Convolutional Neural Networks (CNNs)
For image data — learn spatial hierarchies of features.
Input Image → [Conv + ReLU] → [Pooling] → [Conv + ReLU] → [Pooling] → [Flatten] → [Dense] → Output
extract edges downsample extract shapes downsample 1D vector classify
| Layer | Purpose |
|---|---|
| Convolution | Apply filters to detect features (edges, textures, shapes) |
| Pooling (Max/Avg) | Downsample — reduce spatial size, keep important info |
| Flatten | Convert 2D feature maps to 1D vector |
| Dense (FC) | Final classification layers |
Key concepts:
- Filters/Kernels: small matrices that slide across the image
- Stride: how many pixels the filter moves each step
- Padding: add zeros around edges to preserve spatial dimensions
- Transfer learning: use pre-trained networks (ResNet, VGG) and fine-tune
ELI5: CNNs process images the same way your brain does — in a hierarchy of increasing complexity. The first layer learns to detect simple edges (a horizontal line, a diagonal). The second layer combines edges into shapes (a corner, a curve). Deeper layers combine shapes into recognizable objects (an ear, a wheel). By the time you reach the final layer, the network has learned to recognize “cat” or “car” from the combinations of low-level features it detected. Transfer learning exploits this: a network pre-trained on millions of images already knows how to detect edges and shapes — you just need to teach it the last step for your specific task.
Recurrent Neural Networks (RNNs)
For sequential data — text, time series, speech.
Standard RNN:
x₁ → [h₁] → x₂ → [h₂] → x₃ → [h₃] → output
↓ ↓ ↓
hidden state flows forward through time
| Variant | Solves | How |
|---|---|---|
| Vanilla RNN | — | Simple recurrence, short memory |
| LSTM | Vanishing gradient | Gates (forget, input, output) control memory |
| GRU | Vanishing gradient | Simplified LSTM (reset + update gates), faster |
| Bidirectional | Context from future | Processes sequence forwards AND backwards |
Vanishing Gradient Problem: In deep networks / long sequences, gradients shrink to ~0 during backpropagation, preventing learning of long-range dependencies. LSTM/GRU solve this with gating mechanisms.
ELI5: RNNs have memory — they pass a “hidden state” from one time step to the next, like a person reading a sentence who remembers what they’ve read so far. The problem is plain RNNs have terrible long-term memory: by the time they reach word 50 of a sentence, they’ve nearly forgotten word 1. LSTM fixes this with explicit memory gates — a “forget gate” that decides what to erase, an “input gate” that decides what to store, and an “output gate” that decides what to share. Think of LSTM as a person with a notepad: they actively decide what to write down, what to cross out, and what to read back when needed.
Training Concepts
Key Training Parameters
| Parameter | What It Controls |
|---|---|
| Epochs | Number of complete passes through training data |
| Batch size | Number of samples processed before weight update |
| Learning rate | Step size for weight updates (most important!) |
| Mini-batch | Compromise between batch (all data) and stochastic (1 sample) |
Learning Rate Impact:
Too high: ╱╲╱╲╱╲ oscillates, may diverge
Just right: ╲ loss converges smoothly
Too low: ╲___________________ very slow convergence
Overfitting vs Underfitting
Training Error Validation Error
────────────── ────────────────
Underfitting: HIGH HIGH
→ Model too simple, can't learn patterns
Good fit: LOW LOW
→ Model generalizes well
Overfitting: VERY LOW HIGH
→ Model memorized training data, fails on new data
Error
│
│ ╲ validation error
│ ╲ ╱────────
│ ╲───────╱
│ ╲ ╱ ← sweet spot
│ ╲╱
│ ╲
│ ╲ training error
│ ╲___________
└──────────────────────── Model Complexity
underfit │ overfit
ELI5: Overfitting is like a student who memorized every question and answer from last year’s exam verbatim. They ace the practice test (low training error) but fail the real exam when the questions are slightly different (high validation error) — because they memorized, not understood. Underfitting is the student who barely studied and can’t answer even the practice questions (high error on both). The goal is a student who genuinely learned the concepts and can handle questions they’ve never seen before.
Regularization Techniques
| Technique | How It Works | When to Use |
|---|---|---|
| L1 (Lasso) | Adds $ \lambda \sum |w_i| $ to loss | Feature selection (drives weights to 0) |
| L2 (Ridge) | Adds $ \lambda \sum w_i^2 $ to loss | Reduce all weights (prevents large weights) |
| Elastic Net | L1 + L2 combined | Best of both |
| Dropout | Randomly zero out neurons during training | Deep networks, prevents co-adaptation |
| Early Stopping | Stop training when validation error rises | Universal, simple |
| Data Augmentation | Create more training data (flip, rotate, crop) | Image models |
| Batch Normalization | Normalize layer inputs | Faster training, some regularization |
L1 vs L2 Regularization:
L1 (Lasso): L2 (Ridge):
Sparse weights Small weights
Some weights → 0 All weights shrink
Feature selection Weight decay
Diamond constraint Circle constraint
ELI5: Regularization is like putting constraints on a student’s study habits: “you can study, but you can’t just memorize individual answers.” L1 is strict — it forces the model to pick the most important features and set irrelevant ones to exactly zero (like telling the student they can only keep 10 key facts). L2 is gentler — it keeps all features but shrinks large weights down so no single feature dominates (like telling the student they can keep all their notes but must write them smaller). Dropout is like randomly covering some of the student’s notes during practice so they can’t rely on any one shortcut.
Evaluation Metrics
Classification Metrics
Confusion Matrix:
Predicted
Positive Negative
Actual Positive [ TP | FN ]
Actual Negative [ FP | TN ]
TP = True Positive (correctly predicted positive)
FP = False Positive (incorrectly predicted positive) → "Type I error"
FN = False Negative (incorrectly predicted negative) → "Type II error"
TN = True Negative (correctly predicted negative)
ELI5: The confusion matrix has an unfortunately scary name but it’s actually just a tally sheet. Make two columns (Predicted Positive, Predicted Negative) and two rows (Actually Positive, Actually Negative), then count how many predictions fell in each box. The diagonal (TP and TN) is where your model got things right. The off-diagonal (FP and FN) is where it was wrong. Every other classification metric — precision, recall, F1, accuracy — is just a different arithmetic combination of these four numbers, chosen based on which type of mistake is more costly.
| Metric | Formula | When to Prioritize |
|---|---|---|
| Accuracy | $\frac{TP+TN}{TP+TN+FP+FN}$ | Balanced classes |
| Precision | $\frac{TP}{TP+FP}$ | Cost of false positives is high (spam filter) |
| Recall (Sensitivity) | $\frac{TP}{TP+FN}$ | Cost of false negatives is high (cancer detection) |
| F1 Score | $2 \times \frac{Precision \times Recall}{Precision + Recall}$ | Balance precision & recall |
| Specificity | $\frac{TN}{TN+FP}$ | True negative rate |
| AUC-ROC | Area under ROC curve | Overall classifier quality, threshold-independent |
ROC Curve:
True Positive Rate (Recall)
1.0 ┌─────────────────┐
│ ╱───────│ Perfect (AUC=1.0)
│ ╱ │
│ ╱ │ Good (AUC=0.8-0.9)
│ ╱ ╱─────── │
│ ╱ ╱ │ Random (AUC=0.5)
│╱ ╱ │
0.0 └─────────────────┘
0.0 FPR 1.0
AUC = 1.0 → perfect classifier
AUC = 0.5 → random guessing
Exam tip: “Which metric?” questions are common.
- Medical diagnosis → Recall (don’t miss diseases)
- Spam filter → Precision (don’t block real emails)
- Balanced → F1 Score
- Overall ranking → AUC-ROC
Regression Metrics
| Metric | Formula | Interpretation |
|---|---|---|
| MSE | $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$ | Penalizes large errors more |
| RMSE | $\sqrt{MSE}$ | Same unit as target variable |
| MAE | $\frac{1}{n}\sum|y_i - \hat{y}_i|$ | Average absolute error |
| R² | $1 - \frac{SS_{res}}{SS_{tot}}$ | % of variance explained (1.0 = perfect) |
Clustering Metrics
| Metric | What It Measures |
|---|---|
| Silhouette Score | How similar an object is to its own cluster vs others (-1 to 1) |
| Davies-Bouldin Index | Average similarity ratio between clusters (lower = better) |
| Inertia / Distortion | Sum of squared distances to centroids (for elbow method) |
Ensemble Methods
Bagging (Bootstrap Aggregating)
Training Data
│
├─ Bootstrap Sample 1 → Model 1 ─┐
├─ Bootstrap Sample 2 → Model 2 ─┤→ Average / Vote → Final Prediction
└─ Bootstrap Sample 3 → Model 3 ─┘
- Reduces variance (overfitting)
- Example: Random Forest (bagging of decision trees)
Boosting
Training Data
│
└─ Model 1 → errors → Model 2 → errors → Model 3
(focus on (focus on
mistakes) remaining mistakes)
↓
Weighted sum → Final Prediction
- Reduces bias (underfitting)
- Examples: XGBoost, AdaBoost, LightGBM, CatBoost
Bagging = parallel models, reduce variance. Boosting = sequential models, reduce bias.
ELI5: Bagging is like asking 100 random people on the street the same question and taking the average answer — no one person has to be an expert, but the crowd wisdom is surprisingly accurate. Boosting is more like a relay tutoring session: tutor 1 teaches you everything they can, then tutor 2 focuses specifically on what tutor 1 couldn’t explain, then tutor 3 handles what’s still confusing. Each round targets the remaining weaknesses, so the final combined knowledge is much stronger than any single tutor.
SageMaker Training & Tuning Tools
Automatic Model Tuning (AMT)
Finds optimal hyperparameters automatically.
| Strategy | How It Works | When to Use |
|---|---|---|
| Random Search | Random combinations | Quick exploration, many params |
| Bayesian Optimization | Uses past results to guide next trial | Default, most efficient |
| Hyperband | Early-stop poor configs, allocate resources to promising ones | Large search spaces |
| Grid Search | Try all combinations | Small search spaces |
Bayesian Optimization:
Trial 1: lr=0.01, depth=5 → score=0.72
Trial 2: lr=0.05, depth=3 → score=0.78 ← model suggests nearby region
Trial 3: lr=0.04, depth=4 → score=0.81 ← converging on optimum
...
SageMaker Autopilot (AutoML)
- Automatically explores algorithms, preprocessing, and hyperparameters
- Generates candidate notebooks showing what it tried
- Supports: classification, regression, time series forecasting
- Outputs interpretable leaderboard
SageMaker Experiments
- Track and compare training runs
- Log metrics, parameters, artifacts per trial
- Visualize comparisons across experiments
SageMaker Debugger
- Monitors training in real-time
- Detects: vanishing gradients, exploding gradients, overfitting, class imbalance
- Built-in rules: trigger alerts or stop training automatically
- Profiling: CPU/GPU utilization, memory, I/O bottlenecks
SageMaker Model Registry
Model Registry:
┌────────────────────────────────────────────┐
│ Model Group: "fraud-detection" │
│ │
│ Version 1: accuracy=0.85 [Rejected] │
│ Version 2: accuracy=0.91 [Approved] ← ─ │─── Deploy
│ Version 3: accuracy=0.89 [Pending] │
│ │
│ Metadata: metrics, lineage, approval │
└────────────────────────────────────────────┘
- Version control for trained models
- Approval workflows (Pending → Approved → Rejected)
- Model lineage tracking (what data/code produced this model)
- Integrates with SageMaker Pipelines for CI/CD
Distributed Training at Scale
SageMaker Distributed Training
| Strategy | What It Distributes | When to Use |
|---|---|---|
| Data Parallelism | Data split across GPUs, each has full model | Model fits in one GPU, data is huge |
| Model Parallelism | Model split across GPUs | Model too large for one GPU (LLMs) |
Data Parallelism:
GPU 1: Full Model + Data Batch 1 ─┐
GPU 2: Full Model + Data Batch 2 ─┤→ Sync gradients → Update model
GPU 3: Full Model + Data Batch 3 ─┘
Model Parallelism:
GPU 1: Layers 1-10 ──→ GPU 2: Layers 11-20 ──→ GPU 3: Layers 21-30
(pipeline parallelism: overlap computation)
ELI5: Data parallelism is giving each worker a different piece of a jigsaw puzzle to solve simultaneously — everyone has the same picture on the box lid (the full model), just different puzzle pieces (data batches). They all work in parallel and then compare notes to agree on the solution. Model parallelism is used when the puzzle box itself is too big for one worker to hold — so you split the box between workers, each holding a different section of it. Data parallelism is the default; model parallelism is needed only for models too large to fit in a single GPU’s memory, like large language models.
Training Infrastructure
| Feature | Purpose |
|---|---|
| SageMaker Training Compiler | Optimize DL model compilation for faster training (up to 50%) |
| Warm Pools | Keep instances alive between training jobs (reduce startup time) |
| Checkpointing | Save model state periodically (resume on failure) |
| Managed Spot Training | Use Spot Instances (up to 90% cost savings, needs checkpointing) |
| Elastic Fabric Adapter (EFA) | High-bandwidth, low-latency networking for distributed training |
| SageMaker HyperPod | Managed infrastructure for foundation model training, auto-healing |
SageMaker Instance Types for Training
| Prefix | Type | Use Case |
|---|---|---|
| ml.m5 | General purpose | Small models, preprocessing |
| ml.c5 | Compute optimized | CPU-intensive training |
| ml.p3/p4d | GPU (NVIDIA V100/A100) | Deep learning training |
| ml.g4dn/g5 | GPU (T4/A10G) | Inference, smaller training |
| ml.trn1 | AWS Trainium | Cost-effective DL training |
| ml.inf1/inf2 | AWS Inferentia | High-throughput inference |
SageMaker Clarify — Bias & Explainability
Post-Training Bias Metrics
| Metric | What It Measures |
|---|---|
| Disparate Impact (DI) | Ratio of positive outcomes between groups |
| Difference in Positive Proportions (DPPL) | Difference in positive prediction rates |
| Accuracy Difference | Accuracy gap between groups |
| Treatment Equality | Ratio of FP to FN across groups |
Explainability
| Method | What It Does |
|---|---|
| SHAP values | Contribution of each feature to individual predictions |
| Partial Dependence Plots (PDPs) | Effect of one feature on predictions (marginal effect) |
| Feature importance | Global ranking of feature contributions |
SHAP Example (loan approval prediction):
Base prediction: 0.5 (50% chance)
+ Income: high → +0.25
+ Credit score: ok → +0.10
- Debt ratio: high → -0.15
= Final: 0.70 (70% chance approved)
ELI5: SHAP tells you WHY the model made a specific decision, not just what the decision was. For every individual prediction, it assigns a score to each input feature saying “this feature pushed the prediction up by X” or “this feature pulled it down by Y.” It’s like asking the model to show its work on an exam — you can see that for this particular loan application, income was the biggest positive factor and debt ratio was the biggest negative factor. This is essential for regulated industries (lending, healthcare) where you must be able to explain why a decision was made.
Exam tip: Clarify integrates with Model Monitor for continuous bias monitoring in production. Alerts via CloudWatch if bias thresholds are exceeded.
Model Cards
Auto-generated documentation from Clarify capturing: model overview, intended use, training data, evaluation results, bias metrics, limitations. Used for regulatory compliance (GDPR, EU AI Act) and internal governance.
SageMaker Processing
Separate from Training — runs data preprocessing, postprocessing, and evaluation scripts.
| Feature | Details |
|---|---|
| Purpose | Data transformation BEFORE training, model evaluation AFTER training |
| Frameworks | SKLearnProcessor, PySparkProcessor, custom containers |
| Timeout | Max 5 days |
| Not Training | Processing ≠ Training (common exam trap) |
Bring Your Own Container (BYOC)
SageMaker container directory structure:
/opt/ml/
├── input/
│ ├── config/
│ │ ├── hyperparameters.json ← your hyperparams
│ │ └── resourceConfig.json ← instance info (distributed)
│ └── data/
│ └── <channel_name>/ ← training data ("train", "validation")
├── model/ ← save model here → uploaded to S3 as model.tar.gz
└── output/
└── failure ← write error messages here
Training container MUST:
1. Read data from /opt/ml/input/data/
2. Read hyperparams from /opt/ml/input/config/hyperparameters.json
3. Write model to /opt/ml/model/
4. Exit 0 on success, non-zero on failure
Inference container MUST:
1. Implement /ping (health check → 200)
2. Implement /invocations (POST → predictions)
3. Load model from /opt/ml/model/
Script Mode vs BYOC vs Built-in
Custom code needed?
├── Standard framework (TF, PyTorch, sklearn)?
│ └── Script Mode (provide train.py, SM provides container)
├── Non-standard library or custom serving?
│ └── BYOC (build Docker → push to ECR)
└── No custom code?
└── Built-in algorithm
Spot Training — Full Configuration
Requirements (ALL must be set):
1. use_spot_instances = True
2. max_wait > max_run
3. checkpoint_s3_uri = S3 path
4. Algorithm must support checkpointing
If interrupted:
→ Current checkpoint saved to S3
→ SM waits for new Spot capacity
→ Resumes from last checkpoint
→ You only pay for actual training time (not waiting)
If max_wait expires before completion:
→ Job FAILS with "MaxWaitTimeExceeded"
→ Increase max_wait, or switch to On-Demand
HPO Warm Start
Resume from previous tuning job’s learned knowledge:
| Type | When |
|---|---|
| IDENTICAL_DATA_AND_ALGORITHM | Same problem — transfer all knowledge |
| TRANSFER_LEARNING | Similar problem — partial knowledge transfer |
Transfer Learning vs Incremental Learning
| Transfer Learning | Incremental Learning | |
|---|---|---|
| Task | New/different task | Same task |
| Data | Small target dataset | New batch for existing task |
| Source | Pre-trained model from different domain | Your own previous model |
| Example | ImageNet → medical X-ray classifier | Fraud model Jan → update with Feb data |
| Risk | Negative transfer | Catastrophic forgetting |
Is the TASK changing?
├── YES → Transfer Learning (fine-tune pre-trained model)
└── NO, same task but new data?
├── Small batch, keep existing → Incremental Learning
│ (pass model.tar.gz as input channel)
└── Distribution shifted significantly → Full Retrain
Cross-Validation
| Technique | When to Use |
|---|---|
| K-Fold (K=5 or 10) | Standard — most problems |
| Stratified K-Fold | Imbalanced classification (preserves class ratios) |
| Time-Series Split | Temporal data (train on past, test on future) |
| LOOCV | Very small datasets (<1000 samples) |
Exam key: NEVER use random K-Fold on time series (trains on future, tests on past = leakage). Use walk-forward validation.
Loss Functions
| Function | Task | Notes |
|---|---|---|
| Binary Cross-Entropy | Binary classification | Log loss between probability and label |
| Categorical Cross-Entropy | Multi-class classification | Log loss across classes |
| MSE | Regression | Penalizes large errors |
| MAE | Regression | Robust to outliers |
| Huber Loss | Regression | MSE for small errors, MAE for large (best of both) |
| Focal Loss | Imbalanced classification | Down-weights easy examples, focuses on hard ones |
F-Beta Scores
| Score | Emphasis | Use When |
|---|---|---|
| F0.5 | Weights Precision higher | FP is costly (spam filter) |
| F1 | Equal weight | Balance precision and recall |
| F2 | Weights Recall higher | FN is costly (cancer detection) |
Key Formulas
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1 = 2 × Precision × Recall / (Precision + Recall)
Specificity = TN / (TN + FP)
FPR = 1 - Specificity
RMSE = √(mean((y - ŷ)²))
R² = 1 - (SS_res / SS_tot)
Disparate Impact = rate_group_A / rate_group_B (flag if < 0.8)
TF-IDF(t,d) = TF(t,d) × log(N / DF(t))
Gradient descent = w_new = w_old - lr × gradient(Loss)