Domain 2B: Model Training, Tuning & Evaluation

16 min read 3211 words

Table of Contents

Model Training, Tuning & Evaluation

Model Training, Tuning & Evaluation

Exam Domain: 2 — ML Model Development (26%) Task: Train, refine, and analyze model performance

Deep Learning Foundations

Neural Network Basics

Input Layer      Hidden Layers       Output Layer
  (features)      (learned)          (prediction)

  x₁ ─────┐   ┌─── h₁ ───┐
           ├───┤           ├─── h₄ ───── ŷ
  x₂ ─────┤   └─── h₂ ───┤
           │       h₃ ────┘
  x₃ ─────┘

Each connection has a weight (w).
Each node applies: output = activation(Σ(wᵢ × xᵢ) + bias)

ELI5: A neural network is like a factory assembly line. Raw materials (your input data) go in at one end. They pass through a series of processing stations (layers), where each station’s workers (neurons) each look at the incoming material and decide how much of their own signal to pass forward, based on weights they learned during training. At the end of the line, a finished product comes out (the prediction). Training is just running millions of products through the line, comparing the output to the correct answer, and slightly adjusting each worker’s decision rules to reduce mistakes.

Activation Functions

Function	Formula	Range	Use Case
Sigmoid	$\sigma(x) = \frac{1}{1+e^{-x}}$	(0, 1)	Binary classification output
Tanh	$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$	(-1, 1)	Hidden layers (zero-centered)
ReLU	$f(x) = \max(0, x)$	[0, ∞)	Default for hidden layers
Leaky ReLU	$f(x) = \max(0.01x, x)$	(-∞, ∞)	Fixes “dying ReLU” problem
Softmax	$\frac{e^{x_i}}{\sum e^{x_j}}$	(0, 1), sums to 1	Multi-class output

When to use which activation:
  Hidden layers   → ReLU (or Leaky ReLU)
  Binary output   → Sigmoid
  Multi-class     → Softmax
  RNN hidden      → Tanh

ELI5: Activation functions are decision gates that control how much signal each neuron passes forward. Without them, stacking layers would be pointless — a network of linear operations is still just one linear operation, no matter how many layers deep. ReLU is the default because it’s simple (just “pass positive numbers through, block negatives”) and doesn’t suffer from the vanishing gradient problem that plagues Sigmoid and Tanh in deep networks. Think of it as: Sigmoid squashes everything to a probability, Softmax turns a list of scores into probabilities that sum to 1, ReLU just lets the positive signal through unfiltered.

Convolutional Neural Networks (CNNs)

For image data — learn spatial hierarchies of features.

Input Image → [Conv + ReLU] → [Pooling] → [Conv + ReLU] → [Pooling] → [Flatten] → [Dense] → Output
              extract edges    downsample   extract shapes  downsample   1D vector    classify

Layer	Purpose
Convolution	Apply filters to detect features (edges, textures, shapes)
Pooling (Max/Avg)	Downsample — reduce spatial size, keep important info
Flatten	Convert 2D feature maps to 1D vector
Dense (FC)	Final classification layers

Key concepts:

Filters/Kernels: small matrices that slide across the image
Stride: how many pixels the filter moves each step
Padding: add zeros around edges to preserve spatial dimensions
Transfer learning: use pre-trained networks (ResNet, VGG) and fine-tune

ELI5: CNNs process images the same way your brain does — in a hierarchy of increasing complexity. The first layer learns to detect simple edges (a horizontal line, a diagonal). The second layer combines edges into shapes (a corner, a curve). Deeper layers combine shapes into recognizable objects (an ear, a wheel). By the time you reach the final layer, the network has learned to recognize “cat” or “car” from the combinations of low-level features it detected. Transfer learning exploits this: a network pre-trained on millions of images already knows how to detect edges and shapes — you just need to teach it the last step for your specific task.

Recurrent Neural Networks (RNNs)

For sequential data — text, time series, speech.

Standard RNN:
  x₁ → [h₁] → x₂ → [h₂] → x₃ → [h₃] → output
         ↓              ↓              ↓
    hidden state flows forward through time

Variant	Solves	How
Vanilla RNN	—	Simple recurrence, short memory
LSTM	Vanishing gradient	Gates (forget, input, output) control memory
GRU	Vanishing gradient	Simplified LSTM (reset + update gates), faster
Bidirectional	Context from future	Processes sequence forwards AND backwards

Vanishing Gradient Problem: In deep networks / long sequences, gradients shrink to ~0 during backpropagation, preventing learning of long-range dependencies. LSTM/GRU solve this with gating mechanisms.

ELI5: RNNs have memory — they pass a “hidden state” from one time step to the next, like a person reading a sentence who remembers what they’ve read so far. The problem is plain RNNs have terrible long-term memory: by the time they reach word 50 of a sentence, they’ve nearly forgotten word 1. LSTM fixes this with explicit memory gates — a “forget gate” that decides what to erase, an “input gate” that decides what to store, and an “output gate” that decides what to share. Think of LSTM as a person with a notepad: they actively decide what to write down, what to cross out, and what to read back when needed.

Training Concepts

Key Training Parameters

Parameter	What It Controls
Epochs	Number of complete passes through training data
Batch size	Number of samples processed before weight update
Learning rate	Step size for weight updates (most important!)
Mini-batch	Compromise between batch (all data) and stochastic (1 sample)

Learning Rate Impact:
  Too high:    ╱╲╱╲╱╲  oscillates, may diverge
  Just right:  ╲  loss converges smoothly
  Too low:     ╲___________________  very slow convergence

Overfitting vs Underfitting

                    Training Error    Validation Error
                    ──────────────    ────────────────
Underfitting:       HIGH              HIGH
  → Model too simple, can't learn patterns

Good fit:           LOW               LOW
  → Model generalizes well

Overfitting:        VERY LOW          HIGH
  → Model memorized training data, fails on new data

Error
  │
  │  ╲  validation error
  │   ╲         ╱────────
  │    ╲───────╱
  │     ╲  ╱   ← sweet spot
  │      ╲╱
  │       ╲
  │        ╲ training error
  │         ╲___________
  └──────────────────────── Model Complexity
       underfit   │   overfit

ELI5: Overfitting is like a student who memorized every question and answer from last year’s exam verbatim. They ace the practice test (low training error) but fail the real exam when the questions are slightly different (high validation error) — because they memorized, not understood. Underfitting is the student who barely studied and can’t answer even the practice questions (high error on both). The goal is a student who genuinely learned the concepts and can handle questions they’ve never seen before.

Regularization Techniques

Technique	How It Works	When to Use
L1 (Lasso)	Adds $ \lambda \sum \|w_i\| $ to loss	Feature selection (drives weights to 0)
L2 (Ridge)	Adds $ \lambda \sum w_i^2 $ to loss	Reduce all weights (prevents large weights)
Elastic Net	L1 + L2 combined	Best of both
Dropout	Randomly zero out neurons during training	Deep networks, prevents co-adaptation
Early Stopping	Stop training when validation error rises	Universal, simple
Data Augmentation	Create more training data (flip, rotate, crop)	Image models
Batch Normalization	Normalize layer inputs	Faster training, some regularization

L1 vs L2 Regularization:

L1 (Lasso):          L2 (Ridge):
  Sparse weights       Small weights
  Some weights → 0     All weights shrink
  Feature selection    Weight decay
  Diamond constraint   Circle constraint

ELI5: Regularization is like putting constraints on a student’s study habits: “you can study, but you can’t just memorize individual answers.” L1 is strict — it forces the model to pick the most important features and set irrelevant ones to exactly zero (like telling the student they can only keep 10 key facts). L2 is gentler — it keeps all features but shrinks large weights down so no single feature dominates (like telling the student they can keep all their notes but must write them smaller). Dropout is like randomly covering some of the student’s notes during practice so they can’t rely on any one shortcut.

Evaluation Metrics

Classification Metrics

Confusion Matrix:
                    Predicted
                 Positive  Negative
Actual Positive [  TP    |   FN   ]
Actual Negative [  FP    |   TN   ]

TP = True Positive   (correctly predicted positive)
FP = False Positive  (incorrectly predicted positive) → "Type I error"
FN = False Negative  (incorrectly predicted negative) → "Type II error"
TN = True Negative   (correctly predicted negative)

ELI5: The confusion matrix has an unfortunately scary name but it’s actually just a tally sheet. Make two columns (Predicted Positive, Predicted Negative) and two rows (Actually Positive, Actually Negative), then count how many predictions fell in each box. The diagonal (TP and TN) is where your model got things right. The off-diagonal (FP and FN) is where it was wrong. Every other classification metric — precision, recall, F1, accuracy — is just a different arithmetic combination of these four numbers, chosen based on which type of mistake is more costly.

Metric	Formula	When to Prioritize
Accuracy	$\frac{TP+TN}{TP+TN+FP+FN}$	Balanced classes
Precision	$\frac{TP}{TP+FP}$	Cost of false positives is high (spam filter)
Recall (Sensitivity)	$\frac{TP}{TP+FN}$	Cost of false negatives is high (cancer detection)
F1 Score	$2 \times \frac{Precision \times Recall}{Precision + Recall}$	Balance precision & recall
Specificity	$\frac{TN}{TN+FP}$	True negative rate
AUC-ROC	Area under ROC curve	Overall classifier quality, threshold-independent

ROC Curve:
  True Positive Rate (Recall)
  1.0 ┌─────────────────┐
      │         ╱───────│  Perfect (AUC=1.0)
      │       ╱         │
      │     ╱           │  Good (AUC=0.8-0.9)
      │   ╱   ╱─────── │
      │  ╱  ╱           │  Random (AUC=0.5)
      │╱ ╱              │
  0.0 └─────────────────┘
      0.0    FPR    1.0

AUC = 1.0 → perfect classifier
AUC = 0.5 → random guessing

Exam tip: “Which metric?” questions are common.
Medical diagnosis → Recall (don’t miss diseases)
Spam filter → Precision (don’t block real emails)
Balanced → F1 Score
Overall ranking → AUC-ROC

Regression Metrics

Metric	Formula	Interpretation
MSE	$\frac{1}{n}\sum(y_i - \hat{y}_i)^2$	Penalizes large errors more
RMSE	$\sqrt{MSE}$	Same unit as target variable
MAE	$\frac{1}{n}\sum\|y_i - \hat{y}_i\|$	Average absolute error
R²	$1 - \frac{SS_{res}}{SS_{tot}}$	% of variance explained (1.0 = perfect)

Clustering Metrics

Metric	What It Measures
Silhouette Score	How similar an object is to its own cluster vs others (-1 to 1)
Davies-Bouldin Index	Average similarity ratio between clusters (lower = better)
Inertia / Distortion	Sum of squared distances to centroids (for elbow method)

Ensemble Methods

Bagging (Bootstrap Aggregating)

Training Data
    │
    ├─ Bootstrap Sample 1 → Model 1 ─┐
    ├─ Bootstrap Sample 2 → Model 2 ─┤→ Average / Vote → Final Prediction
    └─ Bootstrap Sample 3 → Model 3 ─┘

Reduces variance (overfitting)
Example: Random Forest (bagging of decision trees)

Boosting

Training Data
    │
    └─ Model 1 → errors → Model 2 → errors → Model 3
                (focus on          (focus on
                 mistakes)          remaining mistakes)
                         ↓
                  Weighted sum → Final Prediction

Reduces bias (underfitting)
Examples: XGBoost, AdaBoost, LightGBM, CatBoost

Bagging = parallel models, reduce variance. Boosting = sequential models, reduce bias.

ELI5: Bagging is like asking 100 random people on the street the same question and taking the average answer — no one person has to be an expert, but the crowd wisdom is surprisingly accurate. Boosting is more like a relay tutoring session: tutor 1 teaches you everything they can, then tutor 2 focuses specifically on what tutor 1 couldn’t explain, then tutor 3 handles what’s still confusing. Each round targets the remaining weaknesses, so the final combined knowledge is much stronger than any single tutor.

SageMaker Training & Tuning Tools

Automatic Model Tuning (AMT)

Finds optimal hyperparameters automatically.

Strategy	How It Works	When to Use
Random Search	Random combinations	Quick exploration, many params
Bayesian Optimization	Uses past results to guide next trial	Default, most efficient
Hyperband	Early-stop poor configs, allocate resources to promising ones	Large search spaces
Grid Search	Try all combinations	Small search spaces

Bayesian Optimization:
  Trial 1: lr=0.01, depth=5  → score=0.72
  Trial 2: lr=0.05, depth=3  → score=0.78   ← model suggests nearby region
  Trial 3: lr=0.04, depth=4  → score=0.81   ← converging on optimum
  ...

SageMaker Autopilot (AutoML)

Automatically explores algorithms, preprocessing, and hyperparameters
Generates candidate notebooks showing what it tried
Supports: classification, regression, time series forecasting
Outputs interpretable leaderboard

SageMaker Experiments

Track and compare training runs
Log metrics, parameters, artifacts per trial
Visualize comparisons across experiments

SageMaker Debugger

Monitors training in real-time
Detects: vanishing gradients, exploding gradients, overfitting, class imbalance
Built-in rules: trigger alerts or stop training automatically
Profiling: CPU/GPU utilization, memory, I/O bottlenecks

SageMaker Model Registry

Model Registry:
┌────────────────────────────────────────────┐
│  Model Group: "fraud-detection"            │
│                                            │
│  Version 1: accuracy=0.85  [Rejected]      │
│  Version 2: accuracy=0.91  [Approved] ← ─ │─── Deploy
│  Version 3: accuracy=0.89  [Pending]       │
│                                            │
│  Metadata: metrics, lineage, approval      │
└────────────────────────────────────────────┘

Version control for trained models
Approval workflows (Pending → Approved → Rejected)
Model lineage tracking (what data/code produced this model)
Integrates with SageMaker Pipelines for CI/CD

Distributed Training at Scale

SageMaker Distributed Training

Strategy	What It Distributes	When to Use
Data Parallelism	Data split across GPUs, each has full model	Model fits in one GPU, data is huge
Model Parallelism	Model split across GPUs	Model too large for one GPU (LLMs)

Data Parallelism:
  GPU 1: Full Model + Data Batch 1  ─┐
  GPU 2: Full Model + Data Batch 2  ─┤→ Sync gradients → Update model
  GPU 3: Full Model + Data Batch 3  ─┘

Model Parallelism:
  GPU 1: Layers 1-10   ──→  GPU 2: Layers 11-20  ──→  GPU 3: Layers 21-30
  (pipeline parallelism: overlap computation)

ELI5: Data parallelism is giving each worker a different piece of a jigsaw puzzle to solve simultaneously — everyone has the same picture on the box lid (the full model), just different puzzle pieces (data batches). They all work in parallel and then compare notes to agree on the solution. Model parallelism is used when the puzzle box itself is too big for one worker to hold — so you split the box between workers, each holding a different section of it. Data parallelism is the default; model parallelism is needed only for models too large to fit in a single GPU’s memory, like large language models.

Training Infrastructure

Feature	Purpose
SageMaker Training Compiler	Optimize DL model compilation for faster training (up to 50%)
Warm Pools	Keep instances alive between training jobs (reduce startup time)
Checkpointing	Save model state periodically (resume on failure)
Managed Spot Training	Use Spot Instances (up to 90% cost savings, needs checkpointing)
Elastic Fabric Adapter (EFA)	High-bandwidth, low-latency networking for distributed training
SageMaker HyperPod	Managed infrastructure for foundation model training, auto-healing

SageMaker Instance Types for Training

Prefix	Type	Use Case
ml.m5	General purpose	Small models, preprocessing
ml.c5	Compute optimized	CPU-intensive training
ml.p3/p4d	GPU (NVIDIA V100/A100)	Deep learning training
ml.g4dn/g5	GPU (T4/A10G)	Inference, smaller training
ml.trn1	AWS Trainium	Cost-effective DL training
ml.inf1/inf2	AWS Inferentia	High-throughput inference

SageMaker Clarify — Bias & Explainability

Post-Training Bias Metrics

Metric	What It Measures
Disparate Impact (DI)	Ratio of positive outcomes between groups
Difference in Positive Proportions (DPPL)	Difference in positive prediction rates
Accuracy Difference	Accuracy gap between groups
Treatment Equality	Ratio of FP to FN across groups

Explainability

Method	What It Does
SHAP values	Contribution of each feature to individual predictions
Partial Dependence Plots (PDPs)	Effect of one feature on predictions (marginal effect)
Feature importance	Global ranking of feature contributions

SHAP Example (loan approval prediction):
  Base prediction: 0.5 (50% chance)
  + Income: high     → +0.25
  + Credit score: ok → +0.10
  - Debt ratio: high → -0.15
  = Final: 0.70 (70% chance approved)

ELI5: SHAP tells you WHY the model made a specific decision, not just what the decision was. For every individual prediction, it assigns a score to each input feature saying “this feature pushed the prediction up by X” or “this feature pulled it down by Y.” It’s like asking the model to show its work on an exam — you can see that for this particular loan application, income was the biggest positive factor and debt ratio was the biggest negative factor. This is essential for regulated industries (lending, healthcare) where you must be able to explain why a decision was made.

Exam tip: Clarify integrates with Model Monitor for continuous bias monitoring in production. Alerts via CloudWatch if bias thresholds are exceeded.

Model Cards

Auto-generated documentation from Clarify capturing: model overview, intended use, training data, evaluation results, bias metrics, limitations. Used for regulatory compliance (GDPR, EU AI Act) and internal governance.

SageMaker Processing

Separate from Training — runs data preprocessing, postprocessing, and evaluation scripts.

Feature	Details
Purpose	Data transformation BEFORE training, model evaluation AFTER training
Frameworks	SKLearnProcessor, PySparkProcessor, custom containers
Timeout	Max 5 days
Not Training	Processing ≠ Training (common exam trap)

Bring Your Own Container (BYOC)

SageMaker container directory structure:

/opt/ml/
├── input/
│   ├── config/
│   │   ├── hyperparameters.json    ← your hyperparams
│   │   └── resourceConfig.json     ← instance info (distributed)
│   └── data/
│       └── <channel_name>/         ← training data ("train", "validation")
├── model/                          ← save model here → uploaded to S3 as model.tar.gz
└── output/
    └── failure                     ← write error messages here

Training container MUST:
  1. Read data from /opt/ml/input/data/
  2. Read hyperparams from /opt/ml/input/config/hyperparameters.json
  3. Write model to /opt/ml/model/
  4. Exit 0 on success, non-zero on failure

Inference container MUST:
  1. Implement /ping (health check → 200)
  2. Implement /invocations (POST → predictions)
  3. Load model from /opt/ml/model/

Script Mode vs BYOC vs Built-in

Custom code needed?
├── Standard framework (TF, PyTorch, sklearn)?
│   └── Script Mode (provide train.py, SM provides container)
├── Non-standard library or custom serving?
│   └── BYOC (build Docker → push to ECR)
└── No custom code?
    └── Built-in algorithm

Spot Training — Full Configuration

Requirements (ALL must be set):
  1. use_spot_instances = True
  2. max_wait > max_run
  3. checkpoint_s3_uri = S3 path
  4. Algorithm must support checkpointing

If interrupted:
  → Current checkpoint saved to S3
  → SM waits for new Spot capacity
  → Resumes from last checkpoint
  → You only pay for actual training time (not waiting)

If max_wait expires before completion:
  → Job FAILS with "MaxWaitTimeExceeded"
  → Increase max_wait, or switch to On-Demand

HPO Warm Start

Resume from previous tuning job’s learned knowledge:

Type	When
IDENTICAL_DATA_AND_ALGORITHM	Same problem — transfer all knowledge
TRANSFER_LEARNING	Similar problem — partial knowledge transfer

Transfer Learning vs Incremental Learning

	Transfer Learning	Incremental Learning
Task	New/different task	Same task
Data	Small target dataset	New batch for existing task
Source	Pre-trained model from different domain	Your own previous model
Example	ImageNet → medical X-ray classifier	Fraud model Jan → update with Feb data
Risk	Negative transfer	Catastrophic forgetting

Is the TASK changing?
├── YES → Transfer Learning (fine-tune pre-trained model)
└── NO, same task but new data?
    ├── Small batch, keep existing → Incremental Learning
    │   (pass model.tar.gz as input channel)
    └── Distribution shifted significantly → Full Retrain

Cross-Validation

Technique	When to Use
K-Fold (K=5 or 10)	Standard — most problems
Stratified K-Fold	Imbalanced classification (preserves class ratios)
Time-Series Split	Temporal data (train on past, test on future)
LOOCV	Very small datasets (<1000 samples)

Exam key: NEVER use random K-Fold on time series (trains on future, tests on past = leakage). Use walk-forward validation.

Loss Functions

Function	Task	Notes
Binary Cross-Entropy	Binary classification	Log loss between probability and label
Categorical Cross-Entropy	Multi-class classification	Log loss across classes
MSE	Regression	Penalizes large errors
MAE	Regression	Robust to outliers
Huber Loss	Regression	MSE for small errors, MAE for large (best of both)
Focal Loss	Imbalanced classification	Down-weights easy examples, focuses on hard ones

F-Beta Scores

Score	Emphasis	Use When
F0.5	Weights Precision higher	FP is costly (spam filter)
F1	Equal weight	Balance precision and recall
F2	Weights Recall higher	FN is costly (cancer detection)

Key Formulas

Precision         = TP / (TP + FP)
Recall            = TP / (TP + FN)
F1                = 2 × Precision × Recall / (Precision + Recall)
Specificity       = TN / (TN + FP)
FPR               = 1 - Specificity
RMSE              = √(mean((y - ŷ)²))
R²                = 1 - (SS_res / SS_tot)
Disparate Impact  = rate_group_A / rate_group_B  (flag if < 0.8)
TF-IDF(t,d)       = TF(t,d) × log(N / DF(t))
Gradient descent  = w_new = w_old - lr × gradient(Loss)

Model Training, Tuning & Evaluation#

Deep Learning Foundations#

Neural Network Basics#

Activation Functions#

Convolutional Neural Networks (CNNs)#

Recurrent Neural Networks (RNNs)#

Training Concepts#

Key Training Parameters#

Overfitting vs Underfitting#

Regularization Techniques#

Evaluation Metrics#

Classification Metrics#

Regression Metrics#

Clustering Metrics#

Ensemble Methods#

Bagging (Bootstrap Aggregating)#

Boosting#

SageMaker Training & Tuning Tools#

Automatic Model Tuning (AMT)#

SageMaker Autopilot (AutoML)#

SageMaker Experiments#

SageMaker Debugger#

SageMaker Model Registry#

Distributed Training at Scale#

SageMaker Distributed Training#

Training Infrastructure#

SageMaker Instance Types for Training#

SageMaker Clarify — Bias & Explainability#

Post-Training Bias Metrics#

Explainability#

Model Cards#

SageMaker Processing#

Bring Your Own Container (BYOC)#

Script Mode vs BYOC vs Built-in#

Spot Training — Full Configuration#

HPO Warm Start#

Transfer Learning vs Incremental Learning#

Cross-Validation#

Loss Functions#

F-Beta Scores#

Key Formulas#