Domain 3G: Model Evaluation & Validation

15 min read 3122 words

Table of Contents

Model Evaluation & Validation

Model Evaluation & Validation

Exam Domain: 3 — ML Model Development Task: Select the right metrics, detect evaluation pitfalls, and use SageMaker tools to validate and monitor models

Why Evaluation Matters (First Principles)

ELI5: Measuring the wrong thing is worse than not measuring at all. A weather app that always says “sunny” has 90% accuracy in Arizona — but it’s completely useless. The metric must match the business problem.

A model is only as good as its evaluation. Two failure modes:

Wrong metric — optimizing something that doesn’t capture what you care about
Contaminated evaluation — test data leaked into training → metrics are lies

The purpose of evaluation is to estimate future performance on unseen data, not to describe training performance.

Why this matters for the exam: Many exam questions present a scenario with a metric problem — imbalanced data, business cost asymmetry, or a leakage situation. Recognizing the correct metric for each context is heavily tested.

Classification Metrics

Confusion Matrix

For binary classification, every prediction falls into one of four cells:

                    PREDICTED
                  Positive  Negative
                ┌─────────┬─────────┐
ACTUAL Positive │   TP    │   FN   │ ← Actual positives
                │(correct)│(missed) │
                ├─────────┼─────────┤
       Negative │   FP    │   TN   │ ← Actual negatives
                │(false   │(correct)│
                │ alarm)  │         │
                └─────────┴─────────┘

Medical screening example:

Cell	Meaning	Consequence
TP (True Positive)	Sick person correctly diagnosed	Correct — patient gets treatment
FP (False Positive) — Type I error	Healthy person wrongly flagged	Unnecessary treatment, cost, anxiety
FN (False Negative) — Type II error	Sick person missed	Untreated illness — potentially fatal
TN (True Negative)	Healthy person correctly cleared	Correct — patient goes home

Exam tip: Type I error = FP (cry wolf). Type II error = FN (miss the wolf). In high-stakes domains (cancer, fraud), FN is usually the costlier error.

Accuracy

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

When it works: Balanced class distribution.

When it fails: Imbalanced data. Classic trap: 99% of transactions are legitimate. A model that always predicts “not fraud” achieves 99% accuracy but catches zero fraud cases.

Exam tip: If you see “imbalanced dataset” + “accuracy,” the answer is almost certainly “use a different metric” — F1, AUC-ROC, or precision/recall.

Precision

$$\text{Precision} = \frac{TP}{TP + FP}$$

“Of everything I predicted positive, how many were actually positive?”

ELI5: Precision is like a picky chef — they only serve dishes they’re confident about. Few dishes, but every dish is good. You might miss some great dishes, but you never serve a bad one.

Prioritize precision when: The cost of false positives is high.

Spam filter: you don’t want to delete legitimate emails
Search results: irrelevant results frustrate users
Drug recommendation: don’t suggest a harmful drug based on weak signal

Recall (Sensitivity, True Positive Rate)

$$\text{Recall} = \frac{TP}{TP + FN}$$

“Of everything that IS positive, how many did I catch?”

ELI5: Recall is like a worried parent — they take their child to the doctor for every cough. They might overcall, causing unnecessary visits (FP), but they never miss a real illness (FN).

Prioritize recall when: The cost of false negatives is high.

Cancer screening: missing a real cancer is far worse than extra biopsies
Fraud detection: missing a fraud case causes direct financial loss
Security intrusion detection: missing an attack is catastrophic

Precision vs Recall Tradeoff

As you raise the classification threshold (require more confidence to predict positive):

Fewer predictions are positive → fewer FP → precision increases
More positive cases are missed → more FN → recall decreases

Threshold = 0.3 (lenient):
   Precision = 0.60, Recall = 0.95  ← catches almost everything, many false alarms

Threshold = 0.5 (default):
   Precision = 0.75, Recall = 0.80  ← balanced

Threshold = 0.9 (strict):
   Precision = 0.95, Recall = 0.40  ← very confident predictions only, misses many real positives

Precision
1.0 │╲
    │  ╲
0.8 │    ╲___
    │        ╲___
0.6 │            ╲___
    │                ╲___
0.4 │                    ╲___
    └────────────────────────── Recall
    0.4  0.6  0.8  1.0
         ↑ Inverse relationship

Business decision table:

Scenario	Prioritize	Rationale
Email spam filter	Precision	False positives delete legitimate mail
Cancer screening	Recall	False negatives miss treatable cancer
Content moderation (ban accounts)	Precision	Wrongly banning innocent users is bad
Financial fraud detection	Recall	Letting fraud through costs money
Search engine results	Precision	Irrelevant results reduce trust
Disease contact tracing	Recall	Missing a contact spreads the disease

F1 Score

The harmonic mean of precision and recall:

$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

ELI5: F1 is the compromise between the picky chef (precision) and the worried parent (recall). It rewards balance — a model with P=0.9, R=0.1 gets a terrible F1, not a good one.

Why harmonic mean (not arithmetic)?

Arithmetic mean of P=1.0 and R=0.0 would be 0.5 — misleadingly high for a useless model. Harmonic mean: $\frac{2 \cdot 1.0 \cdot 0.0}{1.0 + 0.0} = 0$ — correctly reflects the failure.

F-beta score — when you want to weight precision vs recall differently:

$$F_\beta = (1+\beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$

$\beta$ value	Meaning
$\beta < 1$ (e.g., 0.5)	Favors precision ($\beta^2$ shrinks recall’s weight)
$\beta = 1$	Equal weight (standard F1)
$\beta > 1$ (e.g., 2)	Favors recall (recall counts $\beta^2$ times more than precision)

Exam tip: “We care more about missing real positives than false alarms” → use $F_2$ (beta=2, favors recall).

Specificity (True Negative Rate)

$$\text{Specificity} = \frac{TN}{TN + FP}$$

“Of everything that IS negative, how many did I correctly identify?”

Complement: False Positive Rate (FPR) = $1 - \text{Specificity} = \frac{FP}{TN + FP}$
Used in ROC curve (x-axis = FPR, y-axis = TPR/Recall)

ROC Curve & AUC

ROC (Receiver Operating Characteristic): Plot TPR vs FPR at every possible threshold.

TPR (Recall)
1.0 │          ╱─────── Perfect classifier (AUC=1.0)
    │        ╱─╱
0.8 │      ╱  ╱ ← Good classifier
    │    ╱  ╱
0.6 │  ╱  ╱
    │ ╱  ╱ ─ ─ ─ Random (AUC=0.5, diagonal)
0.4 │╱  ╱
    │  ╱
0.2 │ ╱
    │╱
    └────────────────── FPR (1 - Specificity)
    0.0  0.2  0.4  0.6  0.8  1.0

AUC (Area Under the ROC Curve):

AUC Value	Interpretation
1.0	Perfect classifier
0.9–1.0	Excellent
0.8–0.9	Good
0.7–0.8	Fair
0.6–0.7	Poor
0.5	Random (no better than coin flip)
< 0.5	Worse than random — invert the predictions!

ELI5: AUC answers: “If I pick a random positive example and a random negative example, what’s the probability my model scores the positive one higher?” AUC = 0.9 means 90% probability — the model almost always ranks positives above negatives.

When to use AUC:

Comparing models independent of threshold choice
Imbalanced datasets (more robust than accuracy)
When you don’t know the operating threshold yet

Precision-Recall Curve & Average Precision

For highly imbalanced data (rare positives), the ROC curve can look deceptively good because TN dominates FPR. The PR curve exposes performance on the minority class:

Precision
1.0 │╲ ← Perfect
    │  ╲___
0.8 │       ╲___  ← Good model
    │            ╲___
0.6 │     ─ ─ ─ ─ ─ ─ ╲── Baseline (random)
    │                        ╲
0.4 │
    └──────────────────────── Recall
    0.0  0.2  0.4  0.6  0.8  1.0

Average Precision (AP): Area under the PR curve. Higher is better.

Use PR-AUC instead of ROC-AUC when: Positive class is rare (fraud detection, disease screening, anomaly detection).

Exam tip: “1% fraud rate, need to evaluate model” → use Precision-Recall curve / AP, not ROC-AUC.

Log Loss (Cross-Entropy Loss)

$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i) \right]$$

Evaluates the quality of probability calibration, not just class labels
Heavily penalizes confident wrong predictions: predicting $\hat{p} = 0.99$ for a negative example → large loss
Lower is better; perfect calibration approaches 0
Used as the training objective for logistic regression and most neural networks

Multi-Class Classification

For K classes, extend binary metrics:

Strategy	How	When
Macro-average	Compute metric per class, take unweighted mean	Classes equally important, even if imbalanced
Micro-average	Pool all TP/FP/FN across classes, compute once	Each individual prediction equally important
Weighted-average	Weight each class by its support (sample count)	Balanced representation by class frequency

Exam tip: Imbalanced multi-class + care about all classes equally → macro F1. Imbalanced + care about total correct predictions → weighted F1.

Regression Metrics

MSE (Mean Squared Error)

$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

Penalizes large errors quadratically — a $2\times$ larger error contributes $4\times$ more to the metric
Units: squared units of target (if target is dollars, MSE is dollars²)
Differentiable everywhere — preferred training objective

RMSE (Root Mean Squared Error)

$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

Same units as target → interpretable: “average error is $X$ units”
Most commonly reported regression metric
Still penalizes large errors heavily (via the underlying MSE)

MAE (Mean Absolute Error)

$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

Treats all error magnitudes linearly — a $2\times$ larger error contributes exactly $2\times$ more
Robust to outliers (a single extreme error doesn’t dominate)
Less smooth at zero (not differentiable) — some optimizers handle this less well

RMSE vs MAE

ELI5: RMSE is a strict teacher who penalizes big mistakes harshly — one catastrophic error tanks your grade. MAE is a fair teacher who treats every mistake the same size — you’re graded on your average everyday performance.

Situation	Use
Large errors are especially costly (financial forecasting, safety-critical systems)	RMSE
All error magnitudes matter equally; outliers in data but not meaningful	MAE
Want interpretable units	Both (same units as target)
Comparing across datasets with different scales	MAPE or R²

R² (Coefficient of Determination)

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

ELI5: R² answers: “How much better is my model than just guessing the average every time?” R²=0.80 means my model explains 80% of the variance — the average-guesser explains 0%.

R² Value	Interpretation
1.0	Perfect predictions
0.8–1.0	Strong model
0.5–0.8	Moderate
0.0	Same as predicting the mean — no value
< 0.0	Worse than predicting the mean (terrible model or wrong setup)

Adjusted R²: Penalizes adding features that don’t help:

$$\bar{R}^2 = 1 - (1-R^2)\frac{n-1}{n-p-1}$$

where $n$ = samples, $p$ = number of features. Use adjusted R² when comparing models with different feature counts.

MAPE (Mean Absolute Percentage Error)

$$MAPE = \frac{1}{n}\sum_{i=1}^{n}\frac{|y_i - \hat{y}_i|}{|y_i|} \times 100%$$

Scale-independent — “average 8% error” is interpretable regardless of target magnitude
Good for comparing models across different products/targets
Problem: Undefined (division by zero) when $y_i = 0$; also asymmetric (over-predictions treated differently from under-predictions)

Regression vs Classification Metrics Summary

Metric	Type	Formula Sketch	Best Used When	Outlier Sensitivity
Accuracy	Class.	(TP+TN)/all	Balanced classes	N/A
Precision	Class.	TP/(TP+FP)	FP cost high	N/A
Recall	Class.	TP/(TP+FN)	FN cost high	N/A
F1	Class.	Harmonic mean P+R	Imbalanced, both errors matter	N/A
AUC-ROC	Class.	Area under ROC	Threshold-independent comparison	N/A
AP (PR-AUC)	Class.	Area under PR	Rare positive class	N/A
Log Loss	Class.	Cross-entropy	Probability calibration quality	High
MSE	Reg.	Mean squared error	Large errors very costly	High
RMSE	Reg.	$\sqrt{MSE}$	Interpretable, penalize large errors	High
MAE	Reg.	Mean absolute error	Robust evaluation, outliers present	Low
R²	Reg.	Variance explained	Relative model quality	Moderate
MAPE	Reg.	% error	Cross-scale comparison	Low (but div/0 risk)

Model Selection & Comparison

Always Start with a Baseline

Never report model performance without comparing to a dumb baseline:

Problem Type	Baseline
Binary/multi-class	Predict majority class always
Regression	Predict mean of training labels always
Time series	Persistence model: $\hat{y}_{t+1} = y_t$
NLP ranking	BM25 or TF-IDF

If your model doesn’t beat the baseline, you have a problem.

Cross-Validation

For small datasets where a single train/val split is unreliable:

5-fold Cross-Validation:
┌────┬────┬────┬────┬────┐
│ V  │ T  │ T  │ T  │ T  │ Fold 1: Val=fold1, Train=rest
├────┼────┼────┼────┼────┤
│ T  │ V  │ T  │ T  │ T  │ Fold 2: Val=fold2, Train=rest
├────┼────┼────┼────┼────┤
│ T  │ T  │ V  │ T  │ T  │ Fold 3: ...
├────┼────┼────┼────┼────┤
│ T  │ T  │ T  │ V  │ T  │ Fold 4: ...
├────┼────┼────┼────┼────┤
│ T  │ T  │ T  │ T  │ V  │ Fold 5: Val=fold5, Train=rest
└────┴────┴────┴────┴────┘
Final score = mean ± std of 5 validation scores

StratifiedKFold: Preserves class balance in each fold — always use for classification
TimeSeriesSplit: Respects temporal order — use for time series

Statistical Significance

Is model A actually better than model B, or is it noise?

Paired t-test on cross-validation scores: Compare fold-by-fold scores between two models
McNemar’s test: For comparing two classifiers on the same test set (compares disagreements)
Rule of thumb: If confidence intervals overlap substantially, results are not significant

A/B Testing for ML Models

ELI5: A/B testing is like a taste test — don’t just look at the recipe (offline metrics), let real customers try both dishes (production traffic) and see which they prefer in real life. The kitchen test and the restaurant test often give different results.

Why offline metrics aren’t enough:

Production data distribution may differ from evaluation data
Business KPIs (revenue, engagement, retention) may not correlate perfectly with ML metrics
User behavior is only observable in production

Traffic (100%)
      │
      ├──── 50% ──▶ Model A (current)
      │                  │
      └──── 50% ──▶ Model B (challenger)
                         │
                   Compare: CTR, conversion,
                   revenue, churn — not just accuracy

SageMaker Production Variants:

# Split traffic between two model versions
endpoint_config = {
    'ProductionVariants': [
        {'VariantName': 'ModelA', 'InitialVariantWeight': 0.9, ...},
        {'VariantName': 'ModelB', 'InitialVariantWeight': 0.1, ...},
    ]
}

How long to run: Until you reach statistical significance (sufficient sample size). Use power analysis: $n \approx \frac{2\sigma^2(z_{\alpha/2} + z_\beta)^2}{\Delta^2}$ where $\Delta$ is the minimum detectable effect.

Exam tip: SageMaker supports weight-based traffic splitting across production variants. Start with 10% on the challenger (ModelB), monitor, then shift traffic if results are positive.

SageMaker Model Evaluation Tools

SageMaker Clarify

Two capabilities: Bias Detection + Explainability

Bias Detection — Pre-training (data bias):

Metric	What It Measures
CI (Class Imbalance)	Imbalance between facet groups in the dataset
DPL (Difference in Positive Label Rate)	Difference in positive outcome rate between groups
KL (KL Divergence)	Distribution difference of labels between groups
KS (Kolmogorov-Smirnov)	Max distance between label distributions

Bias Detection — Post-training (model bias):

Metric	What It Measures
DPPL (Difference in Positive Prediction Rate)	Model predicts positive more for one group
DI (Disparate Impact)	Ratio of positive prediction rates between groups
DCA (Difference in Conditional Accuracy)	Accuracy differs between demographic groups

Explainability — SHAP Values:

ELI5: SHAP tells you exactly how much each feature pushed the prediction up or down for a specific individual example. “For this loan application, the high income pushed approval probability up by +0.25, but the short credit history pulled it down by -0.15.”

Feature contributions for one prediction:
Base rate:        0.50
+ income:        +0.25
+ employment:    +0.10
- credit_length: -0.15
- debt_ratio:    -0.08
              ────────
Final score:      0.62  → predicted "approve"

Local explanation: Why did the model predict X for THIS specific example?
Global explanation: Which features are most important overall?
Works with any model (model-agnostic via KernelSHAP, faster with TreeSHAP for trees)

SageMaker Experiments

Track, compare, and organize training runs:

Log metrics, parameters, artifacts automatically
Compare runs side-by-side in Studio
Group related runs into experiments
Reproducibility: every run’s config is recorded

SageMaker Model Monitor

Detect drift after deployment — catch when production data diverges from training data:

Monitor Type	Detects
Data Quality Monitor	Feature distribution drift (input data statistics change)
Model Quality Monitor	Prediction quality degradation (when ground truth labels arrive)
Bias Drift Monitor	Fairness metrics degrading over time
Feature Attribution Drift	SHAP values for features changing — model “reasoning” drifts

Training Data Baseline
      │
      ▼
  Statistics           Production Data (hourly/daily)
  (mean, std,               │
   distributions)           ▼
      │              Compute statistics
      └──────────── Compare ──────────── Alert if drift detected
                   (KS test,                (CloudWatch, SNS)
                    PSI score)

Population Stability Index (PSI): Commonly used for drift detection.

$$PSI = \sum_i (P_{expected} - P_{actual}) \ln\frac{P_{expected}}{P_{actual}}$$

PSI Value	Interpretation
< 0.1	No significant drift
0.1–0.2	Slight drift — monitor closely
> 0.2	Significant drift — retrain model

Common Evaluation Mistakes (Exam Traps)

1. Data Leakage

Target leakage: A feature in your dataset is computed using knowledge of the label (e.g., “diagnosis_code” for predicting “has_disease” — the code IS the diagnosis).

Temporal leakage: Using future information to predict the past.

Example: Predicting loan default using the customer’s current account balance (which reflects whether they defaulted)

Preprocessing leakage: Fitting preprocessing (scaler, imputer, encoder) on the full dataset including test data, then evaluating on test.

Always fit preprocessing on train only, transform test separately

WRONG:
fit_scaler(all_data) → transform(train, test) → train → evaluate
                            ↑ test statistics leaked into scaler

CORRECT:
fit_scaler(train_only) → transform(train) → train
                       → transform(test)  → evaluate

2. Accuracy on Imbalanced Data

99% negative class → 99% accuracy by always predicting negative
Fix: use F1, AUC-ROC, or PR-AUC instead

3. Hyperparameter Selection Bias

If you tune hyperparameters by evaluating on the test set multiple times, the test set is now a de facto validation set. Your final reported test performance is optimistic.

Fix: use a separate validation set for all tuning decisions; only touch test set once at the very end.

4. No Stratification on Imbalanced Data

A random train/test split of 100 samples with 5% positive might put 0 positives in the test set entirely. Always use stratified splits for classification.

5. Random Split for Time Series

Using train_test_split(shuffle=True) on time series data leaks future information into training. Always use chronological split.

Why this matters for the exam: The exam frequently presents scenarios involving these mistakes and asks you to identify the problem or the fix. Leakage and wrong metric choice are the two most common trap scenarios.

Quick Reference: Problem → Metric

Problem Type	Primary Metric	Secondary Metric	Notes
Binary classification, balanced	AUC-ROC	F1	Compare models with AUC; set threshold with F1
Binary classification, imbalanced	PR-AUC / Average Precision	F1	ROC misleads on rare positives
Binary, FN is very costly	Recall ($F_2$)	AUC-ROC	Cancer, fraud, security
Binary, FP is very costly	Precision ($F_{0.5}$)	AUC-ROC	Spam, legal actions
Multi-class, balanced	Accuracy	F1-macro	Accuracy OK when balanced
Multi-class, imbalanced	F1-macro	Per-class recall	Macro treats all classes equally
Regression, outliers penalized	RMSE	R²	Large errors cost a lot
Regression, robust to outliers	MAE	R²	Outliers in data are noise
Regression, relative performance	R²	Adjusted R²	Compare across datasets
Regression, percentage error	MAPE	RMSE	Cross-scale comparison
Ranking	NDCG	MAP	Search, recommendation
Anomaly detection	PR-AUC	Recall@K	Rare anomaly = imbalanced
Probability calibration	Log Loss	Brier Score	When probabilities matter, not just ranks
Time series forecast	RMSE, MAE, MAPE	sMAPE	Depends on scale and outlier presence

Evaluation Decision Tree

What type of output?
│
├─ CLASS LABEL or PROBABILITY
│   │
│   ├─ Binary classification?
│   │   ├─ Balanced classes?    → Accuracy, AUC-ROC
│   │   ├─ Imbalanced classes?  → PR-AUC, F1
│   │   ├─ FN costs more?       → Recall, F2
│   │   └─ FP costs more?       → Precision, F0.5
│   │
│   └─ Multi-class?
│       ├─ Balanced?            → Accuracy, F1-macro
│       └─ Imbalanced?          → F1-macro, per-class recall
│
└─ NUMERIC VALUE
    ├─ Large errors catastrophic? → RMSE
    ├─ Outliers in data (noise)?  → MAE
    ├─ Want % error?              → MAPE
    └─ Want relative quality?     → R²

Model Evaluation & Validation#

Why Evaluation Matters (First Principles)#

Classification Metrics#

Confusion Matrix#

Accuracy#

Precision#

Recall (Sensitivity, True Positive Rate)#

Precision vs Recall Tradeoff#

F1 Score#

Specificity (True Negative Rate)#

ROC Curve & AUC#

Precision-Recall Curve & Average Precision#

Log Loss (Cross-Entropy Loss)#

Multi-Class Classification#

Regression Metrics#

MSE (Mean Squared Error)#

RMSE (Root Mean Squared Error)#

MAE (Mean Absolute Error)#

RMSE vs MAE#

R² (Coefficient of Determination)#

MAPE (Mean Absolute Percentage Error)#

Regression vs Classification Metrics Summary#

Model Selection & Comparison#

Always Start with a Baseline#

Cross-Validation#

Statistical Significance#

A/B Testing for ML Models#

SageMaker Model Evaluation Tools#

SageMaker Clarify#

SageMaker Experiments#

SageMaker Model Monitor#

Common Evaluation Mistakes (Exam Traps)#

1. Data Leakage#

2. Accuracy on Imbalanced Data#

3. Hyperparameter Selection Bias#

4. No Stratification on Imbalanced Data#

5. Random Split for Time Series#

Quick Reference: Problem → Metric#

Evaluation Decision Tree#