Domain 3G: Model Evaluation & Validation
Table of Contents
- Model Evaluation & Validation
- Why Evaluation Matters (First Principles)
- Classification Metrics
- Multi-Class Classification
- Regression Metrics
- Regression vs Classification Metrics Summary
- Model Selection & Comparison
- A/B Testing for ML Models
- SageMaker Model Evaluation Tools
- Common Evaluation Mistakes (Exam Traps)
- Quick Reference: Problem → Metric
- Evaluation Decision Tree
Model Evaluation & Validation
Exam Domain: 3 — ML Model Development Task: Select the right metrics, detect evaluation pitfalls, and use SageMaker tools to validate and monitor models
Why Evaluation Matters (First Principles)
ELI5: Measuring the wrong thing is worse than not measuring at all. A weather app that always says “sunny” has 90% accuracy in Arizona — but it’s completely useless. The metric must match the business problem.
A model is only as good as its evaluation. Two failure modes:
- Wrong metric — optimizing something that doesn’t capture what you care about
- Contaminated evaluation — test data leaked into training → metrics are lies
The purpose of evaluation is to estimate future performance on unseen data, not to describe training performance.
Why this matters for the exam: Many exam questions present a scenario with a metric problem — imbalanced data, business cost asymmetry, or a leakage situation. Recognizing the correct metric for each context is heavily tested.
Classification Metrics
Confusion Matrix
For binary classification, every prediction falls into one of four cells:
PREDICTED
Positive Negative
┌─────────┬─────────┐
ACTUAL Positive │ TP │ FN │ ← Actual positives
│(correct)│(missed) │
├─────────┼─────────┤
Negative │ FP │ TN │ ← Actual negatives
│(false │(correct)│
│ alarm) │ │
└─────────┴─────────┘
Medical screening example:
| Cell | Meaning | Consequence |
|---|---|---|
| TP (True Positive) | Sick person correctly diagnosed | Correct — patient gets treatment |
| FP (False Positive) — Type I error | Healthy person wrongly flagged | Unnecessary treatment, cost, anxiety |
| FN (False Negative) — Type II error | Sick person missed | Untreated illness — potentially fatal |
| TN (True Negative) | Healthy person correctly cleared | Correct — patient goes home |
Exam tip: Type I error = FP (cry wolf). Type II error = FN (miss the wolf). In high-stakes domains (cancer, fraud), FN is usually the costlier error.
Accuracy
$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$
When it works: Balanced class distribution.
When it fails: Imbalanced data. Classic trap: 99% of transactions are legitimate. A model that always predicts “not fraud” achieves 99% accuracy but catches zero fraud cases.
Exam tip: If you see “imbalanced dataset” + “accuracy,” the answer is almost certainly “use a different metric” — F1, AUC-ROC, or precision/recall.
Precision
$$\text{Precision} = \frac{TP}{TP + FP}$$
“Of everything I predicted positive, how many were actually positive?”
ELI5: Precision is like a picky chef — they only serve dishes they’re confident about. Few dishes, but every dish is good. You might miss some great dishes, but you never serve a bad one.
Prioritize precision when: The cost of false positives is high.
- Spam filter: you don’t want to delete legitimate emails
- Search results: irrelevant results frustrate users
- Drug recommendation: don’t suggest a harmful drug based on weak signal
Recall (Sensitivity, True Positive Rate)
$$\text{Recall} = \frac{TP}{TP + FN}$$
“Of everything that IS positive, how many did I catch?”
ELI5: Recall is like a worried parent — they take their child to the doctor for every cough. They might overcall, causing unnecessary visits (FP), but they never miss a real illness (FN).
Prioritize recall when: The cost of false negatives is high.
- Cancer screening: missing a real cancer is far worse than extra biopsies
- Fraud detection: missing a fraud case causes direct financial loss
- Security intrusion detection: missing an attack is catastrophic
Precision vs Recall Tradeoff
As you raise the classification threshold (require more confidence to predict positive):
- Fewer predictions are positive → fewer FP → precision increases
- More positive cases are missed → more FN → recall decreases
Threshold = 0.3 (lenient):
Precision = 0.60, Recall = 0.95 ← catches almost everything, many false alarms
Threshold = 0.5 (default):
Precision = 0.75, Recall = 0.80 ← balanced
Threshold = 0.9 (strict):
Precision = 0.95, Recall = 0.40 ← very confident predictions only, misses many real positives
Precision
1.0 │╲
│ ╲
0.8 │ ╲___
│ ╲___
0.6 │ ╲___
│ ╲___
0.4 │ ╲___
└────────────────────────── Recall
0.4 0.6 0.8 1.0
↑ Inverse relationship
Business decision table:
| Scenario | Prioritize | Rationale |
|---|---|---|
| Email spam filter | Precision | False positives delete legitimate mail |
| Cancer screening | Recall | False negatives miss treatable cancer |
| Content moderation (ban accounts) | Precision | Wrongly banning innocent users is bad |
| Financial fraud detection | Recall | Letting fraud through costs money |
| Search engine results | Precision | Irrelevant results reduce trust |
| Disease contact tracing | Recall | Missing a contact spreads the disease |
F1 Score
The harmonic mean of precision and recall:
$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$
ELI5: F1 is the compromise between the picky chef (precision) and the worried parent (recall). It rewards balance — a model with P=0.9, R=0.1 gets a terrible F1, not a good one.
Why harmonic mean (not arithmetic)?
Arithmetic mean of P=1.0 and R=0.0 would be 0.5 — misleadingly high for a useless model. Harmonic mean: $\frac{2 \cdot 1.0 \cdot 0.0}{1.0 + 0.0} = 0$ — correctly reflects the failure.
F-beta score — when you want to weight precision vs recall differently:
$$F_\beta = (1+\beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$
| $\beta$ value | Meaning |
|---|---|
| $\beta < 1$ (e.g., 0.5) | Favors precision ($\beta^2$ shrinks recall’s weight) |
| $\beta = 1$ | Equal weight (standard F1) |
| $\beta > 1$ (e.g., 2) | Favors recall (recall counts $\beta^2$ times more than precision) |
Exam tip: “We care more about missing real positives than false alarms” → use $F_2$ (beta=2, favors recall).
Specificity (True Negative Rate)
$$\text{Specificity} = \frac{TN}{TN + FP}$$
“Of everything that IS negative, how many did I correctly identify?”
- Complement: False Positive Rate (FPR) = $1 - \text{Specificity} = \frac{FP}{TN + FP}$
- Used in ROC curve (x-axis = FPR, y-axis = TPR/Recall)
ROC Curve & AUC
ROC (Receiver Operating Characteristic): Plot TPR vs FPR at every possible threshold.
TPR (Recall)
1.0 │ ╱─────── Perfect classifier (AUC=1.0)
│ ╱─╱
0.8 │ ╱ ╱ ← Good classifier
│ ╱ ╱
0.6 │ ╱ ╱
│ ╱ ╱ ─ ─ ─ Random (AUC=0.5, diagonal)
0.4 │╱ ╱
│ ╱
0.2 │ ╱
│╱
└────────────────── FPR (1 - Specificity)
0.0 0.2 0.4 0.6 0.8 1.0
AUC (Area Under the ROC Curve):
| AUC Value | Interpretation |
|---|---|
| 1.0 | Perfect classifier |
| 0.9–1.0 | Excellent |
| 0.8–0.9 | Good |
| 0.7–0.8 | Fair |
| 0.6–0.7 | Poor |
| 0.5 | Random (no better than coin flip) |
| < 0.5 | Worse than random — invert the predictions! |
ELI5: AUC answers: “If I pick a random positive example and a random negative example, what’s the probability my model scores the positive one higher?” AUC = 0.9 means 90% probability — the model almost always ranks positives above negatives.
When to use AUC:
- Comparing models independent of threshold choice
- Imbalanced datasets (more robust than accuracy)
- When you don’t know the operating threshold yet
Precision-Recall Curve & Average Precision
For highly imbalanced data (rare positives), the ROC curve can look deceptively good because TN dominates FPR. The PR curve exposes performance on the minority class:
Precision
1.0 │╲ ← Perfect
│ ╲___
0.8 │ ╲___ ← Good model
│ ╲___
0.6 │ ─ ─ ─ ─ ─ ─ ╲── Baseline (random)
│ ╲
0.4 │
└──────────────────────── Recall
0.0 0.2 0.4 0.6 0.8 1.0
Average Precision (AP): Area under the PR curve. Higher is better.
Use PR-AUC instead of ROC-AUC when: Positive class is rare (fraud detection, disease screening, anomaly detection).
Exam tip: “1% fraud rate, need to evaluate model” → use Precision-Recall curve / AP, not ROC-AUC.
Log Loss (Cross-Entropy Loss)
$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i) \right]$$
- Evaluates the quality of probability calibration, not just class labels
- Heavily penalizes confident wrong predictions: predicting $\hat{p} = 0.99$ for a negative example → large loss
- Lower is better; perfect calibration approaches 0
- Used as the training objective for logistic regression and most neural networks
Multi-Class Classification
For K classes, extend binary metrics:
| Strategy | How | When |
|---|---|---|
| Macro-average | Compute metric per class, take unweighted mean | Classes equally important, even if imbalanced |
| Micro-average | Pool all TP/FP/FN across classes, compute once | Each individual prediction equally important |
| Weighted-average | Weight each class by its support (sample count) | Balanced representation by class frequency |
Exam tip: Imbalanced multi-class + care about all classes equally → macro F1. Imbalanced + care about total correct predictions → weighted F1.
Regression Metrics
MSE (Mean Squared Error)
$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$
- Penalizes large errors quadratically — a $2\times$ larger error contributes $4\times$ more to the metric
- Units: squared units of target (if target is dollars, MSE is dollars²)
- Differentiable everywhere — preferred training objective
RMSE (Root Mean Squared Error)
$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$
- Same units as target → interpretable: “average error is $X$ units”
- Most commonly reported regression metric
- Still penalizes large errors heavily (via the underlying MSE)
MAE (Mean Absolute Error)
$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$
- Treats all error magnitudes linearly — a $2\times$ larger error contributes exactly $2\times$ more
- Robust to outliers (a single extreme error doesn’t dominate)
- Less smooth at zero (not differentiable) — some optimizers handle this less well
RMSE vs MAE
ELI5: RMSE is a strict teacher who penalizes big mistakes harshly — one catastrophic error tanks your grade. MAE is a fair teacher who treats every mistake the same size — you’re graded on your average everyday performance.
| Situation | Use |
|---|---|
| Large errors are especially costly (financial forecasting, safety-critical systems) | RMSE |
| All error magnitudes matter equally; outliers in data but not meaningful | MAE |
| Want interpretable units | Both (same units as target) |
| Comparing across datasets with different scales | MAPE or R² |
R² (Coefficient of Determination)
$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$
ELI5: R² answers: “How much better is my model than just guessing the average every time?” R²=0.80 means my model explains 80% of the variance — the average-guesser explains 0%.
| R² Value | Interpretation |
|---|---|
| 1.0 | Perfect predictions |
| 0.8–1.0 | Strong model |
| 0.5–0.8 | Moderate |
| 0.0 | Same as predicting the mean — no value |
| < 0.0 | Worse than predicting the mean (terrible model or wrong setup) |
Adjusted R²: Penalizes adding features that don’t help:
$$\bar{R}^2 = 1 - (1-R^2)\frac{n-1}{n-p-1}$$
where $n$ = samples, $p$ = number of features. Use adjusted R² when comparing models with different feature counts.
MAPE (Mean Absolute Percentage Error)
$$MAPE = \frac{1}{n}\sum_{i=1}^{n}\frac{|y_i - \hat{y}_i|}{|y_i|} \times 100%$$
- Scale-independent — “average 8% error” is interpretable regardless of target magnitude
- Good for comparing models across different products/targets
- Problem: Undefined (division by zero) when $y_i = 0$; also asymmetric (over-predictions treated differently from under-predictions)
Regression vs Classification Metrics Summary
| Metric | Type | Formula Sketch | Best Used When | Outlier Sensitivity |
|---|---|---|---|---|
| Accuracy | Class. | (TP+TN)/all | Balanced classes | N/A |
| Precision | Class. | TP/(TP+FP) | FP cost high | N/A |
| Recall | Class. | TP/(TP+FN) | FN cost high | N/A |
| F1 | Class. | Harmonic mean P+R | Imbalanced, both errors matter | N/A |
| AUC-ROC | Class. | Area under ROC | Threshold-independent comparison | N/A |
| AP (PR-AUC) | Class. | Area under PR | Rare positive class | N/A |
| Log Loss | Class. | Cross-entropy | Probability calibration quality | High |
| MSE | Reg. | Mean squared error | Large errors very costly | High |
| RMSE | Reg. | $\sqrt{MSE}$ | Interpretable, penalize large errors | High |
| MAE | Reg. | Mean absolute error | Robust evaluation, outliers present | Low |
| R² | Reg. | Variance explained | Relative model quality | Moderate |
| MAPE | Reg. | % error | Cross-scale comparison | Low (but div/0 risk) |
Model Selection & Comparison
Always Start with a Baseline
Never report model performance without comparing to a dumb baseline:
| Problem Type | Baseline |
|---|---|
| Binary/multi-class | Predict majority class always |
| Regression | Predict mean of training labels always |
| Time series | Persistence model: $\hat{y}_{t+1} = y_t$ |
| NLP ranking | BM25 or TF-IDF |
If your model doesn’t beat the baseline, you have a problem.
Cross-Validation
For small datasets where a single train/val split is unreliable:
5-fold Cross-Validation:
┌────┬────┬────┬────┬────┐
│ V │ T │ T │ T │ T │ Fold 1: Val=fold1, Train=rest
├────┼────┼────┼────┼────┤
│ T │ V │ T │ T │ T │ Fold 2: Val=fold2, Train=rest
├────┼────┼────┼────┼────┤
│ T │ T │ V │ T │ T │ Fold 3: ...
├────┼────┼────┼────┼────┤
│ T │ T │ T │ V │ T │ Fold 4: ...
├────┼────┼────┼────┼────┤
│ T │ T │ T │ T │ V │ Fold 5: Val=fold5, Train=rest
└────┴────┴────┴────┴────┘
Final score = mean ± std of 5 validation scores
- StratifiedKFold: Preserves class balance in each fold — always use for classification
- TimeSeriesSplit: Respects temporal order — use for time series
Statistical Significance
Is model A actually better than model B, or is it noise?
- Paired t-test on cross-validation scores: Compare fold-by-fold scores between two models
- McNemar’s test: For comparing two classifiers on the same test set (compares disagreements)
- Rule of thumb: If confidence intervals overlap substantially, results are not significant
A/B Testing for ML Models
ELI5: A/B testing is like a taste test — don’t just look at the recipe (offline metrics), let real customers try both dishes (production traffic) and see which they prefer in real life. The kitchen test and the restaurant test often give different results.
Why offline metrics aren’t enough:
- Production data distribution may differ from evaluation data
- Business KPIs (revenue, engagement, retention) may not correlate perfectly with ML metrics
- User behavior is only observable in production
Traffic (100%)
│
├──── 50% ──▶ Model A (current)
│ │
└──── 50% ──▶ Model B (challenger)
│
Compare: CTR, conversion,
revenue, churn — not just accuracy
SageMaker Production Variants:
# Split traffic between two model versions
endpoint_config = {
'ProductionVariants': [
{'VariantName': 'ModelA', 'InitialVariantWeight': 0.9, ...},
{'VariantName': 'ModelB', 'InitialVariantWeight': 0.1, ...},
]
}
How long to run: Until you reach statistical significance (sufficient sample size). Use power analysis: $n \approx \frac{2\sigma^2(z_{\alpha/2} + z_\beta)^2}{\Delta^2}$ where $\Delta$ is the minimum detectable effect.
Exam tip: SageMaker supports weight-based traffic splitting across production variants. Start with 10% on the challenger (ModelB), monitor, then shift traffic if results are positive.
SageMaker Model Evaluation Tools
SageMaker Clarify
Two capabilities: Bias Detection + Explainability
Bias Detection — Pre-training (data bias):
| Metric | What It Measures |
|---|---|
| CI (Class Imbalance) | Imbalance between facet groups in the dataset |
| DPL (Difference in Positive Label Rate) | Difference in positive outcome rate between groups |
| KL (KL Divergence) | Distribution difference of labels between groups |
| KS (Kolmogorov-Smirnov) | Max distance between label distributions |
Bias Detection — Post-training (model bias):
| Metric | What It Measures |
|---|---|
| DPPL (Difference in Positive Prediction Rate) | Model predicts positive more for one group |
| DI (Disparate Impact) | Ratio of positive prediction rates between groups |
| DCA (Difference in Conditional Accuracy) | Accuracy differs between demographic groups |
Explainability — SHAP Values:
ELI5: SHAP tells you exactly how much each feature pushed the prediction up or down for a specific individual example. “For this loan application, the high income pushed approval probability up by +0.25, but the short credit history pulled it down by -0.15.”
Feature contributions for one prediction:
Base rate: 0.50
+ income: +0.25
+ employment: +0.10
- credit_length: -0.15
- debt_ratio: -0.08
────────
Final score: 0.62 → predicted "approve"
- Local explanation: Why did the model predict X for THIS specific example?
- Global explanation: Which features are most important overall?
- Works with any model (model-agnostic via KernelSHAP, faster with TreeSHAP for trees)
SageMaker Experiments
Track, compare, and organize training runs:
- Log metrics, parameters, artifacts automatically
- Compare runs side-by-side in Studio
- Group related runs into experiments
- Reproducibility: every run’s config is recorded
SageMaker Model Monitor
Detect drift after deployment — catch when production data diverges from training data:
| Monitor Type | Detects |
|---|---|
| Data Quality Monitor | Feature distribution drift (input data statistics change) |
| Model Quality Monitor | Prediction quality degradation (when ground truth labels arrive) |
| Bias Drift Monitor | Fairness metrics degrading over time |
| Feature Attribution Drift | SHAP values for features changing — model “reasoning” drifts |
Training Data Baseline
│
▼
Statistics Production Data (hourly/daily)
(mean, std, │
distributions) ▼
│ Compute statistics
└──────────── Compare ──────────── Alert if drift detected
(KS test, (CloudWatch, SNS)
PSI score)
Population Stability Index (PSI): Commonly used for drift detection.
$$PSI = \sum_i (P_{expected} - P_{actual}) \ln\frac{P_{expected}}{P_{actual}}$$
| PSI Value | Interpretation |
|---|---|
| < 0.1 | No significant drift |
| 0.1–0.2 | Slight drift — monitor closely |
| > 0.2 | Significant drift — retrain model |
Common Evaluation Mistakes (Exam Traps)
1. Data Leakage
Target leakage: A feature in your dataset is computed using knowledge of the label (e.g., “diagnosis_code” for predicting “has_disease” — the code IS the diagnosis).
Temporal leakage: Using future information to predict the past.
- Example: Predicting loan default using the customer’s current account balance (which reflects whether they defaulted)
Preprocessing leakage: Fitting preprocessing (scaler, imputer, encoder) on the full dataset including test data, then evaluating on test.
- Always fit preprocessing on train only, transform test separately
WRONG:
fit_scaler(all_data) → transform(train, test) → train → evaluate
↑ test statistics leaked into scaler
CORRECT:
fit_scaler(train_only) → transform(train) → train
→ transform(test) → evaluate
2. Accuracy on Imbalanced Data
- 99% negative class → 99% accuracy by always predicting negative
- Fix: use F1, AUC-ROC, or PR-AUC instead
3. Hyperparameter Selection Bias
If you tune hyperparameters by evaluating on the test set multiple times, the test set is now a de facto validation set. Your final reported test performance is optimistic.
Fix: use a separate validation set for all tuning decisions; only touch test set once at the very end.
4. No Stratification on Imbalanced Data
A random train/test split of 100 samples with 5% positive might put 0 positives in the test set entirely. Always use stratified splits for classification.
5. Random Split for Time Series
Using train_test_split(shuffle=True) on time series data leaks future information into training. Always use chronological split.
Why this matters for the exam: The exam frequently presents scenarios involving these mistakes and asks you to identify the problem or the fix. Leakage and wrong metric choice are the two most common trap scenarios.
Quick Reference: Problem → Metric
| Problem Type | Primary Metric | Secondary Metric | Notes |
|---|---|---|---|
| Binary classification, balanced | AUC-ROC | F1 | Compare models with AUC; set threshold with F1 |
| Binary classification, imbalanced | PR-AUC / Average Precision | F1 | ROC misleads on rare positives |
| Binary, FN is very costly | Recall ($F_2$) | AUC-ROC | Cancer, fraud, security |
| Binary, FP is very costly | Precision ($F_{0.5}$) | AUC-ROC | Spam, legal actions |
| Multi-class, balanced | Accuracy | F1-macro | Accuracy OK when balanced |
| Multi-class, imbalanced | F1-macro | Per-class recall | Macro treats all classes equally |
| Regression, outliers penalized | RMSE | R² | Large errors cost a lot |
| Regression, robust to outliers | MAE | R² | Outliers in data are noise |
| Regression, relative performance | R² | Adjusted R² | Compare across datasets |
| Regression, percentage error | MAPE | RMSE | Cross-scale comparison |
| Ranking | NDCG | MAP | Search, recommendation |
| Anomaly detection | PR-AUC | Recall@K | Rare anomaly = imbalanced |
| Probability calibration | Log Loss | Brier Score | When probabilities matter, not just ranks |
| Time series forecast | RMSE, MAE, MAPE | sMAPE | Depends on scale and outlier presence |
Evaluation Decision Tree
What type of output?
│
├─ CLASS LABEL or PROBABILITY
│ │
│ ├─ Binary classification?
│ │ ├─ Balanced classes? → Accuracy, AUC-ROC
│ │ ├─ Imbalanced classes? → PR-AUC, F1
│ │ ├─ FN costs more? → Recall, F2
│ │ └─ FP costs more? → Precision, F0.5
│ │
│ └─ Multi-class?
│ ├─ Balanced? → Accuracy, F1-macro
│ └─ Imbalanced? → F1-macro, per-class recall
│
└─ NUMERIC VALUE
├─ Large errors catastrophic? → RMSE
├─ Outliers in data (noise)? → MAE
├─ Want % error? → MAPE
└─ Want relative quality? → R²