← AWS MLS-C01 — ML Specialty

Domain 3G: Model Evaluation & Validation

Model Evaluation & Validation

Exam Domain: 3 — ML Model Development Task: Select the right metrics, detect evaluation pitfalls, and use SageMaker tools to validate and monitor models


Why Evaluation Matters (First Principles)

ELI5: Measuring the wrong thing is worse than not measuring at all. A weather app that always says “sunny” has 90% accuracy in Arizona — but it’s completely useless. The metric must match the business problem.

A model is only as good as its evaluation. Two failure modes:

  1. Wrong metric — optimizing something that doesn’t capture what you care about
  2. Contaminated evaluation — test data leaked into training → metrics are lies

The purpose of evaluation is to estimate future performance on unseen data, not to describe training performance.

Why this matters for the exam: Many exam questions present a scenario with a metric problem — imbalanced data, business cost asymmetry, or a leakage situation. Recognizing the correct metric for each context is heavily tested.


Classification Metrics

Confusion Matrix

For binary classification, every prediction falls into one of four cells:

                    PREDICTED
                  Positive  Negative
                ┌─────────┬─────────┐
ACTUAL Positive │   TP    │   FN   │ ← Actual positives
                │(correct)│(missed) │
                ├─────────┼─────────┤
       Negative │   FP    │   TN   │ ← Actual negatives
                │(false   │(correct)│
                │ alarm)  │         │
                └─────────┴─────────┘

Medical screening example:

CellMeaningConsequence
TP (True Positive)Sick person correctly diagnosedCorrect — patient gets treatment
FP (False Positive) — Type I errorHealthy person wrongly flaggedUnnecessary treatment, cost, anxiety
FN (False Negative) — Type II errorSick person missedUntreated illness — potentially fatal
TN (True Negative)Healthy person correctly clearedCorrect — patient goes home

Exam tip: Type I error = FP (cry wolf). Type II error = FN (miss the wolf). In high-stakes domains (cancer, fraud), FN is usually the costlier error.


Accuracy

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

When it works: Balanced class distribution.

When it fails: Imbalanced data. Classic trap: 99% of transactions are legitimate. A model that always predicts “not fraud” achieves 99% accuracy but catches zero fraud cases.

Exam tip: If you see “imbalanced dataset” + “accuracy,” the answer is almost certainly “use a different metric” — F1, AUC-ROC, or precision/recall.


Precision

$$\text{Precision} = \frac{TP}{TP + FP}$$

“Of everything I predicted positive, how many were actually positive?”

ELI5: Precision is like a picky chef — they only serve dishes they’re confident about. Few dishes, but every dish is good. You might miss some great dishes, but you never serve a bad one.

Prioritize precision when: The cost of false positives is high.

  • Spam filter: you don’t want to delete legitimate emails
  • Search results: irrelevant results frustrate users
  • Drug recommendation: don’t suggest a harmful drug based on weak signal

Recall (Sensitivity, True Positive Rate)

$$\text{Recall} = \frac{TP}{TP + FN}$$

“Of everything that IS positive, how many did I catch?”

ELI5: Recall is like a worried parent — they take their child to the doctor for every cough. They might overcall, causing unnecessary visits (FP), but they never miss a real illness (FN).

Prioritize recall when: The cost of false negatives is high.

  • Cancer screening: missing a real cancer is far worse than extra biopsies
  • Fraud detection: missing a fraud case causes direct financial loss
  • Security intrusion detection: missing an attack is catastrophic

Precision vs Recall Tradeoff

As you raise the classification threshold (require more confidence to predict positive):

  • Fewer predictions are positive → fewer FP → precision increases
  • More positive cases are missed → more FN → recall decreases
Threshold = 0.3 (lenient):
   Precision = 0.60, Recall = 0.95  ← catches almost everything, many false alarms

Threshold = 0.5 (default):
   Precision = 0.75, Recall = 0.80  ← balanced

Threshold = 0.9 (strict):
   Precision = 0.95, Recall = 0.40  ← very confident predictions only, misses many real positives
Precision
1.0 │╲
    │  ╲
0.8 │    ╲___
    │        ╲___
0.6 │            ╲___
    │                ╲___
0.4 │                    ╲___
    └────────────────────────── Recall
    0.4  0.6  0.8  1.0
         ↑ Inverse relationship

Business decision table:

ScenarioPrioritizeRationale
Email spam filterPrecisionFalse positives delete legitimate mail
Cancer screeningRecallFalse negatives miss treatable cancer
Content moderation (ban accounts)PrecisionWrongly banning innocent users is bad
Financial fraud detectionRecallLetting fraud through costs money
Search engine resultsPrecisionIrrelevant results reduce trust
Disease contact tracingRecallMissing a contact spreads the disease

F1 Score

The harmonic mean of precision and recall:

$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$

ELI5: F1 is the compromise between the picky chef (precision) and the worried parent (recall). It rewards balance — a model with P=0.9, R=0.1 gets a terrible F1, not a good one.

Why harmonic mean (not arithmetic)?

Arithmetic mean of P=1.0 and R=0.0 would be 0.5 — misleadingly high for a useless model. Harmonic mean: $\frac{2 \cdot 1.0 \cdot 0.0}{1.0 + 0.0} = 0$ — correctly reflects the failure.

F-beta score — when you want to weight precision vs recall differently:

$$F_\beta = (1+\beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$

$\beta$ valueMeaning
$\beta < 1$ (e.g., 0.5)Favors precision ($\beta^2$ shrinks recall’s weight)
$\beta = 1$Equal weight (standard F1)
$\beta > 1$ (e.g., 2)Favors recall (recall counts $\beta^2$ times more than precision)

Exam tip: “We care more about missing real positives than false alarms” → use $F_2$ (beta=2, favors recall).


Specificity (True Negative Rate)

$$\text{Specificity} = \frac{TN}{TN + FP}$$

“Of everything that IS negative, how many did I correctly identify?”

  • Complement: False Positive Rate (FPR) = $1 - \text{Specificity} = \frac{FP}{TN + FP}$
  • Used in ROC curve (x-axis = FPR, y-axis = TPR/Recall)

ROC Curve & AUC

ROC (Receiver Operating Characteristic): Plot TPR vs FPR at every possible threshold.

TPR (Recall)
1.0 │          ╱─────── Perfect classifier (AUC=1.0)
    │        ╱─╱
0.8 │      ╱  ╱ ← Good classifier
    │    ╱  ╱
0.6 │  ╱  ╱
    │ ╱  ╱ ─ ─ ─ Random (AUC=0.5, diagonal)
0.4 │╱  ╱
    │  ╱
0.2 │ ╱
    │╱
    └────────────────── FPR (1 - Specificity)
    0.0  0.2  0.4  0.6  0.8  1.0

AUC (Area Under the ROC Curve):

AUC ValueInterpretation
1.0Perfect classifier
0.9–1.0Excellent
0.8–0.9Good
0.7–0.8Fair
0.6–0.7Poor
0.5Random (no better than coin flip)
< 0.5Worse than random — invert the predictions!

ELI5: AUC answers: “If I pick a random positive example and a random negative example, what’s the probability my model scores the positive one higher?” AUC = 0.9 means 90% probability — the model almost always ranks positives above negatives.

When to use AUC:

  • Comparing models independent of threshold choice
  • Imbalanced datasets (more robust than accuracy)
  • When you don’t know the operating threshold yet

Precision-Recall Curve & Average Precision

For highly imbalanced data (rare positives), the ROC curve can look deceptively good because TN dominates FPR. The PR curve exposes performance on the minority class:

Precision
1.0 │╲ ← Perfect
    │  ╲___
0.8 │       ╲___  ← Good model
    │            ╲___
0.6 │     ─ ─ ─ ─ ─ ─ ╲── Baseline (random)
    │                        ╲
0.4 │
    └──────────────────────── Recall
    0.0  0.2  0.4  0.6  0.8  1.0

Average Precision (AP): Area under the PR curve. Higher is better.

Use PR-AUC instead of ROC-AUC when: Positive class is rare (fraud detection, disease screening, anomaly detection).

Exam tip: “1% fraud rate, need to evaluate model” → use Precision-Recall curve / AP, not ROC-AUC.


Log Loss (Cross-Entropy Loss)

$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i) \right]$$

  • Evaluates the quality of probability calibration, not just class labels
  • Heavily penalizes confident wrong predictions: predicting $\hat{p} = 0.99$ for a negative example → large loss
  • Lower is better; perfect calibration approaches 0
  • Used as the training objective for logistic regression and most neural networks

Multi-Class Classification

For K classes, extend binary metrics:

StrategyHowWhen
Macro-averageCompute metric per class, take unweighted meanClasses equally important, even if imbalanced
Micro-averagePool all TP/FP/FN across classes, compute onceEach individual prediction equally important
Weighted-averageWeight each class by its support (sample count)Balanced representation by class frequency

Exam tip: Imbalanced multi-class + care about all classes equally → macro F1. Imbalanced + care about total correct predictions → weighted F1.


Regression Metrics

MSE (Mean Squared Error)

$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

  • Penalizes large errors quadratically — a $2\times$ larger error contributes $4\times$ more to the metric
  • Units: squared units of target (if target is dollars, MSE is dollars²)
  • Differentiable everywhere — preferred training objective

RMSE (Root Mean Squared Error)

$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

  • Same units as target → interpretable: “average error is $X$ units”
  • Most commonly reported regression metric
  • Still penalizes large errors heavily (via the underlying MSE)

MAE (Mean Absolute Error)

$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$

  • Treats all error magnitudes linearly — a $2\times$ larger error contributes exactly $2\times$ more
  • Robust to outliers (a single extreme error doesn’t dominate)
  • Less smooth at zero (not differentiable) — some optimizers handle this less well

RMSE vs MAE

ELI5: RMSE is a strict teacher who penalizes big mistakes harshly — one catastrophic error tanks your grade. MAE is a fair teacher who treats every mistake the same size — you’re graded on your average everyday performance.

SituationUse
Large errors are especially costly (financial forecasting, safety-critical systems)RMSE
All error magnitudes matter equally; outliers in data but not meaningfulMAE
Want interpretable unitsBoth (same units as target)
Comparing across datasets with different scalesMAPE or

R² (Coefficient of Determination)

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$$

ELI5: R² answers: “How much better is my model than just guessing the average every time?” R²=0.80 means my model explains 80% of the variance — the average-guesser explains 0%.

R² ValueInterpretation
1.0Perfect predictions
0.8–1.0Strong model
0.5–0.8Moderate
0.0Same as predicting the mean — no value
< 0.0Worse than predicting the mean (terrible model or wrong setup)

Adjusted R²: Penalizes adding features that don’t help:

$$\bar{R}^2 = 1 - (1-R^2)\frac{n-1}{n-p-1}$$

where $n$ = samples, $p$ = number of features. Use adjusted R² when comparing models with different feature counts.

MAPE (Mean Absolute Percentage Error)

$$MAPE = \frac{1}{n}\sum_{i=1}^{n}\frac{|y_i - \hat{y}_i|}{|y_i|} \times 100%$$

  • Scale-independent — “average 8% error” is interpretable regardless of target magnitude
  • Good for comparing models across different products/targets
  • Problem: Undefined (division by zero) when $y_i = 0$; also asymmetric (over-predictions treated differently from under-predictions)

Regression vs Classification Metrics Summary

MetricTypeFormula SketchBest Used WhenOutlier Sensitivity
AccuracyClass.(TP+TN)/allBalanced classesN/A
PrecisionClass.TP/(TP+FP)FP cost highN/A
RecallClass.TP/(TP+FN)FN cost highN/A
F1Class.Harmonic mean P+RImbalanced, both errors matterN/A
AUC-ROCClass.Area under ROCThreshold-independent comparisonN/A
AP (PR-AUC)Class.Area under PRRare positive classN/A
Log LossClass.Cross-entropyProbability calibration qualityHigh
MSEReg.Mean squared errorLarge errors very costlyHigh
RMSEReg.$\sqrt{MSE}$Interpretable, penalize large errorsHigh
MAEReg.Mean absolute errorRobust evaluation, outliers presentLow
Reg.Variance explainedRelative model qualityModerate
MAPEReg.% errorCross-scale comparisonLow (but div/0 risk)

Model Selection & Comparison

Always Start with a Baseline

Never report model performance without comparing to a dumb baseline:

Problem TypeBaseline
Binary/multi-classPredict majority class always
RegressionPredict mean of training labels always
Time seriesPersistence model: $\hat{y}_{t+1} = y_t$
NLP rankingBM25 or TF-IDF

If your model doesn’t beat the baseline, you have a problem.

Cross-Validation

For small datasets where a single train/val split is unreliable:

5-fold Cross-Validation:
┌────┬────┬────┬────┬────┐
│ V  │ T  │ T  │ T  │ T  │ Fold 1: Val=fold1, Train=rest
├────┼────┼────┼────┼────┤
│ T  │ V  │ T  │ T  │ T  │ Fold 2: Val=fold2, Train=rest
├────┼────┼────┼────┼────┤
│ T  │ T  │ V  │ T  │ T  │ Fold 3: ...
├────┼────┼────┼────┼────┤
│ T  │ T  │ T  │ V  │ T  │ Fold 4: ...
├────┼────┼────┼────┼────┤
│ T  │ T  │ T  │ T  │ V  │ Fold 5: Val=fold5, Train=rest
└────┴────┴────┴────┴────┘
Final score = mean ± std of 5 validation scores
  • StratifiedKFold: Preserves class balance in each fold — always use for classification
  • TimeSeriesSplit: Respects temporal order — use for time series

Statistical Significance

Is model A actually better than model B, or is it noise?

  • Paired t-test on cross-validation scores: Compare fold-by-fold scores between two models
  • McNemar’s test: For comparing two classifiers on the same test set (compares disagreements)
  • Rule of thumb: If confidence intervals overlap substantially, results are not significant

A/B Testing for ML Models

ELI5: A/B testing is like a taste test — don’t just look at the recipe (offline metrics), let real customers try both dishes (production traffic) and see which they prefer in real life. The kitchen test and the restaurant test often give different results.

Why offline metrics aren’t enough:

  • Production data distribution may differ from evaluation data
  • Business KPIs (revenue, engagement, retention) may not correlate perfectly with ML metrics
  • User behavior is only observable in production
Traffic (100%)
      │
      ├──── 50% ──▶ Model A (current)
      │                  │
      └──── 50% ──▶ Model B (challenger)
                         │
                   Compare: CTR, conversion,
                   revenue, churn — not just accuracy

SageMaker Production Variants:

# Split traffic between two model versions
endpoint_config = {
    'ProductionVariants': [
        {'VariantName': 'ModelA', 'InitialVariantWeight': 0.9, ...},
        {'VariantName': 'ModelB', 'InitialVariantWeight': 0.1, ...},
    ]
}

How long to run: Until you reach statistical significance (sufficient sample size). Use power analysis: $n \approx \frac{2\sigma^2(z_{\alpha/2} + z_\beta)^2}{\Delta^2}$ where $\Delta$ is the minimum detectable effect.

Exam tip: SageMaker supports weight-based traffic splitting across production variants. Start with 10% on the challenger (ModelB), monitor, then shift traffic if results are positive.


SageMaker Model Evaluation Tools

SageMaker Clarify

Two capabilities: Bias Detection + Explainability

Bias Detection — Pre-training (data bias):

MetricWhat It Measures
CI (Class Imbalance)Imbalance between facet groups in the dataset
DPL (Difference in Positive Label Rate)Difference in positive outcome rate between groups
KL (KL Divergence)Distribution difference of labels between groups
KS (Kolmogorov-Smirnov)Max distance between label distributions

Bias Detection — Post-training (model bias):

MetricWhat It Measures
DPPL (Difference in Positive Prediction Rate)Model predicts positive more for one group
DI (Disparate Impact)Ratio of positive prediction rates between groups
DCA (Difference in Conditional Accuracy)Accuracy differs between demographic groups

Explainability — SHAP Values:

ELI5: SHAP tells you exactly how much each feature pushed the prediction up or down for a specific individual example. “For this loan application, the high income pushed approval probability up by +0.25, but the short credit history pulled it down by -0.15.”

Feature contributions for one prediction:
Base rate:        0.50
+ income:        +0.25
+ employment:    +0.10
- credit_length: -0.15
- debt_ratio:    -0.08
              ────────
Final score:      0.62  → predicted "approve"
  • Local explanation: Why did the model predict X for THIS specific example?
  • Global explanation: Which features are most important overall?
  • Works with any model (model-agnostic via KernelSHAP, faster with TreeSHAP for trees)

SageMaker Experiments

Track, compare, and organize training runs:

  • Log metrics, parameters, artifacts automatically
  • Compare runs side-by-side in Studio
  • Group related runs into experiments
  • Reproducibility: every run’s config is recorded

SageMaker Model Monitor

Detect drift after deployment — catch when production data diverges from training data:

Monitor TypeDetects
Data Quality MonitorFeature distribution drift (input data statistics change)
Model Quality MonitorPrediction quality degradation (when ground truth labels arrive)
Bias Drift MonitorFairness metrics degrading over time
Feature Attribution DriftSHAP values for features changing — model “reasoning” drifts
Training Data Baseline
      │
      ▼
  Statistics           Production Data (hourly/daily)
  (mean, std,               │
   distributions)           ▼
      │              Compute statistics
      └──────────── Compare ──────────── Alert if drift detected
                   (KS test,                (CloudWatch, SNS)
                    PSI score)

Population Stability Index (PSI): Commonly used for drift detection.

$$PSI = \sum_i (P_{expected} - P_{actual}) \ln\frac{P_{expected}}{P_{actual}}$$

PSI ValueInterpretation
< 0.1No significant drift
0.1–0.2Slight drift — monitor closely
> 0.2Significant drift — retrain model

Common Evaluation Mistakes (Exam Traps)

1. Data Leakage

Target leakage: A feature in your dataset is computed using knowledge of the label (e.g., “diagnosis_code” for predicting “has_disease” — the code IS the diagnosis).

Temporal leakage: Using future information to predict the past.

  • Example: Predicting loan default using the customer’s current account balance (which reflects whether they defaulted)

Preprocessing leakage: Fitting preprocessing (scaler, imputer, encoder) on the full dataset including test data, then evaluating on test.

  • Always fit preprocessing on train only, transform test separately
WRONG:
fit_scaler(all_data) → transform(train, test) → train → evaluate
                            ↑ test statistics leaked into scaler

CORRECT:
fit_scaler(train_only) → transform(train) → train
                       → transform(test)  → evaluate

2. Accuracy on Imbalanced Data

  • 99% negative class → 99% accuracy by always predicting negative
  • Fix: use F1, AUC-ROC, or PR-AUC instead

3. Hyperparameter Selection Bias

If you tune hyperparameters by evaluating on the test set multiple times, the test set is now a de facto validation set. Your final reported test performance is optimistic.

Fix: use a separate validation set for all tuning decisions; only touch test set once at the very end.

4. No Stratification on Imbalanced Data

A random train/test split of 100 samples with 5% positive might put 0 positives in the test set entirely. Always use stratified splits for classification.

5. Random Split for Time Series

Using train_test_split(shuffle=True) on time series data leaks future information into training. Always use chronological split.

Why this matters for the exam: The exam frequently presents scenarios involving these mistakes and asks you to identify the problem or the fix. Leakage and wrong metric choice are the two most common trap scenarios.


Quick Reference: Problem → Metric

Problem TypePrimary MetricSecondary MetricNotes
Binary classification, balancedAUC-ROCF1Compare models with AUC; set threshold with F1
Binary classification, imbalancedPR-AUC / Average PrecisionF1ROC misleads on rare positives
Binary, FN is very costlyRecall ($F_2$)AUC-ROCCancer, fraud, security
Binary, FP is very costlyPrecision ($F_{0.5}$)AUC-ROCSpam, legal actions
Multi-class, balancedAccuracyF1-macroAccuracy OK when balanced
Multi-class, imbalancedF1-macroPer-class recallMacro treats all classes equally
Regression, outliers penalizedRMSELarge errors cost a lot
Regression, robust to outliersMAEOutliers in data are noise
Regression, relative performanceAdjusted R²Compare across datasets
Regression, percentage errorMAPERMSECross-scale comparison
RankingNDCGMAPSearch, recommendation
Anomaly detectionPR-AUCRecall@KRare anomaly = imbalanced
Probability calibrationLog LossBrier ScoreWhen probabilities matter, not just ranks
Time series forecastRMSE, MAE, MAPEsMAPEDepends on scale and outlier presence

Evaluation Decision Tree

What type of output?
│
├─ CLASS LABEL or PROBABILITY
│   │
│   ├─ Binary classification?
│   │   ├─ Balanced classes?    → Accuracy, AUC-ROC
│   │   ├─ Imbalanced classes?  → PR-AUC, F1
│   │   ├─ FN costs more?       → Recall, F2
│   │   └─ FP costs more?       → Precision, F0.5
│   │
│   └─ Multi-class?
│       ├─ Balanced?            → Accuracy, F1-macro
│       └─ Imbalanced?          → F1-macro, per-class recall
│
└─ NUMERIC VALUE
    ├─ Large errors catastrophic? → RMSE
    ├─ Outliers in data (noise)?  → MAE
    ├─ Want % error?              → MAPE
    └─ Want relative quality?     → R²