← AWS MLA-C01 — ML Engineer Associate

Domain 2B: Model Training, Tuning & Evaluation

Model Training, Tuning & Evaluation

Exam Domain: 2 — ML Model Development (26%) Task: Train, refine, and analyze model performance


Deep Learning Foundations

Neural Network Basics

Input Layer      Hidden Layers       Output Layer
  (features)      (learned)          (prediction)

  x₁ ─────┐   ┌─── h₁ ───┐
           ├───┤           ├─── h₄ ───── ŷ
  x₂ ─────┤   └─── h₂ ───┤
           │       h₃ ────┘
  x₃ ─────┘

Each connection has a weight (w).
Each node applies: output = activation(Σ(wᵢ × xᵢ) + bias)

ELI5: A neural network is like a factory assembly line. Raw materials (your input data) go in at one end. They pass through a series of processing stations (layers), where each station’s workers (neurons) each look at the incoming material and decide how much of their own signal to pass forward, based on weights they learned during training. At the end of the line, a finished product comes out (the prediction). Training is just running millions of products through the line, comparing the output to the correct answer, and slightly adjusting each worker’s decision rules to reduce mistakes.

Activation Functions

FunctionFormulaRangeUse Case
Sigmoid$\sigma(x) = \frac{1}{1+e^{-x}}$(0, 1)Binary classification output
Tanh$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$(-1, 1)Hidden layers (zero-centered)
ReLU$f(x) = \max(0, x)$[0, ∞)Default for hidden layers
Leaky ReLU$f(x) = \max(0.01x, x)$(-∞, ∞)Fixes “dying ReLU” problem
Softmax$\frac{e^{x_i}}{\sum e^{x_j}}$(0, 1), sums to 1Multi-class output
When to use which activation:
  Hidden layers   → ReLU (or Leaky ReLU)
  Binary output   → Sigmoid
  Multi-class     → Softmax
  RNN hidden      → Tanh

ELI5: Activation functions are decision gates that control how much signal each neuron passes forward. Without them, stacking layers would be pointless — a network of linear operations is still just one linear operation, no matter how many layers deep. ReLU is the default because it’s simple (just “pass positive numbers through, block negatives”) and doesn’t suffer from the vanishing gradient problem that plagues Sigmoid and Tanh in deep networks. Think of it as: Sigmoid squashes everything to a probability, Softmax turns a list of scores into probabilities that sum to 1, ReLU just lets the positive signal through unfiltered.


Convolutional Neural Networks (CNNs)

For image data — learn spatial hierarchies of features.

Input Image → [Conv + ReLU] → [Pooling] → [Conv + ReLU] → [Pooling] → [Flatten] → [Dense] → Output
              extract edges    downsample   extract shapes  downsample   1D vector    classify
LayerPurpose
ConvolutionApply filters to detect features (edges, textures, shapes)
Pooling (Max/Avg)Downsample — reduce spatial size, keep important info
FlattenConvert 2D feature maps to 1D vector
Dense (FC)Final classification layers

Key concepts:

  • Filters/Kernels: small matrices that slide across the image
  • Stride: how many pixels the filter moves each step
  • Padding: add zeros around edges to preserve spatial dimensions
  • Transfer learning: use pre-trained networks (ResNet, VGG) and fine-tune

ELI5: CNNs process images the same way your brain does — in a hierarchy of increasing complexity. The first layer learns to detect simple edges (a horizontal line, a diagonal). The second layer combines edges into shapes (a corner, a curve). Deeper layers combine shapes into recognizable objects (an ear, a wheel). By the time you reach the final layer, the network has learned to recognize “cat” or “car” from the combinations of low-level features it detected. Transfer learning exploits this: a network pre-trained on millions of images already knows how to detect edges and shapes — you just need to teach it the last step for your specific task.

Recurrent Neural Networks (RNNs)

For sequential data — text, time series, speech.

Standard RNN:
  x₁ → [h₁] → x₂ → [h₂] → x₃ → [h₃] → output
         ↓              ↓              ↓
    hidden state flows forward through time
VariantSolvesHow
Vanilla RNNSimple recurrence, short memory
LSTMVanishing gradientGates (forget, input, output) control memory
GRUVanishing gradientSimplified LSTM (reset + update gates), faster
BidirectionalContext from futureProcesses sequence forwards AND backwards

Vanishing Gradient Problem: In deep networks / long sequences, gradients shrink to ~0 during backpropagation, preventing learning of long-range dependencies. LSTM/GRU solve this with gating mechanisms.

ELI5: RNNs have memory — they pass a “hidden state” from one time step to the next, like a person reading a sentence who remembers what they’ve read so far. The problem is plain RNNs have terrible long-term memory: by the time they reach word 50 of a sentence, they’ve nearly forgotten word 1. LSTM fixes this with explicit memory gates — a “forget gate” that decides what to erase, an “input gate” that decides what to store, and an “output gate” that decides what to share. Think of LSTM as a person with a notepad: they actively decide what to write down, what to cross out, and what to read back when needed.


Training Concepts

Key Training Parameters

ParameterWhat It Controls
EpochsNumber of complete passes through training data
Batch sizeNumber of samples processed before weight update
Learning rateStep size for weight updates (most important!)
Mini-batchCompromise between batch (all data) and stochastic (1 sample)
Learning Rate Impact:
  Too high:    ╱╲╱╲╱╲  oscillates, may diverge
  Just right:  ╲  loss converges smoothly
  Too low:     ╲___________________  very slow convergence

Overfitting vs Underfitting

                    Training Error    Validation Error
                    ──────────────    ────────────────
Underfitting:       HIGH              HIGH
  → Model too simple, can't learn patterns

Good fit:           LOW               LOW
  → Model generalizes well

Overfitting:        VERY LOW          HIGH
  → Model memorized training data, fails on new data
Error
  │
  │  ╲  validation error
  │   ╲         ╱────────
  │    ╲───────╱
  │     ╲  ╱   ← sweet spot
  │      ╲╱
  │       ╲
  │        ╲ training error
  │         ╲___________
  └──────────────────────── Model Complexity
       underfit   │   overfit

ELI5: Overfitting is like a student who memorized every question and answer from last year’s exam verbatim. They ace the practice test (low training error) but fail the real exam when the questions are slightly different (high validation error) — because they memorized, not understood. Underfitting is the student who barely studied and can’t answer even the practice questions (high error on both). The goal is a student who genuinely learned the concepts and can handle questions they’ve never seen before.

Regularization Techniques

TechniqueHow It WorksWhen to Use
L1 (Lasso)Adds $ \lambda \sum |w_i| $ to lossFeature selection (drives weights to 0)
L2 (Ridge)Adds $ \lambda \sum w_i^2 $ to lossReduce all weights (prevents large weights)
Elastic NetL1 + L2 combinedBest of both
DropoutRandomly zero out neurons during trainingDeep networks, prevents co-adaptation
Early StoppingStop training when validation error risesUniversal, simple
Data AugmentationCreate more training data (flip, rotate, crop)Image models
Batch NormalizationNormalize layer inputsFaster training, some regularization
L1 vs L2 Regularization:

L1 (Lasso):          L2 (Ridge):
  Sparse weights       Small weights
  Some weights → 0     All weights shrink
  Feature selection    Weight decay
  Diamond constraint   Circle constraint

ELI5: Regularization is like putting constraints on a student’s study habits: “you can study, but you can’t just memorize individual answers.” L1 is strict — it forces the model to pick the most important features and set irrelevant ones to exactly zero (like telling the student they can only keep 10 key facts). L2 is gentler — it keeps all features but shrinks large weights down so no single feature dominates (like telling the student they can keep all their notes but must write them smaller). Dropout is like randomly covering some of the student’s notes during practice so they can’t rely on any one shortcut.


Evaluation Metrics

Classification Metrics

Confusion Matrix:
                    Predicted
                 Positive  Negative
Actual Positive [  TP    |   FN   ]
Actual Negative [  FP    |   TN   ]

TP = True Positive   (correctly predicted positive)
FP = False Positive  (incorrectly predicted positive) → "Type I error"
FN = False Negative  (incorrectly predicted negative) → "Type II error"
TN = True Negative   (correctly predicted negative)

ELI5: The confusion matrix has an unfortunately scary name but it’s actually just a tally sheet. Make two columns (Predicted Positive, Predicted Negative) and two rows (Actually Positive, Actually Negative), then count how many predictions fell in each box. The diagonal (TP and TN) is where your model got things right. The off-diagonal (FP and FN) is where it was wrong. Every other classification metric — precision, recall, F1, accuracy — is just a different arithmetic combination of these four numbers, chosen based on which type of mistake is more costly.

MetricFormulaWhen to Prioritize
Accuracy$\frac{TP+TN}{TP+TN+FP+FN}$Balanced classes
Precision$\frac{TP}{TP+FP}$Cost of false positives is high (spam filter)
Recall (Sensitivity)$\frac{TP}{TP+FN}$Cost of false negatives is high (cancer detection)
F1 Score$2 \times \frac{Precision \times Recall}{Precision + Recall}$Balance precision & recall
Specificity$\frac{TN}{TN+FP}$True negative rate
AUC-ROCArea under ROC curveOverall classifier quality, threshold-independent
ROC Curve:
  True Positive Rate (Recall)
  1.0 ┌─────────────────┐
      │         ╱───────│  Perfect (AUC=1.0)
      │       ╱         │
      │     ╱           │  Good (AUC=0.8-0.9)
      │   ╱   ╱─────── │
      │  ╱  ╱           │  Random (AUC=0.5)
      │╱ ╱              │
  0.0 └─────────────────┘
      0.0    FPR    1.0

AUC = 1.0 → perfect classifier
AUC = 0.5 → random guessing

Exam tip: “Which metric?” questions are common.

  • Medical diagnosis → Recall (don’t miss diseases)
  • Spam filter → Precision (don’t block real emails)
  • Balanced → F1 Score
  • Overall ranking → AUC-ROC

Regression Metrics

MetricFormulaInterpretation
MSE$\frac{1}{n}\sum(y_i - \hat{y}_i)^2$Penalizes large errors more
RMSE$\sqrt{MSE}$Same unit as target variable
MAE$\frac{1}{n}\sum|y_i - \hat{y}_i|$Average absolute error
$1 - \frac{SS_{res}}{SS_{tot}}$% of variance explained (1.0 = perfect)

Clustering Metrics

MetricWhat It Measures
Silhouette ScoreHow similar an object is to its own cluster vs others (-1 to 1)
Davies-Bouldin IndexAverage similarity ratio between clusters (lower = better)
Inertia / DistortionSum of squared distances to centroids (for elbow method)

Ensemble Methods

Bagging (Bootstrap Aggregating)

Training Data
    │
    ├─ Bootstrap Sample 1 → Model 1 ─┐
    ├─ Bootstrap Sample 2 → Model 2 ─┤→ Average / Vote → Final Prediction
    └─ Bootstrap Sample 3 → Model 3 ─┘
  • Reduces variance (overfitting)
  • Example: Random Forest (bagging of decision trees)

Boosting

Training Data
    │
    └─ Model 1 → errors → Model 2 → errors → Model 3
                (focus on          (focus on
                 mistakes)          remaining mistakes)
                         ↓
                  Weighted sum → Final Prediction
  • Reduces bias (underfitting)
  • Examples: XGBoost, AdaBoost, LightGBM, CatBoost

Bagging = parallel models, reduce variance. Boosting = sequential models, reduce bias.

ELI5: Bagging is like asking 100 random people on the street the same question and taking the average answer — no one person has to be an expert, but the crowd wisdom is surprisingly accurate. Boosting is more like a relay tutoring session: tutor 1 teaches you everything they can, then tutor 2 focuses specifically on what tutor 1 couldn’t explain, then tutor 3 handles what’s still confusing. Each round targets the remaining weaknesses, so the final combined knowledge is much stronger than any single tutor.


SageMaker Training & Tuning Tools

Automatic Model Tuning (AMT)

Finds optimal hyperparameters automatically.

StrategyHow It WorksWhen to Use
Random SearchRandom combinationsQuick exploration, many params
Bayesian OptimizationUses past results to guide next trialDefault, most efficient
HyperbandEarly-stop poor configs, allocate resources to promising onesLarge search spaces
Grid SearchTry all combinationsSmall search spaces
Bayesian Optimization:
  Trial 1: lr=0.01, depth=5  → score=0.72
  Trial 2: lr=0.05, depth=3  → score=0.78   ← model suggests nearby region
  Trial 3: lr=0.04, depth=4  → score=0.81   ← converging on optimum
  ...

SageMaker Autopilot (AutoML)

  • Automatically explores algorithms, preprocessing, and hyperparameters
  • Generates candidate notebooks showing what it tried
  • Supports: classification, regression, time series forecasting
  • Outputs interpretable leaderboard

SageMaker Experiments

  • Track and compare training runs
  • Log metrics, parameters, artifacts per trial
  • Visualize comparisons across experiments

SageMaker Debugger

  • Monitors training in real-time
  • Detects: vanishing gradients, exploding gradients, overfitting, class imbalance
  • Built-in rules: trigger alerts or stop training automatically
  • Profiling: CPU/GPU utilization, memory, I/O bottlenecks

SageMaker Model Registry

Model Registry:
┌────────────────────────────────────────────┐
│  Model Group: "fraud-detection"            │
│                                            │
│  Version 1: accuracy=0.85  [Rejected]      │
│  Version 2: accuracy=0.91  [Approved] ← ─ │─── Deploy
│  Version 3: accuracy=0.89  [Pending]       │
│                                            │
│  Metadata: metrics, lineage, approval      │
└────────────────────────────────────────────┘
  • Version control for trained models
  • Approval workflows (Pending → Approved → Rejected)
  • Model lineage tracking (what data/code produced this model)
  • Integrates with SageMaker Pipelines for CI/CD

Distributed Training at Scale

SageMaker Distributed Training

StrategyWhat It DistributesWhen to Use
Data ParallelismData split across GPUs, each has full modelModel fits in one GPU, data is huge
Model ParallelismModel split across GPUsModel too large for one GPU (LLMs)
Data Parallelism:
  GPU 1: Full Model + Data Batch 1  ─┐
  GPU 2: Full Model + Data Batch 2  ─┤→ Sync gradients → Update model
  GPU 3: Full Model + Data Batch 3  ─┘

Model Parallelism:
  GPU 1: Layers 1-10   ──→  GPU 2: Layers 11-20  ──→  GPU 3: Layers 21-30
  (pipeline parallelism: overlap computation)

ELI5: Data parallelism is giving each worker a different piece of a jigsaw puzzle to solve simultaneously — everyone has the same picture on the box lid (the full model), just different puzzle pieces (data batches). They all work in parallel and then compare notes to agree on the solution. Model parallelism is used when the puzzle box itself is too big for one worker to hold — so you split the box between workers, each holding a different section of it. Data parallelism is the default; model parallelism is needed only for models too large to fit in a single GPU’s memory, like large language models.

Training Infrastructure

FeaturePurpose
SageMaker Training CompilerOptimize DL model compilation for faster training (up to 50%)
Warm PoolsKeep instances alive between training jobs (reduce startup time)
CheckpointingSave model state periodically (resume on failure)
Managed Spot TrainingUse Spot Instances (up to 90% cost savings, needs checkpointing)
Elastic Fabric Adapter (EFA)High-bandwidth, low-latency networking for distributed training
SageMaker HyperPodManaged infrastructure for foundation model training, auto-healing

SageMaker Instance Types for Training

PrefixTypeUse Case
ml.m5General purposeSmall models, preprocessing
ml.c5Compute optimizedCPU-intensive training
ml.p3/p4dGPU (NVIDIA V100/A100)Deep learning training
ml.g4dn/g5GPU (T4/A10G)Inference, smaller training
ml.trn1AWS TrainiumCost-effective DL training
ml.inf1/inf2AWS InferentiaHigh-throughput inference

SageMaker Clarify — Bias & Explainability

Post-Training Bias Metrics

MetricWhat It Measures
Disparate Impact (DI)Ratio of positive outcomes between groups
Difference in Positive Proportions (DPPL)Difference in positive prediction rates
Accuracy DifferenceAccuracy gap between groups
Treatment EqualityRatio of FP to FN across groups

Explainability

MethodWhat It Does
SHAP valuesContribution of each feature to individual predictions
Partial Dependence Plots (PDPs)Effect of one feature on predictions (marginal effect)
Feature importanceGlobal ranking of feature contributions
SHAP Example (loan approval prediction):
  Base prediction: 0.5 (50% chance)
  + Income: high     → +0.25
  + Credit score: ok → +0.10
  - Debt ratio: high → -0.15
  = Final: 0.70 (70% chance approved)

ELI5: SHAP tells you WHY the model made a specific decision, not just what the decision was. For every individual prediction, it assigns a score to each input feature saying “this feature pushed the prediction up by X” or “this feature pulled it down by Y.” It’s like asking the model to show its work on an exam — you can see that for this particular loan application, income was the biggest positive factor and debt ratio was the biggest negative factor. This is essential for regulated industries (lending, healthcare) where you must be able to explain why a decision was made.

Exam tip: Clarify integrates with Model Monitor for continuous bias monitoring in production. Alerts via CloudWatch if bias thresholds are exceeded.

Model Cards

Auto-generated documentation from Clarify capturing: model overview, intended use, training data, evaluation results, bias metrics, limitations. Used for regulatory compliance (GDPR, EU AI Act) and internal governance.


SageMaker Processing

Separate from Training — runs data preprocessing, postprocessing, and evaluation scripts.

FeatureDetails
PurposeData transformation BEFORE training, model evaluation AFTER training
FrameworksSKLearnProcessor, PySparkProcessor, custom containers
TimeoutMax 5 days
Not TrainingProcessing ≠ Training (common exam trap)

Bring Your Own Container (BYOC)

SageMaker container directory structure:

/opt/ml/
├── input/
│   ├── config/
│   │   ├── hyperparameters.json    ← your hyperparams
│   │   └── resourceConfig.json     ← instance info (distributed)
│   └── data/
│       └── <channel_name>/         ← training data ("train", "validation")
├── model/                          ← save model here → uploaded to S3 as model.tar.gz
└── output/
    └── failure                     ← write error messages here

Training container MUST:
  1. Read data from /opt/ml/input/data/
  2. Read hyperparams from /opt/ml/input/config/hyperparameters.json
  3. Write model to /opt/ml/model/
  4. Exit 0 on success, non-zero on failure

Inference container MUST:
  1. Implement /ping (health check → 200)
  2. Implement /invocations (POST → predictions)
  3. Load model from /opt/ml/model/

Script Mode vs BYOC vs Built-in

Custom code needed?
├── Standard framework (TF, PyTorch, sklearn)?
│   └── Script Mode (provide train.py, SM provides container)
├── Non-standard library or custom serving?
│   └── BYOC (build Docker → push to ECR)
└── No custom code?
    └── Built-in algorithm

Spot Training — Full Configuration

Requirements (ALL must be set):
  1. use_spot_instances = True
  2. max_wait > max_run
  3. checkpoint_s3_uri = S3 path
  4. Algorithm must support checkpointing

If interrupted:
  → Current checkpoint saved to S3
  → SM waits for new Spot capacity
  → Resumes from last checkpoint
  → You only pay for actual training time (not waiting)

If max_wait expires before completion:
  → Job FAILS with "MaxWaitTimeExceeded"
  → Increase max_wait, or switch to On-Demand

HPO Warm Start

Resume from previous tuning job’s learned knowledge:

TypeWhen
IDENTICAL_DATA_AND_ALGORITHMSame problem — transfer all knowledge
TRANSFER_LEARNINGSimilar problem — partial knowledge transfer

Transfer Learning vs Incremental Learning

Transfer LearningIncremental Learning
TaskNew/different taskSame task
DataSmall target datasetNew batch for existing task
SourcePre-trained model from different domainYour own previous model
ExampleImageNet → medical X-ray classifierFraud model Jan → update with Feb data
RiskNegative transferCatastrophic forgetting
Is the TASK changing?
├── YES → Transfer Learning (fine-tune pre-trained model)
└── NO, same task but new data?
    ├── Small batch, keep existing → Incremental Learning
    │   (pass model.tar.gz as input channel)
    └── Distribution shifted significantly → Full Retrain

Cross-Validation

TechniqueWhen to Use
K-Fold (K=5 or 10)Standard — most problems
Stratified K-FoldImbalanced classification (preserves class ratios)
Time-Series SplitTemporal data (train on past, test on future)
LOOCVVery small datasets (<1000 samples)

Exam key: NEVER use random K-Fold on time series (trains on future, tests on past = leakage). Use walk-forward validation.


Loss Functions

FunctionTaskNotes
Binary Cross-EntropyBinary classificationLog loss between probability and label
Categorical Cross-EntropyMulti-class classificationLog loss across classes
MSERegressionPenalizes large errors
MAERegressionRobust to outliers
Huber LossRegressionMSE for small errors, MAE for large (best of both)
Focal LossImbalanced classificationDown-weights easy examples, focuses on hard ones

F-Beta Scores

ScoreEmphasisUse When
F0.5Weights Precision higherFP is costly (spam filter)
F1Equal weightBalance precision and recall
F2Weights Recall higherFN is costly (cancer detection)

Key Formulas

Precision         = TP / (TP + FP)
Recall            = TP / (TP + FN)
F1                = 2 × Precision × Recall / (Precision + Recall)
Specificity       = TN / (TN + FP)
FPR               = 1 - Specificity
RMSE              = √(mean((y - ŷ)²))
R²                = 1 - (SS_res / SS_tot)
Disparate Impact  = rate_group_A / rate_group_B  (flag if < 0.8)
TF-IDF(t,d)       = TF(t,d) × log(N / DF(t))
Gradient descent  = w_new = w_old - lr × gradient(Loss)