Domain 3F: Model Training & Hyperparameter Tuning

17 min read 3510 words

Table of Contents

Model Training & Hyperparameter Tuning

Model Training & Hyperparameter Tuning

Exam Domain: 3 — ML Model Development Task: Select and apply correct training strategies, manage hyperparameters, and use SageMaker training features effectively

The Training Process (First Principles)

ELI5: Training a model is like teaching a student through repeated quizzes. The student makes a guess, you tell them how wrong they were, they adjust their thinking, and repeat — thousands of times until they get it right.

At its core, training is an optimization loop:

┌──────────────────────────────────────────────────────────┐
│                   TRAINING LOOP                          │
│                                                          │
│  ┌─────────┐    Forward     ┌──────────┐                 │
│  │  Input  │─── Pass ──────▶│   Loss   │                 │
│  │  Batch  │                │ Function │                 │
│  └─────────┘                └────┬─────┘                 │
│       ▲                          │                       │
│       │                    Backward Pass                 │
│  ┌────┴──────┐             (Backprop)                    │
│  │  Update   │                   │                       │
│  │  Weights  │◀──────────────────┘                       │
│  │  (SGD/    │   Gradients tell each weight               │
│  │   Adam)   │   which direction to move                 │
│  └───────────┘                                           │
└──────────────────────────────────────────────────────────┘

Four mechanical steps per iteration:

Forward pass — input flows through the network, produces a prediction
Loss computation — measure how wrong the prediction is (cross-entropy, MSE, etc.)
Backward pass (backpropagation) — compute gradient of loss w.r.t. every weight
Weight update — move weights in the direction that reduces loss: $w \leftarrow w - \eta \nabla L$

Epochs, Batches, Iterations

Term	Meaning	Example
Dataset	All training examples	50,000 images
Batch (mini-batch)	Subset processed at once	32 images
Iteration (step)	One forward+backward+update	Process 1 batch
Epoch	One full pass through all data	All 50,000 images seen once

Relationship:

$$\text{total_iterations} = \text{epochs} \times \frac{\text{dataset_size}}{\text{batch_size}}$$

Concrete example: 50,000 images, batch size 100, 10 epochs:

Batches per epoch = 50,000 / 100 = 500
Total iterations = 10 × 500 = 5,000

Exam tip: Larger batch size → fewer iterations per epoch, but each iteration uses more memory. Smaller batch size → noisier gradients but better generalization (regularization effect).

Learning Rate: The Most Important Hyperparameter

ELI5: Learning rate is the size of steps you take walking downhill toward the valley (minimum loss). Too big and you leap over the valley entirely. Too small and you’re still walking at midnight when everyone else has arrived.

$$w \leftarrow w - \underbrace{\eta}_{\text{learning rate}} \cdot \nabla L$$

What happens at different learning rates:

Loss
│
│  Too HIGH (η=1.0): diverges, bounces wildly
│     ×         ×
│  ×    ×    ×    ×    ×   ← never converges
│
│  JUST RIGHT (η=0.01): converges smoothly
│  ○
│    ○
│      ○
│        ○○○○○○ ← converges to minimum
│
│  Too LOW (η=0.0001): converges but painfully slowly
│  ·
│   ·
│    ·           ← still moving after 10,000 steps
│
└─────────────────────────────────── Iterations

Learning Rate Schedules

Static learning rate is rarely optimal. Schedules adapt $\eta$ over training:

Schedule	Formula	When to Use
Step Decay	$\eta_t = \eta_0 \cdot \gamma^{\lfloor t/k \rfloor}$	Drop by factor $\gamma$ every $k$ steps; simple, effective
Exponential Decay	$\eta_t = \eta_0 \cdot e^{-\lambda t}$	Smooth continuous decay
Cosine Annealing	$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max}-\eta_{min})(1+\cos(\frac{t\pi}{T}))$	Best for fine-tuning, smooth
Warm-up + Decay	Linear increase for first $k$ steps, then decay	Transformers, large batch training

Warm-up intuition: Early in training, gradients are chaotic. Starting with a small learning rate lets the model stabilize before taking large steps.

Why this matters for the exam: SageMaker’s built-in algorithms expose learning rate as a key hyperparameter. Automatic Model Tuning searches for the optimal value. Recognizing divergence (oscillating/increasing loss) vs. slow convergence (loss barely moving) guides the fix.

Training Data Management

Train / Validation / Test Split: Why Three Sets?

┌───────────────────────────────────────────────────────┐
│                 FULL DATASET                          │
├──────────────────┬───────────────┬────────────────────┤
│   TRAIN (70%)    │   VAL (15%)   │    TEST (15%)      │
│                  │               │                    │
│  Model learns    │  Tune hyper-  │  Final honest      │
│  weights here    │  parameters   │  evaluation        │
│                  │  here         │  (touch ONCE)      │
└──────────────────┴───────────────┴────────────────────┘

Why not just train + test? If you tune hyperparameters based on test performance, the test set is effectively used for training decisions. Your reported test accuracy is optimistically biased — information leakage. The validation set is the “practice exam,” the test set is the “real exam you can only take once.”

Stratified Splitting for Imbalanced Data

For a dataset with 95% class A and 5% class B: a random split might put all class B samples in train and none in test. Stratified splitting preserves the class ratio in each split.

Always use stratified splits for classification with imbalanced classes
sklearn.model_selection.StratifiedKFold

Time Series: Never Random Split!

Random splitting a time series causes temporal leakage — future data leaks into training.

WRONG (Random Split):
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│T1│V │T2│T3│V │T4│V │T5│T6│T7│  ← validation points
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘    scattered randomly — future leaks into past!

CORRECT (Chronological Split):
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│  TRAIN (past)  │ VAL │ TEST │  ← strict time boundary
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
     T=1 → T=7    T=8   T=9-10

Walk-forward validation (time series CV): train on T1-T3, validate T4; train T1-T4, validate T5; etc.

Data Augmentation

ELI5: If you only have 100 photos of cats, flip them, rotate them, zoom in — now you have 800 photos that teach the model “cats look like this regardless of angle.” You haven’t collected new cats, you’ve just looked at your existing cats differently.

Augmentation is both a data strategy AND a regularization technique (prevents memorizing exact training examples).

Data Type	Techniques
Image	Random flip (H/V), rotation, crop, color jitter, blur, Mixup (blend two images), Cutout (erase random patches), CutMix
Text	Synonym replacement, back-translation (EN→FR→EN), random insertion/deletion, EDA (Easy Data Augmentation)
Tabular	SMOTE (synthetic minority oversampling), Gaussian noise injection, random feature masking
Audio	Time stretch, pitch shift, noise addition, SpecAugment (mask frequency bands)

Exam tip: Augmentation applies ONLY to training data, never to validation or test data. Applying it to test data would change what you’re evaluating.

SageMaker Training Deep Dive

Training Job Lifecycle

┌─────────┐    ┌──────────────────────────────────────────┐    ┌─────────┐
│   S3    │    │          TRAINING INSTANCE               │    │   S3    │
│         │    │                                          │    │         │
│Training │───▶│  /opt/ml/input/data/<channel>/           │    │  Model  │
│  Data   │    │  /opt/ml/input/config/hyperparameters    │    │Artifacts│
│         │    │  /opt/ml/model/          (save here) ───▶│───▶│         │
│  Code   │───▶│  /opt/ml/output/data/   (logs/metrics)  │    │  Output │
│(ECR     │    │  /opt/ml/output/failure (error info)    │    │  Data   │
│ Image)  │    │                                          │    │         │
└─────────┘    └──────────────────────────────────────────┘    └─────────┘

Key directory structure:

Path	Purpose
`/opt/ml/input/data/<channel>/`	Training data (one subdir per channel)
`/opt/ml/input/config/hyperparameters.json`	Your hyperparameter values
`/opt/ml/model/`	Save model artifacts here — SageMaker uploads to S3 after training
`/opt/ml/output/data/`	Additional output files (metrics, logs)

File Mode vs Pipe Mode

Feature	File Mode	Pipe Mode
How it works	Download all data to disk first	Stream data directly from S3
Startup time	Slow (wait for full download)	Fast (start training immediately)
Storage needed	Full dataset on instance	Minimal (one batch at a time)
Reuse across epochs	Yes (on disk)	Requires re-streaming each epoch
Best for	Small-medium datasets, multiple epochs	Large datasets, fewer epochs, faster iteration

ELI5 Pipe Mode: Instead of downloading the entire library before reading, you stream books one at a time directly from the library. You start reading immediately, but can’t go back without re-requesting.

Exam tip: Pipe mode = faster startup + lower storage cost. File mode = simpler, can re-read data. Choose Pipe mode when dataset is large and startup latency matters.

Instance Types for Training

Instance Family	Hardware	Best For
`ml.m5`	General-purpose CPU	Small models, preprocessing, simple algorithms
`ml.c5`	Compute-optimized CPU	CPU-heavy algorithms (XGBoost, classical ML)
`ml.p3`	NVIDIA V100 GPU	Deep learning (training), computer vision, NLP
`ml.p4`	NVIDIA A100 GPU	Large models, LLMs, highest throughput
`ml.g4dn`	NVIDIA T4 GPU	Cost-effective GPU, inference, smaller DL models
`ml.trn1`	AWS Trainium chip	Deep learning training, cost-optimized
`ml.inf1`	AWS Inferentia chip	Inference only (not training)

Decision rule:

Algorithm uses backpropagation / neural networks?
├─ YES → GPU instance (p3/p4/g4dn) or Trainium (trn1)
└─ NO  → CPU instance
         ├─ XGBoost, LightGBM, tree methods → ml.m5 or ml.c5
         └─ Large tabular, classical ML      → ml.m5

Distributed Training

For datasets or models too large for one instance:

Data Parallelism — split data, replicate model:

┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│  Worker 1    │   │  Worker 2    │   │  Worker 3    │
│  Batch 1-100 │   │ Batch 101-200│   │ Batch 201-300│
│              │   │              │   │              │
│  Same Model  │   │  Same Model  │   │  Same Model  │
└──────┬───────┘   └──────┬───────┘   └──────┬───────┘
       │                  │                  │
       └──────────────────┼──────────────────┘
                   Aggregate Gradients
                   (AllReduce / Parameter Server)
                   Update shared model

Model Parallelism — split model across workers (for models too large for one GPU):

┌──────────────┐   ┌──────────────┐
│  Worker 1    │   │  Worker 2    │
│  Layers 1-50 │──▶│ Layers 51-100│
│              │   │              │
└──────────────┘   └──────────────┘

Strategy	When to Use
SageMaker Distributed Data Parallel (SMDDP)	Model fits on one GPU, want faster training with multiple GPUs
SageMaker Model Parallel (SMP)	Model too large for one GPU (LLMs, very deep networks)
Horovod	Open-source alternative to SMDDP, community ecosystem

Managed Spot Training

ELI5: Like writing your essay on Google Docs instead of Notepad — if your laptop gets taken away (spot interruption), you pick up exactly where you left off because your progress was saved to the cloud.

┌──────────────────────────────────────────────────────┐
│                SPOT TRAINING FLOW                    │
│                                                      │
│  Start Job → Train → Checkpoint to S3 → Interrupted │
│                ↑                              │       │
│                └──── New Spot Instance ←──────┘       │
│                      Restore checkpoint               │
│                      Continue training               │
└──────────────────────────────────────────────────────┘

Cost savings: Up to 90% vs On-Demand instances
Requirement: Must implement checkpointing (save model state to S3 periodically)
SageMaker handles: Requesting spot capacity, detecting interruptions, restarting
You provide: checkpoint_s3_uri in training job config, checkpoint logic in training script
Max wait time: Set max_wait > max_run to allow time for spot availability

Exam tip: If a question asks how to reduce training costs by up to 90%, the answer is Managed Spot Training with checkpointing.

Incremental Training

Resume training from a previously trained model:

Provide a pre-trained model artifact as input to a new training job
Useful for: continuing interrupted training, fine-tuning with new data, iterative improvement
Supported by: Object Detection, Image Classification, Semantic Segmentation built-in algorithms

SageMaker Debugger

Real-time monitoring of training jobs without modifying training code:

Training Job
     │
     ├─▶ Debugger Hook ──▶ S3 (tensors, metrics)
     │                          │
     │                     Debugger Rules
     │                     (run as separate
     │                      processing job)
     │                          │
     │                     CloudWatch Alarms
     │                     SNS Notifications
     └─▶ Training continues (non-blocking)

Built-in Rules:

Rule	Detects
`VanishingGradient`	Gradients approaching zero — deep networks stop learning
`ExplodingTensor`	Weights/gradients growing uncontrollably (NaN)
`LossNotDecreasing`	Training loss plateaued — possible learning rate or data issue
`Overfit`	Val loss increasing while train loss decreasing
`Overtraining`	Model performance degrading with more training
`PoorWeightInitialization`	Bad initialization causing slow learning

SageMaker Profiler (part of Debugger):

Identifies hardware bottlenecks: CPU/GPU utilization, I/O wait, network
Flags common inefficiencies: DataLoader bottlenecks, small batch sizes, underutilized GPUs
Outputs a profiling report with actionable recommendations

Hyperparameters vs Parameters

ELI5: Parameters are what the student learns during class (knowledge, skills). Hyperparameters are how you set up the classroom — class size (batch size), teaching speed (learning rate), how many classes (epochs). The student discovers parameters; you decide hyperparameters.

Type	Examples	Who Sets It	When
Parameters (model weights)	Neural network weights, tree split values, SVM support vectors	Algorithm (learned from data)	During training
Hyperparameters	Learning rate, max_depth, num_layers, dropout rate, batch size	You (the practitioner)	Before training

Key insight: You cannot learn hyperparameters with backpropagation because the gradient of validation loss w.r.t. a hyperparameter is not generally computable. This is why hyperparameter optimization is a separate outer loop.

Hyperparameter Optimization (HPO)

Manual Tuning

Try values based on domain knowledge/intuition. Does not scale. Only viable for 1-2 hyperparameters with clear prior knowledge.

Grid Search

Search every combination of a predefined grid:

learning_rate: [0.001, 0.01, 0.1]
batch_size:    [32, 64, 128]
max_depth:     [3, 5, 7]
dropout:       [0.2, 0.3, 0.5]

Total combinations: 3 × 3 × 3 × 3 = 81 training jobs

Pros: Exhaustive, finds the best combination within the grid
Cons: Curse of dimensionality — adding one more hyperparameter with 5 values multiplies total jobs by 5. 6 hyperparameters × 5 values = $5^6 = 15,625$ jobs.

Random Search

Randomly sample from the hyperparameter distributions:

Grid Search coverage:       Random Search coverage:
┌─────────────────┐         ┌─────────────────┐
│ × × × × × × × │         │   ·   ·  · ·    │
│ × × × × × × × │         │ ·   ·    ·   ·  │
│ × × × × × × × │         │   · ·  ·    ·   │
│ × × × × × × × │         │ ·    ·   · ·    │
│ × × × × × × × │         │   ·   · ·    ·  │
└─────────────────┘         └─────────────────┘
Uniform grid — wastes        Covers the space more
budget on redundant          efficiently; likely to
columns when only one        find better values with
hyperparameter matters       the same budget

Why random is better (Bergstra & Bengio 2012): Most hyperparameters have regions where they matter and regions where they don’t. Grid search wastes budget on uninformative combinations. Random search covers the important dimensions more efficiently.

Bayesian Optimization

ELI5: Instead of blindly trying random combinations, Bayesian optimization learns from previous experiments: “learning rate 0.1 was good, 0.5 was terrible — so let’s try something near 0.1 next.” It builds a map of the performance landscape and chooses the next experiment intelligently.

Iteration 1: Try (lr=0.1, depth=3) → val_acc = 0.82
Iteration 2: Try (lr=0.5, depth=3) → val_acc = 0.61
Iteration 3: Surrogate model says: lr near 0.1 is promising
             Try (lr=0.08, depth=5) → val_acc = 0.87
Iteration 4: Surrogate model updated, tries (lr=0.09, depth=6) → 0.89
...

Mechanics:

Fit a surrogate model (Gaussian Process, Random Forest) on observed (hyperparams → performance) pairs
Use an acquisition function to decide the next point to evaluate:
- Expected Improvement (EI): choose point with highest expected gain over current best
- Upper Confidence Bound (UCB): balance exploration vs exploitation
Evaluate the actual training job at that point, update the surrogate, repeat

HPO Method	Jobs Needed	Best For
Grid Search	Exponential in #params	≤ 2 hyperparameters, small grid
Random Search	Linear	Quick baseline, many hyperparameters
Bayesian	Sub-linear (smart)	When jobs are expensive, want efficiency
Hyperband	Sub-linear + early stopping	When early performance predicts final performance

Hyperband

ELI5: Like American Idol auditions — everyone sings 30 seconds, cut the worst half, give survivors 2 minutes, cut again, give finalists the full song. Same budget, more winners identified early.

Successive Halving:

Round 1: 16 candidates × 1 epoch each   → keep best 8
Round 2: 8 candidates  × 2 epochs each  → keep best 4
Round 3: 4 candidates  × 4 epochs each  → keep best 2
Round 4: 2 candidates  × 8 epochs each  → keep best 1
                                           ↑ final winner
Total epochs used: 16×1 + 8×2 + 4×4 + 2×8 = 64
vs. 16 full-run = 16×8 = 128 epochs

Assumption: Early performance correlates with final performance. Works well for deep learning, less well for methods that are slow to differentiate.

SageMaker Automatic Model Tuning (AMT)

# Example configuration
tuner = HyperparameterTuner(
    estimator=estimator,
    objective_metric_name='validation:accuracy',
    objective_type='Maximize',          # or 'Minimize'
    hyperparameter_ranges={
        'learning_rate': ContinuousParameter(0.0001, 0.1, scaling_type='Logarithmic'),
        'num_layers':    IntegerParameter(1, 10),
        'optimizer':     CategoricalParameter(['adam', 'sgd', 'rmsprop']),
    },
    max_jobs=50,
    max_parallel_jobs=5,
    strategy='Bayesian',               # 'Random', 'Bayesian', 'Hyperband', 'Grid'
    early_stopping_type='Auto',
)

Key concepts:

Setting	Meaning
`objective_metric_name`	Metric your training script emits (must match exactly)
`max_jobs`	Total training jobs budget
`max_parallel_jobs`	Concurrent jobs (more parallel = less info for Bayesian guidance)
`strategy = 'Bayesian'`	Default; most efficient for expensive jobs
`early_stopping_type = 'Auto'`	Stop jobs early if not improving

Warm start: Resume a previous tuning job. Two types:

IDENTICAL_DATA_AND_ALGORITHM — same setup, add more jobs
TRANSFER_LEARNING — changed some ranges/algorithm, use prior as guidance

Logarithmic scaling: Use for parameters spanning orders of magnitude (e.g., learning rate from 0.0001 to 0.1). Ensures equal sampling density at each scale.

Exam tip: For AMT, max_parallel_jobs should be much less than max_jobs. If max_parallel_jobs = max_jobs, all jobs run simultaneously and Bayesian optimization can’t use feedback from earlier jobs — it degrades to random search.

Regularization Techniques

Dropout

ELI5: During practice, randomly mute different teammates each game. Everyone learns to do their job independently instead of relying on specific partners. On game day (inference), everyone plays — and they perform better because of the independent practice.

Training:                         Inference:
  Input                             Input
    │                                 │
  [h1][h2][h3][h4]                 [h1][h2][h3][h4]
    │   ✗   │   ✗   ← 50% dropped     │   │   │   │  ← all active
  [h5][  ][h6][  ]                 [h5][h6][h7][h8]
    │       │                         │   │   │   │
  Output                            Output (scale by keep_prob)

Typical rates: 0.2–0.5 (drop 20–50% of neurons)
Applied only during training — at inference, all neurons active (outputs scaled by keep probability)
Effect: Prevents co-adaptation, acts as implicit ensemble of $2^n$ networks

Batch Normalization

For each mini-batch, normalize layer activations: $\hat{x} = \frac{x - \mu_B}{\sigma_B + \epsilon}$, then scale and shift: $y = \gamma\hat{x} + \beta$

ELI5: Each layer keeps re-centering its inputs so the next layer always receives well-behaved data — not too big, not too small, centered around zero. Like a translator who converts everything to a standard format before passing it along.

Benefits: Smoother loss landscape, enables higher learning rates, reduces sensitivity to initialization
Placement: Usually before activation function, sometimes after
Training vs inference: Uses batch statistics during training, running mean/variance during inference

Early Stopping

Monitor validation loss; stop training when it starts increasing:

Loss
│
│  Training loss ────────────────────────────────▶ (keeps decreasing)
│  ╲
│   ╲
│    ╲____________
│                 ╲_______
│  Validation     ↑       ╲___  ← best point      ╲ starts
│  loss           │            ╲____________________╲  increasing
│                 │                                    ← overfitting
│                 └─── EARLY STOP HERE (patience=5)
└──────────────────────────────────────────────────────── Epochs

Patience: Number of epochs to wait after validation loss increases before stopping
Restore best weights: Save model checkpoint at the best validation epoch, restore at the end
Benefit: Free regularization — no extra hyperparameter tuning needed

Weight Decay (L2 Regularization in Optimizers)

Add a term to the optimizer step that shrinks weights toward zero each step:

$$w \leftarrow w(1 - \eta\lambda) - \eta\nabla L$$

Equivalent to L2 regularization for SGD, but subtly different for adaptive optimizers (AdamW implements weight decay correctly, Adam+L2 does not).

Common Training Problems & Fixes

Symptom	Diagnosis	Fix
Training loss not decreasing	Learning rate too low, vanishing gradient, bug	Increase LR; check gradients; debug code
Training loss oscillating wildly	Learning rate too high	Reduce LR by 10×; use LR schedule
Training loss → NaN	Exploding gradients, bad data	Gradient clipping; check for inf/nan in data
Train loss low, val loss high	Overfitting	More data, dropout, L2, early stopping, less capacity
Both losses high	Underfitting	Bigger model, more features, fewer regularization constraints
GPU utilization < 50%	Data loading bottleneck	More DataLoader workers; Pipe mode; prefetching
Loss decreasing very slowly	Batch size too small; bad LR schedule	Increase batch size; use LR warm-up
Validation loss better than train loss	Train augmentation/dropout active, val not	Expected and fine — model generalizes well

Why this matters for the exam: SageMaker Debugger automates detecting many of these — vanishing gradient, exploding tensor, loss not decreasing. Questions will test whether you can read symptoms and diagnose causes.

Transfer Learning

ELI5: Instead of training a doctor from kindergarten, hire a general doctor and spend 6 months teaching them your specialty. Most of their general medical knowledge transfers directly — you’re only teaching the specialized part.

Pre-trained Model (e.g., ResNet-50 on ImageNet):
┌──────────────────────────────────────────────────┐
│  Conv layers (edge/shape/texture detectors)      │
│  ┌────┐ ┌────┐ ┌────┐ ┌────┐  [FROZEN]          │
│  │ C1 │→│ C2 │→│ C3 │→│ C4 │  keep these weights│
│  └────┘ └────┘ └────┘ └────┘                    │
│                                                  │
│  Final classifier [REPLACED + RETRAINED]         │
│  ┌───────────────────┐                           │
│  │  FC → Softmax     │  your 5 categories        │
│  └───────────────────┘                           │
└──────────────────────────────────────────────────┘

Two Strategies

Strategy	Approach	When to Use
Feature Extraction	Freeze ALL pre-trained layers; only train new head	Very small dataset (< 1000 samples); similar domain
Fine-tuning	Unfreeze some/all layers; train with small LR (1e-4 to 1e-5)	Medium dataset; possibly different domain

Fine-tuning rules:

Unfreeze from the top (last layers) downward
Use a much smaller learning rate than original training (avoid destroying learned representations)
Earlier layers = more generic features (always safe to keep frozen)
Later layers = more task-specific features (safe to fine-tune)

Transfer Learning in SageMaker

Image Classification, Object Detection, Semantic Segmentation algorithms support incremental training
JumpStart provides pre-trained model hubs for fine-tuning
Fine-tuning is a first-class feature: provide model_id, your dataset, config

Exam tip: “Small dataset + image/NLP task” → Transfer learning. “Similar domain” → Feature extraction (freeze more). “Different domain” → Fine-tuning (unfreeze more layers, use even smaller LR).

Quick Reference: Training Decisions

Want to reduce training cost?
├─ Up to 90% → Managed Spot Training + Checkpointing
└─ Smaller instance → Optimize batch size + Pipe Mode

Large dataset, slow startup?
└─ Use Pipe Mode instead of File Mode

Model too large for one GPU?
├─ Model Parallelism (SageMaker SMP)
└─ Need more throughput → Data Parallelism (SMDDP / Horovod)

Overfitting?
├─ Add Dropout (0.2–0.5)
├─ L2 / Weight decay
├─ Early Stopping
├─ More training data / Data Augmentation
└─ Reduce model capacity

Training loss not moving?
├─ Learning rate too low → increase
├─ Learning rate too high → reduce
└─ Check data pipeline (NaN values, wrong labels)

Want to find best hyperparameters efficiently?
├─ < 3 hyperparameters, cheap jobs → Random Search
├─ Expensive jobs, any count → Bayesian (SageMaker AMT)
└─ Very expensive jobs → Hyperband (AMT with Hyperband strategy)

Model Training & Hyperparameter Tuning#

The Training Process (First Principles)#

Epochs, Batches, Iterations#

Learning Rate: The Most Important Hyperparameter#

Learning Rate Schedules#

Training Data Management#

Train / Validation / Test Split: Why Three Sets?#

Stratified Splitting for Imbalanced Data#

Time Series: Never Random Split!#

Data Augmentation#

SageMaker Training Deep Dive#

Training Job Lifecycle#

File Mode vs Pipe Mode#

Instance Types for Training#

Distributed Training#

Managed Spot Training#

Incremental Training#

SageMaker Debugger#

Hyperparameters vs Parameters#

Hyperparameter Optimization (HPO)#

Manual Tuning#

Grid Search#

Random Search#

Bayesian Optimization#

Hyperband#

SageMaker Automatic Model Tuning (AMT)#

Regularization Techniques#

Dropout#

Batch Normalization#

Early Stopping#

Weight Decay (L2 Regularization in Optimizers)#

Common Training Problems & Fixes#

Transfer Learning#

Two Strategies#

Transfer Learning in SageMaker#

Quick Reference: Training Decisions#