Domain 3F: Model Training & Hyperparameter Tuning
Table of Contents
- Model Training & Hyperparameter Tuning
- The Training Process (First Principles)
- Learning Rate: The Most Important Hyperparameter
- Training Data Management
- SageMaker Training Deep Dive
- Hyperparameters vs Parameters
- Hyperparameter Optimization (HPO)
- Regularization Techniques
- Common Training Problems & Fixes
- Transfer Learning
- Quick Reference: Training Decisions
Model Training & Hyperparameter Tuning
Exam Domain: 3 — ML Model Development Task: Select and apply correct training strategies, manage hyperparameters, and use SageMaker training features effectively
The Training Process (First Principles)
ELI5: Training a model is like teaching a student through repeated quizzes. The student makes a guess, you tell them how wrong they were, they adjust their thinking, and repeat — thousands of times until they get it right.
At its core, training is an optimization loop:
┌──────────────────────────────────────────────────────────┐
│ TRAINING LOOP │
│ │
│ ┌─────────┐ Forward ┌──────────┐ │
│ │ Input │─── Pass ──────▶│ Loss │ │
│ │ Batch │ │ Function │ │
│ └─────────┘ └────┬─────┘ │
│ ▲ │ │
│ │ Backward Pass │
│ ┌────┴──────┐ (Backprop) │
│ │ Update │ │ │
│ │ Weights │◀──────────────────┘ │
│ │ (SGD/ │ Gradients tell each weight │
│ │ Adam) │ which direction to move │
│ └───────────┘ │
└──────────────────────────────────────────────────────────┘
Four mechanical steps per iteration:
- Forward pass — input flows through the network, produces a prediction
- Loss computation — measure how wrong the prediction is (cross-entropy, MSE, etc.)
- Backward pass (backpropagation) — compute gradient of loss w.r.t. every weight
- Weight update — move weights in the direction that reduces loss: $w \leftarrow w - \eta \nabla L$
Epochs, Batches, Iterations
| Term | Meaning | Example |
|---|---|---|
| Dataset | All training examples | 50,000 images |
| Batch (mini-batch) | Subset processed at once | 32 images |
| Iteration (step) | One forward+backward+update | Process 1 batch |
| Epoch | One full pass through all data | All 50,000 images seen once |
Relationship:
$$\text{total_iterations} = \text{epochs} \times \frac{\text{dataset_size}}{\text{batch_size}}$$
Concrete example: 50,000 images, batch size 100, 10 epochs:
- Batches per epoch = 50,000 / 100 = 500
- Total iterations = 10 × 500 = 5,000
Exam tip: Larger batch size → fewer iterations per epoch, but each iteration uses more memory. Smaller batch size → noisier gradients but better generalization (regularization effect).
Learning Rate: The Most Important Hyperparameter
ELI5: Learning rate is the size of steps you take walking downhill toward the valley (minimum loss). Too big and you leap over the valley entirely. Too small and you’re still walking at midnight when everyone else has arrived.
$$w \leftarrow w - \underbrace{\eta}_{\text{learning rate}} \cdot \nabla L$$
What happens at different learning rates:
Loss
│
│ Too HIGH (η=1.0): diverges, bounces wildly
│ × ×
│ × × × × × ← never converges
│
│ JUST RIGHT (η=0.01): converges smoothly
│ ○
│ ○
│ ○
│ ○○○○○○ ← converges to minimum
│
│ Too LOW (η=0.0001): converges but painfully slowly
│ ·
│ ·
│ · ← still moving after 10,000 steps
│
└─────────────────────────────────── Iterations
Learning Rate Schedules
Static learning rate is rarely optimal. Schedules adapt $\eta$ over training:
| Schedule | Formula | When to Use |
|---|---|---|
| Step Decay | $\eta_t = \eta_0 \cdot \gamma^{\lfloor t/k \rfloor}$ | Drop by factor $\gamma$ every $k$ steps; simple, effective |
| Exponential Decay | $\eta_t = \eta_0 \cdot e^{-\lambda t}$ | Smooth continuous decay |
| Cosine Annealing | $\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max}-\eta_{min})(1+\cos(\frac{t\pi}{T}))$ | Best for fine-tuning, smooth |
| Warm-up + Decay | Linear increase for first $k$ steps, then decay | Transformers, large batch training |
Warm-up intuition: Early in training, gradients are chaotic. Starting with a small learning rate lets the model stabilize before taking large steps.
Why this matters for the exam: SageMaker’s built-in algorithms expose learning rate as a key hyperparameter. Automatic Model Tuning searches for the optimal value. Recognizing divergence (oscillating/increasing loss) vs. slow convergence (loss barely moving) guides the fix.
Training Data Management
Train / Validation / Test Split: Why Three Sets?
┌───────────────────────────────────────────────────────┐
│ FULL DATASET │
├──────────────────┬───────────────┬────────────────────┤
│ TRAIN (70%) │ VAL (15%) │ TEST (15%) │
│ │ │ │
│ Model learns │ Tune hyper- │ Final honest │
│ weights here │ parameters │ evaluation │
│ │ here │ (touch ONCE) │
└──────────────────┴───────────────┴────────────────────┘
Why not just train + test? If you tune hyperparameters based on test performance, the test set is effectively used for training decisions. Your reported test accuracy is optimistically biased — information leakage. The validation set is the “practice exam,” the test set is the “real exam you can only take once.”
Stratified Splitting for Imbalanced Data
For a dataset with 95% class A and 5% class B: a random split might put all class B samples in train and none in test. Stratified splitting preserves the class ratio in each split.
- Always use stratified splits for classification with imbalanced classes
sklearn.model_selection.StratifiedKFold
Time Series: Never Random Split!
Random splitting a time series causes temporal leakage — future data leaks into training.
WRONG (Random Split):
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│T1│V │T2│T3│V │T4│V │T5│T6│T7│ ← validation points
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ scattered randomly — future leaks into past!
CORRECT (Chronological Split):
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ TRAIN (past) │ VAL │ TEST │ ← strict time boundary
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
T=1 → T=7 T=8 T=9-10
Walk-forward validation (time series CV): train on T1-T3, validate T4; train T1-T4, validate T5; etc.
Data Augmentation
ELI5: If you only have 100 photos of cats, flip them, rotate them, zoom in — now you have 800 photos that teach the model “cats look like this regardless of angle.” You haven’t collected new cats, you’ve just looked at your existing cats differently.
Augmentation is both a data strategy AND a regularization technique (prevents memorizing exact training examples).
| Data Type | Techniques |
|---|---|
| Image | Random flip (H/V), rotation, crop, color jitter, blur, Mixup (blend two images), Cutout (erase random patches), CutMix |
| Text | Synonym replacement, back-translation (EN→FR→EN), random insertion/deletion, EDA (Easy Data Augmentation) |
| Tabular | SMOTE (synthetic minority oversampling), Gaussian noise injection, random feature masking |
| Audio | Time stretch, pitch shift, noise addition, SpecAugment (mask frequency bands) |
Exam tip: Augmentation applies ONLY to training data, never to validation or test data. Applying it to test data would change what you’re evaluating.
SageMaker Training Deep Dive
Training Job Lifecycle
┌─────────┐ ┌──────────────────────────────────────────┐ ┌─────────┐
│ S3 │ │ TRAINING INSTANCE │ │ S3 │
│ │ │ │ │ │
│Training │───▶│ /opt/ml/input/data/<channel>/ │ │ Model │
│ Data │ │ /opt/ml/input/config/hyperparameters │ │Artifacts│
│ │ │ /opt/ml/model/ (save here) ───▶│───▶│ │
│ Code │───▶│ /opt/ml/output/data/ (logs/metrics) │ │ Output │
│(ECR │ │ /opt/ml/output/failure (error info) │ │ Data │
│ Image) │ │ │ │ │
└─────────┘ └──────────────────────────────────────────┘ └─────────┘
Key directory structure:
| Path | Purpose |
|---|---|
/opt/ml/input/data/<channel>/ | Training data (one subdir per channel) |
/opt/ml/input/config/hyperparameters.json | Your hyperparameter values |
/opt/ml/model/ | Save model artifacts here — SageMaker uploads to S3 after training |
/opt/ml/output/data/ | Additional output files (metrics, logs) |
File Mode vs Pipe Mode
| Feature | File Mode | Pipe Mode |
|---|---|---|
| How it works | Download all data to disk first | Stream data directly from S3 |
| Startup time | Slow (wait for full download) | Fast (start training immediately) |
| Storage needed | Full dataset on instance | Minimal (one batch at a time) |
| Reuse across epochs | Yes (on disk) | Requires re-streaming each epoch |
| Best for | Small-medium datasets, multiple epochs | Large datasets, fewer epochs, faster iteration |
ELI5 Pipe Mode: Instead of downloading the entire library before reading, you stream books one at a time directly from the library. You start reading immediately, but can’t go back without re-requesting.
Exam tip: Pipe mode = faster startup + lower storage cost. File mode = simpler, can re-read data. Choose Pipe mode when dataset is large and startup latency matters.
Instance Types for Training
| Instance Family | Hardware | Best For |
|---|---|---|
ml.m5 | General-purpose CPU | Small models, preprocessing, simple algorithms |
ml.c5 | Compute-optimized CPU | CPU-heavy algorithms (XGBoost, classical ML) |
ml.p3 | NVIDIA V100 GPU | Deep learning (training), computer vision, NLP |
ml.p4 | NVIDIA A100 GPU | Large models, LLMs, highest throughput |
ml.g4dn | NVIDIA T4 GPU | Cost-effective GPU, inference, smaller DL models |
ml.trn1 | AWS Trainium chip | Deep learning training, cost-optimized |
ml.inf1 | AWS Inferentia chip | Inference only (not training) |
Decision rule:
Algorithm uses backpropagation / neural networks?
├─ YES → GPU instance (p3/p4/g4dn) or Trainium (trn1)
└─ NO → CPU instance
├─ XGBoost, LightGBM, tree methods → ml.m5 or ml.c5
└─ Large tabular, classical ML → ml.m5
Distributed Training
For datasets or models too large for one instance:
Data Parallelism — split data, replicate model:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Worker 1 │ │ Worker 2 │ │ Worker 3 │
│ Batch 1-100 │ │ Batch 101-200│ │ Batch 201-300│
│ │ │ │ │ │
│ Same Model │ │ Same Model │ │ Same Model │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└──────────────────┼──────────────────┘
Aggregate Gradients
(AllReduce / Parameter Server)
Update shared model
Model Parallelism — split model across workers (for models too large for one GPU):
┌──────────────┐ ┌──────────────┐
│ Worker 1 │ │ Worker 2 │
│ Layers 1-50 │──▶│ Layers 51-100│
│ │ │ │
└──────────────┘ └──────────────┘
| Strategy | When to Use |
|---|---|
| SageMaker Distributed Data Parallel (SMDDP) | Model fits on one GPU, want faster training with multiple GPUs |
| SageMaker Model Parallel (SMP) | Model too large for one GPU (LLMs, very deep networks) |
| Horovod | Open-source alternative to SMDDP, community ecosystem |
Managed Spot Training
ELI5: Like writing your essay on Google Docs instead of Notepad — if your laptop gets taken away (spot interruption), you pick up exactly where you left off because your progress was saved to the cloud.
┌──────────────────────────────────────────────────────┐
│ SPOT TRAINING FLOW │
│ │
│ Start Job → Train → Checkpoint to S3 → Interrupted │
│ ↑ │ │
│ └──── New Spot Instance ←──────┘ │
│ Restore checkpoint │
│ Continue training │
└──────────────────────────────────────────────────────┘
- Cost savings: Up to 90% vs On-Demand instances
- Requirement: Must implement checkpointing (save model state to S3 periodically)
- SageMaker handles: Requesting spot capacity, detecting interruptions, restarting
- You provide:
checkpoint_s3_uriin training job config, checkpoint logic in training script - Max wait time: Set
max_wait>max_runto allow time for spot availability
Exam tip: If a question asks how to reduce training costs by up to 90%, the answer is Managed Spot Training with checkpointing.
Incremental Training
Resume training from a previously trained model:
- Provide a pre-trained model artifact as input to a new training job
- Useful for: continuing interrupted training, fine-tuning with new data, iterative improvement
- Supported by: Object Detection, Image Classification, Semantic Segmentation built-in algorithms
SageMaker Debugger
Real-time monitoring of training jobs without modifying training code:
Training Job
│
├─▶ Debugger Hook ──▶ S3 (tensors, metrics)
│ │
│ Debugger Rules
│ (run as separate
│ processing job)
│ │
│ CloudWatch Alarms
│ SNS Notifications
└─▶ Training continues (non-blocking)
Built-in Rules:
| Rule | Detects |
|---|---|
VanishingGradient | Gradients approaching zero — deep networks stop learning |
ExplodingTensor | Weights/gradients growing uncontrollably (NaN) |
LossNotDecreasing | Training loss plateaued — possible learning rate or data issue |
Overfit | Val loss increasing while train loss decreasing |
Overtraining | Model performance degrading with more training |
PoorWeightInitialization | Bad initialization causing slow learning |
SageMaker Profiler (part of Debugger):
- Identifies hardware bottlenecks: CPU/GPU utilization, I/O wait, network
- Flags common inefficiencies: DataLoader bottlenecks, small batch sizes, underutilized GPUs
- Outputs a profiling report with actionable recommendations
Hyperparameters vs Parameters
ELI5: Parameters are what the student learns during class (knowledge, skills). Hyperparameters are how you set up the classroom — class size (batch size), teaching speed (learning rate), how many classes (epochs). The student discovers parameters; you decide hyperparameters.
| Type | Examples | Who Sets It | When |
|---|---|---|---|
| Parameters (model weights) | Neural network weights, tree split values, SVM support vectors | Algorithm (learned from data) | During training |
| Hyperparameters | Learning rate, max_depth, num_layers, dropout rate, batch size | You (the practitioner) | Before training |
Key insight: You cannot learn hyperparameters with backpropagation because the gradient of validation loss w.r.t. a hyperparameter is not generally computable. This is why hyperparameter optimization is a separate outer loop.
Hyperparameter Optimization (HPO)
Manual Tuning
Try values based on domain knowledge/intuition. Does not scale. Only viable for 1-2 hyperparameters with clear prior knowledge.
Grid Search
Search every combination of a predefined grid:
learning_rate: [0.001, 0.01, 0.1]
batch_size: [32, 64, 128]
max_depth: [3, 5, 7]
dropout: [0.2, 0.3, 0.5]
Total combinations: 3 × 3 × 3 × 3 = 81 training jobs
- Pros: Exhaustive, finds the best combination within the grid
- Cons: Curse of dimensionality — adding one more hyperparameter with 5 values multiplies total jobs by 5. 6 hyperparameters × 5 values = $5^6 = 15,625$ jobs.
Random Search
Randomly sample from the hyperparameter distributions:
Grid Search coverage: Random Search coverage:
┌─────────────────┐ ┌─────────────────┐
│ × × × × × × × │ │ · · · · │
│ × × × × × × × │ │ · · · · │
│ × × × × × × × │ │ · · · · │
│ × × × × × × × │ │ · · · · │
│ × × × × × × × │ │ · · · · │
└─────────────────┘ └─────────────────┘
Uniform grid — wastes Covers the space more
budget on redundant efficiently; likely to
columns when only one find better values with
hyperparameter matters the same budget
Why random is better (Bergstra & Bengio 2012): Most hyperparameters have regions where they matter and regions where they don’t. Grid search wastes budget on uninformative combinations. Random search covers the important dimensions more efficiently.
Bayesian Optimization
ELI5: Instead of blindly trying random combinations, Bayesian optimization learns from previous experiments: “learning rate 0.1 was good, 0.5 was terrible — so let’s try something near 0.1 next.” It builds a map of the performance landscape and chooses the next experiment intelligently.
Iteration 1: Try (lr=0.1, depth=3) → val_acc = 0.82
Iteration 2: Try (lr=0.5, depth=3) → val_acc = 0.61
Iteration 3: Surrogate model says: lr near 0.1 is promising
Try (lr=0.08, depth=5) → val_acc = 0.87
Iteration 4: Surrogate model updated, tries (lr=0.09, depth=6) → 0.89
...
Mechanics:
- Fit a surrogate model (Gaussian Process, Random Forest) on observed (hyperparams → performance) pairs
- Use an acquisition function to decide the next point to evaluate:
- Expected Improvement (EI): choose point with highest expected gain over current best
- Upper Confidence Bound (UCB): balance exploration vs exploitation
- Evaluate the actual training job at that point, update the surrogate, repeat
| HPO Method | Jobs Needed | Best For |
|---|---|---|
| Grid Search | Exponential in #params | ≤ 2 hyperparameters, small grid |
| Random Search | Linear | Quick baseline, many hyperparameters |
| Bayesian | Sub-linear (smart) | When jobs are expensive, want efficiency |
| Hyperband | Sub-linear + early stopping | When early performance predicts final performance |
Hyperband
ELI5: Like American Idol auditions — everyone sings 30 seconds, cut the worst half, give survivors 2 minutes, cut again, give finalists the full song. Same budget, more winners identified early.
Successive Halving:
Round 1: 16 candidates × 1 epoch each → keep best 8
Round 2: 8 candidates × 2 epochs each → keep best 4
Round 3: 4 candidates × 4 epochs each → keep best 2
Round 4: 2 candidates × 8 epochs each → keep best 1
↑ final winner
Total epochs used: 16×1 + 8×2 + 4×4 + 2×8 = 64
vs. 16 full-run = 16×8 = 128 epochs
Assumption: Early performance correlates with final performance. Works well for deep learning, less well for methods that are slow to differentiate.
SageMaker Automatic Model Tuning (AMT)
# Example configuration
tuner = HyperparameterTuner(
estimator=estimator,
objective_metric_name='validation:accuracy',
objective_type='Maximize', # or 'Minimize'
hyperparameter_ranges={
'learning_rate': ContinuousParameter(0.0001, 0.1, scaling_type='Logarithmic'),
'num_layers': IntegerParameter(1, 10),
'optimizer': CategoricalParameter(['adam', 'sgd', 'rmsprop']),
},
max_jobs=50,
max_parallel_jobs=5,
strategy='Bayesian', # 'Random', 'Bayesian', 'Hyperband', 'Grid'
early_stopping_type='Auto',
)
Key concepts:
| Setting | Meaning |
|---|---|
objective_metric_name | Metric your training script emits (must match exactly) |
max_jobs | Total training jobs budget |
max_parallel_jobs | Concurrent jobs (more parallel = less info for Bayesian guidance) |
strategy = 'Bayesian' | Default; most efficient for expensive jobs |
early_stopping_type = 'Auto' | Stop jobs early if not improving |
Warm start: Resume a previous tuning job. Two types:
IDENTICAL_DATA_AND_ALGORITHM— same setup, add more jobsTRANSFER_LEARNING— changed some ranges/algorithm, use prior as guidance
Logarithmic scaling: Use for parameters spanning orders of magnitude (e.g., learning rate from 0.0001 to 0.1). Ensures equal sampling density at each scale.
Exam tip: For AMT, max_parallel_jobs should be much less than max_jobs. If max_parallel_jobs = max_jobs, all jobs run simultaneously and Bayesian optimization can’t use feedback from earlier jobs — it degrades to random search.
Regularization Techniques
Dropout
ELI5: During practice, randomly mute different teammates each game. Everyone learns to do their job independently instead of relying on specific partners. On game day (inference), everyone plays — and they perform better because of the independent practice.
Training: Inference:
Input Input
│ │
[h1][h2][h3][h4] [h1][h2][h3][h4]
│ ✗ │ ✗ ← 50% dropped │ │ │ │ ← all active
[h5][ ][h6][ ] [h5][h6][h7][h8]
│ │ │ │ │ │
Output Output (scale by keep_prob)
- Typical rates: 0.2–0.5 (drop 20–50% of neurons)
- Applied only during training — at inference, all neurons active (outputs scaled by keep probability)
- Effect: Prevents co-adaptation, acts as implicit ensemble of $2^n$ networks
Batch Normalization
For each mini-batch, normalize layer activations: $\hat{x} = \frac{x - \mu_B}{\sigma_B + \epsilon}$, then scale and shift: $y = \gamma\hat{x} + \beta$
ELI5: Each layer keeps re-centering its inputs so the next layer always receives well-behaved data — not too big, not too small, centered around zero. Like a translator who converts everything to a standard format before passing it along.
- Benefits: Smoother loss landscape, enables higher learning rates, reduces sensitivity to initialization
- Placement: Usually before activation function, sometimes after
- Training vs inference: Uses batch statistics during training, running mean/variance during inference
Early Stopping
Monitor validation loss; stop training when it starts increasing:
Loss
│
│ Training loss ────────────────────────────────▶ (keeps decreasing)
│ ╲
│ ╲
│ ╲____________
│ ╲_______
│ Validation ↑ ╲___ ← best point ╲ starts
│ loss │ ╲____________________╲ increasing
│ │ ← overfitting
│ └─── EARLY STOP HERE (patience=5)
└──────────────────────────────────────────────────────── Epochs
- Patience: Number of epochs to wait after validation loss increases before stopping
- Restore best weights: Save model checkpoint at the best validation epoch, restore at the end
- Benefit: Free regularization — no extra hyperparameter tuning needed
Weight Decay (L2 Regularization in Optimizers)
Add a term to the optimizer step that shrinks weights toward zero each step:
$$w \leftarrow w(1 - \eta\lambda) - \eta\nabla L$$
Equivalent to L2 regularization for SGD, but subtly different for adaptive optimizers (AdamW implements weight decay correctly, Adam+L2 does not).
Common Training Problems & Fixes
| Symptom | Diagnosis | Fix |
|---|---|---|
| Training loss not decreasing | Learning rate too low, vanishing gradient, bug | Increase LR; check gradients; debug code |
| Training loss oscillating wildly | Learning rate too high | Reduce LR by 10×; use LR schedule |
| Training loss → NaN | Exploding gradients, bad data | Gradient clipping; check for inf/nan in data |
| Train loss low, val loss high | Overfitting | More data, dropout, L2, early stopping, less capacity |
| Both losses high | Underfitting | Bigger model, more features, fewer regularization constraints |
| GPU utilization < 50% | Data loading bottleneck | More DataLoader workers; Pipe mode; prefetching |
| Loss decreasing very slowly | Batch size too small; bad LR schedule | Increase batch size; use LR warm-up |
| Validation loss better than train loss | Train augmentation/dropout active, val not | Expected and fine — model generalizes well |
Why this matters for the exam: SageMaker Debugger automates detecting many of these — vanishing gradient, exploding tensor, loss not decreasing. Questions will test whether you can read symptoms and diagnose causes.
Transfer Learning
ELI5: Instead of training a doctor from kindergarten, hire a general doctor and spend 6 months teaching them your specialty. Most of their general medical knowledge transfers directly — you’re only teaching the specialized part.
Pre-trained Model (e.g., ResNet-50 on ImageNet):
┌──────────────────────────────────────────────────┐
│ Conv layers (edge/shape/texture detectors) │
│ ┌────┐ ┌────┐ ┌────┐ ┌────┐ [FROZEN] │
│ │ C1 │→│ C2 │→│ C3 │→│ C4 │ keep these weights│
│ └────┘ └────┘ └────┘ └────┘ │
│ │
│ Final classifier [REPLACED + RETRAINED] │
│ ┌───────────────────┐ │
│ │ FC → Softmax │ your 5 categories │
│ └───────────────────┘ │
└──────────────────────────────────────────────────┘
Two Strategies
| Strategy | Approach | When to Use |
|---|---|---|
| Feature Extraction | Freeze ALL pre-trained layers; only train new head | Very small dataset (< 1000 samples); similar domain |
| Fine-tuning | Unfreeze some/all layers; train with small LR (1e-4 to 1e-5) | Medium dataset; possibly different domain |
Fine-tuning rules:
- Unfreeze from the top (last layers) downward
- Use a much smaller learning rate than original training (avoid destroying learned representations)
- Earlier layers = more generic features (always safe to keep frozen)
- Later layers = more task-specific features (safe to fine-tune)
Transfer Learning in SageMaker
- Image Classification, Object Detection, Semantic Segmentation algorithms support incremental training
- JumpStart provides pre-trained model hubs for fine-tuning
- Fine-tuning is a first-class feature: provide
model_id, your dataset, config
Exam tip: “Small dataset + image/NLP task” → Transfer learning. “Similar domain” → Feature extraction (freeze more). “Different domain” → Fine-tuning (unfreeze more layers, use even smaller LR).
Quick Reference: Training Decisions
Want to reduce training cost?
├─ Up to 90% → Managed Spot Training + Checkpointing
└─ Smaller instance → Optimize batch size + Pipe Mode
Large dataset, slow startup?
└─ Use Pipe Mode instead of File Mode
Model too large for one GPU?
├─ Model Parallelism (SageMaker SMP)
└─ Need more throughput → Data Parallelism (SMDDP / Horovod)
Overfitting?
├─ Add Dropout (0.2–0.5)
├─ L2 / Weight decay
├─ Early Stopping
├─ More training data / Data Augmentation
└─ Reduce model capacity
Training loss not moving?
├─ Learning rate too low → increase
├─ Learning rate too high → reduce
└─ Check data pipeline (NaN values, wrong labels)
Want to find best hyperparameters efficiently?
├─ < 3 hyperparameters, cheap jobs → Random Search
├─ Expensive jobs, any count → Bayesian (SageMaker AMT)
└─ Very expensive jobs → Hyperband (AMT with Hyperband strategy)