← AWS MLS-C01 — ML Specialty

Domain 3F: Model Training & Hyperparameter Tuning

Model Training & Hyperparameter Tuning

Exam Domain: 3 — ML Model Development Task: Select and apply correct training strategies, manage hyperparameters, and use SageMaker training features effectively


The Training Process (First Principles)

ELI5: Training a model is like teaching a student through repeated quizzes. The student makes a guess, you tell them how wrong they were, they adjust their thinking, and repeat — thousands of times until they get it right.

At its core, training is an optimization loop:

┌──────────────────────────────────────────────────────────┐
│                   TRAINING LOOP                          │
│                                                          │
│  ┌─────────┐    Forward     ┌──────────┐                 │
│  │  Input  │─── Pass ──────▶│   Loss   │                 │
│  │  Batch  │                │ Function │                 │
│  └─────────┘                └────┬─────┘                 │
│       ▲                          │                       │
│       │                    Backward Pass                 │
│  ┌────┴──────┐             (Backprop)                    │
│  │  Update   │                   │                       │
│  │  Weights  │◀──────────────────┘                       │
│  │  (SGD/    │   Gradients tell each weight               │
│  │   Adam)   │   which direction to move                 │
│  └───────────┘                                           │
└──────────────────────────────────────────────────────────┘

Four mechanical steps per iteration:

  1. Forward pass — input flows through the network, produces a prediction
  2. Loss computation — measure how wrong the prediction is (cross-entropy, MSE, etc.)
  3. Backward pass (backpropagation) — compute gradient of loss w.r.t. every weight
  4. Weight update — move weights in the direction that reduces loss: $w \leftarrow w - \eta \nabla L$

Epochs, Batches, Iterations

TermMeaningExample
DatasetAll training examples50,000 images
Batch (mini-batch)Subset processed at once32 images
Iteration (step)One forward+backward+updateProcess 1 batch
EpochOne full pass through all dataAll 50,000 images seen once

Relationship:

$$\text{total_iterations} = \text{epochs} \times \frac{\text{dataset_size}}{\text{batch_size}}$$

Concrete example: 50,000 images, batch size 100, 10 epochs:

  • Batches per epoch = 50,000 / 100 = 500
  • Total iterations = 10 × 500 = 5,000

Exam tip: Larger batch size → fewer iterations per epoch, but each iteration uses more memory. Smaller batch size → noisier gradients but better generalization (regularization effect).


Learning Rate: The Most Important Hyperparameter

ELI5: Learning rate is the size of steps you take walking downhill toward the valley (minimum loss). Too big and you leap over the valley entirely. Too small and you’re still walking at midnight when everyone else has arrived.

$$w \leftarrow w - \underbrace{\eta}_{\text{learning rate}} \cdot \nabla L$$

What happens at different learning rates:

Loss
│
│  Too HIGH (η=1.0): diverges, bounces wildly
│     ×         ×
│  ×    ×    ×    ×    ×   ← never converges
│
│  JUST RIGHT (η=0.01): converges smoothly
│  ○
│    ○
│      ○
│        ○○○○○○ ← converges to minimum
│
│  Too LOW (η=0.0001): converges but painfully slowly
│  ·
│   ·
│    ·           ← still moving after 10,000 steps
│
└─────────────────────────────────── Iterations

Learning Rate Schedules

Static learning rate is rarely optimal. Schedules adapt $\eta$ over training:

ScheduleFormulaWhen to Use
Step Decay$\eta_t = \eta_0 \cdot \gamma^{\lfloor t/k \rfloor}$Drop by factor $\gamma$ every $k$ steps; simple, effective
Exponential Decay$\eta_t = \eta_0 \cdot e^{-\lambda t}$Smooth continuous decay
Cosine Annealing$\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max}-\eta_{min})(1+\cos(\frac{t\pi}{T}))$Best for fine-tuning, smooth
Warm-up + DecayLinear increase for first $k$ steps, then decayTransformers, large batch training

Warm-up intuition: Early in training, gradients are chaotic. Starting with a small learning rate lets the model stabilize before taking large steps.

Why this matters for the exam: SageMaker’s built-in algorithms expose learning rate as a key hyperparameter. Automatic Model Tuning searches for the optimal value. Recognizing divergence (oscillating/increasing loss) vs. slow convergence (loss barely moving) guides the fix.


Training Data Management

Train / Validation / Test Split: Why Three Sets?

┌───────────────────────────────────────────────────────┐
│                 FULL DATASET                          │
├──────────────────┬───────────────┬────────────────────┤
│   TRAIN (70%)    │   VAL (15%)   │    TEST (15%)      │
│                  │               │                    │
│  Model learns    │  Tune hyper-  │  Final honest      │
│  weights here    │  parameters   │  evaluation        │
│                  │  here         │  (touch ONCE)      │
└──────────────────┴───────────────┴────────────────────┘

Why not just train + test? If you tune hyperparameters based on test performance, the test set is effectively used for training decisions. Your reported test accuracy is optimistically biased — information leakage. The validation set is the “practice exam,” the test set is the “real exam you can only take once.”

Stratified Splitting for Imbalanced Data

For a dataset with 95% class A and 5% class B: a random split might put all class B samples in train and none in test. Stratified splitting preserves the class ratio in each split.

  • Always use stratified splits for classification with imbalanced classes
  • sklearn.model_selection.StratifiedKFold

Time Series: Never Random Split!

Random splitting a time series causes temporal leakage — future data leaks into training.

WRONG (Random Split):
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│T1│V │T2│T3│V │T4│V │T5│T6│T7│  ← validation points
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘    scattered randomly — future leaks into past!

CORRECT (Chronological Split):
┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│  TRAIN (past)  │ VAL │ TEST │  ← strict time boundary
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
     T=1 → T=7    T=8   T=9-10

Walk-forward validation (time series CV): train on T1-T3, validate T4; train T1-T4, validate T5; etc.

Data Augmentation

ELI5: If you only have 100 photos of cats, flip them, rotate them, zoom in — now you have 800 photos that teach the model “cats look like this regardless of angle.” You haven’t collected new cats, you’ve just looked at your existing cats differently.

Augmentation is both a data strategy AND a regularization technique (prevents memorizing exact training examples).

Data TypeTechniques
ImageRandom flip (H/V), rotation, crop, color jitter, blur, Mixup (blend two images), Cutout (erase random patches), CutMix
TextSynonym replacement, back-translation (EN→FR→EN), random insertion/deletion, EDA (Easy Data Augmentation)
TabularSMOTE (synthetic minority oversampling), Gaussian noise injection, random feature masking
AudioTime stretch, pitch shift, noise addition, SpecAugment (mask frequency bands)

Exam tip: Augmentation applies ONLY to training data, never to validation or test data. Applying it to test data would change what you’re evaluating.


SageMaker Training Deep Dive

Training Job Lifecycle

┌─────────┐    ┌──────────────────────────────────────────┐    ┌─────────┐
│   S3    │    │          TRAINING INSTANCE               │    │   S3    │
│         │    │                                          │    │         │
│Training │───▶│  /opt/ml/input/data/<channel>/           │    │  Model  │
│  Data   │    │  /opt/ml/input/config/hyperparameters    │    │Artifacts│
│         │    │  /opt/ml/model/          (save here) ───▶│───▶│         │
│  Code   │───▶│  /opt/ml/output/data/   (logs/metrics)  │    │  Output │
│(ECR     │    │  /opt/ml/output/failure (error info)    │    │  Data   │
│ Image)  │    │                                          │    │         │
└─────────┘    └──────────────────────────────────────────┘    └─────────┘

Key directory structure:

PathPurpose
/opt/ml/input/data/<channel>/Training data (one subdir per channel)
/opt/ml/input/config/hyperparameters.jsonYour hyperparameter values
/opt/ml/model/Save model artifacts here — SageMaker uploads to S3 after training
/opt/ml/output/data/Additional output files (metrics, logs)

File Mode vs Pipe Mode

FeatureFile ModePipe Mode
How it worksDownload all data to disk firstStream data directly from S3
Startup timeSlow (wait for full download)Fast (start training immediately)
Storage neededFull dataset on instanceMinimal (one batch at a time)
Reuse across epochsYes (on disk)Requires re-streaming each epoch
Best forSmall-medium datasets, multiple epochsLarge datasets, fewer epochs, faster iteration

ELI5 Pipe Mode: Instead of downloading the entire library before reading, you stream books one at a time directly from the library. You start reading immediately, but can’t go back without re-requesting.

Exam tip: Pipe mode = faster startup + lower storage cost. File mode = simpler, can re-read data. Choose Pipe mode when dataset is large and startup latency matters.

Instance Types for Training

Instance FamilyHardwareBest For
ml.m5General-purpose CPUSmall models, preprocessing, simple algorithms
ml.c5Compute-optimized CPUCPU-heavy algorithms (XGBoost, classical ML)
ml.p3NVIDIA V100 GPUDeep learning (training), computer vision, NLP
ml.p4NVIDIA A100 GPULarge models, LLMs, highest throughput
ml.g4dnNVIDIA T4 GPUCost-effective GPU, inference, smaller DL models
ml.trn1AWS Trainium chipDeep learning training, cost-optimized
ml.inf1AWS Inferentia chipInference only (not training)

Decision rule:

Algorithm uses backpropagation / neural networks?
├─ YES → GPU instance (p3/p4/g4dn) or Trainium (trn1)
└─ NO  → CPU instance
         ├─ XGBoost, LightGBM, tree methods → ml.m5 or ml.c5
         └─ Large tabular, classical ML      → ml.m5

Distributed Training

For datasets or models too large for one instance:

Data Parallelism — split data, replicate model:

┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│  Worker 1    │   │  Worker 2    │   │  Worker 3    │
│  Batch 1-100 │   │ Batch 101-200│   │ Batch 201-300│
│              │   │              │   │              │
│  Same Model  │   │  Same Model  │   │  Same Model  │
└──────┬───────┘   └──────┬───────┘   └──────┬───────┘
       │                  │                  │
       └──────────────────┼──────────────────┘
                   Aggregate Gradients
                   (AllReduce / Parameter Server)
                   Update shared model

Model Parallelism — split model across workers (for models too large for one GPU):

┌──────────────┐   ┌──────────────┐
│  Worker 1    │   │  Worker 2    │
│  Layers 1-50 │──▶│ Layers 51-100│
│              │   │              │
└──────────────┘   └──────────────┘
StrategyWhen to Use
SageMaker Distributed Data Parallel (SMDDP)Model fits on one GPU, want faster training with multiple GPUs
SageMaker Model Parallel (SMP)Model too large for one GPU (LLMs, very deep networks)
HorovodOpen-source alternative to SMDDP, community ecosystem

Managed Spot Training

ELI5: Like writing your essay on Google Docs instead of Notepad — if your laptop gets taken away (spot interruption), you pick up exactly where you left off because your progress was saved to the cloud.

┌──────────────────────────────────────────────────────┐
│                SPOT TRAINING FLOW                    │
│                                                      │
│  Start Job → Train → Checkpoint to S3 → Interrupted │
│                ↑                              │       │
│                └──── New Spot Instance ←──────┘       │
│                      Restore checkpoint               │
│                      Continue training               │
└──────────────────────────────────────────────────────┘
  • Cost savings: Up to 90% vs On-Demand instances
  • Requirement: Must implement checkpointing (save model state to S3 periodically)
  • SageMaker handles: Requesting spot capacity, detecting interruptions, restarting
  • You provide: checkpoint_s3_uri in training job config, checkpoint logic in training script
  • Max wait time: Set max_wait > max_run to allow time for spot availability

Exam tip: If a question asks how to reduce training costs by up to 90%, the answer is Managed Spot Training with checkpointing.

Incremental Training

Resume training from a previously trained model:

  • Provide a pre-trained model artifact as input to a new training job
  • Useful for: continuing interrupted training, fine-tuning with new data, iterative improvement
  • Supported by: Object Detection, Image Classification, Semantic Segmentation built-in algorithms

SageMaker Debugger

Real-time monitoring of training jobs without modifying training code:

Training Job
     │
     ├─▶ Debugger Hook ──▶ S3 (tensors, metrics)
     │                          │
     │                     Debugger Rules
     │                     (run as separate
     │                      processing job)
     │                          │
     │                     CloudWatch Alarms
     │                     SNS Notifications
     └─▶ Training continues (non-blocking)

Built-in Rules:

RuleDetects
VanishingGradientGradients approaching zero — deep networks stop learning
ExplodingTensorWeights/gradients growing uncontrollably (NaN)
LossNotDecreasingTraining loss plateaued — possible learning rate or data issue
OverfitVal loss increasing while train loss decreasing
OvertrainingModel performance degrading with more training
PoorWeightInitializationBad initialization causing slow learning

SageMaker Profiler (part of Debugger):

  • Identifies hardware bottlenecks: CPU/GPU utilization, I/O wait, network
  • Flags common inefficiencies: DataLoader bottlenecks, small batch sizes, underutilized GPUs
  • Outputs a profiling report with actionable recommendations

Hyperparameters vs Parameters

ELI5: Parameters are what the student learns during class (knowledge, skills). Hyperparameters are how you set up the classroom — class size (batch size), teaching speed (learning rate), how many classes (epochs). The student discovers parameters; you decide hyperparameters.

TypeExamplesWho Sets ItWhen
Parameters (model weights)Neural network weights, tree split values, SVM support vectorsAlgorithm (learned from data)During training
HyperparametersLearning rate, max_depth, num_layers, dropout rate, batch sizeYou (the practitioner)Before training

Key insight: You cannot learn hyperparameters with backpropagation because the gradient of validation loss w.r.t. a hyperparameter is not generally computable. This is why hyperparameter optimization is a separate outer loop.


Hyperparameter Optimization (HPO)

Manual Tuning

Try values based on domain knowledge/intuition. Does not scale. Only viable for 1-2 hyperparameters with clear prior knowledge.

Search every combination of a predefined grid:

learning_rate: [0.001, 0.01, 0.1]
batch_size:    [32, 64, 128]
max_depth:     [3, 5, 7]
dropout:       [0.2, 0.3, 0.5]

Total combinations: 3 × 3 × 3 × 3 = 81 training jobs
  • Pros: Exhaustive, finds the best combination within the grid
  • Cons: Curse of dimensionality — adding one more hyperparameter with 5 values multiplies total jobs by 5. 6 hyperparameters × 5 values = $5^6 = 15,625$ jobs.

Randomly sample from the hyperparameter distributions:

Grid Search coverage:       Random Search coverage:
┌─────────────────┐         ┌─────────────────┐
│ × × × × × × × │         │   ·   ·  · ·    │
│ × × × × × × × │         │ ·   ·    ·   ·  │
│ × × × × × × × │         │   · ·  ·    ·   │
│ × × × × × × × │         │ ·    ·   · ·    │
│ × × × × × × × │         │   ·   · ·    ·  │
└─────────────────┘         └─────────────────┘
Uniform grid — wastes        Covers the space more
budget on redundant          efficiently; likely to
columns when only one        find better values with
hyperparameter matters       the same budget

Why random is better (Bergstra & Bengio 2012): Most hyperparameters have regions where they matter and regions where they don’t. Grid search wastes budget on uninformative combinations. Random search covers the important dimensions more efficiently.

Bayesian Optimization

ELI5: Instead of blindly trying random combinations, Bayesian optimization learns from previous experiments: “learning rate 0.1 was good, 0.5 was terrible — so let’s try something near 0.1 next.” It builds a map of the performance landscape and chooses the next experiment intelligently.

Iteration 1: Try (lr=0.1, depth=3) → val_acc = 0.82
Iteration 2: Try (lr=0.5, depth=3) → val_acc = 0.61
Iteration 3: Surrogate model says: lr near 0.1 is promising
             Try (lr=0.08, depth=5) → val_acc = 0.87
Iteration 4: Surrogate model updated, tries (lr=0.09, depth=6) → 0.89
...

Mechanics:

  1. Fit a surrogate model (Gaussian Process, Random Forest) on observed (hyperparams → performance) pairs
  2. Use an acquisition function to decide the next point to evaluate:
    • Expected Improvement (EI): choose point with highest expected gain over current best
    • Upper Confidence Bound (UCB): balance exploration vs exploitation
  3. Evaluate the actual training job at that point, update the surrogate, repeat
HPO MethodJobs NeededBest For
Grid SearchExponential in #params≤ 2 hyperparameters, small grid
Random SearchLinearQuick baseline, many hyperparameters
BayesianSub-linear (smart)When jobs are expensive, want efficiency
HyperbandSub-linear + early stoppingWhen early performance predicts final performance

Hyperband

ELI5: Like American Idol auditions — everyone sings 30 seconds, cut the worst half, give survivors 2 minutes, cut again, give finalists the full song. Same budget, more winners identified early.

Successive Halving:

Round 1: 16 candidates × 1 epoch each   → keep best 8
Round 2: 8 candidates  × 2 epochs each  → keep best 4
Round 3: 4 candidates  × 4 epochs each  → keep best 2
Round 4: 2 candidates  × 8 epochs each  → keep best 1
                                           ↑ final winner
Total epochs used: 16×1 + 8×2 + 4×4 + 2×8 = 64
vs. 16 full-run = 16×8 = 128 epochs

Assumption: Early performance correlates with final performance. Works well for deep learning, less well for methods that are slow to differentiate.

SageMaker Automatic Model Tuning (AMT)

# Example configuration
tuner = HyperparameterTuner(
    estimator=estimator,
    objective_metric_name='validation:accuracy',
    objective_type='Maximize',          # or 'Minimize'
    hyperparameter_ranges={
        'learning_rate': ContinuousParameter(0.0001, 0.1, scaling_type='Logarithmic'),
        'num_layers':    IntegerParameter(1, 10),
        'optimizer':     CategoricalParameter(['adam', 'sgd', 'rmsprop']),
    },
    max_jobs=50,
    max_parallel_jobs=5,
    strategy='Bayesian',               # 'Random', 'Bayesian', 'Hyperband', 'Grid'
    early_stopping_type='Auto',
)

Key concepts:

SettingMeaning
objective_metric_nameMetric your training script emits (must match exactly)
max_jobsTotal training jobs budget
max_parallel_jobsConcurrent jobs (more parallel = less info for Bayesian guidance)
strategy = 'Bayesian'Default; most efficient for expensive jobs
early_stopping_type = 'Auto'Stop jobs early if not improving

Warm start: Resume a previous tuning job. Two types:

  • IDENTICAL_DATA_AND_ALGORITHM — same setup, add more jobs
  • TRANSFER_LEARNING — changed some ranges/algorithm, use prior as guidance

Logarithmic scaling: Use for parameters spanning orders of magnitude (e.g., learning rate from 0.0001 to 0.1). Ensures equal sampling density at each scale.

Exam tip: For AMT, max_parallel_jobs should be much less than max_jobs. If max_parallel_jobs = max_jobs, all jobs run simultaneously and Bayesian optimization can’t use feedback from earlier jobs — it degrades to random search.


Regularization Techniques

Dropout

ELI5: During practice, randomly mute different teammates each game. Everyone learns to do their job independently instead of relying on specific partners. On game day (inference), everyone plays — and they perform better because of the independent practice.

Training:                         Inference:
  Input                             Input
    │                                 │
  [h1][h2][h3][h4]                 [h1][h2][h3][h4]
    │   ✗   │   ✗   ← 50% dropped     │   │   │   │  ← all active
  [h5][  ][h6][  ]                 [h5][h6][h7][h8]
    │       │                         │   │   │   │
  Output                            Output (scale by keep_prob)
  • Typical rates: 0.2–0.5 (drop 20–50% of neurons)
  • Applied only during training — at inference, all neurons active (outputs scaled by keep probability)
  • Effect: Prevents co-adaptation, acts as implicit ensemble of $2^n$ networks

Batch Normalization

For each mini-batch, normalize layer activations: $\hat{x} = \frac{x - \mu_B}{\sigma_B + \epsilon}$, then scale and shift: $y = \gamma\hat{x} + \beta$

ELI5: Each layer keeps re-centering its inputs so the next layer always receives well-behaved data — not too big, not too small, centered around zero. Like a translator who converts everything to a standard format before passing it along.

  • Benefits: Smoother loss landscape, enables higher learning rates, reduces sensitivity to initialization
  • Placement: Usually before activation function, sometimes after
  • Training vs inference: Uses batch statistics during training, running mean/variance during inference

Early Stopping

Monitor validation loss; stop training when it starts increasing:

Loss
│
│  Training loss ────────────────────────────────▶ (keeps decreasing)
│  ╲
│   ╲
│    ╲____________
│                 ╲_______
│  Validation     ↑       ╲___  ← best point      ╲ starts
│  loss           │            ╲____________________╲  increasing
│                 │                                    ← overfitting
│                 └─── EARLY STOP HERE (patience=5)
└──────────────────────────────────────────────────────── Epochs
  • Patience: Number of epochs to wait after validation loss increases before stopping
  • Restore best weights: Save model checkpoint at the best validation epoch, restore at the end
  • Benefit: Free regularization — no extra hyperparameter tuning needed

Weight Decay (L2 Regularization in Optimizers)

Add a term to the optimizer step that shrinks weights toward zero each step:

$$w \leftarrow w(1 - \eta\lambda) - \eta\nabla L$$

Equivalent to L2 regularization for SGD, but subtly different for adaptive optimizers (AdamW implements weight decay correctly, Adam+L2 does not).


Common Training Problems & Fixes

SymptomDiagnosisFix
Training loss not decreasingLearning rate too low, vanishing gradient, bugIncrease LR; check gradients; debug code
Training loss oscillating wildlyLearning rate too highReduce LR by 10×; use LR schedule
Training loss → NaNExploding gradients, bad dataGradient clipping; check for inf/nan in data
Train loss low, val loss highOverfittingMore data, dropout, L2, early stopping, less capacity
Both losses highUnderfittingBigger model, more features, fewer regularization constraints
GPU utilization < 50%Data loading bottleneckMore DataLoader workers; Pipe mode; prefetching
Loss decreasing very slowlyBatch size too small; bad LR scheduleIncrease batch size; use LR warm-up
Validation loss better than train lossTrain augmentation/dropout active, val notExpected and fine — model generalizes well

Why this matters for the exam: SageMaker Debugger automates detecting many of these — vanishing gradient, exploding tensor, loss not decreasing. Questions will test whether you can read symptoms and diagnose causes.


Transfer Learning

ELI5: Instead of training a doctor from kindergarten, hire a general doctor and spend 6 months teaching them your specialty. Most of their general medical knowledge transfers directly — you’re only teaching the specialized part.

Pre-trained Model (e.g., ResNet-50 on ImageNet):
┌──────────────────────────────────────────────────┐
│  Conv layers (edge/shape/texture detectors)      │
│  ┌────┐ ┌────┐ ┌────┐ ┌────┐  [FROZEN]          │
│  │ C1 │→│ C2 │→│ C3 │→│ C4 │  keep these weights│
│  └────┘ └────┘ └────┘ └────┘                    │
│                                                  │
│  Final classifier [REPLACED + RETRAINED]         │
│  ┌───────────────────┐                           │
│  │  FC → Softmax     │  your 5 categories        │
│  └───────────────────┘                           │
└──────────────────────────────────────────────────┘

Two Strategies

StrategyApproachWhen to Use
Feature ExtractionFreeze ALL pre-trained layers; only train new headVery small dataset (< 1000 samples); similar domain
Fine-tuningUnfreeze some/all layers; train with small LR (1e-4 to 1e-5)Medium dataset; possibly different domain

Fine-tuning rules:

  • Unfreeze from the top (last layers) downward
  • Use a much smaller learning rate than original training (avoid destroying learned representations)
  • Earlier layers = more generic features (always safe to keep frozen)
  • Later layers = more task-specific features (safe to fine-tune)

Transfer Learning in SageMaker

  • Image Classification, Object Detection, Semantic Segmentation algorithms support incremental training
  • JumpStart provides pre-trained model hubs for fine-tuning
  • Fine-tuning is a first-class feature: provide model_id, your dataset, config

Exam tip: “Small dataset + image/NLP task” → Transfer learning. “Similar domain” → Feature extraction (freeze more). “Different domain” → Fine-tuning (unfreeze more layers, use even smaller LR).


Quick Reference: Training Decisions

Want to reduce training cost?
├─ Up to 90% → Managed Spot Training + Checkpointing
└─ Smaller instance → Optimize batch size + Pipe Mode

Large dataset, slow startup?
└─ Use Pipe Mode instead of File Mode

Model too large for one GPU?
├─ Model Parallelism (SageMaker SMP)
└─ Need more throughput → Data Parallelism (SMDDP / Horovod)

Overfitting?
├─ Add Dropout (0.2–0.5)
├─ L2 / Weight decay
├─ Early Stopping
├─ More training data / Data Augmentation
└─ Reduce model capacity

Training loss not moving?
├─ Learning rate too low → increase
├─ Learning rate too high → reduce
└─ Check data pipeline (NaN values, wrong labels)

Want to find best hyperparameters efficiently?
├─ < 3 hyperparameters, cheap jobs → Random Search
├─ Expensive jobs, any count → Bayesian (SageMaker AMT)
└─ Very expensive jobs → Hyperband (AMT with Hyperband strategy)