← AWS MLS-C01 — ML Specialty

Domain 3A: ML Fundamentals from First Principles

Domain 3A: ML Fundamentals from First Principles

Exam Domain: 3 — Modeling (36%) Focus: The foundations every other ML concept builds on


What is Machine Learning? (First Principles)

The Fundamental Idea

Traditional programming: you write the rules, the computer applies them to data. Machine learning: you give it data + answers, the computer discovers the rules.

Traditional Programming:
  Rules + Data → Program → Output

Machine Learning:
  Data + Output → Training → Rules (Model)
  New Data + Model → Prediction

ELI5: Imagine teaching a child to recognize cats. Traditional programming means writing an encyclopedia of rules: “has fur, has four legs, has pointy ears, meows…” — and it breaks the moment you see a hairless cat. Machine learning means showing the child 10,000 photos labeled “cat” or “not cat” and letting them figure out the patterns themselves. The child’s brain becomes the model.

Types of Learning — The Taxonomy

Machine Learning
├── Supervised Learning          (has labels)
│   ├── Classification           → discrete output (cat/dog, spam/not-spam)
│   └── Regression               → continuous output (house price, temperature)
│
├── Unsupervised Learning        (no labels)
│   ├── Clustering               → group similar data points
│   ├── Dimensionality Reduction → compress features
│   └── Anomaly Detection        → find unusual points
│
├── Semi-Supervised Learning     (few labels + many unlabeled)
│   └── Self-training, label propagation
│
└── Reinforcement Learning       (reward signal, no direct labels)
    └── Agent learns by trial + error to maximize reward
TypeLabels?LearnsExample Use Case
SupervisedYes, all dataInput → Output mappingEmail spam detection
UnsupervisedNoHidden structureCustomer segmentation
Semi-supervisedFewLeverage unlabeled dataDocument classification (10 labeled, 10K unlabeled)
ReinforcementReward onlyOptimal actionsGame playing, robot control

When to use each:

  • Supervised: You have labeled examples and want to predict a known target
  • Unsupervised: You want to discover patterns without knowing what you’re looking for
  • Semi-supervised: Labeling is expensive but unlabeled data is abundant (most real-world text/image problems)
  • Reinforcement: The goal is sequential decision-making with delayed rewards (robotics, games, resource optimization)

Exam tip: “Discriminative vs generative” is a sub-distinction within supervised learning. Discriminative models (logistic regression, SVM, neural networks) learn the decision boundary P(y|x). Generative models (Naive Bayes, GMM) learn the full data distribution P(x|y).


The Learning Process — What “Training” Actually Means

The Mechanical View

Step 1: Initialize weights randomly
         w = [0.1, -0.3, 0.7, ...]

Step 2: Forward pass — make a prediction
         ŷ = model(x, w)

Step 3: Compute loss — measure how wrong we are
         L = loss(y, ŷ)

Step 4: Backward pass — compute gradients
         ∂L/∂w = how much each weight contributed to the error

Step 5: Update weights — reduce the error
         w_new = w_old - α × (∂L/∂w)

Step 6: Repeat steps 2-5 for all training data
         (one pass through all data = one epoch)

Training is just this loop, repeated thousands or millions of times, until the loss stops decreasing.

Loss / Cost Functions — Measuring Wrongness

A loss function takes the model’s prediction and the true answer and returns a single number: how wrong is the model?

The model’s job is to minimize this number.

ELI5: A loss function is like a score in a game where lower is better. If you’re playing golf, every swing that misses costs you points. The model is the golfer. Training is practice. Each round of practice, you slightly adjust your swing (weights) based on what caused the most penalty. After enough rounds, you’ve learned to minimize your score.

Mean Squared Error (Regression)

$$L = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

  • Penalizes large errors disproportionately (squaring amplifies big mistakes)
  • Sensitive to outliers
  • Units: squared (if target is dollars, loss is dollars²)

Root Mean Squared Error

$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

  • Same units as the target variable (interpretable)
  • Still sensitive to outliers, but more readable

Cross-Entropy Loss (Classification)

$$L = -\sum_{i} y_i \log(\hat{y}_i)$$

  • Penalizes confident wrong predictions severely
  • If truth is class 1 and model says P(class 1) = 0.01 → massive penalty
  • If truth is class 1 and model says P(class 1) = 0.99 → tiny penalty

Binary cross-entropy (for binary classification): $$L = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$$

Why Different Loss Functions?

Loss FunctionTaskKey Property
MSERegressionPenalizes large errors heavily
MAE (Mean Absolute Error)RegressionRobust to outliers (linear penalty)
Huber LossRegressionMSE for small errors, MAE for large (best of both)
Cross-EntropyClassificationProbabilistic interpretation, confident mistakes punished hard
Hinge LossSVM/ClassificationMaximizes margin, 0 loss when correctly classified with confidence
Focal LossImbalanced classificationDown-weights easy examples, focuses learning on hard cases

Exam tip: Huber loss is the answer when you want regression with outlier robustness but still want smooth gradients near zero. Cross-entropy is almost always the right choice for classification (not MSE — MSE with probabilities creates very flat gradients).


Gradient Descent — The Engine of ML

The Intuition

Imagine you’re lost in a hilly landscape covered in fog. Your goal is to reach the lowest valley. You can’t see far, but you can feel the slope under your feet. What do you do?

  1. Feel which direction goes downhill (compute gradient)
  2. Take a step in that direction (update weights)
  3. Repeat until you can’t go any lower (convergence)

This is gradient descent.

Loss
  │
  │    ╲        Current position
  │     ╲  ← ●
  │      ╲      ╲
  │       ╲      ╲
  │        ╲      ●  ← after one step
  │         ╲      ╲
  │          ╲      ╲
  │           ╲      ●  ← converging
  │            ╲____●___
  │                  ↑ minimum
  └──────────────────────── Weights (w)

Gradient = slope at current position (∂L/∂w)
Step size = learning rate (α)

The Math

$$w_{new} = w_{old} - \alpha \cdot \frac{\partial L}{\partial w}$$

  • $\alpha$ is the learning rate — how big each step is
  • $\frac{\partial L}{\partial w}$ is the gradient — which direction and how steeply uphill
  • We subtract the gradient because we want to go downhill

Learning Rate: The Most Critical Hyperparameter

α too HIGH:   Loss
              │  ╱╲  ╱╲  ╱╲
              │╱    ╲╱   ╲╱   → oscillates, diverges
              │
              Steps overshoot the minimum

α just right: Loss
              │╲
              │ ╲___  → smooth convergence
              │
              Steps land near the minimum

α too LOW:    Loss
              │╲
              │ ╲__________  → converges, but takes forever
              │
              Tiny steps, takes thousands of extra epochs

Learning rate strategies:

  • Learning rate schedule: start high, decay over time (step decay, cosine annealing)
  • Warm-up: start low, ramp up, then decay — prevents instability at beginning of training
  • Cyclical LR: oscillate between bounds — helps escape local minima

Gradient Descent Variants

Batch Gradient Descent:
  ┌─────────────────────────────┐
  │ for each epoch:             │
  │   compute gradient on ALL   │
  │   N training examples       │
  │   update weights once       │
  └─────────────────────────────┘
  Pro: stable, accurate gradient
  Con: SLOW — one update per full dataset pass

Stochastic Gradient Descent (SGD):
  ┌─────────────────────────────┐
  │ for each epoch:             │
  │   for each sample:          │
  │     compute gradient on     │
  │     SINGLE sample           │
  │     update weights          │
  └─────────────────────────────┘
  Pro: fast updates, online learning
  Con: noisy gradient, may never fully converge

Mini-batch Gradient Descent:
  ┌─────────────────────────────┐
  │ for each epoch:             │
  │   for each mini-batch       │
  │   of size B (32, 64, 128):  │
  │     compute gradient on B   │
  │     update weights          │
  └─────────────────────────────┘
  Pro: balance of speed + stability
  Con: batch size is a hyperparameter
  → THIS IS THE STANDARD IN DEEP LEARNING
VariantBatch SizeUpdate FrequencyGradient NoiseMemory
Batch GDAll NOnce per epochLow (accurate)High
SGD1N times per epochHigh (noisy)Low
Mini-batch GD32-512N/B times per epochMediumMedium

Advanced Optimizers

Plain SGD has problems: equal learning rate for all weights, slow in ravines. Advanced optimizers fix this.

Optimizer Family Tree:
SGD
 ├── SGD + Momentum       (adds velocity, accelerates in consistent directions)
 ├── Adagrad              (adaptive LR: frequent params get smaller LR)
 │    └── RMSprop         (Adagrad + exponential moving average of squared gradients)
 └── Adam                 (RMSprop + Momentum combined)
      └── AdamW           (Adam + weight decay decoupled from gradient)

ELI5 on Momentum: Plain SGD is like a ball rolling down a bumpy hill that stops dead every time it hits a small bump. Momentum gives the ball mass — it builds up speed going downhill, which helps it roll through small bumps and get out of shallow dips. “Instead of recalculating direction from scratch each step, you keep some of the previous velocity.”

Adam (Adaptive Moment Estimation) is the default for deep learning because:

  1. Maintains per-parameter adaptive learning rates (like RMSprop)
  2. Tracks momentum in the gradient direction (like SGD with momentum)
  3. Works well with little tuning on most problems
  4. Handles sparse gradients (NLP) and dense gradients (vision) well
OptimizerBest ForKey Hyperparams
SGD + MomentumComputer vision (CNNs), when Adam overfitslr, momentum
AdamDefault for most deep learninglr, β₁=0.9, β₂=0.999, ε
AdamWTransformers, NLPlr, weight_decay
RMSpropRNNslr, rho
AdagradSparse data, NLPlr

Local Minima vs Global Minima

1D Loss Surface (simple, illustrative):
  Loss
  │     local     global
  │      min       min
  │  ╲  /╲  ╲  /╲  ╲
  │   ╲/  ╲  ╲/  ╲  ╲____
  └──────────────────────── w
       ↑               ↑
  can get trapped     want this

In high dimensions (millions of weights), this is much less of a problem than it looks in 1D. Most critical points in high-dimensional loss landscapes are saddle points (flat in some directions, curved in others), not true local minima. The practical issue is getting stuck on plateaus (very flat regions where gradients are near zero).

Convergence — when to stop:

  • Validation loss stops decreasing (most common)
  • Gradient norm falls below a threshold
  • Fixed number of epochs reached

Bias-Variance Tradeoff — The Central Tension of ML

What is Bias?

Bias = systematic error — the model is consistently wrong in the same direction because it’s too simple to capture the true pattern.

True function (wiggly)        High-bias model (straight line)
     ╭──╮                           /
  ───╯   ╰───╮                     /
              ╰──╮               /
                 ╰───           /  ← misses all the curves
                                    (underfitting)

ELI5: Bias is like wearing the wrong prescription glasses — everything you see is consistently blurry in the same way. You might look at a horse and always see a slightly fuzzy shape. The error is systematic and predictable. You could look at a million horses and still be consistently wrong because the problem is your glasses, not the horse.

High bias symptoms:

  • High error on training data AND test data
  • Model is “too simple” — a linear model fitting a cubic relationship
  • Adding more data doesn’t help (the model can’t learn the complexity anyway)

What is Variance?

Variance = sensitivity to training data — the model memorizes noise and fluctuations, not the true signal.

True function (simple)         High-variance model (over-wiggly)
                                     ╭╮    ╭╮
      /                           ───╯╰────╯╰───
     /  (clean linear trend)
    /

ELI5: Variance is like a hypochondriac who thinks every new symptom is a different rare disease — over-reacting to every data point. The model is so sensitive to the specific training examples it saw that if you replaced just one training point, you’d get a completely different model.

High variance symptoms:

  • Very low error on training data, high error on test data
  • Model is “too complex” — a degree-50 polynomial fitting 20 data points
  • Getting more training data usually helps

The Tradeoff

Total Error = Bias² + Variance + Irreducible Error

Error
  │
  │  Bias²       Total Error
  │    ╲            ╱╲
  │     ╲          ╱  ╲
  │      ╲        ╱    ╲
  │       ╲──────╱  ╲   ╲
  │              ╱   Variance
  │             ╱
  │____________╱___________
  └──────────────────────── Model Complexity
   simple              complex
   (underfit)    ↑    (overfit)
             sweet spot
  • Irreducible error: noise in the data itself — can never be eliminated
  • As complexity increases: bias decreases, variance increases
  • The art of ML is finding the sweet spot

The Train / Validation / Test Split — Why Three Sets

                        All available data
                              │
           ┌──────────────────┼──────────────────┐
           │                  │                  │
      Training Set       Validation Set        Test Set
      (60-80%)             (10-20%)            (10-20%)
           │                  │                  │
     Learn patterns      Tune model          Final honest
     from this            hyperparams         evaluation
           │              (compare             (NEVER touch
     Repeat many          models,              until done!)
     times over           early stopping)

ELI5: Training data is homework — you practice on it repeatedly. Validation data is practice tests — you check your progress and adjust your study approach. Test data is the final exam — you only look at it once, at the very end, and you never use it to guide your studying. If you peek at the final exam questions while studying, your grade is meaningless as a measure of whether you actually learned anything.

Why not just train/test? Because you’ll implicitly optimize for the test set through repeated model selection. Every time you try a new model and compare test performance, you’re “using” the test set to make decisions. The validation set absorbs this overfitting, protecting the test set’s integrity.

Common splits:

  • 60/20/20 (train/val/test): classic, good for medium datasets
  • 80/10/10: when you have plenty of data and need more training
  • 90/5/5: large datasets (millions of examples) — 5% is still huge

Underfitting vs Overfitting — Diagnosis and Fixes

Diagnosing

                     Train Error    Val Error    Diagnosis
────────────────────────────────────────────────────────
Underfitting           HIGH          HIGH        Model too simple
Good fit               LOW           LOW         
Slight overfit         LOW        SLIGHTLY HIGH  Normal, acceptable
Overfitting          VERY LOW        HIGH        Model too complex

Fixing Underfitting (High Bias)

  • Add more features / engineered features
  • Use a more complex model (more layers, more trees, higher degree polynomial)
  • Reduce regularization strength (lower λ)
  • Train longer (more epochs)
  • Remove features that are too noisy (paradoxically, fewer bad features can help)

Fixing Overfitting (High Variance)

  • Get more training data (most effective)
  • Use a simpler model
  • Increase regularization (higher λ)
  • Dropout (neural networks)
  • Early stopping
  • Data augmentation
  • Ensemble methods (averaging reduces variance)
  • Feature selection (remove noise features)

Regularization — Preventing Overfitting

What It Is

Regularization adds a complexity penalty to the loss function. The model must balance fitting the training data against keeping the weights small (simple).

$$L_{regularized} = L_{original} + \lambda \cdot \text{penalty}$$

L1 Regularization (Lasso)

$$L = L_{original} + \lambda \sum_{i} |w_i|$$

  • Penalty is the sum of absolute values of weights
  • Effect: drives some weights exactly to zero → automatic feature selection
  • The solution is sparse (many zero weights)

ELI5: L1 is like Marie Kondo — if a feature doesn’t “spark joy” (contribute meaningfully to predictions), it gets thrown out entirely. After L1 regularization, only the truly useful features have non-zero weights. You end up with a leaner, more interpretable model.

Why L1 creates sparsity: The L1 penalty has a “kink” at zero — the gradient is constant (±1) everywhere except at 0. This constant pull toward zero is strong enough to push small weights all the way to exactly zero. L2 has a shrinking gradient as weights approach zero — it gets weaker and weaker as a weight gets small, never quite reaching zero.

L2 Regularization (Ridge)

$$L = L_{original} + \lambda \sum_{i} w_i^2$$

  • Penalty is the sum of squared weights
  • Effect: shrinks all weights toward zero, but never exactly zero
  • All features retained, but with smaller influence

ELI5: L2 is like a dimmer switch — it turns every feature’s influence down, but nothing gets turned off completely. Every feature stays in the picture, just quieter. This is better when you believe all features are relevant but you want to prevent any single feature from dominating.

Why L2 doesn’t produce sparsity: As a weight approaches zero, the L2 penalty gradient (2λw) approaches zero too — it loses its “push” before reaching zero.

Elastic Net

$$L = L_{original} + \lambda_1 \sum_{i} |w_i| + \lambda_2 \sum_{i} w_i^2$$

Combines L1 and L2. Gets sparse solutions (L1) while handling correlated features gracefully (L2). Best when you have many features that are correlated.

Lambda (λ) — The Regularization Strength

λ = 0:     No regularization → pure overfitting risk
λ too low: Small penalty → model still overfits
λ just right: Sweet spot → generalizes well
λ too high: Large penalty → all weights near 0 → underfitting

Comparison Table

L1 (Lasso)L2 (Ridge)Elastic Net
Penalty$\sum |w_i|$$\sum w_i^2$L1 + L2
WeightsMany → exactly 0All shrink, none → 0Sparse + shrunk
Feature selectionYes, automaticNoPartial
Best whenFew features matterAll features matter (correlated)Many features, some correlated
RobustnessLess stable (non-unique)More stable (unique solution)Stable
InterpretabilityHigh (sparse)MediumMedium

Dropout (Neural Networks)

Randomly disable a fraction of neurons during each training forward pass.

Normal forward pass:
  x → [n1] → [n2] → [n3] → [n4] → output

With 50% dropout:
  x → [n1] →  ✗   → [n3] →  ✗   → output
  (n2 and n4 disabled this batch — different neurons next batch)

Effect: prevents neurons from co-adapting (relying on specific other neurons). Forces the network to learn redundant representations. Acts like training an ensemble of many sub-networks.

Early Stopping

Validation
  Loss
  │╲
  │ ╲___
  │     ╲___
  │         ╲__
  │            ╲__
  │               ╲___ ← minimum — STOP HERE
  │                   ╲──────────────────
  │                            ↑ overfitting begins
  └──────────────────────────────── Epochs

Stop training when validation loss starts increasing. The model at the minimum validation loss is the best model.

Data Augmentation

Artificially increase training data diversity:

  • Images: random flips, rotations, crops, color jitter, cutout, Mixup
  • Text: back-translation, synonym replacement, random insertion/deletion
  • Tabular: SMOTE (synthetic minority oversampling for imbalanced classes)

Cross-Validation

Why Cross-Validation?

A single train/val split gives a noisy estimate of model performance — you might get lucky or unlucky in how the split fell. Cross-validation averages over multiple splits for a more reliable estimate.

K-Fold Cross-Validation

Dataset: [1][2][3][4][5][6][7][8][9][10]  (10 samples, K=5 folds)

Fold 1:  [VAL][  TRAIN  ][  TRAIN  ]  → score₁
Fold 2:  [   ][VAL][  TRAIN  ][   ]   → score₂
Fold 3:  [TRAIN][   ][VAL][  TRAIN]   → score₃
Fold 4:  [  TRAIN  ][   ][VAL][   ]   → score₄
Fold 5:  [  TRAIN  ][  TRAIN  ][VAL]  → score₅

Final score = mean(score₁, score₂, score₃, score₄, score₅)
              ± std (measure of estimate reliability)
  • Each sample is used for validation exactly once
  • Each sample is used for training K-1 times
  • K=5 or K=10 are standard choices
  • More reliable than a single split

Stratified K-Fold

Same as K-Fold, but ensures each fold has the same class proportion as the full dataset.

Critical for imbalanced data: if your dataset is 90% class A and 10% class B, random folds might end up with no class B examples in a fold, making evaluation meaningless.

Leave-One-Out (LOO) Cross-Validation

K = N (one sample per fold). Every sample gets to be the validation set exactly once.

  • Pro: unbiased estimate of model performance
  • Con: computationally expensive (N training runs)
  • Use when: dataset is very small (< 100 samples)

Time Series Cross-Validation — The Special Case

Regular K-Fold on time series (WRONG!):
  Train:  [Jan][Mar][May][Aug]   Test: [Feb][Apr][Jun]
  Problem: model sees future data (August) to predict past (February)
                                       ↑ DATA LEAKAGE

Time Series Walk-Forward Validation (CORRECT):
  Fold 1:  [Jan──Mar]  → test [Apr]
  Fold 2:  [Jan──Apr]  → test [May]
  Fold 3:  [Jan──May]  → test [Jun]
  Fold 4:  [Jan──Jun]  → test [Jul]
           (expanding window)

Exam tip: NEVER use random K-Fold on time series data. The rule is simple: you can only train on data from before your test period. This mimics how the model will actually be used in production — you always predict the future from the past.


Ensemble Methods — The Power of Combining Models

Why Ensembles Work

If you have many models that are:

  1. Reasonably accurate (better than random)
  2. Uncorrelated in their errors (they make different mistakes)

Then their average will be more accurate than any single model. The errors cancel out.

Model A correct on:  [1][1][1][0][0][0][1][0][1][1]  (70% accuracy)
Model B correct on:  [1][0][1][1][0][1][0][1][1][0]  (60% accuracy)
Model C correct on:  [0][1][1][1][1][0][1][1][0][1]  (70% accuracy)

Majority vote:       [1][1][1][1][0][0][1][1][1][1]  (80% accuracy!)

Bagging (Bootstrap Aggregating)

Original Data (N samples)
       │
  ┌────┴────┐
  │Bootstrap│ ← sample with replacement
  │Sampling │
  └────┬────┘
       │
  ┌────┴────┬─────────┬─────────┐
  ↓         ↓         ↓         ↓
Model 1   Model 2   Model 3   Model 4   (trained independently)
  │         │         │         │
  └─────────┴─────────┴─────────┘
                 │
            Aggregate
         (average/vote)
                 │
           Final Prediction

Each bootstrap sample has N draws with replacement from the original data — on average, 63.2% unique samples, 36.8% are duplicates.

Random Forest = bagging on decision trees + random feature subset at each split:

  • Random feature selection at each split decorrelates the trees (if they all see the same features, they’ll all focus on the most predictive feature)
  • Reduces variance without increasing bias

ELI5: Bagging is like asking 100 slightly-confused doctors to each independently diagnose the same patient, then taking the majority vote. No single doctor needs to be brilliant — you just need them to be somewhat accurate and make different mistakes. The crowd’s collective wisdom is more reliable than any individual.

Bagging reduces variance because individual model variance averages out. Best for high-variance models (decision trees).

Boosting

Training Data
      │
      ▼
   Model 1      ← train on original data
   (weak)
      │
   Errors         ← identify where Model 1 was wrong
      │
      ▼
   Model 2      ← train with HIGHER WEIGHT on Model 1's mistakes
   (weak)
      │
   Errors
      │
      ▼
   Model 3      ← train with HIGHER WEIGHT on remaining mistakes
   (weak)
      │
      ▼
   Weighted Sum → Final Prediction

ELI5: Boosting is like a relay tutoring session. Tutor 1 teaches you everything they can. Tutor 2 comes in and focuses specifically on the topics tutor 1 couldn’t explain clearly. Tutor 3 handles whatever is still confusing. Each round targets the remaining weaknesses. The final combined knowledge is much stronger than any single tutor because the team specializes in each other’s gaps.

Boosting reduces bias because each model corrects the previous one’s errors. Best for high-bias models (shallow trees = weak learners).

Types:

  • AdaBoost: reweights training samples (higher weight to misclassified samples)
  • Gradient Boosting (GBM): fits each new model to the residuals (errors) of the previous ensemble
  • XGBoost: GBM + L1/L2 regularization + second-order gradients + parallel tree construction

Stacking

Use the predictions of multiple diverse models as input features for a meta-model (also called a blender).

Level 0 (base models):
  x → [Logistic Regression] → pred₁
  x → [Random Forest]       → pred₂
  x → [XGBoost]             → pred₃
  x → [Neural Network]      → pred₄

Level 1 (meta-model):
  [pred₁, pred₂, pred₃, pred₄] → [Meta-Model] → Final Prediction

The meta-model learns when to trust each base model. Computationally expensive, but often the best-performing approach in ML competitions.

Bagging vs Boosting — Summary

BaggingBoosting
TrainingParallel (independent)Sequential (dependent)
Each model seesRandom subset of dataWeighted/resampled data
ReducesVarianceBias
CombinerAverage / majority voteWeighted sum
Best forHigh-variance models (deep trees)High-bias models (shallow trees)
RiskLess prone to overfittingCan overfit if too many rounds
ExamplesRandom ForestXGBoost, AdaBoost, LightGBM
SpeedParallelizableSequential (harder to parallelize)

Exam tip: “Bagging = parallel, reduces variance. Boosting = sequential, reduces bias.” This distinction appears on the MLS-C01 exam. Also know: XGBoost is boosting. Random Forest is bagging.


Key Formulas Cheat Sheet

Loss functions:
  MSE         = (1/n) × Σ(yᵢ - ŷᵢ)²
  RMSE        = √MSE
  MAE         = (1/n) × Σ|yᵢ - ŷᵢ|
  Cross-Entropy = -Σ yᵢ × log(ŷᵢ)
  Binary CE   = -[y×log(ŷ) + (1-y)×log(1-ŷ)]

Regularization:
  L1 (Lasso)  = L + λ × Σ|wᵢ|
  L2 (Ridge)  = L + λ × Σwᵢ²
  Elastic Net = L + λ₁×Σ|wᵢ| + λ₂×Σwᵢ²

Gradient descent:
  w_new = w_old - α × ∂L/∂w

Bias-Variance:
  Total Error = Bias² + Variance + Irreducible Error

Why This Matters for the Exam

The MLS-C01 exam tests conceptual understanding, not mathematical derivations. You need to know:

  • Which algorithm to choose for a given problem type (supervised/unsupervised)
  • How gradient descent hyperparameters (learning rate, batch size) affect training
  • How to diagnose overfitting vs underfitting from training/validation curves
  • When to use L1 vs L2 vs Elastic Net regularization
  • Why time series CV is different from standard K-Fold
  • The difference between bagging (Random Forest) and boosting (XGBoost) — parallel vs sequential, variance vs bias