Domain 3A: ML Fundamentals from First Principles

19 min read 3893 words

Table of Contents

Domain 3A: ML Fundamentals from First Principles

Domain 3A: ML Fundamentals from First Principles

Exam Domain: 3 — Modeling (36%) Focus: The foundations every other ML concept builds on

What is Machine Learning? (First Principles)

The Fundamental Idea

Traditional programming: you write the rules, the computer applies them to data. Machine learning: you give it data + answers, the computer discovers the rules.

Traditional Programming:
  Rules + Data → Program → Output

Machine Learning:
  Data + Output → Training → Rules (Model)
  New Data + Model → Prediction

ELI5: Imagine teaching a child to recognize cats. Traditional programming means writing an encyclopedia of rules: “has fur, has four legs, has pointy ears, meows…” — and it breaks the moment you see a hairless cat. Machine learning means showing the child 10,000 photos labeled “cat” or “not cat” and letting them figure out the patterns themselves. The child’s brain becomes the model.

Types of Learning — The Taxonomy

Machine Learning
├── Supervised Learning          (has labels)
│   ├── Classification           → discrete output (cat/dog, spam/not-spam)
│   └── Regression               → continuous output (house price, temperature)
│
├── Unsupervised Learning        (no labels)
│   ├── Clustering               → group similar data points
│   ├── Dimensionality Reduction → compress features
│   └── Anomaly Detection        → find unusual points
│
├── Semi-Supervised Learning     (few labels + many unlabeled)
│   └── Self-training, label propagation
│
└── Reinforcement Learning       (reward signal, no direct labels)
    └── Agent learns by trial + error to maximize reward

Type	Labels?	Learns	Example Use Case
Supervised	Yes, all data	Input → Output mapping	Email spam detection
Unsupervised	No	Hidden structure	Customer segmentation
Semi-supervised	Few	Leverage unlabeled data	Document classification (10 labeled, 10K unlabeled)
Reinforcement	Reward only	Optimal actions	Game playing, robot control

When to use each:

Supervised: You have labeled examples and want to predict a known target
Unsupervised: You want to discover patterns without knowing what you’re looking for
Semi-supervised: Labeling is expensive but unlabeled data is abundant (most real-world text/image problems)
Reinforcement: The goal is sequential decision-making with delayed rewards (robotics, games, resource optimization)

Exam tip: “Discriminative vs generative” is a sub-distinction within supervised learning. Discriminative models (logistic regression, SVM, neural networks) learn the decision boundary P(y|x). Generative models (Naive Bayes, GMM) learn the full data distribution P(x|y).

The Learning Process — What “Training” Actually Means

The Mechanical View

Step 1: Initialize weights randomly
         w = [0.1, -0.3, 0.7, ...]

Step 2: Forward pass — make a prediction
         ŷ = model(x, w)

Step 3: Compute loss — measure how wrong we are
         L = loss(y, ŷ)

Step 4: Backward pass — compute gradients
         ∂L/∂w = how much each weight contributed to the error

Step 5: Update weights — reduce the error
         w_new = w_old - α × (∂L/∂w)

Step 6: Repeat steps 2-5 for all training data
         (one pass through all data = one epoch)

Training is just this loop, repeated thousands or millions of times, until the loss stops decreasing.

Loss / Cost Functions — Measuring Wrongness

A loss function takes the model’s prediction and the true answer and returns a single number: how wrong is the model?

The model’s job is to minimize this number.

ELI5: A loss function is like a score in a game where lower is better. If you’re playing golf, every swing that misses costs you points. The model is the golfer. Training is practice. Each round of practice, you slightly adjust your swing (weights) based on what caused the most penalty. After enough rounds, you’ve learned to minimize your score.

Mean Squared Error (Regression)

$$L = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

Penalizes large errors disproportionately (squaring amplifies big mistakes)
Sensitive to outliers
Units: squared (if target is dollars, loss is dollars²)

Root Mean Squared Error

$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$

Same units as the target variable (interpretable)
Still sensitive to outliers, but more readable

Cross-Entropy Loss (Classification)

$$L = -\sum_{i} y_i \log(\hat{y}_i)$$

Penalizes confident wrong predictions severely
If truth is class 1 and model says P(class 1) = 0.01 → massive penalty
If truth is class 1 and model says P(class 1) = 0.99 → tiny penalty

Binary cross-entropy (for binary classification): $$L = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$$

Why Different Loss Functions?

Loss Function	Task	Key Property
MSE	Regression	Penalizes large errors heavily
MAE (Mean Absolute Error)	Regression	Robust to outliers (linear penalty)
Huber Loss	Regression	MSE for small errors, MAE for large (best of both)
Cross-Entropy	Classification	Probabilistic interpretation, confident mistakes punished hard
Hinge Loss	SVM/Classification	Maximizes margin, 0 loss when correctly classified with confidence
Focal Loss	Imbalanced classification	Down-weights easy examples, focuses learning on hard cases

Exam tip: Huber loss is the answer when you want regression with outlier robustness but still want smooth gradients near zero. Cross-entropy is almost always the right choice for classification (not MSE — MSE with probabilities creates very flat gradients).

Gradient Descent — The Engine of ML

The Intuition

Imagine you’re lost in a hilly landscape covered in fog. Your goal is to reach the lowest valley. You can’t see far, but you can feel the slope under your feet. What do you do?

Feel which direction goes downhill (compute gradient)
Take a step in that direction (update weights)
Repeat until you can’t go any lower (convergence)

This is gradient descent.

Loss
  │
  │    ╲        Current position
  │     ╲  ← ●
  │      ╲      ╲
  │       ╲      ╲
  │        ╲      ●  ← after one step
  │         ╲      ╲
  │          ╲      ╲
  │           ╲      ●  ← converging
  │            ╲____●___
  │                  ↑ minimum
  └──────────────────────── Weights (w)

Gradient = slope at current position (∂L/∂w)
Step size = learning rate (α)

The Math

$$w_{new} = w_{old} - \alpha \cdot \frac{\partial L}{\partial w}$$

$\alpha$ is the learning rate — how big each step is
$\frac{\partial L}{\partial w}$ is the gradient — which direction and how steeply uphill
We subtract the gradient because we want to go downhill

Learning Rate: The Most Critical Hyperparameter

α too HIGH:   Loss
              │  ╱╲  ╱╲  ╱╲
              │╱    ╲╱   ╲╱   → oscillates, diverges
              │
              Steps overshoot the minimum

α just right: Loss
              │╲
              │ ╲___  → smooth convergence
              │
              Steps land near the minimum

α too LOW:    Loss
              │╲
              │ ╲__________  → converges, but takes forever
              │
              Tiny steps, takes thousands of extra epochs

Learning rate strategies:

Learning rate schedule: start high, decay over time (step decay, cosine annealing)
Warm-up: start low, ramp up, then decay — prevents instability at beginning of training
Cyclical LR: oscillate between bounds — helps escape local minima

Gradient Descent Variants

Batch Gradient Descent:
  ┌─────────────────────────────┐
  │ for each epoch:             │
  │   compute gradient on ALL   │
  │   N training examples       │
  │   update weights once       │
  └─────────────────────────────┘
  Pro: stable, accurate gradient
  Con: SLOW — one update per full dataset pass

Stochastic Gradient Descent (SGD):
  ┌─────────────────────────────┐
  │ for each epoch:             │
  │   for each sample:          │
  │     compute gradient on     │
  │     SINGLE sample           │
  │     update weights          │
  └─────────────────────────────┘
  Pro: fast updates, online learning
  Con: noisy gradient, may never fully converge

Mini-batch Gradient Descent:
  ┌─────────────────────────────┐
  │ for each epoch:             │
  │   for each mini-batch       │
  │   of size B (32, 64, 128):  │
  │     compute gradient on B   │
  │     update weights          │
  └─────────────────────────────┘
  Pro: balance of speed + stability
  Con: batch size is a hyperparameter
  → THIS IS THE STANDARD IN DEEP LEARNING

Variant	Batch Size	Update Frequency	Gradient Noise	Memory
Batch GD	All N	Once per epoch	Low (accurate)	High
SGD	1	N times per epoch	High (noisy)	Low
Mini-batch GD	32-512	N/B times per epoch	Medium	Medium

Advanced Optimizers

Plain SGD has problems: equal learning rate for all weights, slow in ravines. Advanced optimizers fix this.

Optimizer Family Tree:
SGD
 ├── SGD + Momentum       (adds velocity, accelerates in consistent directions)
 ├── Adagrad              (adaptive LR: frequent params get smaller LR)
 │    └── RMSprop         (Adagrad + exponential moving average of squared gradients)
 └── Adam                 (RMSprop + Momentum combined)
      └── AdamW           (Adam + weight decay decoupled from gradient)

ELI5 on Momentum: Plain SGD is like a ball rolling down a bumpy hill that stops dead every time it hits a small bump. Momentum gives the ball mass — it builds up speed going downhill, which helps it roll through small bumps and get out of shallow dips. “Instead of recalculating direction from scratch each step, you keep some of the previous velocity.”

Adam (Adaptive Moment Estimation) is the default for deep learning because:

Maintains per-parameter adaptive learning rates (like RMSprop)
Tracks momentum in the gradient direction (like SGD with momentum)
Works well with little tuning on most problems
Handles sparse gradients (NLP) and dense gradients (vision) well

Optimizer	Best For	Key Hyperparams
SGD + Momentum	Computer vision (CNNs), when Adam overfits	lr, momentum
Adam	Default for most deep learning	lr, β₁=0.9, β₂=0.999, ε
AdamW	Transformers, NLP	lr, weight_decay
RMSprop	RNNs	lr, rho
Adagrad	Sparse data, NLP	lr

Local Minima vs Global Minima

1D Loss Surface (simple, illustrative):
  Loss
  │     local     global
  │      min       min
  │  ╲  /╲  ╲  /╲  ╲
  │   ╲/  ╲  ╲/  ╲  ╲____
  └──────────────────────── w
       ↑               ↑
  can get trapped     want this

In high dimensions (millions of weights), this is much less of a problem than it looks in 1D. Most critical points in high-dimensional loss landscapes are saddle points (flat in some directions, curved in others), not true local minima. The practical issue is getting stuck on plateaus (very flat regions where gradients are near zero).

Convergence — when to stop:

Validation loss stops decreasing (most common)
Gradient norm falls below a threshold
Fixed number of epochs reached

Bias-Variance Tradeoff — The Central Tension of ML

What is Bias?

Bias = systematic error — the model is consistently wrong in the same direction because it’s too simple to capture the true pattern.

True function (wiggly)        High-bias model (straight line)
     ╭──╮                           /
  ───╯   ╰───╮                     /
              ╰──╮               /
                 ╰───           /  ← misses all the curves
                                    (underfitting)

ELI5: Bias is like wearing the wrong prescription glasses — everything you see is consistently blurry in the same way. You might look at a horse and always see a slightly fuzzy shape. The error is systematic and predictable. You could look at a million horses and still be consistently wrong because the problem is your glasses, not the horse.

High bias symptoms:

High error on training data AND test data
Model is “too simple” — a linear model fitting a cubic relationship
Adding more data doesn’t help (the model can’t learn the complexity anyway)

What is Variance?

Variance = sensitivity to training data — the model memorizes noise and fluctuations, not the true signal.

True function (simple)         High-variance model (over-wiggly)
                                     ╭╮    ╭╮
      /                           ───╯╰────╯╰───
     /  (clean linear trend)
    /

ELI5: Variance is like a hypochondriac who thinks every new symptom is a different rare disease — over-reacting to every data point. The model is so sensitive to the specific training examples it saw that if you replaced just one training point, you’d get a completely different model.

High variance symptoms:

Very low error on training data, high error on test data
Model is “too complex” — a degree-50 polynomial fitting 20 data points
Getting more training data usually helps

The Tradeoff

Total Error = Bias² + Variance + Irreducible Error

Error
  │
  │  Bias²       Total Error
  │    ╲            ╱╲
  │     ╲          ╱  ╲
  │      ╲        ╱    ╲
  │       ╲──────╱  ╲   ╲
  │              ╱   Variance
  │             ╱
  │____________╱___________
  └──────────────────────── Model Complexity
   simple              complex
   (underfit)    ↑    (overfit)
             sweet spot

Irreducible error: noise in the data itself — can never be eliminated
As complexity increases: bias decreases, variance increases
The art of ML is finding the sweet spot

The Train / Validation / Test Split — Why Three Sets

                        All available data
                              │
           ┌──────────────────┼──────────────────┐
           │                  │                  │
      Training Set       Validation Set        Test Set
      (60-80%)             (10-20%)            (10-20%)
           │                  │                  │
     Learn patterns      Tune model          Final honest
     from this            hyperparams         evaluation
           │              (compare             (NEVER touch
     Repeat many          models,              until done!)
     times over           early stopping)

ELI5: Training data is homework — you practice on it repeatedly. Validation data is practice tests — you check your progress and adjust your study approach. Test data is the final exam — you only look at it once, at the very end, and you never use it to guide your studying. If you peek at the final exam questions while studying, your grade is meaningless as a measure of whether you actually learned anything.

Why not just train/test? Because you’ll implicitly optimize for the test set through repeated model selection. Every time you try a new model and compare test performance, you’re “using” the test set to make decisions. The validation set absorbs this overfitting, protecting the test set’s integrity.

Common splits:

60/20/20 (train/val/test): classic, good for medium datasets
80/10/10: when you have plenty of data and need more training
90/5/5: large datasets (millions of examples) — 5% is still huge

Underfitting vs Overfitting — Diagnosis and Fixes

Diagnosing

                     Train Error    Val Error    Diagnosis
────────────────────────────────────────────────────────
Underfitting           HIGH          HIGH        Model too simple
Good fit               LOW           LOW         
Slight overfit         LOW        SLIGHTLY HIGH  Normal, acceptable
Overfitting          VERY LOW        HIGH        Model too complex

Fixing Underfitting (High Bias)

Add more features / engineered features
Use a more complex model (more layers, more trees, higher degree polynomial)
Reduce regularization strength (lower λ)
Train longer (more epochs)
Remove features that are too noisy (paradoxically, fewer bad features can help)

Fixing Overfitting (High Variance)

Get more training data (most effective)
Use a simpler model
Increase regularization (higher λ)
Dropout (neural networks)
Early stopping
Data augmentation
Ensemble methods (averaging reduces variance)
Feature selection (remove noise features)

Regularization — Preventing Overfitting

What It Is

Regularization adds a complexity penalty to the loss function. The model must balance fitting the training data against keeping the weights small (simple).

$$L_{regularized} = L_{original} + \lambda \cdot \text{penalty}$$

L1 Regularization (Lasso)

$$L = L_{original} + \lambda \sum_{i} |w_i|$$

Penalty is the sum of absolute values of weights
Effect: drives some weights exactly to zero → automatic feature selection
The solution is sparse (many zero weights)

ELI5: L1 is like Marie Kondo — if a feature doesn’t “spark joy” (contribute meaningfully to predictions), it gets thrown out entirely. After L1 regularization, only the truly useful features have non-zero weights. You end up with a leaner, more interpretable model.

Why L1 creates sparsity: The L1 penalty has a “kink” at zero — the gradient is constant (±1) everywhere except at 0. This constant pull toward zero is strong enough to push small weights all the way to exactly zero. L2 has a shrinking gradient as weights approach zero — it gets weaker and weaker as a weight gets small, never quite reaching zero.

L2 Regularization (Ridge)

$$L = L_{original} + \lambda \sum_{i} w_i^2$$

Penalty is the sum of squared weights
Effect: shrinks all weights toward zero, but never exactly zero
All features retained, but with smaller influence

ELI5: L2 is like a dimmer switch — it turns every feature’s influence down, but nothing gets turned off completely. Every feature stays in the picture, just quieter. This is better when you believe all features are relevant but you want to prevent any single feature from dominating.

Why L2 doesn’t produce sparsity: As a weight approaches zero, the L2 penalty gradient (2λw) approaches zero too — it loses its “push” before reaching zero.

Elastic Net

$$L = L_{original} + \lambda_1 \sum_{i} |w_i| + \lambda_2 \sum_{i} w_i^2$$

Combines L1 and L2. Gets sparse solutions (L1) while handling correlated features gracefully (L2). Best when you have many features that are correlated.

Lambda (λ) — The Regularization Strength

λ = 0:     No regularization → pure overfitting risk
λ too low: Small penalty → model still overfits
λ just right: Sweet spot → generalizes well
λ too high: Large penalty → all weights near 0 → underfitting

Comparison Table

	L1 (Lasso)	L2 (Ridge)	Elastic Net
Penalty	$\sum \|w_i\|$	$\sum w_i^2$	L1 + L2
Weights	Many → exactly 0	All shrink, none → 0	Sparse + shrunk
Feature selection	Yes, automatic	No	Partial
Best when	Few features matter	All features matter (correlated)	Many features, some correlated
Robustness	Less stable (non-unique)	More stable (unique solution)	Stable
Interpretability	High (sparse)	Medium	Medium

Dropout (Neural Networks)

Randomly disable a fraction of neurons during each training forward pass.

Normal forward pass:
  x → [n1] → [n2] → [n3] → [n4] → output

With 50% dropout:
  x → [n1] →  ✗   → [n3] →  ✗   → output
  (n2 and n4 disabled this batch — different neurons next batch)

Effect: prevents neurons from co-adapting (relying on specific other neurons). Forces the network to learn redundant representations. Acts like training an ensemble of many sub-networks.

Early Stopping

Validation
  Loss
  │╲
  │ ╲___
  │     ╲___
  │         ╲__
  │            ╲__
  │               ╲___ ← minimum — STOP HERE
  │                   ╲──────────────────
  │                            ↑ overfitting begins
  └──────────────────────────────── Epochs

Stop training when validation loss starts increasing. The model at the minimum validation loss is the best model.

Data Augmentation

Artificially increase training data diversity:

Images: random flips, rotations, crops, color jitter, cutout, Mixup
Text: back-translation, synonym replacement, random insertion/deletion
Tabular: SMOTE (synthetic minority oversampling for imbalanced classes)

Cross-Validation

Why Cross-Validation?

A single train/val split gives a noisy estimate of model performance — you might get lucky or unlucky in how the split fell. Cross-validation averages over multiple splits for a more reliable estimate.

K-Fold Cross-Validation

Dataset: [1][2][3][4][5][6][7][8][9][10]  (10 samples, K=5 folds)

Fold 1:  [VAL][  TRAIN  ][  TRAIN  ]  → score₁
Fold 2:  [   ][VAL][  TRAIN  ][   ]   → score₂
Fold 3:  [TRAIN][   ][VAL][  TRAIN]   → score₃
Fold 4:  [  TRAIN  ][   ][VAL][   ]   → score₄
Fold 5:  [  TRAIN  ][  TRAIN  ][VAL]  → score₅

Final score = mean(score₁, score₂, score₃, score₄, score₅)
              ± std (measure of estimate reliability)

Each sample is used for validation exactly once
Each sample is used for training K-1 times
K=5 or K=10 are standard choices
More reliable than a single split

Stratified K-Fold

Same as K-Fold, but ensures each fold has the same class proportion as the full dataset.

Critical for imbalanced data: if your dataset is 90% class A and 10% class B, random folds might end up with no class B examples in a fold, making evaluation meaningless.

Leave-One-Out (LOO) Cross-Validation

K = N (one sample per fold). Every sample gets to be the validation set exactly once.

Pro: unbiased estimate of model performance
Con: computationally expensive (N training runs)
Use when: dataset is very small (< 100 samples)

Time Series Cross-Validation — The Special Case

Regular K-Fold on time series (WRONG!):
  Train:  [Jan][Mar][May][Aug]   Test: [Feb][Apr][Jun]
  Problem: model sees future data (August) to predict past (February)
                                       ↑ DATA LEAKAGE

Time Series Walk-Forward Validation (CORRECT):
  Fold 1:  [Jan──Mar]  → test [Apr]
  Fold 2:  [Jan──Apr]  → test [May]
  Fold 3:  [Jan──May]  → test [Jun]
  Fold 4:  [Jan──Jun]  → test [Jul]
           (expanding window)

Exam tip: NEVER use random K-Fold on time series data. The rule is simple: you can only train on data from before your test period. This mimics how the model will actually be used in production — you always predict the future from the past.

Ensemble Methods — The Power of Combining Models

Why Ensembles Work

If you have many models that are:

Reasonably accurate (better than random)
Uncorrelated in their errors (they make different mistakes)

Then their average will be more accurate than any single model. The errors cancel out.

Model A correct on:  [1][1][1][0][0][0][1][0][1][1]  (70% accuracy)
Model B correct on:  [1][0][1][1][0][1][0][1][1][0]  (60% accuracy)
Model C correct on:  [0][1][1][1][1][0][1][1][0][1]  (70% accuracy)

Majority vote:       [1][1][1][1][0][0][1][1][1][1]  (80% accuracy!)

Bagging (Bootstrap Aggregating)

Original Data (N samples)
       │
  ┌────┴────┐
  │Bootstrap│ ← sample with replacement
  │Sampling │
  └────┬────┘
       │
  ┌────┴────┬─────────┬─────────┐
  ↓         ↓         ↓         ↓
Model 1   Model 2   Model 3   Model 4   (trained independently)
  │         │         │         │
  └─────────┴─────────┴─────────┘
                 │
            Aggregate
         (average/vote)
                 │
           Final Prediction

Each bootstrap sample has N draws with replacement from the original data — on average, 63.2% unique samples, 36.8% are duplicates.

Random Forest = bagging on decision trees + random feature subset at each split:

Random feature selection at each split decorrelates the trees (if they all see the same features, they’ll all focus on the most predictive feature)
Reduces variance without increasing bias

ELI5: Bagging is like asking 100 slightly-confused doctors to each independently diagnose the same patient, then taking the majority vote. No single doctor needs to be brilliant — you just need them to be somewhat accurate and make different mistakes. The crowd’s collective wisdom is more reliable than any individual.

Bagging reduces variance because individual model variance averages out. Best for high-variance models (decision trees).

Boosting

Training Data
      │
      ▼
   Model 1      ← train on original data
   (weak)
      │
   Errors         ← identify where Model 1 was wrong
      │
      ▼
   Model 2      ← train with HIGHER WEIGHT on Model 1's mistakes
   (weak)
      │
   Errors
      │
      ▼
   Model 3      ← train with HIGHER WEIGHT on remaining mistakes
   (weak)
      │
      ▼
   Weighted Sum → Final Prediction

ELI5: Boosting is like a relay tutoring session. Tutor 1 teaches you everything they can. Tutor 2 comes in and focuses specifically on the topics tutor 1 couldn’t explain clearly. Tutor 3 handles whatever is still confusing. Each round targets the remaining weaknesses. The final combined knowledge is much stronger than any single tutor because the team specializes in each other’s gaps.

Boosting reduces bias because each model corrects the previous one’s errors. Best for high-bias models (shallow trees = weak learners).

Types:

AdaBoost: reweights training samples (higher weight to misclassified samples)
Gradient Boosting (GBM): fits each new model to the residuals (errors) of the previous ensemble
XGBoost: GBM + L1/L2 regularization + second-order gradients + parallel tree construction

Stacking

Use the predictions of multiple diverse models as input features for a meta-model (also called a blender).

Level 0 (base models):
  x → [Logistic Regression] → pred₁
  x → [Random Forest]       → pred₂
  x → [XGBoost]             → pred₃
  x → [Neural Network]      → pred₄

Level 1 (meta-model):
  [pred₁, pred₂, pred₃, pred₄] → [Meta-Model] → Final Prediction

The meta-model learns when to trust each base model. Computationally expensive, but often the best-performing approach in ML competitions.

Bagging vs Boosting — Summary

	Bagging	Boosting
Training	Parallel (independent)	Sequential (dependent)
Each model sees	Random subset of data	Weighted/resampled data
Reduces	Variance	Bias
Combiner	Average / majority vote	Weighted sum
Best for	High-variance models (deep trees)	High-bias models (shallow trees)
Risk	Less prone to overfitting	Can overfit if too many rounds
Examples	Random Forest	XGBoost, AdaBoost, LightGBM
Speed	Parallelizable	Sequential (harder to parallelize)

Exam tip: “Bagging = parallel, reduces variance. Boosting = sequential, reduces bias.” This distinction appears on the MLS-C01 exam. Also know: XGBoost is boosting. Random Forest is bagging.

Key Formulas Cheat Sheet

Loss functions:
  MSE         = (1/n) × Σ(yᵢ - ŷᵢ)²
  RMSE        = √MSE
  MAE         = (1/n) × Σ|yᵢ - ŷᵢ|
  Cross-Entropy = -Σ yᵢ × log(ŷᵢ)
  Binary CE   = -[y×log(ŷ) + (1-y)×log(1-ŷ)]

Regularization:
  L1 (Lasso)  = L + λ × Σ|wᵢ|
  L2 (Ridge)  = L + λ × Σwᵢ²
  Elastic Net = L + λ₁×Σ|wᵢ| + λ₂×Σwᵢ²

Gradient descent:
  w_new = w_old - α × ∂L/∂w

Bias-Variance:
  Total Error = Bias² + Variance + Irreducible Error

Why This Matters for the Exam

The MLS-C01 exam tests conceptual understanding, not mathematical derivations. You need to know:
Which algorithm to choose for a given problem type (supervised/unsupervised)
How gradient descent hyperparameters (learning rate, batch size) affect training
How to diagnose overfitting vs underfitting from training/validation curves
When to use L1 vs L2 vs Elastic Net regularization
Why time series CV is different from standard K-Fold
The difference between bagging (Random Forest) and boosting (XGBoost) — parallel vs sequential, variance vs bias

Domain 3A: ML Fundamentals from First Principles#

What is Machine Learning? (First Principles)#

The Fundamental Idea#

Types of Learning — The Taxonomy#

The Learning Process — What “Training” Actually Means#

The Mechanical View#

Loss / Cost Functions — Measuring Wrongness#

Mean Squared Error (Regression)#

Root Mean Squared Error#

Cross-Entropy Loss (Classification)#

Why Different Loss Functions?#

Gradient Descent — The Engine of ML#

The Intuition#

The Math#

Learning Rate: The Most Critical Hyperparameter#

Gradient Descent Variants#

Advanced Optimizers#

Local Minima vs Global Minima#

Bias-Variance Tradeoff — The Central Tension of ML#

What is Bias?#

What is Variance?#

The Tradeoff#

The Train / Validation / Test Split — Why Three Sets#

Underfitting vs Overfitting — Diagnosis and Fixes#

Diagnosing#

Fixing Underfitting (High Bias)#

Fixing Overfitting (High Variance)#

Regularization — Preventing Overfitting#

What It Is#

L1 Regularization (Lasso)#

L2 Regularization (Ridge)#

Elastic Net#

Lambda (λ) — The Regularization Strength#

Comparison Table#

Dropout (Neural Networks)#

Early Stopping#

Data Augmentation#

Cross-Validation#

Why Cross-Validation?#

K-Fold Cross-Validation#

Stratified K-Fold#

Leave-One-Out (LOO) Cross-Validation#

Time Series Cross-Validation — The Special Case#

Ensemble Methods — The Power of Combining Models#

Why Ensembles Work#

Bagging (Bootstrap Aggregating)#

Boosting#

Stacking#

Bagging vs Boosting — Summary#

Key Formulas Cheat Sheet#

Why This Matters for the Exam#

Domain 3A: ML Fundamentals from First Principles

What is Machine Learning? (First Principles)

The Fundamental Idea

Types of Learning — The Taxonomy

The Learning Process — What “Training” Actually Means

The Mechanical View

Loss / Cost Functions — Measuring Wrongness

Mean Squared Error (Regression)

Root Mean Squared Error

Cross-Entropy Loss (Classification)

Why Different Loss Functions?

Gradient Descent — The Engine of ML

The Intuition

The Math

Learning Rate: The Most Critical Hyperparameter

Gradient Descent Variants

Advanced Optimizers

Local Minima vs Global Minima

Bias-Variance Tradeoff — The Central Tension of ML

What is Bias?

What is Variance?

The Tradeoff

The Train / Validation / Test Split — Why Three Sets

Underfitting vs Overfitting — Diagnosis and Fixes

Diagnosing

Fixing Underfitting (High Bias)

Fixing Overfitting (High Variance)

Regularization — Preventing Overfitting

What It Is

L1 Regularization (Lasso)

L2 Regularization (Ridge)

Elastic Net

Lambda (λ) — The Regularization Strength

Comparison Table

Dropout (Neural Networks)

Early Stopping

Data Augmentation

Cross-Validation

Why Cross-Validation?

K-Fold Cross-Validation

Stratified K-Fold

Leave-One-Out (LOO) Cross-Validation

Time Series Cross-Validation — The Special Case

Ensemble Methods — The Power of Combining Models

Why Ensembles Work

Bagging (Bootstrap Aggregating)

Boosting

Stacking

Bagging vs Boosting — Summary

Key Formulas Cheat Sheet

Why This Matters for the Exam