Domain 3A: ML Fundamentals from First Principles
Table of Contents
- Domain 3A: ML Fundamentals from First Principles
- What is Machine Learning? (First Principles)
- The Learning Process — What “Training” Actually Means
- Gradient Descent — The Engine of ML
- Bias-Variance Tradeoff — The Central Tension of ML
- Underfitting vs Overfitting — Diagnosis and Fixes
- Regularization — Preventing Overfitting
- Cross-Validation
- Ensemble Methods — The Power of Combining Models
- Key Formulas Cheat Sheet
- Why This Matters for the Exam
Domain 3A: ML Fundamentals from First Principles
Exam Domain: 3 — Modeling (36%) Focus: The foundations every other ML concept builds on
What is Machine Learning? (First Principles)
The Fundamental Idea
Traditional programming: you write the rules, the computer applies them to data. Machine learning: you give it data + answers, the computer discovers the rules.
Traditional Programming:
Rules + Data → Program → Output
Machine Learning:
Data + Output → Training → Rules (Model)
New Data + Model → Prediction
ELI5: Imagine teaching a child to recognize cats. Traditional programming means writing an encyclopedia of rules: “has fur, has four legs, has pointy ears, meows…” — and it breaks the moment you see a hairless cat. Machine learning means showing the child 10,000 photos labeled “cat” or “not cat” and letting them figure out the patterns themselves. The child’s brain becomes the model.
Types of Learning — The Taxonomy
Machine Learning
├── Supervised Learning (has labels)
│ ├── Classification → discrete output (cat/dog, spam/not-spam)
│ └── Regression → continuous output (house price, temperature)
│
├── Unsupervised Learning (no labels)
│ ├── Clustering → group similar data points
│ ├── Dimensionality Reduction → compress features
│ └── Anomaly Detection → find unusual points
│
├── Semi-Supervised Learning (few labels + many unlabeled)
│ └── Self-training, label propagation
│
└── Reinforcement Learning (reward signal, no direct labels)
└── Agent learns by trial + error to maximize reward
| Type | Labels? | Learns | Example Use Case |
|---|---|---|---|
| Supervised | Yes, all data | Input → Output mapping | Email spam detection |
| Unsupervised | No | Hidden structure | Customer segmentation |
| Semi-supervised | Few | Leverage unlabeled data | Document classification (10 labeled, 10K unlabeled) |
| Reinforcement | Reward only | Optimal actions | Game playing, robot control |
When to use each:
- Supervised: You have labeled examples and want to predict a known target
- Unsupervised: You want to discover patterns without knowing what you’re looking for
- Semi-supervised: Labeling is expensive but unlabeled data is abundant (most real-world text/image problems)
- Reinforcement: The goal is sequential decision-making with delayed rewards (robotics, games, resource optimization)
Exam tip: “Discriminative vs generative” is a sub-distinction within supervised learning. Discriminative models (logistic regression, SVM, neural networks) learn the decision boundary P(y|x). Generative models (Naive Bayes, GMM) learn the full data distribution P(x|y).
The Learning Process — What “Training” Actually Means
The Mechanical View
Step 1: Initialize weights randomly
w = [0.1, -0.3, 0.7, ...]
Step 2: Forward pass — make a prediction
ŷ = model(x, w)
Step 3: Compute loss — measure how wrong we are
L = loss(y, ŷ)
Step 4: Backward pass — compute gradients
∂L/∂w = how much each weight contributed to the error
Step 5: Update weights — reduce the error
w_new = w_old - α × (∂L/∂w)
Step 6: Repeat steps 2-5 for all training data
(one pass through all data = one epoch)
Training is just this loop, repeated thousands or millions of times, until the loss stops decreasing.
Loss / Cost Functions — Measuring Wrongness
A loss function takes the model’s prediction and the true answer and returns a single number: how wrong is the model?
The model’s job is to minimize this number.
ELI5: A loss function is like a score in a game where lower is better. If you’re playing golf, every swing that misses costs you points. The model is the golfer. Training is practice. Each round of practice, you slightly adjust your swing (weights) based on what caused the most penalty. After enough rounds, you’ve learned to minimize your score.
Mean Squared Error (Regression)
$$L = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$
- Penalizes large errors disproportionately (squaring amplifies big mistakes)
- Sensitive to outliers
- Units: squared (if target is dollars, loss is dollars²)
Root Mean Squared Error
$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$
- Same units as the target variable (interpretable)
- Still sensitive to outliers, but more readable
Cross-Entropy Loss (Classification)
$$L = -\sum_{i} y_i \log(\hat{y}_i)$$
- Penalizes confident wrong predictions severely
- If truth is class 1 and model says P(class 1) = 0.01 → massive penalty
- If truth is class 1 and model says P(class 1) = 0.99 → tiny penalty
Binary cross-entropy (for binary classification): $$L = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$$
Why Different Loss Functions?
| Loss Function | Task | Key Property |
|---|---|---|
| MSE | Regression | Penalizes large errors heavily |
| MAE (Mean Absolute Error) | Regression | Robust to outliers (linear penalty) |
| Huber Loss | Regression | MSE for small errors, MAE for large (best of both) |
| Cross-Entropy | Classification | Probabilistic interpretation, confident mistakes punished hard |
| Hinge Loss | SVM/Classification | Maximizes margin, 0 loss when correctly classified with confidence |
| Focal Loss | Imbalanced classification | Down-weights easy examples, focuses learning on hard cases |
Exam tip: Huber loss is the answer when you want regression with outlier robustness but still want smooth gradients near zero. Cross-entropy is almost always the right choice for classification (not MSE — MSE with probabilities creates very flat gradients).
Gradient Descent — The Engine of ML
The Intuition
Imagine you’re lost in a hilly landscape covered in fog. Your goal is to reach the lowest valley. You can’t see far, but you can feel the slope under your feet. What do you do?
- Feel which direction goes downhill (compute gradient)
- Take a step in that direction (update weights)
- Repeat until you can’t go any lower (convergence)
This is gradient descent.
Loss
│
│ ╲ Current position
│ ╲ ← ●
│ ╲ ╲
│ ╲ ╲
│ ╲ ● ← after one step
│ ╲ ╲
│ ╲ ╲
│ ╲ ● ← converging
│ ╲____●___
│ ↑ minimum
└──────────────────────── Weights (w)
Gradient = slope at current position (∂L/∂w)
Step size = learning rate (α)
The Math
$$w_{new} = w_{old} - \alpha \cdot \frac{\partial L}{\partial w}$$
- $\alpha$ is the learning rate — how big each step is
- $\frac{\partial L}{\partial w}$ is the gradient — which direction and how steeply uphill
- We subtract the gradient because we want to go downhill
Learning Rate: The Most Critical Hyperparameter
α too HIGH: Loss
│ ╱╲ ╱╲ ╱╲
│╱ ╲╱ ╲╱ → oscillates, diverges
│
Steps overshoot the minimum
α just right: Loss
│╲
│ ╲___ → smooth convergence
│
Steps land near the minimum
α too LOW: Loss
│╲
│ ╲__________ → converges, but takes forever
│
Tiny steps, takes thousands of extra epochs
Learning rate strategies:
- Learning rate schedule: start high, decay over time (step decay, cosine annealing)
- Warm-up: start low, ramp up, then decay — prevents instability at beginning of training
- Cyclical LR: oscillate between bounds — helps escape local minima
Gradient Descent Variants
Batch Gradient Descent:
┌─────────────────────────────┐
│ for each epoch: │
│ compute gradient on ALL │
│ N training examples │
│ update weights once │
└─────────────────────────────┘
Pro: stable, accurate gradient
Con: SLOW — one update per full dataset pass
Stochastic Gradient Descent (SGD):
┌─────────────────────────────┐
│ for each epoch: │
│ for each sample: │
│ compute gradient on │
│ SINGLE sample │
│ update weights │
└─────────────────────────────┘
Pro: fast updates, online learning
Con: noisy gradient, may never fully converge
Mini-batch Gradient Descent:
┌─────────────────────────────┐
│ for each epoch: │
│ for each mini-batch │
│ of size B (32, 64, 128): │
│ compute gradient on B │
│ update weights │
└─────────────────────────────┘
Pro: balance of speed + stability
Con: batch size is a hyperparameter
→ THIS IS THE STANDARD IN DEEP LEARNING
| Variant | Batch Size | Update Frequency | Gradient Noise | Memory |
|---|---|---|---|---|
| Batch GD | All N | Once per epoch | Low (accurate) | High |
| SGD | 1 | N times per epoch | High (noisy) | Low |
| Mini-batch GD | 32-512 | N/B times per epoch | Medium | Medium |
Advanced Optimizers
Plain SGD has problems: equal learning rate for all weights, slow in ravines. Advanced optimizers fix this.
Optimizer Family Tree:
SGD
├── SGD + Momentum (adds velocity, accelerates in consistent directions)
├── Adagrad (adaptive LR: frequent params get smaller LR)
│ └── RMSprop (Adagrad + exponential moving average of squared gradients)
└── Adam (RMSprop + Momentum combined)
└── AdamW (Adam + weight decay decoupled from gradient)
ELI5 on Momentum: Plain SGD is like a ball rolling down a bumpy hill that stops dead every time it hits a small bump. Momentum gives the ball mass — it builds up speed going downhill, which helps it roll through small bumps and get out of shallow dips. “Instead of recalculating direction from scratch each step, you keep some of the previous velocity.”
Adam (Adaptive Moment Estimation) is the default for deep learning because:
- Maintains per-parameter adaptive learning rates (like RMSprop)
- Tracks momentum in the gradient direction (like SGD with momentum)
- Works well with little tuning on most problems
- Handles sparse gradients (NLP) and dense gradients (vision) well
| Optimizer | Best For | Key Hyperparams |
|---|---|---|
| SGD + Momentum | Computer vision (CNNs), when Adam overfits | lr, momentum |
| Adam | Default for most deep learning | lr, β₁=0.9, β₂=0.999, ε |
| AdamW | Transformers, NLP | lr, weight_decay |
| RMSprop | RNNs | lr, rho |
| Adagrad | Sparse data, NLP | lr |
Local Minima vs Global Minima
1D Loss Surface (simple, illustrative):
Loss
│ local global
│ min min
│ ╲ /╲ ╲ /╲ ╲
│ ╲/ ╲ ╲/ ╲ ╲____
└──────────────────────── w
↑ ↑
can get trapped want this
In high dimensions (millions of weights), this is much less of a problem than it looks in 1D. Most critical points in high-dimensional loss landscapes are saddle points (flat in some directions, curved in others), not true local minima. The practical issue is getting stuck on plateaus (very flat regions where gradients are near zero).
Convergence — when to stop:
- Validation loss stops decreasing (most common)
- Gradient norm falls below a threshold
- Fixed number of epochs reached
Bias-Variance Tradeoff — The Central Tension of ML
What is Bias?
Bias = systematic error — the model is consistently wrong in the same direction because it’s too simple to capture the true pattern.
True function (wiggly) High-bias model (straight line)
╭──╮ /
───╯ ╰───╮ /
╰──╮ /
╰─── / ← misses all the curves
(underfitting)
ELI5: Bias is like wearing the wrong prescription glasses — everything you see is consistently blurry in the same way. You might look at a horse and always see a slightly fuzzy shape. The error is systematic and predictable. You could look at a million horses and still be consistently wrong because the problem is your glasses, not the horse.
High bias symptoms:
- High error on training data AND test data
- Model is “too simple” — a linear model fitting a cubic relationship
- Adding more data doesn’t help (the model can’t learn the complexity anyway)
What is Variance?
Variance = sensitivity to training data — the model memorizes noise and fluctuations, not the true signal.
True function (simple) High-variance model (over-wiggly)
╭╮ ╭╮
/ ───╯╰────╯╰───
/ (clean linear trend)
/
ELI5: Variance is like a hypochondriac who thinks every new symptom is a different rare disease — over-reacting to every data point. The model is so sensitive to the specific training examples it saw that if you replaced just one training point, you’d get a completely different model.
High variance symptoms:
- Very low error on training data, high error on test data
- Model is “too complex” — a degree-50 polynomial fitting 20 data points
- Getting more training data usually helps
The Tradeoff
Total Error = Bias² + Variance + Irreducible Error
Error
│
│ Bias² Total Error
│ ╲ ╱╲
│ ╲ ╱ ╲
│ ╲ ╱ ╲
│ ╲──────╱ ╲ ╲
│ ╱ Variance
│ ╱
│____________╱___________
└──────────────────────── Model Complexity
simple complex
(underfit) ↑ (overfit)
sweet spot
- Irreducible error: noise in the data itself — can never be eliminated
- As complexity increases: bias decreases, variance increases
- The art of ML is finding the sweet spot
The Train / Validation / Test Split — Why Three Sets
All available data
│
┌──────────────────┼──────────────────┐
│ │ │
Training Set Validation Set Test Set
(60-80%) (10-20%) (10-20%)
│ │ │
Learn patterns Tune model Final honest
from this hyperparams evaluation
│ (compare (NEVER touch
Repeat many models, until done!)
times over early stopping)
ELI5: Training data is homework — you practice on it repeatedly. Validation data is practice tests — you check your progress and adjust your study approach. Test data is the final exam — you only look at it once, at the very end, and you never use it to guide your studying. If you peek at the final exam questions while studying, your grade is meaningless as a measure of whether you actually learned anything.
Why not just train/test? Because you’ll implicitly optimize for the test set through repeated model selection. Every time you try a new model and compare test performance, you’re “using” the test set to make decisions. The validation set absorbs this overfitting, protecting the test set’s integrity.
Common splits:
- 60/20/20 (train/val/test): classic, good for medium datasets
- 80/10/10: when you have plenty of data and need more training
- 90/5/5: large datasets (millions of examples) — 5% is still huge
Underfitting vs Overfitting — Diagnosis and Fixes
Diagnosing
Train Error Val Error Diagnosis
────────────────────────────────────────────────────────
Underfitting HIGH HIGH Model too simple
Good fit LOW LOW
Slight overfit LOW SLIGHTLY HIGH Normal, acceptable
Overfitting VERY LOW HIGH Model too complex
Fixing Underfitting (High Bias)
- Add more features / engineered features
- Use a more complex model (more layers, more trees, higher degree polynomial)
- Reduce regularization strength (lower λ)
- Train longer (more epochs)
- Remove features that are too noisy (paradoxically, fewer bad features can help)
Fixing Overfitting (High Variance)
- Get more training data (most effective)
- Use a simpler model
- Increase regularization (higher λ)
- Dropout (neural networks)
- Early stopping
- Data augmentation
- Ensemble methods (averaging reduces variance)
- Feature selection (remove noise features)
Regularization — Preventing Overfitting
What It Is
Regularization adds a complexity penalty to the loss function. The model must balance fitting the training data against keeping the weights small (simple).
$$L_{regularized} = L_{original} + \lambda \cdot \text{penalty}$$
L1 Regularization (Lasso)
$$L = L_{original} + \lambda \sum_{i} |w_i|$$
- Penalty is the sum of absolute values of weights
- Effect: drives some weights exactly to zero → automatic feature selection
- The solution is sparse (many zero weights)
ELI5: L1 is like Marie Kondo — if a feature doesn’t “spark joy” (contribute meaningfully to predictions), it gets thrown out entirely. After L1 regularization, only the truly useful features have non-zero weights. You end up with a leaner, more interpretable model.
Why L1 creates sparsity: The L1 penalty has a “kink” at zero — the gradient is constant (±1) everywhere except at 0. This constant pull toward zero is strong enough to push small weights all the way to exactly zero. L2 has a shrinking gradient as weights approach zero — it gets weaker and weaker as a weight gets small, never quite reaching zero.
L2 Regularization (Ridge)
$$L = L_{original} + \lambda \sum_{i} w_i^2$$
- Penalty is the sum of squared weights
- Effect: shrinks all weights toward zero, but never exactly zero
- All features retained, but with smaller influence
ELI5: L2 is like a dimmer switch — it turns every feature’s influence down, but nothing gets turned off completely. Every feature stays in the picture, just quieter. This is better when you believe all features are relevant but you want to prevent any single feature from dominating.
Why L2 doesn’t produce sparsity: As a weight approaches zero, the L2 penalty gradient (2λw) approaches zero too — it loses its “push” before reaching zero.
Elastic Net
$$L = L_{original} + \lambda_1 \sum_{i} |w_i| + \lambda_2 \sum_{i} w_i^2$$
Combines L1 and L2. Gets sparse solutions (L1) while handling correlated features gracefully (L2). Best when you have many features that are correlated.
Lambda (λ) — The Regularization Strength
λ = 0: No regularization → pure overfitting risk
λ too low: Small penalty → model still overfits
λ just right: Sweet spot → generalizes well
λ too high: Large penalty → all weights near 0 → underfitting
Comparison Table
| L1 (Lasso) | L2 (Ridge) | Elastic Net | |
|---|---|---|---|
| Penalty | $\sum |w_i|$ | $\sum w_i^2$ | L1 + L2 |
| Weights | Many → exactly 0 | All shrink, none → 0 | Sparse + shrunk |
| Feature selection | Yes, automatic | No | Partial |
| Best when | Few features matter | All features matter (correlated) | Many features, some correlated |
| Robustness | Less stable (non-unique) | More stable (unique solution) | Stable |
| Interpretability | High (sparse) | Medium | Medium |
Dropout (Neural Networks)
Randomly disable a fraction of neurons during each training forward pass.
Normal forward pass:
x → [n1] → [n2] → [n3] → [n4] → output
With 50% dropout:
x → [n1] → ✗ → [n3] → ✗ → output
(n2 and n4 disabled this batch — different neurons next batch)
Effect: prevents neurons from co-adapting (relying on specific other neurons). Forces the network to learn redundant representations. Acts like training an ensemble of many sub-networks.
Early Stopping
Validation
Loss
│╲
│ ╲___
│ ╲___
│ ╲__
│ ╲__
│ ╲___ ← minimum — STOP HERE
│ ╲──────────────────
│ ↑ overfitting begins
└──────────────────────────────── Epochs
Stop training when validation loss starts increasing. The model at the minimum validation loss is the best model.
Data Augmentation
Artificially increase training data diversity:
- Images: random flips, rotations, crops, color jitter, cutout, Mixup
- Text: back-translation, synonym replacement, random insertion/deletion
- Tabular: SMOTE (synthetic minority oversampling for imbalanced classes)
Cross-Validation
Why Cross-Validation?
A single train/val split gives a noisy estimate of model performance — you might get lucky or unlucky in how the split fell. Cross-validation averages over multiple splits for a more reliable estimate.
K-Fold Cross-Validation
Dataset: [1][2][3][4][5][6][7][8][9][10] (10 samples, K=5 folds)
Fold 1: [VAL][ TRAIN ][ TRAIN ] → score₁
Fold 2: [ ][VAL][ TRAIN ][ ] → score₂
Fold 3: [TRAIN][ ][VAL][ TRAIN] → score₃
Fold 4: [ TRAIN ][ ][VAL][ ] → score₄
Fold 5: [ TRAIN ][ TRAIN ][VAL] → score₅
Final score = mean(score₁, score₂, score₃, score₄, score₅)
± std (measure of estimate reliability)
- Each sample is used for validation exactly once
- Each sample is used for training K-1 times
- K=5 or K=10 are standard choices
- More reliable than a single split
Stratified K-Fold
Same as K-Fold, but ensures each fold has the same class proportion as the full dataset.
Critical for imbalanced data: if your dataset is 90% class A and 10% class B, random folds might end up with no class B examples in a fold, making evaluation meaningless.
Leave-One-Out (LOO) Cross-Validation
K = N (one sample per fold). Every sample gets to be the validation set exactly once.
- Pro: unbiased estimate of model performance
- Con: computationally expensive (N training runs)
- Use when: dataset is very small (< 100 samples)
Time Series Cross-Validation — The Special Case
Regular K-Fold on time series (WRONG!):
Train: [Jan][Mar][May][Aug] Test: [Feb][Apr][Jun]
Problem: model sees future data (August) to predict past (February)
↑ DATA LEAKAGE
Time Series Walk-Forward Validation (CORRECT):
Fold 1: [Jan──Mar] → test [Apr]
Fold 2: [Jan──Apr] → test [May]
Fold 3: [Jan──May] → test [Jun]
Fold 4: [Jan──Jun] → test [Jul]
(expanding window)
Exam tip: NEVER use random K-Fold on time series data. The rule is simple: you can only train on data from before your test period. This mimics how the model will actually be used in production — you always predict the future from the past.
Ensemble Methods — The Power of Combining Models
Why Ensembles Work
If you have many models that are:
- Reasonably accurate (better than random)
- Uncorrelated in their errors (they make different mistakes)
Then their average will be more accurate than any single model. The errors cancel out.
Model A correct on: [1][1][1][0][0][0][1][0][1][1] (70% accuracy)
Model B correct on: [1][0][1][1][0][1][0][1][1][0] (60% accuracy)
Model C correct on: [0][1][1][1][1][0][1][1][0][1] (70% accuracy)
Majority vote: [1][1][1][1][0][0][1][1][1][1] (80% accuracy!)
Bagging (Bootstrap Aggregating)
Original Data (N samples)
│
┌────┴────┐
│Bootstrap│ ← sample with replacement
│Sampling │
└────┬────┘
│
┌────┴────┬─────────┬─────────┐
↓ ↓ ↓ ↓
Model 1 Model 2 Model 3 Model 4 (trained independently)
│ │ │ │
└─────────┴─────────┴─────────┘
│
Aggregate
(average/vote)
│
Final Prediction
Each bootstrap sample has N draws with replacement from the original data — on average, 63.2% unique samples, 36.8% are duplicates.
Random Forest = bagging on decision trees + random feature subset at each split:
- Random feature selection at each split decorrelates the trees (if they all see the same features, they’ll all focus on the most predictive feature)
- Reduces variance without increasing bias
ELI5: Bagging is like asking 100 slightly-confused doctors to each independently diagnose the same patient, then taking the majority vote. No single doctor needs to be brilliant — you just need them to be somewhat accurate and make different mistakes. The crowd’s collective wisdom is more reliable than any individual.
Bagging reduces variance because individual model variance averages out. Best for high-variance models (decision trees).
Boosting
Training Data
│
▼
Model 1 ← train on original data
(weak)
│
Errors ← identify where Model 1 was wrong
│
▼
Model 2 ← train with HIGHER WEIGHT on Model 1's mistakes
(weak)
│
Errors
│
▼
Model 3 ← train with HIGHER WEIGHT on remaining mistakes
(weak)
│
▼
Weighted Sum → Final Prediction
ELI5: Boosting is like a relay tutoring session. Tutor 1 teaches you everything they can. Tutor 2 comes in and focuses specifically on the topics tutor 1 couldn’t explain clearly. Tutor 3 handles whatever is still confusing. Each round targets the remaining weaknesses. The final combined knowledge is much stronger than any single tutor because the team specializes in each other’s gaps.
Boosting reduces bias because each model corrects the previous one’s errors. Best for high-bias models (shallow trees = weak learners).
Types:
- AdaBoost: reweights training samples (higher weight to misclassified samples)
- Gradient Boosting (GBM): fits each new model to the residuals (errors) of the previous ensemble
- XGBoost: GBM + L1/L2 regularization + second-order gradients + parallel tree construction
Stacking
Use the predictions of multiple diverse models as input features for a meta-model (also called a blender).
Level 0 (base models):
x → [Logistic Regression] → pred₁
x → [Random Forest] → pred₂
x → [XGBoost] → pred₃
x → [Neural Network] → pred₄
Level 1 (meta-model):
[pred₁, pred₂, pred₃, pred₄] → [Meta-Model] → Final Prediction
The meta-model learns when to trust each base model. Computationally expensive, but often the best-performing approach in ML competitions.
Bagging vs Boosting — Summary
| Bagging | Boosting | |
|---|---|---|
| Training | Parallel (independent) | Sequential (dependent) |
| Each model sees | Random subset of data | Weighted/resampled data |
| Reduces | Variance | Bias |
| Combiner | Average / majority vote | Weighted sum |
| Best for | High-variance models (deep trees) | High-bias models (shallow trees) |
| Risk | Less prone to overfitting | Can overfit if too many rounds |
| Examples | Random Forest | XGBoost, AdaBoost, LightGBM |
| Speed | Parallelizable | Sequential (harder to parallelize) |
Exam tip: “Bagging = parallel, reduces variance. Boosting = sequential, reduces bias.” This distinction appears on the MLS-C01 exam. Also know: XGBoost is boosting. Random Forest is bagging.
Key Formulas Cheat Sheet
Loss functions:
MSE = (1/n) × Σ(yᵢ - ŷᵢ)²
RMSE = √MSE
MAE = (1/n) × Σ|yᵢ - ŷᵢ|
Cross-Entropy = -Σ yᵢ × log(ŷᵢ)
Binary CE = -[y×log(ŷ) + (1-y)×log(1-ŷ)]
Regularization:
L1 (Lasso) = L + λ × Σ|wᵢ|
L2 (Ridge) = L + λ × Σwᵢ²
Elastic Net = L + λ₁×Σ|wᵢ| + λ₂×Σwᵢ²
Gradient descent:
w_new = w_old - α × ∂L/∂w
Bias-Variance:
Total Error = Bias² + Variance + Irreducible Error
Why This Matters for the Exam
The MLS-C01 exam tests conceptual understanding, not mathematical derivations. You need to know:
- Which algorithm to choose for a given problem type (supervised/unsupervised)
- How gradient descent hyperparameters (learning rate, batch size) affect training
- How to diagnose overfitting vs underfitting from training/validation curves
- When to use L1 vs L2 vs Elastic Net regularization
- Why time series CV is different from standard K-Fold
- The difference between bagging (Random Forest) and boosting (XGBoost) — parallel vs sequential, variance vs bias