Domain 3B: Supervised Learning Deep Dive

18 min read 3804 words

Table of Contents

Domain 3B: Supervised Learning Deep Dive

Domain 3B: Supervised Learning Deep Dive

Exam Domain: 3 — Modeling (36%) Focus: Core supervised algorithms — how they work, when to use them, how to tune them

Regression Algorithms

Linear Regression

First Principles

The simplest ML model: find the line (or hyperplane in multiple dimensions) that best fits the data by minimizing the sum of squared distances between predictions and true values.

$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_n x_n$$

Where $w_0$ is the bias (intercept) and $w_1 \ldots w_n$ are the learned weights (coefficients).

ELI5: Draw a scatter plot of house sizes vs prices. Linear regression draws the best straight line through those dots. “Best” means the total squared vertical distance from each dot to the line is minimized. That’s it. For multiple features, replace “line” with “flat plane through higher-dimensional space” — the concept is identical.

Solving Linear Regression: Two Methods

Normal Equation (closed-form solution): $$\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

Computes the exact answer in one step
Requires inverting the matrix $\mathbf{X}^T \mathbf{X}$ (O(n³) operation)
Use when: n_features < ~10,000 (matrix inversion is feasible)
Don’t use when: millions of features (inversion is prohibitively slow)

Gradient Descent:

Iteratively finds the solution by following the gradient
Use when: large feature space, online learning, regularization needed
Scales better to large problems

Assumptions (and What Breaks Them)

Assumption	Meaning	What Breaks It	Fix
Linearity	Relationship is actually linear	Non-linear patterns	Add polynomial features, use a non-linear model
Independence	Observations are independent	Time series, repeated measures	Use time-aware models, mixed effects
Homoscedasticity	Error variance is constant	Residuals fan out as x increases	Log-transform target, weighted regression
Normality of residuals	Errors are normally distributed	Heavy-tailed distributions	Robust regression, transformations
No multicollinearity	Features aren’t highly correlated	Correlated features → unstable weights	PCA, Ridge regression, drop correlated features

Polynomial Regression

When the relationship is non-linear, add polynomial features: $x_1^2, x_1 \cdot x_2, x_1^3$, etc.

Linear fit on curved data:      Polynomial fit:
        ●                               ●
  ●         ●                    ●    ╱   ●
        ╱               vs           ╱
   ●  ╱                        ●  ╱
     ╱  ← poor fit                ← better fit

Warning: high-degree polynomials overfit easily. Use regularization (Ridge/Lasso) or cross-validation to select degree.

Logistic Regression (for Classification!)

Despite the name, logistic regression is a classification algorithm. It predicts the probability of a binary outcome.

The Core Idea

Take a linear combination of features (like linear regression), then squash the output to the range (0, 1) using the sigmoid function:

$$\hat{p} = \sigma(z) = \frac{1}{1 + e^{-z}}, \quad z = w_0 + w_1 x_1 + \ldots + w_n x_n$$

Sigmoid Function:
  Output
  1.0 │                         ──────────
      │                    ────╱
  0.5 │─────────────── ───╱ ─────────────
      │           ────╱
  0.0 │──────────
      └────────────────────────────────── z (linear combo)
           -∞           0           +∞

At z=0: output = 0.5 (decision boundary)
z >> 0: output → 1.0
z << 0: output → 0.0

ELI5: Linear regression predicts a number that can be anything (−∞ to +∞). Logistic regression predicts a probability (0 to 1). The sigmoid function is the translator — it squashes any real number to a valid probability. If the probability is > 0.5, we predict class 1; otherwise, class 0.

Decision Boundary

The decision boundary is where $\hat{p} = 0.5$, i.e., where $z = 0$.

Because z is a linear combination of features, the decision boundary is a straight line (or hyperplane in higher dimensions). This makes logistic regression a linear classifier.

Two features (x₁, x₂):
  x₂
  │         ●  ●  ●  (class 1)
  │      ● ╱ ●
  │       ╱   ●
  │      ╱ ← decision boundary (straight line)
  │   ● ╱
  │  ● ╱   ○  (class 0)
  │   ╱  ○  ○
  └──────────────── x₁

Multi-class Logistic Regression

One-vs-Rest (OvR): train N binary classifiers, each predicting “is this class k vs all others?” Take the class with highest predicted probability.
Softmax (Multinomial): directly outputs a probability for each class, all summing to 1:

$$P(y=k | \mathbf{x}) = \frac{e^{z_k}}{\sum_{j} e^{z_j}}$$

Decision Trees

First Principles: Divide and Conquer

A decision tree asks a sequence of yes/no questions about features to classify a data point.

                    ┌────────────────────┐
                    │  Age < 30?         │
                    └──────┬─────────────┘
                           │
             YES ──────────┴──────────── NO
              │                           │
    ┌─────────────────┐      ┌─────────────────────┐
    │ Income > 50K?   │      │  Credit score > 700? │
    └────────┬────────┘      └──────────┬──────────┘
             │                          │
    YES ─────┴───── NO         YES ─────┴───── NO
     │               │          │               │
  Approve          Deny       Approve          Deny

The tree is built by repeatedly finding the best feature split at each node.

Split Criteria: How to Pick the Best Split?

Gini Impurity: probability of misclassifying a randomly chosen sample if we label it randomly according to class distribution in the node.

$$G = 1 - \sum_{k} p_k^2$$

Where $p_k$ is the proportion of class k in the node.

$G = 0$: pure node (all one class) — perfect split
$G = 0.5$: maximally impure (50/50 split) — worst split

Entropy / Information Gain:

$$H = -\sum_{k} p_k \log_2 p_k$$

Information Gain = entropy before split − weighted average entropy after split.

ELI5 comparison: Both measure “how mixed up are the classes in this bucket?” Gini says “if I randomly pick a sample and randomly assign it a label, how often am I wrong?” Entropy says “how many bits of information do I need to describe the class distribution?” In practice, they give almost identical results. Gini is slightly faster to compute (no logarithm).

Criterion	Formula	Behavior
Gini	$1 - \sum p_k^2$	Slightly favors larger partitions
Entropy	$-\sum p_k \log_2 p_k$	More balanced splits
Variance reduction	For regression trees	Minimize variance within child nodes

Decision Tree Pros and Cons

Pros:

Highly interpretable — can visualize and explain
No feature scaling needed (splits are threshold comparisons, not distance-based)
Handles mixed feature types (numerical + categorical)
Captures non-linear relationships
Fast inference

Cons:

High variance — small changes in training data → very different trees
Overfits easily without constraints
Unstable (a single outlier can change the whole tree structure)
Biased toward features with more possible split points

Controlling Tree Complexity (Pruning)

Unpruned (overfits):          Pruned (generalizes):
      Root                          Root
     /    \                        /    \
   /        \                    /        \
 Split     Split               Split     Split
 /  \      /  \                /  \
S1  S2   S3   S4             S1  S2   (stopped early)
|   |    |    |
...many tiny leaves...

Pre-pruning (stop during training):

max_depth: limit tree depth
min_samples_split: require minimum N samples to split a node
min_samples_leaf: require minimum N samples in each leaf
max_leaf_nodes: cap total number of leaves

Post-pruning (prune after training):

Cost-complexity pruning (alpha parameter): prune subtrees that don’t improve validation performance enough per leaf added

Random Forest

What It Is

Random Forest = Bagging on decision trees + random feature subsets at each split.

For each tree:

Bootstrap sample (sample N points with replacement from training data)
At each node, consider only $\sqrt{p}$ (classification) or $p/3$ (regression) randomly selected features
Grow the tree fully (or to max_depth)

Final prediction: majority vote (classification) or average (regression) across all trees.

Why Randomness Helps

If all trees had access to the same features, they would all make the same split at the top (the most informative feature), producing highly correlated trees. Correlated errors don’t cancel out when averaged.

Random feature subsets force different trees to use different features, decorrelating them. Uncorrelated errors cancel out. Variance drops.

ELI5: If you ask 100 people the same question while giving them access to the same information, they’ll all give similar answers — no diversity means no averaging benefit. But if each person can only see a random subset of the information, they’ll form different opinions based on different perspectives. When you average those diverse opinions, errors cancel out and accuracy improves.

Feature Importance

Two ways to measure which features matter most:

Gini Importance (impurity-based): sum how much each feature reduces impurity across all splits, across all trees. Fast but biased toward high-cardinality features.
Permutation Importance: shuffle a single feature’s values, measure how much model accuracy drops. Drop = importance. Slower but unbiased.

Key Hyperparameters

Parameter	Default	Effect
`n_estimators`	100	More trees = more stable (never hurts, just slower)
`max_depth`	None (full)	Limit depth to reduce variance
`max_features`	`sqrt(p)` for clf	Smaller = more diverse trees, but each tree is weaker
`min_samples_split`	2	Higher = more conservative splits
`min_samples_leaf`	1	Higher = smoother predictions
`bootstrap`	True	Set False for pasting (no replacement)

Random Forest vs Single Decision Tree

	Single Decision Tree	Random Forest
Variance	Very high	Low
Bias	Low (deep trees)	Slightly higher
Interpretability	High (visual)	Low (black box)
Overfitting	Prone	Resistant
Training speed	Fast	Slower (N trees)
Prediction speed	Fast	Moderate (N trees)
Feature importance	Basic	Robust

Gradient Boosted Trees (XGBoost, LightGBM, CatBoost)

The Boosting Concept, Revisited

Boosting builds trees sequentially. Each new tree fits the residuals (errors) of the current ensemble.

Iteration 0: Initial prediction = mean(y) for all samples
             Residuals = y - mean(y)

Iteration 1: Tree 1 fits the residuals
             New predictions = mean(y) + α × Tree1(x)
             Residuals = y - new predictions

Iteration 2: Tree 2 fits the NEW residuals
             New predictions = prev + α × Tree2(x)

...
Final prediction = mean(y) + α × Tree1(x) + α × Tree2(x) + ...

Each tree is a weak learner (shallow, limited splits). The ensemble gradually reduces both bias and variance.

XGBoost Deep Dive

XGBoost adds several innovations over vanilla gradient boosting:

Regularized objective: $$Obj = \sum_{i} L(y_i, \hat{y}i) + \sum{k} \Omega(f_k)$$

Where $\Omega(f_k) = \gamma T + \frac{1}{2}\lambda \sum_{j} w_j^2$ penalizes tree complexity (T = number of leaves, $w_j$ = leaf weights).

Second-order gradient approximation (Newton boosting): Uses both first-order (gradient) and second-order (Hessian) information to find better leaf weights. Like using curvature information to take smarter steps, not just slope.

Sparsity-aware split finding: handles missing values natively by learning the best default direction for missing values during training.

Column (feature) subsampling: like Random Forest, sample features at each tree/level to reduce overfitting and speed up training.

ELI5: XGBoost is the Swiss Army knife of ML. If your data fits in a table with labeled rows, XGBoost will probably get you a top-performing model. It’s won more Kaggle competitions than any other single algorithm. The key insight is that it’s not just “boosting with trees” — it adds explicit regularization and uses smarter math (second-order gradients) to find better solutions faster.

XGBoost Key Hyperparameters

Parameter	Effect	Tuning Direction
`n_estimators`	Number of trees	More trees: better fit, risk of overfit. Use with early stopping.
`max_depth`	Tree depth (3-6 typical)	Lower = less overfit
`learning_rate` (eta)	Shrinks each tree’s contribution	Lower + more trees = better but slower
`subsample`	Row sampling fraction	0.5-0.9: reduces overfit
`colsample_bytree`	Feature sampling fraction	0.5-0.9: reduces overfit, like RF
`lambda` (reg_lambda)	L2 regularization on leaf weights	Higher = less overfit
`alpha` (reg_alpha)	L1 regularization on leaf weights	Higher = sparser model
`min_child_weight`	Min sum of Hessian per leaf	Higher = more conservative
`gamma`	Min loss reduction to split	Higher = fewer splits

Golden tuning rule: set learning_rate low (0.01-0.1), n_estimators high, use early stopping — let training stop when validation score stops improving.

LightGBM vs XGBoost vs CatBoost

XGBoost tree growth:          LightGBM tree growth:
Level-wise (breadth-first)    Leaf-wise (best-first)

     Root                          Root
    /    \                        /    \
  L1      R1                   L1       R1
  / \    / \                   / \
L2  R2  L3  R3               L2   R2   ← grows the leaf with
(all nodes at each level)         ↑         highest gain first
                           continues expanding best leaf

LightGBM’s leaf-wise growth can find better splits but risks overfitting on small datasets (deep unbalanced trees). Set num_leaves to control.

CatBoost’s key feature: ordered boosting — uses a permutation-based approach to avoid target leakage when computing statistics on categorical features. This makes it especially strong on categorical-heavy data without manual preprocessing.

	XGBoost	LightGBM	CatBoost
Tree growth	Level-wise	Leaf-wise	Symmetric (oblivious)
Speed	Moderate	Fastest (large data)	Moderate
Memory	Moderate	Low	Moderate
Categorical features	Manual encoding needed	Native (label encoding)	Native (ordered target encoding)
Best for	Tabular, general	Large datasets, speed	Categorical-heavy data
Overfitting risk	Low-moderate	Higher (small data)	Low
Interpretability	Good	Good	Good
Tuning effort	Moderate	Moderate	Low (good defaults)

Exam tip: SageMaker has a built-in XGBoost container. It’s the most commonly tested tree-based algorithm on the MLS-C01 exam.

Support Vector Machines (SVM)

First Principles: Maximize the Margin

For a binary classification problem, instead of just finding any decision boundary that separates classes, SVM finds the boundary that maximizes the margin — the distance between the boundary and the nearest training points (support vectors).

      ●  ●                    ●  ●
   ●        ○              ●     │  ○
●     ●    ○  ○           ●  ●  │    ○  ○
   ●      ○               ●     │margin ○
      ●  ○                      │
                          support vectors
                                ↑
   Any line separates    SVM finds the widest street

Why maximizing the margin matters: the wider the margin, the more “buffer room” between classes. New test points are less likely to fall on the wrong side. This leads to better generalization (less overfitting).

Support vectors are the training points closest to the boundary — they’re the only ones that define the boundary. Move any other training point and the boundary stays the same.

The C Parameter (Soft Margin)

Real data is rarely perfectly separable. The C parameter controls the trade-off between maximizing margin width and minimizing classification errors:

Small C (soft margin):           Large C (hard margin):
  ●  ●                             ●  ●
●    ●  │  ○  ○              ●   ●│    ○  ○
    ●   │    ○                  ●  │  ○
●       │  ○                      │
        ↑ wider margin,           ↑ narrow margin,
          allows some              no misclassification
          misclassification        (may overfit)

Small C: wider margin, allows more margin violations → lower variance, higher bias
Large C: narrow margin, tries to classify all points correctly → higher variance, lower bias

The Kernel Trick — Handling Non-Linear Data

What if the classes aren’t linearly separable?

ELI5: Imagine cats and dogs scattered in a room, but they’re mixed together in a circle pattern — the cats are in the middle, dogs form a ring around them. You can’t draw a straight line to separate them. But if you could lift the cats up into 3D (their height represents some new feature), suddenly a flat horizontal plane separates them from the dogs below. The kernel trick does exactly this lifting — mathematically, without actually computing the high-dimensional coordinates. It’s like taking a shortcut through the calculation.

2D (non-separable):           After kernel transformation (3D):
  ○ ○ ○                              ●
○  ●●●  ○           →              ●   ●       ← lifted cats
○  ●●●  ○                        ──────────    ← separating plane
  ○ ○ ○                          ○   ○   ○     ← dogs stay flat
(circle pattern)

The kernel function $K(x_i, x_j)$ computes the dot product in the high-dimensional space without actually going there — the “kernel trick.”

Kernel	Formula	Use When
Linear	$x_i \cdot x_j$	Data is linearly separable, high-dimensional (text)
Polynomial	$(x_i \cdot x_j + r)^d$	Moderate non-linearity, known polynomial relationship
RBF / Gaussian	$e^{-\gamma \|x_i - x_j\|^2}$	Default — most non-linear problems
Sigmoid	$\tanh(\kappa x_i \cdot x_j + c)$	Rare, similar to 2-layer neural network

RBF kernel parameter γ (gamma):

High γ: each training point has small influence radius → complex decision boundary → overfitting
Low γ: each training point has large influence radius → smoother boundary → underfitting

SVM Practical Guide

Use SVM when:

Small-to-medium dataset (< ~100K samples — training is O(n²) to O(n³))
High-dimensional data where n_features > n_samples (text classification, genomics)
Clear margin of separation exists
You need a well-principled probabilistic model is NOT a priority (SVM doesn’t natively output probabilities — needs Platt scaling)

Don’t use SVM when:

Large datasets (slow training and prediction)
You need calibrated probabilities directly
Lots of missing data (SVM requires complete feature vectors)
Lots of noise in the data (C needs to be very small → similar to linear models)

K-Nearest Neighbors (KNN)

First Principles: Learning by Analogy

KNN makes no assumptions about the data distribution. It stores all training data and at inference time, looks at the K most similar training examples.

ELI5: To decide if a new movie is action or romance, look at the K most similar movies in your database (using some similarity measure like director, actors, runtime, rating). If 7 out of 10 most similar movies are action, predict action. There’s no “model” — just a lookup table of all your past experience.

K=1 (overfit):              K=5 (smoother):
  ●●●  ○○○                  ●●●  ○○○
  ●○●  ○●○                  ●●●  ○○○
  ●●●  ○○○                  ●●●  ○○○

(every training point        (smooth, rounded
 is its own class island)     decision boundary)

The Algorithm

Store all training examples
For a new point x: a. Compute distance to every training point b. Find K nearest neighbors c. Return majority vote (classification) or mean (regression)

No training phase! KNN is a lazy learner — all computation happens at inference time. This means:

Training: O(1) — just store the data
Inference: O(n × d) — compute n distances in d dimensions

Distance Metrics

Metric	Formula	Use When
Euclidean	$\sqrt{\sum (x_i - y_i)^2}$	Continuous features, same scale
Manhattan	$\sum \|x_i - y_i\|$	Grid-like data, more robust to outliers
Minkowski	$(\sum \|x_i - y_i\|^p)^{1/p}$	Generalizes both (p=2: Euclidean, p=1: Manhattan)
Hamming	Count of different bits	Categorical or binary features
Cosine	$1 - \frac{x \cdot y}{\|x\|\|y\|}$	Text/document similarity (direction matters, not magnitude)

CRITICAL: KNN requires feature scaling (StandardScaler or MinMaxScaler). If one feature is in thousands (salary) and another in single digits (years), distance will be dominated by salary.

Choosing K

K=1:   Very low bias, very high variance
       (jagged decision boundary, sensitive to noise)

K=N:   Very high bias, very low variance
       (predicts the majority class for everything)

Sweet spot: usually K = √N (N = number of training samples)
            Use cross-validation to find optimal K

The Curse of Dimensionality

In high dimensions, KNN breaks down because:

All distances become approximately equal (data becomes equidistant)
Volume of space grows exponentially — data becomes extremely sparse
“Nearest neighbor” in 1000 dimensions may be no more similar than a random point

1D: Cover 10% of range → need 10% of data
2D: Cover 10% in each dim → need 10% × 10% = 1% of data... wait
    Actually to cover 10% of volume in 2D, need 10%^(1/2) = 31.6% per side
nD: To cover 10% of n-dimensional volume → need 10%^(1/n) of range per dimension
    n=10: need 79% of range per dimension to cover 10% of volume!
    → Must look nearly everywhere to find neighbors

Use KNN when:

Small dataset (< 10K samples — inference is fast enough)
Low-to-moderate dimensions (< 50 features, ideally < 20)
Interpretability needed (“these 5 similar examples predicted X”)
Non-linear patterns exist

Don’t use KNN when:

Large datasets (O(n) inference per prediction)
High-dimensional data (curse of dimensionality)
Features need to be on same scale but can’t be (mixed types)

Classification-Specific Concepts

Binary vs Multi-class vs Multi-label

Binary:      One class from {0, 1}          → "Is this spam? Yes/No"
Multi-class: One class from {A, B, C, D}    → "Which digit? 0-9"
Multi-label: Any subset of {A, B, C, D}     → "Which topics? [tech, finance, AI]"

Multi-label requires different evaluation (Hamming loss, micro/macro F1) and different model outputs (sigmoid per class, not softmax).

Threshold Tuning: Why 0.5 Isn’t Always Optimal

The default decision threshold of 0.5 only makes sense when:

False positives and false negatives have equal cost
The class prior probabilities are balanced

Real-world examples where 0.5 is wrong:

Fraud detection: missing a fraud (FN) costs far more than a false alarm (FP) → lower threshold (0.2-0.3)
Medical screening: missing a disease (FN) costs more than unnecessary follow-up (FP) → lower threshold
Spam filter: incorrectly blocking real email (FP) is very annoying → higher threshold (0.7-0.8)

Use the ROC curve (threshold vs TPR/FPR) or precision-recall curve to choose the optimal threshold for your cost function.

Class Imbalance: When the Problem is Lopsided

Balanced (50/50):              Imbalanced (99/1):
  ●●●●●●●●●●                    ●●●●●●●●●●●●●●●●●●●●●●●●●●
  ○○○○○○○○○○                    ○   (minority class = 1%)
  Accuracy is meaningful        Accuracy = 99% just by predicting all ●

Techniques for imbalanced data:

Technique	How	When
Oversampling (SMOTE)	Generate synthetic minority samples	Training data too small
Undersampling	Remove majority samples randomly	Majority class is redundant
Class weights	Penalize misclassifying minority harder	Model-level (easy to implement)
Threshold tuning	Lower threshold for minority class	Post-training adjustment
Ensemble (BalancedBagging)	Balance each bootstrap sample	Combines sampling + ensembling

Evaluation on imbalanced data: use F1-score, precision-recall curve, AUC-PR (not accuracy or AUC-ROC which can be misleading).

Algorithm Selection Guide

Decision Flowchart

What is your problem?
│
├── REGRESSION (continuous target)?
│   ├── Linear relationship, interpretability needed?
│   │   └── Linear Regression (+ Ridge/Lasso for regularization)
│   ├── Non-linear, medium data, best accuracy?
│   │   └── XGBoost / LightGBM
│   └── Many correlated features?
│       └── Ridge or Elastic Net
│
└── CLASSIFICATION (discrete target)?
    ├── Need interpretability (rules)?
    │   └── Decision Tree (shallow) or Logistic Regression
    ├── Robust baseline, feature importance?
    │   └── Random Forest
    ├── Best accuracy on tabular data?
    │   └── XGBoost / LightGBM / CatBoost
    ├── Small dataset, high-dimensional (text)?
    │   └── SVM with RBF/Linear kernel
    ├── Small dataset, need instance-based reasoning?
    │   └── KNN
    └── Complex patterns (images, text, sequences)?
        └── Neural Networks (see Domain 3D)

Master Comparison Table

Algorithm	Speed (Train)	Speed (Predict)	Interpretability	Scaling Needed	Large Data	Non-linear	Missing Values
Linear/Logistic Regression	Fast	Fast	High	Yes	Yes	No (poly features)	No
Decision Tree	Fast	Fast	High	No	Yes	Yes	No
Random Forest	Moderate	Moderate	Low	No	Yes	Yes	No (imputation)
XGBoost	Moderate	Fast	Low-Med	No	Yes	Yes	Yes (native)
LightGBM	Fast	Fast	Low-Med	No	Yes	Yes	Yes (native)
SVM (RBF)	Slow (O(n²))	Slow (O(n))	Low	Yes	No	Yes	No
KNN	Instant	Slow (O(n))	High	Yes	No	Yes	No

When to Use Each (Exam Quick Reference)

Scenario	Recommended Algorithm
Tabular data, need best accuracy, no interpretability needed	XGBoost / LightGBM
Need to explain each prediction (regulatory)	Logistic Regression, Decision Tree
Text classification, high-dimensional sparse features	SVM (Linear kernel) or Logistic Regression
Many correlated features, prevent one from dominating	Ridge Regression
Want automatic feature selection	Lasso (L1)
Categorical-heavy tabular data	CatBoost
Large dataset, need fast training	LightGBM
Small dataset, similarity-based reasoning	KNN
Robust to outliers in features	Tree-based (no scaling needed)
Robust to outliers in target	MAE loss, Huber loss, or quantile regression
Need probability calibration	Logistic Regression (inherently calibrated)

Key Formulas Cheat Sheet

Linear Regression:
  ŷ = w₀ + w₁x₁ + ... + wₙxₙ
  Normal Equation: w = (XᵀX)⁻¹ Xᵀy

Logistic Regression:
  σ(z) = 1 / (1 + e^(-z))
  Predict class 1 if σ(z) > threshold (default 0.5)

Decision Tree splits:
  Gini = 1 - Σ pₖ²
  Entropy = -Σ pₖ log₂(pₖ)

KNN:
  Euclidean distance = √Σ(xᵢ - yᵢ)²
  Predict: majority class of K nearest neighbors

SVM:
  Maximize margin = 2 / ‖w‖
  Soft margin: minimize ½‖w‖² + C × Σ ξᵢ

Why this matters for the exam: The MLS-C01 exam does not require you to derive these algorithms mathematically. It DOES require you to know:
Which algorithm handles which problem type (regression vs classification vs clustering)
Key hyperparameters and their effects (e.g., C in SVM → bias-variance tradeoff)
When an algorithm will fail (KNN in high dimensions, SVM on large datasets)
How tree-based algorithms handle feature types and missing values
The difference between bagging (Random Forest) and boosting (XGBoost)
Which SageMaker built-in algorithms correspond to which concepts (XGBoost, Linear Learner, KNN)

Domain 3B: Supervised Learning Deep Dive#

Regression Algorithms#

Linear Regression#

First Principles#

Solving Linear Regression: Two Methods#

Assumptions (and What Breaks Them)#

Polynomial Regression#

Logistic Regression (for Classification!)#

The Core Idea#

Decision Boundary#

Multi-class Logistic Regression#

Decision Trees#

First Principles: Divide and Conquer#

Split Criteria: How to Pick the Best Split?#

Decision Tree Pros and Cons#

Controlling Tree Complexity (Pruning)#

Random Forest#

What It Is#

Why Randomness Helps#

Feature Importance#

Key Hyperparameters#

Random Forest vs Single Decision Tree#

Gradient Boosted Trees (XGBoost, LightGBM, CatBoost)#

The Boosting Concept, Revisited#

XGBoost Deep Dive#

XGBoost Key Hyperparameters#

LightGBM vs XGBoost vs CatBoost#

Support Vector Machines (SVM)#

First Principles: Maximize the Margin#

The C Parameter (Soft Margin)#

The Kernel Trick — Handling Non-Linear Data#

SVM Practical Guide#

K-Nearest Neighbors (KNN)#

First Principles: Learning by Analogy#

The Algorithm#

Distance Metrics#

Choosing K#

The Curse of Dimensionality#

Classification-Specific Concepts#

Binary vs Multi-class vs Multi-label#

Threshold Tuning: Why 0.5 Isn’t Always Optimal#

Class Imbalance: When the Problem is Lopsided#

Algorithm Selection Guide#

Decision Flowchart#

Master Comparison Table#

When to Use Each (Exam Quick Reference)#

Key Formulas Cheat Sheet#

Domain 3B: Supervised Learning Deep Dive

Regression Algorithms

Linear Regression

First Principles

Solving Linear Regression: Two Methods

Assumptions (and What Breaks Them)

Polynomial Regression

Logistic Regression (for Classification!)

The Core Idea

Decision Boundary

Multi-class Logistic Regression

Decision Trees

First Principles: Divide and Conquer

Split Criteria: How to Pick the Best Split?

Decision Tree Pros and Cons

Controlling Tree Complexity (Pruning)

Random Forest

What It Is

Why Randomness Helps

Feature Importance

Key Hyperparameters

Random Forest vs Single Decision Tree

Gradient Boosted Trees (XGBoost, LightGBM, CatBoost)

The Boosting Concept, Revisited

XGBoost Deep Dive

XGBoost Key Hyperparameters

LightGBM vs XGBoost vs CatBoost

Support Vector Machines (SVM)

First Principles: Maximize the Margin

The C Parameter (Soft Margin)

The Kernel Trick — Handling Non-Linear Data

SVM Practical Guide

K-Nearest Neighbors (KNN)

First Principles: Learning by Analogy

The Algorithm

Distance Metrics

Choosing K

The Curse of Dimensionality

Classification-Specific Concepts

Binary vs Multi-class vs Multi-label

Threshold Tuning: Why 0.5 Isn’t Always Optimal

Class Imbalance: When the Problem is Lopsided

Algorithm Selection Guide

Decision Flowchart

Master Comparison Table

When to Use Each (Exam Quick Reference)

Key Formulas Cheat Sheet