Domain 3B: Supervised Learning Deep Dive
Table of Contents
- Domain 3B: Supervised Learning Deep Dive
Domain 3B: Supervised Learning Deep Dive
Exam Domain: 3 — Modeling (36%) Focus: Core supervised algorithms — how they work, when to use them, how to tune them
Regression Algorithms
Linear Regression
First Principles
The simplest ML model: find the line (or hyperplane in multiple dimensions) that best fits the data by minimizing the sum of squared distances between predictions and true values.
$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_n x_n$$
Where $w_0$ is the bias (intercept) and $w_1 \ldots w_n$ are the learned weights (coefficients).
ELI5: Draw a scatter plot of house sizes vs prices. Linear regression draws the best straight line through those dots. “Best” means the total squared vertical distance from each dot to the line is minimized. That’s it. For multiple features, replace “line” with “flat plane through higher-dimensional space” — the concept is identical.
Solving Linear Regression: Two Methods
Normal Equation (closed-form solution): $$\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$
- Computes the exact answer in one step
- Requires inverting the matrix $\mathbf{X}^T \mathbf{X}$ (O(n³) operation)
- Use when: n_features < ~10,000 (matrix inversion is feasible)
- Don’t use when: millions of features (inversion is prohibitively slow)
Gradient Descent:
- Iteratively finds the solution by following the gradient
- Use when: large feature space, online learning, regularization needed
- Scales better to large problems
Assumptions (and What Breaks Them)
| Assumption | Meaning | What Breaks It | Fix |
|---|---|---|---|
| Linearity | Relationship is actually linear | Non-linear patterns | Add polynomial features, use a non-linear model |
| Independence | Observations are independent | Time series, repeated measures | Use time-aware models, mixed effects |
| Homoscedasticity | Error variance is constant | Residuals fan out as x increases | Log-transform target, weighted regression |
| Normality of residuals | Errors are normally distributed | Heavy-tailed distributions | Robust regression, transformations |
| No multicollinearity | Features aren’t highly correlated | Correlated features → unstable weights | PCA, Ridge regression, drop correlated features |
Polynomial Regression
When the relationship is non-linear, add polynomial features: $x_1^2, x_1 \cdot x_2, x_1^3$, etc.
Linear fit on curved data: Polynomial fit:
● ●
● ● ● ╱ ●
╱ vs ╱
● ╱ ● ╱
╱ ← poor fit ← better fit
Warning: high-degree polynomials overfit easily. Use regularization (Ridge/Lasso) or cross-validation to select degree.
Logistic Regression (for Classification!)
Despite the name, logistic regression is a classification algorithm. It predicts the probability of a binary outcome.
The Core Idea
Take a linear combination of features (like linear regression), then squash the output to the range (0, 1) using the sigmoid function:
$$\hat{p} = \sigma(z) = \frac{1}{1 + e^{-z}}, \quad z = w_0 + w_1 x_1 + \ldots + w_n x_n$$
Sigmoid Function:
Output
1.0 │ ──────────
│ ────╱
0.5 │─────────────── ───╱ ─────────────
│ ────╱
0.0 │──────────
└────────────────────────────────── z (linear combo)
-∞ 0 +∞
At z=0: output = 0.5 (decision boundary)
z >> 0: output → 1.0
z << 0: output → 0.0
ELI5: Linear regression predicts a number that can be anything (−∞ to +∞). Logistic regression predicts a probability (0 to 1). The sigmoid function is the translator — it squashes any real number to a valid probability. If the probability is > 0.5, we predict class 1; otherwise, class 0.
Decision Boundary
The decision boundary is where $\hat{p} = 0.5$, i.e., where $z = 0$.
Because z is a linear combination of features, the decision boundary is a straight line (or hyperplane in higher dimensions). This makes logistic regression a linear classifier.
Two features (x₁, x₂):
x₂
│ ● ● ● (class 1)
│ ● ╱ ●
│ ╱ ●
│ ╱ ← decision boundary (straight line)
│ ● ╱
│ ● ╱ ○ (class 0)
│ ╱ ○ ○
└──────────────── x₁
Multi-class Logistic Regression
- One-vs-Rest (OvR): train N binary classifiers, each predicting “is this class k vs all others?” Take the class with highest predicted probability.
- Softmax (Multinomial): directly outputs a probability for each class, all summing to 1:
$$P(y=k | \mathbf{x}) = \frac{e^{z_k}}{\sum_{j} e^{z_j}}$$
Decision Trees
First Principles: Divide and Conquer
A decision tree asks a sequence of yes/no questions about features to classify a data point.
┌────────────────────┐
│ Age < 30? │
└──────┬─────────────┘
│
YES ──────────┴──────────── NO
│ │
┌─────────────────┐ ┌─────────────────────┐
│ Income > 50K? │ │ Credit score > 700? │
└────────┬────────┘ └──────────┬──────────┘
│ │
YES ─────┴───── NO YES ─────┴───── NO
│ │ │ │
Approve Deny Approve Deny
The tree is built by repeatedly finding the best feature split at each node.
Split Criteria: How to Pick the Best Split?
Gini Impurity: probability of misclassifying a randomly chosen sample if we label it randomly according to class distribution in the node.
$$G = 1 - \sum_{k} p_k^2$$
Where $p_k$ is the proportion of class k in the node.
- $G = 0$: pure node (all one class) — perfect split
- $G = 0.5$: maximally impure (50/50 split) — worst split
Entropy / Information Gain:
$$H = -\sum_{k} p_k \log_2 p_k$$
Information Gain = entropy before split − weighted average entropy after split.
ELI5 comparison: Both measure “how mixed up are the classes in this bucket?” Gini says “if I randomly pick a sample and randomly assign it a label, how often am I wrong?” Entropy says “how many bits of information do I need to describe the class distribution?” In practice, they give almost identical results. Gini is slightly faster to compute (no logarithm).
| Criterion | Formula | Behavior |
|---|---|---|
| Gini | $1 - \sum p_k^2$ | Slightly favors larger partitions |
| Entropy | $-\sum p_k \log_2 p_k$ | More balanced splits |
| Variance reduction | For regression trees | Minimize variance within child nodes |
Decision Tree Pros and Cons
Pros:
- Highly interpretable — can visualize and explain
- No feature scaling needed (splits are threshold comparisons, not distance-based)
- Handles mixed feature types (numerical + categorical)
- Captures non-linear relationships
- Fast inference
Cons:
- High variance — small changes in training data → very different trees
- Overfits easily without constraints
- Unstable (a single outlier can change the whole tree structure)
- Biased toward features with more possible split points
Controlling Tree Complexity (Pruning)
Unpruned (overfits): Pruned (generalizes):
Root Root
/ \ / \
/ \ / \
Split Split Split Split
/ \ / \ / \
S1 S2 S3 S4 S1 S2 (stopped early)
| | | |
...many tiny leaves...
Pre-pruning (stop during training):
max_depth: limit tree depthmin_samples_split: require minimum N samples to split a nodemin_samples_leaf: require minimum N samples in each leafmax_leaf_nodes: cap total number of leaves
Post-pruning (prune after training):
- Cost-complexity pruning (alpha parameter): prune subtrees that don’t improve validation performance enough per leaf added
Random Forest
What It Is
Random Forest = Bagging on decision trees + random feature subsets at each split.
For each tree:
- Bootstrap sample (sample N points with replacement from training data)
- At each node, consider only $\sqrt{p}$ (classification) or $p/3$ (regression) randomly selected features
- Grow the tree fully (or to max_depth)
Final prediction: majority vote (classification) or average (regression) across all trees.
Why Randomness Helps
If all trees had access to the same features, they would all make the same split at the top (the most informative feature), producing highly correlated trees. Correlated errors don’t cancel out when averaged.
Random feature subsets force different trees to use different features, decorrelating them. Uncorrelated errors cancel out. Variance drops.
ELI5: If you ask 100 people the same question while giving them access to the same information, they’ll all give similar answers — no diversity means no averaging benefit. But if each person can only see a random subset of the information, they’ll form different opinions based on different perspectives. When you average those diverse opinions, errors cancel out and accuracy improves.
Feature Importance
Two ways to measure which features matter most:
- Gini Importance (impurity-based): sum how much each feature reduces impurity across all splits, across all trees. Fast but biased toward high-cardinality features.
- Permutation Importance: shuffle a single feature’s values, measure how much model accuracy drops. Drop = importance. Slower but unbiased.
Key Hyperparameters
| Parameter | Default | Effect |
|---|---|---|
n_estimators | 100 | More trees = more stable (never hurts, just slower) |
max_depth | None (full) | Limit depth to reduce variance |
max_features | sqrt(p) for clf | Smaller = more diverse trees, but each tree is weaker |
min_samples_split | 2 | Higher = more conservative splits |
min_samples_leaf | 1 | Higher = smoother predictions |
bootstrap | True | Set False for pasting (no replacement) |
Random Forest vs Single Decision Tree
| Single Decision Tree | Random Forest | |
|---|---|---|
| Variance | Very high | Low |
| Bias | Low (deep trees) | Slightly higher |
| Interpretability | High (visual) | Low (black box) |
| Overfitting | Prone | Resistant |
| Training speed | Fast | Slower (N trees) |
| Prediction speed | Fast | Moderate (N trees) |
| Feature importance | Basic | Robust |
Gradient Boosted Trees (XGBoost, LightGBM, CatBoost)
The Boosting Concept, Revisited
Boosting builds trees sequentially. Each new tree fits the residuals (errors) of the current ensemble.
Iteration 0: Initial prediction = mean(y) for all samples
Residuals = y - mean(y)
Iteration 1: Tree 1 fits the residuals
New predictions = mean(y) + α × Tree1(x)
Residuals = y - new predictions
Iteration 2: Tree 2 fits the NEW residuals
New predictions = prev + α × Tree2(x)
...
Final prediction = mean(y) + α × Tree1(x) + α × Tree2(x) + ...
Each tree is a weak learner (shallow, limited splits). The ensemble gradually reduces both bias and variance.
XGBoost Deep Dive
XGBoost adds several innovations over vanilla gradient boosting:
Regularized objective: $$Obj = \sum_{i} L(y_i, \hat{y}i) + \sum{k} \Omega(f_k)$$
Where $\Omega(f_k) = \gamma T + \frac{1}{2}\lambda \sum_{j} w_j^2$ penalizes tree complexity (T = number of leaves, $w_j$ = leaf weights).
Second-order gradient approximation (Newton boosting): Uses both first-order (gradient) and second-order (Hessian) information to find better leaf weights. Like using curvature information to take smarter steps, not just slope.
Sparsity-aware split finding: handles missing values natively by learning the best default direction for missing values during training.
Column (feature) subsampling: like Random Forest, sample features at each tree/level to reduce overfitting and speed up training.
ELI5: XGBoost is the Swiss Army knife of ML. If your data fits in a table with labeled rows, XGBoost will probably get you a top-performing model. It’s won more Kaggle competitions than any other single algorithm. The key insight is that it’s not just “boosting with trees” — it adds explicit regularization and uses smarter math (second-order gradients) to find better solutions faster.
XGBoost Key Hyperparameters
| Parameter | Effect | Tuning Direction |
|---|---|---|
n_estimators | Number of trees | More trees: better fit, risk of overfit. Use with early stopping. |
max_depth | Tree depth (3-6 typical) | Lower = less overfit |
learning_rate (eta) | Shrinks each tree’s contribution | Lower + more trees = better but slower |
subsample | Row sampling fraction | 0.5-0.9: reduces overfit |
colsample_bytree | Feature sampling fraction | 0.5-0.9: reduces overfit, like RF |
lambda (reg_lambda) | L2 regularization on leaf weights | Higher = less overfit |
alpha (reg_alpha) | L1 regularization on leaf weights | Higher = sparser model |
min_child_weight | Min sum of Hessian per leaf | Higher = more conservative |
gamma | Min loss reduction to split | Higher = fewer splits |
Golden tuning rule: set learning_rate low (0.01-0.1), n_estimators high, use early stopping — let training stop when validation score stops improving.
LightGBM vs XGBoost vs CatBoost
XGBoost tree growth: LightGBM tree growth:
Level-wise (breadth-first) Leaf-wise (best-first)
Root Root
/ \ / \
L1 R1 L1 R1
/ \ / \ / \
L2 R2 L3 R3 L2 R2 ← grows the leaf with
(all nodes at each level) ↑ highest gain first
continues expanding best leaf
LightGBM’s leaf-wise growth can find better splits but risks overfitting on small datasets (deep unbalanced trees). Set num_leaves to control.
CatBoost’s key feature: ordered boosting — uses a permutation-based approach to avoid target leakage when computing statistics on categorical features. This makes it especially strong on categorical-heavy data without manual preprocessing.
| XGBoost | LightGBM | CatBoost | |
|---|---|---|---|
| Tree growth | Level-wise | Leaf-wise | Symmetric (oblivious) |
| Speed | Moderate | Fastest (large data) | Moderate |
| Memory | Moderate | Low | Moderate |
| Categorical features | Manual encoding needed | Native (label encoding) | Native (ordered target encoding) |
| Best for | Tabular, general | Large datasets, speed | Categorical-heavy data |
| Overfitting risk | Low-moderate | Higher (small data) | Low |
| Interpretability | Good | Good | Good |
| Tuning effort | Moderate | Moderate | Low (good defaults) |
Exam tip: SageMaker has a built-in XGBoost container. It’s the most commonly tested tree-based algorithm on the MLS-C01 exam.
Support Vector Machines (SVM)
First Principles: Maximize the Margin
For a binary classification problem, instead of just finding any decision boundary that separates classes, SVM finds the boundary that maximizes the margin — the distance between the boundary and the nearest training points (support vectors).
● ● ● ●
● ○ ● │ ○
● ● ○ ○ ● ● │ ○ ○
● ○ ● │margin ○
● ○ │
support vectors
↑
Any line separates SVM finds the widest street
Why maximizing the margin matters: the wider the margin, the more “buffer room” between classes. New test points are less likely to fall on the wrong side. This leads to better generalization (less overfitting).
Support vectors are the training points closest to the boundary — they’re the only ones that define the boundary. Move any other training point and the boundary stays the same.
The C Parameter (Soft Margin)
Real data is rarely perfectly separable. The C parameter controls the trade-off between maximizing margin width and minimizing classification errors:
Small C (soft margin): Large C (hard margin):
● ● ● ●
● ● │ ○ ○ ● ●│ ○ ○
● │ ○ ● │ ○
● │ ○ │
↑ wider margin, ↑ narrow margin,
allows some no misclassification
misclassification (may overfit)
- Small C: wider margin, allows more margin violations → lower variance, higher bias
- Large C: narrow margin, tries to classify all points correctly → higher variance, lower bias
The Kernel Trick — Handling Non-Linear Data
What if the classes aren’t linearly separable?
ELI5: Imagine cats and dogs scattered in a room, but they’re mixed together in a circle pattern — the cats are in the middle, dogs form a ring around them. You can’t draw a straight line to separate them. But if you could lift the cats up into 3D (their height represents some new feature), suddenly a flat horizontal plane separates them from the dogs below. The kernel trick does exactly this lifting — mathematically, without actually computing the high-dimensional coordinates. It’s like taking a shortcut through the calculation.
2D (non-separable): After kernel transformation (3D):
○ ○ ○ ●
○ ●●● ○ → ● ● ← lifted cats
○ ●●● ○ ────────── ← separating plane
○ ○ ○ ○ ○ ○ ← dogs stay flat
(circle pattern)
The kernel function $K(x_i, x_j)$ computes the dot product in the high-dimensional space without actually going there — the “kernel trick.”
| Kernel | Formula | Use When |
|---|---|---|
| Linear | $x_i \cdot x_j$ | Data is linearly separable, high-dimensional (text) |
| Polynomial | $(x_i \cdot x_j + r)^d$ | Moderate non-linearity, known polynomial relationship |
| RBF / Gaussian | $e^{-\gamma |x_i - x_j|^2}$ | Default — most non-linear problems |
| Sigmoid | $\tanh(\kappa x_i \cdot x_j + c)$ | Rare, similar to 2-layer neural network |
RBF kernel parameter γ (gamma):
- High γ: each training point has small influence radius → complex decision boundary → overfitting
- Low γ: each training point has large influence radius → smoother boundary → underfitting
SVM Practical Guide
Use SVM when:
- Small-to-medium dataset (< ~100K samples — training is O(n²) to O(n³))
- High-dimensional data where n_features > n_samples (text classification, genomics)
- Clear margin of separation exists
- You need a well-principled probabilistic model is NOT a priority (SVM doesn’t natively output probabilities — needs Platt scaling)
Don’t use SVM when:
- Large datasets (slow training and prediction)
- You need calibrated probabilities directly
- Lots of missing data (SVM requires complete feature vectors)
- Lots of noise in the data (C needs to be very small → similar to linear models)
K-Nearest Neighbors (KNN)
First Principles: Learning by Analogy
KNN makes no assumptions about the data distribution. It stores all training data and at inference time, looks at the K most similar training examples.
ELI5: To decide if a new movie is action or romance, look at the K most similar movies in your database (using some similarity measure like director, actors, runtime, rating). If 7 out of 10 most similar movies are action, predict action. There’s no “model” — just a lookup table of all your past experience.
K=1 (overfit): K=5 (smoother):
●●● ○○○ ●●● ○○○
●○● ○●○ ●●● ○○○
●●● ○○○ ●●● ○○○
(every training point (smooth, rounded
is its own class island) decision boundary)
The Algorithm
- Store all training examples
- For a new point x: a. Compute distance to every training point b. Find K nearest neighbors c. Return majority vote (classification) or mean (regression)
No training phase! KNN is a lazy learner — all computation happens at inference time. This means:
- Training: O(1) — just store the data
- Inference: O(n × d) — compute n distances in d dimensions
Distance Metrics
| Metric | Formula | Use When |
|---|---|---|
| Euclidean | $\sqrt{\sum (x_i - y_i)^2}$ | Continuous features, same scale |
| Manhattan | $\sum |x_i - y_i|$ | Grid-like data, more robust to outliers |
| Minkowski | $(\sum |x_i - y_i|^p)^{1/p}$ | Generalizes both (p=2: Euclidean, p=1: Manhattan) |
| Hamming | Count of different bits | Categorical or binary features |
| Cosine | $1 - \frac{x \cdot y}{|x||y|}$ | Text/document similarity (direction matters, not magnitude) |
CRITICAL: KNN requires feature scaling (StandardScaler or MinMaxScaler). If one feature is in thousands (salary) and another in single digits (years), distance will be dominated by salary.
Choosing K
K=1: Very low bias, very high variance
(jagged decision boundary, sensitive to noise)
K=N: Very high bias, very low variance
(predicts the majority class for everything)
Sweet spot: usually K = √N (N = number of training samples)
Use cross-validation to find optimal K
The Curse of Dimensionality
In high dimensions, KNN breaks down because:
- All distances become approximately equal (data becomes equidistant)
- Volume of space grows exponentially — data becomes extremely sparse
- “Nearest neighbor” in 1000 dimensions may be no more similar than a random point
1D: Cover 10% of range → need 10% of data
2D: Cover 10% in each dim → need 10% × 10% = 1% of data... wait
Actually to cover 10% of volume in 2D, need 10%^(1/2) = 31.6% per side
nD: To cover 10% of n-dimensional volume → need 10%^(1/n) of range per dimension
n=10: need 79% of range per dimension to cover 10% of volume!
→ Must look nearly everywhere to find neighbors
Use KNN when:
- Small dataset (< 10K samples — inference is fast enough)
- Low-to-moderate dimensions (< 50 features, ideally < 20)
- Interpretability needed (“these 5 similar examples predicted X”)
- Non-linear patterns exist
Don’t use KNN when:
- Large datasets (O(n) inference per prediction)
- High-dimensional data (curse of dimensionality)
- Features need to be on same scale but can’t be (mixed types)
Classification-Specific Concepts
Binary vs Multi-class vs Multi-label
Binary: One class from {0, 1} → "Is this spam? Yes/No"
Multi-class: One class from {A, B, C, D} → "Which digit? 0-9"
Multi-label: Any subset of {A, B, C, D} → "Which topics? [tech, finance, AI]"
Multi-label requires different evaluation (Hamming loss, micro/macro F1) and different model outputs (sigmoid per class, not softmax).
Threshold Tuning: Why 0.5 Isn’t Always Optimal
The default decision threshold of 0.5 only makes sense when:
- False positives and false negatives have equal cost
- The class prior probabilities are balanced
Real-world examples where 0.5 is wrong:
- Fraud detection: missing a fraud (FN) costs far more than a false alarm (FP) → lower threshold (0.2-0.3)
- Medical screening: missing a disease (FN) costs more than unnecessary follow-up (FP) → lower threshold
- Spam filter: incorrectly blocking real email (FP) is very annoying → higher threshold (0.7-0.8)
Use the ROC curve (threshold vs TPR/FPR) or precision-recall curve to choose the optimal threshold for your cost function.
Class Imbalance: When the Problem is Lopsided
Balanced (50/50): Imbalanced (99/1):
●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●
○○○○○○○○○○ ○ (minority class = 1%)
Accuracy is meaningful Accuracy = 99% just by predicting all ●
Techniques for imbalanced data:
| Technique | How | When |
|---|---|---|
| Oversampling (SMOTE) | Generate synthetic minority samples | Training data too small |
| Undersampling | Remove majority samples randomly | Majority class is redundant |
| Class weights | Penalize misclassifying minority harder | Model-level (easy to implement) |
| Threshold tuning | Lower threshold for minority class | Post-training adjustment |
| Ensemble (BalancedBagging) | Balance each bootstrap sample | Combines sampling + ensembling |
Evaluation on imbalanced data: use F1-score, precision-recall curve, AUC-PR (not accuracy or AUC-ROC which can be misleading).
Algorithm Selection Guide
Decision Flowchart
What is your problem?
│
├── REGRESSION (continuous target)?
│ ├── Linear relationship, interpretability needed?
│ │ └── Linear Regression (+ Ridge/Lasso for regularization)
│ ├── Non-linear, medium data, best accuracy?
│ │ └── XGBoost / LightGBM
│ └── Many correlated features?
│ └── Ridge or Elastic Net
│
└── CLASSIFICATION (discrete target)?
├── Need interpretability (rules)?
│ └── Decision Tree (shallow) or Logistic Regression
├── Robust baseline, feature importance?
│ └── Random Forest
├── Best accuracy on tabular data?
│ └── XGBoost / LightGBM / CatBoost
├── Small dataset, high-dimensional (text)?
│ └── SVM with RBF/Linear kernel
├── Small dataset, need instance-based reasoning?
│ └── KNN
└── Complex patterns (images, text, sequences)?
└── Neural Networks (see Domain 3D)
Master Comparison Table
| Algorithm | Speed (Train) | Speed (Predict) | Interpretability | Scaling Needed | Large Data | Non-linear | Missing Values |
|---|---|---|---|---|---|---|---|
| Linear/Logistic Regression | Fast | Fast | High | Yes | Yes | No (poly features) | No |
| Decision Tree | Fast | Fast | High | No | Yes | Yes | No |
| Random Forest | Moderate | Moderate | Low | No | Yes | Yes | No (imputation) |
| XGBoost | Moderate | Fast | Low-Med | No | Yes | Yes | Yes (native) |
| LightGBM | Fast | Fast | Low-Med | No | Yes | Yes | Yes (native) |
| SVM (RBF) | Slow (O(n²)) | Slow (O(n)) | Low | Yes | No | Yes | No |
| KNN | Instant | Slow (O(n)) | High | Yes | No | Yes | No |
When to Use Each (Exam Quick Reference)
| Scenario | Recommended Algorithm |
|---|---|
| Tabular data, need best accuracy, no interpretability needed | XGBoost / LightGBM |
| Need to explain each prediction (regulatory) | Logistic Regression, Decision Tree |
| Text classification, high-dimensional sparse features | SVM (Linear kernel) or Logistic Regression |
| Many correlated features, prevent one from dominating | Ridge Regression |
| Want automatic feature selection | Lasso (L1) |
| Categorical-heavy tabular data | CatBoost |
| Large dataset, need fast training | LightGBM |
| Small dataset, similarity-based reasoning | KNN |
| Robust to outliers in features | Tree-based (no scaling needed) |
| Robust to outliers in target | MAE loss, Huber loss, or quantile regression |
| Need probability calibration | Logistic Regression (inherently calibrated) |
Key Formulas Cheat Sheet
Linear Regression:
ŷ = w₀ + w₁x₁ + ... + wₙxₙ
Normal Equation: w = (XᵀX)⁻¹ Xᵀy
Logistic Regression:
σ(z) = 1 / (1 + e^(-z))
Predict class 1 if σ(z) > threshold (default 0.5)
Decision Tree splits:
Gini = 1 - Σ pₖ²
Entropy = -Σ pₖ log₂(pₖ)
KNN:
Euclidean distance = √Σ(xᵢ - yᵢ)²
Predict: majority class of K nearest neighbors
SVM:
Maximize margin = 2 / ‖w‖
Soft margin: minimize ½‖w‖² + C × Σ ξᵢ
Why this matters for the exam: The MLS-C01 exam does not require you to derive these algorithms mathematically. It DOES require you to know:
- Which algorithm handles which problem type (regression vs classification vs clustering)
- Key hyperparameters and their effects (e.g., C in SVM → bias-variance tradeoff)
- When an algorithm will fail (KNN in high dimensions, SVM on large datasets)
- How tree-based algorithms handle feature types and missing values
- The difference between bagging (Random Forest) and boosting (XGBoost)
- Which SageMaker built-in algorithms correspond to which concepts (XGBoost, Linear Learner, KNN)