← AWS MLS-C01 — ML Specialty

Domain 3B: Supervised Learning Deep Dive

Domain 3B: Supervised Learning Deep Dive

Exam Domain: 3 — Modeling (36%) Focus: Core supervised algorithms — how they work, when to use them, how to tune them


Regression Algorithms

Linear Regression

First Principles

The simplest ML model: find the line (or hyperplane in multiple dimensions) that best fits the data by minimizing the sum of squared distances between predictions and true values.

$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_n x_n$$

Where $w_0$ is the bias (intercept) and $w_1 \ldots w_n$ are the learned weights (coefficients).

ELI5: Draw a scatter plot of house sizes vs prices. Linear regression draws the best straight line through those dots. “Best” means the total squared vertical distance from each dot to the line is minimized. That’s it. For multiple features, replace “line” with “flat plane through higher-dimensional space” — the concept is identical.

Solving Linear Regression: Two Methods

Normal Equation (closed-form solution): $$\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

  • Computes the exact answer in one step
  • Requires inverting the matrix $\mathbf{X}^T \mathbf{X}$ (O(n³) operation)
  • Use when: n_features < ~10,000 (matrix inversion is feasible)
  • Don’t use when: millions of features (inversion is prohibitively slow)

Gradient Descent:

  • Iteratively finds the solution by following the gradient
  • Use when: large feature space, online learning, regularization needed
  • Scales better to large problems

Assumptions (and What Breaks Them)

AssumptionMeaningWhat Breaks ItFix
LinearityRelationship is actually linearNon-linear patternsAdd polynomial features, use a non-linear model
IndependenceObservations are independentTime series, repeated measuresUse time-aware models, mixed effects
HomoscedasticityError variance is constantResiduals fan out as x increasesLog-transform target, weighted regression
Normality of residualsErrors are normally distributedHeavy-tailed distributionsRobust regression, transformations
No multicollinearityFeatures aren’t highly correlatedCorrelated features → unstable weightsPCA, Ridge regression, drop correlated features

Polynomial Regression

When the relationship is non-linear, add polynomial features: $x_1^2, x_1 \cdot x_2, x_1^3$, etc.

Linear fit on curved data:      Polynomial fit:
        ●                               ●
  ●         ●                    ●    ╱   ●
        ╱               vs           ╱
   ●  ╱                        ●  ╱
     ╱  ← poor fit                ← better fit

Warning: high-degree polynomials overfit easily. Use regularization (Ridge/Lasso) or cross-validation to select degree.


Logistic Regression (for Classification!)

Despite the name, logistic regression is a classification algorithm. It predicts the probability of a binary outcome.

The Core Idea

Take a linear combination of features (like linear regression), then squash the output to the range (0, 1) using the sigmoid function:

$$\hat{p} = \sigma(z) = \frac{1}{1 + e^{-z}}, \quad z = w_0 + w_1 x_1 + \ldots + w_n x_n$$

Sigmoid Function:
  Output
  1.0 │                         ──────────
      │                    ────╱
  0.5 │─────────────── ───╱ ─────────────
      │           ────╱
  0.0 │──────────
      └────────────────────────────────── z (linear combo)
           -∞           0           +∞

At z=0: output = 0.5 (decision boundary)
z >> 0: output → 1.0
z << 0: output → 0.0

ELI5: Linear regression predicts a number that can be anything (−∞ to +∞). Logistic regression predicts a probability (0 to 1). The sigmoid function is the translator — it squashes any real number to a valid probability. If the probability is > 0.5, we predict class 1; otherwise, class 0.

Decision Boundary

The decision boundary is where $\hat{p} = 0.5$, i.e., where $z = 0$.

Because z is a linear combination of features, the decision boundary is a straight line (or hyperplane in higher dimensions). This makes logistic regression a linear classifier.

Two features (x₁, x₂):
  x₂
  │         ●  ●  ●  (class 1)
  │      ● ╱ ●
  │       ╱   ●
  │      ╱ ← decision boundary (straight line)
  │   ● ╱
  │  ● ╱   ○  (class 0)
  │   ╱  ○  ○
  └──────────────── x₁

Multi-class Logistic Regression

  • One-vs-Rest (OvR): train N binary classifiers, each predicting “is this class k vs all others?” Take the class with highest predicted probability.
  • Softmax (Multinomial): directly outputs a probability for each class, all summing to 1:

$$P(y=k | \mathbf{x}) = \frac{e^{z_k}}{\sum_{j} e^{z_j}}$$


Decision Trees

First Principles: Divide and Conquer

A decision tree asks a sequence of yes/no questions about features to classify a data point.

                    ┌────────────────────┐
                    │  Age < 30?         │
                    └──────┬─────────────┘
                           │
             YES ──────────┴──────────── NO
              │                           │
    ┌─────────────────┐      ┌─────────────────────┐
    │ Income > 50K?   │      │  Credit score > 700? │
    └────────┬────────┘      └──────────┬──────────┘
             │                          │
    YES ─────┴───── NO         YES ─────┴───── NO
     │               │          │               │
  Approve          Deny       Approve          Deny

The tree is built by repeatedly finding the best feature split at each node.

Split Criteria: How to Pick the Best Split?

Gini Impurity: probability of misclassifying a randomly chosen sample if we label it randomly according to class distribution in the node.

$$G = 1 - \sum_{k} p_k^2$$

Where $p_k$ is the proportion of class k in the node.

  • $G = 0$: pure node (all one class) — perfect split
  • $G = 0.5$: maximally impure (50/50 split) — worst split

Entropy / Information Gain:

$$H = -\sum_{k} p_k \log_2 p_k$$

Information Gain = entropy before split − weighted average entropy after split.

ELI5 comparison: Both measure “how mixed up are the classes in this bucket?” Gini says “if I randomly pick a sample and randomly assign it a label, how often am I wrong?” Entropy says “how many bits of information do I need to describe the class distribution?” In practice, they give almost identical results. Gini is slightly faster to compute (no logarithm).

CriterionFormulaBehavior
Gini$1 - \sum p_k^2$Slightly favors larger partitions
Entropy$-\sum p_k \log_2 p_k$More balanced splits
Variance reductionFor regression treesMinimize variance within child nodes

Decision Tree Pros and Cons

Pros:

  • Highly interpretable — can visualize and explain
  • No feature scaling needed (splits are threshold comparisons, not distance-based)
  • Handles mixed feature types (numerical + categorical)
  • Captures non-linear relationships
  • Fast inference

Cons:

  • High variance — small changes in training data → very different trees
  • Overfits easily without constraints
  • Unstable (a single outlier can change the whole tree structure)
  • Biased toward features with more possible split points

Controlling Tree Complexity (Pruning)

Unpruned (overfits):          Pruned (generalizes):
      Root                          Root
     /    \                        /    \
   /        \                    /        \
 Split     Split               Split     Split
 /  \      /  \                /  \
S1  S2   S3   S4             S1  S2   (stopped early)
|   |    |    |
...many tiny leaves...

Pre-pruning (stop during training):

  • max_depth: limit tree depth
  • min_samples_split: require minimum N samples to split a node
  • min_samples_leaf: require minimum N samples in each leaf
  • max_leaf_nodes: cap total number of leaves

Post-pruning (prune after training):

  • Cost-complexity pruning (alpha parameter): prune subtrees that don’t improve validation performance enough per leaf added

Random Forest

What It Is

Random Forest = Bagging on decision trees + random feature subsets at each split.

For each tree:

  1. Bootstrap sample (sample N points with replacement from training data)
  2. At each node, consider only $\sqrt{p}$ (classification) or $p/3$ (regression) randomly selected features
  3. Grow the tree fully (or to max_depth)

Final prediction: majority vote (classification) or average (regression) across all trees.

Why Randomness Helps

If all trees had access to the same features, they would all make the same split at the top (the most informative feature), producing highly correlated trees. Correlated errors don’t cancel out when averaged.

Random feature subsets force different trees to use different features, decorrelating them. Uncorrelated errors cancel out. Variance drops.

ELI5: If you ask 100 people the same question while giving them access to the same information, they’ll all give similar answers — no diversity means no averaging benefit. But if each person can only see a random subset of the information, they’ll form different opinions based on different perspectives. When you average those diverse opinions, errors cancel out and accuracy improves.

Feature Importance

Two ways to measure which features matter most:

  • Gini Importance (impurity-based): sum how much each feature reduces impurity across all splits, across all trees. Fast but biased toward high-cardinality features.
  • Permutation Importance: shuffle a single feature’s values, measure how much model accuracy drops. Drop = importance. Slower but unbiased.

Key Hyperparameters

ParameterDefaultEffect
n_estimators100More trees = more stable (never hurts, just slower)
max_depthNone (full)Limit depth to reduce variance
max_featuressqrt(p) for clfSmaller = more diverse trees, but each tree is weaker
min_samples_split2Higher = more conservative splits
min_samples_leaf1Higher = smoother predictions
bootstrapTrueSet False for pasting (no replacement)

Random Forest vs Single Decision Tree

Single Decision TreeRandom Forest
VarianceVery highLow
BiasLow (deep trees)Slightly higher
InterpretabilityHigh (visual)Low (black box)
OverfittingProneResistant
Training speedFastSlower (N trees)
Prediction speedFastModerate (N trees)
Feature importanceBasicRobust

Gradient Boosted Trees (XGBoost, LightGBM, CatBoost)

The Boosting Concept, Revisited

Boosting builds trees sequentially. Each new tree fits the residuals (errors) of the current ensemble.

Iteration 0: Initial prediction = mean(y) for all samples
             Residuals = y - mean(y)

Iteration 1: Tree 1 fits the residuals
             New predictions = mean(y) + α × Tree1(x)
             Residuals = y - new predictions

Iteration 2: Tree 2 fits the NEW residuals
             New predictions = prev + α × Tree2(x)

...
Final prediction = mean(y) + α × Tree1(x) + α × Tree2(x) + ...

Each tree is a weak learner (shallow, limited splits). The ensemble gradually reduces both bias and variance.

XGBoost Deep Dive

XGBoost adds several innovations over vanilla gradient boosting:

Regularized objective: $$Obj = \sum_{i} L(y_i, \hat{y}i) + \sum{k} \Omega(f_k)$$

Where $\Omega(f_k) = \gamma T + \frac{1}{2}\lambda \sum_{j} w_j^2$ penalizes tree complexity (T = number of leaves, $w_j$ = leaf weights).

Second-order gradient approximation (Newton boosting): Uses both first-order (gradient) and second-order (Hessian) information to find better leaf weights. Like using curvature information to take smarter steps, not just slope.

Sparsity-aware split finding: handles missing values natively by learning the best default direction for missing values during training.

Column (feature) subsampling: like Random Forest, sample features at each tree/level to reduce overfitting and speed up training.

ELI5: XGBoost is the Swiss Army knife of ML. If your data fits in a table with labeled rows, XGBoost will probably get you a top-performing model. It’s won more Kaggle competitions than any other single algorithm. The key insight is that it’s not just “boosting with trees” — it adds explicit regularization and uses smarter math (second-order gradients) to find better solutions faster.

XGBoost Key Hyperparameters

ParameterEffectTuning Direction
n_estimatorsNumber of treesMore trees: better fit, risk of overfit. Use with early stopping.
max_depthTree depth (3-6 typical)Lower = less overfit
learning_rate (eta)Shrinks each tree’s contributionLower + more trees = better but slower
subsampleRow sampling fraction0.5-0.9: reduces overfit
colsample_bytreeFeature sampling fraction0.5-0.9: reduces overfit, like RF
lambda (reg_lambda)L2 regularization on leaf weightsHigher = less overfit
alpha (reg_alpha)L1 regularization on leaf weightsHigher = sparser model
min_child_weightMin sum of Hessian per leafHigher = more conservative
gammaMin loss reduction to splitHigher = fewer splits

Golden tuning rule: set learning_rate low (0.01-0.1), n_estimators high, use early stopping — let training stop when validation score stops improving.

LightGBM vs XGBoost vs CatBoost

XGBoost tree growth:          LightGBM tree growth:
Level-wise (breadth-first)    Leaf-wise (best-first)

     Root                          Root
    /    \                        /    \
  L1      R1                   L1       R1
  / \    / \                   / \
L2  R2  L3  R3               L2   R2   ← grows the leaf with
(all nodes at each level)         ↑         highest gain first
                           continues expanding best leaf

LightGBM’s leaf-wise growth can find better splits but risks overfitting on small datasets (deep unbalanced trees). Set num_leaves to control.

CatBoost’s key feature: ordered boosting — uses a permutation-based approach to avoid target leakage when computing statistics on categorical features. This makes it especially strong on categorical-heavy data without manual preprocessing.

XGBoostLightGBMCatBoost
Tree growthLevel-wiseLeaf-wiseSymmetric (oblivious)
SpeedModerateFastest (large data)Moderate
MemoryModerateLowModerate
Categorical featuresManual encoding neededNative (label encoding)Native (ordered target encoding)
Best forTabular, generalLarge datasets, speedCategorical-heavy data
Overfitting riskLow-moderateHigher (small data)Low
InterpretabilityGoodGoodGood
Tuning effortModerateModerateLow (good defaults)

Exam tip: SageMaker has a built-in XGBoost container. It’s the most commonly tested tree-based algorithm on the MLS-C01 exam.


Support Vector Machines (SVM)

First Principles: Maximize the Margin

For a binary classification problem, instead of just finding any decision boundary that separates classes, SVM finds the boundary that maximizes the margin — the distance between the boundary and the nearest training points (support vectors).

      ●  ●                    ●  ●
   ●        ○              ●     │  ○
●     ●    ○  ○           ●  ●  │    ○  ○
   ●      ○               ●     │margin ○
      ●  ○                      │
                          support vectors
                                ↑
   Any line separates    SVM finds the widest street

Why maximizing the margin matters: the wider the margin, the more “buffer room” between classes. New test points are less likely to fall on the wrong side. This leads to better generalization (less overfitting).

Support vectors are the training points closest to the boundary — they’re the only ones that define the boundary. Move any other training point and the boundary stays the same.

The C Parameter (Soft Margin)

Real data is rarely perfectly separable. The C parameter controls the trade-off between maximizing margin width and minimizing classification errors:

Small C (soft margin):           Large C (hard margin):
  ●  ●                             ●  ●
●    ●  │  ○  ○              ●   ●│    ○  ○
    ●   │    ○                  ●  │  ○
●       │  ○                      │
        ↑ wider margin,           ↑ narrow margin,
          allows some              no misclassification
          misclassification        (may overfit)
  • Small C: wider margin, allows more margin violations → lower variance, higher bias
  • Large C: narrow margin, tries to classify all points correctly → higher variance, lower bias

The Kernel Trick — Handling Non-Linear Data

What if the classes aren’t linearly separable?

ELI5: Imagine cats and dogs scattered in a room, but they’re mixed together in a circle pattern — the cats are in the middle, dogs form a ring around them. You can’t draw a straight line to separate them. But if you could lift the cats up into 3D (their height represents some new feature), suddenly a flat horizontal plane separates them from the dogs below. The kernel trick does exactly this lifting — mathematically, without actually computing the high-dimensional coordinates. It’s like taking a shortcut through the calculation.

2D (non-separable):           After kernel transformation (3D):
  ○ ○ ○                              ●
○  ●●●  ○           →              ●   ●       ← lifted cats
○  ●●●  ○                        ──────────    ← separating plane
  ○ ○ ○                          ○   ○   ○     ← dogs stay flat
(circle pattern)

The kernel function $K(x_i, x_j)$ computes the dot product in the high-dimensional space without actually going there — the “kernel trick.”

KernelFormulaUse When
Linear$x_i \cdot x_j$Data is linearly separable, high-dimensional (text)
Polynomial$(x_i \cdot x_j + r)^d$Moderate non-linearity, known polynomial relationship
RBF / Gaussian$e^{-\gamma |x_i - x_j|^2}$Default — most non-linear problems
Sigmoid$\tanh(\kappa x_i \cdot x_j + c)$Rare, similar to 2-layer neural network

RBF kernel parameter γ (gamma):

  • High γ: each training point has small influence radius → complex decision boundary → overfitting
  • Low γ: each training point has large influence radius → smoother boundary → underfitting

SVM Practical Guide

Use SVM when:

  • Small-to-medium dataset (< ~100K samples — training is O(n²) to O(n³))
  • High-dimensional data where n_features > n_samples (text classification, genomics)
  • Clear margin of separation exists
  • You need a well-principled probabilistic model is NOT a priority (SVM doesn’t natively output probabilities — needs Platt scaling)

Don’t use SVM when:

  • Large datasets (slow training and prediction)
  • You need calibrated probabilities directly
  • Lots of missing data (SVM requires complete feature vectors)
  • Lots of noise in the data (C needs to be very small → similar to linear models)

K-Nearest Neighbors (KNN)

First Principles: Learning by Analogy

KNN makes no assumptions about the data distribution. It stores all training data and at inference time, looks at the K most similar training examples.

ELI5: To decide if a new movie is action or romance, look at the K most similar movies in your database (using some similarity measure like director, actors, runtime, rating). If 7 out of 10 most similar movies are action, predict action. There’s no “model” — just a lookup table of all your past experience.

K=1 (overfit):              K=5 (smoother):
  ●●●  ○○○                  ●●●  ○○○
  ●○●  ○●○                  ●●●  ○○○
  ●●●  ○○○                  ●●●  ○○○

(every training point        (smooth, rounded
 is its own class island)     decision boundary)

The Algorithm

  1. Store all training examples
  2. For a new point x: a. Compute distance to every training point b. Find K nearest neighbors c. Return majority vote (classification) or mean (regression)

No training phase! KNN is a lazy learner — all computation happens at inference time. This means:

  • Training: O(1) — just store the data
  • Inference: O(n × d) — compute n distances in d dimensions

Distance Metrics

MetricFormulaUse When
Euclidean$\sqrt{\sum (x_i - y_i)^2}$Continuous features, same scale
Manhattan$\sum |x_i - y_i|$Grid-like data, more robust to outliers
Minkowski$(\sum |x_i - y_i|^p)^{1/p}$Generalizes both (p=2: Euclidean, p=1: Manhattan)
HammingCount of different bitsCategorical or binary features
Cosine$1 - \frac{x \cdot y}{|x||y|}$Text/document similarity (direction matters, not magnitude)

CRITICAL: KNN requires feature scaling (StandardScaler or MinMaxScaler). If one feature is in thousands (salary) and another in single digits (years), distance will be dominated by salary.

Choosing K

K=1:   Very low bias, very high variance
       (jagged decision boundary, sensitive to noise)

K=N:   Very high bias, very low variance
       (predicts the majority class for everything)

Sweet spot: usually K = √N (N = number of training samples)
            Use cross-validation to find optimal K

The Curse of Dimensionality

In high dimensions, KNN breaks down because:

  1. All distances become approximately equal (data becomes equidistant)
  2. Volume of space grows exponentially — data becomes extremely sparse
  3. “Nearest neighbor” in 1000 dimensions may be no more similar than a random point
1D: Cover 10% of range → need 10% of data
2D: Cover 10% in each dim → need 10% × 10% = 1% of data... wait
    Actually to cover 10% of volume in 2D, need 10%^(1/2) = 31.6% per side
nD: To cover 10% of n-dimensional volume → need 10%^(1/n) of range per dimension
    n=10: need 79% of range per dimension to cover 10% of volume!
    → Must look nearly everywhere to find neighbors

Use KNN when:

  • Small dataset (< 10K samples — inference is fast enough)
  • Low-to-moderate dimensions (< 50 features, ideally < 20)
  • Interpretability needed (“these 5 similar examples predicted X”)
  • Non-linear patterns exist

Don’t use KNN when:

  • Large datasets (O(n) inference per prediction)
  • High-dimensional data (curse of dimensionality)
  • Features need to be on same scale but can’t be (mixed types)

Classification-Specific Concepts

Binary vs Multi-class vs Multi-label

Binary:      One class from {0, 1}          → "Is this spam? Yes/No"
Multi-class: One class from {A, B, C, D}    → "Which digit? 0-9"
Multi-label: Any subset of {A, B, C, D}     → "Which topics? [tech, finance, AI]"

Multi-label requires different evaluation (Hamming loss, micro/macro F1) and different model outputs (sigmoid per class, not softmax).

Threshold Tuning: Why 0.5 Isn’t Always Optimal

The default decision threshold of 0.5 only makes sense when:

  • False positives and false negatives have equal cost
  • The class prior probabilities are balanced

Real-world examples where 0.5 is wrong:

  • Fraud detection: missing a fraud (FN) costs far more than a false alarm (FP) → lower threshold (0.2-0.3)
  • Medical screening: missing a disease (FN) costs more than unnecessary follow-up (FP) → lower threshold
  • Spam filter: incorrectly blocking real email (FP) is very annoying → higher threshold (0.7-0.8)

Use the ROC curve (threshold vs TPR/FPR) or precision-recall curve to choose the optimal threshold for your cost function.

Class Imbalance: When the Problem is Lopsided

Balanced (50/50):              Imbalanced (99/1):
  ●●●●●●●●●●                    ●●●●●●●●●●●●●●●●●●●●●●●●●●
  ○○○○○○○○○○                    ○   (minority class = 1%)
  Accuracy is meaningful        Accuracy = 99% just by predicting all ●

Techniques for imbalanced data:

TechniqueHowWhen
Oversampling (SMOTE)Generate synthetic minority samplesTraining data too small
UndersamplingRemove majority samples randomlyMajority class is redundant
Class weightsPenalize misclassifying minority harderModel-level (easy to implement)
Threshold tuningLower threshold for minority classPost-training adjustment
Ensemble (BalancedBagging)Balance each bootstrap sampleCombines sampling + ensembling

Evaluation on imbalanced data: use F1-score, precision-recall curve, AUC-PR (not accuracy or AUC-ROC which can be misleading).


Algorithm Selection Guide

Decision Flowchart

What is your problem?
│
├── REGRESSION (continuous target)?
│   ├── Linear relationship, interpretability needed?
│   │   └── Linear Regression (+ Ridge/Lasso for regularization)
│   ├── Non-linear, medium data, best accuracy?
│   │   └── XGBoost / LightGBM
│   └── Many correlated features?
│       └── Ridge or Elastic Net
│
└── CLASSIFICATION (discrete target)?
    ├── Need interpretability (rules)?
    │   └── Decision Tree (shallow) or Logistic Regression
    ├── Robust baseline, feature importance?
    │   └── Random Forest
    ├── Best accuracy on tabular data?
    │   └── XGBoost / LightGBM / CatBoost
    ├── Small dataset, high-dimensional (text)?
    │   └── SVM with RBF/Linear kernel
    ├── Small dataset, need instance-based reasoning?
    │   └── KNN
    └── Complex patterns (images, text, sequences)?
        └── Neural Networks (see Domain 3D)

Master Comparison Table

AlgorithmSpeed (Train)Speed (Predict)InterpretabilityScaling NeededLarge DataNon-linearMissing Values
Linear/Logistic RegressionFastFastHighYesYesNo (poly features)No
Decision TreeFastFastHighNoYesYesNo
Random ForestModerateModerateLowNoYesYesNo (imputation)
XGBoostModerateFastLow-MedNoYesYesYes (native)
LightGBMFastFastLow-MedNoYesYesYes (native)
SVM (RBF)Slow (O(n²))Slow (O(n))LowYesNoYesNo
KNNInstantSlow (O(n))HighYesNoYesNo

When to Use Each (Exam Quick Reference)

ScenarioRecommended Algorithm
Tabular data, need best accuracy, no interpretability neededXGBoost / LightGBM
Need to explain each prediction (regulatory)Logistic Regression, Decision Tree
Text classification, high-dimensional sparse featuresSVM (Linear kernel) or Logistic Regression
Many correlated features, prevent one from dominatingRidge Regression
Want automatic feature selectionLasso (L1)
Categorical-heavy tabular dataCatBoost
Large dataset, need fast trainingLightGBM
Small dataset, similarity-based reasoningKNN
Robust to outliers in featuresTree-based (no scaling needed)
Robust to outliers in targetMAE loss, Huber loss, or quantile regression
Need probability calibrationLogistic Regression (inherently calibrated)

Key Formulas Cheat Sheet

Linear Regression:
  ŷ = w₀ + w₁x₁ + ... + wₙxₙ
  Normal Equation: w = (XᵀX)⁻¹ Xᵀy

Logistic Regression:
  σ(z) = 1 / (1 + e^(-z))
  Predict class 1 if σ(z) > threshold (default 0.5)

Decision Tree splits:
  Gini = 1 - Σ pₖ²
  Entropy = -Σ pₖ log₂(pₖ)

KNN:
  Euclidean distance = √Σ(xᵢ - yᵢ)²
  Predict: majority class of K nearest neighbors

SVM:
  Maximize margin = 2 / ‖w‖
  Soft margin: minimize ½‖w‖² + C × Σ ξᵢ

Why this matters for the exam: The MLS-C01 exam does not require you to derive these algorithms mathematically. It DOES require you to know:

  • Which algorithm handles which problem type (regression vs classification vs clustering)
  • Key hyperparameters and their effects (e.g., C in SVM → bias-variance tradeoff)
  • When an algorithm will fail (KNN in high dimensions, SVM on large datasets)
  • How tree-based algorithms handle feature types and missing values
  • The difference between bagging (Random Forest) and boosting (XGBoost)
  • Which SageMaker built-in algorithms correspond to which concepts (XGBoost, Linear Learner, KNN)