Domain 3C: Unsupervised Learning Deep Dive

19 min read 3901 words

Table of Contents

Domain 3C: Unsupervised Learning Deep Dive

Domain 3C: Unsupervised Learning Deep Dive

Exam Domain: 3 — Modeling (36%) Focus: Clustering, dimensionality reduction, anomaly detection — finding structure without labels

What is Unsupervised Learning? (First Principles)

In supervised learning, every training example comes with a label: this email is spam, this tumor is malignant, this transaction is fraud. The model learns to map inputs to those known outputs.

Unsupervised learning removes the labels entirely. The model must discover hidden structure, patterns, and groupings in the data on its own.

ELI5: Supervised learning is studying with an answer key — you know which photos are cats and which are dogs, and you learn the difference. Unsupervised learning is sorting a huge pile of unlabeled vacation photos into groups — you discover that there seem to be “beach photos,” “mountain photos,” and “city photos” groups, but nobody told you those categories existed. You invented them from the patterns you saw.

The Three Main Tasks

Unsupervised Learning
│
├── Clustering              → "Which things are similar?"
│   ├── K-Means             (centroid-based)
│   ├── Hierarchical        (tree of merges)
│   ├── DBSCAN              (density-based)
│   └── GMM                 (probabilistic)
│
├── Dimensionality Reduction → "How can I represent this more compactly?"
│   ├── PCA                 (linear, global structure)
│   ├── t-SNE               (non-linear, local structure, visualization only)
│   └── UMAP                (non-linear, faster than t-SNE)
│
└── Anomaly Detection       → "What's unusual?"
    ├── Isolation Forest    (path length)
    ├── Random Cut Forest   (SageMaker built-in)
    ├── LOF                 (density-based)
    └── Statistical methods (Z-score, Mahalanobis)

Clustering Algorithms

K-Means

The Algorithm Step by Step

K-Means finds K cluster centers (centroids) such that each point is assigned to its nearest centroid and the total within-cluster variance is minimized.

Step 0: Choose K=3, initialize 3 centroids randomly
        ┌─────────────────────────┐
        │  ●  ●    ×₁             │
        │     ●  ●    ×₂          │
        │  ●      ×₃  ●  ●        │
        └─────────────────────────┘
        (× = centroid, ● = data point)

Step 1: Assign each point to nearest centroid
        ┌─────────────────────────┐
        │  ●  ●    ×₁             │  ← blue cluster
        │     ●  ●    ×₂          │  ← green cluster
        │  ●      ×₃  ●  ●        │  ← red cluster
        └─────────────────────────┘

Step 2: Recalculate centroids (move to cluster mean)
        ┌─────────────────────────┐
        │  ● ×₁●                  │  ← centroid moved
        │     ●  ● ×₂             │  ← centroid moved
        │  ● ×₃    ●  ●           │  ← centroid moved
        └─────────────────────────┘

Step 3: Re-assign points to new centroids
Step 4: Repeat steps 2-3 until centroids stop moving

Convergence is guaranteed (the objective — total within-cluster sum of squares — decreases every iteration and is bounded below by 0). But convergence to a global minimum is not guaranteed.

Objective function (minimize):

$$\text{Inertia} = \sum_{k=1}^{K} \sum_{x \in C_k} |x - \mu_k|^2$$

Where $\mu_k$ is the centroid of cluster k.

ELI5: K-Means is like placing K magnets in a room full of iron balls. Each ball snaps to its nearest magnet. Then you pick up each magnet and move it to the center of gravity of the balls stuck to it. The balls re-snap. You repeat until the magnets stop moving. The final positions are the cluster centers.

Choosing K: The Elbow Method

Inertia (within-cluster sum of squares)
  │
  │╲
  │ ╲
  │  ╲
  │   ╲___
  │       ╲───
  │           ╲────────────────
  └────────────────────────────── K (number of clusters)
  K=1  K=2  K=3  K=4  K=5  K=6
              ↑
          "elbow" — adding more clusters stops helping much
          → choose K=3

The elbow is where adding more clusters gives diminishing returns. Note: the elbow isn’t always obvious — use silhouette score as a complement.

Silhouette score for a single point: $$s = \frac{b - a}{\max(a, b)}$$

Where:

$a$ = mean distance to points in same cluster (cohesion)
$b$ = mean distance to points in nearest other cluster (separation)
Score range: −1 to +1 (higher = better fit)

Average silhouette score across all points measures overall cluster quality. Best K = highest average silhouette score.

K-Means Limitations

K-Means works well:          K-Means fails:
  ●●●    ○○○                  ●●●●●●●●●●●●●●●●
  ●●●    ○○○                   (ring of ●s)
  ●●●    ○○○                  ●●●●●●●●●●
  (spherical, similar size,    (circle inside circle —
   well-separated)              K-Means draws straight lines)

Assumes spherical clusters (Euclidean distance-based)
Sensitive to initialization (different random starts → different results)
Must specify K in advance
Sensitive to outliers (outliers become their own “cluster” or pull centroids)
Struggles with clusters of very different sizes or densities

K-Means++ Initialization

Standard K-Means picks initial centroids randomly — bad luck means slow convergence or bad local minima.

K-Means++ seeds centroids intelligently:

Pick first centroid uniformly at random
For each subsequent centroid, pick a point with probability proportional to its squared distance from the nearest existing centroid
Far-apart points are more likely to be chosen → better spread of initial centroids

Effect: faster convergence, better final solutions, almost always used in practice (sklearn default).

Mini-Batch K-Means

For very large datasets, compute centroid updates on random mini-batches instead of the full dataset. Faster training with slightly lower quality. Use when dataset has millions of points.

Hierarchical Clustering

The Concept: Building a Tree

Instead of assuming K clusters upfront, hierarchical clustering builds a tree of all possible merges (or splits) and lets you choose the level of granularity.

Agglomerative (bottom-up) — most common:

Start: each point is its own cluster (N clusters)
Find the two closest clusters
Merge them into one
Repeat until 1 cluster remains
Cut the dendrogram at the desired level to get K clusters

ASCII Dendrogram:
                     ┌──────────────────────────┐
                     │                          │
             ┌───────┴───────┐        ┌─────────┴─────────┐
             │               │        │                   │
         ┌───┴───┐       ┌───┴───┐  ┌─┴─┐           ┌────┴────┐
         │       │       │       │  │   │           │         │
        [A]     [B]     [C]     [D][E] [F]         [G]       [H]

Cut here ─────────────────────────────────────────────────── → 2 clusters
Cut here ────────────────────────────────────  → 3 clusters
Cut here ──────────────────── → 4 clusters

The height of each merge in the dendrogram represents the distance between the merged clusters. Cuts at different heights give different numbers of clusters.

Linkage Methods: How to Measure Distance Between Clusters?

Linkage	Distance Between Clusters	Effect	Use When
Single	Nearest pair across clusters	Chaining (elongated clusters)	Detecting connected structure
Complete	Furthest pair across clusters	Compact, roughly equal-size clusters	Balanced clusters needed
Average	Mean of all cross-cluster distances	Balance between single and complete	General purpose
Ward’s	Minimize increase in total variance	Compact, equal-size clusters	Most common default

When to Use Hierarchical vs K-Means

	K-Means	Hierarchical
K specification	Required upfront	Choose after seeing dendrogram
Speed	Fast O(nKt)	Slow O(n² log n)
Large datasets	Good	Poor (n > 10K gets slow)
Interpretability	Medium	High (visual dendrogram)
Cluster shape	Spherical only	More flexible
Multiple granularities	Retrain for each K	Single run, cut anywhere

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

The Core Idea: Clusters Are Dense Regions

DBSCAN defines clusters as dense regions of points separated by sparse regions. It doesn’t try to assign every point to a cluster — sparse points are noise.

DBSCAN classification of points:

    ● ● ●              ● ●
   ●●●●●●●              ●●●
    ● ● ●    ·    ·      ●●
              ·       ● ●

Dense region    Noise   Dense region
= Cluster 1     (·)    = Cluster 2

Parameters

eps (ε): the radius of a point’s neighborhood — how far to look for neighbors
min_samples: minimum number of neighbors within eps to be a “core point”

Point Types

Core point:    has ≥ min_samples neighbors within eps radius
               ● ● ●         ← all core points (assuming min_samples=3)
                ●●●

Border point:  has < min_samples neighbors, but is within eps of a core point
               ○ ← border point (close to core points but sparse itself)

Noise point:   not a core point, not within eps of any core point
               · ← noise/outlier (isolated)

Algorithm:

Find all core points
Connect core points that are within eps of each other into clusters
Assign border points to the nearest cluster they touch
Label remaining points as noise (−1)

ELI5: DBSCAN looks for crowds. A cluster is a crowd of people standing close together. If you’re standing in a crowd (core point), you belong to that cluster. If you’re on the edge of a crowd (border point), you belong to the nearest crowd. If you’re standing alone in an empty field far from any crowd (noise point), you don’t belong to any cluster — you’re an outlier.

Strengths and Weaknesses

Strengths:

Discovers arbitrarily shaped clusters (not just spherical)
Automatically identifies outliers (no forced assignment)
Does not need K specified in advance
Robust to outliers (they become noise, not distort clusters)

Weaknesses:

Sensitive to eps — hard to choose (use k-distance plot)
Struggles with clusters of varying density (a global eps can’t fit all clusters)
Doesn’t scale well to high dimensions (distances become meaningless — curse of dimensionality)
For very large datasets, use HDBSCAN (hierarchical DBSCAN) instead

Comparison: K-Means vs DBSCAN vs Hierarchical

	K-Means	DBSCAN	Hierarchical
K required?	Yes	No	No (choose from dendrogram)
Handles noise?	No	Yes	No
Cluster shape	Spherical	Arbitrary	Depends on linkage
Speed	Fast	Moderate	Slow
Large data	Yes	Yes (with index)	No
Varying density	No	No	Depends
Interpretability	Low	Medium	High

Gaussian Mixture Models (GMM)

Soft Clustering with Probabilities

K-Means gives hard assignments — each point belongs to exactly one cluster. GMM gives soft assignments — each point has a probability of belonging to each cluster.

GMM models the data as a mixture of K Gaussian distributions. Each cluster has its own mean (μ), covariance (Σ), and mixture weight (π).

$$P(x) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(x | \mu_k, \Sigma_k)$$

K-Means (hard):             GMM (soft):
  Point A → Cluster 1       Point A → [C1: 70%, C2: 25%, C3: 5%]
  Point B → Cluster 2       Point B → [C1: 10%, C2: 85%, C3: 5%]
  (binary assignment)       (probabilistic assignment)

Expectation-Maximization (EM) Algorithm

GMM is fitted with EM:

E-step (Expectation): For each point, compute the probability it came from each cluster (given current parameters)
M-step (Maximization): Update cluster parameters (mean, covariance, weight) to maximize likelihood given those probabilities
Repeat until convergence

When to Use GMM vs K-Means

	K-Means	GMM
Assignment	Hard	Soft (probabilistic)
Cluster shape	Spherical	Elliptical (full covariance)
Overlapping clusters	No	Yes
Speed	Faster	Slower
Need probabilities	No	Yes
Interpretation	Simpler	Richer

Use GMM when: clusters overlap, you need probability of cluster membership, clusters are elliptical rather than spherical.

Dimensionality Reduction

Why Reduce Dimensions?

Problems with high-dimensional data:
  ┌──────────────────────────────────────────────────┐
  │ Curse of dimensionality → sparse data, distances  │
  │ Computation: O(n × d) for many operations         │
  │ Visualization: impossible beyond 3D               │
  │ Noise features hurt model performance             │
  │ Storage and memory cost                           │
  └──────────────────────────────────────────────────┘

Goals of dimensionality reduction:
  ┌──────────────────────────────────────────────────┐
  │ Visualization (2D/3D for human inspection)        │
  │ Noise removal (compress away noise dimensions)    │
  │ Speed up downstream algorithms                    │
  │ Remove multicollinearity for linear models        │
  │ Feature extraction (create new meaningful dims)   │
  └──────────────────────────────────────────────────┘

Principal Component Analysis (PCA) — Deep Dive

First Principles: Find the Directions of Maximum Variance

PCA finds a new coordinate system (principal components) aligned with the directions of maximum variance in the data.

Original 2D data:            After PCA:
  x₂                          PC2 (low variance)
  │    ●                       ↗
  │  ● ● ●               ● ● ●●●●●●●●●→ PC1 (high variance)
  │ ●●●●●●●         →
  │  ● ● ●
  │    ●
  └──────── x₁
            (correlated)      (PC1 captures most variance,
                               PC2 captures the rest)

The first principal component (PC1) is the direction along which the data varies the most. The second (PC2) is the direction of second-highest variance, constrained to be perpendicular to PC1. And so on.

The Math (Intuition Level)

Center the data: subtract the mean from each feature → zero-centered
Compute covariance matrix: $C = \frac{1}{n-1} X^T X$ — captures how features vary together
Eigendecomposition: $C = V \Lambda V^T$
- Eigenvectors V = directions of the principal components
- Eigenvalues Λ = variance explained by each component
Project: multiply centered data by top-K eigenvectors → new K-dimensional representation

You don’t need to memorize the math for the exam. Understand the concepts:

PCA finds directions of maximum variance
Principal components are uncorrelated (orthogonal)
Eigenvalue = variance explained by that component
Projecting onto fewer components = compression (lossy)

Explained Variance: How Many Components?

Explained Variance by Component:
  │
  │ 40%
  │ ████
  │ ████  25%
  │ ████  ████
  │ ████  ████  15%
  │ ████  ████  ████  10%  5%   3%   2%
  └──────────────────────────────────── Component
         PC1   PC2   PC3   PC4   PC5   PC6   PC7

Cumulative: 40% → 65% → 80% → 90% → 95% → 98% → 100%
                              ↑
                    Keep 4 components for 90% variance retained

Common rule: retain enough components to explain 95% of total variance. Use the elbow in the scree plot as an alternative guide.

ELI5: PCA finds the best “camera angle” that shows the most interesting variation in your data. If you’re photographing a long, flat building, the most informative view shows the length (PC1). The next best view shows the height (PC2). The depth (PC3) adds little because the building is flat. You keep only the views that add meaningful information, discarding dimensions that are mostly noise.

PCA Use Cases

Preprocessing for algorithms sensitive to high dimensions: reduce dimensions before feeding into logistic regression, KNN, SVM
Visualization: project to 2-3 dimensions to scatter-plot and explore clusters
Noise reduction: keep high-variance components, discard low-variance (noise) components
Removing multicollinearity: PCA components are orthogonal by construction — no correlation between them

PCA Limitations

Linear only: can’t capture non-linear structure (use t-SNE or UMAP for non-linear)
Components aren’t interpretable: PC1 is a weighted sum of all original features — you lose the ability to say “this is the ‘age’ dimension”
Sensitive to feature scaling: must standardize features before PCA (otherwise high-magnitude features dominate)
Information loss: projection is lossy — some original variance is discarded

t-SNE — For Visualization Only

What It Does

t-distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction technique designed primarily for 2D/3D visualization of high-dimensional data.

Where PCA preserves global structure (the overall geometry), t-SNE preserves local structure — nearby points in high dimensions stay nearby in 2D.

How It Works (Intuition)

In high-dimensional space: model pairwise similarities as probabilities (nearby points = high probability of being “neighbors”)
In 2D space: place points randomly, then move them so pairwise similarities match
Uses a t-distribution (heavier tails) in 2D to avoid crowding — clusters can be separated

High-D space:              t-SNE 2D projection:
  ●●●  ●●●  ●●●             ●●●   ●●●
  ●●● (mixed)●●●     →       ●●●   ●●●    ●●●
  ●●●    ●●●                        ●●●
  (hard to see clusters)    (clusters pop out visually)

The Perplexity Parameter

Perplexity controls the balance between local and global structure (roughly: how many neighbors each point considers). Typical range: 5-50.

Low perplexity: tight local clusters, ignores global structure
High perplexity: broader view, may merge distinct clusters
Rule of thumb: perplexity ≈ √N or 30 for most datasets

Critical Limitations — The Exam Traps

Exam tip: t-SNE is for visualization only. Do NOT use it as a preprocessing step for machine learning models. Here’s why:

Limitation	Explanation
Non-deterministic	Different random seeds → different layouts. Not reproducible without fixing seed.
No out-of-sample extension	Cannot transform new points without re-running on full dataset
Distances not meaningful	Cluster sizes and inter-cluster distances in t-SNE plots are meaningless
Global structure distorted	Two clusters being far apart in t-SNE doesn’t mean they’re different
Slow	O(n² log n), impractical for > 100K points (use UMAP instead)

t-SNE vs PCA

	PCA	t-SNE
Type	Linear	Non-linear
Structure preserved	Global (overall geometry)	Local (neighborhoods)
Use for preprocessing	Yes	No
Deterministic	Yes	No (random initialization)
Out-of-sample	Yes (apply same projection)	No
Speed	Fast O(nd²)	Slow O(n² log n)
Interpretability	Components have meaning	No meaning
Best for	Preprocessing, noise reduction	Visualization only

ELI5: PCA takes the big picture view — it captures the overall shape and spread of your data, like a map from above. t-SNE zooms into the neighborhoods — it makes sure that things that were close together in high dimensions are still close together on the 2D plot. The catch is that t-SNE distorts the global picture to preserve the local one. It’s better for “seeing” clusters, but you can’t trust the distances between them.

Anomaly Detection

What is Anomaly Detection?

Find the data points that don’t fit the pattern — outliers, unusual events, rare failures.

Normal data:              Anomalies:
  ●●●●●●●●●                ●●●●●●●●●
  ●●●●●●●●●                ●●●●●●●●●
  ●●●●●●●●●                ●●●●●●●●●
  (dense cluster)           (sparse outliers → ○)
                           ●●● ○ ●●●●
                           ●●●●●● ○ ●

Use cases: fraud detection, network intrusion, manufacturing defects, server anomalies, medical anomalies.

Isolation Forest

The Core Idea

Normal data points require many random splits to isolate (they’re surrounded by many neighbors — you need lots of cuts to get them alone).

Anomalies are easy to isolate (they’re in sparse regions — just a few random cuts and they’re separated from everyone else).

Isolating a normal point:            Isolating an anomaly:
  ●●●●●●●●●●●●                         ●●●●●●●●●●●●○
  ──────────────  cut 1                 ─────────────── cut 1
  ●●●●●●●●●●●●
  ──────────────  cut 2
  ●●●●●●●●●●                           (anomaly isolated in 1-2 cuts)
  ──────────────  cut 3
  ●●●●●                                (○ already alone)
  ──────────────  cut 4
  ●●●
  (still many points, need more cuts)

Anomaly score = average path length across many isolation trees. Short path = anomaly.

$$\text{score}(x) = 2^{-\frac{E[h(x)]}{c(n)}}$$

Where $E[h(x)]$ is the expected path length and $c(n)$ is a normalization factor. Score near 1 = anomaly, near 0.5 = normal.

ELI5: Imagine a game of “20 questions” with binary yes/no questions. To identify a normal person in a room, you need many specific questions to narrow down who they are. To identify someone wildly unusual (a clown in a library), one or two questions is enough: “Are you wearing face paint? Yes → anomaly!” Normal data needs many splits to isolate; anomalies isolate quickly.

Key parameters:

n_estimators: number of isolation trees (100-200 typical)
contamination: expected fraction of anomalies in data (used to set threshold)
max_samples: subsample size per tree (256 default — works well)

Works well for: high-dimensional data, no assumption about data distribution, fast O(n log n).

Random Cut Forest (RCF) — SageMaker Built-in

What Makes RCF Different

RCF is Amazon’s proprietary anomaly detection algorithm, available as a SageMaker built-in algorithm. It’s conceptually similar to Isolation Forest but with key differences suited for AWS production use:

Isolation Forest:                Random Cut Forest:
  ─ Uses axis-aligned cuts        ─ Uses random-direction cuts
  ─ Batch only                    ─ Streaming capable (online learning)
  ─ Static model                  ─ Model updates as new data arrives
  ─ Open-source (sklearn)         ─ AWS built-in, optimized for SageMaker

Streaming capability is the key differentiator: RCF can update its anomaly model as new data arrives, making it ideal for:

Real-time fraud detection
IoT sensor monitoring
Live server metric analysis
Time series anomaly detection (traffic spikes, unusual patterns)

Use RCF when:

On AWS SageMaker and need managed anomaly detection
Data arrives as a stream (not batch)
Time series with seasonal patterns (RCF handles this well)

Exam tip: When the scenario says “real-time streaming anomaly detection on AWS,” the answer is almost always RCF (SageMaker built-in).

Statistical Methods for Anomaly Detection

Z-Score Method

$$z = \frac{x - \mu}{\sigma}$$

Flag points where $|z| > 3$ (more than 3 standard deviations from mean).

Simple, interpretable, fast
Assumes normal (Gaussian) distribution
Fails when distribution is skewed or multi-modal
Only considers one feature at a time (ignores correlations)

Mahalanobis Distance

Generalization of Z-score that accounts for correlations between features.

$$D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$$

Where Σ is the covariance matrix of the data.

Considers multivariate structure
Points far from the center in any direction are flagged
More robust than Z-score for correlated features
Assumes elliptical distribution (normal-like)

Local Outlier Factor (LOF)

Measures the local density of a point relative to its neighbors.

Core idea: a point is an outlier if its local density is much lower than its neighbors’
Assigns a score based on ratio of local densities
Score = 1: normal (similar density to neighbors)
Score » 1: outlier (much less dense region than neighbors)

Key advantage: detects anomalies relative to local context, not global distribution. A point in a sparse cluster isn’t flagged if its neighbors are also sparse.

Method	Assumption	Best For
Z-Score	Normal distribution, 1D	Simple, quick screening
Mahalanobis	Multivariate normal	Correlated features
LOF	Local density	Non-uniform density, local anomalies
Isolation Forest	None	High-dimensional, general purpose
RCF	None	Streaming, time series, AWS

Association Rule Learning

Market Basket Analysis

Find items that frequently appear together: “people who buy X also buy Y.”

Classic example: retail transaction analysis, recommendation systems, cross-selling.

Transactions:
  T1: {milk, bread, butter}
  T2: {beer, diapers}
  T3: {milk, diapers, beer}
  T4: {milk, bread, diapers}
  T5: {bread, butter}

Question: Which items appear together frequently?
Answer:   {milk, bread}, {diapers, beer} are frequent itemsets
Rule:     {diapers} → {beer} is a strong association rule

Key Metrics

Support: how often does this itemset appear? $$\text{Support}(X) = \frac{\text{count}(X)}{N}$$

Confidence: if someone buys X, how likely are they to buy Y? $$\text{Confidence}(X \to Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)}$$

Lift: how much more likely is Y given X, compared to Y alone? $$\text{Lift}(X \to Y) = \frac{\text{Confidence}(X \to Y)}{\text{Support}(Y)}$$

Lift = 1: X and Y are independent (no association)
Lift > 1: X and Y are positively correlated (buy together more than chance)
Lift < 1: X and Y are negatively correlated (rarely bought together)

ELI5: If 80% of customers who buy diapers also buy beer, that’s high confidence. But if 80% of ALL customers buy beer, then the diaper-beer association has lift = 1 (no real insight). Lift > 1 means “diapers make beer purchases MORE likely than baseline.” That’s the actionable insight — stock them near each other, or cross-promote them.

Apriori Algorithm

Efficiently finds frequent itemsets by exploiting the anti-monotone property: if an itemset is infrequent, no superset of it can be frequent.

Step 1: Find frequent 1-itemsets (items meeting min_support)
  {milk: 3/5=0.6}, {bread: 3/5=0.6}, {diapers: 3/5=0.6}, {beer: 2/5=0.4}

Step 2: Combine frequent 1-itemsets to form candidate 2-itemsets
  Check: {milk,bread}, {milk,diapers}, {bread,diapers}, {beer,diapers}...

Step 3: Filter by min_support again
  Keep only frequent 2-itemsets → form 3-itemsets from those
  (Prune early: if {milk,butter} is infrequent, skip {milk,butter,anything})

The pruning makes Apriori much faster than brute force.

Quick Reference: Algorithm Selection Table

Task	Algorithm	When to Use	SageMaker Equivalent
Clustering (spherical)	K-Means	Large data, spherical clusters, need speed	K-Means (built-in)
Clustering (arbitrary shape)	DBSCAN	Unknown K, want noise detection, irregular shapes	— (use sklearn)
Clustering (hierarchy)	Hierarchical	Need dendrogram, small data, explore granularity	— (use sklearn)
Clustering (probabilistic)	GMM	Overlapping clusters, need probabilities	— (use sklearn)
Dim reduction (preprocessing)	PCA	Remove noise, fix multicollinearity, speed up training	PCA (built-in)
Dim reduction (visualization)	t-SNE	2D/3D visualization of clusters only	— (use sklearn)
Anomaly detection (batch)	Isolation Forest	General purpose, no distribution assumption	— (use sklearn)
Anomaly detection (streaming)	Random Cut Forest	Real-time, time series, AWS native	RCF (built-in)
Topic modeling	LDA	Discover topics in text documents	— (use sklearn/gensim)
Association rules	Apriori / FP-Growth	Market basket, cross-sell recommendations	— (use mlxtend)

SageMaker Built-in Unsupervised Algorithms

Algorithm	Type	Key Use Case
K-Means	Clustering	Customer segmentation, document clustering
PCA	Dimensionality reduction	Feature reduction before supervised learning
Random Cut Forest (RCF)	Anomaly detection	Real-time streaming anomaly detection
IP Insights	Anomaly detection	Detect unusual (user, IP) account behavior
Object2Vec	Embedding	Learn embeddings for any paired objects
LDA (Latent Dirichlet Allocation)	Topic modeling	Find topics in text corpus
Neural Topic Model (NTM)	Topic modeling	Alternative to LDA with neural network

Why This Matters for the Exam

Core MLS-C01 unsupervised learning exam topics:
K-Means vs DBSCAN: K-Means needs K upfront and assumes spherical clusters. DBSCAN detects noise and arbitrary shapes without K.
Elbow method + silhouette score: the two ways to choose K for K-Means.
PCA: must standardize features first. Keeps high-variance components. Use for preprocessing, not visualization of cluster structure.
t-SNE: visualization ONLY. Non-deterministic. Cannot use for feature engineering or out-of-sample prediction.
Random Cut Forest: SageMaker built-in for streaming anomaly detection. Know when to use it over Isolation Forest (streaming + AWS integration).
Association rules: support, confidence, lift definitions. Lift > 1 means genuine association.
Time series clustering: use dynamic time warping (DTW) distance, not Euclidean (sequences of different length/phase).

Domain 3C: Unsupervised Learning Deep Dive#

What is Unsupervised Learning? (First Principles)#

The Three Main Tasks#

Clustering Algorithms#

K-Means#

The Algorithm Step by Step#

Choosing K: The Elbow Method#

K-Means Limitations#

K-Means++ Initialization#

Mini-Batch K-Means#

Hierarchical Clustering#

The Concept: Building a Tree#

Linkage Methods: How to Measure Distance Between Clusters?#

When to Use Hierarchical vs K-Means#

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)#

The Core Idea: Clusters Are Dense Regions#

Parameters#

Point Types#

Strengths and Weaknesses#

Comparison: K-Means vs DBSCAN vs Hierarchical#

Gaussian Mixture Models (GMM)#

Soft Clustering with Probabilities#

Expectation-Maximization (EM) Algorithm#

When to Use GMM vs K-Means#

Dimensionality Reduction#

Why Reduce Dimensions?#

Principal Component Analysis (PCA) — Deep Dive#

First Principles: Find the Directions of Maximum Variance#

The Math (Intuition Level)#

Explained Variance: How Many Components?#

PCA Use Cases#

PCA Limitations#

t-SNE — For Visualization Only#

What It Does#

How It Works (Intuition)#

The Perplexity Parameter#

Critical Limitations — The Exam Traps#

t-SNE vs PCA#

Anomaly Detection#

What is Anomaly Detection?#

Isolation Forest#

The Core Idea#

Random Cut Forest (RCF) — SageMaker Built-in#

What Makes RCF Different#

Statistical Methods for Anomaly Detection#

Z-Score Method#

Mahalanobis Distance#

Local Outlier Factor (LOF)#

Association Rule Learning#

Market Basket Analysis#

Key Metrics#

Apriori Algorithm#

Quick Reference: Algorithm Selection Table#

SageMaker Built-in Unsupervised Algorithms#

Why This Matters for the Exam#

Domain 3C: Unsupervised Learning Deep Dive

What is Unsupervised Learning? (First Principles)

The Three Main Tasks

Clustering Algorithms

K-Means

The Algorithm Step by Step

Choosing K: The Elbow Method

K-Means Limitations

K-Means++ Initialization

Mini-Batch K-Means

Hierarchical Clustering

The Concept: Building a Tree

Linkage Methods: How to Measure Distance Between Clusters?

When to Use Hierarchical vs K-Means

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

The Core Idea: Clusters Are Dense Regions

Parameters

Point Types

Strengths and Weaknesses

Comparison: K-Means vs DBSCAN vs Hierarchical

Gaussian Mixture Models (GMM)

Soft Clustering with Probabilities

Expectation-Maximization (EM) Algorithm

When to Use GMM vs K-Means

Dimensionality Reduction

Why Reduce Dimensions?

Principal Component Analysis (PCA) — Deep Dive

First Principles: Find the Directions of Maximum Variance

The Math (Intuition Level)

Explained Variance: How Many Components?

PCA Use Cases

PCA Limitations

t-SNE — For Visualization Only

What It Does

How It Works (Intuition)

The Perplexity Parameter

Critical Limitations — The Exam Traps

t-SNE vs PCA

Anomaly Detection

What is Anomaly Detection?

Isolation Forest

The Core Idea

Random Cut Forest (RCF) — SageMaker Built-in

What Makes RCF Different

Statistical Methods for Anomaly Detection

Z-Score Method

Mahalanobis Distance

Local Outlier Factor (LOF)

Association Rule Learning

Market Basket Analysis

Key Metrics

Apriori Algorithm

Quick Reference: Algorithm Selection Table

SageMaker Built-in Unsupervised Algorithms

Why This Matters for the Exam