Domain 3C: Unsupervised Learning Deep Dive
Table of Contents
- Domain 3C: Unsupervised Learning Deep Dive
- What is Unsupervised Learning? (First Principles)
- Clustering Algorithms
- Dimensionality Reduction
- Anomaly Detection
- Association Rule Learning
- Quick Reference: Algorithm Selection Table
- Why This Matters for the Exam
Domain 3C: Unsupervised Learning Deep Dive
Exam Domain: 3 — Modeling (36%) Focus: Clustering, dimensionality reduction, anomaly detection — finding structure without labels
What is Unsupervised Learning? (First Principles)
In supervised learning, every training example comes with a label: this email is spam, this tumor is malignant, this transaction is fraud. The model learns to map inputs to those known outputs.
Unsupervised learning removes the labels entirely. The model must discover hidden structure, patterns, and groupings in the data on its own.
ELI5: Supervised learning is studying with an answer key — you know which photos are cats and which are dogs, and you learn the difference. Unsupervised learning is sorting a huge pile of unlabeled vacation photos into groups — you discover that there seem to be “beach photos,” “mountain photos,” and “city photos” groups, but nobody told you those categories existed. You invented them from the patterns you saw.
The Three Main Tasks
Unsupervised Learning
│
├── Clustering → "Which things are similar?"
│ ├── K-Means (centroid-based)
│ ├── Hierarchical (tree of merges)
│ ├── DBSCAN (density-based)
│ └── GMM (probabilistic)
│
├── Dimensionality Reduction → "How can I represent this more compactly?"
│ ├── PCA (linear, global structure)
│ ├── t-SNE (non-linear, local structure, visualization only)
│ └── UMAP (non-linear, faster than t-SNE)
│
└── Anomaly Detection → "What's unusual?"
├── Isolation Forest (path length)
├── Random Cut Forest (SageMaker built-in)
├── LOF (density-based)
└── Statistical methods (Z-score, Mahalanobis)
Clustering Algorithms
K-Means
The Algorithm Step by Step
K-Means finds K cluster centers (centroids) such that each point is assigned to its nearest centroid and the total within-cluster variance is minimized.
Step 0: Choose K=3, initialize 3 centroids randomly
┌─────────────────────────┐
│ ● ● ×₁ │
│ ● ● ×₂ │
│ ● ×₃ ● ● │
└─────────────────────────┘
(× = centroid, ● = data point)
Step 1: Assign each point to nearest centroid
┌─────────────────────────┐
│ ● ● ×₁ │ ← blue cluster
│ ● ● ×₂ │ ← green cluster
│ ● ×₃ ● ● │ ← red cluster
└─────────────────────────┘
Step 2: Recalculate centroids (move to cluster mean)
┌─────────────────────────┐
│ ● ×₁● │ ← centroid moved
│ ● ● ×₂ │ ← centroid moved
│ ● ×₃ ● ● │ ← centroid moved
└─────────────────────────┘
Step 3: Re-assign points to new centroids
Step 4: Repeat steps 2-3 until centroids stop moving
Convergence is guaranteed (the objective — total within-cluster sum of squares — decreases every iteration and is bounded below by 0). But convergence to a global minimum is not guaranteed.
Objective function (minimize):
$$\text{Inertia} = \sum_{k=1}^{K} \sum_{x \in C_k} |x - \mu_k|^2$$
Where $\mu_k$ is the centroid of cluster k.
ELI5: K-Means is like placing K magnets in a room full of iron balls. Each ball snaps to its nearest magnet. Then you pick up each magnet and move it to the center of gravity of the balls stuck to it. The balls re-snap. You repeat until the magnets stop moving. The final positions are the cluster centers.
Choosing K: The Elbow Method
Inertia (within-cluster sum of squares)
│
│╲
│ ╲
│ ╲
│ ╲___
│ ╲───
│ ╲────────────────
└────────────────────────────── K (number of clusters)
K=1 K=2 K=3 K=4 K=5 K=6
↑
"elbow" — adding more clusters stops helping much
→ choose K=3
The elbow is where adding more clusters gives diminishing returns. Note: the elbow isn’t always obvious — use silhouette score as a complement.
Silhouette score for a single point: $$s = \frac{b - a}{\max(a, b)}$$
Where:
- $a$ = mean distance to points in same cluster (cohesion)
- $b$ = mean distance to points in nearest other cluster (separation)
- Score range: −1 to +1 (higher = better fit)
Average silhouette score across all points measures overall cluster quality. Best K = highest average silhouette score.
K-Means Limitations
K-Means works well: K-Means fails:
●●● ○○○ ●●●●●●●●●●●●●●●●
●●● ○○○ (ring of ●s)
●●● ○○○ ●●●●●●●●●●
(spherical, similar size, (circle inside circle —
well-separated) K-Means draws straight lines)
- Assumes spherical clusters (Euclidean distance-based)
- Sensitive to initialization (different random starts → different results)
- Must specify K in advance
- Sensitive to outliers (outliers become their own “cluster” or pull centroids)
- Struggles with clusters of very different sizes or densities
K-Means++ Initialization
Standard K-Means picks initial centroids randomly — bad luck means slow convergence or bad local minima.
K-Means++ seeds centroids intelligently:
- Pick first centroid uniformly at random
- For each subsequent centroid, pick a point with probability proportional to its squared distance from the nearest existing centroid
- Far-apart points are more likely to be chosen → better spread of initial centroids
Effect: faster convergence, better final solutions, almost always used in practice (sklearn default).
Mini-Batch K-Means
For very large datasets, compute centroid updates on random mini-batches instead of the full dataset. Faster training with slightly lower quality. Use when dataset has millions of points.
Hierarchical Clustering
The Concept: Building a Tree
Instead of assuming K clusters upfront, hierarchical clustering builds a tree of all possible merges (or splits) and lets you choose the level of granularity.
Agglomerative (bottom-up) — most common:
- Start: each point is its own cluster (N clusters)
- Find the two closest clusters
- Merge them into one
- Repeat until 1 cluster remains
- Cut the dendrogram at the desired level to get K clusters
ASCII Dendrogram:
┌──────────────────────────┐
│ │
┌───────┴───────┐ ┌─────────┴─────────┐
│ │ │ │
┌───┴───┐ ┌───┴───┐ ┌─┴─┐ ┌────┴────┐
│ │ │ │ │ │ │ │
[A] [B] [C] [D][E] [F] [G] [H]
Cut here ─────────────────────────────────────────────────── → 2 clusters
Cut here ──────────────────────────────────── → 3 clusters
Cut here ──────────────────── → 4 clusters
The height of each merge in the dendrogram represents the distance between the merged clusters. Cuts at different heights give different numbers of clusters.
Linkage Methods: How to Measure Distance Between Clusters?
| Linkage | Distance Between Clusters | Effect | Use When |
|---|---|---|---|
| Single | Nearest pair across clusters | Chaining (elongated clusters) | Detecting connected structure |
| Complete | Furthest pair across clusters | Compact, roughly equal-size clusters | Balanced clusters needed |
| Average | Mean of all cross-cluster distances | Balance between single and complete | General purpose |
| Ward’s | Minimize increase in total variance | Compact, equal-size clusters | Most common default |
When to Use Hierarchical vs K-Means
| K-Means | Hierarchical | |
|---|---|---|
| K specification | Required upfront | Choose after seeing dendrogram |
| Speed | Fast O(nKt) | Slow O(n² log n) |
| Large datasets | Good | Poor (n > 10K gets slow) |
| Interpretability | Medium | High (visual dendrogram) |
| Cluster shape | Spherical only | More flexible |
| Multiple granularities | Retrain for each K | Single run, cut anywhere |
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
The Core Idea: Clusters Are Dense Regions
DBSCAN defines clusters as dense regions of points separated by sparse regions. It doesn’t try to assign every point to a cluster — sparse points are noise.
DBSCAN classification of points:
● ● ● ● ●
●●●●●●● ●●●
● ● ● · · ●●
· ● ●
Dense region Noise Dense region
= Cluster 1 (·) = Cluster 2
Parameters
- eps (ε): the radius of a point’s neighborhood — how far to look for neighbors
- min_samples: minimum number of neighbors within eps to be a “core point”
Point Types
Core point: has ≥ min_samples neighbors within eps radius
● ● ● ← all core points (assuming min_samples=3)
●●●
Border point: has < min_samples neighbors, but is within eps of a core point
○ ← border point (close to core points but sparse itself)
Noise point: not a core point, not within eps of any core point
· ← noise/outlier (isolated)
Algorithm:
- Find all core points
- Connect core points that are within eps of each other into clusters
- Assign border points to the nearest cluster they touch
- Label remaining points as noise (−1)
ELI5: DBSCAN looks for crowds. A cluster is a crowd of people standing close together. If you’re standing in a crowd (core point), you belong to that cluster. If you’re on the edge of a crowd (border point), you belong to the nearest crowd. If you’re standing alone in an empty field far from any crowd (noise point), you don’t belong to any cluster — you’re an outlier.
Strengths and Weaknesses
Strengths:
- Discovers arbitrarily shaped clusters (not just spherical)
- Automatically identifies outliers (no forced assignment)
- Does not need K specified in advance
- Robust to outliers (they become noise, not distort clusters)
Weaknesses:
- Sensitive to eps — hard to choose (use k-distance plot)
- Struggles with clusters of varying density (a global eps can’t fit all clusters)
- Doesn’t scale well to high dimensions (distances become meaningless — curse of dimensionality)
- For very large datasets, use HDBSCAN (hierarchical DBSCAN) instead
Comparison: K-Means vs DBSCAN vs Hierarchical
| K-Means | DBSCAN | Hierarchical | |
|---|---|---|---|
| K required? | Yes | No | No (choose from dendrogram) |
| Handles noise? | No | Yes | No |
| Cluster shape | Spherical | Arbitrary | Depends on linkage |
| Speed | Fast | Moderate | Slow |
| Large data | Yes | Yes (with index) | No |
| Varying density | No | No | Depends |
| Interpretability | Low | Medium | High |
Gaussian Mixture Models (GMM)
Soft Clustering with Probabilities
K-Means gives hard assignments — each point belongs to exactly one cluster. GMM gives soft assignments — each point has a probability of belonging to each cluster.
GMM models the data as a mixture of K Gaussian distributions. Each cluster has its own mean (μ), covariance (Σ), and mixture weight (π).
$$P(x) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(x | \mu_k, \Sigma_k)$$
K-Means (hard): GMM (soft):
Point A → Cluster 1 Point A → [C1: 70%, C2: 25%, C3: 5%]
Point B → Cluster 2 Point B → [C1: 10%, C2: 85%, C3: 5%]
(binary assignment) (probabilistic assignment)
Expectation-Maximization (EM) Algorithm
GMM is fitted with EM:
- E-step (Expectation): For each point, compute the probability it came from each cluster (given current parameters)
- M-step (Maximization): Update cluster parameters (mean, covariance, weight) to maximize likelihood given those probabilities
- Repeat until convergence
When to Use GMM vs K-Means
| K-Means | GMM | |
|---|---|---|
| Assignment | Hard | Soft (probabilistic) |
| Cluster shape | Spherical | Elliptical (full covariance) |
| Overlapping clusters | No | Yes |
| Speed | Faster | Slower |
| Need probabilities | No | Yes |
| Interpretation | Simpler | Richer |
Use GMM when: clusters overlap, you need probability of cluster membership, clusters are elliptical rather than spherical.
Dimensionality Reduction
Why Reduce Dimensions?
Problems with high-dimensional data:
┌──────────────────────────────────────────────────┐
│ Curse of dimensionality → sparse data, distances │
│ Computation: O(n × d) for many operations │
│ Visualization: impossible beyond 3D │
│ Noise features hurt model performance │
│ Storage and memory cost │
└──────────────────────────────────────────────────┘
Goals of dimensionality reduction:
┌──────────────────────────────────────────────────┐
│ Visualization (2D/3D for human inspection) │
│ Noise removal (compress away noise dimensions) │
│ Speed up downstream algorithms │
│ Remove multicollinearity for linear models │
│ Feature extraction (create new meaningful dims) │
└──────────────────────────────────────────────────┘
Principal Component Analysis (PCA) — Deep Dive
First Principles: Find the Directions of Maximum Variance
PCA finds a new coordinate system (principal components) aligned with the directions of maximum variance in the data.
Original 2D data: After PCA:
x₂ PC2 (low variance)
│ ● ↗
│ ● ● ● ● ● ●●●●●●●●●→ PC1 (high variance)
│ ●●●●●●● →
│ ● ● ●
│ ●
└──────── x₁
(correlated) (PC1 captures most variance,
PC2 captures the rest)
The first principal component (PC1) is the direction along which the data varies the most. The second (PC2) is the direction of second-highest variance, constrained to be perpendicular to PC1. And so on.
The Math (Intuition Level)
- Center the data: subtract the mean from each feature → zero-centered
- Compute covariance matrix: $C = \frac{1}{n-1} X^T X$ — captures how features vary together
- Eigendecomposition: $C = V \Lambda V^T$
- Eigenvectors V = directions of the principal components
- Eigenvalues Λ = variance explained by each component
- Project: multiply centered data by top-K eigenvectors → new K-dimensional representation
You don’t need to memorize the math for the exam. Understand the concepts:
- PCA finds directions of maximum variance
- Principal components are uncorrelated (orthogonal)
- Eigenvalue = variance explained by that component
- Projecting onto fewer components = compression (lossy)
Explained Variance: How Many Components?
Explained Variance by Component:
│
│ 40%
│ ████
│ ████ 25%
│ ████ ████
│ ████ ████ 15%
│ ████ ████ ████ 10% 5% 3% 2%
└──────────────────────────────────── Component
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Cumulative: 40% → 65% → 80% → 90% → 95% → 98% → 100%
↑
Keep 4 components for 90% variance retained
Common rule: retain enough components to explain 95% of total variance. Use the elbow in the scree plot as an alternative guide.
ELI5: PCA finds the best “camera angle” that shows the most interesting variation in your data. If you’re photographing a long, flat building, the most informative view shows the length (PC1). The next best view shows the height (PC2). The depth (PC3) adds little because the building is flat. You keep only the views that add meaningful information, discarding dimensions that are mostly noise.
PCA Use Cases
- Preprocessing for algorithms sensitive to high dimensions: reduce dimensions before feeding into logistic regression, KNN, SVM
- Visualization: project to 2-3 dimensions to scatter-plot and explore clusters
- Noise reduction: keep high-variance components, discard low-variance (noise) components
- Removing multicollinearity: PCA components are orthogonal by construction — no correlation between them
PCA Limitations
- Linear only: can’t capture non-linear structure (use t-SNE or UMAP for non-linear)
- Components aren’t interpretable: PC1 is a weighted sum of all original features — you lose the ability to say “this is the ‘age’ dimension”
- Sensitive to feature scaling: must standardize features before PCA (otherwise high-magnitude features dominate)
- Information loss: projection is lossy — some original variance is discarded
t-SNE — For Visualization Only
What It Does
t-distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction technique designed primarily for 2D/3D visualization of high-dimensional data.
Where PCA preserves global structure (the overall geometry), t-SNE preserves local structure — nearby points in high dimensions stay nearby in 2D.
How It Works (Intuition)
- In high-dimensional space: model pairwise similarities as probabilities (nearby points = high probability of being “neighbors”)
- In 2D space: place points randomly, then move them so pairwise similarities match
- Uses a t-distribution (heavier tails) in 2D to avoid crowding — clusters can be separated
High-D space: t-SNE 2D projection:
●●● ●●● ●●● ●●● ●●●
●●● (mixed)●●● → ●●● ●●● ●●●
●●● ●●● ●●●
(hard to see clusters) (clusters pop out visually)
The Perplexity Parameter
Perplexity controls the balance between local and global structure (roughly: how many neighbors each point considers). Typical range: 5-50.
- Low perplexity: tight local clusters, ignores global structure
- High perplexity: broader view, may merge distinct clusters
- Rule of thumb: perplexity ≈ √N or 30 for most datasets
Critical Limitations — The Exam Traps
Exam tip: t-SNE is for visualization only. Do NOT use it as a preprocessing step for machine learning models. Here’s why:
| Limitation | Explanation |
|---|---|
| Non-deterministic | Different random seeds → different layouts. Not reproducible without fixing seed. |
| No out-of-sample extension | Cannot transform new points without re-running on full dataset |
| Distances not meaningful | Cluster sizes and inter-cluster distances in t-SNE plots are meaningless |
| Global structure distorted | Two clusters being far apart in t-SNE doesn’t mean they’re different |
| Slow | O(n² log n), impractical for > 100K points (use UMAP instead) |
t-SNE vs PCA
| PCA | t-SNE | |
|---|---|---|
| Type | Linear | Non-linear |
| Structure preserved | Global (overall geometry) | Local (neighborhoods) |
| Use for preprocessing | Yes | No |
| Deterministic | Yes | No (random initialization) |
| Out-of-sample | Yes (apply same projection) | No |
| Speed | Fast O(nd²) | Slow O(n² log n) |
| Interpretability | Components have meaning | No meaning |
| Best for | Preprocessing, noise reduction | Visualization only |
ELI5: PCA takes the big picture view — it captures the overall shape and spread of your data, like a map from above. t-SNE zooms into the neighborhoods — it makes sure that things that were close together in high dimensions are still close together on the 2D plot. The catch is that t-SNE distorts the global picture to preserve the local one. It’s better for “seeing” clusters, but you can’t trust the distances between them.
Anomaly Detection
What is Anomaly Detection?
Find the data points that don’t fit the pattern — outliers, unusual events, rare failures.
Normal data: Anomalies:
●●●●●●●●● ●●●●●●●●●
●●●●●●●●● ●●●●●●●●●
●●●●●●●●● ●●●●●●●●●
(dense cluster) (sparse outliers → ○)
●●● ○ ●●●●
●●●●●● ○ ●
Use cases: fraud detection, network intrusion, manufacturing defects, server anomalies, medical anomalies.
Isolation Forest
The Core Idea
Normal data points require many random splits to isolate (they’re surrounded by many neighbors — you need lots of cuts to get them alone).
Anomalies are easy to isolate (they’re in sparse regions — just a few random cuts and they’re separated from everyone else).
Isolating a normal point: Isolating an anomaly:
●●●●●●●●●●●● ●●●●●●●●●●●●○
────────────── cut 1 ─────────────── cut 1
●●●●●●●●●●●●
────────────── cut 2
●●●●●●●●●● (anomaly isolated in 1-2 cuts)
────────────── cut 3
●●●●● (○ already alone)
────────────── cut 4
●●●
(still many points, need more cuts)
Anomaly score = average path length across many isolation trees. Short path = anomaly.
$$\text{score}(x) = 2^{-\frac{E[h(x)]}{c(n)}}$$
Where $E[h(x)]$ is the expected path length and $c(n)$ is a normalization factor. Score near 1 = anomaly, near 0.5 = normal.
ELI5: Imagine a game of “20 questions” with binary yes/no questions. To identify a normal person in a room, you need many specific questions to narrow down who they are. To identify someone wildly unusual (a clown in a library), one or two questions is enough: “Are you wearing face paint? Yes → anomaly!” Normal data needs many splits to isolate; anomalies isolate quickly.
Key parameters:
n_estimators: number of isolation trees (100-200 typical)contamination: expected fraction of anomalies in data (used to set threshold)max_samples: subsample size per tree (256 default — works well)
Works well for: high-dimensional data, no assumption about data distribution, fast O(n log n).
Random Cut Forest (RCF) — SageMaker Built-in
What Makes RCF Different
RCF is Amazon’s proprietary anomaly detection algorithm, available as a SageMaker built-in algorithm. It’s conceptually similar to Isolation Forest but with key differences suited for AWS production use:
Isolation Forest: Random Cut Forest:
─ Uses axis-aligned cuts ─ Uses random-direction cuts
─ Batch only ─ Streaming capable (online learning)
─ Static model ─ Model updates as new data arrives
─ Open-source (sklearn) ─ AWS built-in, optimized for SageMaker
Streaming capability is the key differentiator: RCF can update its anomaly model as new data arrives, making it ideal for:
- Real-time fraud detection
- IoT sensor monitoring
- Live server metric analysis
- Time series anomaly detection (traffic spikes, unusual patterns)
Use RCF when:
- On AWS SageMaker and need managed anomaly detection
- Data arrives as a stream (not batch)
- Time series with seasonal patterns (RCF handles this well)
Exam tip: When the scenario says “real-time streaming anomaly detection on AWS,” the answer is almost always RCF (SageMaker built-in).
Statistical Methods for Anomaly Detection
Z-Score Method
$$z = \frac{x - \mu}{\sigma}$$
Flag points where $|z| > 3$ (more than 3 standard deviations from mean).
- Simple, interpretable, fast
- Assumes normal (Gaussian) distribution
- Fails when distribution is skewed or multi-modal
- Only considers one feature at a time (ignores correlations)
Mahalanobis Distance
Generalization of Z-score that accounts for correlations between features.
$$D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$$
Where Σ is the covariance matrix of the data.
- Considers multivariate structure
- Points far from the center in any direction are flagged
- More robust than Z-score for correlated features
- Assumes elliptical distribution (normal-like)
Local Outlier Factor (LOF)
Measures the local density of a point relative to its neighbors.
- Core idea: a point is an outlier if its local density is much lower than its neighbors’
- Assigns a score based on ratio of local densities
- Score = 1: normal (similar density to neighbors)
- Score » 1: outlier (much less dense region than neighbors)
Key advantage: detects anomalies relative to local context, not global distribution. A point in a sparse cluster isn’t flagged if its neighbors are also sparse.
| Method | Assumption | Best For |
|---|---|---|
| Z-Score | Normal distribution, 1D | Simple, quick screening |
| Mahalanobis | Multivariate normal | Correlated features |
| LOF | Local density | Non-uniform density, local anomalies |
| Isolation Forest | None | High-dimensional, general purpose |
| RCF | None | Streaming, time series, AWS |
Association Rule Learning
Market Basket Analysis
Find items that frequently appear together: “people who buy X also buy Y.”
Classic example: retail transaction analysis, recommendation systems, cross-selling.
Transactions:
T1: {milk, bread, butter}
T2: {beer, diapers}
T3: {milk, diapers, beer}
T4: {milk, bread, diapers}
T5: {bread, butter}
Question: Which items appear together frequently?
Answer: {milk, bread}, {diapers, beer} are frequent itemsets
Rule: {diapers} → {beer} is a strong association rule
Key Metrics
Support: how often does this itemset appear? $$\text{Support}(X) = \frac{\text{count}(X)}{N}$$
Confidence: if someone buys X, how likely are they to buy Y? $$\text{Confidence}(X \to Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)}$$
Lift: how much more likely is Y given X, compared to Y alone? $$\text{Lift}(X \to Y) = \frac{\text{Confidence}(X \to Y)}{\text{Support}(Y)}$$
- Lift = 1: X and Y are independent (no association)
- Lift > 1: X and Y are positively correlated (buy together more than chance)
- Lift < 1: X and Y are negatively correlated (rarely bought together)
ELI5: If 80% of customers who buy diapers also buy beer, that’s high confidence. But if 80% of ALL customers buy beer, then the diaper-beer association has lift = 1 (no real insight). Lift > 1 means “diapers make beer purchases MORE likely than baseline.” That’s the actionable insight — stock them near each other, or cross-promote them.
Apriori Algorithm
Efficiently finds frequent itemsets by exploiting the anti-monotone property: if an itemset is infrequent, no superset of it can be frequent.
Step 1: Find frequent 1-itemsets (items meeting min_support)
{milk: 3/5=0.6}, {bread: 3/5=0.6}, {diapers: 3/5=0.6}, {beer: 2/5=0.4}
Step 2: Combine frequent 1-itemsets to form candidate 2-itemsets
Check: {milk,bread}, {milk,diapers}, {bread,diapers}, {beer,diapers}...
Step 3: Filter by min_support again
Keep only frequent 2-itemsets → form 3-itemsets from those
(Prune early: if {milk,butter} is infrequent, skip {milk,butter,anything})
The pruning makes Apriori much faster than brute force.
Quick Reference: Algorithm Selection Table
| Task | Algorithm | When to Use | SageMaker Equivalent |
|---|---|---|---|
| Clustering (spherical) | K-Means | Large data, spherical clusters, need speed | K-Means (built-in) |
| Clustering (arbitrary shape) | DBSCAN | Unknown K, want noise detection, irregular shapes | — (use sklearn) |
| Clustering (hierarchy) | Hierarchical | Need dendrogram, small data, explore granularity | — (use sklearn) |
| Clustering (probabilistic) | GMM | Overlapping clusters, need probabilities | — (use sklearn) |
| Dim reduction (preprocessing) | PCA | Remove noise, fix multicollinearity, speed up training | PCA (built-in) |
| Dim reduction (visualization) | t-SNE | 2D/3D visualization of clusters only | — (use sklearn) |
| Anomaly detection (batch) | Isolation Forest | General purpose, no distribution assumption | — (use sklearn) |
| Anomaly detection (streaming) | Random Cut Forest | Real-time, time series, AWS native | RCF (built-in) |
| Topic modeling | LDA | Discover topics in text documents | — (use sklearn/gensim) |
| Association rules | Apriori / FP-Growth | Market basket, cross-sell recommendations | — (use mlxtend) |
SageMaker Built-in Unsupervised Algorithms
| Algorithm | Type | Key Use Case |
|---|---|---|
| K-Means | Clustering | Customer segmentation, document clustering |
| PCA | Dimensionality reduction | Feature reduction before supervised learning |
| Random Cut Forest (RCF) | Anomaly detection | Real-time streaming anomaly detection |
| IP Insights | Anomaly detection | Detect unusual (user, IP) account behavior |
| Object2Vec | Embedding | Learn embeddings for any paired objects |
| LDA (Latent Dirichlet Allocation) | Topic modeling | Find topics in text corpus |
| Neural Topic Model (NTM) | Topic modeling | Alternative to LDA with neural network |
Why This Matters for the Exam
Core MLS-C01 unsupervised learning exam topics:
- K-Means vs DBSCAN: K-Means needs K upfront and assumes spherical clusters. DBSCAN detects noise and arbitrary shapes without K.
- Elbow method + silhouette score: the two ways to choose K for K-Means.
- PCA: must standardize features first. Keeps high-variance components. Use for preprocessing, not visualization of cluster structure.
- t-SNE: visualization ONLY. Non-deterministic. Cannot use for feature engineering or out-of-sample prediction.
- Random Cut Forest: SageMaker built-in for streaming anomaly detection. Know when to use it over Isolation Forest (streaming + AWS integration).
- Association rules: support, confidence, lift definitions. Lift > 1 means genuine association.
- Time series clustering: use dynamic time warping (DTW) distance, not Euclidean (sequences of different length/phase).