← AWS MLS-C01 — ML Specialty

Domain 3C: Unsupervised Learning Deep Dive

Table of Contents

Domain 3C: Unsupervised Learning Deep Dive

Exam Domain: 3 — Modeling (36%) Focus: Clustering, dimensionality reduction, anomaly detection — finding structure without labels


What is Unsupervised Learning? (First Principles)

In supervised learning, every training example comes with a label: this email is spam, this tumor is malignant, this transaction is fraud. The model learns to map inputs to those known outputs.

Unsupervised learning removes the labels entirely. The model must discover hidden structure, patterns, and groupings in the data on its own.

ELI5: Supervised learning is studying with an answer key — you know which photos are cats and which are dogs, and you learn the difference. Unsupervised learning is sorting a huge pile of unlabeled vacation photos into groups — you discover that there seem to be “beach photos,” “mountain photos,” and “city photos” groups, but nobody told you those categories existed. You invented them from the patterns you saw.

The Three Main Tasks

Unsupervised Learning
│
├── Clustering              → "Which things are similar?"
│   ├── K-Means             (centroid-based)
│   ├── Hierarchical        (tree of merges)
│   ├── DBSCAN              (density-based)
│   └── GMM                 (probabilistic)
│
├── Dimensionality Reduction → "How can I represent this more compactly?"
│   ├── PCA                 (linear, global structure)
│   ├── t-SNE               (non-linear, local structure, visualization only)
│   └── UMAP                (non-linear, faster than t-SNE)
│
└── Anomaly Detection       → "What's unusual?"
    ├── Isolation Forest    (path length)
    ├── Random Cut Forest   (SageMaker built-in)
    ├── LOF                 (density-based)
    └── Statistical methods (Z-score, Mahalanobis)

Clustering Algorithms

K-Means

The Algorithm Step by Step

K-Means finds K cluster centers (centroids) such that each point is assigned to its nearest centroid and the total within-cluster variance is minimized.

Step 0: Choose K=3, initialize 3 centroids randomly
        ┌─────────────────────────┐
        │  ●  ●    ×₁             │
        │     ●  ●    ×₂          │
        │  ●      ×₃  ●  ●        │
        └─────────────────────────┘
        (× = centroid, ● = data point)

Step 1: Assign each point to nearest centroid
        ┌─────────────────────────┐
        │  ●  ●    ×₁             │  ← blue cluster
        │     ●  ●    ×₂          │  ← green cluster
        │  ●      ×₃  ●  ●        │  ← red cluster
        └─────────────────────────┘

Step 2: Recalculate centroids (move to cluster mean)
        ┌─────────────────────────┐
        │  ● ×₁●                  │  ← centroid moved
        │     ●  ● ×₂             │  ← centroid moved
        │  ● ×₃    ●  ●           │  ← centroid moved
        └─────────────────────────┘

Step 3: Re-assign points to new centroids
Step 4: Repeat steps 2-3 until centroids stop moving

Convergence is guaranteed (the objective — total within-cluster sum of squares — decreases every iteration and is bounded below by 0). But convergence to a global minimum is not guaranteed.

Objective function (minimize):

$$\text{Inertia} = \sum_{k=1}^{K} \sum_{x \in C_k} |x - \mu_k|^2$$

Where $\mu_k$ is the centroid of cluster k.

ELI5: K-Means is like placing K magnets in a room full of iron balls. Each ball snaps to its nearest magnet. Then you pick up each magnet and move it to the center of gravity of the balls stuck to it. The balls re-snap. You repeat until the magnets stop moving. The final positions are the cluster centers.

Choosing K: The Elbow Method

Inertia (within-cluster sum of squares)
  │
  │╲
  │ ╲
  │  ╲
  │   ╲___
  │       ╲───
  │           ╲────────────────
  └────────────────────────────── K (number of clusters)
  K=1  K=2  K=3  K=4  K=5  K=6
              ↑
          "elbow" — adding more clusters stops helping much
          → choose K=3

The elbow is where adding more clusters gives diminishing returns. Note: the elbow isn’t always obvious — use silhouette score as a complement.

Silhouette score for a single point: $$s = \frac{b - a}{\max(a, b)}$$

Where:

  • $a$ = mean distance to points in same cluster (cohesion)
  • $b$ = mean distance to points in nearest other cluster (separation)
  • Score range: −1 to +1 (higher = better fit)

Average silhouette score across all points measures overall cluster quality. Best K = highest average silhouette score.

K-Means Limitations

K-Means works well:          K-Means fails:
  ●●●    ○○○                  ●●●●●●●●●●●●●●●●
  ●●●    ○○○                   (ring of ●s)
  ●●●    ○○○                  ●●●●●●●●●●
  (spherical, similar size,    (circle inside circle —
   well-separated)              K-Means draws straight lines)
  • Assumes spherical clusters (Euclidean distance-based)
  • Sensitive to initialization (different random starts → different results)
  • Must specify K in advance
  • Sensitive to outliers (outliers become their own “cluster” or pull centroids)
  • Struggles with clusters of very different sizes or densities

K-Means++ Initialization

Standard K-Means picks initial centroids randomly — bad luck means slow convergence or bad local minima.

K-Means++ seeds centroids intelligently:

  1. Pick first centroid uniformly at random
  2. For each subsequent centroid, pick a point with probability proportional to its squared distance from the nearest existing centroid
  3. Far-apart points are more likely to be chosen → better spread of initial centroids

Effect: faster convergence, better final solutions, almost always used in practice (sklearn default).

Mini-Batch K-Means

For very large datasets, compute centroid updates on random mini-batches instead of the full dataset. Faster training with slightly lower quality. Use when dataset has millions of points.


Hierarchical Clustering

The Concept: Building a Tree

Instead of assuming K clusters upfront, hierarchical clustering builds a tree of all possible merges (or splits) and lets you choose the level of granularity.

Agglomerative (bottom-up) — most common:

  1. Start: each point is its own cluster (N clusters)
  2. Find the two closest clusters
  3. Merge them into one
  4. Repeat until 1 cluster remains
  5. Cut the dendrogram at the desired level to get K clusters
ASCII Dendrogram:
                     ┌──────────────────────────┐
                     │                          │
             ┌───────┴───────┐        ┌─────────┴─────────┐
             │               │        │                   │
         ┌───┴───┐       ┌───┴───┐  ┌─┴─┐           ┌────┴────┐
         │       │       │       │  │   │           │         │
        [A]     [B]     [C]     [D][E] [F]         [G]       [H]

Cut here ─────────────────────────────────────────────────── → 2 clusters
Cut here ────────────────────────────────────  → 3 clusters
Cut here ──────────────────── → 4 clusters

The height of each merge in the dendrogram represents the distance between the merged clusters. Cuts at different heights give different numbers of clusters.

Linkage Methods: How to Measure Distance Between Clusters?

LinkageDistance Between ClustersEffectUse When
SingleNearest pair across clustersChaining (elongated clusters)Detecting connected structure
CompleteFurthest pair across clustersCompact, roughly equal-size clustersBalanced clusters needed
AverageMean of all cross-cluster distancesBalance between single and completeGeneral purpose
Ward’sMinimize increase in total varianceCompact, equal-size clustersMost common default

When to Use Hierarchical vs K-Means

K-MeansHierarchical
K specificationRequired upfrontChoose after seeing dendrogram
SpeedFast O(nKt)Slow O(n² log n)
Large datasetsGoodPoor (n > 10K gets slow)
InterpretabilityMediumHigh (visual dendrogram)
Cluster shapeSpherical onlyMore flexible
Multiple granularitiesRetrain for each KSingle run, cut anywhere

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

The Core Idea: Clusters Are Dense Regions

DBSCAN defines clusters as dense regions of points separated by sparse regions. It doesn’t try to assign every point to a cluster — sparse points are noise.

DBSCAN classification of points:

    ● ● ●              ● ●
   ●●●●●●●              ●●●
    ● ● ●    ·    ·      ●●
              ·       ● ●

Dense region    Noise   Dense region
= Cluster 1     (·)    = Cluster 2

Parameters

  • eps (ε): the radius of a point’s neighborhood — how far to look for neighbors
  • min_samples: minimum number of neighbors within eps to be a “core point”

Point Types

Core point:    has ≥ min_samples neighbors within eps radius
               ● ● ●         ← all core points (assuming min_samples=3)
                ●●●

Border point:  has < min_samples neighbors, but is within eps of a core point
               ○ ← border point (close to core points but sparse itself)

Noise point:   not a core point, not within eps of any core point
               · ← noise/outlier (isolated)

Algorithm:

  1. Find all core points
  2. Connect core points that are within eps of each other into clusters
  3. Assign border points to the nearest cluster they touch
  4. Label remaining points as noise (−1)

ELI5: DBSCAN looks for crowds. A cluster is a crowd of people standing close together. If you’re standing in a crowd (core point), you belong to that cluster. If you’re on the edge of a crowd (border point), you belong to the nearest crowd. If you’re standing alone in an empty field far from any crowd (noise point), you don’t belong to any cluster — you’re an outlier.

Strengths and Weaknesses

Strengths:

  • Discovers arbitrarily shaped clusters (not just spherical)
  • Automatically identifies outliers (no forced assignment)
  • Does not need K specified in advance
  • Robust to outliers (they become noise, not distort clusters)

Weaknesses:

  • Sensitive to eps — hard to choose (use k-distance plot)
  • Struggles with clusters of varying density (a global eps can’t fit all clusters)
  • Doesn’t scale well to high dimensions (distances become meaningless — curse of dimensionality)
  • For very large datasets, use HDBSCAN (hierarchical DBSCAN) instead

Comparison: K-Means vs DBSCAN vs Hierarchical

K-MeansDBSCANHierarchical
K required?YesNoNo (choose from dendrogram)
Handles noise?NoYesNo
Cluster shapeSphericalArbitraryDepends on linkage
SpeedFastModerateSlow
Large dataYesYes (with index)No
Varying densityNoNoDepends
InterpretabilityLowMediumHigh

Gaussian Mixture Models (GMM)

Soft Clustering with Probabilities

K-Means gives hard assignments — each point belongs to exactly one cluster. GMM gives soft assignments — each point has a probability of belonging to each cluster.

GMM models the data as a mixture of K Gaussian distributions. Each cluster has its own mean (μ), covariance (Σ), and mixture weight (π).

$$P(x) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(x | \mu_k, \Sigma_k)$$

K-Means (hard):             GMM (soft):
  Point A → Cluster 1       Point A → [C1: 70%, C2: 25%, C3: 5%]
  Point B → Cluster 2       Point B → [C1: 10%, C2: 85%, C3: 5%]
  (binary assignment)       (probabilistic assignment)

Expectation-Maximization (EM) Algorithm

GMM is fitted with EM:

  • E-step (Expectation): For each point, compute the probability it came from each cluster (given current parameters)
  • M-step (Maximization): Update cluster parameters (mean, covariance, weight) to maximize likelihood given those probabilities
  • Repeat until convergence

When to Use GMM vs K-Means

K-MeansGMM
AssignmentHardSoft (probabilistic)
Cluster shapeSphericalElliptical (full covariance)
Overlapping clustersNoYes
SpeedFasterSlower
Need probabilitiesNoYes
InterpretationSimplerRicher

Use GMM when: clusters overlap, you need probability of cluster membership, clusters are elliptical rather than spherical.


Dimensionality Reduction

Why Reduce Dimensions?

Problems with high-dimensional data:
  ┌──────────────────────────────────────────────────┐
  │ Curse of dimensionality → sparse data, distances  │
  │ Computation: O(n × d) for many operations         │
  │ Visualization: impossible beyond 3D               │
  │ Noise features hurt model performance             │
  │ Storage and memory cost                           │
  └──────────────────────────────────────────────────┘

Goals of dimensionality reduction:
  ┌──────────────────────────────────────────────────┐
  │ Visualization (2D/3D for human inspection)        │
  │ Noise removal (compress away noise dimensions)    │
  │ Speed up downstream algorithms                    │
  │ Remove multicollinearity for linear models        │
  │ Feature extraction (create new meaningful dims)   │
  └──────────────────────────────────────────────────┘

Principal Component Analysis (PCA) — Deep Dive

First Principles: Find the Directions of Maximum Variance

PCA finds a new coordinate system (principal components) aligned with the directions of maximum variance in the data.

Original 2D data:            After PCA:
  x₂                          PC2 (low variance)
  │    ●                       ↗
  │  ● ● ●               ● ● ●●●●●●●●●→ PC1 (high variance)
  │ ●●●●●●●         →
  │  ● ● ●
  │    ●
  └──────── x₁
            (correlated)      (PC1 captures most variance,
                               PC2 captures the rest)

The first principal component (PC1) is the direction along which the data varies the most. The second (PC2) is the direction of second-highest variance, constrained to be perpendicular to PC1. And so on.

The Math (Intuition Level)

  1. Center the data: subtract the mean from each feature → zero-centered
  2. Compute covariance matrix: $C = \frac{1}{n-1} X^T X$ — captures how features vary together
  3. Eigendecomposition: $C = V \Lambda V^T$
    • Eigenvectors V = directions of the principal components
    • Eigenvalues Λ = variance explained by each component
  4. Project: multiply centered data by top-K eigenvectors → new K-dimensional representation

You don’t need to memorize the math for the exam. Understand the concepts:

  • PCA finds directions of maximum variance
  • Principal components are uncorrelated (orthogonal)
  • Eigenvalue = variance explained by that component
  • Projecting onto fewer components = compression (lossy)

Explained Variance: How Many Components?

Explained Variance by Component:
  │
  │ 40%
  │ ████
  │ ████  25%
  │ ████  ████
  │ ████  ████  15%
  │ ████  ████  ████  10%  5%   3%   2%
  └──────────────────────────────────── Component
         PC1   PC2   PC3   PC4   PC5   PC6   PC7

Cumulative: 40% → 65% → 80% → 90% → 95% → 98% → 100%
                              ↑
                    Keep 4 components for 90% variance retained

Common rule: retain enough components to explain 95% of total variance. Use the elbow in the scree plot as an alternative guide.

ELI5: PCA finds the best “camera angle” that shows the most interesting variation in your data. If you’re photographing a long, flat building, the most informative view shows the length (PC1). The next best view shows the height (PC2). The depth (PC3) adds little because the building is flat. You keep only the views that add meaningful information, discarding dimensions that are mostly noise.

PCA Use Cases

  • Preprocessing for algorithms sensitive to high dimensions: reduce dimensions before feeding into logistic regression, KNN, SVM
  • Visualization: project to 2-3 dimensions to scatter-plot and explore clusters
  • Noise reduction: keep high-variance components, discard low-variance (noise) components
  • Removing multicollinearity: PCA components are orthogonal by construction — no correlation between them

PCA Limitations

  • Linear only: can’t capture non-linear structure (use t-SNE or UMAP for non-linear)
  • Components aren’t interpretable: PC1 is a weighted sum of all original features — you lose the ability to say “this is the ‘age’ dimension”
  • Sensitive to feature scaling: must standardize features before PCA (otherwise high-magnitude features dominate)
  • Information loss: projection is lossy — some original variance is discarded

t-SNE — For Visualization Only

What It Does

t-distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction technique designed primarily for 2D/3D visualization of high-dimensional data.

Where PCA preserves global structure (the overall geometry), t-SNE preserves local structure — nearby points in high dimensions stay nearby in 2D.

How It Works (Intuition)

  1. In high-dimensional space: model pairwise similarities as probabilities (nearby points = high probability of being “neighbors”)
  2. In 2D space: place points randomly, then move them so pairwise similarities match
  3. Uses a t-distribution (heavier tails) in 2D to avoid crowding — clusters can be separated
High-D space:              t-SNE 2D projection:
  ●●●  ●●●  ●●●             ●●●   ●●●
  ●●● (mixed)●●●     →       ●●●   ●●●    ●●●
  ●●●    ●●●                        ●●●
  (hard to see clusters)    (clusters pop out visually)

The Perplexity Parameter

Perplexity controls the balance between local and global structure (roughly: how many neighbors each point considers). Typical range: 5-50.

  • Low perplexity: tight local clusters, ignores global structure
  • High perplexity: broader view, may merge distinct clusters
  • Rule of thumb: perplexity ≈ √N or 30 for most datasets

Critical Limitations — The Exam Traps

Exam tip: t-SNE is for visualization only. Do NOT use it as a preprocessing step for machine learning models. Here’s why:

LimitationExplanation
Non-deterministicDifferent random seeds → different layouts. Not reproducible without fixing seed.
No out-of-sample extensionCannot transform new points without re-running on full dataset
Distances not meaningfulCluster sizes and inter-cluster distances in t-SNE plots are meaningless
Global structure distortedTwo clusters being far apart in t-SNE doesn’t mean they’re different
SlowO(n² log n), impractical for > 100K points (use UMAP instead)

t-SNE vs PCA

PCAt-SNE
TypeLinearNon-linear
Structure preservedGlobal (overall geometry)Local (neighborhoods)
Use for preprocessingYesNo
DeterministicYesNo (random initialization)
Out-of-sampleYes (apply same projection)No
SpeedFast O(nd²)Slow O(n² log n)
InterpretabilityComponents have meaningNo meaning
Best forPreprocessing, noise reductionVisualization only

ELI5: PCA takes the big picture view — it captures the overall shape and spread of your data, like a map from above. t-SNE zooms into the neighborhoods — it makes sure that things that were close together in high dimensions are still close together on the 2D plot. The catch is that t-SNE distorts the global picture to preserve the local one. It’s better for “seeing” clusters, but you can’t trust the distances between them.


Anomaly Detection

What is Anomaly Detection?

Find the data points that don’t fit the pattern — outliers, unusual events, rare failures.

Normal data:              Anomalies:
  ●●●●●●●●●                ●●●●●●●●●
  ●●●●●●●●●                ●●●●●●●●●
  ●●●●●●●●●                ●●●●●●●●●
  (dense cluster)           (sparse outliers → ○)
                           ●●● ○ ●●●●
                           ●●●●●● ○ ●

Use cases: fraud detection, network intrusion, manufacturing defects, server anomalies, medical anomalies.


Isolation Forest

The Core Idea

Normal data points require many random splits to isolate (they’re surrounded by many neighbors — you need lots of cuts to get them alone).

Anomalies are easy to isolate (they’re in sparse regions — just a few random cuts and they’re separated from everyone else).

Isolating a normal point:            Isolating an anomaly:
  ●●●●●●●●●●●●                         ●●●●●●●●●●●●○
  ──────────────  cut 1                 ─────────────── cut 1
  ●●●●●●●●●●●●
  ──────────────  cut 2
  ●●●●●●●●●●                           (anomaly isolated in 1-2 cuts)
  ──────────────  cut 3
  ●●●●●                                (○ already alone)
  ──────────────  cut 4
  ●●●
  (still many points, need more cuts)

Anomaly score = average path length across many isolation trees. Short path = anomaly.

$$\text{score}(x) = 2^{-\frac{E[h(x)]}{c(n)}}$$

Where $E[h(x)]$ is the expected path length and $c(n)$ is a normalization factor. Score near 1 = anomaly, near 0.5 = normal.

ELI5: Imagine a game of “20 questions” with binary yes/no questions. To identify a normal person in a room, you need many specific questions to narrow down who they are. To identify someone wildly unusual (a clown in a library), one or two questions is enough: “Are you wearing face paint? Yes → anomaly!” Normal data needs many splits to isolate; anomalies isolate quickly.

Key parameters:

  • n_estimators: number of isolation trees (100-200 typical)
  • contamination: expected fraction of anomalies in data (used to set threshold)
  • max_samples: subsample size per tree (256 default — works well)

Works well for: high-dimensional data, no assumption about data distribution, fast O(n log n).


Random Cut Forest (RCF) — SageMaker Built-in

What Makes RCF Different

RCF is Amazon’s proprietary anomaly detection algorithm, available as a SageMaker built-in algorithm. It’s conceptually similar to Isolation Forest but with key differences suited for AWS production use:

Isolation Forest:                Random Cut Forest:
  ─ Uses axis-aligned cuts        ─ Uses random-direction cuts
  ─ Batch only                    ─ Streaming capable (online learning)
  ─ Static model                  ─ Model updates as new data arrives
  ─ Open-source (sklearn)         ─ AWS built-in, optimized for SageMaker

Streaming capability is the key differentiator: RCF can update its anomaly model as new data arrives, making it ideal for:

  • Real-time fraud detection
  • IoT sensor monitoring
  • Live server metric analysis
  • Time series anomaly detection (traffic spikes, unusual patterns)

Use RCF when:

  • On AWS SageMaker and need managed anomaly detection
  • Data arrives as a stream (not batch)
  • Time series with seasonal patterns (RCF handles this well)

Exam tip: When the scenario says “real-time streaming anomaly detection on AWS,” the answer is almost always RCF (SageMaker built-in).


Statistical Methods for Anomaly Detection

Z-Score Method

$$z = \frac{x - \mu}{\sigma}$$

Flag points where $|z| > 3$ (more than 3 standard deviations from mean).

  • Simple, interpretable, fast
  • Assumes normal (Gaussian) distribution
  • Fails when distribution is skewed or multi-modal
  • Only considers one feature at a time (ignores correlations)

Mahalanobis Distance

Generalization of Z-score that accounts for correlations between features.

$$D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$$

Where Σ is the covariance matrix of the data.

  • Considers multivariate structure
  • Points far from the center in any direction are flagged
  • More robust than Z-score for correlated features
  • Assumes elliptical distribution (normal-like)

Local Outlier Factor (LOF)

Measures the local density of a point relative to its neighbors.

  • Core idea: a point is an outlier if its local density is much lower than its neighbors’
  • Assigns a score based on ratio of local densities
  • Score = 1: normal (similar density to neighbors)
  • Score » 1: outlier (much less dense region than neighbors)

Key advantage: detects anomalies relative to local context, not global distribution. A point in a sparse cluster isn’t flagged if its neighbors are also sparse.

MethodAssumptionBest For
Z-ScoreNormal distribution, 1DSimple, quick screening
MahalanobisMultivariate normalCorrelated features
LOFLocal densityNon-uniform density, local anomalies
Isolation ForestNoneHigh-dimensional, general purpose
RCFNoneStreaming, time series, AWS

Association Rule Learning

Market Basket Analysis

Find items that frequently appear together: “people who buy X also buy Y.”

Classic example: retail transaction analysis, recommendation systems, cross-selling.

Transactions:
  T1: {milk, bread, butter}
  T2: {beer, diapers}
  T3: {milk, diapers, beer}
  T4: {milk, bread, diapers}
  T5: {bread, butter}

Question: Which items appear together frequently?
Answer:   {milk, bread}, {diapers, beer} are frequent itemsets
Rule:     {diapers} → {beer} is a strong association rule

Key Metrics

Support: how often does this itemset appear? $$\text{Support}(X) = \frac{\text{count}(X)}{N}$$

Confidence: if someone buys X, how likely are they to buy Y? $$\text{Confidence}(X \to Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)}$$

Lift: how much more likely is Y given X, compared to Y alone? $$\text{Lift}(X \to Y) = \frac{\text{Confidence}(X \to Y)}{\text{Support}(Y)}$$

  • Lift = 1: X and Y are independent (no association)
  • Lift > 1: X and Y are positively correlated (buy together more than chance)
  • Lift < 1: X and Y are negatively correlated (rarely bought together)

ELI5: If 80% of customers who buy diapers also buy beer, that’s high confidence. But if 80% of ALL customers buy beer, then the diaper-beer association has lift = 1 (no real insight). Lift > 1 means “diapers make beer purchases MORE likely than baseline.” That’s the actionable insight — stock them near each other, or cross-promote them.

Apriori Algorithm

Efficiently finds frequent itemsets by exploiting the anti-monotone property: if an itemset is infrequent, no superset of it can be frequent.

Step 1: Find frequent 1-itemsets (items meeting min_support)
  {milk: 3/5=0.6}, {bread: 3/5=0.6}, {diapers: 3/5=0.6}, {beer: 2/5=0.4}

Step 2: Combine frequent 1-itemsets to form candidate 2-itemsets
  Check: {milk,bread}, {milk,diapers}, {bread,diapers}, {beer,diapers}...

Step 3: Filter by min_support again
  Keep only frequent 2-itemsets → form 3-itemsets from those
  (Prune early: if {milk,butter} is infrequent, skip {milk,butter,anything})

The pruning makes Apriori much faster than brute force.


Quick Reference: Algorithm Selection Table

TaskAlgorithmWhen to UseSageMaker Equivalent
Clustering (spherical)K-MeansLarge data, spherical clusters, need speedK-Means (built-in)
Clustering (arbitrary shape)DBSCANUnknown K, want noise detection, irregular shapes— (use sklearn)
Clustering (hierarchy)HierarchicalNeed dendrogram, small data, explore granularity— (use sklearn)
Clustering (probabilistic)GMMOverlapping clusters, need probabilities— (use sklearn)
Dim reduction (preprocessing)PCARemove noise, fix multicollinearity, speed up trainingPCA (built-in)
Dim reduction (visualization)t-SNE2D/3D visualization of clusters only— (use sklearn)
Anomaly detection (batch)Isolation ForestGeneral purpose, no distribution assumption— (use sklearn)
Anomaly detection (streaming)Random Cut ForestReal-time, time series, AWS nativeRCF (built-in)
Topic modelingLDADiscover topics in text documents— (use sklearn/gensim)
Association rulesApriori / FP-GrowthMarket basket, cross-sell recommendations— (use mlxtend)

SageMaker Built-in Unsupervised Algorithms

AlgorithmTypeKey Use Case
K-MeansClusteringCustomer segmentation, document clustering
PCADimensionality reductionFeature reduction before supervised learning
Random Cut Forest (RCF)Anomaly detectionReal-time streaming anomaly detection
IP InsightsAnomaly detectionDetect unusual (user, IP) account behavior
Object2VecEmbeddingLearn embeddings for any paired objects
LDA (Latent Dirichlet Allocation)Topic modelingFind topics in text corpus
Neural Topic Model (NTM)Topic modelingAlternative to LDA with neural network

Why This Matters for the Exam

Core MLS-C01 unsupervised learning exam topics:

  • K-Means vs DBSCAN: K-Means needs K upfront and assumes spherical clusters. DBSCAN detects noise and arbitrary shapes without K.
  • Elbow method + silhouette score: the two ways to choose K for K-Means.
  • PCA: must standardize features first. Keeps high-variance components. Use for preprocessing, not visualization of cluster structure.
  • t-SNE: visualization ONLY. Non-deterministic. Cannot use for feature engineering or out-of-sample prediction.
  • Random Cut Forest: SageMaker built-in for streaming anomaly detection. Know when to use it over Isolation Forest (streaming + AWS integration).
  • Association rules: support, confidence, lift definitions. Lift > 1 means genuine association.
  • Time series clustering: use dynamic time warping (DTW) distance, not Euclidean (sequences of different length/phase).