Domain 3E: SageMaker Built-In Algorithms

25 min read 5258 words

Table of Contents

SageMaker Built-In Algorithms

SageMaker Built-In Algorithms

Exam Domain: 3 — ML Model Development Task: Choose the right SageMaker built-in algorithm for a given scenario

Algorithm Selection Flowchart

What type of problem?
│
├─ Predict a NUMBER? ──────────────────────────────── REGRESSION
│   ├─ Linear relationship, tabular data?            → Linear Learner
│   ├─ Complex patterns, tabular data?               → XGBoost
│   ├─ Time series, single series?                   → DeepAR
│   └─ Sparse feature interactions?                  → Factorization Machines
│
├─ Predict a CATEGORY? ────────────────────────────── CLASSIFICATION
│   ├─ Tabular, binary/multi-class?                  → XGBoost
│   ├─ Tabular, simple/linear?                       → Linear Learner
│   ├─ High-dimensional sparse (clicks)?             → Factorization Machines
│   ├─ Text, multi-class?                            → BlazingText (supervised)
│   ├─ Image, whole image label?                     → Image Classification
│   ├─ Image, find objects with boxes?               → Object Detection
│   └─ Image, every pixel labeled?                   → Semantic Segmentation
│
├─ FORECAST future values? ────────────────────────── TIME SERIES
│   └─ Multiple related time series?                 → DeepAR
│
├─ Find GROUPS in data? ───────────────────────────── CLUSTERING
│   └─ K groups of similar items?                    → K-Means
│
├─ REDUCE dimensions? ─────────────────────────────── DIM REDUCTION
│   └─ Linear projection, variance-preserving?       → PCA
│
├─ Detect ANOMALIES? ──────────────────────────────── ANOMALY DETECTION
│   ├─ Numeric / streaming time series?              → Random Cut Forest
│   └─ IP address / login patterns?                  → IP Insights
│
├─ NLP / TEXT tasks? ──────────────────────────────── NLP
│   ├─ Word embeddings (semantic similarity)?        → BlazingText (Word2Vec)
│   ├─ Text classification (categories)?             → BlazingText (supervised)
│   ├─ Topic discovery in corpus?                    → NTM (neural) or LDA
│   ├─ Sequence translation / summarization?         → Seq2Seq
│   └─ Pair embeddings (similarity)?                 → Object2Vec
│
└─ RECOMMENDATIONS? ───────────────────────────────── RECOMMENDATIONS
    ├─ Sparse interactions (user-item clicks)?       → Factorization Machines
    └─ Learn embeddings for object pairs?            → Object2Vec

Why this matters for the exam: Always identify the problem type first, then narrow by data characteristics. The most common trap is jumping to an algorithm (e.g., “it’s a neural network problem so use deep learning”) before classifying the task. Classify the task first — the algorithm often follows naturally.

Complete Algorithm Reference

Supervised — Regression / Classification

Linear Learner

Aspect	Detail
Type	Supervised
Problem	Binary/multi-class classification, regression
How it works	Trains a linear model (logistic/linear regression) with stochastic gradient descent; automatically normalizes data; trains multiple models with varied hyperparameters in parallel, selects best
Input format	RecordIO-protobuf (float32) or CSV
Key hyperparameters	`predictor_type` (regressor / binary_classifier / multiclass_classifier), `learning_rate`, `l1`, `wd` (L2), `mini_batch_size`
Instance type	CPU (ml.m5) recommended; supports multi-instance

ELI5: Linear Learner draws the best straight line (or flat plane) through your data to make predictions. If you’re predicting house prices, it finds the line that best fits “price goes up as square footage goes up.” Simple, fast, and surprisingly effective when your data actually has a roughly linear relationship. Always use it as your baseline — if a fancier model can’t beat it, the fancy model isn’t worth the complexity.

Use when: simple/linear relationships, fast baseline, interpretability needed, large datasets where simplicity is a feature.

Do NOT use when: complex non-linear interactions, high-dimensional sparse data (use Factorization Machines), image/text data.

XGBoost

Aspect	Detail
Type	Supervised
Problem	Classification, regression, ranking
How it works	Gradient-boosted decision trees: trains trees sequentially, each correcting residuals of the previous ensemble; includes L1/L2 regularization to prevent overfitting; handles missing values natively by learning the best default direction
Input format	CSV, LibSVM, RecordIO-protobuf, Parquet
Key hyperparameters	`num_round` (# trees), `max_depth` (tree depth), `eta` (learning rate), `subsample` (row fraction), `colsample_bytree` (feature fraction), `objective`, `min_child_weight`
Instance type	CPU (ml.m5) recommended; GPU possible with `tree_method=gpu_hist`

XGBoost Boosting Process:

  Round 1: Tree₁ ─────────────────────────► prediction₁
                  ↓ compute residuals
  Round 2: Tree₂ trains on residuals ──────► prediction₁ + η·Tree₂
                  ↓ compute new residuals
  Round 3: Tree₃ trains on residuals ──────► prediction₁ + η·Tree₂ + η·Tree₃
  ...
  Round N: Final = Σ η·Treeᵢ

  η = learning rate (eta) controls each tree's contribution
  max_depth controls how complex each tree is

Key hyperparameters deep dive:

num_round: more rounds → lower training error, but risk of overfitting. Tune with early stopping.
max_depth: shallower trees (3-6) → more regularization. Deeper trees → more complex patterns.
eta: smaller = slower learning but more robust. Default 0.3, typical range 0.01-0.3.
subsample: < 1.0 → random row sampling each tree → stochastic boosting → regularization.
objective: reg:squarederror for regression, binary:logistic for binary, multi:softmax for multi-class.

ELI5: XGBoost is like a team of students who each learn from the mistakes of the one before them. Student 1 tries to predict house prices, gets some wrong. Student 2 focuses specifically on fixing Student 1’s mistakes. Student 3 fixes what Student 2 still got wrong. After hundreds of rounds, their combined predictions are remarkably accurate. It’s the Swiss Army knife of ML — handles missing values, resists overfitting, works on almost any tabular problem.

Use when: any structured/tabular data, default first choice for classification/regression, when you need feature importance.

Do NOT use when: image data, raw text, very sparse high-dimensional data (use Factorization Machines), when you need probabilistic time series forecasts (use DeepAR).

Exam tip: XGBoost is the most frequently tested algorithm. When the exam says “tabular data with complex patterns,” the answer is almost always XGBoost.

K-Nearest Neighbors (KNN)

Aspect	Detail
Type	Supervised (also used for unsupervised similarity search)
Problem	Classification, regression, similarity search
How it works	At inference time, finds the K most similar training points to the query and returns their majority class (classification) or mean value (regression); SageMaker implementation uses random sampling for large datasets, then builds an index (exact or approximate) for fast lookup
Input format	RecordIO-protobuf or CSV
Key hyperparameters	`k` (number of neighbors), `predictor_type`, `dimension_reduction_type`, `dimension_reduction_target`
Instance type	CPU (ml.m5); GPU supported for index building

ELI5: KNN is the “ask your neighbors” algorithm. To classify a new house price, find the K most similar houses (by square footage, bedrooms, etc.) in the training data and look at what their prices were. The new house’s price = the average of those K neighbors. It’s lazy — no model is built during training, all computation happens at prediction time. Fast to “train,” slow to predict on large datasets.

Use when: simple non-parametric baseline, anomaly detection via distance, similarity search.

Do NOT use when: very large datasets (inference is slow), high-dimensional data without preprocessing (curse of dimensionality), need for feature importance.

Supervised — Time Series

DeepAR

Aspect	Detail
Type	Supervised
Problem	Probabilistic time series forecasting
How it works	Autoregressive RNN (LSTM) trained simultaneously on MULTIPLE related time series; at each time step, the model predicts a probability distribution over future values; learns shared patterns (seasonality, trends) across all series
Input format	JSON Lines (each line: `{"start": "YYYY-MM-DD", "target": [v1, v2, ...]}`)
Key hyperparameters	`prediction_length`, `context_length`, `epochs`, `num_cells`, `num_layers`, `dropout_rate`, `likelihood`
Instance type	GPU (ml.p3) for large datasets; CPU for small

DeepAR trains across multiple series simultaneously:

  Product A: 100 120 115 ──────┐
  Product B: 800 820 790 ──────┤  Shared LSTM  →  Probabilistic forecasts
  Product C:   5   6   5 ──────┤  (learns      →  with uncertainty bands
  ...1000 products...          ┘  shared        →  (P10, P50, P90 quantiles)
                                  patterns)

Why DeepAR beats classical methods:

Cold start: a new product with 2 weeks of data benefits from patterns learned across 1000 similar products
Shared patterns: weekly/monthly seasonality learned once and applied everywhere
Probabilistic outputs: returns confidence intervals, not just point predictions (critical for inventory planning)

ELI5: Traditional forecasting tools are like tutors who can only teach one student at a time. DeepAR is a classroom — it teaches 1000 students (time series) simultaneously and lets them learn from each other. A new student with no history still benefits from the general knowledge in the room. Plus, instead of saying “you’ll sell exactly 47 units,” DeepAR says “there’s a 90% chance you’ll sell between 40 and 55 units” — much more useful for decision-making.

Use when: multiple related time series, cold-start series (little history), probabilistic forecasts needed, demand forecasting across product catalog.

Do NOT use when: single isolated time series with long history (classical methods may suffice), very irregular/sparse series.

Unsupervised — Clustering

K-Means

Aspect	Detail
Type	Unsupervised
Problem	Clustering — group N points into K clusters
How it works	SageMaker uses web-scale K-Means with mini-batches: initializes K × `extra_center_factor` centers using random init or K-Means++, then iteratively assigns points to nearest centroid and updates centroids, finally reduces to K clusters
Input format	RecordIO-protobuf or CSV
Key hyperparameters	`k` (number of clusters), `extra_center_factor` (oversampling), `init_method` (random or k-means++), `max_iterations`
Instance type	CPU (ml.m5); supports multi-instance

Choosing K — the Elbow Method:

Distortion vs K:

  Distortion
  (within-cluster
   sum of squares)
     │
  ──►│ \.
     │   \.
     │    \..
     │       \___
     │           ‾‾‾‾‾‾‾────────
     └──────────────────────────► K
            1  2  3  4  5  6  7

         ↑ Pick the "elbow" — the point where adding
           more clusters gives diminishing returns.

ELI5: K-Means is like sorting mail into K bins. Start with random bin assignments. Look at each bin and find its center. Move every letter to whichever bin center it’s closest to. Recompute centers. Repeat until nothing moves. The result: K groups where items within each group are similar and items between groups are different.

Use when: customer segmentation, document clustering, color quantization, finding natural groupings.

Do NOT use when: K is unknown and the elbow method doesn’t give a clear signal, data has non-spherical clusters (K-Means assumes spherical/Euclidean), very high-dimensional data without preprocessing.

Unsupervised — Dimensionality Reduction

PCA (Principal Component Analysis)

Aspect	Detail
Type	Unsupervised
Problem	Reduce number of features while preserving variance
How it works	Finds orthogonal directions (principal components) of maximum variance in the data; projects data onto the top N components; SageMaker supports two modes: Regular (exact, computes covariance matrix) and Randomized (approximate, scalable to large datasets)
Input format	RecordIO-protobuf or CSV
Key hyperparameters	`num_components` (target dimensions), `algorithm_mode` (regular / randomized), `subtract_mean`
Instance type	CPU (ml.m5)

Regular vs Randomized PCA:

Regular mode: computes exact covariance matrix → precise but $O(d^2)$ memory where $d$ = feature count → use when $d$ is moderate (< 10,000 features)
Randomized mode: uses randomized SVD approximation → much more memory-efficient → use when $d$ is very large (10,000+ features) or dataset is very large

ELI5: PCA is like finding the best angle to photograph a 3D sculpture so the 2D photo captures as much detail as possible. The “principal components” are those best viewing angles — the directions in high-dimensional space where the data varies the most. Projecting onto the top 2-3 components lets you visualize or compress data while keeping the most important information.

Use when: too many features (curse of dimensionality), feature redundancy, visualization, pre-processing before clustering.

Do NOT use when: non-linear structure (use autoencoders), features are categorical (PCA needs numeric), you need interpretability of features.

Unsupervised — Anomaly Detection

Random Cut Forest (RCF)

Aspect	Detail
Type	Unsupervised
Problem	Anomaly detection in numeric data and time series
How it works	Builds an ensemble of random binary trees (Random Cut Trees) by recursively bisecting the feature space with random cuts; an anomaly is a point that gets isolated with fewer cuts (it’s far from other points); the anomaly score = average depth in the trees (inverted)
Input format	RecordIO-protobuf or CSV
Key hyperparameters	`num_samples_per_tree`, `num_trees`, `eval_metrics`
Instance type	CPU (ml.m5); also available as built-in function in Kinesis Data Analytics

How RCF assigns anomaly scores:

  Normal point (deep in a cluster):   Anomaly (isolated):
  ┌──────────────────────┐            ┌──────────────────────┐
  │  ·  · · ·            │            │  ·  · · ·            │
  │  · [·] · ·           │            │  · · · ·             │
  │  ·  · · ·            │            │            [★]       │
  │                      │            │                      │
  └──────────────────────┘            └──────────────────────┘
  Many cuts needed to isolate it      Isolated in 2-3 cuts
  → Low anomaly score                 → High anomaly score

How it differs from Isolation Forest: conceptually similar (both isolate anomalies with random partitioning), but RCF uses random cuts that are proportional to the range of each feature (not purely random splits) → better handles features of different scales. RCF also supports streaming/online updates.

Real-time use case: RCF in Kinesis Data Analytics enables millisecond-latency anomaly detection on streaming data without any model deployment — just SQL-like function calls.

ELI5: RCF is like a bouncer at a club deciding how “weird” each person is. The bouncer builds a mental model of what “normal” looks like from the crowd. Someone who fits the usual pattern blends in and gets a low weirdness score. Someone wildly out of place — a spike in server traffic at 3am, a transaction ten times larger than usual — stands out immediately and gets a high anomaly score. RCF formalizes this by asking: “How many random cuts does it take to isolate this data point?” Normal points are surrounded by neighbors and take many cuts. Anomalies are isolated in very few.

Use when: streaming anomaly detection (fraud, IoT sensors, server metrics), time series with spikes/level shifts, no labeled anomaly examples available.

Do NOT use when: you have labeled anomaly examples (use supervised classification instead), you need anomaly explanations (RCF gives scores, not reasons).

Exam tip: RCF = anomaly detection. Kinesis + RCF = real-time streaming anomaly detection. This combination appears frequently in exam scenarios.

IP Insights

Aspect	Detail
Type	Unsupervised
Problem	Learn normal patterns of IP address usage per entity
How it works	Neural network learns embeddings for (entity, IP address) pairs from historical normal access logs; at inference, scores new (entity, IP) pairs — low scores indicate unusual/anomalous access
Input format	CSV (entity identifier, IPv4 address)
Key hyperparameters	`num_entity_vectors`, `vector_dim`, `epochs`, `batch_size`
Instance type	GPU (ml.p3) recommended

ELI5: IP Insights builds a “reputation map” for which users connect from which IP addresses. Alice always logs in from New York and Chicago. Suddenly Alice’s credentials are used from Bucharest — that’s a high anomaly score. It doesn’t need labels saying “this was fraud”; it just learns normal patterns and flags deviations.

Use when: account takeover detection, fraud detection via login patterns, detecting compromised credentials.

Do NOT use when: general numeric anomaly detection (use RCF), you need real-time streaming (IP Insights requires batch/endpoint inference).

NLP Algorithms

BlazingText

Aspect	Detail
Type	Unsupervised (Word2Vec) or Supervised (text classification)
Problem	Word embeddings OR text classification
How it works	Word2Vec: learns word embeddings by training a shallow NN to predict context from word (Skip-gram) or word from context (CBOW); Text Classification: multi-class text classifier using fastText architecture (bag of n-gram features) — extremely fast
Input format	Word2Vec: one sentence per line (plain text); Text Classification: `__label__<class> text` format
Key hyperparameters	`mode` (batch_skipgram / skipgram / cbow / supervised), `vector_dim`, `epochs`, `learning_rate`, `word_ngrams`
Instance type	CPU or GPU depending on mode

Word2Vec Architectures:

  CBOW (Continuous Bag of Words) — Faster, good for frequent words:

  "The" "cat" [___] "on" "the"   →   predict center word "sat"
      context words in                         ↑
      sliding window                     predicted word

  Skip-gram — Slower, better for rare words:

  [sat]   →   predict "The", "cat", "on", "the"
  center word         ↑
                 surrounding context words

  Mathematical result: words used in similar contexts → close vectors
  king − man + woman ≈ queen  (vector arithmetic captures meaning)

ELI5 — Word2Vec: CBOW: fill in the blank using surrounding words. Skip-gram: given one word, predict what words surround it. Both tasks force the network to learn that words appearing in similar contexts have similar meanings. The learned word vectors encode semantic relationships as geometry — “Paris is to France as Berlin is to Germany” becomes a consistent directional vector in the embedding space.

ELI5 — Text Classification mode: BlazingText’s supervised mode (fastText) represents each document as the average of its word/n-gram vectors, then classifies. No convolutions, no recurrence — just fast averaging. Trains in seconds on millions of documents, often matches or beats deep learning for text classification.

Use when (Word2Vec): you need pre-trained word embeddings for downstream NLP tasks, semantic similarity, analogy completion.

Use when (Text Classification): fast text categorization, sentiment analysis, spam detection, large document collections.

Do NOT use when: you need contextual embeddings (use BERT/transformers for tasks where “bank” means different things in different sentences), long-form generation.

Exam tip: BlazingText is the SageMaker algorithm for both word embeddings (Word2Vec) AND text classification. Two very different tasks, same algorithm — know both modes.

Seq2Seq (Sequence-to-Sequence)

Aspect	Detail
Type	Supervised
Problem	Sequence-to-sequence translation
How it works	Encoder RNN reads input sequence into a context vector; decoder RNN generates output sequence conditioned on context vector and previous output tokens; attention mechanism allows decoder to focus on different parts of the input at each step
Input format	RecordIO-protobuf (integer token sequences, pre-tokenized)
Key hyperparameters	`encoder_type` (rnn/cnn/transformer), `num_layers_encoder`, `num_layers_decoder`, `num_embed_dim`, `attention_type`
Instance type	GPU ONLY (ml.p3) — the ONLY SageMaker built-in algorithm that requires GPU

Use when: machine translation, text summarization, question answering, speech-to-text.

Do NOT use when: you need a pre-trained large language model (use SageMaker JumpStart / Bedrock instead), your task is text classification (use BlazingText).

Exam tip: Seq2Seq is the ONLY SageMaker built-in algorithm that MUST use GPU. This is a common exam fact. All other algorithms can run on CPU.

Neural Topic Model (NTM)

Aspect	Detail
Type	Unsupervised
Problem	Discover latent topics in a document corpus
How it works	Variational autoencoder architecture: encoder maps bag-of-words document representation to a distribution over topics (latent space); decoder reconstructs the document from topic mixture; training optimizes both reconstruction and KL divergence
Input format	RecordIO-protobuf or CSV (bag-of-words count vectors)
Key hyperparameters	`num_topics`, `encoder_layers`, `epochs`, `learning_rate`, `optimizer`
Instance type	GPU (ml.p3) or CPU

NTM vs LDA:

NTM uses neural variational inference → more accurate, better on small corpora
LDA uses statistical Bayesian inference → faster, more interpretable, classic choice
In practice: try LDA first (faster, interpretable), use NTM if quality is insufficient

Latent Dirichlet Allocation (LDA)

Aspect	Detail
Type	Unsupervised
Problem	Statistical topic modeling
How it works	Generative probabilistic model: assumes each document is generated by choosing a mixture of topics, then for each word, choosing a topic from that mixture and sampling a word from that topic’s word distribution; inference reverses this process to discover topic distributions
Input format	RecordIO-protobuf or CSV (bag-of-words)
Key hyperparameters	`num_topics`, `alpha0` (document-topic prior), `max_restarts`, `max_iterations`
Instance type	CPU (ml.m5) only

ELI5: Imagine you have 10,000 news articles and no labels. LDA says: “These documents were secretly generated by mixing a small set of topics — maybe 20 topics like ‘sports,’ ‘politics,’ ’technology.’ Each article is a different blend: ‘70% politics, 20% economics, 10% sports.’ Each topic is a different distribution of words: ‘politics’ has high probability for ’election,’ ‘vote,’ ‘congress.’ LDA reverse-engineers these blends from the raw word counts.” The output: each document gets a topic mixture vector, each topic gets a word distribution.

Use when: content discovery, document organization, building a topic-based search system, exploring large text corpora.

Do NOT use when: you need high accuracy on small corpora (use NTM), your documents have labeled categories (use BlazingText supervised mode).

Object2Vec

Aspect	Detail
Type	Supervised
Problem	Learn embeddings for pairs of objects
How it works	Two encoder towers (same or different architecture) encode each object in a pair into a fixed-dimension vector; a comparator (cosine similarity, dot product, or NN) computes a score for the pair; trained on labeled pairs (similar/dissimilar, ratings, etc.)
Input format	JSON Lines (pairs: `{"label": float, "in0": [...], "in1": [...]}`)
Key hyperparameters	`enc_dim`, `num_layers`, `mlp_dim`, `epochs`, `optimizer`, `comparator_list`
Instance type	GPU (ml.p3) recommended

ELI5: Object2Vec learns to represent pairs of things as vectors such that similar pairs have similar vectors. Give it thousands of (user, movie, rating) triples and it learns that users who liked Star Wars also like similar movies. Give it (sentence1, sentence2, similarity_score) pairs and it learns what makes sentences semantically similar. It’s a general-purpose embedding learner for any pair relationship.

Use when: collaborative filtering recommendations, sentence/document similarity, duplicate detection, knowledge graph embeddings.

Do NOT use when: you only have one object (not pairs), you need word-level embeddings (use BlazingText Word2Vec).

Computer Vision

Image Classification

Aspect	Detail
Type	Supervised
Problem	Assign one or more labels to an entire image
How it works	CNN based on ResNet architecture; supports two training modes: full training from random initialization OR transfer learning from pre-trained ImageNet weights (last layer replaced, rest fine-tuned)
Input format	RecordIO or image files (JPG/PNG) with annotation file; also supports augmented manifest
Key hyperparameters	`num_classes`, `num_training_samples`, `epochs`, `learning_rate`, `mini_batch_size`, `use_pretrained_model` (0/1)
Instance type	GPU ONLY (ml.p3); supports multi-GPU, multi-instance

Training modes:

use_pretrained_model=1: transfer learning — fast, works with small datasets (100+ images per class)
use_pretrained_model=0: full training from scratch — needs large dataset (10,000+ images per class)

Use when: classify whole image (e.g., “is this a hotdog?”, “what disease is in this X-ray?”).

Do NOT use when: you need to locate objects (use Object Detection), pixel-level segmentation (use Semantic Segmentation).

Object Detection

Aspect	Detail
Type	Supervised
Problem	Detect and localize objects in images — output bounding boxes + class labels + confidence scores
How it works	CNN using SSD (Single Shot MultiBox Detector) or VGG with MXNet backend; predicts bounding boxes and class probabilities at multiple scales simultaneously in one forward pass; NMS (non-maximum suppression) removes duplicate boxes
Input format	RecordIO or image files with JSON annotation (bounding box coordinates + labels)
Key hyperparameters	`num_classes`, `base_network` (vgg-16 / resnet-50), `mini_batch_size`, `learning_rate`, `lr_scheduler_step`, `overlap_threshold`
Instance type	GPU ONLY (ml.p3); supports multi-GPU

Object Detection vs Image Classification:

  Image Classification:
  ┌─────────────────────┐
  │  🐱                 │  →  "cat" (confidence: 0.97)
  └─────────────────────┘
  One label for the whole image

  Object Detection:
  ┌─────────────────────┐
  │  ┌────┐    ┌──────┐ │  →  "cat" at (10,20,80,90): 0.95
  │  │ 🐱 │    │ 🐶   │ │     "dog" at (120,30,200,110): 0.89
  │  └────┘    └──────┘ │
  └─────────────────────┘
  Multiple boxes per image, each with class + confidence

Use when: count objects in an image, locate where objects are (autonomous vehicles, security cameras, quality control on assembly lines).

Semantic Segmentation

Aspect	Detail
Type	Supervised
Problem	Assign a class label to EVERY pixel in an image
How it works	Fully Convolutional Network (FCN) with backbone (ResNet-50/101); encoder extracts features, decoder upsamples back to original resolution; also supports PSPNet (Pyramid Pooling) and DeepLab-v3 architectures
Input format	Image files (JPG/PNG) + label images (one pixel = one class label)
Key hyperparameters	`num_classes`, `backbone` (resnet-50 / resnet-101), `algorithm` (fcn / psp / deeplab), `epochs`, `learning_rate`, `use_pretrained_model`
Instance type	GPU ONLY (ml.p3); single-instance only (no multi-instance)

Computer Vision Task Hierarchy:

  Image Classification:  "There is a cat in this image"
                         → 1 label per image

  Object Detection:      "Cat at pixels (10,20)-(80,90)"
                         → N bounding boxes per image

  Semantic Segmentation: "Pixel (25,30) is cat, pixel (100,50) is sky"
                         → 1 label per PIXEL

  More precise → More training data → More compute → More useful

Use when: autonomous driving (road/pedestrian/car pixel masks), medical imaging (tumor boundary delineation), satellite imagery (land use classification).

Exam tip: Semantic segmentation = single GPU instance only. Cannot use multi-instance training. This makes it the most expensive and slowest vision algorithm to train.

Recommendations

Factorization Machines

Aspect	Detail
Type	Supervised
Problem	Classification or regression on high-dimensional sparse data
How it works	Extends linear model by efficiently computing all pairwise feature interactions via factored representation: $\hat{y} = w_0 + \sum_i w_i x_i + \sum_{i<j} \langle v_i, v_j \rangle x_i x_j$ where $\langle v_i, v_j \rangle$ are learned latent factor dot products
Input format	RecordIO-protobuf with sparse tensor support
Key hyperparameters	`feature_dim`, `num_factors` (latent dimensions), `predictor_type`, `epochs`, `learning_rate`
Instance type	CPU (ml.c5)

Why it works for sparse data: In a user-item matrix with 1M users × 500K items, 99.9% of cells are empty. A standard linear model can’t learn user-item interactions (not enough data per pair). FM factorizes each feature into a latent vector — user vectors and item vectors interact via dot products, so even unobserved user-item pairs get predicted via shared latent structure.

$$\hat{y}(x) = w_0 + \sum_i w_i x_i + \sum_{i<j} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j$$

ELI5: In a massive spreadsheet of users × products, most cells are empty — you only know what each user actually bought. Factorization Machines figures out the hidden patterns in what IS filled in. It learns that User A (who liked sci-fi movies) has a similar latent profile to User B (who also liked sci-fi), so it can predict that A would also like movies B liked, even if A has never seen them. It’s efficiently learning all pairwise interactions between features without needing every pair to be observed.

Use when: click-through rate prediction, recommendation systems with sparse features, any problem where features are user/item IDs (one-hot encoded → sparse).

Do NOT use when: dense feature matrices (XGBoost or Linear Learner will be simpler), image/text data.

Reinforcement Learning on SageMaker

SageMaker RL enables training reinforcement learning agents using:

Frameworks: Ray RLlib (most common), Intel Coach
Environments: Amazon Robomaker (robotics), OpenAI Gym (classic control), custom environments via Docker
Training topology: Separate training instance (learns policy) + environment simulation instances (generate experience)

Use cases: robotics path planning, game-playing agents, autonomous vehicle control, resource scheduling optimization, portfolio management.

Input Format Reference

Algorithm	RecordIO	CSV	JSON Lines	Image files	Pipe Mode
Linear Learner	Yes (float32)	Yes	No	No	Yes
XGBoost	Yes	Yes	No	No	Yes
KNN	Yes	Yes	No	No	Yes
DeepAR	No	No	Yes	No	No
K-Means	Yes	Yes	No	No	Yes
PCA	Yes	Yes	No	No	Yes
RCF	Yes	Yes	No	No	Yes
BlazingText	No	Text (supervised)	No	No	No
Seq2Seq	Yes (integers)	No	No	No	No
NTM	Yes	Yes	No	No	Yes
LDA	Yes	Yes	No	No	No
Object2Vec	No	No	Yes	No	No
Factorization Machines	Yes (sparse)	No	No	No	Yes
Image Classification	Yes	No	No	Yes	Yes
Object Detection	Yes	No	No	Yes	No
Semantic Segmentation	No	No	No	Yes	No

Exam tip: When the question asks about streaming large datasets efficiently → Pipe mode + RecordIO. DeepAR uses JSON Lines, not RecordIO. Computer vision algorithms take image files (JPG/PNG) or RecordIO of image data.

Instance Type Reference

Algorithm	CPU	GPU	Recommended	Multi-Instance	Notes
XGBoost	Yes	Yes	CPU (ml.m5)	No	GPU only with gpu_hist
Linear Learner	Yes	Yes	CPU (ml.m5)	Yes	—
KNN	Yes	Yes	CPU (ml.m5)	No	GPU helps index building
DeepAR	Yes	Yes	GPU (ml.p3) large	No	—
K-Means	Yes	No	CPU (ml.m5)	Yes	—
PCA	Yes	Yes	CPU (ml.m5)	No	—
RCF	Yes	No	CPU (ml.m5)	No	Also in Kinesis Analytics
IP Insights	Yes	Yes	GPU (ml.p3)	No	—
BlazingText	Yes	Yes	Mode-dependent	No	Text class: CPU; W2V: GPU
Seq2Seq	No	Yes only	GPU (ml.p3)	Yes	GPU REQUIRED
NTM	Yes	Yes	GPU (ml.p3) large	Yes	—
LDA	Yes	No	CPU (ml.m5)	No	CPU only
Object2Vec	Yes	Yes	GPU (ml.p3)	No	—
Factorization Machines	Yes	No	CPU (ml.c5)	Yes	—
Image Classification	No	Yes only	GPU (ml.p3)	Yes	GPU REQUIRED
Object Detection	No	Yes only	GPU (ml.p3)	Yes	GPU REQUIRED
Semantic Segmentation	No	Yes only	GPU (ml.p3)	No	GPU REQUIRED, no multi-instance

Exam rules to memorize:
Vision algorithms (Image Classification, Object Detection, Semantic Segmentation) + Seq2Seq = GPU ONLY, always
Tree-based + classical ML = CPU first (cheaper)
Semantic Segmentation = GPU + single instance only (no multi-instance)
“Cost-effective training” → choose CPU-capable algorithm on CPU instance

Automatic Model Tuning (Hyperparameter Optimization)

SageMaker HPO automatically searches for optimal hyperparameters using three strategies:

HPO Strategy Comparison:

  Bayesian Optimization (recommended):
  ┌──────────────────────────────────────────────────────┐
  │  Job 1: eta=0.3, max_depth=6  → AUC=0.82            │
  │  Job 2: eta=0.1, max_depth=8  → AUC=0.85  ← better  │
  │  Job 3: explore near job 2... → AUC=0.87             │
  │  ...intelligently guided by a surrogate model...     │
  └──────────────────────────────────────────────────────┘
  Learns from past results → focuses on promising regions

  Random Search:
  ┌──────────────────────────────────────────────────────┐
  │  Job 1-N: sample hyperparameters at random           │
  │  All jobs run in parallel → fast exploration         │
  │  No learning across jobs                             │
  └──────────────────────────────────────────────────────┘
  Good for parallelism, no prior knowledge assumed

  Hyperband:
  ┌──────────────────────────────────────────────────────┐
  │  Start many jobs with few resources                  │
  │  Eliminate bottom-performing jobs early              │
  │  Give more resources to survivors                    │
  │  → Very efficient for expensive training jobs        │
  └──────────────────────────────────────────────────────┘

Key configuration parameters:

Objective metric: the metric to optimize (e.g., validation:auc, train:loss) — must be emitted to CloudWatch logs
Parameter ranges: continuous (e.g., learning_rate: 0.001–0.3), integer (e.g., max_depth: 3–10), categorical (e.g., optimizer: [adam, sgd])
Max training jobs: total budget of jobs to run
Max parallel jobs: how many jobs run simultaneously (more parallel → less efficient Bayesian learning)
Warm start: initialize a new tuning job from results of a previous one → no wasted computation on already-explored regions

ELI5: Instead of manually trying 100 hyperparameter combinations, let SageMaker’s optimizer intelligently explore the space — it learns which regions are promising and focuses there. Bayesian optimization is like a scientist who reads all past experiment results before deciding what to try next. Random search is like throwing darts blindfolded. Hyperband is like a tournament that eliminates weak contestants early to spend time on the strong ones.

Exam tip: Bayesian optimization uses sequential jobs (each job informs the next), so max_parallel_jobs should be low (2-3) for Bayesian. Random search benefits from high parallelism. Hyperband is best when training is expensive and you can afford to kill early-underperforming jobs.

Bring Your Own Algorithm

When to Use Each Approach

Decision Tree: Built-in vs Script Mode vs Custom Container

  Does SageMaker have a built-in algorithm for your problem?
  ├─ Yes, and it fits exactly → Use built-in algorithm
  └─ No, or needs modification
      │
      Does your code fit in a supported framework
      (TensorFlow, PyTorch, Scikit-learn, MXNet, HuggingFace)?
      ├─ Yes → Script Mode (your code + SageMaker's container)
      └─ No, or you need special system dependencies
          │
          Custom Docker Container (full control)

Script Mode

Upload your training script (e.g., train.py) and SageMaker runs it inside its pre-built framework container. Your script reads from /opt/ml/input/data/ and writes the model to /opt/ml/model/.

# SageMaker script mode — your training script reads from env vars
import os
import argparse

parser = argparse.ArgumentParser()
# SageMaker passes hyperparameters as command-line arguments
parser.add_argument('--learning-rate', type=float, default=0.01)
# SageMaker passes data paths as environment variables
parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))

Custom Docker Container

Required /opt/ml/ directory structure in your container:

  /opt/ml/
  ├── input/
  │   ├── config/
  │   │   ├── hyperparameters.json   ← SageMaker writes your HPO config here
  │   │   └── resourceConfig.json    ← instance info
  │   └── data/
  │       └── <channel-name>/        ← training data mounted here
  │           └── (your data files)
  ├── model/                         ← write your trained model artifacts here
  │   └── (model files)
  └── output/
      └── failure                    ← write failure reason here if training fails

  Your container must respond to:
    docker run image train    ← SageMaker calls this to start training
    docker run image serve    ← SageMaker calls this to start inference server

Container requirements:

Accept train command → execute training → write model to /opt/ml/model/
Accept serve command → start an HTTP server on port 8080 → respond to /invocations and /ping
Exit with code 0 on success, non-zero on failure

ECR: Push your custom container to Amazon ECR (Elastic Container Registry) — SageMaker pulls it from there for training and inference.

Master Comparison Table

Algorithm	Problem Type	Data Type	GPU Required	Key Hyperparameters	Multi-Instance
Linear Learner	Regression, Classification	Tabular	No	predictor_type, learning_rate, l1	Yes
XGBoost	Regression, Classification	Tabular	No (GPU optional)	num_round, max_depth, eta, objective	No
KNN	Classification, Regression	Any numeric	No	k, predictor_type	No
DeepAR	Time Series Forecast	Multiple time series	Recommended	prediction_length, context_length	No
K-Means	Clustering	Numeric	No	k, extra_center_factor	Yes
PCA	Dim Reduction	Numeric	No	num_components, algorithm_mode	No
RCF	Anomaly Detection	Numeric/streaming	No	num_trees, num_samples_per_tree	No
IP Insights	Anomaly Detection	IP addresses	Recommended	num_entity_vectors, vector_dim	No
BlazingText	Text Classification, Embeddings	Text	Mode-dep	mode, vector_dim, word_ngrams	No
Seq2Seq	Sequence Translation	Token sequences	YES	encoder_type, num_layers	Yes
NTM	Topic Modeling	Text (BoW)	Recommended	num_topics, encoder_layers	Yes
LDA	Topic Modeling	Text (BoW)	No	num_topics, alpha0	No
Object2Vec	Embeddings, Similarity	Object pairs	Recommended	enc_dim, num_layers, comparator_list	No
Factorization Machines	Classification, Regression	Sparse/high-dim	No	feature_dim, num_factors	Yes
Image Classification	Image Classification	Images	YES	num_classes, use_pretrained_model	Yes
Object Detection	Object Localization	Images	YES	num_classes, base_network	Yes
Semantic Segmentation	Pixel Classification	Images	YES	num_classes, algorithm, backbone	No (single only)

SageMaker Built-In Algorithms#

Algorithm Selection Flowchart#

Complete Algorithm Reference#

Supervised — Regression / Classification#

Linear Learner#

XGBoost#

K-Nearest Neighbors (KNN)#

Supervised — Time Series#

DeepAR#

Unsupervised — Clustering#

K-Means#

Unsupervised — Dimensionality Reduction#

PCA (Principal Component Analysis)#

Unsupervised — Anomaly Detection#

Random Cut Forest (RCF)#

IP Insights#

NLP Algorithms#

BlazingText#

Seq2Seq (Sequence-to-Sequence)#

Neural Topic Model (NTM)#

Latent Dirichlet Allocation (LDA)#

Object2Vec#

Computer Vision#

Image Classification#

Object Detection#

Semantic Segmentation#

Recommendations#

Factorization Machines#

Reinforcement Learning on SageMaker#

Input Format Reference#

Instance Type Reference#

Automatic Model Tuning (Hyperparameter Optimization)#

Bring Your Own Algorithm#

When to Use Each Approach#

Script Mode#

Custom Docker Container#

Master Comparison Table#