Domain 3E: SageMaker Built-In Algorithms
Table of Contents
- SageMaker Built-In Algorithms
- Algorithm Selection Flowchart
- Complete Algorithm Reference
- Input Format Reference
- Instance Type Reference
- Automatic Model Tuning (Hyperparameter Optimization)
- Bring Your Own Algorithm
- Master Comparison Table
SageMaker Built-In Algorithms
Exam Domain: 3 — ML Model Development Task: Choose the right SageMaker built-in algorithm for a given scenario
Algorithm Selection Flowchart
What type of problem?
│
├─ Predict a NUMBER? ──────────────────────────────── REGRESSION
│ ├─ Linear relationship, tabular data? → Linear Learner
│ ├─ Complex patterns, tabular data? → XGBoost
│ ├─ Time series, single series? → DeepAR
│ └─ Sparse feature interactions? → Factorization Machines
│
├─ Predict a CATEGORY? ────────────────────────────── CLASSIFICATION
│ ├─ Tabular, binary/multi-class? → XGBoost
│ ├─ Tabular, simple/linear? → Linear Learner
│ ├─ High-dimensional sparse (clicks)? → Factorization Machines
│ ├─ Text, multi-class? → BlazingText (supervised)
│ ├─ Image, whole image label? → Image Classification
│ ├─ Image, find objects with boxes? → Object Detection
│ └─ Image, every pixel labeled? → Semantic Segmentation
│
├─ FORECAST future values? ────────────────────────── TIME SERIES
│ └─ Multiple related time series? → DeepAR
│
├─ Find GROUPS in data? ───────────────────────────── CLUSTERING
│ └─ K groups of similar items? → K-Means
│
├─ REDUCE dimensions? ─────────────────────────────── DIM REDUCTION
│ └─ Linear projection, variance-preserving? → PCA
│
├─ Detect ANOMALIES? ──────────────────────────────── ANOMALY DETECTION
│ ├─ Numeric / streaming time series? → Random Cut Forest
│ └─ IP address / login patterns? → IP Insights
│
├─ NLP / TEXT tasks? ──────────────────────────────── NLP
│ ├─ Word embeddings (semantic similarity)? → BlazingText (Word2Vec)
│ ├─ Text classification (categories)? → BlazingText (supervised)
│ ├─ Topic discovery in corpus? → NTM (neural) or LDA
│ ├─ Sequence translation / summarization? → Seq2Seq
│ └─ Pair embeddings (similarity)? → Object2Vec
│
└─ RECOMMENDATIONS? ───────────────────────────────── RECOMMENDATIONS
├─ Sparse interactions (user-item clicks)? → Factorization Machines
└─ Learn embeddings for object pairs? → Object2Vec
Why this matters for the exam: Always identify the problem type first, then narrow by data characteristics. The most common trap is jumping to an algorithm (e.g., “it’s a neural network problem so use deep learning”) before classifying the task. Classify the task first — the algorithm often follows naturally.
Complete Algorithm Reference
Supervised — Regression / Classification
Linear Learner
| Aspect | Detail |
|---|---|
| Type | Supervised |
| Problem | Binary/multi-class classification, regression |
| How it works | Trains a linear model (logistic/linear regression) with stochastic gradient descent; automatically normalizes data; trains multiple models with varied hyperparameters in parallel, selects best |
| Input format | RecordIO-protobuf (float32) or CSV |
| Key hyperparameters | predictor_type (regressor / binary_classifier / multiclass_classifier), learning_rate, l1, wd (L2), mini_batch_size |
| Instance type | CPU (ml.m5) recommended; supports multi-instance |
ELI5: Linear Learner draws the best straight line (or flat plane) through your data to make predictions. If you’re predicting house prices, it finds the line that best fits “price goes up as square footage goes up.” Simple, fast, and surprisingly effective when your data actually has a roughly linear relationship. Always use it as your baseline — if a fancier model can’t beat it, the fancy model isn’t worth the complexity.
Use when: simple/linear relationships, fast baseline, interpretability needed, large datasets where simplicity is a feature.
Do NOT use when: complex non-linear interactions, high-dimensional sparse data (use Factorization Machines), image/text data.
XGBoost
| Aspect | Detail |
|---|---|
| Type | Supervised |
| Problem | Classification, regression, ranking |
| How it works | Gradient-boosted decision trees: trains trees sequentially, each correcting residuals of the previous ensemble; includes L1/L2 regularization to prevent overfitting; handles missing values natively by learning the best default direction |
| Input format | CSV, LibSVM, RecordIO-protobuf, Parquet |
| Key hyperparameters | num_round (# trees), max_depth (tree depth), eta (learning rate), subsample (row fraction), colsample_bytree (feature fraction), objective, min_child_weight |
| Instance type | CPU (ml.m5) recommended; GPU possible with tree_method=gpu_hist |
XGBoost Boosting Process:
Round 1: Tree₁ ─────────────────────────► prediction₁
↓ compute residuals
Round 2: Tree₂ trains on residuals ──────► prediction₁ + η·Tree₂
↓ compute new residuals
Round 3: Tree₃ trains on residuals ──────► prediction₁ + η·Tree₂ + η·Tree₃
...
Round N: Final = Σ η·Treeᵢ
η = learning rate (eta) controls each tree's contribution
max_depth controls how complex each tree is
Key hyperparameters deep dive:
num_round: more rounds → lower training error, but risk of overfitting. Tune with early stopping.max_depth: shallower trees (3-6) → more regularization. Deeper trees → more complex patterns.eta: smaller = slower learning but more robust. Default 0.3, typical range 0.01-0.3.subsample: < 1.0 → random row sampling each tree → stochastic boosting → regularization.objective:reg:squarederrorfor regression,binary:logisticfor binary,multi:softmaxfor multi-class.
ELI5: XGBoost is like a team of students who each learn from the mistakes of the one before them. Student 1 tries to predict house prices, gets some wrong. Student 2 focuses specifically on fixing Student 1’s mistakes. Student 3 fixes what Student 2 still got wrong. After hundreds of rounds, their combined predictions are remarkably accurate. It’s the Swiss Army knife of ML — handles missing values, resists overfitting, works on almost any tabular problem.
Use when: any structured/tabular data, default first choice for classification/regression, when you need feature importance.
Do NOT use when: image data, raw text, very sparse high-dimensional data (use Factorization Machines), when you need probabilistic time series forecasts (use DeepAR).
Exam tip: XGBoost is the most frequently tested algorithm. When the exam says “tabular data with complex patterns,” the answer is almost always XGBoost.
K-Nearest Neighbors (KNN)
| Aspect | Detail |
|---|---|
| Type | Supervised (also used for unsupervised similarity search) |
| Problem | Classification, regression, similarity search |
| How it works | At inference time, finds the K most similar training points to the query and returns their majority class (classification) or mean value (regression); SageMaker implementation uses random sampling for large datasets, then builds an index (exact or approximate) for fast lookup |
| Input format | RecordIO-protobuf or CSV |
| Key hyperparameters | k (number of neighbors), predictor_type, dimension_reduction_type, dimension_reduction_target |
| Instance type | CPU (ml.m5); GPU supported for index building |
ELI5: KNN is the “ask your neighbors” algorithm. To classify a new house price, find the K most similar houses (by square footage, bedrooms, etc.) in the training data and look at what their prices were. The new house’s price = the average of those K neighbors. It’s lazy — no model is built during training, all computation happens at prediction time. Fast to “train,” slow to predict on large datasets.
Use when: simple non-parametric baseline, anomaly detection via distance, similarity search.
Do NOT use when: very large datasets (inference is slow), high-dimensional data without preprocessing (curse of dimensionality), need for feature importance.
Supervised — Time Series
DeepAR
| Aspect | Detail |
|---|---|
| Type | Supervised |
| Problem | Probabilistic time series forecasting |
| How it works | Autoregressive RNN (LSTM) trained simultaneously on MULTIPLE related time series; at each time step, the model predicts a probability distribution over future values; learns shared patterns (seasonality, trends) across all series |
| Input format | JSON Lines (each line: {"start": "YYYY-MM-DD", "target": [v1, v2, ...]}) |
| Key hyperparameters | prediction_length, context_length, epochs, num_cells, num_layers, dropout_rate, likelihood |
| Instance type | GPU (ml.p3) for large datasets; CPU for small |
DeepAR trains across multiple series simultaneously:
Product A: 100 120 115 ──────┐
Product B: 800 820 790 ──────┤ Shared LSTM → Probabilistic forecasts
Product C: 5 6 5 ──────┤ (learns → with uncertainty bands
...1000 products... ┘ shared → (P10, P50, P90 quantiles)
patterns)
Why DeepAR beats classical methods:
- Cold start: a new product with 2 weeks of data benefits from patterns learned across 1000 similar products
- Shared patterns: weekly/monthly seasonality learned once and applied everywhere
- Probabilistic outputs: returns confidence intervals, not just point predictions (critical for inventory planning)
ELI5: Traditional forecasting tools are like tutors who can only teach one student at a time. DeepAR is a classroom — it teaches 1000 students (time series) simultaneously and lets them learn from each other. A new student with no history still benefits from the general knowledge in the room. Plus, instead of saying “you’ll sell exactly 47 units,” DeepAR says “there’s a 90% chance you’ll sell between 40 and 55 units” — much more useful for decision-making.
Use when: multiple related time series, cold-start series (little history), probabilistic forecasts needed, demand forecasting across product catalog.
Do NOT use when: single isolated time series with long history (classical methods may suffice), very irregular/sparse series.
Unsupervised — Clustering
K-Means
| Aspect | Detail |
|---|---|
| Type | Unsupervised |
| Problem | Clustering — group N points into K clusters |
| How it works | SageMaker uses web-scale K-Means with mini-batches: initializes K × extra_center_factor centers using random init or K-Means++, then iteratively assigns points to nearest centroid and updates centroids, finally reduces to K clusters |
| Input format | RecordIO-protobuf or CSV |
| Key hyperparameters | k (number of clusters), extra_center_factor (oversampling), init_method (random or k-means++), max_iterations |
| Instance type | CPU (ml.m5); supports multi-instance |
Choosing K — the Elbow Method:
Distortion vs K:
Distortion
(within-cluster
sum of squares)
│
──►│ \.
│ \.
│ \..
│ \___
│ ‾‾‾‾‾‾‾────────
└──────────────────────────► K
1 2 3 4 5 6 7
↑ Pick the "elbow" — the point where adding
more clusters gives diminishing returns.
ELI5: K-Means is like sorting mail into K bins. Start with random bin assignments. Look at each bin and find its center. Move every letter to whichever bin center it’s closest to. Recompute centers. Repeat until nothing moves. The result: K groups where items within each group are similar and items between groups are different.
Use when: customer segmentation, document clustering, color quantization, finding natural groupings.
Do NOT use when: K is unknown and the elbow method doesn’t give a clear signal, data has non-spherical clusters (K-Means assumes spherical/Euclidean), very high-dimensional data without preprocessing.
Unsupervised — Dimensionality Reduction
PCA (Principal Component Analysis)
| Aspect | Detail |
|---|---|
| Type | Unsupervised |
| Problem | Reduce number of features while preserving variance |
| How it works | Finds orthogonal directions (principal components) of maximum variance in the data; projects data onto the top N components; SageMaker supports two modes: Regular (exact, computes covariance matrix) and Randomized (approximate, scalable to large datasets) |
| Input format | RecordIO-protobuf or CSV |
| Key hyperparameters | num_components (target dimensions), algorithm_mode (regular / randomized), subtract_mean |
| Instance type | CPU (ml.m5) |
Regular vs Randomized PCA:
- Regular mode: computes exact covariance matrix → precise but $O(d^2)$ memory where $d$ = feature count → use when $d$ is moderate (< 10,000 features)
- Randomized mode: uses randomized SVD approximation → much more memory-efficient → use when $d$ is very large (10,000+ features) or dataset is very large
ELI5: PCA is like finding the best angle to photograph a 3D sculpture so the 2D photo captures as much detail as possible. The “principal components” are those best viewing angles — the directions in high-dimensional space where the data varies the most. Projecting onto the top 2-3 components lets you visualize or compress data while keeping the most important information.
Use when: too many features (curse of dimensionality), feature redundancy, visualization, pre-processing before clustering.
Do NOT use when: non-linear structure (use autoencoders), features are categorical (PCA needs numeric), you need interpretability of features.
Unsupervised — Anomaly Detection
Random Cut Forest (RCF)
| Aspect | Detail |
|---|---|
| Type | Unsupervised |
| Problem | Anomaly detection in numeric data and time series |
| How it works | Builds an ensemble of random binary trees (Random Cut Trees) by recursively bisecting the feature space with random cuts; an anomaly is a point that gets isolated with fewer cuts (it’s far from other points); the anomaly score = average depth in the trees (inverted) |
| Input format | RecordIO-protobuf or CSV |
| Key hyperparameters | num_samples_per_tree, num_trees, eval_metrics |
| Instance type | CPU (ml.m5); also available as built-in function in Kinesis Data Analytics |
How RCF assigns anomaly scores:
Normal point (deep in a cluster): Anomaly (isolated):
┌──────────────────────┐ ┌──────────────────────┐
│ · · · · │ │ · · · · │
│ · [·] · · │ │ · · · · │
│ · · · · │ │ [★] │
│ │ │ │
└──────────────────────┘ └──────────────────────┘
Many cuts needed to isolate it Isolated in 2-3 cuts
→ Low anomaly score → High anomaly score
How it differs from Isolation Forest: conceptually similar (both isolate anomalies with random partitioning), but RCF uses random cuts that are proportional to the range of each feature (not purely random splits) → better handles features of different scales. RCF also supports streaming/online updates.
Real-time use case: RCF in Kinesis Data Analytics enables millisecond-latency anomaly detection on streaming data without any model deployment — just SQL-like function calls.
ELI5: RCF is like a bouncer at a club deciding how “weird” each person is. The bouncer builds a mental model of what “normal” looks like from the crowd. Someone who fits the usual pattern blends in and gets a low weirdness score. Someone wildly out of place — a spike in server traffic at 3am, a transaction ten times larger than usual — stands out immediately and gets a high anomaly score. RCF formalizes this by asking: “How many random cuts does it take to isolate this data point?” Normal points are surrounded by neighbors and take many cuts. Anomalies are isolated in very few.
Use when: streaming anomaly detection (fraud, IoT sensors, server metrics), time series with spikes/level shifts, no labeled anomaly examples available.
Do NOT use when: you have labeled anomaly examples (use supervised classification instead), you need anomaly explanations (RCF gives scores, not reasons).
Exam tip: RCF = anomaly detection. Kinesis + RCF = real-time streaming anomaly detection. This combination appears frequently in exam scenarios.
IP Insights
| Aspect | Detail |
|---|---|
| Type | Unsupervised |
| Problem | Learn normal patterns of IP address usage per entity |
| How it works | Neural network learns embeddings for (entity, IP address) pairs from historical normal access logs; at inference, scores new (entity, IP) pairs — low scores indicate unusual/anomalous access |
| Input format | CSV (entity identifier, IPv4 address) |
| Key hyperparameters | num_entity_vectors, vector_dim, epochs, batch_size |
| Instance type | GPU (ml.p3) recommended |
ELI5: IP Insights builds a “reputation map” for which users connect from which IP addresses. Alice always logs in from New York and Chicago. Suddenly Alice’s credentials are used from Bucharest — that’s a high anomaly score. It doesn’t need labels saying “this was fraud”; it just learns normal patterns and flags deviations.
Use when: account takeover detection, fraud detection via login patterns, detecting compromised credentials.
Do NOT use when: general numeric anomaly detection (use RCF), you need real-time streaming (IP Insights requires batch/endpoint inference).
NLP Algorithms
BlazingText
| Aspect | Detail |
|---|---|
| Type | Unsupervised (Word2Vec) or Supervised (text classification) |
| Problem | Word embeddings OR text classification |
| How it works | Word2Vec: learns word embeddings by training a shallow NN to predict context from word (Skip-gram) or word from context (CBOW); Text Classification: multi-class text classifier using fastText architecture (bag of n-gram features) — extremely fast |
| Input format | Word2Vec: one sentence per line (plain text); Text Classification: __label__<class> text format |
| Key hyperparameters | mode (batch_skipgram / skipgram / cbow / supervised), vector_dim, epochs, learning_rate, word_ngrams |
| Instance type | CPU or GPU depending on mode |
Word2Vec Architectures:
CBOW (Continuous Bag of Words) — Faster, good for frequent words:
"The" "cat" [___] "on" "the" → predict center word "sat"
context words in ↑
sliding window predicted word
Skip-gram — Slower, better for rare words:
[sat] → predict "The", "cat", "on", "the"
center word ↑
surrounding context words
Mathematical result: words used in similar contexts → close vectors
king − man + woman ≈ queen (vector arithmetic captures meaning)
ELI5 — Word2Vec: CBOW: fill in the blank using surrounding words. Skip-gram: given one word, predict what words surround it. Both tasks force the network to learn that words appearing in similar contexts have similar meanings. The learned word vectors encode semantic relationships as geometry — “Paris is to France as Berlin is to Germany” becomes a consistent directional vector in the embedding space.
ELI5 — Text Classification mode: BlazingText’s supervised mode (fastText) represents each document as the average of its word/n-gram vectors, then classifies. No convolutions, no recurrence — just fast averaging. Trains in seconds on millions of documents, often matches or beats deep learning for text classification.
Use when (Word2Vec): you need pre-trained word embeddings for downstream NLP tasks, semantic similarity, analogy completion.
Use when (Text Classification): fast text categorization, sentiment analysis, spam detection, large document collections.
Do NOT use when: you need contextual embeddings (use BERT/transformers for tasks where “bank” means different things in different sentences), long-form generation.
Exam tip: BlazingText is the SageMaker algorithm for both word embeddings (Word2Vec) AND text classification. Two very different tasks, same algorithm — know both modes.
Seq2Seq (Sequence-to-Sequence)
| Aspect | Detail |
|---|---|
| Type | Supervised |
| Problem | Sequence-to-sequence translation |
| How it works | Encoder RNN reads input sequence into a context vector; decoder RNN generates output sequence conditioned on context vector and previous output tokens; attention mechanism allows decoder to focus on different parts of the input at each step |
| Input format | RecordIO-protobuf (integer token sequences, pre-tokenized) |
| Key hyperparameters | encoder_type (rnn/cnn/transformer), num_layers_encoder, num_layers_decoder, num_embed_dim, attention_type |
| Instance type | GPU ONLY (ml.p3) — the ONLY SageMaker built-in algorithm that requires GPU |
Use when: machine translation, text summarization, question answering, speech-to-text.
Do NOT use when: you need a pre-trained large language model (use SageMaker JumpStart / Bedrock instead), your task is text classification (use BlazingText).
Exam tip: Seq2Seq is the ONLY SageMaker built-in algorithm that MUST use GPU. This is a common exam fact. All other algorithms can run on CPU.
Neural Topic Model (NTM)
| Aspect | Detail |
|---|---|
| Type | Unsupervised |
| Problem | Discover latent topics in a document corpus |
| How it works | Variational autoencoder architecture: encoder maps bag-of-words document representation to a distribution over topics (latent space); decoder reconstructs the document from topic mixture; training optimizes both reconstruction and KL divergence |
| Input format | RecordIO-protobuf or CSV (bag-of-words count vectors) |
| Key hyperparameters | num_topics, encoder_layers, epochs, learning_rate, optimizer |
| Instance type | GPU (ml.p3) or CPU |
NTM vs LDA:
- NTM uses neural variational inference → more accurate, better on small corpora
- LDA uses statistical Bayesian inference → faster, more interpretable, classic choice
- In practice: try LDA first (faster, interpretable), use NTM if quality is insufficient
Latent Dirichlet Allocation (LDA)
| Aspect | Detail |
|---|---|
| Type | Unsupervised |
| Problem | Statistical topic modeling |
| How it works | Generative probabilistic model: assumes each document is generated by choosing a mixture of topics, then for each word, choosing a topic from that mixture and sampling a word from that topic’s word distribution; inference reverses this process to discover topic distributions |
| Input format | RecordIO-protobuf or CSV (bag-of-words) |
| Key hyperparameters | num_topics, alpha0 (document-topic prior), max_restarts, max_iterations |
| Instance type | CPU (ml.m5) only |
ELI5: Imagine you have 10,000 news articles and no labels. LDA says: “These documents were secretly generated by mixing a small set of topics — maybe 20 topics like ‘sports,’ ‘politics,’ ’technology.’ Each article is a different blend: ‘70% politics, 20% economics, 10% sports.’ Each topic is a different distribution of words: ‘politics’ has high probability for ’election,’ ‘vote,’ ‘congress.’ LDA reverse-engineers these blends from the raw word counts.” The output: each document gets a topic mixture vector, each topic gets a word distribution.
Use when: content discovery, document organization, building a topic-based search system, exploring large text corpora.
Do NOT use when: you need high accuracy on small corpora (use NTM), your documents have labeled categories (use BlazingText supervised mode).
Object2Vec
| Aspect | Detail |
|---|---|
| Type | Supervised |
| Problem | Learn embeddings for pairs of objects |
| How it works | Two encoder towers (same or different architecture) encode each object in a pair into a fixed-dimension vector; a comparator (cosine similarity, dot product, or NN) computes a score for the pair; trained on labeled pairs (similar/dissimilar, ratings, etc.) |
| Input format | JSON Lines (pairs: {"label": float, "in0": [...], "in1": [...]}) |
| Key hyperparameters | enc_dim, num_layers, mlp_dim, epochs, optimizer, comparator_list |
| Instance type | GPU (ml.p3) recommended |
ELI5: Object2Vec learns to represent pairs of things as vectors such that similar pairs have similar vectors. Give it thousands of (user, movie, rating) triples and it learns that users who liked Star Wars also like similar movies. Give it (sentence1, sentence2, similarity_score) pairs and it learns what makes sentences semantically similar. It’s a general-purpose embedding learner for any pair relationship.
Use when: collaborative filtering recommendations, sentence/document similarity, duplicate detection, knowledge graph embeddings.
Do NOT use when: you only have one object (not pairs), you need word-level embeddings (use BlazingText Word2Vec).
Computer Vision
Image Classification
| Aspect | Detail |
|---|---|
| Type | Supervised |
| Problem | Assign one or more labels to an entire image |
| How it works | CNN based on ResNet architecture; supports two training modes: full training from random initialization OR transfer learning from pre-trained ImageNet weights (last layer replaced, rest fine-tuned) |
| Input format | RecordIO or image files (JPG/PNG) with annotation file; also supports augmented manifest |
| Key hyperparameters | num_classes, num_training_samples, epochs, learning_rate, mini_batch_size, use_pretrained_model (0/1) |
| Instance type | GPU ONLY (ml.p3); supports multi-GPU, multi-instance |
Training modes:
use_pretrained_model=1: transfer learning — fast, works with small datasets (100+ images per class)use_pretrained_model=0: full training from scratch — needs large dataset (10,000+ images per class)
Use when: classify whole image (e.g., “is this a hotdog?”, “what disease is in this X-ray?”).
Do NOT use when: you need to locate objects (use Object Detection), pixel-level segmentation (use Semantic Segmentation).
Object Detection
| Aspect | Detail |
|---|---|
| Type | Supervised |
| Problem | Detect and localize objects in images — output bounding boxes + class labels + confidence scores |
| How it works | CNN using SSD (Single Shot MultiBox Detector) or VGG with MXNet backend; predicts bounding boxes and class probabilities at multiple scales simultaneously in one forward pass; NMS (non-maximum suppression) removes duplicate boxes |
| Input format | RecordIO or image files with JSON annotation (bounding box coordinates + labels) |
| Key hyperparameters | num_classes, base_network (vgg-16 / resnet-50), mini_batch_size, learning_rate, lr_scheduler_step, overlap_threshold |
| Instance type | GPU ONLY (ml.p3); supports multi-GPU |
Object Detection vs Image Classification:
Image Classification:
┌─────────────────────┐
│ 🐱 │ → "cat" (confidence: 0.97)
└─────────────────────┘
One label for the whole image
Object Detection:
┌─────────────────────┐
│ ┌────┐ ┌──────┐ │ → "cat" at (10,20,80,90): 0.95
│ │ 🐱 │ │ 🐶 │ │ "dog" at (120,30,200,110): 0.89
│ └────┘ └──────┘ │
└─────────────────────┘
Multiple boxes per image, each with class + confidence
Use when: count objects in an image, locate where objects are (autonomous vehicles, security cameras, quality control on assembly lines).
Semantic Segmentation
| Aspect | Detail |
|---|---|
| Type | Supervised |
| Problem | Assign a class label to EVERY pixel in an image |
| How it works | Fully Convolutional Network (FCN) with backbone (ResNet-50/101); encoder extracts features, decoder upsamples back to original resolution; also supports PSPNet (Pyramid Pooling) and DeepLab-v3 architectures |
| Input format | Image files (JPG/PNG) + label images (one pixel = one class label) |
| Key hyperparameters | num_classes, backbone (resnet-50 / resnet-101), algorithm (fcn / psp / deeplab), epochs, learning_rate, use_pretrained_model |
| Instance type | GPU ONLY (ml.p3); single-instance only (no multi-instance) |
Computer Vision Task Hierarchy:
Image Classification: "There is a cat in this image"
→ 1 label per image
Object Detection: "Cat at pixels (10,20)-(80,90)"
→ N bounding boxes per image
Semantic Segmentation: "Pixel (25,30) is cat, pixel (100,50) is sky"
→ 1 label per PIXEL
More precise → More training data → More compute → More useful
Use when: autonomous driving (road/pedestrian/car pixel masks), medical imaging (tumor boundary delineation), satellite imagery (land use classification).
Exam tip: Semantic segmentation = single GPU instance only. Cannot use multi-instance training. This makes it the most expensive and slowest vision algorithm to train.
Recommendations
Factorization Machines
| Aspect | Detail |
|---|---|
| Type | Supervised |
| Problem | Classification or regression on high-dimensional sparse data |
| How it works | Extends linear model by efficiently computing all pairwise feature interactions via factored representation: $\hat{y} = w_0 + \sum_i w_i x_i + \sum_{i<j} \langle v_i, v_j \rangle x_i x_j$ where $\langle v_i, v_j \rangle$ are learned latent factor dot products |
| Input format | RecordIO-protobuf with sparse tensor support |
| Key hyperparameters | feature_dim, num_factors (latent dimensions), predictor_type, epochs, learning_rate |
| Instance type | CPU (ml.c5) |
Why it works for sparse data: In a user-item matrix with 1M users × 500K items, 99.9% of cells are empty. A standard linear model can’t learn user-item interactions (not enough data per pair). FM factorizes each feature into a latent vector — user vectors and item vectors interact via dot products, so even unobserved user-item pairs get predicted via shared latent structure.
$$\hat{y}(x) = w_0 + \sum_i w_i x_i + \sum_{i<j} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j$$
ELI5: In a massive spreadsheet of users × products, most cells are empty — you only know what each user actually bought. Factorization Machines figures out the hidden patterns in what IS filled in. It learns that User A (who liked sci-fi movies) has a similar latent profile to User B (who also liked sci-fi), so it can predict that A would also like movies B liked, even if A has never seen them. It’s efficiently learning all pairwise interactions between features without needing every pair to be observed.
Use when: click-through rate prediction, recommendation systems with sparse features, any problem where features are user/item IDs (one-hot encoded → sparse).
Do NOT use when: dense feature matrices (XGBoost or Linear Learner will be simpler), image/text data.
Reinforcement Learning on SageMaker
SageMaker RL enables training reinforcement learning agents using:
- Frameworks: Ray RLlib (most common), Intel Coach
- Environments: Amazon Robomaker (robotics), OpenAI Gym (classic control), custom environments via Docker
- Training topology: Separate training instance (learns policy) + environment simulation instances (generate experience)
Use cases: robotics path planning, game-playing agents, autonomous vehicle control, resource scheduling optimization, portfolio management.
Input Format Reference
| Algorithm | RecordIO | CSV | JSON Lines | Image files | Pipe Mode |
|---|---|---|---|---|---|
| Linear Learner | Yes (float32) | Yes | No | No | Yes |
| XGBoost | Yes | Yes | No | No | Yes |
| KNN | Yes | Yes | No | No | Yes |
| DeepAR | No | No | Yes | No | No |
| K-Means | Yes | Yes | No | No | Yes |
| PCA | Yes | Yes | No | No | Yes |
| RCF | Yes | Yes | No | No | Yes |
| BlazingText | No | Text (supervised) | No | No | No |
| Seq2Seq | Yes (integers) | No | No | No | No |
| NTM | Yes | Yes | No | No | Yes |
| LDA | Yes | Yes | No | No | No |
| Object2Vec | No | No | Yes | No | No |
| Factorization Machines | Yes (sparse) | No | No | No | Yes |
| Image Classification | Yes | No | No | Yes | Yes |
| Object Detection | Yes | No | No | Yes | No |
| Semantic Segmentation | No | No | No | Yes | No |
Exam tip: When the question asks about streaming large datasets efficiently → Pipe mode + RecordIO. DeepAR uses JSON Lines, not RecordIO. Computer vision algorithms take image files (JPG/PNG) or RecordIO of image data.
Instance Type Reference
| Algorithm | CPU | GPU | Recommended | Multi-Instance | Notes |
|---|---|---|---|---|---|
| XGBoost | Yes | Yes | CPU (ml.m5) | No | GPU only with gpu_hist |
| Linear Learner | Yes | Yes | CPU (ml.m5) | Yes | — |
| KNN | Yes | Yes | CPU (ml.m5) | No | GPU helps index building |
| DeepAR | Yes | Yes | GPU (ml.p3) large | No | — |
| K-Means | Yes | No | CPU (ml.m5) | Yes | — |
| PCA | Yes | Yes | CPU (ml.m5) | No | — |
| RCF | Yes | No | CPU (ml.m5) | No | Also in Kinesis Analytics |
| IP Insights | Yes | Yes | GPU (ml.p3) | No | — |
| BlazingText | Yes | Yes | Mode-dependent | No | Text class: CPU; W2V: GPU |
| Seq2Seq | No | Yes only | GPU (ml.p3) | Yes | GPU REQUIRED |
| NTM | Yes | Yes | GPU (ml.p3) large | Yes | — |
| LDA | Yes | No | CPU (ml.m5) | No | CPU only |
| Object2Vec | Yes | Yes | GPU (ml.p3) | No | — |
| Factorization Machines | Yes | No | CPU (ml.c5) | Yes | — |
| Image Classification | No | Yes only | GPU (ml.p3) | Yes | GPU REQUIRED |
| Object Detection | No | Yes only | GPU (ml.p3) | Yes | GPU REQUIRED |
| Semantic Segmentation | No | Yes only | GPU (ml.p3) | No | GPU REQUIRED, no multi-instance |
Exam rules to memorize:
- Vision algorithms (Image Classification, Object Detection, Semantic Segmentation) + Seq2Seq = GPU ONLY, always
- Tree-based + classical ML = CPU first (cheaper)
- Semantic Segmentation = GPU + single instance only (no multi-instance)
- “Cost-effective training” → choose CPU-capable algorithm on CPU instance
Automatic Model Tuning (Hyperparameter Optimization)
SageMaker HPO automatically searches for optimal hyperparameters using three strategies:
HPO Strategy Comparison:
Bayesian Optimization (recommended):
┌──────────────────────────────────────────────────────┐
│ Job 1: eta=0.3, max_depth=6 → AUC=0.82 │
│ Job 2: eta=0.1, max_depth=8 → AUC=0.85 ← better │
│ Job 3: explore near job 2... → AUC=0.87 │
│ ...intelligently guided by a surrogate model... │
└──────────────────────────────────────────────────────┘
Learns from past results → focuses on promising regions
Random Search:
┌──────────────────────────────────────────────────────┐
│ Job 1-N: sample hyperparameters at random │
│ All jobs run in parallel → fast exploration │
│ No learning across jobs │
└──────────────────────────────────────────────────────┘
Good for parallelism, no prior knowledge assumed
Hyperband:
┌──────────────────────────────────────────────────────┐
│ Start many jobs with few resources │
│ Eliminate bottom-performing jobs early │
│ Give more resources to survivors │
│ → Very efficient for expensive training jobs │
└──────────────────────────────────────────────────────┘
Key configuration parameters:
- Objective metric: the metric to optimize (e.g.,
validation:auc,train:loss) — must be emitted to CloudWatch logs - Parameter ranges: continuous (e.g.,
learning_rate: 0.001–0.3), integer (e.g.,max_depth: 3–10), categorical (e.g.,optimizer: [adam, sgd]) - Max training jobs: total budget of jobs to run
- Max parallel jobs: how many jobs run simultaneously (more parallel → less efficient Bayesian learning)
- Warm start: initialize a new tuning job from results of a previous one → no wasted computation on already-explored regions
ELI5: Instead of manually trying 100 hyperparameter combinations, let SageMaker’s optimizer intelligently explore the space — it learns which regions are promising and focuses there. Bayesian optimization is like a scientist who reads all past experiment results before deciding what to try next. Random search is like throwing darts blindfolded. Hyperband is like a tournament that eliminates weak contestants early to spend time on the strong ones.
Exam tip: Bayesian optimization uses sequential jobs (each job informs the next), so
max_parallel_jobsshould be low (2-3) for Bayesian. Random search benefits from high parallelism. Hyperband is best when training is expensive and you can afford to kill early-underperforming jobs.
Bring Your Own Algorithm
When to Use Each Approach
Decision Tree: Built-in vs Script Mode vs Custom Container
Does SageMaker have a built-in algorithm for your problem?
├─ Yes, and it fits exactly → Use built-in algorithm
└─ No, or needs modification
│
Does your code fit in a supported framework
(TensorFlow, PyTorch, Scikit-learn, MXNet, HuggingFace)?
├─ Yes → Script Mode (your code + SageMaker's container)
└─ No, or you need special system dependencies
│
Custom Docker Container (full control)
Script Mode
Upload your training script (e.g., train.py) and SageMaker runs it inside its pre-built framework container. Your script reads from /opt/ml/input/data/ and writes the model to /opt/ml/model/.
# SageMaker script mode — your training script reads from env vars
import os
import argparse
parser = argparse.ArgumentParser()
# SageMaker passes hyperparameters as command-line arguments
parser.add_argument('--learning-rate', type=float, default=0.01)
# SageMaker passes data paths as environment variables
parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
Custom Docker Container
Required /opt/ml/ directory structure in your container:
/opt/ml/
├── input/
│ ├── config/
│ │ ├── hyperparameters.json ← SageMaker writes your HPO config here
│ │ └── resourceConfig.json ← instance info
│ └── data/
│ └── <channel-name>/ ← training data mounted here
│ └── (your data files)
├── model/ ← write your trained model artifacts here
│ └── (model files)
└── output/
└── failure ← write failure reason here if training fails
Your container must respond to:
docker run image train ← SageMaker calls this to start training
docker run image serve ← SageMaker calls this to start inference server
Container requirements:
- Accept
traincommand → execute training → write model to/opt/ml/model/ - Accept
servecommand → start an HTTP server on port 8080 → respond to/invocationsand/ping - Exit with code 0 on success, non-zero on failure
ECR: Push your custom container to Amazon ECR (Elastic Container Registry) — SageMaker pulls it from there for training and inference.
Master Comparison Table
| Algorithm | Problem Type | Data Type | GPU Required | Key Hyperparameters | Multi-Instance |
|---|---|---|---|---|---|
| Linear Learner | Regression, Classification | Tabular | No | predictor_type, learning_rate, l1 | Yes |
| XGBoost | Regression, Classification | Tabular | No (GPU optional) | num_round, max_depth, eta, objective | No |
| KNN | Classification, Regression | Any numeric | No | k, predictor_type | No |
| DeepAR | Time Series Forecast | Multiple time series | Recommended | prediction_length, context_length | No |
| K-Means | Clustering | Numeric | No | k, extra_center_factor | Yes |
| PCA | Dim Reduction | Numeric | No | num_components, algorithm_mode | No |
| RCF | Anomaly Detection | Numeric/streaming | No | num_trees, num_samples_per_tree | No |
| IP Insights | Anomaly Detection | IP addresses | Recommended | num_entity_vectors, vector_dim | No |
| BlazingText | Text Classification, Embeddings | Text | Mode-dep | mode, vector_dim, word_ngrams | No |
| Seq2Seq | Sequence Translation | Token sequences | YES | encoder_type, num_layers | Yes |
| NTM | Topic Modeling | Text (BoW) | Recommended | num_topics, encoder_layers | Yes |
| LDA | Topic Modeling | Text (BoW) | No | num_topics, alpha0 | No |
| Object2Vec | Embeddings, Similarity | Object pairs | Recommended | enc_dim, num_layers, comparator_list | No |
| Factorization Machines | Classification, Regression | Sparse/high-dim | No | feature_dim, num_factors | Yes |
| Image Classification | Image Classification | Images | YES | num_classes, use_pretrained_model | Yes |
| Object Detection | Object Localization | Images | YES | num_classes, base_network | Yes |
| Semantic Segmentation | Pixel Classification | Images | YES | num_classes, algorithm, backbone | No (single only) |