Domain 2A: SageMaker Built-In Algorithms

11 min read 2257 words

SageMaker Built-In Algorithms

Exam Domain: 2 — ML Model Development (26%) Task: Choose the right modeling approach

Algorithm Selection Flowchart

What type of problem?
│
├─ Predict a number? ──────────── REGRESSION
│   ├─ Linear relationship?      → Linear Learner
│   ├─ Complex / tabular?        → XGBoost / LightGBM
│   └─ Time series?              → DeepAR
│
├─ Predict a category? ────────── CLASSIFICATION
│   ├─ Binary / multi-class?     → Linear Learner (binary)
│   ├─ Complex / tabular?        → XGBoost / LightGBM
│   ├─ Image?                    → Image Classification
│   ├─ Object in image?          → Object Detection
│   ├─ Pixel-level?              → Semantic Segmentation
│   └─ Text?                     → BlazingText
│
├─ Group similar items? ───────── CLUSTERING
│   └─ K groups?                 → K-Means
│
├─ Reduce dimensions? ─────────── DIMENSIONALITY REDUCTION
│   └─ Linear?                   → PCA
│
├─ Detect anomalies? ──────────── ANOMALY DETECTION
│   ├─ Numeric / time series?    → Random Cut Forest (RCF)
│   └─ IP address patterns?      → IP Insights
│
├─ Text / NLP? ────────────────── NLP
│   ├─ Word embeddings?          → BlazingText (Word2Vec)
│   ├─ Text classification?      → BlazingText (supervised)
│   ├─ Topic discovery?          → NTM or LDA
│   └─ Sequence translation?     → Seq2Seq
│
├─ Recommendations? ───────────── RECOMMENDATIONS
│   ├─ Sparse data (clicks)?     → Factorization Machines
│   └─ Pair embeddings?          → Object2Vec
│
└─ Forecasting? ───────────────── TIME SERIES
    └─ Multiple related series?  → DeepAR

Why this matters for the exam: This flowchart is the most important mental model to internalize. The key insight: always start with your problem type (regression, classification, clustering, anomaly detection), not with the algorithm name. Exam questions describe a business scenario — your job is to classify the problem first, then the algorithm choice often becomes obvious. Most wrong answers fail because test-takers jump to an algorithm before identifying the problem type.

Complete Algorithm Reference

Supervised — Regression / Classification

Linear Learner

Aspect	Detail
Type	Supervised
Problem	Binary/multi-class classification, regression
How it works	Linear model (logistic regression / linear regression)
Input	RecordIO (float32) or CSV
Key feature	Built-in normalization, can train multiple models in parallel
Hyperparameters	`predictor_type` (regressor/binary_classifier/multiclass_classifier), `learning_rate`, `l1`, `mini_batch_size`
When to use	Simple linear relationships, baseline model, fast training

Supports automatic model tuning — trains many models with different hyperparameters simultaneously.

ELI5: Linear Learner draws the best straight line (or flat plane) through your data to make predictions. If you’re predicting house prices, it finds the line that best fits “price goes up as square footage goes up.” Simple, fast, and surprisingly effective when your data actually has a roughly linear relationship. Always use it as your baseline — if a fancier model can’t beat it, the fancy model isn’t worth the complexity.

XGBoost

Aspect	Detail
Type	Supervised
Problem	Classification, regression, ranking
How it works	Gradient-boosted decision trees (ensemble)
Input	CSV, LibSVM, RecordIO, Parquet
Key feature	Handles missing values natively, regularization built-in
Hyperparameters	`num_round`, `max_depth`, `eta` (learning rate), `subsample`, `objective`
When to use	Tabular data — often the best default choice

XGBoost Ensemble:
  Tree 1 (weak)  →  residuals  →  Tree 2  →  residuals  →  Tree 3  →  ...
       +                              +                         +
       └──────────────── Sum of predictions ─────────────────────┘
                         = Final prediction

Exam tip: XGBoost is the most commonly tested algorithm. If the question mentions tabular/structured data with complex patterns, XGBoost is likely the answer.

ELI5: XGBoost is like a team of weak students who each learn from the mistakes of the one before them. Student 1 tries to predict house prices, gets some wrong. Student 2 focuses specifically on fixing Student 1’s mistakes. Student 3 fixes what Student 2 still got wrong. After hundreds of rounds, their combined predictions are remarkably accurate. It’s the Swiss Army knife of ML — handles missing values, resists overfitting, works on almost any tabular problem, and wins most Kaggle competitions. When in doubt on the exam, XGBoost.

LightGBM

Aspect	Detail
Type	Supervised
Problem	Classification, regression
How it works	Gradient boosting with leaf-wise tree growth
Key feature	Faster than XGBoost on large datasets, lower memory
When to use	Large datasets where XGBoost is too slow

XGBoost vs LightGBM:

XGBoost: level-wise growth (balanced trees)
LightGBM: leaf-wise growth (deeper, potentially better accuracy, risk of overfitting)

Supervised — Time Series

DeepAR

Aspect	Detail
Type	Supervised
Problem	Time series forecasting
How it works	Autoregressive RNN (LSTM/GRU)
Input	JSON lines (start time + target values)
Key feature	Trains on multiple related time series simultaneously
Hyperparameters	`context_length`, `prediction_length`, `epochs`, `num_cells`
When to use	Forecasting across many related series (e.g., product demand)

DeepAR learns patterns across multiple series:
  Series A: ──────┐
  Series B: ──────┤  → Shared RNN model → Probabilistic forecasts
  Series C: ──────┘                        (with uncertainty bands)

Outputs probabilistic forecasts (quantiles), not just point estimates.

ELI5: Traditional forecasting tools look at one product’s sales history in isolation. DeepAR looks at hundreds or thousands of related products simultaneously and learns shared patterns — “sales of all products spike in December,” “products in category X have a weekly cycle.” A product with only 2 weeks of history benefits from patterns learned across 500 similar products. This makes DeepAR dramatically better than classical methods when you have many related time series to forecast at once.

Supervised — Sequence / Text

BlazingText

Aspect	Detail
Type	Supervised (text classification) / Unsupervised (Word2Vec)
Problem	Text classification, word embeddings
Modes	Word2Vec (Skip-gram, CBOW) and Text Classification (supervised)
Key feature	Optimized for multi-core + GPU, 10-20x faster than standard
When to use	Sentiment analysis, word embeddings, text categorization

Word2Vec Architectures:
  CBOW:       context words → predict center word (faster)
  Skip-gram:  center word → predict context words (better for rare words)

ELI5: BlazingText turns words into points in a mathematical space where meaning is encoded as distance. Words used in similar contexts end up close together: “king” is near “queen,” “Paris” is near “London.” The famous example: “King” − “Man” + “Woman” ≈ “Queen” — that’s not magic, it’s geometry in the embedding space. These numeric representations (embeddings) let ML models understand language by turning fuzzy human meaning into precise numbers that algorithms can compute with.

Seq2Seq (Sequence-to-Sequence)

Aspect	Detail
Type	Supervised
Problem	Sequence translation
How it works	Encoder-decoder RNN with attention
Input	Tokenized integer sequences (RecordIO)
When to use	Machine translation, text summarization, speech-to-text

Must be initialized with pre-trained word embeddings or use provided tokenization.

Supervised — Computer Vision

Image Classification

Aspect	Detail
Type	Supervised
Problem	Classify entire image into a category
How it works	CNN (ResNet architecture)
Input	RecordIO or image files (JPG, PNG)
Key feature	Transfer learning — use pre-trained ImageNet weights
When to use	“What is this image?” (single label or multi-label)

Object Detection

Aspect	Detail
Type	Supervised
Problem	Detect and locate objects in images
How it works	CNN (SSD or Faster R-CNN)
Output	Bounding boxes + class labels + confidence scores
When to use	“Where are the objects and what are they?”

Semantic Segmentation

Aspect	Detail
Type	Supervised
Problem	Classify every pixel in an image
How it works	FCN (Fully Convolutional Network)
Output	Segmentation mask (label per pixel)
When to use	Self-driving cars, medical imaging, satellite analysis

Computer Vision Task Comparison:

Image Classification:  "This is a cat"         → 1 label per image
Object Detection:      "Cat at (x1,y1,x2,y2)" → bounding boxes
Semantic Segmentation: "These pixels are cat"  → pixel-level mask

ELI5: The three vision tasks answer three different questions about the same photo. Classification: “What’s in this photo?” — one answer for the whole image. Object Detection: “Where exactly is it?” — draw a rectangle around each object and label it. Semantic Segmentation: “Color every pixel by what it is” — every pixel in the cat gets labeled “cat,” every pixel of road gets labeled “road.” More precise = more compute = more labeled training data required.

Unsupervised — Clustering & Dimensionality

K-Means

Aspect	Detail
Type	Unsupervised
Problem	Clustering
How it works	Iteratively assigns points to nearest centroid
Hyperparameters	`k` (number of clusters), `init_method`
SageMaker variant	Uses web-scale k-means (mini-batch for large data)
When to use	Customer segmentation, grouping similar items

Choose k using the Elbow method: plot distortion vs k, pick the “elbow” point.

PCA (Principal Component Analysis)

Aspect	Detail
Type	Unsupervised
Problem	Dimensionality reduction
How it works	Finds orthogonal axes of maximum variance
Modes	Regular (covariance matrix) or Randomized (approximation for large data)
When to use	Too many features, reduce noise, visualize high-dimensional data

Unsupervised — Anomaly Detection

Random Cut Forest (RCF)

Aspect	Detail
Type	Unsupervised
Problem	Anomaly detection
How it works	Builds ensemble of random trees; anomalies are isolated quickly
Output	Anomaly score (higher = more anomalous)
Key feature	Works on streaming data too (Kinesis Analytics has built-in RCF)
When to use	Detect spikes, outliers, unusual patterns in time series

ELI5: RCF is like a bouncer at a club deciding how “weird” each person is. The bouncer builds a mental model of what “normal” looks like from the crowd. Someone who fits the usual pattern blends in and gets a low weirdness score. Someone wildly out of place — a spike in server traffic at 3am, a transaction ten times larger than usual — stands out and gets a high anomaly score. RCF formalizes this by asking: “How many random cuts does it take to isolate this data point?” Normal points are surrounded by neighbors and take many cuts. Anomalies are isolated quickly — hence a high score.

IP Insights

Aspect	Detail
Type	Unsupervised
Problem	Learn IP address usage patterns
How it works	Neural network learns entity-IP associations
When to use	Detect anomalous logins, fraudulent IP usage

Unsupervised — Topic Modeling

Neural Topic Model (NTM)

Aspect	Detail
Type	Unsupervised
Problem	Discover topics in text corpus
How it works	Neural variational inference
When to use	Organize documents, discover themes, more accurate than LDA

Latent Dirichlet Allocation (LDA)

Aspect	Detail
Type	Unsupervised
Problem	Topic modeling
How it works	Probabilistic generative model
Difference from NTM	Classical statistical approach, faster, less accurate

NTM vs LDA: NTM is neural-network based (more accurate), LDA is classical (faster, more interpretable).

Recommendations

Factorization Machines

Aspect	Detail
Type	Supervised
Problem	Classification, regression on sparse data
How it works	Captures pairwise feature interactions efficiently
Key feature	Handles high-dimensional sparse data (click-through, recommendations)
When to use	Recommender systems, click prediction

Object2Vec

Aspect	Detail
Type	Supervised
Problem	Learn embeddings for paired objects
How it works	Encoder network learns low-dimensional representations
When to use	Document similarity, recommendation, relationship learning

SageMaker Input Modes

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  File Mode  │  │  Pipe Mode  │  │ FastFile Mode│
├─────────────┤  ├─────────────┤  ├─────────────┤
│ Downloads   │  │ Streams from│  │ POSIX-like   │
│ entire      │  │ S3 directly │  │ access to S3 │
│ dataset to  │  │ (no disk    │  │ (lazy load)  │
│ local disk  │  │  needed)    │  │              │
│             │  │             │  │              │
│ Simplest    │  │ Fastest     │  │ Best of both │
│ Any format  │  │ RecordIO    │  │ Any format   │
│ Needs disk  │  │ required    │  │ Default now  │
└─────────────┘  └─────────────┘  └─────────────┘

Mode	Speed	Disk Required	Format
File	Slowest (download first)	Yes	Any
Pipe	Fastest	No	RecordIO / TFRecord
FastFile	Fast (stream on demand)	No	Any

Exam tip: Pipe mode + RecordIO = maximum training throughput. FastFile mode is the modern default.

Why this matters for the exam: Input mode questions come up frequently. The mental model: File mode downloads everything before training starts — your instance idles waiting for the download. Pipe mode streams data directly from S3 as training runs — no waiting, no disk needed, but requires RecordIO format. FastFile gives you Pipe-mode speed with any format, and is what AWS now recommends by default. When the exam says “minimize training startup time” or “largest dataset, fastest training” — Pipe + RecordIO wins.

Additional Algorithms (AutoML)

AutoGluon-Tabular

Aspect	Detail
Type	AutoML (ensemble stacking)
Problem	Classification, regression on tabular data
How it works	Automatically trains multiple models (XGBoost, LightGBM, CatBoost, NN, etc.) and stacks them
Key feature	Best accuracy with zero tuning — just point at data
When to use	When you want highest accuracy without manual algorithm selection

CatBoost

Aspect	Detail
Type	Supervised (gradient boosting)
Problem	Classification, regression
Key feature	Native categorical feature support (no encoding needed)
When to use	Data with many categorical features

Compute Type per Algorithm

Algorithm	CPU	GPU	Recommended	Multi-Instance
XGBoost	Yes	Yes (gpu_hist)	CPU (ml.m5)	No
Linear Learner	Yes	Yes	CPU (ml.m5)	Yes
Factorization Machines	Yes	No	CPU (ml.c5)	Yes
KNN	Yes	Yes (inference)	CPU (ml.m5)	No
K-Means	Yes	No	CPU (ml.m5)	Yes
PCA	Yes	Yes	CPU (ml.m5)	No
Random Cut Forest	Yes	No	CPU (ml.m5)	No
IP Insights	Yes	Yes	GPU (ml.p3)	No
BlazingText	Yes	Yes	Mode-dependent	No
Seq2Seq	No	Yes only	GPU (ml.p3)	Yes
DeepAR	Yes	Yes	GPU for large	No
Object2Vec	Yes	Yes	GPU (ml.p3)	No
Image Classification	No	Yes only	GPU (ml.p3)	Yes
Object Detection	No	Yes only	GPU (ml.p3)	Yes
Semantic Segmentation	No	Yes only	GPU (ml.p3)	No
NTM	Yes	Yes	GPU for large	Yes
LDA	Yes	No	CPU (ml.m5)	No

Exam rules:
Vision algorithms + Seq2Seq = GPU only, always
Tree-based + classical ML = CPU first
“Cost-effective” + CPU-capable algorithm = pick CPU instance
XGBoost on GPU? Possible (tree_method=gpu_hist) but NOT cost-effective — exam answer is CPU

Quick Decision Matrix

Your Data	Your Problem	Algorithm
Tabular, structured	Classification/Regression	XGBoost
Tabular, simple linear	Classification/Regression	Linear Learner
Multiple time series	Forecasting	DeepAR
Text documents	Classification	BlazingText
Text corpus	Topic discovery	NTM (or LDA)
Images	Classify whole image	Image Classification
Images	Find objects	Object Detection
Images	Label every pixel	Semantic Segmentation
Numeric data	Find anomalies	Random Cut Forest
IP addresses	Detect fraud	IP Insights
Sparse interaction data	Recommendations	Factorization Machines
High-dimensional	Reduce features	PCA
Any data	Group into clusters	K-Means
Sequences	Translate/summarize	Seq2Seq
Word representations	Embeddings	BlazingText (Word2Vec)
Object pairs	Similarity/embeddings	Object2Vec

SageMaker Built-In Algorithms#

Algorithm Selection Flowchart#

Complete Algorithm Reference#

Supervised — Regression / Classification#

Linear Learner#

XGBoost#

LightGBM#

Supervised — Time Series#

DeepAR#

Supervised — Sequence / Text#

BlazingText#

Seq2Seq (Sequence-to-Sequence)#

Supervised — Computer Vision#

Image Classification#

Object Detection#

Semantic Segmentation#

Unsupervised — Clustering & Dimensionality#

K-Means#

PCA (Principal Component Analysis)#

Unsupervised — Anomaly Detection#

Random Cut Forest (RCF)#

IP Insights#

Unsupervised — Topic Modeling#

Neural Topic Model (NTM)#

Latent Dirichlet Allocation (LDA)#

Recommendations#

Factorization Machines#

Object2Vec#

SageMaker Input Modes#

Additional Algorithms (AutoML)#

AutoGluon-Tabular#

CatBoost#

Compute Type per Algorithm#

Quick Decision Matrix#

SageMaker Built-In Algorithms

Algorithm Selection Flowchart

Complete Algorithm Reference

Supervised — Regression / Classification

Linear Learner

XGBoost

LightGBM

Supervised — Time Series

DeepAR

Supervised — Sequence / Text

BlazingText

Seq2Seq (Sequence-to-Sequence)

Supervised — Computer Vision

Image Classification

Object Detection

Semantic Segmentation

Unsupervised — Clustering & Dimensionality

K-Means

PCA (Principal Component Analysis)

Unsupervised — Anomaly Detection

Random Cut Forest (RCF)

IP Insights

Unsupervised — Topic Modeling

Neural Topic Model (NTM)

Latent Dirichlet Allocation (LDA)

Recommendations

Factorization Machines

Object2Vec

SageMaker Input Modes

Additional Algorithms (AutoML)

AutoGluon-Tabular

CatBoost

Compute Type per Algorithm

Quick Decision Matrix