SageMaker Built-In Algorithms#
Exam Domain: 2 — ML Model Development (26%)
Task: Choose the right modeling approach
Algorithm Selection Flowchart#
What type of problem?
│
├─ Predict a number? ──────────── REGRESSION
│ ├─ Linear relationship? → Linear Learner
│ ├─ Complex / tabular? → XGBoost / LightGBM
│ └─ Time series? → DeepAR
│
├─ Predict a category? ────────── CLASSIFICATION
│ ├─ Binary / multi-class? → Linear Learner (binary)
│ ├─ Complex / tabular? → XGBoost / LightGBM
│ ├─ Image? → Image Classification
│ ├─ Object in image? → Object Detection
│ ├─ Pixel-level? → Semantic Segmentation
│ └─ Text? → BlazingText
│
├─ Group similar items? ───────── CLUSTERING
│ └─ K groups? → K-Means
│
├─ Reduce dimensions? ─────────── DIMENSIONALITY REDUCTION
│ └─ Linear? → PCA
│
├─ Detect anomalies? ──────────── ANOMALY DETECTION
│ ├─ Numeric / time series? → Random Cut Forest (RCF)
│ └─ IP address patterns? → IP Insights
│
├─ Text / NLP? ────────────────── NLP
│ ├─ Word embeddings? → BlazingText (Word2Vec)
│ ├─ Text classification? → BlazingText (supervised)
│ ├─ Topic discovery? → NTM or LDA
│ └─ Sequence translation? → Seq2Seq
│
├─ Recommendations? ───────────── RECOMMENDATIONS
│ ├─ Sparse data (clicks)? → Factorization Machines
│ └─ Pair embeddings? → Object2Vec
│
└─ Forecasting? ───────────────── TIME SERIES
└─ Multiple related series? → DeepAR
Why this matters for the exam: This flowchart is the most important mental model to internalize. The key insight: always start with your problem type (regression, classification, clustering, anomaly detection), not with the algorithm name. Exam questions describe a business scenario — your job is to classify the problem first, then the algorithm choice often becomes obvious. Most wrong answers fail because test-takers jump to an algorithm before identifying the problem type.
Complete Algorithm Reference#
Supervised — Regression / Classification#
Linear Learner#
| Aspect | Detail |
|---|
| Type | Supervised |
| Problem | Binary/multi-class classification, regression |
| How it works | Linear model (logistic regression / linear regression) |
| Input | RecordIO (float32) or CSV |
| Key feature | Built-in normalization, can train multiple models in parallel |
| Hyperparameters | predictor_type (regressor/binary_classifier/multiclass_classifier), learning_rate, l1, mini_batch_size |
| When to use | Simple linear relationships, baseline model, fast training |
Supports automatic model tuning — trains many models with different hyperparameters simultaneously.
ELI5: Linear Learner draws the best straight line (or flat plane) through your data to make predictions. If you’re predicting house prices, it finds the line that best fits “price goes up as square footage goes up.” Simple, fast, and surprisingly effective when your data actually has a roughly linear relationship. Always use it as your baseline — if a fancier model can’t beat it, the fancy model isn’t worth the complexity.
XGBoost#
| Aspect | Detail |
|---|
| Type | Supervised |
| Problem | Classification, regression, ranking |
| How it works | Gradient-boosted decision trees (ensemble) |
| Input | CSV, LibSVM, RecordIO, Parquet |
| Key feature | Handles missing values natively, regularization built-in |
| Hyperparameters | num_round, max_depth, eta (learning rate), subsample, objective |
| When to use | Tabular data — often the best default choice |
XGBoost Ensemble:
Tree 1 (weak) → residuals → Tree 2 → residuals → Tree 3 → ...
+ + +
└──────────────── Sum of predictions ─────────────────────┘
= Final prediction
Exam tip: XGBoost is the most commonly tested algorithm. If the question mentions tabular/structured data with complex patterns, XGBoost is likely the answer.
ELI5: XGBoost is like a team of weak students who each learn from the mistakes of the one before them. Student 1 tries to predict house prices, gets some wrong. Student 2 focuses specifically on fixing Student 1’s mistakes. Student 3 fixes what Student 2 still got wrong. After hundreds of rounds, their combined predictions are remarkably accurate. It’s the Swiss Army knife of ML — handles missing values, resists overfitting, works on almost any tabular problem, and wins most Kaggle competitions. When in doubt on the exam, XGBoost.
LightGBM#
| Aspect | Detail |
|---|
| Type | Supervised |
| Problem | Classification, regression |
| How it works | Gradient boosting with leaf-wise tree growth |
| Key feature | Faster than XGBoost on large datasets, lower memory |
| When to use | Large datasets where XGBoost is too slow |
XGBoost vs LightGBM:
- XGBoost: level-wise growth (balanced trees)
- LightGBM: leaf-wise growth (deeper, potentially better accuracy, risk of overfitting)
Supervised — Time Series#
DeepAR#
| Aspect | Detail |
|---|
| Type | Supervised |
| Problem | Time series forecasting |
| How it works | Autoregressive RNN (LSTM/GRU) |
| Input | JSON lines (start time + target values) |
| Key feature | Trains on multiple related time series simultaneously |
| Hyperparameters | context_length, prediction_length, epochs, num_cells |
| When to use | Forecasting across many related series (e.g., product demand) |
DeepAR learns patterns across multiple series:
Series A: ──────┐
Series B: ──────┤ → Shared RNN model → Probabilistic forecasts
Series C: ──────┘ (with uncertainty bands)
Outputs probabilistic forecasts (quantiles), not just point estimates.
ELI5: Traditional forecasting tools look at one product’s sales history in isolation. DeepAR looks at hundreds or thousands of related products simultaneously and learns shared patterns — “sales of all products spike in December,” “products in category X have a weekly cycle.” A product with only 2 weeks of history benefits from patterns learned across 500 similar products. This makes DeepAR dramatically better than classical methods when you have many related time series to forecast at once.
Supervised — Sequence / Text#
BlazingText#
| Aspect | Detail |
|---|
| Type | Supervised (text classification) / Unsupervised (Word2Vec) |
| Problem | Text classification, word embeddings |
| Modes | Word2Vec (Skip-gram, CBOW) and Text Classification (supervised) |
| Key feature | Optimized for multi-core + GPU, 10-20x faster than standard |
| When to use | Sentiment analysis, word embeddings, text categorization |
Word2Vec Architectures:
CBOW: context words → predict center word (faster)
Skip-gram: center word → predict context words (better for rare words)
ELI5: BlazingText turns words into points in a mathematical space where meaning is encoded as distance. Words used in similar contexts end up close together: “king” is near “queen,” “Paris” is near “London.” The famous example: “King” − “Man” + “Woman” ≈ “Queen” — that’s not magic, it’s geometry in the embedding space. These numeric representations (embeddings) let ML models understand language by turning fuzzy human meaning into precise numbers that algorithms can compute with.
Seq2Seq (Sequence-to-Sequence)#
| Aspect | Detail |
|---|
| Type | Supervised |
| Problem | Sequence translation |
| How it works | Encoder-decoder RNN with attention |
| Input | Tokenized integer sequences (RecordIO) |
| When to use | Machine translation, text summarization, speech-to-text |
Must be initialized with pre-trained word embeddings or use provided tokenization.
Supervised — Computer Vision#
Image Classification#
| Aspect | Detail |
|---|
| Type | Supervised |
| Problem | Classify entire image into a category |
| How it works | CNN (ResNet architecture) |
| Input | RecordIO or image files (JPG, PNG) |
| Key feature | Transfer learning — use pre-trained ImageNet weights |
| When to use | “What is this image?” (single label or multi-label) |
Object Detection#
| Aspect | Detail |
|---|
| Type | Supervised |
| Problem | Detect and locate objects in images |
| How it works | CNN (SSD or Faster R-CNN) |
| Output | Bounding boxes + class labels + confidence scores |
| When to use | “Where are the objects and what are they?” |
Semantic Segmentation#
| Aspect | Detail |
|---|
| Type | Supervised |
| Problem | Classify every pixel in an image |
| How it works | FCN (Fully Convolutional Network) |
| Output | Segmentation mask (label per pixel) |
| When to use | Self-driving cars, medical imaging, satellite analysis |
Computer Vision Task Comparison:
Image Classification: "This is a cat" → 1 label per image
Object Detection: "Cat at (x1,y1,x2,y2)" → bounding boxes
Semantic Segmentation: "These pixels are cat" → pixel-level mask
ELI5: The three vision tasks answer three different questions about the same photo. Classification: “What’s in this photo?” — one answer for the whole image. Object Detection: “Where exactly is it?” — draw a rectangle around each object and label it. Semantic Segmentation: “Color every pixel by what it is” — every pixel in the cat gets labeled “cat,” every pixel of road gets labeled “road.” More precise = more compute = more labeled training data required.
Unsupervised — Clustering & Dimensionality#
K-Means#
| Aspect | Detail |
|---|
| Type | Unsupervised |
| Problem | Clustering |
| How it works | Iteratively assigns points to nearest centroid |
| Hyperparameters | k (number of clusters), init_method |
| SageMaker variant | Uses web-scale k-means (mini-batch for large data) |
| When to use | Customer segmentation, grouping similar items |
Choose k using the Elbow method: plot distortion vs k, pick the “elbow” point.
PCA (Principal Component Analysis)#
| Aspect | Detail |
|---|
| Type | Unsupervised |
| Problem | Dimensionality reduction |
| How it works | Finds orthogonal axes of maximum variance |
| Modes | Regular (covariance matrix) or Randomized (approximation for large data) |
| When to use | Too many features, reduce noise, visualize high-dimensional data |
Unsupervised — Anomaly Detection#
Random Cut Forest (RCF)#
| Aspect | Detail |
|---|
| Type | Unsupervised |
| Problem | Anomaly detection |
| How it works | Builds ensemble of random trees; anomalies are isolated quickly |
| Output | Anomaly score (higher = more anomalous) |
| Key feature | Works on streaming data too (Kinesis Analytics has built-in RCF) |
| When to use | Detect spikes, outliers, unusual patterns in time series |
ELI5: RCF is like a bouncer at a club deciding how “weird” each person is. The bouncer builds a mental model of what “normal” looks like from the crowd. Someone who fits the usual pattern blends in and gets a low weirdness score. Someone wildly out of place — a spike in server traffic at 3am, a transaction ten times larger than usual — stands out and gets a high anomaly score. RCF formalizes this by asking: “How many random cuts does it take to isolate this data point?” Normal points are surrounded by neighbors and take many cuts. Anomalies are isolated quickly — hence a high score.
IP Insights#
| Aspect | Detail |
|---|
| Type | Unsupervised |
| Problem | Learn IP address usage patterns |
| How it works | Neural network learns entity-IP associations |
| When to use | Detect anomalous logins, fraudulent IP usage |
Unsupervised — Topic Modeling#
Neural Topic Model (NTM)#
| Aspect | Detail |
|---|
| Type | Unsupervised |
| Problem | Discover topics in text corpus |
| How it works | Neural variational inference |
| When to use | Organize documents, discover themes, more accurate than LDA |
Latent Dirichlet Allocation (LDA)#
| Aspect | Detail |
|---|
| Type | Unsupervised |
| Problem | Topic modeling |
| How it works | Probabilistic generative model |
| Difference from NTM | Classical statistical approach, faster, less accurate |
NTM vs LDA: NTM is neural-network based (more accurate), LDA is classical (faster, more interpretable).
Recommendations#
Factorization Machines#
| Aspect | Detail |
|---|
| Type | Supervised |
| Problem | Classification, regression on sparse data |
| How it works | Captures pairwise feature interactions efficiently |
| Key feature | Handles high-dimensional sparse data (click-through, recommendations) |
| When to use | Recommender systems, click prediction |
Object2Vec#
| Aspect | Detail |
|---|
| Type | Supervised |
| Problem | Learn embeddings for paired objects |
| How it works | Encoder network learns low-dimensional representations |
| When to use | Document similarity, recommendation, relationship learning |
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ File Mode │ │ Pipe Mode │ │ FastFile Mode│
├─────────────┤ ├─────────────┤ ├─────────────┤
│ Downloads │ │ Streams from│ │ POSIX-like │
│ entire │ │ S3 directly │ │ access to S3 │
│ dataset to │ │ (no disk │ │ (lazy load) │
│ local disk │ │ needed) │ │ │
│ │ │ │ │ │
│ Simplest │ │ Fastest │ │ Best of both │
│ Any format │ │ RecordIO │ │ Any format │
│ Needs disk │ │ required │ │ Default now │
└─────────────┘ └─────────────┘ └─────────────┘
| Mode | Speed | Disk Required | Format |
|---|
| File | Slowest (download first) | Yes | Any |
| Pipe | Fastest | No | RecordIO / TFRecord |
| FastFile | Fast (stream on demand) | No | Any |
Exam tip: Pipe mode + RecordIO = maximum training throughput. FastFile mode is the modern default.
Why this matters for the exam: Input mode questions come up frequently. The mental model: File mode downloads everything before training starts — your instance idles waiting for the download. Pipe mode streams data directly from S3 as training runs — no waiting, no disk needed, but requires RecordIO format. FastFile gives you Pipe-mode speed with any format, and is what AWS now recommends by default. When the exam says “minimize training startup time” or “largest dataset, fastest training” — Pipe + RecordIO wins.
Additional Algorithms (AutoML)#
AutoGluon-Tabular#
| Aspect | Detail |
|---|
| Type | AutoML (ensemble stacking) |
| Problem | Classification, regression on tabular data |
| How it works | Automatically trains multiple models (XGBoost, LightGBM, CatBoost, NN, etc.) and stacks them |
| Key feature | Best accuracy with zero tuning — just point at data |
| When to use | When you want highest accuracy without manual algorithm selection |
CatBoost#
| Aspect | Detail |
|---|
| Type | Supervised (gradient boosting) |
| Problem | Classification, regression |
| Key feature | Native categorical feature support (no encoding needed) |
| When to use | Data with many categorical features |
Compute Type per Algorithm#
| Algorithm | CPU | GPU | Recommended | Multi-Instance |
|---|
| XGBoost | Yes | Yes (gpu_hist) | CPU (ml.m5) | No |
| Linear Learner | Yes | Yes | CPU (ml.m5) | Yes |
| Factorization Machines | Yes | No | CPU (ml.c5) | Yes |
| KNN | Yes | Yes (inference) | CPU (ml.m5) | No |
| K-Means | Yes | No | CPU (ml.m5) | Yes |
| PCA | Yes | Yes | CPU (ml.m5) | No |
| Random Cut Forest | Yes | No | CPU (ml.m5) | No |
| IP Insights | Yes | Yes | GPU (ml.p3) | No |
| BlazingText | Yes | Yes | Mode-dependent | No |
| Seq2Seq | No | Yes only | GPU (ml.p3) | Yes |
| DeepAR | Yes | Yes | GPU for large | No |
| Object2Vec | Yes | Yes | GPU (ml.p3) | No |
| Image Classification | No | Yes only | GPU (ml.p3) | Yes |
| Object Detection | No | Yes only | GPU (ml.p3) | Yes |
| Semantic Segmentation | No | Yes only | GPU (ml.p3) | No |
| NTM | Yes | Yes | GPU for large | Yes |
| LDA | Yes | No | CPU (ml.m5) | No |
Exam rules:
- Vision algorithms + Seq2Seq = GPU only, always
- Tree-based + classical ML = CPU first
- “Cost-effective” + CPU-capable algorithm = pick CPU instance
- XGBoost on GPU? Possible (
tree_method=gpu_hist) but NOT cost-effective — exam answer is CPU
Quick Decision Matrix#
| Your Data | Your Problem | Algorithm |
|---|
| Tabular, structured | Classification/Regression | XGBoost |
| Tabular, simple linear | Classification/Regression | Linear Learner |
| Multiple time series | Forecasting | DeepAR |
| Text documents | Classification | BlazingText |
| Text corpus | Topic discovery | NTM (or LDA) |
| Images | Classify whole image | Image Classification |
| Images | Find objects | Object Detection |
| Images | Label every pixel | Semantic Segmentation |
| Numeric data | Find anomalies | Random Cut Forest |
| IP addresses | Detect fraud | IP Insights |
| Sparse interaction data | Recommendations | Factorization Machines |
| High-dimensional | Reduce features | PCA |
| Any data | Group into clusters | K-Means |
| Sequences | Translate/summarize | Seq2Seq |
| Word representations | Embeddings | BlazingText (Word2Vec) |
| Object pairs | Similarity/embeddings | Object2Vec |