← AWS MLS-C01 — ML Specialty

Domain 3E: SageMaker Built-In Algorithms

SageMaker Built-In Algorithms

Exam Domain: 3 — ML Model Development Task: Choose the right SageMaker built-in algorithm for a given scenario


Algorithm Selection Flowchart

What type of problem?
│
├─ Predict a NUMBER? ──────────────────────────────── REGRESSION
│   ├─ Linear relationship, tabular data?            → Linear Learner
│   ├─ Complex patterns, tabular data?               → XGBoost
│   ├─ Time series, single series?                   → DeepAR
│   └─ Sparse feature interactions?                  → Factorization Machines
│
├─ Predict a CATEGORY? ────────────────────────────── CLASSIFICATION
│   ├─ Tabular, binary/multi-class?                  → XGBoost
│   ├─ Tabular, simple/linear?                       → Linear Learner
│   ├─ High-dimensional sparse (clicks)?             → Factorization Machines
│   ├─ Text, multi-class?                            → BlazingText (supervised)
│   ├─ Image, whole image label?                     → Image Classification
│   ├─ Image, find objects with boxes?               → Object Detection
│   └─ Image, every pixel labeled?                   → Semantic Segmentation
│
├─ FORECAST future values? ────────────────────────── TIME SERIES
│   └─ Multiple related time series?                 → DeepAR
│
├─ Find GROUPS in data? ───────────────────────────── CLUSTERING
│   └─ K groups of similar items?                    → K-Means
│
├─ REDUCE dimensions? ─────────────────────────────── DIM REDUCTION
│   └─ Linear projection, variance-preserving?       → PCA
│
├─ Detect ANOMALIES? ──────────────────────────────── ANOMALY DETECTION
│   ├─ Numeric / streaming time series?              → Random Cut Forest
│   └─ IP address / login patterns?                  → IP Insights
│
├─ NLP / TEXT tasks? ──────────────────────────────── NLP
│   ├─ Word embeddings (semantic similarity)?        → BlazingText (Word2Vec)
│   ├─ Text classification (categories)?             → BlazingText (supervised)
│   ├─ Topic discovery in corpus?                    → NTM (neural) or LDA
│   ├─ Sequence translation / summarization?         → Seq2Seq
│   └─ Pair embeddings (similarity)?                 → Object2Vec
│
└─ RECOMMENDATIONS? ───────────────────────────────── RECOMMENDATIONS
    ├─ Sparse interactions (user-item clicks)?       → Factorization Machines
    └─ Learn embeddings for object pairs?            → Object2Vec

Why this matters for the exam: Always identify the problem type first, then narrow by data characteristics. The most common trap is jumping to an algorithm (e.g., “it’s a neural network problem so use deep learning”) before classifying the task. Classify the task first — the algorithm often follows naturally.


Complete Algorithm Reference

Supervised — Regression / Classification

Linear Learner

AspectDetail
TypeSupervised
ProblemBinary/multi-class classification, regression
How it worksTrains a linear model (logistic/linear regression) with stochastic gradient descent; automatically normalizes data; trains multiple models with varied hyperparameters in parallel, selects best
Input formatRecordIO-protobuf (float32) or CSV
Key hyperparameterspredictor_type (regressor / binary_classifier / multiclass_classifier), learning_rate, l1, wd (L2), mini_batch_size
Instance typeCPU (ml.m5) recommended; supports multi-instance

ELI5: Linear Learner draws the best straight line (or flat plane) through your data to make predictions. If you’re predicting house prices, it finds the line that best fits “price goes up as square footage goes up.” Simple, fast, and surprisingly effective when your data actually has a roughly linear relationship. Always use it as your baseline — if a fancier model can’t beat it, the fancy model isn’t worth the complexity.

Use when: simple/linear relationships, fast baseline, interpretability needed, large datasets where simplicity is a feature.

Do NOT use when: complex non-linear interactions, high-dimensional sparse data (use Factorization Machines), image/text data.


XGBoost

AspectDetail
TypeSupervised
ProblemClassification, regression, ranking
How it worksGradient-boosted decision trees: trains trees sequentially, each correcting residuals of the previous ensemble; includes L1/L2 regularization to prevent overfitting; handles missing values natively by learning the best default direction
Input formatCSV, LibSVM, RecordIO-protobuf, Parquet
Key hyperparametersnum_round (# trees), max_depth (tree depth), eta (learning rate), subsample (row fraction), colsample_bytree (feature fraction), objective, min_child_weight
Instance typeCPU (ml.m5) recommended; GPU possible with tree_method=gpu_hist
XGBoost Boosting Process:

  Round 1: Tree₁ ─────────────────────────► prediction₁
                  ↓ compute residuals
  Round 2: Tree₂ trains on residuals ──────► prediction₁ + η·Tree₂
                  ↓ compute new residuals
  Round 3: Tree₃ trains on residuals ──────► prediction₁ + η·Tree₂ + η·Tree₃
  ...
  Round N: Final = Σ η·Treeᵢ

  η = learning rate (eta) controls each tree's contribution
  max_depth controls how complex each tree is

Key hyperparameters deep dive:

  • num_round: more rounds → lower training error, but risk of overfitting. Tune with early stopping.
  • max_depth: shallower trees (3-6) → more regularization. Deeper trees → more complex patterns.
  • eta: smaller = slower learning but more robust. Default 0.3, typical range 0.01-0.3.
  • subsample: < 1.0 → random row sampling each tree → stochastic boosting → regularization.
  • objective: reg:squarederror for regression, binary:logistic for binary, multi:softmax for multi-class.

ELI5: XGBoost is like a team of students who each learn from the mistakes of the one before them. Student 1 tries to predict house prices, gets some wrong. Student 2 focuses specifically on fixing Student 1’s mistakes. Student 3 fixes what Student 2 still got wrong. After hundreds of rounds, their combined predictions are remarkably accurate. It’s the Swiss Army knife of ML — handles missing values, resists overfitting, works on almost any tabular problem.

Use when: any structured/tabular data, default first choice for classification/regression, when you need feature importance.

Do NOT use when: image data, raw text, very sparse high-dimensional data (use Factorization Machines), when you need probabilistic time series forecasts (use DeepAR).

Exam tip: XGBoost is the most frequently tested algorithm. When the exam says “tabular data with complex patterns,” the answer is almost always XGBoost.


K-Nearest Neighbors (KNN)

AspectDetail
TypeSupervised (also used for unsupervised similarity search)
ProblemClassification, regression, similarity search
How it worksAt inference time, finds the K most similar training points to the query and returns their majority class (classification) or mean value (regression); SageMaker implementation uses random sampling for large datasets, then builds an index (exact or approximate) for fast lookup
Input formatRecordIO-protobuf or CSV
Key hyperparametersk (number of neighbors), predictor_type, dimension_reduction_type, dimension_reduction_target
Instance typeCPU (ml.m5); GPU supported for index building

ELI5: KNN is the “ask your neighbors” algorithm. To classify a new house price, find the K most similar houses (by square footage, bedrooms, etc.) in the training data and look at what their prices were. The new house’s price = the average of those K neighbors. It’s lazy — no model is built during training, all computation happens at prediction time. Fast to “train,” slow to predict on large datasets.

Use when: simple non-parametric baseline, anomaly detection via distance, similarity search.

Do NOT use when: very large datasets (inference is slow), high-dimensional data without preprocessing (curse of dimensionality), need for feature importance.


Supervised — Time Series

DeepAR

AspectDetail
TypeSupervised
ProblemProbabilistic time series forecasting
How it worksAutoregressive RNN (LSTM) trained simultaneously on MULTIPLE related time series; at each time step, the model predicts a probability distribution over future values; learns shared patterns (seasonality, trends) across all series
Input formatJSON Lines (each line: {"start": "YYYY-MM-DD", "target": [v1, v2, ...]})
Key hyperparametersprediction_length, context_length, epochs, num_cells, num_layers, dropout_rate, likelihood
Instance typeGPU (ml.p3) for large datasets; CPU for small
DeepAR trains across multiple series simultaneously:

  Product A: 100 120 115 ──────┐
  Product B: 800 820 790 ──────┤  Shared LSTM  →  Probabilistic forecasts
  Product C:   5   6   5 ──────┤  (learns      →  with uncertainty bands
  ...1000 products...          ┘  shared        →  (P10, P50, P90 quantiles)
                                  patterns)

Why DeepAR beats classical methods:

  • Cold start: a new product with 2 weeks of data benefits from patterns learned across 1000 similar products
  • Shared patterns: weekly/monthly seasonality learned once and applied everywhere
  • Probabilistic outputs: returns confidence intervals, not just point predictions (critical for inventory planning)

ELI5: Traditional forecasting tools are like tutors who can only teach one student at a time. DeepAR is a classroom — it teaches 1000 students (time series) simultaneously and lets them learn from each other. A new student with no history still benefits from the general knowledge in the room. Plus, instead of saying “you’ll sell exactly 47 units,” DeepAR says “there’s a 90% chance you’ll sell between 40 and 55 units” — much more useful for decision-making.

Use when: multiple related time series, cold-start series (little history), probabilistic forecasts needed, demand forecasting across product catalog.

Do NOT use when: single isolated time series with long history (classical methods may suffice), very irregular/sparse series.


Unsupervised — Clustering

K-Means

AspectDetail
TypeUnsupervised
ProblemClustering — group N points into K clusters
How it worksSageMaker uses web-scale K-Means with mini-batches: initializes K × extra_center_factor centers using random init or K-Means++, then iteratively assigns points to nearest centroid and updates centroids, finally reduces to K clusters
Input formatRecordIO-protobuf or CSV
Key hyperparametersk (number of clusters), extra_center_factor (oversampling), init_method (random or k-means++), max_iterations
Instance typeCPU (ml.m5); supports multi-instance

Choosing K — the Elbow Method:

Distortion vs K:

  Distortion
  (within-cluster
   sum of squares)
     │
  ──►│ \.
     │   \.
     │    \..
     │       \___
     │           ‾‾‾‾‾‾‾────────
     └──────────────────────────► K
            1  2  3  4  5  6  7

         ↑ Pick the "elbow" — the point where adding
           more clusters gives diminishing returns.

ELI5: K-Means is like sorting mail into K bins. Start with random bin assignments. Look at each bin and find its center. Move every letter to whichever bin center it’s closest to. Recompute centers. Repeat until nothing moves. The result: K groups where items within each group are similar and items between groups are different.

Use when: customer segmentation, document clustering, color quantization, finding natural groupings.

Do NOT use when: K is unknown and the elbow method doesn’t give a clear signal, data has non-spherical clusters (K-Means assumes spherical/Euclidean), very high-dimensional data without preprocessing.


Unsupervised — Dimensionality Reduction

PCA (Principal Component Analysis)

AspectDetail
TypeUnsupervised
ProblemReduce number of features while preserving variance
How it worksFinds orthogonal directions (principal components) of maximum variance in the data; projects data onto the top N components; SageMaker supports two modes: Regular (exact, computes covariance matrix) and Randomized (approximate, scalable to large datasets)
Input formatRecordIO-protobuf or CSV
Key hyperparametersnum_components (target dimensions), algorithm_mode (regular / randomized), subtract_mean
Instance typeCPU (ml.m5)

Regular vs Randomized PCA:

  • Regular mode: computes exact covariance matrix → precise but $O(d^2)$ memory where $d$ = feature count → use when $d$ is moderate (< 10,000 features)
  • Randomized mode: uses randomized SVD approximation → much more memory-efficient → use when $d$ is very large (10,000+ features) or dataset is very large

ELI5: PCA is like finding the best angle to photograph a 3D sculpture so the 2D photo captures as much detail as possible. The “principal components” are those best viewing angles — the directions in high-dimensional space where the data varies the most. Projecting onto the top 2-3 components lets you visualize or compress data while keeping the most important information.

Use when: too many features (curse of dimensionality), feature redundancy, visualization, pre-processing before clustering.

Do NOT use when: non-linear structure (use autoencoders), features are categorical (PCA needs numeric), you need interpretability of features.


Unsupervised — Anomaly Detection

Random Cut Forest (RCF)

AspectDetail
TypeUnsupervised
ProblemAnomaly detection in numeric data and time series
How it worksBuilds an ensemble of random binary trees (Random Cut Trees) by recursively bisecting the feature space with random cuts; an anomaly is a point that gets isolated with fewer cuts (it’s far from other points); the anomaly score = average depth in the trees (inverted)
Input formatRecordIO-protobuf or CSV
Key hyperparametersnum_samples_per_tree, num_trees, eval_metrics
Instance typeCPU (ml.m5); also available as built-in function in Kinesis Data Analytics
How RCF assigns anomaly scores:

  Normal point (deep in a cluster):   Anomaly (isolated):
  ┌──────────────────────┐            ┌──────────────────────┐
  │  ·  · · ·            │            │  ·  · · ·            │
  │  · [·] · ·           │            │  · · · ·             │
  │  ·  · · ·            │            │            [★]       │
  │                      │            │                      │
  └──────────────────────┘            └──────────────────────┘
  Many cuts needed to isolate it      Isolated in 2-3 cuts
  → Low anomaly score                 → High anomaly score

How it differs from Isolation Forest: conceptually similar (both isolate anomalies with random partitioning), but RCF uses random cuts that are proportional to the range of each feature (not purely random splits) → better handles features of different scales. RCF also supports streaming/online updates.

Real-time use case: RCF in Kinesis Data Analytics enables millisecond-latency anomaly detection on streaming data without any model deployment — just SQL-like function calls.

ELI5: RCF is like a bouncer at a club deciding how “weird” each person is. The bouncer builds a mental model of what “normal” looks like from the crowd. Someone who fits the usual pattern blends in and gets a low weirdness score. Someone wildly out of place — a spike in server traffic at 3am, a transaction ten times larger than usual — stands out immediately and gets a high anomaly score. RCF formalizes this by asking: “How many random cuts does it take to isolate this data point?” Normal points are surrounded by neighbors and take many cuts. Anomalies are isolated in very few.

Use when: streaming anomaly detection (fraud, IoT sensors, server metrics), time series with spikes/level shifts, no labeled anomaly examples available.

Do NOT use when: you have labeled anomaly examples (use supervised classification instead), you need anomaly explanations (RCF gives scores, not reasons).

Exam tip: RCF = anomaly detection. Kinesis + RCF = real-time streaming anomaly detection. This combination appears frequently in exam scenarios.


IP Insights

AspectDetail
TypeUnsupervised
ProblemLearn normal patterns of IP address usage per entity
How it worksNeural network learns embeddings for (entity, IP address) pairs from historical normal access logs; at inference, scores new (entity, IP) pairs — low scores indicate unusual/anomalous access
Input formatCSV (entity identifier, IPv4 address)
Key hyperparametersnum_entity_vectors, vector_dim, epochs, batch_size
Instance typeGPU (ml.p3) recommended

ELI5: IP Insights builds a “reputation map” for which users connect from which IP addresses. Alice always logs in from New York and Chicago. Suddenly Alice’s credentials are used from Bucharest — that’s a high anomaly score. It doesn’t need labels saying “this was fraud”; it just learns normal patterns and flags deviations.

Use when: account takeover detection, fraud detection via login patterns, detecting compromised credentials.

Do NOT use when: general numeric anomaly detection (use RCF), you need real-time streaming (IP Insights requires batch/endpoint inference).


NLP Algorithms

BlazingText

AspectDetail
TypeUnsupervised (Word2Vec) or Supervised (text classification)
ProblemWord embeddings OR text classification
How it worksWord2Vec: learns word embeddings by training a shallow NN to predict context from word (Skip-gram) or word from context (CBOW); Text Classification: multi-class text classifier using fastText architecture (bag of n-gram features) — extremely fast
Input formatWord2Vec: one sentence per line (plain text); Text Classification: __label__<class> text format
Key hyperparametersmode (batch_skipgram / skipgram / cbow / supervised), vector_dim, epochs, learning_rate, word_ngrams
Instance typeCPU or GPU depending on mode
Word2Vec Architectures:

  CBOW (Continuous Bag of Words) — Faster, good for frequent words:

  "The" "cat" [___] "on" "the"   →   predict center word "sat"
      context words in                         ↑
      sliding window                     predicted word

  Skip-gram — Slower, better for rare words:

  [sat]   →   predict "The", "cat", "on", "the"
  center word         ↑
                 surrounding context words

  Mathematical result: words used in similar contexts → close vectors
  king − man + woman ≈ queen  (vector arithmetic captures meaning)

ELI5 — Word2Vec: CBOW: fill in the blank using surrounding words. Skip-gram: given one word, predict what words surround it. Both tasks force the network to learn that words appearing in similar contexts have similar meanings. The learned word vectors encode semantic relationships as geometry — “Paris is to France as Berlin is to Germany” becomes a consistent directional vector in the embedding space.

ELI5 — Text Classification mode: BlazingText’s supervised mode (fastText) represents each document as the average of its word/n-gram vectors, then classifies. No convolutions, no recurrence — just fast averaging. Trains in seconds on millions of documents, often matches or beats deep learning for text classification.

Use when (Word2Vec): you need pre-trained word embeddings for downstream NLP tasks, semantic similarity, analogy completion.

Use when (Text Classification): fast text categorization, sentiment analysis, spam detection, large document collections.

Do NOT use when: you need contextual embeddings (use BERT/transformers for tasks where “bank” means different things in different sentences), long-form generation.

Exam tip: BlazingText is the SageMaker algorithm for both word embeddings (Word2Vec) AND text classification. Two very different tasks, same algorithm — know both modes.


Seq2Seq (Sequence-to-Sequence)

AspectDetail
TypeSupervised
ProblemSequence-to-sequence translation
How it worksEncoder RNN reads input sequence into a context vector; decoder RNN generates output sequence conditioned on context vector and previous output tokens; attention mechanism allows decoder to focus on different parts of the input at each step
Input formatRecordIO-protobuf (integer token sequences, pre-tokenized)
Key hyperparametersencoder_type (rnn/cnn/transformer), num_layers_encoder, num_layers_decoder, num_embed_dim, attention_type
Instance typeGPU ONLY (ml.p3) — the ONLY SageMaker built-in algorithm that requires GPU

Use when: machine translation, text summarization, question answering, speech-to-text.

Do NOT use when: you need a pre-trained large language model (use SageMaker JumpStart / Bedrock instead), your task is text classification (use BlazingText).

Exam tip: Seq2Seq is the ONLY SageMaker built-in algorithm that MUST use GPU. This is a common exam fact. All other algorithms can run on CPU.


Neural Topic Model (NTM)

AspectDetail
TypeUnsupervised
ProblemDiscover latent topics in a document corpus
How it worksVariational autoencoder architecture: encoder maps bag-of-words document representation to a distribution over topics (latent space); decoder reconstructs the document from topic mixture; training optimizes both reconstruction and KL divergence
Input formatRecordIO-protobuf or CSV (bag-of-words count vectors)
Key hyperparametersnum_topics, encoder_layers, epochs, learning_rate, optimizer
Instance typeGPU (ml.p3) or CPU

NTM vs LDA:

  • NTM uses neural variational inference → more accurate, better on small corpora
  • LDA uses statistical Bayesian inference → faster, more interpretable, classic choice
  • In practice: try LDA first (faster, interpretable), use NTM if quality is insufficient

Latent Dirichlet Allocation (LDA)

AspectDetail
TypeUnsupervised
ProblemStatistical topic modeling
How it worksGenerative probabilistic model: assumes each document is generated by choosing a mixture of topics, then for each word, choosing a topic from that mixture and sampling a word from that topic’s word distribution; inference reverses this process to discover topic distributions
Input formatRecordIO-protobuf or CSV (bag-of-words)
Key hyperparametersnum_topics, alpha0 (document-topic prior), max_restarts, max_iterations
Instance typeCPU (ml.m5) only

ELI5: Imagine you have 10,000 news articles and no labels. LDA says: “These documents were secretly generated by mixing a small set of topics — maybe 20 topics like ‘sports,’ ‘politics,’ ’technology.’ Each article is a different blend: ‘70% politics, 20% economics, 10% sports.’ Each topic is a different distribution of words: ‘politics’ has high probability for ’election,’ ‘vote,’ ‘congress.’ LDA reverse-engineers these blends from the raw word counts.” The output: each document gets a topic mixture vector, each topic gets a word distribution.

Use when: content discovery, document organization, building a topic-based search system, exploring large text corpora.

Do NOT use when: you need high accuracy on small corpora (use NTM), your documents have labeled categories (use BlazingText supervised mode).


Object2Vec

AspectDetail
TypeSupervised
ProblemLearn embeddings for pairs of objects
How it worksTwo encoder towers (same or different architecture) encode each object in a pair into a fixed-dimension vector; a comparator (cosine similarity, dot product, or NN) computes a score for the pair; trained on labeled pairs (similar/dissimilar, ratings, etc.)
Input formatJSON Lines (pairs: {"label": float, "in0": [...], "in1": [...]})
Key hyperparametersenc_dim, num_layers, mlp_dim, epochs, optimizer, comparator_list
Instance typeGPU (ml.p3) recommended

ELI5: Object2Vec learns to represent pairs of things as vectors such that similar pairs have similar vectors. Give it thousands of (user, movie, rating) triples and it learns that users who liked Star Wars also like similar movies. Give it (sentence1, sentence2, similarity_score) pairs and it learns what makes sentences semantically similar. It’s a general-purpose embedding learner for any pair relationship.

Use when: collaborative filtering recommendations, sentence/document similarity, duplicate detection, knowledge graph embeddings.

Do NOT use when: you only have one object (not pairs), you need word-level embeddings (use BlazingText Word2Vec).


Computer Vision

Image Classification

AspectDetail
TypeSupervised
ProblemAssign one or more labels to an entire image
How it worksCNN based on ResNet architecture; supports two training modes: full training from random initialization OR transfer learning from pre-trained ImageNet weights (last layer replaced, rest fine-tuned)
Input formatRecordIO or image files (JPG/PNG) with annotation file; also supports augmented manifest
Key hyperparametersnum_classes, num_training_samples, epochs, learning_rate, mini_batch_size, use_pretrained_model (0/1)
Instance typeGPU ONLY (ml.p3); supports multi-GPU, multi-instance

Training modes:

  • use_pretrained_model=1: transfer learning — fast, works with small datasets (100+ images per class)
  • use_pretrained_model=0: full training from scratch — needs large dataset (10,000+ images per class)

Use when: classify whole image (e.g., “is this a hotdog?”, “what disease is in this X-ray?”).

Do NOT use when: you need to locate objects (use Object Detection), pixel-level segmentation (use Semantic Segmentation).


Object Detection

AspectDetail
TypeSupervised
ProblemDetect and localize objects in images — output bounding boxes + class labels + confidence scores
How it worksCNN using SSD (Single Shot MultiBox Detector) or VGG with MXNet backend; predicts bounding boxes and class probabilities at multiple scales simultaneously in one forward pass; NMS (non-maximum suppression) removes duplicate boxes
Input formatRecordIO or image files with JSON annotation (bounding box coordinates + labels)
Key hyperparametersnum_classes, base_network (vgg-16 / resnet-50), mini_batch_size, learning_rate, lr_scheduler_step, overlap_threshold
Instance typeGPU ONLY (ml.p3); supports multi-GPU
Object Detection vs Image Classification:

  Image Classification:
  ┌─────────────────────┐
  │  🐱                 │  →  "cat" (confidence: 0.97)
  └─────────────────────┘
  One label for the whole image

  Object Detection:
  ┌─────────────────────┐
  │  ┌────┐    ┌──────┐ │  →  "cat" at (10,20,80,90): 0.95
  │  │ 🐱 │    │ 🐶   │ │     "dog" at (120,30,200,110): 0.89
  │  └────┘    └──────┘ │
  └─────────────────────┘
  Multiple boxes per image, each with class + confidence

Use when: count objects in an image, locate where objects are (autonomous vehicles, security cameras, quality control on assembly lines).


Semantic Segmentation

AspectDetail
TypeSupervised
ProblemAssign a class label to EVERY pixel in an image
How it worksFully Convolutional Network (FCN) with backbone (ResNet-50/101); encoder extracts features, decoder upsamples back to original resolution; also supports PSPNet (Pyramid Pooling) and DeepLab-v3 architectures
Input formatImage files (JPG/PNG) + label images (one pixel = one class label)
Key hyperparametersnum_classes, backbone (resnet-50 / resnet-101), algorithm (fcn / psp / deeplab), epochs, learning_rate, use_pretrained_model
Instance typeGPU ONLY (ml.p3); single-instance only (no multi-instance)
Computer Vision Task Hierarchy:

  Image Classification:  "There is a cat in this image"
                         → 1 label per image

  Object Detection:      "Cat at pixels (10,20)-(80,90)"
                         → N bounding boxes per image

  Semantic Segmentation: "Pixel (25,30) is cat, pixel (100,50) is sky"
                         → 1 label per PIXEL

  More precise → More training data → More compute → More useful

Use when: autonomous driving (road/pedestrian/car pixel masks), medical imaging (tumor boundary delineation), satellite imagery (land use classification).

Exam tip: Semantic segmentation = single GPU instance only. Cannot use multi-instance training. This makes it the most expensive and slowest vision algorithm to train.


Recommendations

Factorization Machines

AspectDetail
TypeSupervised
ProblemClassification or regression on high-dimensional sparse data
How it worksExtends linear model by efficiently computing all pairwise feature interactions via factored representation: $\hat{y} = w_0 + \sum_i w_i x_i + \sum_{i<j} \langle v_i, v_j \rangle x_i x_j$ where $\langle v_i, v_j \rangle$ are learned latent factor dot products
Input formatRecordIO-protobuf with sparse tensor support
Key hyperparametersfeature_dim, num_factors (latent dimensions), predictor_type, epochs, learning_rate
Instance typeCPU (ml.c5)

Why it works for sparse data: In a user-item matrix with 1M users × 500K items, 99.9% of cells are empty. A standard linear model can’t learn user-item interactions (not enough data per pair). FM factorizes each feature into a latent vector — user vectors and item vectors interact via dot products, so even unobserved user-item pairs get predicted via shared latent structure.

$$\hat{y}(x) = w_0 + \sum_i w_i x_i + \sum_{i<j} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j$$

ELI5: In a massive spreadsheet of users × products, most cells are empty — you only know what each user actually bought. Factorization Machines figures out the hidden patterns in what IS filled in. It learns that User A (who liked sci-fi movies) has a similar latent profile to User B (who also liked sci-fi), so it can predict that A would also like movies B liked, even if A has never seen them. It’s efficiently learning all pairwise interactions between features without needing every pair to be observed.

Use when: click-through rate prediction, recommendation systems with sparse features, any problem where features are user/item IDs (one-hot encoded → sparse).

Do NOT use when: dense feature matrices (XGBoost or Linear Learner will be simpler), image/text data.


Reinforcement Learning on SageMaker

SageMaker RL enables training reinforcement learning agents using:

  • Frameworks: Ray RLlib (most common), Intel Coach
  • Environments: Amazon Robomaker (robotics), OpenAI Gym (classic control), custom environments via Docker
  • Training topology: Separate training instance (learns policy) + environment simulation instances (generate experience)

Use cases: robotics path planning, game-playing agents, autonomous vehicle control, resource scheduling optimization, portfolio management.


Input Format Reference

AlgorithmRecordIOCSVJSON LinesImage filesPipe Mode
Linear LearnerYes (float32)YesNoNoYes
XGBoostYesYesNoNoYes
KNNYesYesNoNoYes
DeepARNoNoYesNoNo
K-MeansYesYesNoNoYes
PCAYesYesNoNoYes
RCFYesYesNoNoYes
BlazingTextNoText (supervised)NoNoNo
Seq2SeqYes (integers)NoNoNoNo
NTMYesYesNoNoYes
LDAYesYesNoNoNo
Object2VecNoNoYesNoNo
Factorization MachinesYes (sparse)NoNoNoYes
Image ClassificationYesNoNoYesYes
Object DetectionYesNoNoYesNo
Semantic SegmentationNoNoNoYesNo

Exam tip: When the question asks about streaming large datasets efficiently → Pipe mode + RecordIO. DeepAR uses JSON Lines, not RecordIO. Computer vision algorithms take image files (JPG/PNG) or RecordIO of image data.


Instance Type Reference

AlgorithmCPUGPURecommendedMulti-InstanceNotes
XGBoostYesYesCPU (ml.m5)NoGPU only with gpu_hist
Linear LearnerYesYesCPU (ml.m5)Yes
KNNYesYesCPU (ml.m5)NoGPU helps index building
DeepARYesYesGPU (ml.p3) largeNo
K-MeansYesNoCPU (ml.m5)Yes
PCAYesYesCPU (ml.m5)No
RCFYesNoCPU (ml.m5)NoAlso in Kinesis Analytics
IP InsightsYesYesGPU (ml.p3)No
BlazingTextYesYesMode-dependentNoText class: CPU; W2V: GPU
Seq2SeqNoYes onlyGPU (ml.p3)YesGPU REQUIRED
NTMYesYesGPU (ml.p3) largeYes
LDAYesNoCPU (ml.m5)NoCPU only
Object2VecYesYesGPU (ml.p3)No
Factorization MachinesYesNoCPU (ml.c5)Yes
Image ClassificationNoYes onlyGPU (ml.p3)YesGPU REQUIRED
Object DetectionNoYes onlyGPU (ml.p3)YesGPU REQUIRED
Semantic SegmentationNoYes onlyGPU (ml.p3)NoGPU REQUIRED, no multi-instance

Exam rules to memorize:

  • Vision algorithms (Image Classification, Object Detection, Semantic Segmentation) + Seq2Seq = GPU ONLY, always
  • Tree-based + classical ML = CPU first (cheaper)
  • Semantic Segmentation = GPU + single instance only (no multi-instance)
  • “Cost-effective training” → choose CPU-capable algorithm on CPU instance

Automatic Model Tuning (Hyperparameter Optimization)

SageMaker HPO automatically searches for optimal hyperparameters using three strategies:

HPO Strategy Comparison:

  Bayesian Optimization (recommended):
  ┌──────────────────────────────────────────────────────┐
  │  Job 1: eta=0.3, max_depth=6  → AUC=0.82            │
  │  Job 2: eta=0.1, max_depth=8  → AUC=0.85  ← better  │
  │  Job 3: explore near job 2... → AUC=0.87             │
  │  ...intelligently guided by a surrogate model...     │
  └──────────────────────────────────────────────────────┘
  Learns from past results → focuses on promising regions

  Random Search:
  ┌──────────────────────────────────────────────────────┐
  │  Job 1-N: sample hyperparameters at random           │
  │  All jobs run in parallel → fast exploration         │
  │  No learning across jobs                             │
  └──────────────────────────────────────────────────────┘
  Good for parallelism, no prior knowledge assumed

  Hyperband:
  ┌──────────────────────────────────────────────────────┐
  │  Start many jobs with few resources                  │
  │  Eliminate bottom-performing jobs early              │
  │  Give more resources to survivors                    │
  │  → Very efficient for expensive training jobs        │
  └──────────────────────────────────────────────────────┘

Key configuration parameters:

  • Objective metric: the metric to optimize (e.g., validation:auc, train:loss) — must be emitted to CloudWatch logs
  • Parameter ranges: continuous (e.g., learning_rate: 0.001–0.3), integer (e.g., max_depth: 3–10), categorical (e.g., optimizer: [adam, sgd])
  • Max training jobs: total budget of jobs to run
  • Max parallel jobs: how many jobs run simultaneously (more parallel → less efficient Bayesian learning)
  • Warm start: initialize a new tuning job from results of a previous one → no wasted computation on already-explored regions

ELI5: Instead of manually trying 100 hyperparameter combinations, let SageMaker’s optimizer intelligently explore the space — it learns which regions are promising and focuses there. Bayesian optimization is like a scientist who reads all past experiment results before deciding what to try next. Random search is like throwing darts blindfolded. Hyperband is like a tournament that eliminates weak contestants early to spend time on the strong ones.

Exam tip: Bayesian optimization uses sequential jobs (each job informs the next), so max_parallel_jobs should be low (2-3) for Bayesian. Random search benefits from high parallelism. Hyperband is best when training is expensive and you can afford to kill early-underperforming jobs.


Bring Your Own Algorithm

When to Use Each Approach

Decision Tree: Built-in vs Script Mode vs Custom Container

  Does SageMaker have a built-in algorithm for your problem?
  ├─ Yes, and it fits exactly → Use built-in algorithm
  └─ No, or needs modification
      │
      Does your code fit in a supported framework
      (TensorFlow, PyTorch, Scikit-learn, MXNet, HuggingFace)?
      ├─ Yes → Script Mode (your code + SageMaker's container)
      └─ No, or you need special system dependencies
          │
          Custom Docker Container (full control)

Script Mode

Upload your training script (e.g., train.py) and SageMaker runs it inside its pre-built framework container. Your script reads from /opt/ml/input/data/ and writes the model to /opt/ml/model/.

# SageMaker script mode — your training script reads from env vars
import os
import argparse

parser = argparse.ArgumentParser()
# SageMaker passes hyperparameters as command-line arguments
parser.add_argument('--learning-rate', type=float, default=0.01)
# SageMaker passes data paths as environment variables
parser.add_argument('--model-dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))

Custom Docker Container

Required /opt/ml/ directory structure in your container:

  /opt/ml/
  ├── input/
  │   ├── config/
  │   │   ├── hyperparameters.json   ← SageMaker writes your HPO config here
  │   │   └── resourceConfig.json    ← instance info
  │   └── data/
  │       └── <channel-name>/        ← training data mounted here
  │           └── (your data files)
  ├── model/                         ← write your trained model artifacts here
  │   └── (model files)
  └── output/
      └── failure                    ← write failure reason here if training fails

  Your container must respond to:
    docker run image train    ← SageMaker calls this to start training
    docker run image serve    ← SageMaker calls this to start inference server

Container requirements:

  1. Accept train command → execute training → write model to /opt/ml/model/
  2. Accept serve command → start an HTTP server on port 8080 → respond to /invocations and /ping
  3. Exit with code 0 on success, non-zero on failure

ECR: Push your custom container to Amazon ECR (Elastic Container Registry) — SageMaker pulls it from there for training and inference.


Master Comparison Table

AlgorithmProblem TypeData TypeGPU RequiredKey HyperparametersMulti-Instance
Linear LearnerRegression, ClassificationTabularNopredictor_type, learning_rate, l1Yes
XGBoostRegression, ClassificationTabularNo (GPU optional)num_round, max_depth, eta, objectiveNo
KNNClassification, RegressionAny numericNok, predictor_typeNo
DeepARTime Series ForecastMultiple time seriesRecommendedprediction_length, context_lengthNo
K-MeansClusteringNumericNok, extra_center_factorYes
PCADim ReductionNumericNonum_components, algorithm_modeNo
RCFAnomaly DetectionNumeric/streamingNonum_trees, num_samples_per_treeNo
IP InsightsAnomaly DetectionIP addressesRecommendednum_entity_vectors, vector_dimNo
BlazingTextText Classification, EmbeddingsTextMode-depmode, vector_dim, word_ngramsNo
Seq2SeqSequence TranslationToken sequencesYESencoder_type, num_layersYes
NTMTopic ModelingText (BoW)Recommendednum_topics, encoder_layersYes
LDATopic ModelingText (BoW)Nonum_topics, alpha0No
Object2VecEmbeddings, SimilarityObject pairsRecommendedenc_dim, num_layers, comparator_listNo
Factorization MachinesClassification, RegressionSparse/high-dimNofeature_dim, num_factorsYes
Image ClassificationImage ClassificationImagesYESnum_classes, use_pretrained_modelYes
Object DetectionObject LocalizationImagesYESnum_classes, base_networkYes
Semantic SegmentationPixel ClassificationImagesYESnum_classes, algorithm, backboneNo (single only)