← AWS MLA-C01 — ML Engineer Associate

Domain 2A: SageMaker Built-In Algorithms

SageMaker Built-In Algorithms

Exam Domain: 2 — ML Model Development (26%) Task: Choose the right modeling approach


Algorithm Selection Flowchart

What type of problem?
│
├─ Predict a number? ──────────── REGRESSION
│   ├─ Linear relationship?      → Linear Learner
│   ├─ Complex / tabular?        → XGBoost / LightGBM
│   └─ Time series?              → DeepAR
│
├─ Predict a category? ────────── CLASSIFICATION
│   ├─ Binary / multi-class?     → Linear Learner (binary)
│   ├─ Complex / tabular?        → XGBoost / LightGBM
│   ├─ Image?                    → Image Classification
│   ├─ Object in image?          → Object Detection
│   ├─ Pixel-level?              → Semantic Segmentation
│   └─ Text?                     → BlazingText
│
├─ Group similar items? ───────── CLUSTERING
│   └─ K groups?                 → K-Means
│
├─ Reduce dimensions? ─────────── DIMENSIONALITY REDUCTION
│   └─ Linear?                   → PCA
│
├─ Detect anomalies? ──────────── ANOMALY DETECTION
│   ├─ Numeric / time series?    → Random Cut Forest (RCF)
│   └─ IP address patterns?      → IP Insights
│
├─ Text / NLP? ────────────────── NLP
│   ├─ Word embeddings?          → BlazingText (Word2Vec)
│   ├─ Text classification?      → BlazingText (supervised)
│   ├─ Topic discovery?          → NTM or LDA
│   └─ Sequence translation?     → Seq2Seq
│
├─ Recommendations? ───────────── RECOMMENDATIONS
│   ├─ Sparse data (clicks)?     → Factorization Machines
│   └─ Pair embeddings?          → Object2Vec
│
└─ Forecasting? ───────────────── TIME SERIES
    └─ Multiple related series?  → DeepAR

Why this matters for the exam: This flowchart is the most important mental model to internalize. The key insight: always start with your problem type (regression, classification, clustering, anomaly detection), not with the algorithm name. Exam questions describe a business scenario — your job is to classify the problem first, then the algorithm choice often becomes obvious. Most wrong answers fail because test-takers jump to an algorithm before identifying the problem type.


Complete Algorithm Reference

Supervised — Regression / Classification

Linear Learner

AspectDetail
TypeSupervised
ProblemBinary/multi-class classification, regression
How it worksLinear model (logistic regression / linear regression)
InputRecordIO (float32) or CSV
Key featureBuilt-in normalization, can train multiple models in parallel
Hyperparameterspredictor_type (regressor/binary_classifier/multiclass_classifier), learning_rate, l1, mini_batch_size
When to useSimple linear relationships, baseline model, fast training

Supports automatic model tuning — trains many models with different hyperparameters simultaneously.

ELI5: Linear Learner draws the best straight line (or flat plane) through your data to make predictions. If you’re predicting house prices, it finds the line that best fits “price goes up as square footage goes up.” Simple, fast, and surprisingly effective when your data actually has a roughly linear relationship. Always use it as your baseline — if a fancier model can’t beat it, the fancy model isn’t worth the complexity.

XGBoost

AspectDetail
TypeSupervised
ProblemClassification, regression, ranking
How it worksGradient-boosted decision trees (ensemble)
InputCSV, LibSVM, RecordIO, Parquet
Key featureHandles missing values natively, regularization built-in
Hyperparametersnum_round, max_depth, eta (learning rate), subsample, objective
When to useTabular data — often the best default choice
XGBoost Ensemble:
  Tree 1 (weak)  →  residuals  →  Tree 2  →  residuals  →  Tree 3  →  ...
       +                              +                         +
       └──────────────── Sum of predictions ─────────────────────┘
                         = Final prediction

Exam tip: XGBoost is the most commonly tested algorithm. If the question mentions tabular/structured data with complex patterns, XGBoost is likely the answer.

ELI5: XGBoost is like a team of weak students who each learn from the mistakes of the one before them. Student 1 tries to predict house prices, gets some wrong. Student 2 focuses specifically on fixing Student 1’s mistakes. Student 3 fixes what Student 2 still got wrong. After hundreds of rounds, their combined predictions are remarkably accurate. It’s the Swiss Army knife of ML — handles missing values, resists overfitting, works on almost any tabular problem, and wins most Kaggle competitions. When in doubt on the exam, XGBoost.

LightGBM

AspectDetail
TypeSupervised
ProblemClassification, regression
How it worksGradient boosting with leaf-wise tree growth
Key featureFaster than XGBoost on large datasets, lower memory
When to useLarge datasets where XGBoost is too slow

XGBoost vs LightGBM:

  • XGBoost: level-wise growth (balanced trees)
  • LightGBM: leaf-wise growth (deeper, potentially better accuracy, risk of overfitting)

Supervised — Time Series

DeepAR

AspectDetail
TypeSupervised
ProblemTime series forecasting
How it worksAutoregressive RNN (LSTM/GRU)
InputJSON lines (start time + target values)
Key featureTrains on multiple related time series simultaneously
Hyperparameterscontext_length, prediction_length, epochs, num_cells
When to useForecasting across many related series (e.g., product demand)
DeepAR learns patterns across multiple series:
  Series A: ──────┐
  Series B: ──────┤  → Shared RNN model → Probabilistic forecasts
  Series C: ──────┘                        (with uncertainty bands)

Outputs probabilistic forecasts (quantiles), not just point estimates.

ELI5: Traditional forecasting tools look at one product’s sales history in isolation. DeepAR looks at hundreds or thousands of related products simultaneously and learns shared patterns — “sales of all products spike in December,” “products in category X have a weekly cycle.” A product with only 2 weeks of history benefits from patterns learned across 500 similar products. This makes DeepAR dramatically better than classical methods when you have many related time series to forecast at once.


Supervised — Sequence / Text

BlazingText

AspectDetail
TypeSupervised (text classification) / Unsupervised (Word2Vec)
ProblemText classification, word embeddings
ModesWord2Vec (Skip-gram, CBOW) and Text Classification (supervised)
Key featureOptimized for multi-core + GPU, 10-20x faster than standard
When to useSentiment analysis, word embeddings, text categorization
Word2Vec Architectures:
  CBOW:       context words → predict center word (faster)
  Skip-gram:  center word → predict context words (better for rare words)

ELI5: BlazingText turns words into points in a mathematical space where meaning is encoded as distance. Words used in similar contexts end up close together: “king” is near “queen,” “Paris” is near “London.” The famous example: “King” − “Man” + “Woman” ≈ “Queen” — that’s not magic, it’s geometry in the embedding space. These numeric representations (embeddings) let ML models understand language by turning fuzzy human meaning into precise numbers that algorithms can compute with.

Seq2Seq (Sequence-to-Sequence)

AspectDetail
TypeSupervised
ProblemSequence translation
How it worksEncoder-decoder RNN with attention
InputTokenized integer sequences (RecordIO)
When to useMachine translation, text summarization, speech-to-text

Must be initialized with pre-trained word embeddings or use provided tokenization.


Supervised — Computer Vision

Image Classification

AspectDetail
TypeSupervised
ProblemClassify entire image into a category
How it worksCNN (ResNet architecture)
InputRecordIO or image files (JPG, PNG)
Key featureTransfer learning — use pre-trained ImageNet weights
When to use“What is this image?” (single label or multi-label)

Object Detection

AspectDetail
TypeSupervised
ProblemDetect and locate objects in images
How it worksCNN (SSD or Faster R-CNN)
OutputBounding boxes + class labels + confidence scores
When to use“Where are the objects and what are they?”

Semantic Segmentation

AspectDetail
TypeSupervised
ProblemClassify every pixel in an image
How it worksFCN (Fully Convolutional Network)
OutputSegmentation mask (label per pixel)
When to useSelf-driving cars, medical imaging, satellite analysis
Computer Vision Task Comparison:

Image Classification:  "This is a cat"         → 1 label per image
Object Detection:      "Cat at (x1,y1,x2,y2)" → bounding boxes
Semantic Segmentation: "These pixels are cat"  → pixel-level mask

ELI5: The three vision tasks answer three different questions about the same photo. Classification: “What’s in this photo?” — one answer for the whole image. Object Detection: “Where exactly is it?” — draw a rectangle around each object and label it. Semantic Segmentation: “Color every pixel by what it is” — every pixel in the cat gets labeled “cat,” every pixel of road gets labeled “road.” More precise = more compute = more labeled training data required.


Unsupervised — Clustering & Dimensionality

K-Means

AspectDetail
TypeUnsupervised
ProblemClustering
How it worksIteratively assigns points to nearest centroid
Hyperparametersk (number of clusters), init_method
SageMaker variantUses web-scale k-means (mini-batch for large data)
When to useCustomer segmentation, grouping similar items

Choose k using the Elbow method: plot distortion vs k, pick the “elbow” point.

PCA (Principal Component Analysis)

AspectDetail
TypeUnsupervised
ProblemDimensionality reduction
How it worksFinds orthogonal axes of maximum variance
ModesRegular (covariance matrix) or Randomized (approximation for large data)
When to useToo many features, reduce noise, visualize high-dimensional data

Unsupervised — Anomaly Detection

Random Cut Forest (RCF)

AspectDetail
TypeUnsupervised
ProblemAnomaly detection
How it worksBuilds ensemble of random trees; anomalies are isolated quickly
OutputAnomaly score (higher = more anomalous)
Key featureWorks on streaming data too (Kinesis Analytics has built-in RCF)
When to useDetect spikes, outliers, unusual patterns in time series

ELI5: RCF is like a bouncer at a club deciding how “weird” each person is. The bouncer builds a mental model of what “normal” looks like from the crowd. Someone who fits the usual pattern blends in and gets a low weirdness score. Someone wildly out of place — a spike in server traffic at 3am, a transaction ten times larger than usual — stands out and gets a high anomaly score. RCF formalizes this by asking: “How many random cuts does it take to isolate this data point?” Normal points are surrounded by neighbors and take many cuts. Anomalies are isolated quickly — hence a high score.

IP Insights

AspectDetail
TypeUnsupervised
ProblemLearn IP address usage patterns
How it worksNeural network learns entity-IP associations
When to useDetect anomalous logins, fraudulent IP usage

Unsupervised — Topic Modeling

Neural Topic Model (NTM)

AspectDetail
TypeUnsupervised
ProblemDiscover topics in text corpus
How it worksNeural variational inference
When to useOrganize documents, discover themes, more accurate than LDA

Latent Dirichlet Allocation (LDA)

AspectDetail
TypeUnsupervised
ProblemTopic modeling
How it worksProbabilistic generative model
Difference from NTMClassical statistical approach, faster, less accurate

NTM vs LDA: NTM is neural-network based (more accurate), LDA is classical (faster, more interpretable).


Recommendations

Factorization Machines

AspectDetail
TypeSupervised
ProblemClassification, regression on sparse data
How it worksCaptures pairwise feature interactions efficiently
Key featureHandles high-dimensional sparse data (click-through, recommendations)
When to useRecommender systems, click prediction

Object2Vec

AspectDetail
TypeSupervised
ProblemLearn embeddings for paired objects
How it worksEncoder network learns low-dimensional representations
When to useDocument similarity, recommendation, relationship learning

SageMaker Input Modes

┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│  File Mode  │  │  Pipe Mode  │  │ FastFile Mode│
├─────────────┤  ├─────────────┤  ├─────────────┤
│ Downloads   │  │ Streams from│  │ POSIX-like   │
│ entire      │  │ S3 directly │  │ access to S3 │
│ dataset to  │  │ (no disk    │  │ (lazy load)  │
│ local disk  │  │  needed)    │  │              │
│             │  │             │  │              │
│ Simplest    │  │ Fastest     │  │ Best of both │
│ Any format  │  │ RecordIO    │  │ Any format   │
│ Needs disk  │  │ required    │  │ Default now  │
└─────────────┘  └─────────────┘  └─────────────┘
ModeSpeedDisk RequiredFormat
FileSlowest (download first)YesAny
PipeFastestNoRecordIO / TFRecord
FastFileFast (stream on demand)NoAny

Exam tip: Pipe mode + RecordIO = maximum training throughput. FastFile mode is the modern default.

Why this matters for the exam: Input mode questions come up frequently. The mental model: File mode downloads everything before training starts — your instance idles waiting for the download. Pipe mode streams data directly from S3 as training runs — no waiting, no disk needed, but requires RecordIO format. FastFile gives you Pipe-mode speed with any format, and is what AWS now recommends by default. When the exam says “minimize training startup time” or “largest dataset, fastest training” — Pipe + RecordIO wins.


Additional Algorithms (AutoML)

AutoGluon-Tabular

AspectDetail
TypeAutoML (ensemble stacking)
ProblemClassification, regression on tabular data
How it worksAutomatically trains multiple models (XGBoost, LightGBM, CatBoost, NN, etc.) and stacks them
Key featureBest accuracy with zero tuning — just point at data
When to useWhen you want highest accuracy without manual algorithm selection

CatBoost

AspectDetail
TypeSupervised (gradient boosting)
ProblemClassification, regression
Key featureNative categorical feature support (no encoding needed)
When to useData with many categorical features

Compute Type per Algorithm

AlgorithmCPUGPURecommendedMulti-Instance
XGBoostYesYes (gpu_hist)CPU (ml.m5)No
Linear LearnerYesYesCPU (ml.m5)Yes
Factorization MachinesYesNoCPU (ml.c5)Yes
KNNYesYes (inference)CPU (ml.m5)No
K-MeansYesNoCPU (ml.m5)Yes
PCAYesYesCPU (ml.m5)No
Random Cut ForestYesNoCPU (ml.m5)No
IP InsightsYesYesGPU (ml.p3)No
BlazingTextYesYesMode-dependentNo
Seq2SeqNoYes onlyGPU (ml.p3)Yes
DeepARYesYesGPU for largeNo
Object2VecYesYesGPU (ml.p3)No
Image ClassificationNoYes onlyGPU (ml.p3)Yes
Object DetectionNoYes onlyGPU (ml.p3)Yes
Semantic SegmentationNoYes onlyGPU (ml.p3)No
NTMYesYesGPU for largeYes
LDAYesNoCPU (ml.m5)No

Exam rules:

  • Vision algorithms + Seq2Seq = GPU only, always
  • Tree-based + classical ML = CPU first
  • “Cost-effective” + CPU-capable algorithm = pick CPU instance
  • XGBoost on GPU? Possible (tree_method=gpu_hist) but NOT cost-effective — exam answer is CPU

Quick Decision Matrix

Your DataYour ProblemAlgorithm
Tabular, structuredClassification/RegressionXGBoost
Tabular, simple linearClassification/RegressionLinear Learner
Multiple time seriesForecastingDeepAR
Text documentsClassificationBlazingText
Text corpusTopic discoveryNTM (or LDA)
ImagesClassify whole imageImage Classification
ImagesFind objectsObject Detection
ImagesLabel every pixelSemantic Segmentation
Numeric dataFind anomaliesRandom Cut Forest
IP addressesDetect fraudIP Insights
Sparse interaction dataRecommendationsFactorization Machines
High-dimensionalReduce featuresPCA
Any dataGroup into clustersK-Means
SequencesTranslate/summarizeSeq2Seq
Word representationsEmbeddingsBlazingText (Word2Vec)
Object pairsSimilarity/embeddingsObject2Vec