Domain 2B: Feature Engineering & Data Preparation
Table of Contents
- Feature Engineering & Data Preparation
- Why Feature Engineering is the Most Important Skill
- Numerical Feature Transformations
- Categorical Feature Encoding
- Text Feature Engineering
- Time Series Feature Engineering
- Handling Class Imbalance
- Dimensionality Reduction as Feature Engineering
- SageMaker Feature Store
- Amazon SageMaker Ground Truth
- Quick Reference: Feature Type → Transformation
- Behind the Scenes: Why These Transformations Work
Feature Engineering & Data Preparation
Exam Domain: 2 — Exploratory Data Analysis (24%) Task: Transform raw data into features that algorithms can learn from
Why Feature Engineering is the Most Important Skill
ELI5: Features are the language your model speaks. Raw data is like speaking French to an English speaker. A model can’t understand that “2026-01-15” is a winter weekday, or that “Dr. John Smith III” is a high-income individual, unless you translate those signals. Feature engineering does that translation.
The hierarchy of impact in ML projects:
Features > Algorithm choice > Hyperparameter tuning
A mediocre algorithm with great features beats
a great algorithm with mediocre features.
Every time.
“Garbage in, garbage out” understates the problem. Even clean raw data often produces garbage output because the model lacks the right representation. Raw sensor readings, raw timestamps, and raw text contain signal — but it’s buried in a form the model can’t decode.
Why this matters for the exam: MLS-C01 questions frequently ask “which preprocessing step should you apply given this problem?” or “why is model performance poor despite good data?” The answer is almost always a feature engineering issue: wrong encoding, lack of scaling, unhandled skew, or class imbalance.
Numerical Feature Transformations
Scaling and Normalization
ELI5: Imagine mixing flour measured in grams (0–500) with water measured in liters (0–5). Without scaling, the model thinks flour varies 100x more than water, just because the numbers are bigger — not because flour is 100x more important. Scaling puts everything on the same playing field.
Min-Max Scaling (Normalization)
$$x’ = \frac{x - x_{min}}{x_{max} - x_{min}}$$
- Output range: [0, 1]
- Preserves the shape of the original distribution
- Sensitive to outliers (one extreme value compresses all others near zero)
Standardization (Z-score Scaling)
$$x’ = \frac{x - \mu}{\sigma}$$
- Output: mean = 0, std = 1 (no fixed range)
- Robust to outliers compared to Min-Max
- Required when algorithm assumes zero-mean inputs (PCA, SVM)
Robust Scaling
$$x’ = \frac{x - \text{median}}{IQR}$$
- Uses median and IQR instead of mean and std
- Most outlier-resistant scaling method
- Best choice when data has significant outliers
When to Use Which
| Scaling Method | Use When | Avoid When |
|---|---|---|
| Min-Max | Need values in [0,1]; neural networks with sigmoid output | Outliers present |
| Standardization | PCA, SVM, neural networks; Gaussian assumption needed | Heavy-tailed data |
| Robust Scaling | Outliers present and can’t be removed | All values must be in [0,1] |
| No scaling | Tree-based algorithms (XGBoost, Random Forest, Decision Trees) | — |
WHY algorithms need scaling:
Gradient descent:
Features with large values → large gradients → unstable convergence
Features with small values → tiny gradients → slow learning
Scaling → gradients are comparable → faster, stable convergence
Distance-based (KNN, SVM, K-Means):
Without scaling: distance dominated by largest-range feature
Height (150-200cm) vs Age (0-100 yrs) → height dominates KNN
Scaling → all features contribute equally to distance
Regularization (Ridge, Lasso):
Penalty applied equally to all coefficients
Without scaling: large-range features penalized unfairly
Which algorithms need scaling:
| Need Scaling | Don’t Need Scaling |
|---|---|
| SVM | Decision Trees |
| KNN | Random Forest |
| K-Means | XGBoost / GBM |
| Neural Networks | LightGBM |
| PCA | CatBoost |
| Linear/Logistic Regression (with regularization) | Naive Bayes |
Log Transform
$$x’ = \log(x + 1)$$
(+1 to handle zeros)
- Compresses large values, expands small values
- Converts right-skewed → approximately normal
- Essential for: income, population, website traffic, any “hockey stick” data
Before log transform: After log transform:
* *
* ***
** *****
*** *******
************* *********
─────────────── ──────────────
Right-skewed More symmetric
Behind the scenes: Log transform works because many natural phenomena are multiplicative, not additive. A 10% raise means different absolute dollars depending on your salary. Log transform converts “multiply by 1.1” → “add 0.095” — turning multiplicative relationships linear.
Box-Cox Transform
Generalization of the log transform — finds the optimal power $\lambda$ automatically:
$$x’ = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \lambda \neq 0 \ \ln(x) & \lambda = 0 \end{cases}$$
- When $\lambda = 0$: reduces to log transform
- When $\lambda = 1$: no transform
- When $\lambda = 0.5$: square root transform
- Only works for positive values (use Yeo-Johnson for zero/negative values)
Binning / Discretization
Converting continuous values into categorical bins.
| Method | How | When |
|---|---|---|
| Equal-width | Divide range into equal intervals | Uniform distribution |
| Equal-frequency | Each bin has same count | Skewed data |
| Domain-based | Expert-defined boundaries | Age: child/teen/adult/senior |
When binning helps:
- Reduces noise in a continuous feature that has noisy measurements
- Captures non-linear relationship (income → bracket → behavior)
- When domain knowledge defines natural thresholds (temperature → cold/warm/hot)
When binning hurts:
- Loses ordinal information within bins
- Creates arbitrary boundary effects (age 29 vs 30 treated very differently)
- Information loss — generally reduces model performance unless relationship is truly categorical
Polynomial Features
Create new features by combining existing ones:
$$x_1, x_2 \rightarrow x_1, x_2, x_1^2, x_2^2, x_1 x_2$$
- Enables linear models to capture non-linear relationships
- Degree 2 features for $n$ original features creates $\frac{n(n+1)}{2} + n$ new features
Warning — feature explosion:
10 features, degree 2 → 65 features
100 features, degree 2 → 5,150 features
100 features, degree 3 → 176,851 features
This triggers the curse of dimensionality. Use sparingly and with strong domain justification.
Categorical Feature Encoding
ELI5: Telling a model that cat=1, dog=2, fish=3 implies dog is “between” cat and fish, and fish is “three times” something cat is. That’s nonsense. One-hot encoding says “these are completely separate things with no mathematical relationship” — it creates a binary column for each category.
Label Encoding
Categories mapped to integers: {low=0, medium=1, high=2}
- DANGER for nominal categories: implies ordering that doesn’t exist
- Safe for ordinal categories: size (S < M < L < XL), rating (1-5 stars)
- Safe for tree-based algorithms: trees split on thresholds, so arbitrary numbers don’t imply false order in practice
One-Hot Encoding (OHE)
Creates one binary column per category value.
Color: [Red, Blue, Green]
↓
color_Red color_Blue color_Green
1 0 0 (was Red)
0 1 0 (was Blue)
0 0 1 (was Green)
Dummy variable trap: With k categories and k columns, the last column is perfectly predictable from the other k-1 columns → multicollinearity. Fix: drop one column (use k-1 columns).
When OHE fails:
- High-cardinality features (city names, user IDs, product SKUs) → thousands of sparse columns
- Memory-intensive, can overfit, slows training dramatically
Target Encoding
Replace each category with the mean of the target variable for that category:
City | Mean_House_Price → encoding
NYC | $1.2M → 1,200,000
LA | $800K → 800,000
Houston | $350K → 350,000
- Handles high cardinality elegantly
- Risk: target leakage — you’re using the target to create features
- Fix: use cross-validated target encoding — encode fold $k$ using mean from all other folds
Binary Encoding
Compromise between label and one-hot:
- Assign each category an integer
- Convert integer to binary
- Each bit becomes a column
Category: 5 categories
Label encoding: 5 columns (one-hot) → Binary: only 3 columns (bits for 0-4)
For 1000 categories: OHE = 1000 cols, Binary = 10 cols
Embeddings
Learned dense vector representations for high-cardinality categorical features.
- Neural network learns a dense representation (e.g., 50-dimensional vector per city)
- Captures semantic similarity: “Paris” and “London” will have similar embeddings
- Essential for: user IDs, product IDs, text in deep learning
- SageMaker: Entity embedding layers in custom neural networks
Encoding Decision Guide
| Feature Type | Cardinality | Recommended Encoding |
|---|---|---|
| Ordinal (ordered) | Any | Label encoding |
| Nominal | Low (< 10) | One-hot encoding |
| Nominal | Medium (10–50) | Binary or Target encoding |
| Nominal | High (> 50) | Target encoding or Embeddings |
| Free-form text | — | TF-IDF, Word embeddings |
| Tree-based algorithms | Any | Label encoding is fine |
Text Feature Engineering
Bag of Words (BoW)
Represent text as a vector of word counts, ignoring order.
Sentence 1: "the cat sat on the mat" → {the:2, cat:1, sat:1, on:1, mat:1}
Sentence 2: "the dog sat on the rug" → {the:2, dog:1, sat:1, on:1, rug:1}
- Simple, interpretable
- Loses word order and context (“not good” ≠ “good”)
- Creates very sparse, high-dimensional vectors
TF-IDF
Term Frequency × Inverse Document Frequency
$$\text{TF-IDF}(t, d) = \underbrace{\frac{\text{count}(t,d)}{\text{total words in }d}}{\text{TF}} \times \underbrace{\log\frac{N}{\text{docs containing }t}}{\text{IDF}}$$
ELI5: “The” appears in every document — it’s useless for classification, so IDF gives it a very low weight. “Quantum” appears in only 3 out of 10,000 documents — it’s highly discriminative, so IDF gives it a high weight. TF-IDF automatically downweights common words and upweights rare, informative words.
- TF: how often does a term appear in this document?
- IDF: how rare is this term across all documents?
- High TF-IDF: term appears often in this doc but rarely elsewhere → highly characteristic
Word Embeddings
Dense vector representations learned from large text corpora.
Traditional (BoW/TF-IDF):
"king" → [0, 0, 1, 0, 0, ..., 0] (sparse, 50,000 dims)
Word2Vec/GloVe:
"king" → [0.2, -0.4, 0.8, 0.1, ...] (dense, 100-300 dims)
"queen" → [0.1, -0.5, 0.7, 0.3, ...] (similar to king)
king - man + woman ≈ queen (vector arithmetic captures semantics)
Why embeddings beat BoW:
- Capture semantic similarity (king/queen are neighbors in embedding space)
- Dense → much smaller than sparse one-hot
- Pre-trained embeddings transfer knowledge from billions of words
N-grams
Capture sequences of N consecutive words:
Unigram (N=1): "not", "good"
Bigram (N=2): "not good" ← captures negation
Trigram (N=3): "not very good"
Critical for sentiment analysis where word order changes meaning.
Text Preprocessing Pipeline
Raw text
↓ Lowercasing
↓ Remove punctuation / HTML
↓ Tokenization (split into words/subwords)
↓ Stop word removal ("the", "is", "at")
↓ Stemming (running → run) OR Lemmatization (better → good)
↓ Vectorization (BoW / TF-IDF / Embeddings)
↓ ML-ready features
Stemming vs Lemmatization:
- Stemming: crude chopping (studies → studi) — fast, imprecise
- Lemmatization: proper linguistic reduction (better → good) — slower, more accurate
BlazingText in SageMaker
SageMaker’s built-in algorithm for:
- Word2Vec training at high speed (uses GPU)
- Text classification (supervised mode)
Exam tip: BlazingText is the fastest way to train Word2Vec or text classifiers on AWS. If a question involves word embeddings or text classification at scale, BlazingText is the answer.
Time Series Feature Engineering
ELI5: A model sees numbers, not dates. When you feed it “2026-01-15,” it just sees a number. It doesn’t know that’s a Thursday in winter, or that Thursdays tend to have higher sales, or that January is the slow season. Feature engineering unpacks the date into signals the model can actually learn from.
Lag Features
Use previous values as features:
$$\text{features}: y_{t-1}, y_{t-2}, y_{t-7}, y_{t-28}$$
- $y_{t-1}$: yesterday’s value (captures autocorrelation)
- $y_{t-7}$: same day last week (captures weekly seasonality)
- $y_{t-28}$: same period last month
Exam tip: Lag features are how you convert a time series problem into a standard supervised learning problem. The “labels” are $y_t$, the features include lagged $y$ values.
Rolling Statistics
Rolling mean (window=7): average of last 7 days
Rolling std (window=7): volatility of last 7 days
Rolling min/max: recent range
Exponential weighted mean: recent values weighted more heavily
Captures local trends and volatility that single lag features miss.
Date Decomposition Features
Raw: "2026-01-15 14:35:00"
↓
year: 2026
month: 1
day_of_month: 15
day_of_week: 3 (Thursday)
hour: 14
is_weekend: 0
is_holiday: 0
quarter: 1
week_of_year: 3
These features let the model discover seasonal patterns without knowing what “Thursday” or “January” means.
Seasonality and Differencing
Seasonality extraction: Use Fourier terms (sine/cosine) to capture cyclical patterns:
$$\sin\left(\frac{2\pi t}{P}\right), \cos\left(\frac{2\pi t}{P}\right)$$
where $P$ is the period (e.g., 7 for weekly, 365 for yearly).
Differencing for stationarity:
$$y’t = y_t - y{t-1}$$
First-order differencing removes a linear trend. Seasonal differencing ($y_t - y_{t-P}$) removes seasonality. Required before applying ARIMA-family models.
Handling Class Imbalance
Why Imbalance is a Problem
Dataset: 99% "not fraud," 1% "fraud"
Naive model: predict "not fraud" for everything
Accuracy: 99% ← looks great!
Recall for fraud: 0% ← completely useless
Accuracy is a misleading metric on imbalanced datasets. Use precision, recall, F1, AUC-ROC instead.
Oversampling
Increase the minority class to balance the dataset.
SMOTE (Synthetic Minority Over-sampling Technique):
- For each minority sample, find its k nearest minority neighbors
- Create synthetic samples by interpolating along the line between them
$$x_{new} = x_i + \lambda \times (x_{neighbor} - x_i), \quad \lambda \in [0,1]$$
ELI5: SMOTE creates new fake fraud examples by “blending” real fraud examples — like a photo-morphing app that creates a new face halfway between two existing faces. Instead of just duplicating fraud examples (which overfits), it creates plausible new variations.
Undersampling
Remove majority class samples to balance the dataset.
- Random undersampling: randomly remove majority class examples
- Tomek links: remove majority samples that are too close to minority class (cleaner decision boundary)
- Best when majority class is very large and losing some data is acceptable
Class Weights
Tell the model that minority class mistakes cost more:
$$\text{class weight} = \frac{n_samples}{n_classes \times n_samples_in_class}$$
In SageMaker built-in algorithms, pass class_weight parameter. In XGBoost, use scale_pos_weight. In neural networks, use weighted loss functions.
Behind the scenes: Weighted loss penalizes misclassifying the minority class more heavily. The model learns that getting a fraud wrong is worse than getting a non-fraud wrong — without changing the data at all.
Threshold Adjustment
Default classification threshold is 0.5. For imbalanced problems, lower threshold → catch more positives (higher recall, lower precision).
Choose threshold by optimizing the metric that matters for your use case (F1, precision at fixed recall, etc.) using the precision-recall curve.
Technique Selection Guide
| Situation | Recommended Technique |
|---|---|
| Enough data, minority class rare (< 5%) | SMOTE + class weights |
| Very large dataset, minority rare | Undersampling + class weights |
| Small dataset | SMOTE only (no undersampling) |
| Tree-based algorithm | Class weights (built-in support) |
| Neural network | Weighted loss function |
| Can’t modify training process | Threshold adjustment at inference |
Dimensionality Reduction as Feature Engineering
Curse of Dimensionality
ELI5: In 2D, you can cover a grid with 100 dots. In 3D, you need 1,000. In 10D, you need 10 billion. More dimensions = more empty space = every data point is “far” from every other data point = distance-based algorithms break down. This is the curse of dimensionality.
As dimensions increase:
- Volume of feature space grows exponentially
- Data becomes sparse — even large datasets are insufficient
- Distance metrics become less meaningful
- Models require exponentially more data to generalize
PCA: Principal Component Analysis
First principles: Find the directions in feature space where the data varies the most. Project data onto these directions. Discard the low-variance directions (noise).
Original 2D data: PCA projection to 1D:
PC1
* * * * * * * * * * * *
* * * * → ─────────────────────
* * * Direction of maximum
* * variance preserved
How it works (behind the scenes):
- Standardize the data (mean=0, std=1)
- Compute the covariance matrix $\Sigma$
- Compute eigenvectors and eigenvalues of $\Sigma$
- Sort eigenvectors by eigenvalue (descending)
- Project data onto top $k$ eigenvectors
Each principal component is a linear combination of original features:
$$PC_1 = w_1 x_1 + w_2 x_2 + \ldots + w_n x_n$$
Choosing the number of components:
Explained Variance Ratio:
Component 1: 45%
Component 2: 30%
Component 3: 15% ← elbow
Component 4: 5%
Component 5: 3%
...
Keep enough components to explain 95% of variance (components 1-3 here)
When to use PCA:
| Scenario | Use PCA? |
|---|---|
| Before KNN or SVM (curse of dimensionality) | Yes |
| Visualization (reduce to 2D/3D) | Yes |
| Remove noise from correlated features | Yes |
| Tree-based algorithms | Generally no (trees handle high-dim fine) |
| Features must be interpretable | No (PCs are linear combinations, hard to explain) |
| Before sparse regression (Lasso) | No (Lasso does its own selection) |
Exam tip: PCA requires standardization first. If a question shows PCA applied to unstandardized data, that’s the bug. Large-range features will dominate the principal components.
t-SNE
t-distributed Stochastic Neighbor Embedding — non-linear dimensionality reduction for visualization only.
- Projects high-dimensional data to 2D or 3D
- Preserves local structure (nearby points stay nearby)
- Does NOT preserve global structure or distances
- Not suitable as a preprocessing step for ML — only for visual exploration
Feature Selection vs Feature Extraction
Feature Selection: Feature Extraction:
Select a subset of original Create new features from originals
features as-is
Original features preserved New features (like PCs) created
Interpretable May not be interpretable
(PCA, autoencoders)
Feature selection methods:
| Category | Method | How |
|---|---|---|
| Filter | Correlation, Chi-squared, Mutual Information | Score features independently |
| Wrapper | Recursive Feature Elimination (RFE) | Train model, remove weakest features, repeat |
| Embedded | L1 regularization (Lasso), tree feature importance | Feature selection built into training |
ELI5 on Lasso (L1): Lasso adds a penalty for having non-zero coefficients. The optimizer finds that setting some coefficients exactly to zero reduces total cost. Those features are effectively removed. It’s automatic feature selection baked into the model.
SageMaker Feature Store
What it is: A fully managed, centralized repository for ML features — both for training and real-time inference.
Feature Engineering Pipeline
↓
┌──────────────────────────┐
│ Feature Store │
│ │
│ ┌───────────────────┐ │
│ │ Online Store │ │ ← DynamoDB-backed
│ │ (low latency) │ │ millisecond reads
│ └───────────────────┘ │
│ ┌───────────────────┐ │
│ │ Offline Store │ │ ← S3-backed
│ │ (S3 + Glue) │ │ for training
│ └───────────────────┘ │
└──────────────────────────┘
↓ ↓
Training Real-time
Jobs Inference
Why Feature Store Exists
ELI5: Without Feature Store, your training pipeline computes customer age as of January 1st (training date), but your serving pipeline computes it as of today. The model was trained on “age at training time” but served with “current age” — a subtle mismatch that’s nearly impossible to debug in production. Feature Store timestamps features and ensures both pipelines use the same computation.
This is called training-serving skew — the silent killer of ML systems in production.
Feature Store solves:
- Training-serving skew (same features for training and inference)
- Feature reuse (compute once, use in many models)
- Feature discovery (catalog of available features for the team)
- Point-in-time correctness (get features as they existed at any past timestamp)
Key Concepts
| Concept | Meaning |
|---|---|
| Feature Group | Logical collection of related features (e.g., “customer_demographics”) |
| Record identifier | Primary key (e.g., customer_id) |
| Event time | When the feature values were computed |
| Online Store | Low-latency retrieval for real-time inference (<10ms) |
| Offline Store | S3-backed, queryable via Athena, for training jobs |
Exam tip: Online store = real-time inference, low latency. Offline store = batch training. When an exam question asks about preventing training-serving skew or centralizing feature management, the answer is SageMaker Feature Store.
Amazon SageMaker Ground Truth
What it is: A fully managed data labeling service for creating high-quality training datasets.
Raw Unlabeled Data (S3)
↓
Ground Truth Job
↓
┌─────────────────────────────┐
│ Active Learning Loop │
│ │
│ ML model pre-labels data │
│ ↓ │
│ High confidence → Accept │
│ Low confidence → Human │
│ ↓ │
│ Human labels verified │
│ ↓ │
│ Model retrained │
└─────────────────────────────┘
↓
Labeled Dataset (S3)
Labeling Workforce Options
| Option | Use When | Cost |
|---|---|---|
| Amazon Mechanical Turk | Large volume, non-sensitive, general tasks | Lowest |
| Private workforce | Sensitive data, domain experts needed | Depends on employees |
| AWS Marketplace vendors | Professional labelers, specialized domains | Medium-high |
Active Learning (Human-in-the-Loop)
- Small set of human-labeled examples trains initial model
- Model labels remaining data with confidence scores
- High-confidence labels → accepted automatically
- Low-confidence labels → sent to humans for review
- Human-verified labels → retrain model
- Repeat until quality threshold reached
Cost reduction: Active learning reduces labeling cost by 40–70% vs labeling everything manually.
Ground Truth Task Types
| Task Type | Examples |
|---|---|
| Image classification | Is this a cat or dog? |
| Image bounding boxes | Draw boxes around cars |
| Image segmentation | Pixel-level object boundaries |
| Text classification | Sentiment: positive/negative |
| Named entity recognition | Tag person/place/org in text |
| Video classification | Action recognition |
| 3D point cloud | LiDAR for autonomous vehicles |
Exam tip: Ground Truth + active learning = cost-efficient labeling at scale. Ground Truth Plus = a white-glove managed service where AWS handles the entire labeling workflow.
Why Labeling Quality Matters
A model is bounded by the quality of its labels. Systematic labeling errors (e.g., labelers disagree on edge cases) create noisy targets that no amount of architecture tuning can overcome. Ground Truth uses annotation consolidation (multiple labelers + majority vote) to reduce individual labeler noise.
Quick Reference: Feature Type → Transformation
| Feature Type | Raw Form | Transformation | Why |
|---|---|---|---|
| Right-skewed numerical | Income, prices | Log transform | Compress range, normalize |
| Normally distributed numerical | Age, height | Standardization | Zero mean, unit variance |
| Bounded numerical (0-100) | Score, % | Min-Max or none | Scale to [0,1] if needed |
| Ordinal categorical | Low/Med/High | Label encoding | Preserves order |
| Nominal categorical (low cardinality) | Color, country (small set) | One-hot encoding | No false ordering |
| Nominal categorical (high cardinality) | City, product ID | Target or Binary encoding | Avoid column explosion |
| Free text | Reviews, descriptions | TF-IDF or embeddings | Extract semantic signal |
| Timestamp | 2026-01-15 | Date decomposition + lags | Give model temporal hooks |
| Imbalanced target | 99% class A | SMOTE + class weights | Prevent majority-class bias |
| Correlated features (many) | Sensor readings | PCA | Reduce redundancy, compress |
| Sparse features | Document term counts | Dimensionality reduction | Handle sparsity |
Behind the Scenes: Why These Transformations Work
The fundamental principle:
Raw data → Model = model learns the representation
(slow, may fail)
Raw data → Feature engineering → Model = engineer provides the representation
(faster, often better)
Feature engineering = domain knowledge encoded into the data
so the model doesn't have to discover it from scratch.
The best features are ones that make the relationship
between input and output as simple as possible.
If the true relationship is $y \propto e^x$, and you feed the model $x$, it must learn the exponential. If you feed it $\log(x)$, the relationship becomes linear — trivially easy to learn. This is the fundamental insight behind all feature engineering.