Domain 2B: Feature Engineering & Data Preparation

17 min read 3500 words

Table of Contents

Feature Engineering & Data Preparation

Feature Engineering & Data Preparation

Exam Domain: 2 — Exploratory Data Analysis (24%) Task: Transform raw data into features that algorithms can learn from

Why Feature Engineering is the Most Important Skill

ELI5: Features are the language your model speaks. Raw data is like speaking French to an English speaker. A model can’t understand that “2026-01-15” is a winter weekday, or that “Dr. John Smith III” is a high-income individual, unless you translate those signals. Feature engineering does that translation.

The hierarchy of impact in ML projects:

Features > Algorithm choice > Hyperparameter tuning

A mediocre algorithm with great features beats
a great algorithm with mediocre features.
Every time.

“Garbage in, garbage out” understates the problem. Even clean raw data often produces garbage output because the model lacks the right representation. Raw sensor readings, raw timestamps, and raw text contain signal — but it’s buried in a form the model can’t decode.

Why this matters for the exam: MLS-C01 questions frequently ask “which preprocessing step should you apply given this problem?” or “why is model performance poor despite good data?” The answer is almost always a feature engineering issue: wrong encoding, lack of scaling, unhandled skew, or class imbalance.

Numerical Feature Transformations

Scaling and Normalization

ELI5: Imagine mixing flour measured in grams (0–500) with water measured in liters (0–5). Without scaling, the model thinks flour varies 100x more than water, just because the numbers are bigger — not because flour is 100x more important. Scaling puts everything on the same playing field.

Min-Max Scaling (Normalization)

$$x’ = \frac{x - x_{min}}{x_{max} - x_{min}}$$

Output range: [0, 1]
Preserves the shape of the original distribution
Sensitive to outliers (one extreme value compresses all others near zero)

Standardization (Z-score Scaling)

$$x’ = \frac{x - \mu}{\sigma}$$

Output: mean = 0, std = 1 (no fixed range)
Robust to outliers compared to Min-Max
Required when algorithm assumes zero-mean inputs (PCA, SVM)

Robust Scaling

$$x’ = \frac{x - \text{median}}{IQR}$$

Uses median and IQR instead of mean and std
Most outlier-resistant scaling method
Best choice when data has significant outliers

When to Use Which

Scaling Method	Use When	Avoid When
Min-Max	Need values in [0,1]; neural networks with sigmoid output	Outliers present
Standardization	PCA, SVM, neural networks; Gaussian assumption needed	Heavy-tailed data
Robust Scaling	Outliers present and can’t be removed	All values must be in [0,1]
No scaling	Tree-based algorithms (XGBoost, Random Forest, Decision Trees)	—

WHY algorithms need scaling:

Gradient descent:
  Features with large values → large gradients → unstable convergence
  Features with small values → tiny gradients → slow learning
  Scaling → gradients are comparable → faster, stable convergence

Distance-based (KNN, SVM, K-Means):
  Without scaling: distance dominated by largest-range feature
  Height (150-200cm) vs Age (0-100 yrs) → height dominates KNN
  Scaling → all features contribute equally to distance

Regularization (Ridge, Lasso):
  Penalty applied equally to all coefficients
  Without scaling: large-range features penalized unfairly

Which algorithms need scaling:

Need Scaling	Don’t Need Scaling
SVM	Decision Trees
KNN	Random Forest
K-Means	XGBoost / GBM
Neural Networks	LightGBM
PCA	CatBoost
Linear/Logistic Regression (with regularization)	Naive Bayes

Log Transform

$$x’ = \log(x + 1)$$

(+1 to handle zeros)

Compresses large values, expands small values
Converts right-skewed → approximately normal
Essential for: income, population, website traffic, any “hockey stick” data

Before log transform:          After log transform:
  *                               *
  *                             ***
  **                           *****
  ***                         *******
  *************               *********
───────────────              ──────────────
Right-skewed                 More symmetric

Behind the scenes: Log transform works because many natural phenomena are multiplicative, not additive. A 10% raise means different absolute dollars depending on your salary. Log transform converts “multiply by 1.1” → “add 0.095” — turning multiplicative relationships linear.

Box-Cox Transform

Generalization of the log transform — finds the optimal power $\lambda$ automatically:

$$x’ = \begin{cases} \frac{x^\lambda - 1}{\lambda} & \lambda \neq 0 \ \ln(x) & \lambda = 0 \end{cases}$$

When $\lambda = 0$: reduces to log transform
When $\lambda = 1$: no transform
When $\lambda = 0.5$: square root transform
Only works for positive values (use Yeo-Johnson for zero/negative values)

Binning / Discretization

Converting continuous values into categorical bins.

Method	How	When
Equal-width	Divide range into equal intervals	Uniform distribution
Equal-frequency	Each bin has same count	Skewed data
Domain-based	Expert-defined boundaries	Age: child/teen/adult/senior

When binning helps:

Reduces noise in a continuous feature that has noisy measurements
Captures non-linear relationship (income → bracket → behavior)
When domain knowledge defines natural thresholds (temperature → cold/warm/hot)

When binning hurts:

Loses ordinal information within bins
Creates arbitrary boundary effects (age 29 vs 30 treated very differently)
Information loss — generally reduces model performance unless relationship is truly categorical

Polynomial Features

Create new features by combining existing ones:

$$x_1, x_2 \rightarrow x_1, x_2, x_1^2, x_2^2, x_1 x_2$$

Enables linear models to capture non-linear relationships
Degree 2 features for $n$ original features creates $\frac{n(n+1)}{2} + n$ new features

Warning — feature explosion:

10 features, degree 2 → 65 features
100 features, degree 2 → 5,150 features
100 features, degree 3 → 176,851 features

This triggers the curse of dimensionality. Use sparingly and with strong domain justification.

Categorical Feature Encoding

ELI5: Telling a model that cat=1, dog=2, fish=3 implies dog is “between” cat and fish, and fish is “three times” something cat is. That’s nonsense. One-hot encoding says “these are completely separate things with no mathematical relationship” — it creates a binary column for each category.

Label Encoding

Categories mapped to integers: {low=0, medium=1, high=2}

DANGER for nominal categories: implies ordering that doesn’t exist
Safe for ordinal categories: size (S < M < L < XL), rating (1-5 stars)
Safe for tree-based algorithms: trees split on thresholds, so arbitrary numbers don’t imply false order in practice

One-Hot Encoding (OHE)

Creates one binary column per category value.

Color: [Red, Blue, Green]
↓
color_Red  color_Blue  color_Green
    1          0            0       (was Red)
    0          1            0       (was Blue)
    0          0            1       (was Green)

Dummy variable trap: With k categories and k columns, the last column is perfectly predictable from the other k-1 columns → multicollinearity. Fix: drop one column (use k-1 columns).

When OHE fails:

High-cardinality features (city names, user IDs, product SKUs) → thousands of sparse columns
Memory-intensive, can overfit, slows training dramatically

Target Encoding

Replace each category with the mean of the target variable for that category:

City | Mean_House_Price → encoding
NYC  |    $1.2M        → 1,200,000
LA   |    $800K        → 800,000
Houston | $350K        → 350,000

Handles high cardinality elegantly
Risk: target leakage — you’re using the target to create features
Fix: use cross-validated target encoding — encode fold $k$ using mean from all other folds

Binary Encoding

Compromise between label and one-hot:

Assign each category an integer
Convert integer to binary
Each bit becomes a column

Category: 5 categories
Label encoding: 5 columns (one-hot) → Binary: only 3 columns (bits for 0-4)
For 1000 categories: OHE = 1000 cols, Binary = 10 cols

Embeddings

Learned dense vector representations for high-cardinality categorical features.

Neural network learns a dense representation (e.g., 50-dimensional vector per city)
Captures semantic similarity: “Paris” and “London” will have similar embeddings
Essential for: user IDs, product IDs, text in deep learning
SageMaker: Entity embedding layers in custom neural networks

Encoding Decision Guide

Feature Type	Cardinality	Recommended Encoding
Ordinal (ordered)	Any	Label encoding
Nominal	Low (< 10)	One-hot encoding
Nominal	Medium (10–50)	Binary or Target encoding
Nominal	High (> 50)	Target encoding or Embeddings
Free-form text	—	TF-IDF, Word embeddings
Tree-based algorithms	Any	Label encoding is fine

Text Feature Engineering

Bag of Words (BoW)

Represent text as a vector of word counts, ignoring order.

Sentence 1: "the cat sat on the mat"  → {the:2, cat:1, sat:1, on:1, mat:1}
Sentence 2: "the dog sat on the rug"  → {the:2, dog:1, sat:1, on:1, rug:1}

Simple, interpretable
Loses word order and context (“not good” ≠ “good”)
Creates very sparse, high-dimensional vectors

TF-IDF

Term Frequency × Inverse Document Frequency

$$\text{TF-IDF}(t, d) = \underbrace{\frac{\text{count}(t,d)}{\text{total words in }d}}{\text{TF}} \times \underbrace{\log\frac{N}{\text{docs containing }t}}{\text{IDF}}$$

ELI5: “The” appears in every document — it’s useless for classification, so IDF gives it a very low weight. “Quantum” appears in only 3 out of 10,000 documents — it’s highly discriminative, so IDF gives it a high weight. TF-IDF automatically downweights common words and upweights rare, informative words.

TF: how often does a term appear in this document?
IDF: how rare is this term across all documents?
High TF-IDF: term appears often in this doc but rarely elsewhere → highly characteristic

Word Embeddings

Dense vector representations learned from large text corpora.

Traditional (BoW/TF-IDF):
  "king"  → [0, 0, 1, 0, 0, ..., 0]   (sparse, 50,000 dims)

Word2Vec/GloVe:
  "king"  → [0.2, -0.4, 0.8, 0.1, ...]  (dense, 100-300 dims)
  "queen" → [0.1, -0.5, 0.7, 0.3, ...]  (similar to king)

king - man + woman ≈ queen  (vector arithmetic captures semantics)

Why embeddings beat BoW:

Capture semantic similarity (king/queen are neighbors in embedding space)
Dense → much smaller than sparse one-hot
Pre-trained embeddings transfer knowledge from billions of words

N-grams

Capture sequences of N consecutive words:

Unigram (N=1): "not", "good"
Bigram (N=2):  "not good"   ← captures negation
Trigram (N=3): "not very good"

Critical for sentiment analysis where word order changes meaning.

Text Preprocessing Pipeline

Raw text
   ↓ Lowercasing
   ↓ Remove punctuation / HTML
   ↓ Tokenization (split into words/subwords)
   ↓ Stop word removal ("the", "is", "at")
   ↓ Stemming (running → run) OR Lemmatization (better → good)
   ↓ Vectorization (BoW / TF-IDF / Embeddings)
   ↓ ML-ready features

Stemming vs Lemmatization:

Stemming: crude chopping (studies → studi) — fast, imprecise
Lemmatization: proper linguistic reduction (better → good) — slower, more accurate

BlazingText in SageMaker

SageMaker’s built-in algorithm for:

Word2Vec training at high speed (uses GPU)
Text classification (supervised mode)

Exam tip: BlazingText is the fastest way to train Word2Vec or text classifiers on AWS. If a question involves word embeddings or text classification at scale, BlazingText is the answer.

Time Series Feature Engineering

ELI5: A model sees numbers, not dates. When you feed it “2026-01-15,” it just sees a number. It doesn’t know that’s a Thursday in winter, or that Thursdays tend to have higher sales, or that January is the slow season. Feature engineering unpacks the date into signals the model can actually learn from.

Lag Features

Use previous values as features:

$$\text{features}: y_{t-1}, y_{t-2}, y_{t-7}, y_{t-28}$$

$y_{t-1}$: yesterday’s value (captures autocorrelation)
$y_{t-7}$: same day last week (captures weekly seasonality)
$y_{t-28}$: same period last month

Exam tip: Lag features are how you convert a time series problem into a standard supervised learning problem. The “labels” are $y_t$, the features include lagged $y$ values.

Rolling Statistics

Rolling mean (window=7):  average of last 7 days
Rolling std  (window=7):  volatility of last 7 days
Rolling min/max:          recent range
Exponential weighted mean: recent values weighted more heavily

Captures local trends and volatility that single lag features miss.

Date Decomposition Features

Raw: "2026-01-15 14:35:00"
↓
year:         2026
month:        1
day_of_month: 15
day_of_week:  3  (Thursday)
hour:         14
is_weekend:   0
is_holiday:   0
quarter:      1
week_of_year: 3

These features let the model discover seasonal patterns without knowing what “Thursday” or “January” means.

Seasonality and Differencing

Seasonality extraction: Use Fourier terms (sine/cosine) to capture cyclical patterns:

$$\sin\left(\frac{2\pi t}{P}\right), \cos\left(\frac{2\pi t}{P}\right)$$

where $P$ is the period (e.g., 7 for weekly, 365 for yearly).

Differencing for stationarity:

$$y’t = y_t - y{t-1}$$

First-order differencing removes a linear trend. Seasonal differencing ($y_t - y_{t-P}$) removes seasonality. Required before applying ARIMA-family models.

Handling Class Imbalance

Why Imbalance is a Problem

Dataset: 99% "not fraud," 1% "fraud"

Naive model: predict "not fraud" for everything
Accuracy: 99%   ← looks great!
Recall for fraud: 0%  ← completely useless

Accuracy is a misleading metric on imbalanced datasets. Use precision, recall, F1, AUC-ROC instead.

Oversampling

Increase the minority class to balance the dataset.

SMOTE (Synthetic Minority Over-sampling Technique):

For each minority sample, find its k nearest minority neighbors
Create synthetic samples by interpolating along the line between them

$$x_{new} = x_i + \lambda \times (x_{neighbor} - x_i), \quad \lambda \in [0,1]$$

ELI5: SMOTE creates new fake fraud examples by “blending” real fraud examples — like a photo-morphing app that creates a new face halfway between two existing faces. Instead of just duplicating fraud examples (which overfits), it creates plausible new variations.

Undersampling

Remove majority class samples to balance the dataset.

Random undersampling: randomly remove majority class examples
Tomek links: remove majority samples that are too close to minority class (cleaner decision boundary)
Best when majority class is very large and losing some data is acceptable

Class Weights

Tell the model that minority class mistakes cost more:

$$\text{class weight} = \frac{n_samples}{n_classes \times n_samples_in_class}$$

In SageMaker built-in algorithms, pass class_weight parameter. In XGBoost, use scale_pos_weight. In neural networks, use weighted loss functions.

Behind the scenes: Weighted loss penalizes misclassifying the minority class more heavily. The model learns that getting a fraud wrong is worse than getting a non-fraud wrong — without changing the data at all.

Threshold Adjustment

Default classification threshold is 0.5. For imbalanced problems, lower threshold → catch more positives (higher recall, lower precision).

Choose threshold by optimizing the metric that matters for your use case (F1, precision at fixed recall, etc.) using the precision-recall curve.

Technique Selection Guide

Situation	Recommended Technique
Enough data, minority class rare (< 5%)	SMOTE + class weights
Very large dataset, minority rare	Undersampling + class weights
Small dataset	SMOTE only (no undersampling)
Tree-based algorithm	Class weights (built-in support)
Neural network	Weighted loss function
Can’t modify training process	Threshold adjustment at inference

Dimensionality Reduction as Feature Engineering

Curse of Dimensionality

ELI5: In 2D, you can cover a grid with 100 dots. In 3D, you need 1,000. In 10D, you need 10 billion. More dimensions = more empty space = every data point is “far” from every other data point = distance-based algorithms break down. This is the curse of dimensionality.

As dimensions increase:

Volume of feature space grows exponentially
Data becomes sparse — even large datasets are insufficient
Distance metrics become less meaningful
Models require exponentially more data to generalize

PCA: Principal Component Analysis

First principles: Find the directions in feature space where the data varies the most. Project data onto these directions. Discard the low-variance directions (noise).

Original 2D data:                PCA projection to 1D:
                                         PC1
   *  *                         * * * * * * * * * *
  * * *  *          →           ─────────────────────
   *  * *                       Direction of maximum
  *   *                         variance preserved

How it works (behind the scenes):

Standardize the data (mean=0, std=1)
Compute the covariance matrix $\Sigma$
Compute eigenvectors and eigenvalues of $\Sigma$
Sort eigenvectors by eigenvalue (descending)
Project data onto top $k$ eigenvectors

Each principal component is a linear combination of original features:

$$PC_1 = w_1 x_1 + w_2 x_2 + \ldots + w_n x_n$$

Choosing the number of components:

Explained Variance Ratio:
Component 1: 45%
Component 2: 30%
Component 3: 15%  ← elbow
Component 4: 5%
Component 5: 3%
...
Keep enough components to explain 95% of variance (components 1-3 here)

When to use PCA:

Scenario	Use PCA?
Before KNN or SVM (curse of dimensionality)	Yes
Visualization (reduce to 2D/3D)	Yes
Remove noise from correlated features	Yes
Tree-based algorithms	Generally no (trees handle high-dim fine)
Features must be interpretable	No (PCs are linear combinations, hard to explain)
Before sparse regression (Lasso)	No (Lasso does its own selection)

Exam tip: PCA requires standardization first. If a question shows PCA applied to unstandardized data, that’s the bug. Large-range features will dominate the principal components.

t-SNE

t-distributed Stochastic Neighbor Embedding — non-linear dimensionality reduction for visualization only.

Projects high-dimensional data to 2D or 3D
Preserves local structure (nearby points stay nearby)
Does NOT preserve global structure or distances
Not suitable as a preprocessing step for ML — only for visual exploration

Feature Selection vs Feature Extraction

Feature Selection:                Feature Extraction:
Select a subset of original       Create new features from originals
features as-is

Original features preserved       New features (like PCs) created
Interpretable                     May not be interpretable
                                  (PCA, autoencoders)

Feature selection methods:

Category	Method	How
Filter	Correlation, Chi-squared, Mutual Information	Score features independently
Wrapper	Recursive Feature Elimination (RFE)	Train model, remove weakest features, repeat
Embedded	L1 regularization (Lasso), tree feature importance	Feature selection built into training

ELI5 on Lasso (L1): Lasso adds a penalty for having non-zero coefficients. The optimizer finds that setting some coefficients exactly to zero reduces total cost. Those features are effectively removed. It’s automatic feature selection baked into the model.

SageMaker Feature Store

What it is: A fully managed, centralized repository for ML features — both for training and real-time inference.

Feature Engineering Pipeline
           ↓
    ┌──────────────────────────┐
    │     Feature Store        │
    │                          │
    │  ┌───────────────────┐   │
    │  │   Online Store    │   │ ← DynamoDB-backed
    │  │  (low latency)    │   │   millisecond reads
    │  └───────────────────┘   │
    │  ┌───────────────────┐   │
    │  │  Offline Store    │   │ ← S3-backed
    │  │  (S3 + Glue)      │   │   for training
    │  └───────────────────┘   │
    └──────────────────────────┘
           ↓            ↓
      Training      Real-time
       Jobs         Inference

Why Feature Store Exists

ELI5: Without Feature Store, your training pipeline computes customer age as of January 1st (training date), but your serving pipeline computes it as of today. The model was trained on “age at training time” but served with “current age” — a subtle mismatch that’s nearly impossible to debug in production. Feature Store timestamps features and ensures both pipelines use the same computation.

This is called training-serving skew — the silent killer of ML systems in production.

Feature Store solves:

Training-serving skew (same features for training and inference)
Feature reuse (compute once, use in many models)
Feature discovery (catalog of available features for the team)
Point-in-time correctness (get features as they existed at any past timestamp)

Key Concepts

Concept	Meaning
Feature Group	Logical collection of related features (e.g., “customer_demographics”)
Record identifier	Primary key (e.g., customer_id)
Event time	When the feature values were computed
Online Store	Low-latency retrieval for real-time inference (<10ms)
Offline Store	S3-backed, queryable via Athena, for training jobs

Exam tip: Online store = real-time inference, low latency. Offline store = batch training. When an exam question asks about preventing training-serving skew or centralizing feature management, the answer is SageMaker Feature Store.

Amazon SageMaker Ground Truth

What it is: A fully managed data labeling service for creating high-quality training datasets.

Raw Unlabeled Data (S3)
         ↓
   Ground Truth Job
         ↓
   ┌─────────────────────────────┐
   │   Active Learning Loop      │
   │                             │
   │  ML model pre-labels data   │
   │         ↓                   │
   │  High confidence → Accept   │
   │  Low confidence  → Human    │
   │         ↓                   │
   │  Human labels verified      │
   │         ↓                   │
   │  Model retrained            │
   └─────────────────────────────┘
         ↓
   Labeled Dataset (S3)

Labeling Workforce Options

Option	Use When	Cost
Amazon Mechanical Turk	Large volume, non-sensitive, general tasks	Lowest
Private workforce	Sensitive data, domain experts needed	Depends on employees
AWS Marketplace vendors	Professional labelers, specialized domains	Medium-high

Active Learning (Human-in-the-Loop)

Small set of human-labeled examples trains initial model
Model labels remaining data with confidence scores
High-confidence labels → accepted automatically
Low-confidence labels → sent to humans for review
Human-verified labels → retrain model
Repeat until quality threshold reached

Cost reduction: Active learning reduces labeling cost by 40–70% vs labeling everything manually.

Ground Truth Task Types

Task Type	Examples
Image classification	Is this a cat or dog?
Image bounding boxes	Draw boxes around cars
Image segmentation	Pixel-level object boundaries
Text classification	Sentiment: positive/negative
Named entity recognition	Tag person/place/org in text
Video classification	Action recognition
3D point cloud	LiDAR for autonomous vehicles

Exam tip: Ground Truth + active learning = cost-efficient labeling at scale. Ground Truth Plus = a white-glove managed service where AWS handles the entire labeling workflow.

Why Labeling Quality Matters

A model is bounded by the quality of its labels. Systematic labeling errors (e.g., labelers disagree on edge cases) create noisy targets that no amount of architecture tuning can overcome. Ground Truth uses annotation consolidation (multiple labelers + majority vote) to reduce individual labeler noise.

Quick Reference: Feature Type → Transformation

Feature Type	Raw Form	Transformation	Why
Right-skewed numerical	Income, prices	Log transform	Compress range, normalize
Normally distributed numerical	Age, height	Standardization	Zero mean, unit variance
Bounded numerical (0-100)	Score, %	Min-Max or none	Scale to [0,1] if needed
Ordinal categorical	Low/Med/High	Label encoding	Preserves order
Nominal categorical (low cardinality)	Color, country (small set)	One-hot encoding	No false ordering
Nominal categorical (high cardinality)	City, product ID	Target or Binary encoding	Avoid column explosion
Free text	Reviews, descriptions	TF-IDF or embeddings	Extract semantic signal
Timestamp	2026-01-15	Date decomposition + lags	Give model temporal hooks
Imbalanced target	99% class A	SMOTE + class weights	Prevent majority-class bias
Correlated features (many)	Sensor readings	PCA	Reduce redundancy, compress
Sparse features	Document term counts	Dimensionality reduction	Handle sparsity

Behind the Scenes: Why These Transformations Work

The fundamental principle:

Raw data → Model = model learns the representation
              (slow, may fail)

Raw data → Feature engineering → Model = engineer provides the representation
                                  (faster, often better)

Feature engineering = domain knowledge encoded into the data
so the model doesn't have to discover it from scratch.

The best features are ones that make the relationship
between input and output as simple as possible.

If the true relationship is $y \propto e^x$, and you feed the model $x$, it must learn the exponential. If you feed it $\log(x)$, the relationship becomes linear — trivially easy to learn. This is the fundamental insight behind all feature engineering.

Feature Engineering & Data Preparation#

Why Feature Engineering is the Most Important Skill#

Numerical Feature Transformations#

Scaling and Normalization#

Min-Max Scaling (Normalization)#

Standardization (Z-score Scaling)#

Robust Scaling#

When to Use Which#

Log Transform#

Box-Cox Transform#

Binning / Discretization#

Polynomial Features#

Categorical Feature Encoding#

Label Encoding#

One-Hot Encoding (OHE)#

Target Encoding#

Binary Encoding#

Embeddings#

Encoding Decision Guide#

Text Feature Engineering#

Bag of Words (BoW)#

TF-IDF#

Word Embeddings#

N-grams#

Text Preprocessing Pipeline#

BlazingText in SageMaker#

Time Series Feature Engineering#

Lag Features#

Rolling Statistics#

Date Decomposition Features#

Seasonality and Differencing#

Handling Class Imbalance#

Why Imbalance is a Problem#

Oversampling#

Undersampling#

Class Weights#

Threshold Adjustment#

Technique Selection Guide#

Dimensionality Reduction as Feature Engineering#

Curse of Dimensionality#

PCA: Principal Component Analysis#

t-SNE#

Feature Selection vs Feature Extraction#

SageMaker Feature Store#

Why Feature Store Exists#

Key Concepts#

Amazon SageMaker Ground Truth#

Labeling Workforce Options#

Active Learning (Human-in-the-Loop)#

Ground Truth Task Types#

Why Labeling Quality Matters#

Quick Reference: Feature Type → Transformation#

Behind the Scenes: Why These Transformations Work#