Domain 1B: Data Transformation & Feature Engineering
Data Transformation & Feature Engineering
Exam Domain: 1 — Data Preparation for ML (28%) Task: Transform data, perform feature engineering, ensure data integrity
Data Processing at Scale
Amazon EMR (Elastic MapReduce)
Managed Hadoop/Spark cluster for large-scale data processing.

EMR Architecture:
┌─────────────────────────────────────────────┐
│ EMR Cluster │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Master │ │ Core │ │ Task │ │
│ │ Node │ │ Nodes │ │ Nodes │ │
│ │ (manages) │ │ (HDFS + │ │ (compute │ │
│ │ │ │ compute)│ │ only) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Frameworks: Spark, Hive, Presto, HBase │
└─────────────────────────────────────────────┘
↕
Amazon S3 (EMRFS)
| Feature | Details |
|---|---|
| EMR on EC2 | Traditional clusters, full control |
| EMR Serverless | No cluster management, pay per job |
| EMR on EKS | Run Spark on Kubernetes |
| EMRFS | Read/write S3 as if it were HDFS |
Apache Spark on EMR
- Distributed data processing (100x faster than MapReduce)
- PySpark for Python-based ML data prep
- Spark MLlib for distributed ML algorithms
- Use case: TF-IDF, large-scale feature engineering, ETL
Exam tip: When you see “large-scale data processing” or “distributed transformation” — think EMR + Spark.
Feature Engineering Techniques
The Curse of Dimensionality
More features ≠ better model. Too many features leads to:
- Overfitting
- Increased training time
- Sparse data in high-dimensional space
Solution: Dimensionality reduction (PCA, feature selection)
ELI5: Imagine trying to find a specific person. In a single room (1 dimension — just their height), easy. In a building (2D — height + weight), harder but manageable. In a city described by 1,000 characteristics, nearly impossible — your data points are so spread out that “nearby” becomes meaningless. Every feature you add doubles the space your data has to fill, so you need exponentially more data to cover it. This is why adding 50 random features to your model usually makes it worse, not better.
Handling Missing Data
| Technique | When to Use |
|---|---|
| Drop rows | Small % missing, random pattern |
| Drop columns | Most values missing in that feature |
| Mean/Median imputation | Numerical, roughly normal distribution |
| Mode imputation | Categorical data |
| KNN imputation | Similar samples should have similar values |
| Forward/Backward fill | Time series data |
| Indicator variable | Missingness itself is informative |
ELI5: Missing data is like a survey where some people skipped questions. You have three choices: throw away the whole survey (drop rows) — wasteful if most answers are there. Guess the missing answer based on what similar people answered (imputation) — reasonable but introduces assumptions. Add a checkbox that says “this person skipped this question” (indicator variable) — useful when the fact that someone skipped is itself meaningful, like patients who skip a pain question might be the ones in the most pain.
Handling Unbalanced Data
Imbalanced Dataset (e.g., fraud detection):
Class 0 (normal): 99,000 samples
Class 1 (fraud): 1,000 samples ← minority class
Techniques:
┌─────────────────────────────────────────────┐
│ Oversampling: Duplicate/synthesize minority │
│ ├─ Random oversampling (duplicate) │
│ └─ SMOTE (synthesize new samples) │
│ │
│ Undersampling: Reduce majority class │
│ └─ Random undersampling │
│ │
│ Algorithm-level: │
│ ├─ Class weights (cost-sensitive learning) │
│ └─ Anomaly detection approach │
└─────────────────────────────────────────────┘
SMOTE (Synthetic Minority Oversampling Technique):
- Creates synthetic samples between existing minority samples
- Available in SageMaker Data Wrangler as a built-in operator
ELI5: SMOTE is like creating imaginary friends that look like a blend of your existing friends, to balance out a lopsided group photo. If you only have 10 fraud examples and 10,000 normal ones, SMOTE looks at each fraud example, finds its nearest fraud neighbors, and creates new synthetic fraud examples that fall somewhere between them — not exact copies, but plausible variations. This gives your model more examples of the rare class to learn from without just repeating the same 10 examples over and over.
Handling Outliers
| Method | Description |
|---|---|
| Z-score | Remove if |z| > 3 (assumes normal distribution) |
| IQR method | Remove if < Q1 - 1.5×IQR or > Q3 + 1.5×IQR |
| Winsorization | Cap outliers at a percentile threshold |
| Log transform | Reduce skew, compress range |
| Keep them | If outliers are real signal (e.g., fraud) |
Feature Transformation Techniques
| Technique | Purpose | Formula/Example |
|---|---|---|
| Normalization (Min-Max) | Scale to [0,1] | $x’ = \frac{x - x_{min}}{x_{max} - x_{min}}$ |
| Standardization (Z-score) | Mean=0, StdDev=1 | $x’ = \frac{x - \mu}{\sigma}$ |
| Log Transform | Reduce right skew | $x’ = \log(x + 1)$ |
| Binning | Convert continuous to categorical | Age → [0-18, 19-35, 36-50, 51+] |
| One-Hot Encoding | Categorical to binary columns | Color → [is_red, is_blue, is_green] |
| Label Encoding | Categorical to integers | [small, medium, large] → [0, 1, 2] |
| Tokenization | Text to tokens | “Hello world” → [“Hello”, “world”] |
| TF-IDF | Text importance scoring | Term frequency × inverse document frequency |
| Shuffling | Randomize order | Prevent ordering bias in training |
ELI5: Normalization and standardization both rescale your numbers, but differently. Normalization squishes everything into a 0–1 range — useful when you know the min and max and want everything proportional within that window. Standardization centers everything around the average (mean=0) and stretches or squishes based on spread (std=1) — useful when you don’t know the range or have outliers. Rule of thumb: use normalization for neural networks and image data, standardization for algorithms that assume a normal distribution (like linear regression or SVM).
One-Hot Encoding Example:
Color → is_red is_blue is_green
───── ────── ─────── ────────
red 1 0 0
blue 0 1 0
green 0 0 1
red 1 0 0
AWS Data Transformation Services
SageMaker Data Wrangler
Visual, low-code data preparation within SageMaker Studio.
| Feature | Details |
|---|---|
| 300+ built-in transforms | No code needed |
| Import from | S3, Athena, Redshift, Snowflake, Databricks |
| Built-in visualizations | Histograms, scatter plots, correlation |
| Bias detection | Pre-training bias analysis |
| Export to | SageMaker Pipelines, Processing jobs, Feature Store |
| Balancing operators | Random oversample, undersample, SMOTE |
ELI5: Data Wrangler is like Excel on steroids for ML data prep — but instead of writing formulas, you drag and drop 300+ pre-built transforms. You can join datasets, handle missing values, detect bias, and visualize distributions, all without writing a single line of code. When you’re done, it generates the PySpark or Python code behind the scenes and wires it into your SageMaker Pipeline automatically.
SageMaker Ground Truth
Managed data labeling service.

Labeling Workflow:
Raw Data → [Labeling Job] → Labeled Dataset
↓
┌───────────────────┐
│ Human Labelers: │
│ • Mechanical Turk │
│ • Private team │
│ • Third-party │
├───────────────────┤
│ Auto-Labeling: │
│ ML model labels │
│ high-confidence │
│ samples (up to │
│ 70% cost savings) │
└───────────────────┘
- Active learning: automatically labels easy samples, sends hard ones to humans
- Annotation consolidation: combines multiple labeler outputs for accuracy
- Supports: bounding boxes, image classification, text classification, semantic segmentation
SageMaker Feature Store
Centralized repository for ML features — serves both training and inference.
┌─────────────────────────────────────────────┐
│ SageMaker Feature Store │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Online Store │ │ Offline Store│ │
│ │ (low-latency │ │ (S3-backed │ │
│ │ reads for │ │ for batch │ │
│ │ inference) │ │ training) │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ └───────┬───────────┘ │
│ │ │
│ Feature Groups │
│ (tables of features with │
│ record identifier + timestamp) │
└─────────────────────────────────────────────┘
| Store | Latency | Use Case |
|---|---|---|
| Online | Single-digit ms | Real-time inference |
| Offline | Minutes | Batch training, historical queries |
ELI5: Imagine five different teams at a company all need to know a customer’s age. Without Feature Store, each team calculates it differently — one uses signup date, another uses the profile field, a third uses a derived column — and they all get slightly different answers. Feature Store is the single source of truth: one team computes “customer_age” correctly once, stores it, and every other team — whether training a model or serving predictions — reads the exact same value. It also solves training-serving skew: the feature your model trained on is guaranteed to be the same feature used at inference time.
Exam tip: Feature Store = the answer when the question asks about reusable features shared across teams or consistent features between training and inference.
AWS Glue Ecosystem
AWS Glue — Serverless ETL

┌──────────────────────────────────────────────┐
│ AWS Glue │
│ │
│ ┌────────────┐ ┌────────────┐ │
│ │ Crawlers │→ │ Data │ │
│ │ (discover │ │ Catalog │ │
│ │ schemas) │ │ (metadata) │ │
│ └────────────┘ └─────┬──────┘ │
│ │ │
│ ┌─────▼──────┐ │
│ │ ETL Jobs │ │
│ │ (PySpark / │ │
│ │ Python) │ │
│ └────────────┘ │
│ │
│ ┌────────────┐ ┌────────────┐ │
│ │ Glue │ │ Glue Data │ │
│ │ DataBrew │ │ Quality │ │
│ │ (visual) │ │ (rules) │ │
│ └────────────┘ └────────────┘ │
└──────────────────────────────────────────────┘
| Component | Purpose |
|---|---|
| Crawlers | Auto-discover schema from S3/databases → populate Data Catalog |
| Data Catalog | Central metadata repository (Hive-compatible) |
| ETL Jobs | Serverless Spark jobs for data transformation |
| Glue Studio | Visual ETL pipeline builder |
| Glue DataBrew | Visual data prep with 250+ transforms, PII detection/handling |
| Glue Data Quality | Define rules to validate data quality |
AWS Glue DataBrew — PII Handling
DataBrew can automatically detect and handle PII:
- Redact: Replace with placeholder
- Hash: One-way hash
- Encrypt: Reversible encryption
- Substitute: Replace with fake data
- Delete: Remove entirely
Exam tip: When a question mentions “PII handling in data preparation” → Glue DataBrew or S3 Object Lambda (for on-read transformation).
Amazon Athena — Serverless Analytics
- SQL queries directly on S3 data (no ETL needed)
- Uses Presto engine
- Pay per query ($5 per TB scanned)
Athena Performance Optimization
| Technique | Impact |
|---|---|
| Use columnar formats (Parquet, ORC) | 30-90% less data scanned |
| Partition data | Skip irrelevant partitions |
| Compress data (Snappy, GZIP) | Less data to scan |
| Use CTAS (CREATE TABLE AS SELECT) | Convert formats, partition |
| Bucket data | Further optimize joins |
Athena + Glue Integration
S3 (raw data)
→ Glue Crawler (discover schema)
→ Glue Data Catalog (store metadata)
→ Athena (query with SQL)
- ACID Transactions: Athena supports Apache Iceberg tables for ACID compliance
- Fine-Grained Access: Use Lake Formation for column/row-level security
Data Integrity & Bias Detection
Pre-Training Bias Metrics (SageMaker Clarify)
Detects bias before model training by analyzing the dataset.
| Metric | What It Measures |
|---|---|
| Class Imbalance (CI) | Difference in number of samples per class |
| Difference in Proportions of Labels (DPL) | Label distribution across groups |
| KL Divergence | Distribution difference between groups |
| Jensen-Shannon Divergence | Symmetric version of KL divergence |
| Lp-norm | Distance between label distributions |
| Total Variation Distance | Maximum difference in distributions |
| Kolmogorov-Smirnov | Max difference in cumulative distributions |
| Conditional Demographic Disparity (CDD) | Disparity conditioned on other attributes |
ELI5: Pre-training bias detection is like checking whether your deck of cards is stacked before you start the game. If your training dataset has 90% approvals for one demographic and 40% for another, any model trained on it will learn that skewed pattern as “truth.” Clarify’s pre-training metrics quantify exactly how tilted the deck is — so you can fix the data (resample, reweight) before spending hours training a model that’s biased from the start.
Bias Mitigation Techniques
Pre-Processing (before training):
• Resampling (oversample minority / undersample majority)
• SMOTE (synthetic data for minority)
• Data augmentation
In-Processing (during training):
• Fairness constraints in loss function
• Adversarial debiasing
Post-Processing (after training):
• Threshold adjustment per group
• Reject option classification
Data Security & Compliance
| Requirement | Solution |
|---|---|
| Encrypt data at rest | S3 SSE-KMS, EBS encryption |
| Encrypt in transit | TLS/SSL |
| Anonymize PII | Glue DataBrew, Macie for detection |
| Access control | Lake Formation, S3 policies, IAM |
| Audit trail | CloudTrail, S3 access logs |
| HIPAA / PHI | Transcribe Medical, Comprehend Medical |
Quick Reference: When to Use What
| Scenario | Service |
|---|---|
| Large-scale distributed ETL | EMR + Spark |
| Serverless ETL | AWS Glue ETL Jobs |
| Visual data prep (no code) | Glue DataBrew |
| SageMaker-integrated visual prep | Data Wrangler |
| Label training data | SageMaker Ground Truth |
| Store reusable features | SageMaker Feature Store |
| Query S3 with SQL | Athena |
| Detect PII in S3 | Amazon Macie |
| Handle PII in transformations | Glue DataBrew |
| Detect pre-training bias | SageMaker Clarify |
| Balance imbalanced datasets | Data Wrangler (SMOTE) |
| Schema discovery | Glue Crawlers |
| Data catalog / metadata | Glue Data Catalog |