Domain 1B: Data Transformation & Feature Engineering

10 min read 1973 words

Data Transformation & Feature Engineering

Exam Domain: 1 — Data Preparation for ML (28%) Task: Transform data, perform feature engineering, ensure data integrity

Data Processing at Scale

Amazon EMR (Elastic MapReduce)

Managed Hadoop/Spark cluster for large-scale data processing.

EMR Cluster Architecture

EMR Architecture:
┌─────────────────────────────────────────────┐
│  EMR Cluster                                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │  Master   │  │  Core    │  │  Task    │  │
│  │  Node     │  │  Nodes   │  │  Nodes   │  │
│  │ (manages) │  │ (HDFS +  │  │ (compute │  │
│  │           │  │  compute)│  │  only)   │  │
│  └──────────┘  └──────────┘  └──────────┘  │
│                                             │
│  Frameworks: Spark, Hive, Presto, HBase     │
└─────────────────────────────────────────────┘
          ↕
     Amazon S3 (EMRFS)

Feature	Details
EMR on EC2	Traditional clusters, full control
EMR Serverless	No cluster management, pay per job
EMR on EKS	Run Spark on Kubernetes
EMRFS	Read/write S3 as if it were HDFS

Apache Spark on EMR

Distributed data processing (100x faster than MapReduce)
PySpark for Python-based ML data prep
Spark MLlib for distributed ML algorithms
Use case: TF-IDF, large-scale feature engineering, ETL

Exam tip: When you see “large-scale data processing” or “distributed transformation” — think EMR + Spark.

Feature Engineering Techniques

The Curse of Dimensionality

More features ≠ better model. Too many features leads to:

Overfitting
Increased training time
Sparse data in high-dimensional space

Solution: Dimensionality reduction (PCA, feature selection)

ELI5: Imagine trying to find a specific person. In a single room (1 dimension — just their height), easy. In a building (2D — height + weight), harder but manageable. In a city described by 1,000 characteristics, nearly impossible — your data points are so spread out that “nearby” becomes meaningless. Every feature you add doubles the space your data has to fill, so you need exponentially more data to cover it. This is why adding 50 random features to your model usually makes it worse, not better.

Handling Missing Data

Technique	When to Use
Drop rows	Small % missing, random pattern
Drop columns	Most values missing in that feature
Mean/Median imputation	Numerical, roughly normal distribution
Mode imputation	Categorical data
KNN imputation	Similar samples should have similar values
Forward/Backward fill	Time series data
Indicator variable	Missingness itself is informative

ELI5: Missing data is like a survey where some people skipped questions. You have three choices: throw away the whole survey (drop rows) — wasteful if most answers are there. Guess the missing answer based on what similar people answered (imputation) — reasonable but introduces assumptions. Add a checkbox that says “this person skipped this question” (indicator variable) — useful when the fact that someone skipped is itself meaningful, like patients who skip a pain question might be the ones in the most pain.

Handling Unbalanced Data

Imbalanced Dataset (e.g., fraud detection):
  Class 0 (normal):    99,000 samples
  Class 1 (fraud):      1,000 samples   ← minority class

Techniques:
  ┌─────────────────────────────────────────────┐
  │  Oversampling: Duplicate/synthesize minority │
  │  ├─ Random oversampling (duplicate)          │
  │  └─ SMOTE (synthesize new samples)           │
  │                                              │
  │  Undersampling: Reduce majority class        │
  │  └─ Random undersampling                     │
  │                                              │
  │  Algorithm-level:                            │
  │  ├─ Class weights (cost-sensitive learning)  │
  │  └─ Anomaly detection approach               │
  └─────────────────────────────────────────────┘

SMOTE (Synthetic Minority Oversampling Technique):

Creates synthetic samples between existing minority samples
Available in SageMaker Data Wrangler as a built-in operator

ELI5: SMOTE is like creating imaginary friends that look like a blend of your existing friends, to balance out a lopsided group photo. If you only have 10 fraud examples and 10,000 normal ones, SMOTE looks at each fraud example, finds its nearest fraud neighbors, and creates new synthetic fraud examples that fall somewhere between them — not exact copies, but plausible variations. This gives your model more examples of the rare class to learn from without just repeating the same 10 examples over and over.

Handling Outliers

Method	Description
Z-score	Remove if \|z\| > 3 (assumes normal distribution)
IQR method	Remove if < Q1 - 1.5×IQR or > Q3 + 1.5×IQR
Winsorization	Cap outliers at a percentile threshold
Log transform	Reduce skew, compress range
Keep them	If outliers are real signal (e.g., fraud)

Feature Transformation Techniques

Technique	Purpose	Formula/Example
Normalization (Min-Max)	Scale to [0,1]	$x’ = \frac{x - x_{min}}{x_{max} - x_{min}}$
Standardization (Z-score)	Mean=0, StdDev=1	$x’ = \frac{x - \mu}{\sigma}$
Log Transform	Reduce right skew	$x’ = \log(x + 1)$
Binning	Convert continuous to categorical	Age → [0-18, 19-35, 36-50, 51+]
One-Hot Encoding	Categorical to binary columns	Color → [is_red, is_blue, is_green]
Label Encoding	Categorical to integers	[small, medium, large] → [0, 1, 2]
Tokenization	Text to tokens	“Hello world” → [“Hello”, “world”]
TF-IDF	Text importance scoring	Term frequency × inverse document frequency
Shuffling	Randomize order	Prevent ordering bias in training

ELI5: Normalization and standardization both rescale your numbers, but differently. Normalization squishes everything into a 0–1 range — useful when you know the min and max and want everything proportional within that window. Standardization centers everything around the average (mean=0) and stretches or squishes based on spread (std=1) — useful when you don’t know the range or have outliers. Rule of thumb: use normalization for neural networks and image data, standardization for algorithms that assume a normal distribution (like linear regression or SVM).

One-Hot Encoding Example:

  Color       →    is_red   is_blue   is_green
  ─────            ──────   ───────   ────────
  red              1        0         0
  blue             0        1         0
  green            0        0         1
  red              1        0         0

AWS Data Transformation Services

SageMaker Data Wrangler

Visual, low-code data preparation within SageMaker Studio.

Feature	Details
300+ built-in transforms	No code needed
Import from	S3, Athena, Redshift, Snowflake, Databricks
Built-in visualizations	Histograms, scatter plots, correlation
Bias detection	Pre-training bias analysis
Export to	SageMaker Pipelines, Processing jobs, Feature Store
Balancing operators	Random oversample, undersample, SMOTE

ELI5: Data Wrangler is like Excel on steroids for ML data prep — but instead of writing formulas, you drag and drop 300+ pre-built transforms. You can join datasets, handle missing values, detect bias, and visualize distributions, all without writing a single line of code. When you’re done, it generates the PySpark or Python code behind the scenes and wires it into your SageMaker Pipeline automatically.

SageMaker Ground Truth

Managed data labeling service.

Ground Truth Workflow

Labeling Workflow:
  Raw Data → [Labeling Job] → Labeled Dataset
                  ↓
         ┌───────────────────┐
         │ Human Labelers:   │
         │ • Mechanical Turk │
         │ • Private team    │
         │ • Third-party     │
         ├───────────────────┤
         │ Auto-Labeling:    │
         │ ML model labels   │
         │ high-confidence   │
         │ samples (up to    │
         │ 70% cost savings) │
         └───────────────────┘

Active learning: automatically labels easy samples, sends hard ones to humans
Annotation consolidation: combines multiple labeler outputs for accuracy
Supports: bounding boxes, image classification, text classification, semantic segmentation

SageMaker Feature Store

Centralized repository for ML features — serves both training and inference.

┌─────────────────────────────────────────────┐
│           SageMaker Feature Store            │
│                                             │
│  ┌──────────────┐    ┌──────────────┐       │
│  │ Online Store │    │ Offline Store│       │
│  │ (low-latency │    │ (S3-backed   │       │
│  │  reads for   │    │  for batch   │       │
│  │  inference)  │    │  training)   │       │
│  └──────┬───────┘    └──────┬───────┘       │
│         │                   │               │
│         └───────┬───────────┘               │
│                 │                           │
│         Feature Groups                      │
│         (tables of features with            │
│          record identifier + timestamp)     │
└─────────────────────────────────────────────┘

Store	Latency	Use Case
Online	Single-digit ms	Real-time inference
Offline	Minutes	Batch training, historical queries

ELI5: Imagine five different teams at a company all need to know a customer’s age. Without Feature Store, each team calculates it differently — one uses signup date, another uses the profile field, a third uses a derived column — and they all get slightly different answers. Feature Store is the single source of truth: one team computes “customer_age” correctly once, stores it, and every other team — whether training a model or serving predictions — reads the exact same value. It also solves training-serving skew: the feature your model trained on is guaranteed to be the same feature used at inference time.

Exam tip: Feature Store = the answer when the question asks about reusable features shared across teams or consistent features between training and inference.

AWS Glue Ecosystem

AWS Glue — Serverless ETL

Glue ETL Ecosystem

┌──────────────────────────────────────────────┐
│                  AWS Glue                     │
│                                              │
│  ┌────────────┐  ┌────────────┐              │
│  │  Crawlers  │→ │  Data      │              │
│  │ (discover  │  │  Catalog   │              │
│  │  schemas)  │  │ (metadata) │              │
│  └────────────┘  └─────┬──────┘              │
│                        │                     │
│                  ┌─────▼──────┐               │
│                  │  ETL Jobs  │               │
│                  │ (PySpark / │               │
│                  │  Python)   │               │
│                  └────────────┘               │
│                                              │
│  ┌────────────┐  ┌────────────┐              │
│  │  Glue      │  │  Glue Data │              │
│  │  DataBrew  │  │  Quality   │              │
│  │ (visual)   │  │ (rules)    │              │
│  └────────────┘  └────────────┘              │
└──────────────────────────────────────────────┘

Component	Purpose
Crawlers	Auto-discover schema from S3/databases → populate Data Catalog
Data Catalog	Central metadata repository (Hive-compatible)
ETL Jobs	Serverless Spark jobs for data transformation
Glue Studio	Visual ETL pipeline builder
Glue DataBrew	Visual data prep with 250+ transforms, PII detection/handling
Glue Data Quality	Define rules to validate data quality

AWS Glue DataBrew — PII Handling

DataBrew can automatically detect and handle PII:

Redact: Replace with placeholder
Hash: One-way hash
Encrypt: Reversible encryption
Substitute: Replace with fake data
Delete: Remove entirely

Exam tip: When a question mentions “PII handling in data preparation” → Glue DataBrew or S3 Object Lambda (for on-read transformation).

Amazon Athena — Serverless Analytics

SQL queries directly on S3 data (no ETL needed)
Uses Presto engine
Pay per query ($5 per TB scanned)

Athena Performance Optimization

Technique	Impact
Use columnar formats (Parquet, ORC)	30-90% less data scanned
Partition data	Skip irrelevant partitions
Compress data (Snappy, GZIP)	Less data to scan
Use CTAS (CREATE TABLE AS SELECT)	Convert formats, partition
Bucket data	Further optimize joins

Athena + Glue Integration

S3 (raw data)
    → Glue Crawler (discover schema)
    → Glue Data Catalog (store metadata)
    → Athena (query with SQL)

ACID Transactions: Athena supports Apache Iceberg tables for ACID compliance
Fine-Grained Access: Use Lake Formation for column/row-level security

Data Integrity & Bias Detection

Pre-Training Bias Metrics (SageMaker Clarify)

Detects bias before model training by analyzing the dataset.

Metric	What It Measures
Class Imbalance (CI)	Difference in number of samples per class
Difference in Proportions of Labels (DPL)	Label distribution across groups
KL Divergence	Distribution difference between groups
Jensen-Shannon Divergence	Symmetric version of KL divergence
Lp-norm	Distance between label distributions
Total Variation Distance	Maximum difference in distributions
Kolmogorov-Smirnov	Max difference in cumulative distributions
Conditional Demographic Disparity (CDD)	Disparity conditioned on other attributes

ELI5: Pre-training bias detection is like checking whether your deck of cards is stacked before you start the game. If your training dataset has 90% approvals for one demographic and 40% for another, any model trained on it will learn that skewed pattern as “truth.” Clarify’s pre-training metrics quantify exactly how tilted the deck is — so you can fix the data (resample, reweight) before spending hours training a model that’s biased from the start.

Bias Mitigation Techniques

Pre-Processing (before training):
  • Resampling (oversample minority / undersample majority)
  • SMOTE (synthetic data for minority)
  • Data augmentation

In-Processing (during training):
  • Fairness constraints in loss function
  • Adversarial debiasing

Post-Processing (after training):
  • Threshold adjustment per group
  • Reject option classification

Data Security & Compliance

Requirement	Solution
Encrypt data at rest	S3 SSE-KMS, EBS encryption
Encrypt in transit	TLS/SSL
Anonymize PII	Glue DataBrew, Macie for detection
Access control	Lake Formation, S3 policies, IAM
Audit trail	CloudTrail, S3 access logs
HIPAA / PHI	Transcribe Medical, Comprehend Medical

Quick Reference: When to Use What

Scenario	Service
Large-scale distributed ETL	EMR + Spark
Serverless ETL	AWS Glue ETL Jobs
Visual data prep (no code)	Glue DataBrew
SageMaker-integrated visual prep	Data Wrangler
Label training data	SageMaker Ground Truth
Store reusable features	SageMaker Feature Store
Query S3 with SQL	Athena
Detect PII in S3	Amazon Macie
Handle PII in transformations	Glue DataBrew
Detect pre-training bias	SageMaker Clarify
Balance imbalanced datasets	Data Wrangler (SMOTE)
Schema discovery	Glue Crawlers
Data catalog / metadata	Glue Data Catalog

Data Transformation & Feature Engineering#

Data Processing at Scale#

Amazon EMR (Elastic MapReduce)#

Apache Spark on EMR#

Feature Engineering Techniques#

The Curse of Dimensionality#

Handling Missing Data#

Handling Unbalanced Data#

Handling Outliers#

Feature Transformation Techniques#

AWS Data Transformation Services#

SageMaker Data Wrangler#

SageMaker Ground Truth#

SageMaker Feature Store#

AWS Glue Ecosystem#

AWS Glue — Serverless ETL#

AWS Glue DataBrew — PII Handling#

Amazon Athena — Serverless Analytics#

Athena Performance Optimization#

Athena + Glue Integration#

Data Integrity & Bias Detection#

Pre-Training Bias Metrics (SageMaker Clarify)#

Bias Mitigation Techniques#

Data Security & Compliance#

Quick Reference: When to Use What#