Domain 1B: Data Transformation & ETL

15 min read 3142 words

Table of Contents

Data Transformation & ETL

Data Transformation & ETL

Exam Domain: 1 — Data Engineering (20%) Task: Transform and prepare data for ML workloads

ETL vs ELT — First Principles

Why transform data at all?

ELI5: Raw data is like ingredients from the grocery store — you can’t serve them directly. Tomatoes have stems, meat has bones, flour is unmeasured. ML algorithms are picky chefs: they only accept clean, numeric, normalized “meals.” Transformation is the cooking step that turns raw ingredients into something the model can consume.

Raw problems in real data:
  ✗ Missing values (NULL, NaN, empty string)
  ✗ Wrong types (date stored as string "01/02/24")
  ✗ Inconsistent units (km vs miles, USD vs EUR)
  ✗ Duplicates (same customer row 3 times)
  ✗ Outliers (age = 999, salary = -1)
  ✗ Categorical text (must become numbers for most algorithms)
  ✗ Different scales (income 0–200,000 vs age 0–100)

ETL vs ELT

┌─────────────────────────────────────────────────────────────┐
│  ETL — Extract, Transform, Load                             │
│                                                             │
│  Source → [Extract] → [Transform on separate engine] →     │
│           [Load clean data into target]                     │
│                                                             │
│  Transform happens BEFORE landing in destination            │
│  Good when: destination is a rigid warehouse (Redshift)     │
│  Example: Glue ETL writes clean Parquet to S3               │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│  ELT — Extract, Load, Transform                             │
│                                                             │
│  Source → [Extract] → [Load raw into data lake] →          │
│           [Transform inside destination with SQL]           │
│                                                             │
│  Transform happens INSIDE the destination                   │
│  Good when: destination is powerful enough (Redshift, EMR)  │
│  Example: raw JSON lands in S3, Athena + dbt transforms it  │
└─────────────────────────────────────────────────────────────┘

Factor	ETL	ELT
When to transform	Before load	After load
Raw data stored?	No	Yes
Schema flexibility	Lower	Higher
Destination power needed	Low	High
Reprocessing raw?	Hard	Easy (raw still there)
AWS example	Glue Job → S3	S3 raw → Athena SQL

When ETL wins: Strict compliance — raw data must never land in the destination. Destination is a rigid system (relational DB, Redshift). Clean data needed before any downstream access.

When ELT wins: You want to keep raw data and re-derive as understanding evolves. Destination is powerful (Redshift Spectrum, Athena, Spark). Data scientists need access to raw data for exploration.

AWS Glue — Deep Dive

What Glue Actually Is

ELI5: AWS Glue is a managed Apache Spark service wrapped in a friendly AWS console. When you write a Glue ETL job, you’re writing Spark code. AWS provisions the Spark cluster, runs your job, charges you per DPU-second, and tears the cluster down. You get all the power of distributed Spark without managing a cluster.

GLUE COMPONENTS MAP:

  Data Sources ──► [Crawlers] ──► [Data Catalog] ──► [ETL Jobs]
                    discover       central metadata   transform data
                    schema         repository
                         │
                         └──► [Triggers] ──► [Workflows]
                               schedule       multi-job pipelines

Component 1: Glue Crawlers

How Crawlers work (behind the scenes):
  1. Crawler runs against data source (S3, RDS, Redshift, DynamoDB)
  2. Samples files → applies classifiers (built-in: CSV, JSON, Parquet, etc.)
  3. Detects delimiter, column names, data types
  4. Detects partitions → reads path structure s3://bucket/year=2024/month=01/
     → creates partition keys automatically in catalog
  5. Writes table definition to Glue Data Catalog

Update behavior:
  - New columns?     → ADD_NEW_COLUMNS (default) or UPDATE_IN_DATABASE
  - Incompatible?    → LOG (don't break catalog)
  - Deleted columns? → DELETE_FROM_DATABASE (careful!)

Exam tip: Crawlers detect partitions from S3 path structure automatically. If data is partitioned by date, the crawler will create partition keys for year/month/day in the catalog. This is critical for Athena partition pruning.

Component 2: Glue Data Catalog

ELI5: The Glue Data Catalog is the card catalog for your data lake. Just as a library card catalog tells you which shelf holds a book without you having to search every shelf, the Data Catalog tells Athena, Glue, and EMR exactly where your data lives, what format it’s in, and what columns it has — without those services having to scan your S3 bucket to find out.

Structure:
  Database → Tables → Partitions
                          │
                          └── Location: s3://bucket/prefix/
                              Schema: {col1: string, col2: int, ...}
                              Format: Parquet / CSV / JSON
                              SerDe: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Shared by:
  ├── Athena     (queries the catalog to know S3 locations)
  ├── EMR        (Hive Metastore compatibility)
  ├── Redshift   (Spectrum uses catalog for external tables)
  └── Glue ETL   (reads table definitions for source/target)

Component 3: Glue ETL — DynamicFrame vs DataFrame

DataFrame (Apache Spark standard):
  - Requires fixed schema upfront
  - Error if column type is inconsistent across rows
  - Best for: clean, uniform data

DynamicFrame (Glue extension):
  - Schema per record — each row can have different types/columns
  - resolveChoice(): handle ambiguous types (cast, project, make_struct)
  - relationalize(): flatten nested JSON into flat tables
  - Best for: messy real-world data with schema inconsistencies

Example:
  CSV has 'age' column — some rows have integers, some have "N/A" strings
  DataFrame: FAILS on type mismatch
  DynamicFrame: Creates choice type, resolveChoice() to handle

Why this matters for the exam: DynamicFrame exists because real-world data is messy. When a question describes inconsistent schemas or mixed types in source data, the answer involves DynamicFrame + resolveChoice.

Component 4: Glue Job Types

Job Type	Runtime	Use Case
Spark	Apache Spark	Large-scale batch ETL, terabytes
Spark Streaming	Spark Structured Streaming	Micro-batch on Kinesis/Kafka
Python Shell	Python (no Spark)	Lightweight scripts, small data, < 1GB
Ray	Distributed Python (Ray)	Parallel Python, ML preprocessing

Component 5: Job Bookmarks — Incremental Processing

Problem: You run a Glue job daily on S3.
         Without bookmarks: reprocesses ALL files every run (expensive)
         With bookmarks: tracks which files were processed last run
                         → only processes NEW files since last checkpoint

How it works:
  Glue stores a bookmark in AWS (not S3) with:
    - Last run's S3 object keys and modification timestamps
  Next run:
    - Only reads files with modification time AFTER last bookmark
    - Appends to target, doesn't re-write existing data

ML use case:
  Daily feature engineering pipeline: new training events → feature store
  Don't reprocess last month's data → use bookmarks

Exam tip: Job bookmarks = incremental ETL on S3. When a question mentions “only process new data since last run” or “avoid reprocessing,” the answer is Glue job bookmarks.

Component 6: Glue DataBrew

Visual, no-code data preparation
250+ built-in transformations (normalize, pivot, parse dates, etc.)
Profile data (statistics, missing values, outliers) before transforming
Publishes transformation “recipe” that can be re-applied
Target users: data analysts and data scientists (no Spark code needed)

Discriminative example: Data engineer needs complex custom Spark transformations → Glue ETL. Data analyst wants to clean CSV visually without code → Glue DataBrew.

Component 7: FindMatches ML Transform

Problem: Fuzzy deduplication — "John Smith" vs "J. Smith" vs "Jon Smyth"
         Standard SQL JOIN can't match these without exact string equality

How FindMatches works:
  1. You label sample pairs (same entity? yes/no)
  2. Glue trains an ML model on your labels
  3. Model predicts matches across full dataset
  4. Returns deduplicated table with match confidence scores

Use cases:
  - Customer deduplication across CRM + ecommerce DB
  - Product catalog matching across different supplier feeds
  - Patient record matching in healthcare

Exam tip: FindMatches is the only Glue component that uses ML itself. If a question mentions “fuzzy matching,” “entity resolution,” or “deduplication with uncertain string matches,” the answer is FindMatches ML Transform.

Glue Cost Model — DPUs Explained

DPU = Data Processing Unit
  = 4 vCPUs + 16 GB RAM
  Billed per DPU-second

Standard Spark job: min 2 DPUs, default 10 DPUs
Python Shell job:   0.0625 DPU (1/16th DPU)

Cost optimization:
  - Use Python Shell for small data (huge cost savings)
  - Right-size DPU count for job size
  - Enable job bookmarks to avoid reprocessing
  - Use Parquet output (less data = fewer DPU-seconds)

Amazon EMR — Deep Dive

What EMR Actually Is

ELI5: EMR is the “bring your own cluster” option for big data. AWS launches EC2 instances and installs the open-source big data stack (Hadoop, Spark, Hive, HBase, Presto, Flink) for you. You get full control over the software, configuration, and cluster lifecycle. It’s more powerful and more complex than Glue.

EMR CLUSTER ANATOMY:
  ┌────────────────────────────────────────────┐
  │  Master Node (1x)                          │
  │  → Coordinates jobs, runs YARN, Hive Meta  │
  │  → m5.xlarge minimum                       │
  ├────────────────────────────────────────────┤
  │  Core Nodes (1+)                           │
  │  → Store HDFS data + run tasks             │
  │  → Scale carefully (holds HDFS)            │
  ├────────────────────────────────────────────┤
  │  Task Nodes (0+)                           │
  │  → Run tasks only, no HDFS                 │
  │  → Can use Spot instances (easily replaced)│
  └────────────────────────────────────────────┘

EMR vs Glue — Discriminative Comparison

┌────────────────────────┬────────────────────────────────────┐
│       AWS GLUE         │          AMAZON EMR                │
├────────────────────────┼────────────────────────────────────┤
│ Serverless (no cluster)│ Managed cluster (you configure)    │
│ Simple ETL pipelines   │ Complex multi-framework pipelines  │
│ DPU-second billing     │ EC2 instance billing               │
│ DynamicFrame API       │ Full Spark / Hive / Presto API     │
│ Limited customization  │ Full software stack control        │
│ No GPU support         │ GPU instances supported            │
│ Built-in catalog       │ Hive Metastore (or use Glue cat.)  │
│ Beginner-friendly      │ Requires big data expertise        │
└────────────────────────┴────────────────────────────────────┘

Use Glue when:

Serverless ETL with no cluster management overhead
Standard transformations on S3 data
Team doesn’t have Spark expertise
Variable or unpredictable job frequency

Use EMR when:

Complex Spark jobs requiring custom configuration
Long-running clusters with mixed workloads
ML training with Spark MLlib or TensorFlow
Full control over Hadoop ecosystem components
GPU-accelerated ML preprocessing (EMR supports GPU instances)
Cost-sensitive large-scale jobs (Spot instances on task nodes)

EMR Deployment Options

Option	Description	Use When
EMR on EC2	Traditional cluster on EC2	Full control, custom config, cost-optimized
EMR on EKS	Spark jobs on Kubernetes	Existing EKS infrastructure, multi-tenant
EMR Serverless	No cluster to manage	Intermittent jobs, simplicity like Glue but Spark

Behind the Scenes: HDFS vs EMRFS

HDFS (Hadoop Distributed File System):
  - Data stored ON cluster nodes (local disks)
  - Fast: compute is co-located with data
  - Problem: cluster must stay running to preserve data
  - Problem: cluster termination = data loss

EMRFS (EMR File System — backed by S3):
  - Data stored in S3, cluster reads/writes via S3 API
  - Cluster can be terminated after job — data persists
  - Decouples compute (EMR) from storage (S3)
  - Small overhead: S3 latency vs local disk
  - Still fast for large batch: sequential reads, multipart

WHY EMRFS WINS for most ML workloads:
  1. Cluster is ephemeral → spin up for job, terminate → no idle cost
  2. Multiple clusters can read same S3 dataset simultaneously
  3. S3 is the data lake — data already there, no HDFS import needed
  4. HDFS only needed when random I/O or iterative algorithms need speed

Spark MLlib on EMR

Distributed ML algorithms: classification, regression, clustering, recommendation
Scales across cluster nodes — trains on datasets too large for single machine
Pipeline API: chain Transformers and Estimators
Integration: train on EMR, deploy model to SageMaker for inference

Amazon Redshift for ML

Redshift ML — Run ML in Your Warehouse

CREATE MODEL predict_churn
FROM (SELECT features... FROM customer_data)
TARGET churn
FUNCTION predict_churn_fn
IAM_ROLE 'arn:aws:iam::...'
SETTINGS (S3_BUCKET 's3://my-bucket');

-- Under the hood:
-- 1. Redshift exports training data to S3
-- 2. Calls SageMaker Autopilot to train best model
-- 3. Registers model as a SQL function in Redshift
-- 4. You call it like any SQL function: SELECT predict_churn_fn(features...)

Why this matters: Redshift ML lets data analysts run ML predictions in SQL without leaving the warehouse. The SageMaker complexity is hidden. Exam tests when this pattern applies: existing data in Redshift, SQL-native team, no need to move data.

Redshift Spectrum — Query S3 from Redshift

Architecture:
  Redshift cluster → Spectrum layer → S3 data lake
                    (thousands of
                     Spectrum nodes,
                     scales automatically)

Use case: JOIN warehouse tables (hot) with historical S3 data (cold)
  SELECT w.customer_id, s.revenue
  FROM warehouse_customers w
  JOIN spectrum_schema.historical_revenue s  -- this table is in S3!
  ON w.id = s.customer_id

Redshift vs Athena vs Glue — When to Use Which

┌────────────────────────────────────────────────────────────┐
│  QUESTION: Do you need a persistent data warehouse?        │
│  YES → Redshift (structured, aggregations, BI queries)     │
│  NO  ↓                                                     │
├────────────────────────────────────────────────────────────┤
│  QUESTION: Is data already in S3, no transforms needed?    │
│  YES → Athena (serverless SQL, pay per scan)               │
│  NO  ↓                                                     │
├────────────────────────────────────────────────────────────┤
│  QUESTION: Need to transform data before storing?          │
│  YES → Glue ETL (batch transform, catalog integration)     │
└────────────────────────────────────────────────────────────┘

Amazon Athena

What Athena Is and Why It Matters

ELI5: Athena is like having a SQL interface to your S3 bucket. You don’t need a database server, you don’t load data anywhere — you just write SQL and Athena goes and reads your S3 files. You pay only for the data you scan. For ML data exploration, this is huge: a data scientist can run ad-hoc SQL on terabytes of raw data without provisioning anything.

Under the hood: Presto/Trino query engine
  → Distributed SQL execution across many nodes
  → Reads S3 files in parallel (CSV, JSON, Parquet, ORC, Avro)
  → Uses Glue Data Catalog for table definitions
  → Results returned in seconds to minutes

Cost: Format Matters Enormously

Query: SELECT avg(revenue) FROM sales WHERE year = 2024

File format: CSV (uncompressed, row-oriented)
  → Scan 500 GB full table
  → Cost: 500 GB × $5/TB = $2.50

File format: Parquet (compressed, columnar, partitioned)
  → Partition pruning: only read year=2024/ prefix (100 GB)
  → Column pruning: only read 'revenue' column (2 GB)
  → Cost: 2 GB × $5/TB = $0.01

Savings: 99.6% cost reduction — same query result

Why this matters for the exam: The combination of Parquet + partitioning + Athena is a canonical exam answer for “cost-effective querying of large datasets in S3.” Know this pattern cold.

Partition Projection — Advanced Optimization

Standard partitions: Athena queries Glue catalog to list all partitions
  → With 10,000 partitions, catalog lookups are slow
  → Must run MSCK REPAIR TABLE to add new partitions

Partition projection: Athena computes partitions from rules, no catalog lookup
  Example config:
    projection.date.type=date
    projection.date.range=2020-01-01,NOW
    projection.date.format=yyyy-MM-dd

  Result: no MSCK REPAIR needed, instant partition discovery
  Use when: partitioned by date with thousands of partitions

Athena Federated Query

Query non-S3 sources: RDS, DynamoDB, Redshift, on-prem via Lambda connector
Join data across sources in single SQL
Use case: JOIN S3 feature data with live RDS customer records

AWS Step Functions

Orchestrating ML Workflows

ML Pipeline as State Machine:
  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Glue    │ →  │SageMaker │ →  │  Model   │ →  │ Notify   │
  │  ETL Job │    │ Training │    │Evaluation│    │  Team    │
  └──────────┘    └──────────┘    └──────────┘    └──────────┘
        │                               │
        └── FAIL: Alert + Stop          └── IF accuracy < 0.8: retrain

Key features for ML pipelines:

Retry logic with exponential backoff (transient failures)
Parallel branches (train multiple model variants simultaneously)
Wait states (poll for SageMaker training job completion)
Error handling (catch failures, route to notification)
Visual workflow editor

Step Functions vs SageMaker Pipelines vs Airflow

Factor	Step Functions	SageMaker Pipelines	MWAA (Airflow)
Scope	General AWS orchestration	ML-specific (SM only)	General workflow DAGs
ML awareness	Low (call any AWS API)	High (native SM steps)	Low (operators needed)
Complexity	Medium	Low (SM-focused)	High (Airflow setup)
Lineage/tracking	None built-in	Full ML lineage	None built-in
Best for	Multi-service ML pipelines	Pure SageMaker MLOps	Existing Airflow teams

Discriminative example: ML team needs lineage tracking, experiment comparison, and model registry → SageMaker Pipelines. ML pipeline calls Glue + SageMaker + Lambda + SNS → Step Functions (spans multiple services). Team already uses Airflow → MWAA.

AWS Data Pipeline (Legacy)

Older managed ETL and data movement service
Still appears on some exam questions
Moves data between AWS storage services (S3 ↔ RDS ↔ Redshift ↔ EMR)
Replaced in practice by: Glue Workflows, Step Functions, EventBridge Pipes

Exam tip: If AWS Data Pipeline appears, the question usually tests whether you know it’s for data movement/orchestration (not transformation). Modern replacement is Step Functions + Glue.

Data Quality & Validation

Why Data Quality is Critical for ML

Garbage In → Garbage Out

Model trained on dirty data:
  - Missing features → model learns wrong patterns
  - Label errors → model learns wrong targets
  - Distribution drift → model degrades silently in production
  - Duplicates → model overfits to repeated examples

PyDeequ — Data Quality on Glue/EMR

Apache Deequ (from Amazon, open-sourced)
PyDeequ = Python wrapper for Glue and EMR

Key concepts:
  Analyzer:   compute metrics (completeness, mean, stddev, correlation)
  Check:      define constraints (completeness > 0.9, values in range)
  Verification: run checks, get pass/fail report
  Anomaly Detection: flag if today's metrics diverge from historical

Example constraint:
  check.hasCompleteness("age", _ >= 0.95)    # 95%+ non-null
  check.isNonNegative("revenue")             # no negative revenue
  check.isContainedIn("country", valid_list) # valid country codes

Why this matters for the exam: Data quality gates are a best practice in ML pipelines. If a question mentions “ensure training data meets quality standards before model training,” the pattern is Glue ETL + PyDeequ checks + fail pipeline if checks don’t pass.

Key Integration Patterns

Pattern 1: Batch ML Pipeline

S3 (raw data)
  → EventBridge (scheduled: daily 2am)
      → Step Functions
          ├── Glue Crawler (update catalog)
          ├── Glue ETL Job (transform → Parquet)
          ├── PyDeequ quality check (gate)
          └── SageMaker Training Job
                → S3 (model artifacts)
                → Model Registry

Pattern 2: Streaming ML Pipeline

IoT Sensors / App events
  → Kinesis Data Streams
      ├── Managed Apache Flink (real-time feature engineering)
      │     → Kinesis Data Streams (enriched events)
      │         → SageMaker real-time endpoint (inference)
      └── Firehose (buffer, convert JSON→Parquet)
            → S3 (raw archive + training data)

Pattern 3: Event-Driven ETL

File arrives in S3
  → S3 Event Notification
      → EventBridge rule (e.g., suffix = .csv, prefix = uploads/)
          → Step Functions workflow
              ├── Glue ETL Job (validate + transform)
              ├── PyDeequ check
              └── On success: move to processed/ prefix
                  On failure: move to quarantine/ prefix + SNS alert

Pattern 4: Lakehouse with Query Layers

Raw data → S3 raw layer (any format)
  → Glue Crawler → Data Catalog
      ├── Athena (ad-hoc exploration, pay-per-query)
      ├── Redshift Spectrum (join with warehouse data)
      └── Glue ETL (transform to curated layer)
            → S3 curated layer (Parquet, partitioned)
                ├── Athena (cost-optimized queries)
                └── SageMaker Training (ML-ready data)

Quick Reference: Scenario → Service

Scenario	Service
Discover schema of S3 data automatically	Glue Crawler
Central metadata repository for data lake	Glue Data Catalog
Batch ETL, serverless, Spark-based	Glue ETL Job
Visual no-code data preparation	Glue DataBrew
Fuzzy deduplication / entity matching	Glue FindMatches
Only process new S3 data since last run	Glue Job Bookmarks
Complex Spark, full cluster control	EMR
GPU-accelerated distributed ML preprocessing	EMR (GPU instances)
Serverless Spark (no Glue DynamicFrame needed)	EMR Serverless
SQL on S3, pay per scan	Athena
Reduce Athena cost on large dataset	Parquet + partitioning
Join S3 data with Redshift tables	Redshift Spectrum
ML predictions in SQL without leaving warehouse	Redshift ML
Orchestrate multi-service ML pipeline	Step Functions
ML pipeline with lineage, experiment tracking	SageMaker Pipelines
Existing Airflow team migrating to AWS	MWAA
Data quality gates before model training	PyDeequ on Glue
Legacy data movement between AWS services	AWS Data Pipeline
Inconsistent/mixed-type source data	Glue DynamicFrame
Daily incremental feature pipeline	Glue + Job Bookmarks
Real-time stream transformation before S3	Firehose + Lambda

Data Transformation & ETL#

ETL vs ELT — First Principles#

ETL vs ELT#

AWS Glue — Deep Dive#

What Glue Actually Is#

Component 1: Glue Crawlers#

Component 2: Glue Data Catalog#

Component 3: Glue ETL — DynamicFrame vs DataFrame#

Component 4: Glue Job Types#

Component 5: Job Bookmarks — Incremental Processing#

Component 6: Glue DataBrew#

Component 7: FindMatches ML Transform#

Glue Cost Model — DPUs Explained#

Amazon EMR — Deep Dive#

What EMR Actually Is#

EMR vs Glue — Discriminative Comparison#

EMR Deployment Options#

Behind the Scenes: HDFS vs EMRFS#

Spark MLlib on EMR#

Amazon Redshift for ML#

Redshift ML — Run ML in Your Warehouse#

Redshift Spectrum — Query S3 from Redshift#

Redshift vs Athena vs Glue — When to Use Which#

Amazon Athena#

What Athena Is and Why It Matters#

Cost: Format Matters Enormously#

Partition Projection — Advanced Optimization#

Athena Federated Query#

AWS Step Functions#

Orchestrating ML Workflows#

Step Functions vs SageMaker Pipelines vs Airflow#

AWS Data Pipeline (Legacy)#

Data Quality & Validation#

Why Data Quality is Critical for ML#

PyDeequ — Data Quality on Glue/EMR#

Key Integration Patterns#

Pattern 1: Batch ML Pipeline#

Pattern 2: Streaming ML Pipeline#

Pattern 3: Event-Driven ETL#

Pattern 4: Lakehouse with Query Layers#

Quick Reference: Scenario → Service#