Domain 1B: Data Transformation & ETL
Data Transformation & ETL
Exam Domain: 1 — Data Engineering (20%) Task: Transform and prepare data for ML workloads
ETL vs ELT — First Principles
Why transform data at all?
ELI5: Raw data is like ingredients from the grocery store — you can’t serve them directly. Tomatoes have stems, meat has bones, flour is unmeasured. ML algorithms are picky chefs: they only accept clean, numeric, normalized “meals.” Transformation is the cooking step that turns raw ingredients into something the model can consume.
Raw problems in real data:
✗ Missing values (NULL, NaN, empty string)
✗ Wrong types (date stored as string "01/02/24")
✗ Inconsistent units (km vs miles, USD vs EUR)
✗ Duplicates (same customer row 3 times)
✗ Outliers (age = 999, salary = -1)
✗ Categorical text (must become numbers for most algorithms)
✗ Different scales (income 0–200,000 vs age 0–100)
ETL vs ELT
┌─────────────────────────────────────────────────────────────┐
│ ETL — Extract, Transform, Load │
│ │
│ Source → [Extract] → [Transform on separate engine] → │
│ [Load clean data into target] │
│ │
│ Transform happens BEFORE landing in destination │
│ Good when: destination is a rigid warehouse (Redshift) │
│ Example: Glue ETL writes clean Parquet to S3 │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ ELT — Extract, Load, Transform │
│ │
│ Source → [Extract] → [Load raw into data lake] → │
│ [Transform inside destination with SQL] │
│ │
│ Transform happens INSIDE the destination │
│ Good when: destination is powerful enough (Redshift, EMR) │
│ Example: raw JSON lands in S3, Athena + dbt transforms it │
└─────────────────────────────────────────────────────────────┘
| Factor | ETL | ELT |
|---|---|---|
| When to transform | Before load | After load |
| Raw data stored? | No | Yes |
| Schema flexibility | Lower | Higher |
| Destination power needed | Low | High |
| Reprocessing raw? | Hard | Easy (raw still there) |
| AWS example | Glue Job → S3 | S3 raw → Athena SQL |
When ETL wins: Strict compliance — raw data must never land in the destination. Destination is a rigid system (relational DB, Redshift). Clean data needed before any downstream access.
When ELT wins: You want to keep raw data and re-derive as understanding evolves. Destination is powerful (Redshift Spectrum, Athena, Spark). Data scientists need access to raw data for exploration.
AWS Glue — Deep Dive
What Glue Actually Is
ELI5: AWS Glue is a managed Apache Spark service wrapped in a friendly AWS console. When you write a Glue ETL job, you’re writing Spark code. AWS provisions the Spark cluster, runs your job, charges you per DPU-second, and tears the cluster down. You get all the power of distributed Spark without managing a cluster.
GLUE COMPONENTS MAP:
Data Sources ──► [Crawlers] ──► [Data Catalog] ──► [ETL Jobs]
discover central metadata transform data
schema repository
│
└──► [Triggers] ──► [Workflows]
schedule multi-job pipelines
Component 1: Glue Crawlers
How Crawlers work (behind the scenes):
1. Crawler runs against data source (S3, RDS, Redshift, DynamoDB)
2. Samples files → applies classifiers (built-in: CSV, JSON, Parquet, etc.)
3. Detects delimiter, column names, data types
4. Detects partitions → reads path structure s3://bucket/year=2024/month=01/
→ creates partition keys automatically in catalog
5. Writes table definition to Glue Data Catalog
Update behavior:
- New columns? → ADD_NEW_COLUMNS (default) or UPDATE_IN_DATABASE
- Incompatible? → LOG (don't break catalog)
- Deleted columns? → DELETE_FROM_DATABASE (careful!)
Exam tip: Crawlers detect partitions from S3 path structure automatically. If data is partitioned by date, the crawler will create partition keys for year/month/day in the catalog. This is critical for Athena partition pruning.
Component 2: Glue Data Catalog
ELI5: The Glue Data Catalog is the card catalog for your data lake. Just as a library card catalog tells you which shelf holds a book without you having to search every shelf, the Data Catalog tells Athena, Glue, and EMR exactly where your data lives, what format it’s in, and what columns it has — without those services having to scan your S3 bucket to find out.
Structure:
Database → Tables → Partitions
│
└── Location: s3://bucket/prefix/
Schema: {col1: string, col2: int, ...}
Format: Parquet / CSV / JSON
SerDe: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Shared by:
├── Athena (queries the catalog to know S3 locations)
├── EMR (Hive Metastore compatibility)
├── Redshift (Spectrum uses catalog for external tables)
└── Glue ETL (reads table definitions for source/target)
Component 3: Glue ETL — DynamicFrame vs DataFrame
DataFrame (Apache Spark standard):
- Requires fixed schema upfront
- Error if column type is inconsistent across rows
- Best for: clean, uniform data
DynamicFrame (Glue extension):
- Schema per record — each row can have different types/columns
- resolveChoice(): handle ambiguous types (cast, project, make_struct)
- relationalize(): flatten nested JSON into flat tables
- Best for: messy real-world data with schema inconsistencies
Example:
CSV has 'age' column — some rows have integers, some have "N/A" strings
DataFrame: FAILS on type mismatch
DynamicFrame: Creates choice type, resolveChoice() to handle
Why this matters for the exam: DynamicFrame exists because real-world data is messy. When a question describes inconsistent schemas or mixed types in source data, the answer involves DynamicFrame + resolveChoice.
Component 4: Glue Job Types
| Job Type | Runtime | Use Case |
|---|---|---|
| Spark | Apache Spark | Large-scale batch ETL, terabytes |
| Spark Streaming | Spark Structured Streaming | Micro-batch on Kinesis/Kafka |
| Python Shell | Python (no Spark) | Lightweight scripts, small data, < 1GB |
| Ray | Distributed Python (Ray) | Parallel Python, ML preprocessing |
Component 5: Job Bookmarks — Incremental Processing
Problem: You run a Glue job daily on S3.
Without bookmarks: reprocesses ALL files every run (expensive)
With bookmarks: tracks which files were processed last run
→ only processes NEW files since last checkpoint
How it works:
Glue stores a bookmark in AWS (not S3) with:
- Last run's S3 object keys and modification timestamps
Next run:
- Only reads files with modification time AFTER last bookmark
- Appends to target, doesn't re-write existing data
ML use case:
Daily feature engineering pipeline: new training events → feature store
Don't reprocess last month's data → use bookmarks
Exam tip: Job bookmarks = incremental ETL on S3. When a question mentions “only process new data since last run” or “avoid reprocessing,” the answer is Glue job bookmarks.
Component 6: Glue DataBrew
- Visual, no-code data preparation
- 250+ built-in transformations (normalize, pivot, parse dates, etc.)
- Profile data (statistics, missing values, outliers) before transforming
- Publishes transformation “recipe” that can be re-applied
- Target users: data analysts and data scientists (no Spark code needed)
Discriminative example: Data engineer needs complex custom Spark transformations → Glue ETL. Data analyst wants to clean CSV visually without code → Glue DataBrew.
Component 7: FindMatches ML Transform
Problem: Fuzzy deduplication — "John Smith" vs "J. Smith" vs "Jon Smyth"
Standard SQL JOIN can't match these without exact string equality
How FindMatches works:
1. You label sample pairs (same entity? yes/no)
2. Glue trains an ML model on your labels
3. Model predicts matches across full dataset
4. Returns deduplicated table with match confidence scores
Use cases:
- Customer deduplication across CRM + ecommerce DB
- Product catalog matching across different supplier feeds
- Patient record matching in healthcare
Exam tip: FindMatches is the only Glue component that uses ML itself. If a question mentions “fuzzy matching,” “entity resolution,” or “deduplication with uncertain string matches,” the answer is FindMatches ML Transform.
Glue Cost Model — DPUs Explained
DPU = Data Processing Unit
= 4 vCPUs + 16 GB RAM
Billed per DPU-second
Standard Spark job: min 2 DPUs, default 10 DPUs
Python Shell job: 0.0625 DPU (1/16th DPU)
Cost optimization:
- Use Python Shell for small data (huge cost savings)
- Right-size DPU count for job size
- Enable job bookmarks to avoid reprocessing
- Use Parquet output (less data = fewer DPU-seconds)
Amazon EMR — Deep Dive
What EMR Actually Is
ELI5: EMR is the “bring your own cluster” option for big data. AWS launches EC2 instances and installs the open-source big data stack (Hadoop, Spark, Hive, HBase, Presto, Flink) for you. You get full control over the software, configuration, and cluster lifecycle. It’s more powerful and more complex than Glue.
EMR CLUSTER ANATOMY:
┌────────────────────────────────────────────┐
│ Master Node (1x) │
│ → Coordinates jobs, runs YARN, Hive Meta │
│ → m5.xlarge minimum │
├────────────────────────────────────────────┤
│ Core Nodes (1+) │
│ → Store HDFS data + run tasks │
│ → Scale carefully (holds HDFS) │
├────────────────────────────────────────────┤
│ Task Nodes (0+) │
│ → Run tasks only, no HDFS │
│ → Can use Spot instances (easily replaced)│
└────────────────────────────────────────────┘
EMR vs Glue — Discriminative Comparison
┌────────────────────────┬────────────────────────────────────┐
│ AWS GLUE │ AMAZON EMR │
├────────────────────────┼────────────────────────────────────┤
│ Serverless (no cluster)│ Managed cluster (you configure) │
│ Simple ETL pipelines │ Complex multi-framework pipelines │
│ DPU-second billing │ EC2 instance billing │
│ DynamicFrame API │ Full Spark / Hive / Presto API │
│ Limited customization │ Full software stack control │
│ No GPU support │ GPU instances supported │
│ Built-in catalog │ Hive Metastore (or use Glue cat.) │
│ Beginner-friendly │ Requires big data expertise │
└────────────────────────┴────────────────────────────────────┘
Use Glue when:
- Serverless ETL with no cluster management overhead
- Standard transformations on S3 data
- Team doesn’t have Spark expertise
- Variable or unpredictable job frequency
Use EMR when:
- Complex Spark jobs requiring custom configuration
- Long-running clusters with mixed workloads
- ML training with Spark MLlib or TensorFlow
- Full control over Hadoop ecosystem components
- GPU-accelerated ML preprocessing (EMR supports GPU instances)
- Cost-sensitive large-scale jobs (Spot instances on task nodes)
EMR Deployment Options
| Option | Description | Use When |
|---|---|---|
| EMR on EC2 | Traditional cluster on EC2 | Full control, custom config, cost-optimized |
| EMR on EKS | Spark jobs on Kubernetes | Existing EKS infrastructure, multi-tenant |
| EMR Serverless | No cluster to manage | Intermittent jobs, simplicity like Glue but Spark |
Behind the Scenes: HDFS vs EMRFS
HDFS (Hadoop Distributed File System):
- Data stored ON cluster nodes (local disks)
- Fast: compute is co-located with data
- Problem: cluster must stay running to preserve data
- Problem: cluster termination = data loss
EMRFS (EMR File System — backed by S3):
- Data stored in S3, cluster reads/writes via S3 API
- Cluster can be terminated after job — data persists
- Decouples compute (EMR) from storage (S3)
- Small overhead: S3 latency vs local disk
- Still fast for large batch: sequential reads, multipart
WHY EMRFS WINS for most ML workloads:
1. Cluster is ephemeral → spin up for job, terminate → no idle cost
2. Multiple clusters can read same S3 dataset simultaneously
3. S3 is the data lake — data already there, no HDFS import needed
4. HDFS only needed when random I/O or iterative algorithms need speed
Spark MLlib on EMR
- Distributed ML algorithms: classification, regression, clustering, recommendation
- Scales across cluster nodes — trains on datasets too large for single machine
- Pipeline API: chain Transformers and Estimators
- Integration: train on EMR, deploy model to SageMaker for inference
Amazon Redshift for ML
Redshift ML — Run ML in Your Warehouse
CREATE MODEL predict_churn
FROM (SELECT features... FROM customer_data)
TARGET churn
FUNCTION predict_churn_fn
IAM_ROLE 'arn:aws:iam::...'
SETTINGS (S3_BUCKET 's3://my-bucket');
-- Under the hood:
-- 1. Redshift exports training data to S3
-- 2. Calls SageMaker Autopilot to train best model
-- 3. Registers model as a SQL function in Redshift
-- 4. You call it like any SQL function: SELECT predict_churn_fn(features...)
Why this matters: Redshift ML lets data analysts run ML predictions in SQL without leaving the warehouse. The SageMaker complexity is hidden. Exam tests when this pattern applies: existing data in Redshift, SQL-native team, no need to move data.
Redshift Spectrum — Query S3 from Redshift
Architecture:
Redshift cluster → Spectrum layer → S3 data lake
(thousands of
Spectrum nodes,
scales automatically)
Use case: JOIN warehouse tables (hot) with historical S3 data (cold)
SELECT w.customer_id, s.revenue
FROM warehouse_customers w
JOIN spectrum_schema.historical_revenue s -- this table is in S3!
ON w.id = s.customer_id
Redshift vs Athena vs Glue — When to Use Which
┌────────────────────────────────────────────────────────────┐
│ QUESTION: Do you need a persistent data warehouse? │
│ YES → Redshift (structured, aggregations, BI queries) │
│ NO ↓ │
├────────────────────────────────────────────────────────────┤
│ QUESTION: Is data already in S3, no transforms needed? │
│ YES → Athena (serverless SQL, pay per scan) │
│ NO ↓ │
├────────────────────────────────────────────────────────────┤
│ QUESTION: Need to transform data before storing? │
│ YES → Glue ETL (batch transform, catalog integration) │
└────────────────────────────────────────────────────────────┘
Amazon Athena
What Athena Is and Why It Matters
ELI5: Athena is like having a SQL interface to your S3 bucket. You don’t need a database server, you don’t load data anywhere — you just write SQL and Athena goes and reads your S3 files. You pay only for the data you scan. For ML data exploration, this is huge: a data scientist can run ad-hoc SQL on terabytes of raw data without provisioning anything.
Under the hood: Presto/Trino query engine
→ Distributed SQL execution across many nodes
→ Reads S3 files in parallel (CSV, JSON, Parquet, ORC, Avro)
→ Uses Glue Data Catalog for table definitions
→ Results returned in seconds to minutes
Cost: Format Matters Enormously
Query: SELECT avg(revenue) FROM sales WHERE year = 2024
File format: CSV (uncompressed, row-oriented)
→ Scan 500 GB full table
→ Cost: 500 GB × $5/TB = $2.50
File format: Parquet (compressed, columnar, partitioned)
→ Partition pruning: only read year=2024/ prefix (100 GB)
→ Column pruning: only read 'revenue' column (2 GB)
→ Cost: 2 GB × $5/TB = $0.01
Savings: 99.6% cost reduction — same query result
Why this matters for the exam: The combination of Parquet + partitioning + Athena is a canonical exam answer for “cost-effective querying of large datasets in S3.” Know this pattern cold.
Partition Projection — Advanced Optimization
Standard partitions: Athena queries Glue catalog to list all partitions
→ With 10,000 partitions, catalog lookups are slow
→ Must run MSCK REPAIR TABLE to add new partitions
Partition projection: Athena computes partitions from rules, no catalog lookup
Example config:
projection.date.type=date
projection.date.range=2020-01-01,NOW
projection.date.format=yyyy-MM-dd
Result: no MSCK REPAIR needed, instant partition discovery
Use when: partitioned by date with thousands of partitions
Athena Federated Query
- Query non-S3 sources: RDS, DynamoDB, Redshift, on-prem via Lambda connector
- Join data across sources in single SQL
- Use case: JOIN S3 feature data with live RDS customer records
AWS Step Functions
Orchestrating ML Workflows
ML Pipeline as State Machine:
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Glue │ → │SageMaker │ → │ Model │ → │ Notify │
│ ETL Job │ │ Training │ │Evaluation│ │ Team │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │
└── FAIL: Alert + Stop └── IF accuracy < 0.8: retrain
Key features for ML pipelines:
- Retry logic with exponential backoff (transient failures)
- Parallel branches (train multiple model variants simultaneously)
- Wait states (poll for SageMaker training job completion)
- Error handling (catch failures, route to notification)
- Visual workflow editor
Step Functions vs SageMaker Pipelines vs Airflow
| Factor | Step Functions | SageMaker Pipelines | MWAA (Airflow) |
|---|---|---|---|
| Scope | General AWS orchestration | ML-specific (SM only) | General workflow DAGs |
| ML awareness | Low (call any AWS API) | High (native SM steps) | Low (operators needed) |
| Complexity | Medium | Low (SM-focused) | High (Airflow setup) |
| Lineage/tracking | None built-in | Full ML lineage | None built-in |
| Best for | Multi-service ML pipelines | Pure SageMaker MLOps | Existing Airflow teams |
Discriminative example: ML team needs lineage tracking, experiment comparison, and model registry → SageMaker Pipelines. ML pipeline calls Glue + SageMaker + Lambda + SNS → Step Functions (spans multiple services). Team already uses Airflow → MWAA.
AWS Data Pipeline (Legacy)
- Older managed ETL and data movement service
- Still appears on some exam questions
- Moves data between AWS storage services (S3 ↔ RDS ↔ Redshift ↔ EMR)
- Replaced in practice by: Glue Workflows, Step Functions, EventBridge Pipes
Exam tip: If AWS Data Pipeline appears, the question usually tests whether you know it’s for data movement/orchestration (not transformation). Modern replacement is Step Functions + Glue.
Data Quality & Validation
Why Data Quality is Critical for ML
Garbage In → Garbage Out
Model trained on dirty data:
- Missing features → model learns wrong patterns
- Label errors → model learns wrong targets
- Distribution drift → model degrades silently in production
- Duplicates → model overfits to repeated examples
PyDeequ — Data Quality on Glue/EMR
Apache Deequ (from Amazon, open-sourced)
PyDeequ = Python wrapper for Glue and EMR
Key concepts:
Analyzer: compute metrics (completeness, mean, stddev, correlation)
Check: define constraints (completeness > 0.9, values in range)
Verification: run checks, get pass/fail report
Anomaly Detection: flag if today's metrics diverge from historical
Example constraint:
check.hasCompleteness("age", _ >= 0.95) # 95%+ non-null
check.isNonNegative("revenue") # no negative revenue
check.isContainedIn("country", valid_list) # valid country codes
Why this matters for the exam: Data quality gates are a best practice in ML pipelines. If a question mentions “ensure training data meets quality standards before model training,” the pattern is Glue ETL + PyDeequ checks + fail pipeline if checks don’t pass.
Key Integration Patterns
Pattern 1: Batch ML Pipeline
S3 (raw data)
→ EventBridge (scheduled: daily 2am)
→ Step Functions
├── Glue Crawler (update catalog)
├── Glue ETL Job (transform → Parquet)
├── PyDeequ quality check (gate)
└── SageMaker Training Job
→ S3 (model artifacts)
→ Model Registry
Pattern 2: Streaming ML Pipeline
IoT Sensors / App events
→ Kinesis Data Streams
├── Managed Apache Flink (real-time feature engineering)
│ → Kinesis Data Streams (enriched events)
│ → SageMaker real-time endpoint (inference)
└── Firehose (buffer, convert JSON→Parquet)
→ S3 (raw archive + training data)
Pattern 3: Event-Driven ETL
File arrives in S3
→ S3 Event Notification
→ EventBridge rule (e.g., suffix = .csv, prefix = uploads/)
→ Step Functions workflow
├── Glue ETL Job (validate + transform)
├── PyDeequ check
└── On success: move to processed/ prefix
On failure: move to quarantine/ prefix + SNS alert
Pattern 4: Lakehouse with Query Layers
Raw data → S3 raw layer (any format)
→ Glue Crawler → Data Catalog
├── Athena (ad-hoc exploration, pay-per-query)
├── Redshift Spectrum (join with warehouse data)
└── Glue ETL (transform to curated layer)
→ S3 curated layer (Parquet, partitioned)
├── Athena (cost-optimized queries)
└── SageMaker Training (ML-ready data)
Quick Reference: Scenario → Service
| Scenario | Service |
|---|---|
| Discover schema of S3 data automatically | Glue Crawler |
| Central metadata repository for data lake | Glue Data Catalog |
| Batch ETL, serverless, Spark-based | Glue ETL Job |
| Visual no-code data preparation | Glue DataBrew |
| Fuzzy deduplication / entity matching | Glue FindMatches |
| Only process new S3 data since last run | Glue Job Bookmarks |
| Complex Spark, full cluster control | EMR |
| GPU-accelerated distributed ML preprocessing | EMR (GPU instances) |
| Serverless Spark (no Glue DynamicFrame needed) | EMR Serverless |
| SQL on S3, pay per scan | Athena |
| Reduce Athena cost on large dataset | Parquet + partitioning |
| Join S3 data with Redshift tables | Redshift Spectrum |
| ML predictions in SQL without leaving warehouse | Redshift ML |
| Orchestrate multi-service ML pipeline | Step Functions |
| ML pipeline with lineage, experiment tracking | SageMaker Pipelines |
| Existing Airflow team migrating to AWS | MWAA |
| Data quality gates before model training | PyDeequ on Glue |
| Legacy data movement between AWS services | AWS Data Pipeline |
| Inconsistent/mixed-type source data | Glue DynamicFrame |
| Daily incremental feature pipeline | Glue + Job Bookmarks |
| Real-time stream transformation before S3 | Firehose + Lambda |