← AWS MLS-C01 — ML Specialty

Domain 1B: Data Transformation & ETL

Data Transformation & ETL

Exam Domain: 1 — Data Engineering (20%) Task: Transform and prepare data for ML workloads


ETL vs ELT — First Principles

Why transform data at all?

ELI5: Raw data is like ingredients from the grocery store — you can’t serve them directly. Tomatoes have stems, meat has bones, flour is unmeasured. ML algorithms are picky chefs: they only accept clean, numeric, normalized “meals.” Transformation is the cooking step that turns raw ingredients into something the model can consume.

Raw problems in real data:
  ✗ Missing values (NULL, NaN, empty string)
  ✗ Wrong types (date stored as string "01/02/24")
  ✗ Inconsistent units (km vs miles, USD vs EUR)
  ✗ Duplicates (same customer row 3 times)
  ✗ Outliers (age = 999, salary = -1)
  ✗ Categorical text (must become numbers for most algorithms)
  ✗ Different scales (income 0–200,000 vs age 0–100)

ETL vs ELT

┌─────────────────────────────────────────────────────────────┐
│  ETL — Extract, Transform, Load                             │
│                                                             │
│  Source → [Extract] → [Transform on separate engine] →     │
│           [Load clean data into target]                     │
│                                                             │
│  Transform happens BEFORE landing in destination            │
│  Good when: destination is a rigid warehouse (Redshift)     │
│  Example: Glue ETL writes clean Parquet to S3               │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│  ELT — Extract, Load, Transform                             │
│                                                             │
│  Source → [Extract] → [Load raw into data lake] →          │
│           [Transform inside destination with SQL]           │
│                                                             │
│  Transform happens INSIDE the destination                   │
│  Good when: destination is powerful enough (Redshift, EMR)  │
│  Example: raw JSON lands in S3, Athena + dbt transforms it  │
└─────────────────────────────────────────────────────────────┘
FactorETLELT
When to transformBefore loadAfter load
Raw data stored?NoYes
Schema flexibilityLowerHigher
Destination power neededLowHigh
Reprocessing raw?HardEasy (raw still there)
AWS exampleGlue Job → S3S3 raw → Athena SQL

When ETL wins: Strict compliance — raw data must never land in the destination. Destination is a rigid system (relational DB, Redshift). Clean data needed before any downstream access.

When ELT wins: You want to keep raw data and re-derive as understanding evolves. Destination is powerful (Redshift Spectrum, Athena, Spark). Data scientists need access to raw data for exploration.


AWS Glue — Deep Dive

What Glue Actually Is

ELI5: AWS Glue is a managed Apache Spark service wrapped in a friendly AWS console. When you write a Glue ETL job, you’re writing Spark code. AWS provisions the Spark cluster, runs your job, charges you per DPU-second, and tears the cluster down. You get all the power of distributed Spark without managing a cluster.

GLUE COMPONENTS MAP:

  Data Sources ──► [Crawlers] ──► [Data Catalog] ──► [ETL Jobs]
                    discover       central metadata   transform data
                    schema         repository
                         │
                         └──► [Triggers] ──► [Workflows]
                               schedule       multi-job pipelines

Component 1: Glue Crawlers

How Crawlers work (behind the scenes):
  1. Crawler runs against data source (S3, RDS, Redshift, DynamoDB)
  2. Samples files → applies classifiers (built-in: CSV, JSON, Parquet, etc.)
  3. Detects delimiter, column names, data types
  4. Detects partitions → reads path structure s3://bucket/year=2024/month=01/
     → creates partition keys automatically in catalog
  5. Writes table definition to Glue Data Catalog

Update behavior:
  - New columns?     → ADD_NEW_COLUMNS (default) or UPDATE_IN_DATABASE
  - Incompatible?    → LOG (don't break catalog)
  - Deleted columns? → DELETE_FROM_DATABASE (careful!)

Exam tip: Crawlers detect partitions from S3 path structure automatically. If data is partitioned by date, the crawler will create partition keys for year/month/day in the catalog. This is critical for Athena partition pruning.

Component 2: Glue Data Catalog

ELI5: The Glue Data Catalog is the card catalog for your data lake. Just as a library card catalog tells you which shelf holds a book without you having to search every shelf, the Data Catalog tells Athena, Glue, and EMR exactly where your data lives, what format it’s in, and what columns it has — without those services having to scan your S3 bucket to find out.

Structure:
  Database → Tables → Partitions
                          │
                          └── Location: s3://bucket/prefix/
                              Schema: {col1: string, col2: int, ...}
                              Format: Parquet / CSV / JSON
                              SerDe: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Shared by:
  ├── Athena     (queries the catalog to know S3 locations)
  ├── EMR        (Hive Metastore compatibility)
  ├── Redshift   (Spectrum uses catalog for external tables)
  └── Glue ETL   (reads table definitions for source/target)

Component 3: Glue ETL — DynamicFrame vs DataFrame

DataFrame (Apache Spark standard):
  - Requires fixed schema upfront
  - Error if column type is inconsistent across rows
  - Best for: clean, uniform data

DynamicFrame (Glue extension):
  - Schema per record — each row can have different types/columns
  - resolveChoice(): handle ambiguous types (cast, project, make_struct)
  - relationalize(): flatten nested JSON into flat tables
  - Best for: messy real-world data with schema inconsistencies

Example:
  CSV has 'age' column — some rows have integers, some have "N/A" strings
  DataFrame: FAILS on type mismatch
  DynamicFrame: Creates choice type, resolveChoice() to handle

Why this matters for the exam: DynamicFrame exists because real-world data is messy. When a question describes inconsistent schemas or mixed types in source data, the answer involves DynamicFrame + resolveChoice.

Component 4: Glue Job Types

Job TypeRuntimeUse Case
SparkApache SparkLarge-scale batch ETL, terabytes
Spark StreamingSpark Structured StreamingMicro-batch on Kinesis/Kafka
Python ShellPython (no Spark)Lightweight scripts, small data, < 1GB
RayDistributed Python (Ray)Parallel Python, ML preprocessing

Component 5: Job Bookmarks — Incremental Processing

Problem: You run a Glue job daily on S3.
         Without bookmarks: reprocesses ALL files every run (expensive)
         With bookmarks: tracks which files were processed last run
                         → only processes NEW files since last checkpoint

How it works:
  Glue stores a bookmark in AWS (not S3) with:
    - Last run's S3 object keys and modification timestamps
  Next run:
    - Only reads files with modification time AFTER last bookmark
    - Appends to target, doesn't re-write existing data

ML use case:
  Daily feature engineering pipeline: new training events → feature store
  Don't reprocess last month's data → use bookmarks

Exam tip: Job bookmarks = incremental ETL on S3. When a question mentions “only process new data since last run” or “avoid reprocessing,” the answer is Glue job bookmarks.

Component 6: Glue DataBrew

  • Visual, no-code data preparation
  • 250+ built-in transformations (normalize, pivot, parse dates, etc.)
  • Profile data (statistics, missing values, outliers) before transforming
  • Publishes transformation “recipe” that can be re-applied
  • Target users: data analysts and data scientists (no Spark code needed)

Discriminative example: Data engineer needs complex custom Spark transformations → Glue ETL. Data analyst wants to clean CSV visually without code → Glue DataBrew.

Component 7: FindMatches ML Transform

Problem: Fuzzy deduplication — "John Smith" vs "J. Smith" vs "Jon Smyth"
         Standard SQL JOIN can't match these without exact string equality

How FindMatches works:
  1. You label sample pairs (same entity? yes/no)
  2. Glue trains an ML model on your labels
  3. Model predicts matches across full dataset
  4. Returns deduplicated table with match confidence scores

Use cases:
  - Customer deduplication across CRM + ecommerce DB
  - Product catalog matching across different supplier feeds
  - Patient record matching in healthcare

Exam tip: FindMatches is the only Glue component that uses ML itself. If a question mentions “fuzzy matching,” “entity resolution,” or “deduplication with uncertain string matches,” the answer is FindMatches ML Transform.

Glue Cost Model — DPUs Explained

DPU = Data Processing Unit
  = 4 vCPUs + 16 GB RAM
  Billed per DPU-second

Standard Spark job: min 2 DPUs, default 10 DPUs
Python Shell job:   0.0625 DPU (1/16th DPU)

Cost optimization:
  - Use Python Shell for small data (huge cost savings)
  - Right-size DPU count for job size
  - Enable job bookmarks to avoid reprocessing
  - Use Parquet output (less data = fewer DPU-seconds)

Amazon EMR — Deep Dive

What EMR Actually Is

ELI5: EMR is the “bring your own cluster” option for big data. AWS launches EC2 instances and installs the open-source big data stack (Hadoop, Spark, Hive, HBase, Presto, Flink) for you. You get full control over the software, configuration, and cluster lifecycle. It’s more powerful and more complex than Glue.

EMR CLUSTER ANATOMY:
  ┌────────────────────────────────────────────┐
  │  Master Node (1x)                          │
  │  → Coordinates jobs, runs YARN, Hive Meta  │
  │  → m5.xlarge minimum                       │
  ├────────────────────────────────────────────┤
  │  Core Nodes (1+)                           │
  │  → Store HDFS data + run tasks             │
  │  → Scale carefully (holds HDFS)            │
  ├────────────────────────────────────────────┤
  │  Task Nodes (0+)                           │
  │  → Run tasks only, no HDFS                 │
  │  → Can use Spot instances (easily replaced)│
  └────────────────────────────────────────────┘

EMR vs Glue — Discriminative Comparison

┌────────────────────────┬────────────────────────────────────┐
│       AWS GLUE         │          AMAZON EMR                │
├────────────────────────┼────────────────────────────────────┤
│ Serverless (no cluster)│ Managed cluster (you configure)    │
│ Simple ETL pipelines   │ Complex multi-framework pipelines  │
│ DPU-second billing     │ EC2 instance billing               │
│ DynamicFrame API       │ Full Spark / Hive / Presto API     │
│ Limited customization  │ Full software stack control        │
│ No GPU support         │ GPU instances supported            │
│ Built-in catalog       │ Hive Metastore (or use Glue cat.)  │
│ Beginner-friendly      │ Requires big data expertise        │
└────────────────────────┴────────────────────────────────────┘

Use Glue when:

  • Serverless ETL with no cluster management overhead
  • Standard transformations on S3 data
  • Team doesn’t have Spark expertise
  • Variable or unpredictable job frequency

Use EMR when:

  • Complex Spark jobs requiring custom configuration
  • Long-running clusters with mixed workloads
  • ML training with Spark MLlib or TensorFlow
  • Full control over Hadoop ecosystem components
  • GPU-accelerated ML preprocessing (EMR supports GPU instances)
  • Cost-sensitive large-scale jobs (Spot instances on task nodes)

EMR Deployment Options

OptionDescriptionUse When
EMR on EC2Traditional cluster on EC2Full control, custom config, cost-optimized
EMR on EKSSpark jobs on KubernetesExisting EKS infrastructure, multi-tenant
EMR ServerlessNo cluster to manageIntermittent jobs, simplicity like Glue but Spark

Behind the Scenes: HDFS vs EMRFS

HDFS (Hadoop Distributed File System):
  - Data stored ON cluster nodes (local disks)
  - Fast: compute is co-located with data
  - Problem: cluster must stay running to preserve data
  - Problem: cluster termination = data loss

EMRFS (EMR File System — backed by S3):
  - Data stored in S3, cluster reads/writes via S3 API
  - Cluster can be terminated after job — data persists
  - Decouples compute (EMR) from storage (S3)
  - Small overhead: S3 latency vs local disk
  - Still fast for large batch: sequential reads, multipart

WHY EMRFS WINS for most ML workloads:
  1. Cluster is ephemeral → spin up for job, terminate → no idle cost
  2. Multiple clusters can read same S3 dataset simultaneously
  3. S3 is the data lake — data already there, no HDFS import needed
  4. HDFS only needed when random I/O or iterative algorithms need speed

Spark MLlib on EMR

  • Distributed ML algorithms: classification, regression, clustering, recommendation
  • Scales across cluster nodes — trains on datasets too large for single machine
  • Pipeline API: chain Transformers and Estimators
  • Integration: train on EMR, deploy model to SageMaker for inference

Amazon Redshift for ML

Redshift ML — Run ML in Your Warehouse

CREATE MODEL predict_churn
FROM (SELECT features... FROM customer_data)
TARGET churn
FUNCTION predict_churn_fn
IAM_ROLE 'arn:aws:iam::...'
SETTINGS (S3_BUCKET 's3://my-bucket');

-- Under the hood:
-- 1. Redshift exports training data to S3
-- 2. Calls SageMaker Autopilot to train best model
-- 3. Registers model as a SQL function in Redshift
-- 4. You call it like any SQL function: SELECT predict_churn_fn(features...)

Why this matters: Redshift ML lets data analysts run ML predictions in SQL without leaving the warehouse. The SageMaker complexity is hidden. Exam tests when this pattern applies: existing data in Redshift, SQL-native team, no need to move data.

Redshift Spectrum — Query S3 from Redshift

Architecture:
  Redshift cluster → Spectrum layer → S3 data lake
                    (thousands of
                     Spectrum nodes,
                     scales automatically)

Use case: JOIN warehouse tables (hot) with historical S3 data (cold)
  SELECT w.customer_id, s.revenue
  FROM warehouse_customers w
  JOIN spectrum_schema.historical_revenue s  -- this table is in S3!
  ON w.id = s.customer_id

Redshift vs Athena vs Glue — When to Use Which

┌────────────────────────────────────────────────────────────┐
│  QUESTION: Do you need a persistent data warehouse?        │
│  YES → Redshift (structured, aggregations, BI queries)     │
│  NO  ↓                                                     │
├────────────────────────────────────────────────────────────┤
│  QUESTION: Is data already in S3, no transforms needed?    │
│  YES → Athena (serverless SQL, pay per scan)               │
│  NO  ↓                                                     │
├────────────────────────────────────────────────────────────┤
│  QUESTION: Need to transform data before storing?          │
│  YES → Glue ETL (batch transform, catalog integration)     │
└────────────────────────────────────────────────────────────┘

Amazon Athena

What Athena Is and Why It Matters

ELI5: Athena is like having a SQL interface to your S3 bucket. You don’t need a database server, you don’t load data anywhere — you just write SQL and Athena goes and reads your S3 files. You pay only for the data you scan. For ML data exploration, this is huge: a data scientist can run ad-hoc SQL on terabytes of raw data without provisioning anything.

Under the hood: Presto/Trino query engine
  → Distributed SQL execution across many nodes
  → Reads S3 files in parallel (CSV, JSON, Parquet, ORC, Avro)
  → Uses Glue Data Catalog for table definitions
  → Results returned in seconds to minutes

Cost: Format Matters Enormously

Query: SELECT avg(revenue) FROM sales WHERE year = 2024

File format: CSV (uncompressed, row-oriented)
  → Scan 500 GB full table
  → Cost: 500 GB × $5/TB = $2.50

File format: Parquet (compressed, columnar, partitioned)
  → Partition pruning: only read year=2024/ prefix (100 GB)
  → Column pruning: only read 'revenue' column (2 GB)
  → Cost: 2 GB × $5/TB = $0.01

Savings: 99.6% cost reduction — same query result

Why this matters for the exam: The combination of Parquet + partitioning + Athena is a canonical exam answer for “cost-effective querying of large datasets in S3.” Know this pattern cold.

Partition Projection — Advanced Optimization

Standard partitions: Athena queries Glue catalog to list all partitions
  → With 10,000 partitions, catalog lookups are slow
  → Must run MSCK REPAIR TABLE to add new partitions

Partition projection: Athena computes partitions from rules, no catalog lookup
  Example config:
    projection.date.type=date
    projection.date.range=2020-01-01,NOW
    projection.date.format=yyyy-MM-dd

  Result: no MSCK REPAIR needed, instant partition discovery
  Use when: partitioned by date with thousands of partitions

Athena Federated Query

  • Query non-S3 sources: RDS, DynamoDB, Redshift, on-prem via Lambda connector
  • Join data across sources in single SQL
  • Use case: JOIN S3 feature data with live RDS customer records

AWS Step Functions

Orchestrating ML Workflows

ML Pipeline as State Machine:
  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
  │  Glue    │ →  │SageMaker │ →  │  Model   │ →  │ Notify   │
  │  ETL Job │    │ Training │    │Evaluation│    │  Team    │
  └──────────┘    └──────────┘    └──────────┘    └──────────┘
        │                               │
        └── FAIL: Alert + Stop          └── IF accuracy < 0.8: retrain

Key features for ML pipelines:

  • Retry logic with exponential backoff (transient failures)
  • Parallel branches (train multiple model variants simultaneously)
  • Wait states (poll for SageMaker training job completion)
  • Error handling (catch failures, route to notification)
  • Visual workflow editor

Step Functions vs SageMaker Pipelines vs Airflow

FactorStep FunctionsSageMaker PipelinesMWAA (Airflow)
ScopeGeneral AWS orchestrationML-specific (SM only)General workflow DAGs
ML awarenessLow (call any AWS API)High (native SM steps)Low (operators needed)
ComplexityMediumLow (SM-focused)High (Airflow setup)
Lineage/trackingNone built-inFull ML lineageNone built-in
Best forMulti-service ML pipelinesPure SageMaker MLOpsExisting Airflow teams

Discriminative example: ML team needs lineage tracking, experiment comparison, and model registry → SageMaker Pipelines. ML pipeline calls Glue + SageMaker + Lambda + SNS → Step Functions (spans multiple services). Team already uses Airflow → MWAA.


AWS Data Pipeline (Legacy)

  • Older managed ETL and data movement service
  • Still appears on some exam questions
  • Moves data between AWS storage services (S3 ↔ RDS ↔ Redshift ↔ EMR)
  • Replaced in practice by: Glue Workflows, Step Functions, EventBridge Pipes

Exam tip: If AWS Data Pipeline appears, the question usually tests whether you know it’s for data movement/orchestration (not transformation). Modern replacement is Step Functions + Glue.


Data Quality & Validation

Why Data Quality is Critical for ML

Garbage In → Garbage Out

Model trained on dirty data:
  - Missing features → model learns wrong patterns
  - Label errors → model learns wrong targets
  - Distribution drift → model degrades silently in production
  - Duplicates → model overfits to repeated examples

PyDeequ — Data Quality on Glue/EMR

Apache Deequ (from Amazon, open-sourced)
PyDeequ = Python wrapper for Glue and EMR

Key concepts:
  Analyzer:   compute metrics (completeness, mean, stddev, correlation)
  Check:      define constraints (completeness > 0.9, values in range)
  Verification: run checks, get pass/fail report
  Anomaly Detection: flag if today's metrics diverge from historical

Example constraint:
  check.hasCompleteness("age", _ >= 0.95)    # 95%+ non-null
  check.isNonNegative("revenue")             # no negative revenue
  check.isContainedIn("country", valid_list) # valid country codes

Why this matters for the exam: Data quality gates are a best practice in ML pipelines. If a question mentions “ensure training data meets quality standards before model training,” the pattern is Glue ETL + PyDeequ checks + fail pipeline if checks don’t pass.


Key Integration Patterns

Pattern 1: Batch ML Pipeline

S3 (raw data)
  → EventBridge (scheduled: daily 2am)
      → Step Functions
          ├── Glue Crawler (update catalog)
          ├── Glue ETL Job (transform → Parquet)
          ├── PyDeequ quality check (gate)
          └── SageMaker Training Job
                → S3 (model artifacts)
                → Model Registry

Pattern 2: Streaming ML Pipeline

IoT Sensors / App events
  → Kinesis Data Streams
      ├── Managed Apache Flink (real-time feature engineering)
      │     → Kinesis Data Streams (enriched events)
      │         → SageMaker real-time endpoint (inference)
      └── Firehose (buffer, convert JSON→Parquet)
            → S3 (raw archive + training data)

Pattern 3: Event-Driven ETL

File arrives in S3
  → S3 Event Notification
      → EventBridge rule (e.g., suffix = .csv, prefix = uploads/)
          → Step Functions workflow
              ├── Glue ETL Job (validate + transform)
              ├── PyDeequ check
              └── On success: move to processed/ prefix
                  On failure: move to quarantine/ prefix + SNS alert

Pattern 4: Lakehouse with Query Layers

Raw data → S3 raw layer (any format)
  → Glue Crawler → Data Catalog
      ├── Athena (ad-hoc exploration, pay-per-query)
      ├── Redshift Spectrum (join with warehouse data)
      └── Glue ETL (transform to curated layer)
            → S3 curated layer (Parquet, partitioned)
                ├── Athena (cost-optimized queries)
                └── SageMaker Training (ML-ready data)

Quick Reference: Scenario → Service

ScenarioService
Discover schema of S3 data automaticallyGlue Crawler
Central metadata repository for data lakeGlue Data Catalog
Batch ETL, serverless, Spark-basedGlue ETL Job
Visual no-code data preparationGlue DataBrew
Fuzzy deduplication / entity matchingGlue FindMatches
Only process new S3 data since last runGlue Job Bookmarks
Complex Spark, full cluster controlEMR
GPU-accelerated distributed ML preprocessingEMR (GPU instances)
Serverless Spark (no Glue DynamicFrame needed)EMR Serverless
SQL on S3, pay per scanAthena
Reduce Athena cost on large datasetParquet + partitioning
Join S3 data with Redshift tablesRedshift Spectrum
ML predictions in SQL without leaving warehouseRedshift ML
Orchestrate multi-service ML pipelineStep Functions
ML pipeline with lineage, experiment trackingSageMaker Pipelines
Existing Airflow team migrating to AWSMWAA
Data quality gates before model trainingPyDeequ on Glue
Legacy data movement between AWS servicesAWS Data Pipeline
Inconsistent/mixed-type source dataGlue DynamicFrame
Daily incremental feature pipelineGlue + Job Bookmarks
Real-time stream transformation before S3Firehose + Lambda