Domain 1A: Data Storage & Ingestion

15 min read 3193 words

Table of Contents

Data Storage & Ingestion

Data Storage & Ingestion

Exam Domain: 1 — Data Engineering (20%) Task: Design data ingestion and storage solutions for ML workloads

Data Fundamentals

Types of Data

Type	Description	Examples	ML Use Cases
Structured	Fixed schema, rows & columns	RDS tables, CSV, spreadsheets	Tabular ML, classification
Semi-structured	Self-describing, flexible schema	JSON, XML, Parquet, Avro	NLP preprocessing, logs
Unstructured	No predefined schema	Images, audio, video, raw text	Computer vision, NLP

ELI5: Think of a library. Structured data is the card catalog — every entry has the same fields (title, author, date) in the same columns. Semi-structured is a set of labeled boxes — each box has its own labels, but it tells you what’s inside (like a JSON object). Unstructured is a pile of random stuff on the floor — photos, sticky notes, voice memos — useful, but you have to figure out what it is yourself before you can learn from it.

The Three V’s — Mapped to AWS Services

┌─────────────────────────────────────────────────────────────┐
│  VOLUME — How much data (GB → PB scale)                     │
│    AWS: S3 (unlimited), Redshift (PB warehouse),            │
│         EMR (distributed processing)                        │
├─────────────────────────────────────────────────────────────┤
│  VELOCITY — How fast data arrives                           │
│    Batch:      Glue, EMR, AWS Batch                         │
│    Real-time:  Kinesis Data Streams, MSK, Flink             │
│    Near-RT:    Data Firehose (60s buffer)                   │
├─────────────────────────────────────────────────────────────┤
│  VARIETY — How many formats and sources                     │
│    Discovery:  Glue Crawlers, Data Catalog                  │
│    Query any:  Athena (schema-on-read)                      │
│    Transform:  Glue ETL, EMR, Data Wrangler                 │
└─────────────────────────────────────────────────────────────┘

Why this matters for the exam: When a question describes a scenario, map the V’s first — the service answer becomes obvious. “10 PB of images” → S3 + EMR (Volume). “IoT sensors every 10ms” → Kinesis (Velocity). “CSV + JSON + XML from 15 sources” → Glue Crawlers + Data Catalog (Variety).

Data Formats for ML

Format Comparison Table

Format	Type	Columnar?	Compression	Schema Evolution	Best For
CSV	Text	No	None	Poor (no types)	Small datasets, human-readable, quick prototyping
JSON	Text	No	None	Good (flexible)	Semi-structured, API data, nested records
Parquet	Binary	Yes	Snappy/GZIP	Good	Analytics, Athena, Spark, large datasets
ORC	Binary	Yes	Zlib/Snappy	Good	Hive/EMR-heavy workloads, Hive ORC predicates
Avro	Binary	No	Deflate/Snappy	Excellent	Schema evolution, Kafka streams, event sourcing
RecordIO	Binary	No	LZ4	Poor	SageMaker Pipe mode (fastest training input)
Protobuf	Binary	No	None (structure)	Good	gRPC APIs, TFRecord conversion, microservices

Why this matters for the exam: Format choice has direct cost and performance implications. Parquet on Athena costs 30–90% less than CSV because Athena charges per TB scanned — and columnar format + compression dramatically shrinks what gets scanned. RecordIO + Pipe mode is the fastest SageMaker training path: data streams directly from S3 into the algorithm without materializing on disk.

ELI5: Why Columnar Formats Matter

ELI5: Imagine you have a spreadsheet with 1000 columns and 1 million rows. You only need to sum one column — “revenue.” Row-oriented storage (CSV) reads every single cell of every row to get that one column — 1 billion cell reads. Columnar storage (Parquet/ORC) stores all revenue values together — it jumps straight to that column and reads only 1 million values. For analytics and ML feature extraction, columnar wins every time. This is called column pruning.

Behind the Scenes: Predicate Pushdown

SQL:  SELECT revenue FROM sales WHERE region = 'US'

Row format (CSV):
  Read ALL rows → filter in memory → extract revenue column
  I/O cost: 100% of data

Columnar (Parquet):
  1. Read "region" column footer metadata (min/max per row group)
  2. Skip row groups where region can't be 'US' (predicate pushdown)
  3. Read only matching row groups from "revenue" column
  I/O cost: often 1-5% of data

Why RecordIO + Pipe Mode is Fastest for SageMaker

File Mode (default):
  S3 → Download ALL files to /opt/ml/input/data/ → Algorithm reads files
  Problem: Large datasets = long download = container sits idle

Pipe Mode + RecordIO:
  S3 → Stream data as named pipe → Algorithm reads continuously
  Result: Training starts immediately, no disk I/O bottleneck
  Speed: 2-10x faster for datasets > 50GB

Exam tip: If the question mentions reducing SageMaker training startup time or improving I/O throughput for large training sets, the answer involves RecordIO format + Pipe mode.

Data Architecture Patterns

Data Warehouse vs Data Lake vs Lakehouse

┌──────────────────┐   ┌──────────────────┐   ┌───────────────────┐
│  DATA WAREHOUSE  │   │    DATA LAKE     │   │    LAKEHOUSE      │
├──────────────────┤   ├──────────────────┤   ├───────────────────┤
│ Structured only  │   │ Any format       │   │ Any format        │
│ Schema-on-WRITE  │   │ Schema-on-READ   │   │ Schema-on-read    │
│ Pre-transformed  │   │ Raw + processed  │   │ + ACID txns       │
│ BI & reporting   │   │ ML + data sci    │   │ + governance      │
│ Expensive/fast   │   │ Cheap/flexible   │   │ + versioning      │
│ e.g. Redshift    │   │ e.g. S3          │   │ e.g. S3 + Iceberg │
└──────────────────┘   └──────────────────┘   └───────────────────┘

ELI5: A data warehouse is a neatly organized closet — everything is folded, labeled, in its exact place, but you did all that work upfront before anything could go in. A data lake is a massive garage where you throw everything raw: boxes, bikes, old furniture — cheap to store, but messy to search. A lakehouse is the dream upgrade: the garage’s storage capacity plus enough organization that you can find, update, and version your things reliably.

Schema-on-Write vs Schema-on-Read (First Principles)

Schema-on-Write (Warehouse):
  Ingest time: "Is this column a number? Does the foreign key exist?"
  Query time:  Fast — data is already clean and indexed
  Trade-off:   Inflexible. Changing schema = ETL rewrite + backfill

Schema-on-Read (Data Lake):
  Ingest time: Dump it — worry about structure later
  Query time:  Slower — must parse and interpret on every read
  Trade-off:   Flexible. Add new columns, change types without touching raw data

When each breaks down:

Data Warehouse anti-pattern: Storing raw logs, images, or IoT sensor data — too expensive, too rigid
Data Lake anti-pattern: Needing ACID transactions (concurrent writes/deletes) — S3 is eventually consistent at the object level; without a table format (Iceberg, Delta), you get “data swamps”

Data Mesh Concept

Decentralized: each domain team owns, serves, and governs their own data
Self-serve infrastructure: teams don’t wait for central data engineering
Data as a product: each domain publishes discoverable, trustworthy datasets
AWS implementation: Lake Formation for governance + separate S3 prefixes per domain + Glue Data Catalog shared across accounts

Amazon S3 — The ML Data Lake

S3 is the center of gravity for ML on AWS. Nearly every ML pipeline starts and ends here.

ELI5: S3 is the shared whiteboard that every AWS service can read and write. SageMaker trains from S3, Glue transforms data in S3, Athena queries S3, Kinesis Firehose delivers to S3, Lambda reads from S3. It’s not the fastest storage, but it’s universally accessible, infinitely scalable, and cheap enough that you never have to delete anything. Every ML architecture diagram has S3 at the center.

Storage Classes — Decision Table

Class	Min Storage	Retrieval	Retrieval Fee	Best For ML
Standard	None	Instant	None	Active training datasets
Standard-IA	30 days	Instant	Yes/GB	Datasets accessed < monthly
One Zone-IA	30 days	Instant	Yes/GB	Reproducible data, no DR needed
Glacier Instant	90 days	Instant	Yes/GB	Experiment results, rare re-runs
Glacier Flexible	90 days	1min–12hr	Yes/GB	Long-term model checkpoints
Glacier Deep Archive	180 days	12–48hr	Yes/GB	Compliance archives, raw originals
Intelligent-Tiering	None	Instant (freq)	Monitoring fee	Unknown / changing access patterns

ELI5: Storage classes are storage units at different distances from your home. Standard is a shelf in your living room — always there, premium price. Standard-IA is a storage unit across town — low monthly rent, but you pay a trip fee each time you retrieve something. Glacier Deep Archive is a warehouse in another city — incredibly cheap, but it takes 12–48 hours to get your stuff. Intelligent-Tiering is a self-driving delivery robot that watches what you access and automatically moves things to the right distance.

ML Decision Tree:

Active training data?        → Standard
Monthly batch reprocessing?  → Standard-IA
Reproducible intermediate?   → One Zone-IA
Final model artifacts?       → Glacier Instant (need it sometimes)
Raw original dataset backup? → Glacier Deep Archive
Don't know access pattern?   → Intelligent-Tiering

S3 Performance Optimization

UPLOADS:
  Multipart upload  → For files > 100MB (required for > 5GB)
                      Parallel part uploads, retry individual parts
  Transfer Accel    → Uses CloudFront edge locations globally
                      Good for cross-region uploads from on-prem

DOWNLOADS:
  Byte-range fetch  → Parallel reads of different parts of one file
                      Critical for large Parquet file reads
  S3 Select         → Filter CSV/JSON/Parquet server-side before download
                      Returns only matching rows — saves transfer cost

THROUGHPUT LIMITS:
  3,500 PUT/COPY/DELETE requests/sec per prefix
  5,500 GET/HEAD requests/sec per prefix
  Strategy: spread across prefixes  →  date=2024-01-01/ vs date=2024-01-02/
                                        (each prefix gets its own limit)

S3 Select: Save Money on ML Preprocessing

Without S3 Select:
  S3 → Download 50GB CSV → EC2/Lambda filters rows → Use 500MB result
  Cost: Transfer 50GB + compute to process 50GB

With S3 Select:
  S3 → SQL predicate runs SERVER-SIDE → Transfer only 500MB result
  Cost: Transfer 500MB + small S3 Select request fee
  Savings: 99% less data transferred

Exam tip: S3 Select is frequently tested. It’s useful when you only need a subset of a large file and want to reduce data transfer costs before ML preprocessing. Works on CSV, JSON, and Parquet in S3.

S3 Encryption Options

Method	Who manages keys	Audit trail	Use When
SSE-S3	AWS (S3-managed)	None	Default — simplest, no key management
SSE-KMS	AWS KMS	Full CloudTrail audit	Compliance, key rotation, cross-account access
SSE-C	You (per-request)	None	Full key control, no AWS key storage
Client-side	You (before upload)	None	Maximum security, regulated data

Exam tip: SSE-KMS is the most commonly tested encryption option for ML scenarios involving compliance, auditing, or cross-account access. Every KMS decrypt/encrypt call shows in CloudTrail — use this when auditors need to know who accessed what training data.

Versioning and Lifecycle for ML Dataset Management

VERSIONING:
  Enable on bucket → every PUT creates a new version ID
  ML use: dataset v1, v2, v3 — roll back if model degrades
  Cost: every version stored = separate storage charge
  MFA Delete: require MFA to permanently delete versions

LIFECYCLE RULES:
  Day 0:   training-data/ → S3 Standard    (active training)
  Day 30:  transition     → Standard-IA    (done training)
  Day 90:  transition     → Glacier        (archival)
  Day 365: expire         → Delete         (or move to Deep Archive)

Event Notifications — Trigger ML Pipelines

New file lands in S3
  → S3 Event Notification
      ├── SQS queue     → batch consumer polls and processes
      ├── SNS topic     → fan-out to multiple subscribers
      ├── Lambda        → lightweight immediate transformation
      └── EventBridge   → rule-based routing to Step Functions / Glue

Why this matters for the exam: Event-driven ML pipeline architecture is heavily tested. Know that S3 → EventBridge → Step Functions is the most flexible pattern for orchestrating multi-step ML workflows triggered by new data.

Block & File Storage for ML

EBS vs EFS vs FSx for Lustre

┌──────────────────┬──────────────────┬───────────────────────────┐
│       EBS        │       EFS        │      FSx for Lustre       │
├──────────────────┼──────────────────┼───────────────────────────┤
│ Block storage    │ NFS file system  │ Parallel file system      │
│ 1 EC2 instance   │ Many instances   │ Many instances (HPC)      │
│ Single AZ        │ Multi-AZ         │ Single AZ (or Multi-AZ)   │
│ ~1 GB/s max      │ ~10 GB/s burst   │ Hundreds of GB/s          │
│ No S3 link       │ No S3 link       │ S3-backed (lazy load)     │
│ SageMaker EBS    │ Shared training  │ Distributed GPU training  │
│ volumes          │ datasets         │ at scale                  │
└──────────────────┴──────────────────┴───────────────────────────┘

ELI5: EBS is a USB drive — fast, dedicated to one machine. EFS is a shared Google Drive folder — multiple machines can all read and write simultaneously. FSx for Lustre is a high-speed race track for data — built specifically so that hundreds of GPUs can all slam through terabytes of training data in parallel without waiting on each other.

Why FSx for Lustre Exists: The HPC Training Bottleneck

Problem:  100 GPU nodes training on S3 → each GPU makes GET requests
          → S3 per-prefix throughput limit hit → GPUs wait for data
          → GPU utilization drops to 30-40%

Solution: FSx for Lustre
          → Single high-throughput file system shared by all nodes
          → Lustre protocol designed for parallel I/O (HPC origins)
          → All 100 GPUs see same POSIX filesystem
          → Hundreds of GB/s aggregate throughput

Behind the Scenes: FSx Lustre Lazy Loading from S3

1. Create FSx filesystem linked to s3://my-bucket/training/
2. Mount on all training nodes (no data on FSx yet)
3. First read of file.parquet → cache miss → fetches from S3
4. File now cached on FSx Lustre SSD
5. All subsequent reads → served from local NVMe (microseconds)
6. After training, sync back to S3 with HSM commands
   or let FSx auto-export changed files to S3

Exam tip: When a question asks about training a model on very large datasets (hundreds of GB to TB) with multiple GPU instances and mentions I/O bottlenecks or GPU utilization problems, the answer is FSx for Lustre backed by S3. When the question asks about multiple SageMaker instances sharing feature data, the answer is EFS.

Storage Decision Guide

Use Case                              → Service
─────────────────────────────────────────────────
Single notebook, personal experiments → EBS (gp3)
Share preprocessed features, 2+ jobs → EFS
Multi-GPU distributed training, large → FSx for Lustre
datasets, HPC workloads
Everything else (ML data lake)        → S3

Real-Time Streaming

Amazon Kinesis Data Streams — Deep Dive

                    ┌─────────────┐
Producers           │  Stream     │             Consumers
                    ├─────────────┤
[App logs]  ──────► │  Shard 1    │ ──────────► [Lambda function]
[IoT sensor] ──────► │  Shard 2    │ ──────────► [KCL application]
[Clickstream]──────► │  Shard N    │ ──────────► [Kinesis Flink]
                    └─────────────┘ ──────────► [Firehose]
Per shard: 1 MB/s in, 1000 records/s in
           2 MB/s out (shared across consumers)
Enhanced Fan-Out: 2 MB/s out per consumer per shard (dedicated HTTP/2)

Key Concepts:

Concept	Detail
Shard	Unit of capacity. More shards = more throughput
Partition key	Determines which shard a record goes to (hashed)
Sequence number	Unique ID per record within shard, strictly ordered
Retention	Default 24hr, up to 365 days (Extended Data Retention)
Enhanced Fan-Out	Dedicated 2MB/s per consumer — no sharing

Behind the Scenes: Shards, Ordering, and Partition Keys

Partition key → MD5 hash → maps to a shard

Example: user_id as partition key
  user_123 → hash(user_123) → Shard 1
  user_456 → hash(user_456) → Shard 3

ORDERING GUARANTEE:
  Within a shard: strict order (sequence numbers always increase)
  Across shards:  NO ordering guarantee

CONSEQUENCE FOR ML:
  Time-series per sensor? → Use sensor_id as partition key
    → All events for sensor_X go to same shard → ordered
    → Can reconstruct time series for that sensor correctly

ELI5 on partition keys: Imagine Kinesis is a post office with multiple sorting lanes (shards). The partition key is the zip code — it always routes the same zip code to the same lane. This means all mail from the same zip code stays in order (within its lane), but packages across different lanes don’t have a global order. For ML, if you need all events from the same user/device/sensor in order, use that identifier as the partition key.

Exam tip: Hot shards occur when partition keys are not evenly distributed (e.g., using a boolean value that puts 99% of records on 2 shards). Fix with better key selection or shard splitting.

Amazon Data Firehose — Fully Managed Delivery

Sources:                    Transforms:           Destinations:
Kinesis Data Streams  ──►   Lambda (optional) ──► S3
Direct PUT API        ──►   Format conversion ──► Redshift (via S3 COPY)
MSK                   ──►   (JSON→Parquet,    ──► OpenSearch Service
                            JSON→ORC)         ──► Splunk
                                              ──► HTTP endpoint
                                              ──► Datadog / New Relic

Key Properties:

Buffers data (size: 1–128 MB, or time: 60–900 seconds — whichever hits first)
No consumer management — fully serverless
Built-in format conversion (JSON → Parquet / ORC) without separate ETL
Error records automatically go to separate S3 prefix

Kinesis Data Streams vs Data Firehose

┌────────────────────────┬────────────────────────────────┐
│   DATA STREAMS         │   DATA FIREHOSE                │
├────────────────────────┼────────────────────────────────┤
│ Real-time (~200ms)     │ Near real-time (60s+ buffer)   │
│ You manage shards      │ Fully managed, auto-scales     │
│ Multiple consumers     │ Single delivery destination    │
│ Data replay supported  │ No replay                      │
│ Custom processing      │ Lambda transform only          │
│ KCL / Lambda / Flink   │ Fixed destinations             │
│ Pay per shard-hour     │ Pay per GB ingested            │
└────────────────────────┴────────────────────────────────┘

ELI5: Data Streams is laying your own plumbing — you control every pipe and valve, and multiple consumers can tap in simultaneously. Firehose is calling a plumber who handles everything: you say “water goes to S3” and they do it. If you need custom processing, multiple consumers, or data replay, use Data Streams. If you just need reliable delivery to a storage destination, use Firehose.

Decision Guide:

Need < 1 second latency?                   → Data Streams
Need custom consumer (Lambda/KCL/Flink)?   → Data Streams
Need to replay data?                       → Data Streams
Just delivering data to S3/Redshift?       → Firehose
Need built-in JSON→Parquet conversion?     → Firehose
Want zero operational overhead?            → Firehose

Amazon Managed Service for Apache Flink

Real-time stream processing using SQL, Java, Scala, or Python (Apache Flink API)
Sources: Kinesis Data Streams, MSK (Kafka)
Fully managed: provisions Flink cluster, scales automatically
Use cases:
- Real-time feature engineering (sliding window aggregations)
- Anomaly detection before data lands in storage
- Filtering, enrichment, joining streams

RANDOM_CUT_FOREST (RCF):

Built-in Flink SQL function for streaming anomaly detection
How it works:
  - Maintains a forest of random decision trees on recent data
  - New point gets an anomaly score based on depth in trees
  - Shallow depth (few cuts needed) → high anomaly score
  - Updates incrementally — no full retrain needed

Use case on exam: "Detect anomalies in real-time IoT sensor stream"
Answer: Managed Apache Flink with RANDOM_CUT_FOREST function

Amazon MSK (Managed Streaming for Apache Kafka)

Fully managed Kafka brokers (you still configure topics, partitions, retention)
MSK Connect: run Kafka Connect connectors (Debezium CDC, S3 Sink, etc.)
MSK Serverless: truly serverless — no broker management, auto-scales

Kinesis vs MSK — When to Use Each

Factor	Kinesis	MSK
Protocol	AWS proprietary	Apache Kafka
Team expertise	AWS-native	Existing Kafka users
Message size	1 MB max	1 MB default, configurable higher
Retention	Up to 365 days	Unlimited (tiered storage)
Ecosystem	AWS services only	Kafka ecosystem (Kafka Streams, ksqlDB)
Complexity	Low	Higher (broker config, partition management)
Cost model	Per shard-hour	Per broker-hour

Discriminative example: Your team already uses Kafka on-prem and is migrating → MSK (no rewrite, familiar operations). Greenfield AWS-native project, no Kafka expertise → Kinesis (simpler, less overhead). Existing Kafka, want zero management → MSK Serverless.

Quick Reference: Scenario → Service

Scenario	Service
Store ML training datasets at scale	S3
Fastest SageMaker training I/O	RecordIO + Pipe mode
Multi-GPU training I/O bottleneck	FSx for Lustre
Share feature data across training instances	EFS
Single notebook instance local storage	EBS gp3
Real-time IoT ingestion with custom processing	Kinesis Data Streams
Deliver streaming data to S3/Redshift	Data Firehose
JSON → Parquet conversion in stream	Data Firehose (built-in conversion)
Real-time anomaly detection on stream	Managed Apache Flink + RCF
Kafka-based team migrating to AWS	MSK
Greenfield real-time stream, AWS-native	Kinesis Data Streams
Archive old datasets cheaply	S3 Glacier Deep Archive
Unknown/changing dataset access patterns	S3 Intelligent-Tiering
Query subset of large S3 file (cost)	S3 Select
Columnar format for Athena cost savings	Parquet
Schema evolution in Kafka pipeline	Avro
Trigger ML pipeline on new S3 data	S3 Event → EventBridge → Step Functions

Data Storage & Ingestion#

Data Fundamentals#

Types of Data#

The Three V’s — Mapped to AWS Services#

Data Formats for ML#

Format Comparison Table#

ELI5: Why Columnar Formats Matter#

Behind the Scenes: Predicate Pushdown#

Why RecordIO + Pipe Mode is Fastest for SageMaker#

Data Architecture Patterns#

Data Warehouse vs Data Lake vs Lakehouse#

Schema-on-Write vs Schema-on-Read (First Principles)#

Data Mesh Concept#

Amazon S3 — The ML Data Lake#

Storage Classes — Decision Table#

S3 Performance Optimization#

S3 Select: Save Money on ML Preprocessing#

S3 Encryption Options#

Versioning and Lifecycle for ML Dataset Management#

Event Notifications — Trigger ML Pipelines#

Block & File Storage for ML#

EBS vs EFS vs FSx for Lustre#

Why FSx for Lustre Exists: The HPC Training Bottleneck#

Behind the Scenes: FSx Lustre Lazy Loading from S3#

Storage Decision Guide#

Real-Time Streaming#

Amazon Kinesis Data Streams — Deep Dive#

Behind the Scenes: Shards, Ordering, and Partition Keys#

Amazon Data Firehose — Fully Managed Delivery#

Kinesis Data Streams vs Data Firehose#

Amazon Managed Service for Apache Flink#

Amazon MSK (Managed Streaming for Apache Kafka)#

Kinesis vs MSK — When to Use Each#

Quick Reference: Scenario → Service#