Domain 1A: Data Storage & Ingestion
Data Storage & Ingestion
Exam Domain: 1 — Data Engineering (20%) Task: Design data ingestion and storage solutions for ML workloads
Data Fundamentals
Types of Data
| Type | Description | Examples | ML Use Cases |
|---|---|---|---|
| Structured | Fixed schema, rows & columns | RDS tables, CSV, spreadsheets | Tabular ML, classification |
| Semi-structured | Self-describing, flexible schema | JSON, XML, Parquet, Avro | NLP preprocessing, logs |
| Unstructured | No predefined schema | Images, audio, video, raw text | Computer vision, NLP |
ELI5: Think of a library. Structured data is the card catalog — every entry has the same fields (title, author, date) in the same columns. Semi-structured is a set of labeled boxes — each box has its own labels, but it tells you what’s inside (like a JSON object). Unstructured is a pile of random stuff on the floor — photos, sticky notes, voice memos — useful, but you have to figure out what it is yourself before you can learn from it.
The Three V’s — Mapped to AWS Services
┌─────────────────────────────────────────────────────────────┐
│ VOLUME — How much data (GB → PB scale) │
│ AWS: S3 (unlimited), Redshift (PB warehouse), │
│ EMR (distributed processing) │
├─────────────────────────────────────────────────────────────┤
│ VELOCITY — How fast data arrives │
│ Batch: Glue, EMR, AWS Batch │
│ Real-time: Kinesis Data Streams, MSK, Flink │
│ Near-RT: Data Firehose (60s buffer) │
├─────────────────────────────────────────────────────────────┤
│ VARIETY — How many formats and sources │
│ Discovery: Glue Crawlers, Data Catalog │
│ Query any: Athena (schema-on-read) │
│ Transform: Glue ETL, EMR, Data Wrangler │
└─────────────────────────────────────────────────────────────┘
Why this matters for the exam: When a question describes a scenario, map the V’s first — the service answer becomes obvious. “10 PB of images” → S3 + EMR (Volume). “IoT sensors every 10ms” → Kinesis (Velocity). “CSV + JSON + XML from 15 sources” → Glue Crawlers + Data Catalog (Variety).
Data Formats for ML
Format Comparison Table
| Format | Type | Columnar? | Compression | Schema Evolution | Best For |
|---|---|---|---|---|---|
| CSV | Text | No | None | Poor (no types) | Small datasets, human-readable, quick prototyping |
| JSON | Text | No | None | Good (flexible) | Semi-structured, API data, nested records |
| Parquet | Binary | Yes | Snappy/GZIP | Good | Analytics, Athena, Spark, large datasets |
| ORC | Binary | Yes | Zlib/Snappy | Good | Hive/EMR-heavy workloads, Hive ORC predicates |
| Avro | Binary | No | Deflate/Snappy | Excellent | Schema evolution, Kafka streams, event sourcing |
| RecordIO | Binary | No | LZ4 | Poor | SageMaker Pipe mode (fastest training input) |
| Protobuf | Binary | No | None (structure) | Good | gRPC APIs, TFRecord conversion, microservices |
Why this matters for the exam: Format choice has direct cost and performance implications. Parquet on Athena costs 30–90% less than CSV because Athena charges per TB scanned — and columnar format + compression dramatically shrinks what gets scanned. RecordIO + Pipe mode is the fastest SageMaker training path: data streams directly from S3 into the algorithm without materializing on disk.
ELI5: Why Columnar Formats Matter
ELI5: Imagine you have a spreadsheet with 1000 columns and 1 million rows. You only need to sum one column — “revenue.” Row-oriented storage (CSV) reads every single cell of every row to get that one column — 1 billion cell reads. Columnar storage (Parquet/ORC) stores all revenue values together — it jumps straight to that column and reads only 1 million values. For analytics and ML feature extraction, columnar wins every time. This is called column pruning.
Behind the Scenes: Predicate Pushdown
SQL: SELECT revenue FROM sales WHERE region = 'US'
Row format (CSV):
Read ALL rows → filter in memory → extract revenue column
I/O cost: 100% of data
Columnar (Parquet):
1. Read "region" column footer metadata (min/max per row group)
2. Skip row groups where region can't be 'US' (predicate pushdown)
3. Read only matching row groups from "revenue" column
I/O cost: often 1-5% of data
Why RecordIO + Pipe Mode is Fastest for SageMaker
File Mode (default):
S3 → Download ALL files to /opt/ml/input/data/ → Algorithm reads files
Problem: Large datasets = long download = container sits idle
Pipe Mode + RecordIO:
S3 → Stream data as named pipe → Algorithm reads continuously
Result: Training starts immediately, no disk I/O bottleneck
Speed: 2-10x faster for datasets > 50GB
Exam tip: If the question mentions reducing SageMaker training startup time or improving I/O throughput for large training sets, the answer involves RecordIO format + Pipe mode.
Data Architecture Patterns
Data Warehouse vs Data Lake vs Lakehouse
┌──────────────────┐ ┌──────────────────┐ ┌───────────────────┐
│ DATA WAREHOUSE │ │ DATA LAKE │ │ LAKEHOUSE │
├──────────────────┤ ├──────────────────┤ ├───────────────────┤
│ Structured only │ │ Any format │ │ Any format │
│ Schema-on-WRITE │ │ Schema-on-READ │ │ Schema-on-read │
│ Pre-transformed │ │ Raw + processed │ │ + ACID txns │
│ BI & reporting │ │ ML + data sci │ │ + governance │
│ Expensive/fast │ │ Cheap/flexible │ │ + versioning │
│ e.g. Redshift │ │ e.g. S3 │ │ e.g. S3 + Iceberg │
└──────────────────┘ └──────────────────┘ └───────────────────┘
ELI5: A data warehouse is a neatly organized closet — everything is folded, labeled, in its exact place, but you did all that work upfront before anything could go in. A data lake is a massive garage where you throw everything raw: boxes, bikes, old furniture — cheap to store, but messy to search. A lakehouse is the dream upgrade: the garage’s storage capacity plus enough organization that you can find, update, and version your things reliably.
Schema-on-Write vs Schema-on-Read (First Principles)
Schema-on-Write (Warehouse):
Ingest time: "Is this column a number? Does the foreign key exist?"
Query time: Fast — data is already clean and indexed
Trade-off: Inflexible. Changing schema = ETL rewrite + backfill
Schema-on-Read (Data Lake):
Ingest time: Dump it — worry about structure later
Query time: Slower — must parse and interpret on every read
Trade-off: Flexible. Add new columns, change types without touching raw data
When each breaks down:
- Data Warehouse anti-pattern: Storing raw logs, images, or IoT sensor data — too expensive, too rigid
- Data Lake anti-pattern: Needing ACID transactions (concurrent writes/deletes) — S3 is eventually consistent at the object level; without a table format (Iceberg, Delta), you get “data swamps”
Data Mesh Concept
- Decentralized: each domain team owns, serves, and governs their own data
- Self-serve infrastructure: teams don’t wait for central data engineering
- Data as a product: each domain publishes discoverable, trustworthy datasets
- AWS implementation: Lake Formation for governance + separate S3 prefixes per domain + Glue Data Catalog shared across accounts
Amazon S3 — The ML Data Lake
S3 is the center of gravity for ML on AWS. Nearly every ML pipeline starts and ends here.
ELI5: S3 is the shared whiteboard that every AWS service can read and write. SageMaker trains from S3, Glue transforms data in S3, Athena queries S3, Kinesis Firehose delivers to S3, Lambda reads from S3. It’s not the fastest storage, but it’s universally accessible, infinitely scalable, and cheap enough that you never have to delete anything. Every ML architecture diagram has S3 at the center.
Storage Classes — Decision Table
| Class | Min Storage | Retrieval | Retrieval Fee | Best For ML |
|---|---|---|---|---|
| Standard | None | Instant | None | Active training datasets |
| Standard-IA | 30 days | Instant | Yes/GB | Datasets accessed < monthly |
| One Zone-IA | 30 days | Instant | Yes/GB | Reproducible data, no DR needed |
| Glacier Instant | 90 days | Instant | Yes/GB | Experiment results, rare re-runs |
| Glacier Flexible | 90 days | 1min–12hr | Yes/GB | Long-term model checkpoints |
| Glacier Deep Archive | 180 days | 12–48hr | Yes/GB | Compliance archives, raw originals |
| Intelligent-Tiering | None | Instant (freq) | Monitoring fee | Unknown / changing access patterns |
ELI5: Storage classes are storage units at different distances from your home. Standard is a shelf in your living room — always there, premium price. Standard-IA is a storage unit across town — low monthly rent, but you pay a trip fee each time you retrieve something. Glacier Deep Archive is a warehouse in another city — incredibly cheap, but it takes 12–48 hours to get your stuff. Intelligent-Tiering is a self-driving delivery robot that watches what you access and automatically moves things to the right distance.
ML Decision Tree:
Active training data? → Standard
Monthly batch reprocessing? → Standard-IA
Reproducible intermediate? → One Zone-IA
Final model artifacts? → Glacier Instant (need it sometimes)
Raw original dataset backup? → Glacier Deep Archive
Don't know access pattern? → Intelligent-Tiering
S3 Performance Optimization
UPLOADS:
Multipart upload → For files > 100MB (required for > 5GB)
Parallel part uploads, retry individual parts
Transfer Accel → Uses CloudFront edge locations globally
Good for cross-region uploads from on-prem
DOWNLOADS:
Byte-range fetch → Parallel reads of different parts of one file
Critical for large Parquet file reads
S3 Select → Filter CSV/JSON/Parquet server-side before download
Returns only matching rows — saves transfer cost
THROUGHPUT LIMITS:
3,500 PUT/COPY/DELETE requests/sec per prefix
5,500 GET/HEAD requests/sec per prefix
Strategy: spread across prefixes → date=2024-01-01/ vs date=2024-01-02/
(each prefix gets its own limit)
S3 Select: Save Money on ML Preprocessing
Without S3 Select:
S3 → Download 50GB CSV → EC2/Lambda filters rows → Use 500MB result
Cost: Transfer 50GB + compute to process 50GB
With S3 Select:
S3 → SQL predicate runs SERVER-SIDE → Transfer only 500MB result
Cost: Transfer 500MB + small S3 Select request fee
Savings: 99% less data transferred
Exam tip: S3 Select is frequently tested. It’s useful when you only need a subset of a large file and want to reduce data transfer costs before ML preprocessing. Works on CSV, JSON, and Parquet in S3.
S3 Encryption Options
| Method | Who manages keys | Audit trail | Use When |
|---|---|---|---|
| SSE-S3 | AWS (S3-managed) | None | Default — simplest, no key management |
| SSE-KMS | AWS KMS | Full CloudTrail audit | Compliance, key rotation, cross-account access |
| SSE-C | You (per-request) | None | Full key control, no AWS key storage |
| Client-side | You (before upload) | None | Maximum security, regulated data |
Exam tip: SSE-KMS is the most commonly tested encryption option for ML scenarios involving compliance, auditing, or cross-account access. Every KMS decrypt/encrypt call shows in CloudTrail — use this when auditors need to know who accessed what training data.
Versioning and Lifecycle for ML Dataset Management
VERSIONING:
Enable on bucket → every PUT creates a new version ID
ML use: dataset v1, v2, v3 — roll back if model degrades
Cost: every version stored = separate storage charge
MFA Delete: require MFA to permanently delete versions
LIFECYCLE RULES:
Day 0: training-data/ → S3 Standard (active training)
Day 30: transition → Standard-IA (done training)
Day 90: transition → Glacier (archival)
Day 365: expire → Delete (or move to Deep Archive)
Event Notifications — Trigger ML Pipelines
New file lands in S3
→ S3 Event Notification
├── SQS queue → batch consumer polls and processes
├── SNS topic → fan-out to multiple subscribers
├── Lambda → lightweight immediate transformation
└── EventBridge → rule-based routing to Step Functions / Glue
Why this matters for the exam: Event-driven ML pipeline architecture is heavily tested. Know that S3 → EventBridge → Step Functions is the most flexible pattern for orchestrating multi-step ML workflows triggered by new data.
Block & File Storage for ML
EBS vs EFS vs FSx for Lustre
┌──────────────────┬──────────────────┬───────────────────────────┐
│ EBS │ EFS │ FSx for Lustre │
├──────────────────┼──────────────────┼───────────────────────────┤
│ Block storage │ NFS file system │ Parallel file system │
│ 1 EC2 instance │ Many instances │ Many instances (HPC) │
│ Single AZ │ Multi-AZ │ Single AZ (or Multi-AZ) │
│ ~1 GB/s max │ ~10 GB/s burst │ Hundreds of GB/s │
│ No S3 link │ No S3 link │ S3-backed (lazy load) │
│ SageMaker EBS │ Shared training │ Distributed GPU training │
│ volumes │ datasets │ at scale │
└──────────────────┴──────────────────┴───────────────────────────┘
ELI5: EBS is a USB drive — fast, dedicated to one machine. EFS is a shared Google Drive folder — multiple machines can all read and write simultaneously. FSx for Lustre is a high-speed race track for data — built specifically so that hundreds of GPUs can all slam through terabytes of training data in parallel without waiting on each other.
Why FSx for Lustre Exists: The HPC Training Bottleneck
Problem: 100 GPU nodes training on S3 → each GPU makes GET requests
→ S3 per-prefix throughput limit hit → GPUs wait for data
→ GPU utilization drops to 30-40%
Solution: FSx for Lustre
→ Single high-throughput file system shared by all nodes
→ Lustre protocol designed for parallel I/O (HPC origins)
→ All 100 GPUs see same POSIX filesystem
→ Hundreds of GB/s aggregate throughput
Behind the Scenes: FSx Lustre Lazy Loading from S3
1. Create FSx filesystem linked to s3://my-bucket/training/
2. Mount on all training nodes (no data on FSx yet)
3. First read of file.parquet → cache miss → fetches from S3
4. File now cached on FSx Lustre SSD
5. All subsequent reads → served from local NVMe (microseconds)
6. After training, sync back to S3 with HSM commands
or let FSx auto-export changed files to S3
Exam tip: When a question asks about training a model on very large datasets (hundreds of GB to TB) with multiple GPU instances and mentions I/O bottlenecks or GPU utilization problems, the answer is FSx for Lustre backed by S3. When the question asks about multiple SageMaker instances sharing feature data, the answer is EFS.
Storage Decision Guide
Use Case → Service
─────────────────────────────────────────────────
Single notebook, personal experiments → EBS (gp3)
Share preprocessed features, 2+ jobs → EFS
Multi-GPU distributed training, large → FSx for Lustre
datasets, HPC workloads
Everything else (ML data lake) → S3
Real-Time Streaming
Amazon Kinesis Data Streams — Deep Dive
┌─────────────┐
Producers │ Stream │ Consumers
├─────────────┤
[App logs] ──────► │ Shard 1 │ ──────────► [Lambda function]
[IoT sensor] ──────► │ Shard 2 │ ──────────► [KCL application]
[Clickstream]──────► │ Shard N │ ──────────► [Kinesis Flink]
└─────────────┘ ──────────► [Firehose]
Per shard: 1 MB/s in, 1000 records/s in
2 MB/s out (shared across consumers)
Enhanced Fan-Out: 2 MB/s out per consumer per shard (dedicated HTTP/2)
Key Concepts:
| Concept | Detail |
|---|---|
| Shard | Unit of capacity. More shards = more throughput |
| Partition key | Determines which shard a record goes to (hashed) |
| Sequence number | Unique ID per record within shard, strictly ordered |
| Retention | Default 24hr, up to 365 days (Extended Data Retention) |
| Enhanced Fan-Out | Dedicated 2MB/s per consumer — no sharing |
Behind the Scenes: Shards, Ordering, and Partition Keys
Partition key → MD5 hash → maps to a shard
Example: user_id as partition key
user_123 → hash(user_123) → Shard 1
user_456 → hash(user_456) → Shard 3
ORDERING GUARANTEE:
Within a shard: strict order (sequence numbers always increase)
Across shards: NO ordering guarantee
CONSEQUENCE FOR ML:
Time-series per sensor? → Use sensor_id as partition key
→ All events for sensor_X go to same shard → ordered
→ Can reconstruct time series for that sensor correctly
ELI5 on partition keys: Imagine Kinesis is a post office with multiple sorting lanes (shards). The partition key is the zip code — it always routes the same zip code to the same lane. This means all mail from the same zip code stays in order (within its lane), but packages across different lanes don’t have a global order. For ML, if you need all events from the same user/device/sensor in order, use that identifier as the partition key.
Exam tip: Hot shards occur when partition keys are not evenly distributed (e.g., using a boolean value that puts 99% of records on 2 shards). Fix with better key selection or shard splitting.
Amazon Data Firehose — Fully Managed Delivery
Sources: Transforms: Destinations:
Kinesis Data Streams ──► Lambda (optional) ──► S3
Direct PUT API ──► Format conversion ──► Redshift (via S3 COPY)
MSK ──► (JSON→Parquet, ──► OpenSearch Service
JSON→ORC) ──► Splunk
──► HTTP endpoint
──► Datadog / New Relic
Key Properties:
- Buffers data (size: 1–128 MB, or time: 60–900 seconds — whichever hits first)
- No consumer management — fully serverless
- Built-in format conversion (JSON → Parquet / ORC) without separate ETL
- Error records automatically go to separate S3 prefix
Kinesis Data Streams vs Data Firehose
┌────────────────────────┬────────────────────────────────┐
│ DATA STREAMS │ DATA FIREHOSE │
├────────────────────────┼────────────────────────────────┤
│ Real-time (~200ms) │ Near real-time (60s+ buffer) │
│ You manage shards │ Fully managed, auto-scales │
│ Multiple consumers │ Single delivery destination │
│ Data replay supported │ No replay │
│ Custom processing │ Lambda transform only │
│ KCL / Lambda / Flink │ Fixed destinations │
│ Pay per shard-hour │ Pay per GB ingested │
└────────────────────────┴────────────────────────────────┘
ELI5: Data Streams is laying your own plumbing — you control every pipe and valve, and multiple consumers can tap in simultaneously. Firehose is calling a plumber who handles everything: you say “water goes to S3” and they do it. If you need custom processing, multiple consumers, or data replay, use Data Streams. If you just need reliable delivery to a storage destination, use Firehose.
Decision Guide:
Need < 1 second latency? → Data Streams
Need custom consumer (Lambda/KCL/Flink)? → Data Streams
Need to replay data? → Data Streams
Just delivering data to S3/Redshift? → Firehose
Need built-in JSON→Parquet conversion? → Firehose
Want zero operational overhead? → Firehose
Amazon Managed Service for Apache Flink
- Real-time stream processing using SQL, Java, Scala, or Python (Apache Flink API)
- Sources: Kinesis Data Streams, MSK (Kafka)
- Fully managed: provisions Flink cluster, scales automatically
- Use cases:
- Real-time feature engineering (sliding window aggregations)
- Anomaly detection before data lands in storage
- Filtering, enrichment, joining streams
RANDOM_CUT_FOREST (RCF):
Built-in Flink SQL function for streaming anomaly detection
How it works:
- Maintains a forest of random decision trees on recent data
- New point gets an anomaly score based on depth in trees
- Shallow depth (few cuts needed) → high anomaly score
- Updates incrementally — no full retrain needed
Use case on exam: "Detect anomalies in real-time IoT sensor stream"
Answer: Managed Apache Flink with RANDOM_CUT_FOREST function
Amazon MSK (Managed Streaming for Apache Kafka)
- Fully managed Kafka brokers (you still configure topics, partitions, retention)
- MSK Connect: run Kafka Connect connectors (Debezium CDC, S3 Sink, etc.)
- MSK Serverless: truly serverless — no broker management, auto-scales
Kinesis vs MSK — When to Use Each
| Factor | Kinesis | MSK |
|---|---|---|
| Protocol | AWS proprietary | Apache Kafka |
| Team expertise | AWS-native | Existing Kafka users |
| Message size | 1 MB max | 1 MB default, configurable higher |
| Retention | Up to 365 days | Unlimited (tiered storage) |
| Ecosystem | AWS services only | Kafka ecosystem (Kafka Streams, ksqlDB) |
| Complexity | Low | Higher (broker config, partition management) |
| Cost model | Per shard-hour | Per broker-hour |
Discriminative example: Your team already uses Kafka on-prem and is migrating → MSK (no rewrite, familiar operations). Greenfield AWS-native project, no Kafka expertise → Kinesis (simpler, less overhead). Existing Kafka, want zero management → MSK Serverless.
Quick Reference: Scenario → Service
| Scenario | Service |
|---|---|
| Store ML training datasets at scale | S3 |
| Fastest SageMaker training I/O | RecordIO + Pipe mode |
| Multi-GPU training I/O bottleneck | FSx for Lustre |
| Share feature data across training instances | EFS |
| Single notebook instance local storage | EBS gp3 |
| Real-time IoT ingestion with custom processing | Kinesis Data Streams |
| Deliver streaming data to S3/Redshift | Data Firehose |
| JSON → Parquet conversion in stream | Data Firehose (built-in conversion) |
| Real-time anomaly detection on stream | Managed Apache Flink + RCF |
| Kafka-based team migrating to AWS | MSK |
| Greenfield real-time stream, AWS-native | Kinesis Data Streams |
| Archive old datasets cheaply | S3 Glacier Deep Archive |
| Unknown/changing dataset access patterns | S3 Intelligent-Tiering |
| Query subset of large S3 file (cost) | S3 Select |
| Columnar format for Athena cost savings | Parquet |
| Schema evolution in Kafka pipeline | Avro |
| Trigger ML pipeline on new S3 data | S3 Event → EventBridge → Step Functions |