Domain 1A: Data Ingestion & Storage
Data Ingestion & Storage
Exam Domain: 1 — Data Preparation for ML (28%) Task: Ingest and store data for ML workloads
Data Fundamentals
Types of Data
| Type | Description | Examples |
|---|---|---|
| Structured | Fixed schema, rows & columns | RDS, DynamoDB, CSV |
| Semi-structured | Flexible schema, self-describing | JSON, XML, Parquet, Avro |
| Unstructured | No predefined schema | Images, audio, video, text |
ELI5: Think of a library. Structured data is the card catalog — every entry has the same fields (title, author, date) in the same columns. Semi-structured data is a set of labeled boxes — each box has its own labels, but at least it tells you what’s inside (like a JSON object). Unstructured data is a pile of random stuff on the floor — photos, sticky notes, voice memos — useful, but you have to figure out what it is yourself.
The Three V’s of Data
Volume → How much data (GB → PB scale)
Velocity → How fast data arrives (batch vs real-time)
Variety → How many formats/sources
Why this matters for the exam: The Three V’s determine which AWS services you reach for. High Volume → S3, EMR, Redshift. High Velocity → Kinesis, Firehose. High Variety → Glue (schema discovery), Athena (query anything). When an exam question describes a scenario, map the V’s first and the service answer usually becomes obvious.
Data Formats for ML
| Format | Type | Compression | Columnar | Best For |
|---|---|---|---|---|
| CSV | Text | No | No | Small datasets, simple tabular |
| JSON | Text | No | No | Semi-structured, API data |
| Parquet | Binary | Yes (Snappy) | Yes | Analytics, large datasets, Athena |
| ORC | Binary | Yes | Yes | Hive/EMR workloads |
| Avro | Binary | Yes | No | Schema evolution, streaming |
| RecordIO | Binary | Yes | No | SageMaker Pipe mode (fastest) |
Exam tip: RecordIO + Pipe mode = fastest SageMaker training data ingestion.
ELI5: Choosing a format is like storing your movie collection. CSV is an uncompressed video — simple and readable, but huge. Parquet is a compressed, chapter-indexed file — takes up way less space, and when you only want “action scenes” (specific columns), it jumps right there without reading the whole file. For ML on large datasets, Parquet cuts query costs by 30-90% vs CSV.
Data Architecture Patterns

Data Warehouse vs Data Lake vs Lakehouse
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ DATA WAREHOUSE │ │ DATA LAKE │ │ LAKEHOUSE │
├──────────────────┤ ├──────────────────┤ ├──────────────────┤
│ Structured only │ │ All data types │ │ All data types │
│ Schema-on-write │ │ Schema-on-read │ │ Schema-on-read │
│ Pre-processed │ │ Raw + processed │ │ + ACID txns │
│ SQL analytics │ │ ML + analytics │ │ + governance │
│ e.g. Redshift │ │ e.g. S3 │ │ e.g. S3 + Iceberg│
└──────────────────┘ └──────────────────┘ └──────────────────┘
ELI5: A data warehouse is a neatly organized closet — everything is folded, labeled, and in its exact place, but you did all that work upfront before anything could go in. A data lake is a massive garage where you throw everything raw: boxes, bikes, old furniture — cheap to store, but messy to search. A lakehouse tries to be both: the garage’s storage capacity with enough organization that you can actually find and update things reliably.
Data Mesh
- Decentralized, domain-oriented data ownership
- Each team owns and serves their data as a product
- Federated governance with self-serve infrastructure
Amazon S3 — The ML Data Lake
S3 is the primary storage for ML data on AWS. Almost every ML pipeline starts and ends with S3.
S3 Key Features for ML
| Feature | Purpose |
|---|---|
| Versioning | Track dataset versions, rollback |
| Lifecycle Rules | Auto-transition old data to cheaper tiers |
| Replication | Cross-region for DR, same-region for compliance |
| Event Notifications | Trigger Lambda/SQS/SNS on new data arrival |
| Access Points | Simplified access for different teams |
| Object Lambda | Transform data on read (e.g., redact PII) |
| S3 Select | Query inside objects (CSV/JSON/Parquet) without full download |
ELI5: S3 is the center of gravity in AWS ML because it’s the one service every other service can read from and write to. SageMaker trains from S3, Glue transforms data in S3, Athena queries S3, Kinesis Firehose delivers to S3. Think of S3 as the shared whiteboard that every team in the building can read and write — it’s not the fastest storage, but it’s universally accessible, infinitely scalable, and dirt cheap.
Storage Classes
| Class | Use Case | Retrieval |
|---|---|---|
| S3 Standard | Frequently accessed training data | Instant |
| S3 Standard-IA | Infrequent access, older datasets | Instant, retrieval fee |
| S3 One Zone-IA | Reproducible data, non-critical | Instant, single AZ |
| S3 Glacier Instant | Archive with instant access | Instant |
| S3 Glacier Flexible | Long-term archive | Minutes to hours |
| S3 Glacier Deep Archive | Cheapest archive | 12-48 hours |
| S3 Intelligent-Tiering | Unknown access patterns | Auto-tiered |
ELI5: Storage classes are like storage units at different distances from your home. S3 Standard is stuff in your living room — instantly accessible, but you pay premium rent. Standard-IA is a storage unit across town — cheap monthly fee, but there’s a cost every time you go retrieve something. Glacier Deep Archive is a warehouse in another city — extremely cheap, but it takes 12-48 hours to get anything back. Intelligent-Tiering watches your access patterns and automatically moves things to the right “distance” for you.
S3 Performance Optimization
Upload:
- Multipart upload for files > 100MB (required > 5GB)
- S3 Transfer Acceleration (uses CloudFront edge locations)
Download:
- Byte-range fetches for parallel reads
- S3 Select / Glacier Select to filter data server-side
Throughput:
- 3,500 PUT/COPY/POST/DELETE per prefix per second
- 5,500 GET/HEAD per prefix per second
- Spread across prefixes for higher throughput
S3 Encryption
| Method | Key Management | Use Case |
|---|---|---|
| SSE-S3 | AWS manages keys | Default, simplest |
| SSE-KMS | AWS KMS manages keys | Audit trail, key rotation |
| SSE-C | Customer provides keys | Full control |
| Client-side | Encrypt before upload | Maximum security |
Block & File Storage
Amazon EBS (Elastic Block Store)
- Attached to single EC2 instance (same AZ)
- Types: gp3 (general), io2 (high IOPS), st1 (throughput), sc1 (cold)
- Elastic Volumes: resize, change type, adjust IOPS without detaching
- Use case: SageMaker notebook instance storage
Amazon EFS (Elastic File System)
- Shared NFS file system across multiple instances and AZs
- Auto-scales, pay per use
- Use case: shared training data across multiple SageMaker instances
Amazon FSx
| Variant | Use Case |
|---|---|
| FSx for Lustre | High-performance ML training, integrates with S3 |
| FSx for Windows | Windows-based workloads |
| FSx for NetApp ONTAP | Multi-protocol, hybrid |
| FSx for OpenZFS | Linux workloads, snapshots |
Exam tip: FSx for Lustre is the go-to for high-throughput, low-latency training when S3 alone isn’t fast enough. It can be backed by S3.
EBS vs EFS vs FSx
EBS EFS FSx Lustre
Attach to: Single EC2 Multiple EC2 Multiple EC2
Across AZs: No Yes Yes
Max throughput: ~1 GB/s ~10 GB/s 100s GB/s
S3 integration: No No Yes (lazy load)
Best for ML: Notebooks Shared data HPC training
ELI5: EBS is a USB drive — fast, dedicated to one machine, goes with it everywhere. EFS is a shared Google Drive folder — multiple people (EC2 instances) can all read and write the same files simultaneously. FSx for Lustre is a high-speed race track for data — built specifically so that hundreds of GPUs can all slam through terabytes of training data in parallel without waiting on each other.
Real-Time Streaming
Amazon Kinesis Data Streams

Producers → [Shard 1] [Shard 2] [Shard N] → Consumers
1MB/s in 1MB/s in 2MB/s out per shard
1000 rec/s 1000 rec/s (shared or enhanced fan-out)
- Shards determine throughput — more shards = more capacity
- Retention: 24 hours default, up to 365 days
- Ordering: Per shard, using partition key
- Consumers: Lambda, KCL apps, Kinesis Data Analytics, Firehose
- Enhanced Fan-Out: 2MB/s per shard per consumer (dedicated)
Amazon Data Firehose (formerly Kinesis Data Firehose)
- Fully managed delivery stream — no shards to manage
- Near-real-time (60-second buffer minimum)
- Destinations: S3, Redshift, OpenSearch, Splunk, HTTP endpoints
- Can transform data with Lambda before delivery
- Auto-scales, no capacity management
Kinesis Data Streams vs Firehose:
┌──────────────────────┬────────────────────────┐
│ Data Streams │ Firehose │
├──────────────────────┼────────────────────────┤
│ Real-time (200ms) │ Near real-time (60s+) │
│ Manual scaling │ Auto-scaling │
│ Custom consumers │ Fixed destinations │
│ Data replay │ No replay │
│ You manage shards │ Fully managed │
└──────────────────────┴────────────────────────┘
ELI5: Kinesis Data Streams is like laying your own plumbing — you control every pipe, valve, and flow rate, and you can tap into the stream from multiple places simultaneously. Firehose is like calling a plumber who does it all for you — you tell them where the water should end up (S3, Redshift, etc.) and they handle everything, but you get less control. If you need custom processing or multiple consumers reading the same stream, use Data Streams. If you just need data reliably delivered somewhere, use Firehose.
Amazon Managed Service for Apache Flink
- Real-time stream processing with SQL or Java/Scala/Python
- Source: Kinesis Data Streams, MSK
- Use case: real-time analytics, anomaly detection, aggregations
- RANDOM_CUT_FOREST function: built-in anomaly detection in streaming data
Amazon MSK (Managed Streaming for Apache Kafka)
- Fully managed Apache Kafka
- MSK Connect: managed Kafka Connect connectors
- MSK Serverless: auto-scaling, no cluster management
- Use case: teams already using Kafka ecosystem
Kinesis vs MSK
| Feature | Kinesis | MSK |
|---|---|---|
| Protocol | AWS proprietary | Apache Kafka |
| Message size | 1 MB max | 1 MB default (configurable higher) |
| Retention | Up to 365 days | Unlimited (with tiered storage) |
| Ecosystem | AWS native | Kafka ecosystem |
| Scaling | Shard splitting/merging | Add brokers/partitions |
ETL & Pipeline Orchestration
Data Sources → [Ingestion] → [Storage] → [ETL/Transform] → [ML-Ready Data]
Kinesis S3 Glue Feature Store
MSK EFS EMR
Firehose FSx Data Wrangler
Key Integration Pattern
S3 Event Notification
→ Lambda (lightweight transform)
→ or SQS → batch consumer
→ or EventBridge → Step Functions → Glue ETL → S3 (processed)
Quick Reference: When to Use What
| Scenario | Service |
|---|---|
| Store training datasets | S3 |
| High-throughput training I/O | FSx for Lustre + S3 |
| Shared data across instances | EFS |
| Real-time ingestion (custom processing) | Kinesis Data Streams |
| Real-time ingestion (deliver to S3/Redshift) | Data Firehose |
| Stream processing with SQL | Managed Apache Flink |
| Kafka-based streaming | MSK |
| Fastest SageMaker training input | RecordIO + Pipe mode |
| Archive old datasets cheaply | S3 Glacier Deep Archive |
| Auto-tier based on access | S3 Intelligent-Tiering |