Domain 1A: Data Ingestion & Storage

8 min read 1654 words

Data Ingestion & Storage

Exam Domain: 1 — Data Preparation for ML (28%) Task: Ingest and store data for ML workloads

Data Fundamentals

Types of Data

Type	Description	Examples
Structured	Fixed schema, rows & columns	RDS, DynamoDB, CSV
Semi-structured	Flexible schema, self-describing	JSON, XML, Parquet, Avro
Unstructured	No predefined schema	Images, audio, video, text

ELI5: Think of a library. Structured data is the card catalog — every entry has the same fields (title, author, date) in the same columns. Semi-structured data is a set of labeled boxes — each box has its own labels, but at least it tells you what’s inside (like a JSON object). Unstructured data is a pile of random stuff on the floor — photos, sticky notes, voice memos — useful, but you have to figure out what it is yourself.

The Three V’s of Data

Volume    →  How much data (GB → PB scale)
Velocity  →  How fast data arrives (batch vs real-time)
Variety   →  How many formats/sources

Why this matters for the exam: The Three V’s determine which AWS services you reach for. High Volume → S3, EMR, Redshift. High Velocity → Kinesis, Firehose. High Variety → Glue (schema discovery), Athena (query anything). When an exam question describes a scenario, map the V’s first and the service answer usually becomes obvious.

Data Formats for ML

Format	Type	Compression	Columnar	Best For
CSV	Text	No	No	Small datasets, simple tabular
JSON	Text	No	No	Semi-structured, API data
Parquet	Binary	Yes (Snappy)	Yes	Analytics, large datasets, Athena
ORC	Binary	Yes	Yes	Hive/EMR workloads
Avro	Binary	Yes	No	Schema evolution, streaming
RecordIO	Binary	Yes	No	SageMaker Pipe mode (fastest)

Exam tip: RecordIO + Pipe mode = fastest SageMaker training data ingestion.

ELI5: Choosing a format is like storing your movie collection. CSV is an uncompressed video — simple and readable, but huge. Parquet is a compressed, chapter-indexed file — takes up way less space, and when you only want “action scenes” (specific columns), it jumps right there without reading the whole file. For ML on large datasets, Parquet cuts query costs by 30-90% vs CSV.

Data Architecture Patterns

Data Ingestion Pipeline

Data Warehouse vs Data Lake vs Lakehouse

┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│  DATA WAREHOUSE  │  │    DATA LAKE     │  │   LAKEHOUSE      │
├──────────────────┤  ├──────────────────┤  ├──────────────────┤
│ Structured only  │  │ All data types   │  │ All data types   │
│ Schema-on-write  │  │ Schema-on-read   │  │ Schema-on-read   │
│ Pre-processed    │  │ Raw + processed  │  │ + ACID txns      │
│ SQL analytics    │  │ ML + analytics   │  │ + governance     │
│ e.g. Redshift    │  │ e.g. S3          │  │ e.g. S3 + Iceberg│
└──────────────────┘  └──────────────────┘  └──────────────────┘

ELI5: A data warehouse is a neatly organized closet — everything is folded, labeled, and in its exact place, but you did all that work upfront before anything could go in. A data lake is a massive garage where you throw everything raw: boxes, bikes, old furniture — cheap to store, but messy to search. A lakehouse tries to be both: the garage’s storage capacity with enough organization that you can actually find and update things reliably.

Data Mesh

Decentralized, domain-oriented data ownership
Each team owns and serves their data as a product
Federated governance with self-serve infrastructure

Amazon S3 — The ML Data Lake

S3 is the primary storage for ML data on AWS. Almost every ML pipeline starts and ends with S3.

S3 Key Features for ML

Feature	Purpose
Versioning	Track dataset versions, rollback
Lifecycle Rules	Auto-transition old data to cheaper tiers
Replication	Cross-region for DR, same-region for compliance
Event Notifications	Trigger Lambda/SQS/SNS on new data arrival
Access Points	Simplified access for different teams
Object Lambda	Transform data on read (e.g., redact PII)
S3 Select	Query inside objects (CSV/JSON/Parquet) without full download

ELI5: S3 is the center of gravity in AWS ML because it’s the one service every other service can read from and write to. SageMaker trains from S3, Glue transforms data in S3, Athena queries S3, Kinesis Firehose delivers to S3. Think of S3 as the shared whiteboard that every team in the building can read and write — it’s not the fastest storage, but it’s universally accessible, infinitely scalable, and dirt cheap.

Storage Classes

Class	Use Case	Retrieval
S3 Standard	Frequently accessed training data	Instant
S3 Standard-IA	Infrequent access, older datasets	Instant, retrieval fee
S3 One Zone-IA	Reproducible data, non-critical	Instant, single AZ
S3 Glacier Instant	Archive with instant access	Instant
S3 Glacier Flexible	Long-term archive	Minutes to hours
S3 Glacier Deep Archive	Cheapest archive	12-48 hours
S3 Intelligent-Tiering	Unknown access patterns	Auto-tiered

ELI5: Storage classes are like storage units at different distances from your home. S3 Standard is stuff in your living room — instantly accessible, but you pay premium rent. Standard-IA is a storage unit across town — cheap monthly fee, but there’s a cost every time you go retrieve something. Glacier Deep Archive is a warehouse in another city — extremely cheap, but it takes 12-48 hours to get anything back. Intelligent-Tiering watches your access patterns and automatically moves things to the right “distance” for you.

S3 Performance Optimization

Upload:
  - Multipart upload for files > 100MB (required > 5GB)
  - S3 Transfer Acceleration (uses CloudFront edge locations)

Download:
  - Byte-range fetches for parallel reads
  - S3 Select / Glacier Select to filter data server-side

Throughput:
  - 3,500 PUT/COPY/POST/DELETE per prefix per second
  - 5,500 GET/HEAD per prefix per second
  - Spread across prefixes for higher throughput

S3 Encryption

Method	Key Management	Use Case
SSE-S3	AWS manages keys	Default, simplest
SSE-KMS	AWS KMS manages keys	Audit trail, key rotation
SSE-C	Customer provides keys	Full control
Client-side	Encrypt before upload	Maximum security

Block & File Storage

Amazon EBS (Elastic Block Store)

Attached to single EC2 instance (same AZ)
Types: gp3 (general), io2 (high IOPS), st1 (throughput), sc1 (cold)
Elastic Volumes: resize, change type, adjust IOPS without detaching
Use case: SageMaker notebook instance storage

Amazon EFS (Elastic File System)

Shared NFS file system across multiple instances and AZs
Auto-scales, pay per use
Use case: shared training data across multiple SageMaker instances

Amazon FSx

Variant	Use Case
FSx for Lustre	High-performance ML training, integrates with S3
FSx for Windows	Windows-based workloads
FSx for NetApp ONTAP	Multi-protocol, hybrid
FSx for OpenZFS	Linux workloads, snapshots

Exam tip: FSx for Lustre is the go-to for high-throughput, low-latency training when S3 alone isn’t fast enough. It can be backed by S3.

EBS vs EFS vs FSx

                    EBS              EFS              FSx Lustre
Attach to:          Single EC2       Multiple EC2     Multiple EC2
Across AZs:         No               Yes              Yes
Max throughput:     ~1 GB/s          ~10 GB/s         100s GB/s
S3 integration:    No               No               Yes (lazy load)
Best for ML:       Notebooks        Shared data      HPC training

ELI5: EBS is a USB drive — fast, dedicated to one machine, goes with it everywhere. EFS is a shared Google Drive folder — multiple people (EC2 instances) can all read and write the same files simultaneously. FSx for Lustre is a high-speed race track for data — built specifically so that hundreds of GPUs can all slam through terabytes of training data in parallel without waiting on each other.

Real-Time Streaming

Amazon Kinesis Data Streams

Kinesis Streaming Architecture

Producers → [Shard 1] [Shard 2] [Shard N] → Consumers
             1MB/s in   1MB/s in             2MB/s out per shard
             1000 rec/s  1000 rec/s           (shared or enhanced fan-out)

Shards determine throughput — more shards = more capacity
Retention: 24 hours default, up to 365 days
Ordering: Per shard, using partition key
Consumers: Lambda, KCL apps, Kinesis Data Analytics, Firehose
Enhanced Fan-Out: 2MB/s per shard per consumer (dedicated)

Amazon Data Firehose (formerly Kinesis Data Firehose)

Fully managed delivery stream — no shards to manage
Near-real-time (60-second buffer minimum)
Destinations: S3, Redshift, OpenSearch, Splunk, HTTP endpoints
Can transform data with Lambda before delivery
Auto-scales, no capacity management

Kinesis Data Streams vs Firehose:
┌──────────────────────┬────────────────────────┐
│   Data Streams       │   Firehose             │
├──────────────────────┼────────────────────────┤
│ Real-time (200ms)    │ Near real-time (60s+)  │
│ Manual scaling       │ Auto-scaling           │
│ Custom consumers     │ Fixed destinations     │
│ Data replay          │ No replay              │
│ You manage shards    │ Fully managed          │
└──────────────────────┴────────────────────────┘

ELI5: Kinesis Data Streams is like laying your own plumbing — you control every pipe, valve, and flow rate, and you can tap into the stream from multiple places simultaneously. Firehose is like calling a plumber who does it all for you — you tell them where the water should end up (S3, Redshift, etc.) and they handle everything, but you get less control. If you need custom processing or multiple consumers reading the same stream, use Data Streams. If you just need data reliably delivered somewhere, use Firehose.

Amazon Managed Service for Apache Flink

Real-time stream processing with SQL or Java/Scala/Python
Source: Kinesis Data Streams, MSK
Use case: real-time analytics, anomaly detection, aggregations
RANDOM_CUT_FOREST function: built-in anomaly detection in streaming data

Amazon MSK (Managed Streaming for Apache Kafka)

Fully managed Apache Kafka
MSK Connect: managed Kafka Connect connectors
MSK Serverless: auto-scaling, no cluster management
Use case: teams already using Kafka ecosystem

Kinesis vs MSK

Feature	Kinesis	MSK
Protocol	AWS proprietary	Apache Kafka
Message size	1 MB max	1 MB default (configurable higher)
Retention	Up to 365 days	Unlimited (with tiered storage)
Ecosystem	AWS native	Kafka ecosystem
Scaling	Shard splitting/merging	Add brokers/partitions

ETL & Pipeline Orchestration

Data Sources → [Ingestion] → [Storage] → [ETL/Transform] → [ML-Ready Data]
                Kinesis        S3           Glue               Feature Store
                MSK            EFS          EMR
                Firehose       FSx          Data Wrangler

Key Integration Pattern

S3 Event Notification
    → Lambda (lightweight transform)
    → or SQS → batch consumer
    → or EventBridge → Step Functions → Glue ETL → S3 (processed)

Quick Reference: When to Use What

Scenario	Service
Store training datasets	S3
High-throughput training I/O	FSx for Lustre + S3
Shared data across instances	EFS
Real-time ingestion (custom processing)	Kinesis Data Streams
Real-time ingestion (deliver to S3/Redshift)	Data Firehose
Stream processing with SQL	Managed Apache Flink
Kafka-based streaming	MSK
Fastest SageMaker training input	RecordIO + Pipe mode
Archive old datasets cheaply	S3 Glacier Deep Archive
Auto-tier based on access	S3 Intelligent-Tiering

Data Ingestion & Storage#

Data Fundamentals#

Types of Data#

The Three V’s of Data#

Data Formats for ML#

Data Architecture Patterns#

Data Warehouse vs Data Lake vs Lakehouse#

Data Mesh#

Amazon S3 — The ML Data Lake#

S3 Key Features for ML#

Storage Classes#

S3 Performance Optimization#

S3 Encryption#

Block & File Storage#

Amazon EBS (Elastic Block Store)#

Amazon EFS (Elastic File System)#

Amazon FSx#

EBS vs EFS vs FSx#

Real-Time Streaming#

Amazon Kinesis Data Streams#

Amazon Data Firehose (formerly Kinesis Data Firehose)#

Amazon Managed Service for Apache Flink#

Amazon MSK (Managed Streaming for Apache Kafka)#

Kinesis vs MSK#

ETL & Pipeline Orchestration#

Key Integration Pattern#

Quick Reference: When to Use What#