← AWS MLS-C01 — ML Specialty

Domain 1A: Data Storage & Ingestion

Data Storage & Ingestion

Exam Domain: 1 — Data Engineering (20%) Task: Design data ingestion and storage solutions for ML workloads


Data Fundamentals

Types of Data

TypeDescriptionExamplesML Use Cases
StructuredFixed schema, rows & columnsRDS tables, CSV, spreadsheetsTabular ML, classification
Semi-structuredSelf-describing, flexible schemaJSON, XML, Parquet, AvroNLP preprocessing, logs
UnstructuredNo predefined schemaImages, audio, video, raw textComputer vision, NLP

ELI5: Think of a library. Structured data is the card catalog — every entry has the same fields (title, author, date) in the same columns. Semi-structured is a set of labeled boxes — each box has its own labels, but it tells you what’s inside (like a JSON object). Unstructured is a pile of random stuff on the floor — photos, sticky notes, voice memos — useful, but you have to figure out what it is yourself before you can learn from it.

The Three V’s — Mapped to AWS Services

┌─────────────────────────────────────────────────────────────┐
│  VOLUME — How much data (GB → PB scale)                     │
│    AWS: S3 (unlimited), Redshift (PB warehouse),            │
│         EMR (distributed processing)                        │
├─────────────────────────────────────────────────────────────┤
│  VELOCITY — How fast data arrives                           │
│    Batch:      Glue, EMR, AWS Batch                         │
│    Real-time:  Kinesis Data Streams, MSK, Flink             │
│    Near-RT:    Data Firehose (60s buffer)                   │
├─────────────────────────────────────────────────────────────┤
│  VARIETY — How many formats and sources                     │
│    Discovery:  Glue Crawlers, Data Catalog                  │
│    Query any:  Athena (schema-on-read)                      │
│    Transform:  Glue ETL, EMR, Data Wrangler                 │
└─────────────────────────────────────────────────────────────┘

Why this matters for the exam: When a question describes a scenario, map the V’s first — the service answer becomes obvious. “10 PB of images” → S3 + EMR (Volume). “IoT sensors every 10ms” → Kinesis (Velocity). “CSV + JSON + XML from 15 sources” → Glue Crawlers + Data Catalog (Variety).


Data Formats for ML

Format Comparison Table

FormatTypeColumnar?CompressionSchema EvolutionBest For
CSVTextNoNonePoor (no types)Small datasets, human-readable, quick prototyping
JSONTextNoNoneGood (flexible)Semi-structured, API data, nested records
ParquetBinaryYesSnappy/GZIPGoodAnalytics, Athena, Spark, large datasets
ORCBinaryYesZlib/SnappyGoodHive/EMR-heavy workloads, Hive ORC predicates
AvroBinaryNoDeflate/SnappyExcellentSchema evolution, Kafka streams, event sourcing
RecordIOBinaryNoLZ4PoorSageMaker Pipe mode (fastest training input)
ProtobufBinaryNoNone (structure)GoodgRPC APIs, TFRecord conversion, microservices

Why this matters for the exam: Format choice has direct cost and performance implications. Parquet on Athena costs 30–90% less than CSV because Athena charges per TB scanned — and columnar format + compression dramatically shrinks what gets scanned. RecordIO + Pipe mode is the fastest SageMaker training path: data streams directly from S3 into the algorithm without materializing on disk.

ELI5: Why Columnar Formats Matter

ELI5: Imagine you have a spreadsheet with 1000 columns and 1 million rows. You only need to sum one column — “revenue.” Row-oriented storage (CSV) reads every single cell of every row to get that one column — 1 billion cell reads. Columnar storage (Parquet/ORC) stores all revenue values together — it jumps straight to that column and reads only 1 million values. For analytics and ML feature extraction, columnar wins every time. This is called column pruning.

Behind the Scenes: Predicate Pushdown

SQL:  SELECT revenue FROM sales WHERE region = 'US'

Row format (CSV):
  Read ALL rows → filter in memory → extract revenue column
  I/O cost: 100% of data

Columnar (Parquet):
  1. Read "region" column footer metadata (min/max per row group)
  2. Skip row groups where region can't be 'US' (predicate pushdown)
  3. Read only matching row groups from "revenue" column
  I/O cost: often 1-5% of data

Why RecordIO + Pipe Mode is Fastest for SageMaker

File Mode (default):
  S3 → Download ALL files to /opt/ml/input/data/ → Algorithm reads files
  Problem: Large datasets = long download = container sits idle

Pipe Mode + RecordIO:
  S3 → Stream data as named pipe → Algorithm reads continuously
  Result: Training starts immediately, no disk I/O bottleneck
  Speed: 2-10x faster for datasets > 50GB

Exam tip: If the question mentions reducing SageMaker training startup time or improving I/O throughput for large training sets, the answer involves RecordIO format + Pipe mode.


Data Architecture Patterns

Data Warehouse vs Data Lake vs Lakehouse

┌──────────────────┐   ┌──────────────────┐   ┌───────────────────┐
│  DATA WAREHOUSE  │   │    DATA LAKE     │   │    LAKEHOUSE      │
├──────────────────┤   ├──────────────────┤   ├───────────────────┤
│ Structured only  │   │ Any format       │   │ Any format        │
│ Schema-on-WRITE  │   │ Schema-on-READ   │   │ Schema-on-read    │
│ Pre-transformed  │   │ Raw + processed  │   │ + ACID txns       │
│ BI & reporting   │   │ ML + data sci    │   │ + governance      │
│ Expensive/fast   │   │ Cheap/flexible   │   │ + versioning      │
│ e.g. Redshift    │   │ e.g. S3          │   │ e.g. S3 + Iceberg │
└──────────────────┘   └──────────────────┘   └───────────────────┘

ELI5: A data warehouse is a neatly organized closet — everything is folded, labeled, in its exact place, but you did all that work upfront before anything could go in. A data lake is a massive garage where you throw everything raw: boxes, bikes, old furniture — cheap to store, but messy to search. A lakehouse is the dream upgrade: the garage’s storage capacity plus enough organization that you can find, update, and version your things reliably.

Schema-on-Write vs Schema-on-Read (First Principles)

Schema-on-Write (Warehouse):
  Ingest time: "Is this column a number? Does the foreign key exist?"
  Query time:  Fast — data is already clean and indexed
  Trade-off:   Inflexible. Changing schema = ETL rewrite + backfill

Schema-on-Read (Data Lake):
  Ingest time: Dump it — worry about structure later
  Query time:  Slower — must parse and interpret on every read
  Trade-off:   Flexible. Add new columns, change types without touching raw data

When each breaks down:

  • Data Warehouse anti-pattern: Storing raw logs, images, or IoT sensor data — too expensive, too rigid
  • Data Lake anti-pattern: Needing ACID transactions (concurrent writes/deletes) — S3 is eventually consistent at the object level; without a table format (Iceberg, Delta), you get “data swamps”

Data Mesh Concept

  • Decentralized: each domain team owns, serves, and governs their own data
  • Self-serve infrastructure: teams don’t wait for central data engineering
  • Data as a product: each domain publishes discoverable, trustworthy datasets
  • AWS implementation: Lake Formation for governance + separate S3 prefixes per domain + Glue Data Catalog shared across accounts

Amazon S3 — The ML Data Lake

S3 is the center of gravity for ML on AWS. Nearly every ML pipeline starts and ends here.

ELI5: S3 is the shared whiteboard that every AWS service can read and write. SageMaker trains from S3, Glue transforms data in S3, Athena queries S3, Kinesis Firehose delivers to S3, Lambda reads from S3. It’s not the fastest storage, but it’s universally accessible, infinitely scalable, and cheap enough that you never have to delete anything. Every ML architecture diagram has S3 at the center.

Storage Classes — Decision Table

ClassMin StorageRetrievalRetrieval FeeBest For ML
StandardNoneInstantNoneActive training datasets
Standard-IA30 daysInstantYes/GBDatasets accessed < monthly
One Zone-IA30 daysInstantYes/GBReproducible data, no DR needed
Glacier Instant90 daysInstantYes/GBExperiment results, rare re-runs
Glacier Flexible90 days1min–12hrYes/GBLong-term model checkpoints
Glacier Deep Archive180 days12–48hrYes/GBCompliance archives, raw originals
Intelligent-TieringNoneInstant (freq)Monitoring feeUnknown / changing access patterns

ELI5: Storage classes are storage units at different distances from your home. Standard is a shelf in your living room — always there, premium price. Standard-IA is a storage unit across town — low monthly rent, but you pay a trip fee each time you retrieve something. Glacier Deep Archive is a warehouse in another city — incredibly cheap, but it takes 12–48 hours to get your stuff. Intelligent-Tiering is a self-driving delivery robot that watches what you access and automatically moves things to the right distance.

ML Decision Tree:

Active training data?        → Standard
Monthly batch reprocessing?  → Standard-IA
Reproducible intermediate?   → One Zone-IA
Final model artifacts?       → Glacier Instant (need it sometimes)
Raw original dataset backup? → Glacier Deep Archive
Don't know access pattern?   → Intelligent-Tiering

S3 Performance Optimization

UPLOADS:
  Multipart upload  → For files > 100MB (required for > 5GB)
                      Parallel part uploads, retry individual parts
  Transfer Accel    → Uses CloudFront edge locations globally
                      Good for cross-region uploads from on-prem

DOWNLOADS:
  Byte-range fetch  → Parallel reads of different parts of one file
                      Critical for large Parquet file reads
  S3 Select         → Filter CSV/JSON/Parquet server-side before download
                      Returns only matching rows — saves transfer cost

THROUGHPUT LIMITS:
  3,500 PUT/COPY/DELETE requests/sec per prefix
  5,500 GET/HEAD requests/sec per prefix
  Strategy: spread across prefixes  →  date=2024-01-01/ vs date=2024-01-02/
                                        (each prefix gets its own limit)

S3 Select: Save Money on ML Preprocessing

Without S3 Select:
  S3 → Download 50GB CSV → EC2/Lambda filters rows → Use 500MB result
  Cost: Transfer 50GB + compute to process 50GB

With S3 Select:
  S3 → SQL predicate runs SERVER-SIDE → Transfer only 500MB result
  Cost: Transfer 500MB + small S3 Select request fee
  Savings: 99% less data transferred

Exam tip: S3 Select is frequently tested. It’s useful when you only need a subset of a large file and want to reduce data transfer costs before ML preprocessing. Works on CSV, JSON, and Parquet in S3.

S3 Encryption Options

MethodWho manages keysAudit trailUse When
SSE-S3AWS (S3-managed)NoneDefault — simplest, no key management
SSE-KMSAWS KMSFull CloudTrail auditCompliance, key rotation, cross-account access
SSE-CYou (per-request)NoneFull key control, no AWS key storage
Client-sideYou (before upload)NoneMaximum security, regulated data

Exam tip: SSE-KMS is the most commonly tested encryption option for ML scenarios involving compliance, auditing, or cross-account access. Every KMS decrypt/encrypt call shows in CloudTrail — use this when auditors need to know who accessed what training data.

Versioning and Lifecycle for ML Dataset Management

VERSIONING:
  Enable on bucket → every PUT creates a new version ID
  ML use: dataset v1, v2, v3 — roll back if model degrades
  Cost: every version stored = separate storage charge
  MFA Delete: require MFA to permanently delete versions

LIFECYCLE RULES:
  Day 0:   training-data/ → S3 Standard    (active training)
  Day 30:  transition     → Standard-IA    (done training)
  Day 90:  transition     → Glacier        (archival)
  Day 365: expire         → Delete         (or move to Deep Archive)

Event Notifications — Trigger ML Pipelines

New file lands in S3
  → S3 Event Notification
      ├── SQS queue     → batch consumer polls and processes
      ├── SNS topic     → fan-out to multiple subscribers
      ├── Lambda        → lightweight immediate transformation
      └── EventBridge   → rule-based routing to Step Functions / Glue

Why this matters for the exam: Event-driven ML pipeline architecture is heavily tested. Know that S3 → EventBridge → Step Functions is the most flexible pattern for orchestrating multi-step ML workflows triggered by new data.


Block & File Storage for ML

EBS vs EFS vs FSx for Lustre

┌──────────────────┬──────────────────┬───────────────────────────┐
│       EBS        │       EFS        │      FSx for Lustre       │
├──────────────────┼──────────────────┼───────────────────────────┤
│ Block storage    │ NFS file system  │ Parallel file system      │
│ 1 EC2 instance   │ Many instances   │ Many instances (HPC)      │
│ Single AZ        │ Multi-AZ         │ Single AZ (or Multi-AZ)   │
│ ~1 GB/s max      │ ~10 GB/s burst   │ Hundreds of GB/s          │
│ No S3 link       │ No S3 link       │ S3-backed (lazy load)     │
│ SageMaker EBS    │ Shared training  │ Distributed GPU training  │
│ volumes          │ datasets         │ at scale                  │
└──────────────────┴──────────────────┴───────────────────────────┘

ELI5: EBS is a USB drive — fast, dedicated to one machine. EFS is a shared Google Drive folder — multiple machines can all read and write simultaneously. FSx for Lustre is a high-speed race track for data — built specifically so that hundreds of GPUs can all slam through terabytes of training data in parallel without waiting on each other.

Why FSx for Lustre Exists: The HPC Training Bottleneck

Problem:  100 GPU nodes training on S3 → each GPU makes GET requests
          → S3 per-prefix throughput limit hit → GPUs wait for data
          → GPU utilization drops to 30-40%

Solution: FSx for Lustre
          → Single high-throughput file system shared by all nodes
          → Lustre protocol designed for parallel I/O (HPC origins)
          → All 100 GPUs see same POSIX filesystem
          → Hundreds of GB/s aggregate throughput

Behind the Scenes: FSx Lustre Lazy Loading from S3

1. Create FSx filesystem linked to s3://my-bucket/training/
2. Mount on all training nodes (no data on FSx yet)
3. First read of file.parquet → cache miss → fetches from S3
4. File now cached on FSx Lustre SSD
5. All subsequent reads → served from local NVMe (microseconds)
6. After training, sync back to S3 with HSM commands
   or let FSx auto-export changed files to S3

Exam tip: When a question asks about training a model on very large datasets (hundreds of GB to TB) with multiple GPU instances and mentions I/O bottlenecks or GPU utilization problems, the answer is FSx for Lustre backed by S3. When the question asks about multiple SageMaker instances sharing feature data, the answer is EFS.

Storage Decision Guide

Use Case                              → Service
─────────────────────────────────────────────────
Single notebook, personal experiments → EBS (gp3)
Share preprocessed features, 2+ jobs → EFS
Multi-GPU distributed training, large → FSx for Lustre
datasets, HPC workloads
Everything else (ML data lake)        → S3

Real-Time Streaming

Amazon Kinesis Data Streams — Deep Dive

                    ┌─────────────┐
Producers           │  Stream     │             Consumers
                    ├─────────────┤
[App logs]  ──────► │  Shard 1    │ ──────────► [Lambda function]
[IoT sensor] ──────► │  Shard 2    │ ──────────► [KCL application]
[Clickstream]──────► │  Shard N    │ ──────────► [Kinesis Flink]
                    └─────────────┘ ──────────► [Firehose]
Per shard: 1 MB/s in, 1000 records/s in
           2 MB/s out (shared across consumers)
Enhanced Fan-Out: 2 MB/s out per consumer per shard (dedicated HTTP/2)

Key Concepts:

ConceptDetail
ShardUnit of capacity. More shards = more throughput
Partition keyDetermines which shard a record goes to (hashed)
Sequence numberUnique ID per record within shard, strictly ordered
RetentionDefault 24hr, up to 365 days (Extended Data Retention)
Enhanced Fan-OutDedicated 2MB/s per consumer — no sharing

Behind the Scenes: Shards, Ordering, and Partition Keys

Partition key → MD5 hash → maps to a shard

Example: user_id as partition key
  user_123 → hash(user_123) → Shard 1
  user_456 → hash(user_456) → Shard 3

ORDERING GUARANTEE:
  Within a shard: strict order (sequence numbers always increase)
  Across shards:  NO ordering guarantee

CONSEQUENCE FOR ML:
  Time-series per sensor? → Use sensor_id as partition key
    → All events for sensor_X go to same shard → ordered
    → Can reconstruct time series for that sensor correctly

ELI5 on partition keys: Imagine Kinesis is a post office with multiple sorting lanes (shards). The partition key is the zip code — it always routes the same zip code to the same lane. This means all mail from the same zip code stays in order (within its lane), but packages across different lanes don’t have a global order. For ML, if you need all events from the same user/device/sensor in order, use that identifier as the partition key.

Exam tip: Hot shards occur when partition keys are not evenly distributed (e.g., using a boolean value that puts 99% of records on 2 shards). Fix with better key selection or shard splitting.

Amazon Data Firehose — Fully Managed Delivery

Sources:                    Transforms:           Destinations:
Kinesis Data Streams  ──►   Lambda (optional) ──► S3
Direct PUT API        ──►   Format conversion ──► Redshift (via S3 COPY)
MSK                   ──►   (JSON→Parquet,    ──► OpenSearch Service
                            JSON→ORC)         ──► Splunk
                                              ──► HTTP endpoint
                                              ──► Datadog / New Relic

Key Properties:

  • Buffers data (size: 1–128 MB, or time: 60–900 seconds — whichever hits first)
  • No consumer management — fully serverless
  • Built-in format conversion (JSON → Parquet / ORC) without separate ETL
  • Error records automatically go to separate S3 prefix

Kinesis Data Streams vs Data Firehose

┌────────────────────────┬────────────────────────────────┐
│   DATA STREAMS         │   DATA FIREHOSE                │
├────────────────────────┼────────────────────────────────┤
│ Real-time (~200ms)     │ Near real-time (60s+ buffer)   │
│ You manage shards      │ Fully managed, auto-scales     │
│ Multiple consumers     │ Single delivery destination    │
│ Data replay supported  │ No replay                      │
│ Custom processing      │ Lambda transform only          │
│ KCL / Lambda / Flink   │ Fixed destinations             │
│ Pay per shard-hour     │ Pay per GB ingested            │
└────────────────────────┴────────────────────────────────┘

ELI5: Data Streams is laying your own plumbing — you control every pipe and valve, and multiple consumers can tap in simultaneously. Firehose is calling a plumber who handles everything: you say “water goes to S3” and they do it. If you need custom processing, multiple consumers, or data replay, use Data Streams. If you just need reliable delivery to a storage destination, use Firehose.

Decision Guide:

Need < 1 second latency?                   → Data Streams
Need custom consumer (Lambda/KCL/Flink)?   → Data Streams
Need to replay data?                       → Data Streams
Just delivering data to S3/Redshift?       → Firehose
Need built-in JSON→Parquet conversion?     → Firehose
Want zero operational overhead?            → Firehose
  • Real-time stream processing using SQL, Java, Scala, or Python (Apache Flink API)
  • Sources: Kinesis Data Streams, MSK (Kafka)
  • Fully managed: provisions Flink cluster, scales automatically
  • Use cases:
    • Real-time feature engineering (sliding window aggregations)
    • Anomaly detection before data lands in storage
    • Filtering, enrichment, joining streams

RANDOM_CUT_FOREST (RCF):

Built-in Flink SQL function for streaming anomaly detection
How it works:
  - Maintains a forest of random decision trees on recent data
  - New point gets an anomaly score based on depth in trees
  - Shallow depth (few cuts needed) → high anomaly score
  - Updates incrementally — no full retrain needed

Use case on exam: "Detect anomalies in real-time IoT sensor stream"
Answer: Managed Apache Flink with RANDOM_CUT_FOREST function

Amazon MSK (Managed Streaming for Apache Kafka)

  • Fully managed Kafka brokers (you still configure topics, partitions, retention)
  • MSK Connect: run Kafka Connect connectors (Debezium CDC, S3 Sink, etc.)
  • MSK Serverless: truly serverless — no broker management, auto-scales

Kinesis vs MSK — When to Use Each

FactorKinesisMSK
ProtocolAWS proprietaryApache Kafka
Team expertiseAWS-nativeExisting Kafka users
Message size1 MB max1 MB default, configurable higher
RetentionUp to 365 daysUnlimited (tiered storage)
EcosystemAWS services onlyKafka ecosystem (Kafka Streams, ksqlDB)
ComplexityLowHigher (broker config, partition management)
Cost modelPer shard-hourPer broker-hour

Discriminative example: Your team already uses Kafka on-prem and is migrating → MSK (no rewrite, familiar operations). Greenfield AWS-native project, no Kafka expertise → Kinesis (simpler, less overhead). Existing Kafka, want zero management → MSK Serverless.


Quick Reference: Scenario → Service

ScenarioService
Store ML training datasets at scaleS3
Fastest SageMaker training I/ORecordIO + Pipe mode
Multi-GPU training I/O bottleneckFSx for Lustre
Share feature data across training instancesEFS
Single notebook instance local storageEBS gp3
Real-time IoT ingestion with custom processingKinesis Data Streams
Deliver streaming data to S3/RedshiftData Firehose
JSON → Parquet conversion in streamData Firehose (built-in conversion)
Real-time anomaly detection on streamManaged Apache Flink + RCF
Kafka-based team migrating to AWSMSK
Greenfield real-time stream, AWS-nativeKinesis Data Streams
Archive old datasets cheaplyS3 Glacier Deep Archive
Unknown/changing dataset access patternsS3 Intelligent-Tiering
Query subset of large S3 file (cost)S3 Select
Columnar format for Athena cost savingsParquet
Schema evolution in Kafka pipelineAvro
Trigger ML pipeline on new S3 dataS3 Event → EventBridge → Step Functions