← AWS MLA-C01 — ML Engineer Associate

Domain 1A: Data Ingestion & Storage

Data Ingestion & Storage

Exam Domain: 1 — Data Preparation for ML (28%) Task: Ingest and store data for ML workloads


Data Fundamentals

Types of Data

TypeDescriptionExamples
StructuredFixed schema, rows & columnsRDS, DynamoDB, CSV
Semi-structuredFlexible schema, self-describingJSON, XML, Parquet, Avro
UnstructuredNo predefined schemaImages, audio, video, text

ELI5: Think of a library. Structured data is the card catalog — every entry has the same fields (title, author, date) in the same columns. Semi-structured data is a set of labeled boxes — each box has its own labels, but at least it tells you what’s inside (like a JSON object). Unstructured data is a pile of random stuff on the floor — photos, sticky notes, voice memos — useful, but you have to figure out what it is yourself.

The Three V’s of Data

Volume    →  How much data (GB → PB scale)
Velocity  →  How fast data arrives (batch vs real-time)
Variety   →  How many formats/sources

Why this matters for the exam: The Three V’s determine which AWS services you reach for. High Volume → S3, EMR, Redshift. High Velocity → Kinesis, Firehose. High Variety → Glue (schema discovery), Athena (query anything). When an exam question describes a scenario, map the V’s first and the service answer usually becomes obvious.

Data Formats for ML

FormatTypeCompressionColumnarBest For
CSVTextNoNoSmall datasets, simple tabular
JSONTextNoNoSemi-structured, API data
ParquetBinaryYes (Snappy)YesAnalytics, large datasets, Athena
ORCBinaryYesYesHive/EMR workloads
AvroBinaryYesNoSchema evolution, streaming
RecordIOBinaryYesNoSageMaker Pipe mode (fastest)

Exam tip: RecordIO + Pipe mode = fastest SageMaker training data ingestion.

ELI5: Choosing a format is like storing your movie collection. CSV is an uncompressed video — simple and readable, but huge. Parquet is a compressed, chapter-indexed file — takes up way less space, and when you only want “action scenes” (specific columns), it jumps right there without reading the whole file. For ML on large datasets, Parquet cuts query costs by 30-90% vs CSV.


Data Architecture Patterns

Data Ingestion Pipeline

Data Warehouse vs Data Lake vs Lakehouse

┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│  DATA WAREHOUSE  │  │    DATA LAKE     │  │   LAKEHOUSE      │
├──────────────────┤  ├──────────────────┤  ├──────────────────┤
│ Structured only  │  │ All data types   │  │ All data types   │
│ Schema-on-write  │  │ Schema-on-read   │  │ Schema-on-read   │
│ Pre-processed    │  │ Raw + processed  │  │ + ACID txns      │
│ SQL analytics    │  │ ML + analytics   │  │ + governance     │
│ e.g. Redshift    │  │ e.g. S3          │  │ e.g. S3 + Iceberg│
└──────────────────┘  └──────────────────┘  └──────────────────┘

ELI5: A data warehouse is a neatly organized closet — everything is folded, labeled, and in its exact place, but you did all that work upfront before anything could go in. A data lake is a massive garage where you throw everything raw: boxes, bikes, old furniture — cheap to store, but messy to search. A lakehouse tries to be both: the garage’s storage capacity with enough organization that you can actually find and update things reliably.

Data Mesh

  • Decentralized, domain-oriented data ownership
  • Each team owns and serves their data as a product
  • Federated governance with self-serve infrastructure

Amazon S3 — The ML Data Lake

S3 is the primary storage for ML data on AWS. Almost every ML pipeline starts and ends with S3.

S3 Key Features for ML

FeaturePurpose
VersioningTrack dataset versions, rollback
Lifecycle RulesAuto-transition old data to cheaper tiers
ReplicationCross-region for DR, same-region for compliance
Event NotificationsTrigger Lambda/SQS/SNS on new data arrival
Access PointsSimplified access for different teams
Object LambdaTransform data on read (e.g., redact PII)
S3 SelectQuery inside objects (CSV/JSON/Parquet) without full download

ELI5: S3 is the center of gravity in AWS ML because it’s the one service every other service can read from and write to. SageMaker trains from S3, Glue transforms data in S3, Athena queries S3, Kinesis Firehose delivers to S3. Think of S3 as the shared whiteboard that every team in the building can read and write — it’s not the fastest storage, but it’s universally accessible, infinitely scalable, and dirt cheap.

Storage Classes

ClassUse CaseRetrieval
S3 StandardFrequently accessed training dataInstant
S3 Standard-IAInfrequent access, older datasetsInstant, retrieval fee
S3 One Zone-IAReproducible data, non-criticalInstant, single AZ
S3 Glacier InstantArchive with instant accessInstant
S3 Glacier FlexibleLong-term archiveMinutes to hours
S3 Glacier Deep ArchiveCheapest archive12-48 hours
S3 Intelligent-TieringUnknown access patternsAuto-tiered

ELI5: Storage classes are like storage units at different distances from your home. S3 Standard is stuff in your living room — instantly accessible, but you pay premium rent. Standard-IA is a storage unit across town — cheap monthly fee, but there’s a cost every time you go retrieve something. Glacier Deep Archive is a warehouse in another city — extremely cheap, but it takes 12-48 hours to get anything back. Intelligent-Tiering watches your access patterns and automatically moves things to the right “distance” for you.

S3 Performance Optimization

Upload:
  - Multipart upload for files > 100MB (required > 5GB)
  - S3 Transfer Acceleration (uses CloudFront edge locations)

Download:
  - Byte-range fetches for parallel reads
  - S3 Select / Glacier Select to filter data server-side

Throughput:
  - 3,500 PUT/COPY/POST/DELETE per prefix per second
  - 5,500 GET/HEAD per prefix per second
  - Spread across prefixes for higher throughput

S3 Encryption

MethodKey ManagementUse Case
SSE-S3AWS manages keysDefault, simplest
SSE-KMSAWS KMS manages keysAudit trail, key rotation
SSE-CCustomer provides keysFull control
Client-sideEncrypt before uploadMaximum security

Block & File Storage

Amazon EBS (Elastic Block Store)

  • Attached to single EC2 instance (same AZ)
  • Types: gp3 (general), io2 (high IOPS), st1 (throughput), sc1 (cold)
  • Elastic Volumes: resize, change type, adjust IOPS without detaching
  • Use case: SageMaker notebook instance storage

Amazon EFS (Elastic File System)

  • Shared NFS file system across multiple instances and AZs
  • Auto-scales, pay per use
  • Use case: shared training data across multiple SageMaker instances

Amazon FSx

VariantUse Case
FSx for LustreHigh-performance ML training, integrates with S3
FSx for WindowsWindows-based workloads
FSx for NetApp ONTAPMulti-protocol, hybrid
FSx for OpenZFSLinux workloads, snapshots

Exam tip: FSx for Lustre is the go-to for high-throughput, low-latency training when S3 alone isn’t fast enough. It can be backed by S3.

EBS vs EFS vs FSx

                    EBS              EFS              FSx Lustre
Attach to:          Single EC2       Multiple EC2     Multiple EC2
Across AZs:         No               Yes              Yes
Max throughput:     ~1 GB/s          ~10 GB/s         100s GB/s
S3 integration:    No               No               Yes (lazy load)
Best for ML:       Notebooks        Shared data      HPC training

ELI5: EBS is a USB drive — fast, dedicated to one machine, goes with it everywhere. EFS is a shared Google Drive folder — multiple people (EC2 instances) can all read and write the same files simultaneously. FSx for Lustre is a high-speed race track for data — built specifically so that hundreds of GPUs can all slam through terabytes of training data in parallel without waiting on each other.


Real-Time Streaming

Amazon Kinesis Data Streams

Kinesis Streaming Architecture

Producers → [Shard 1] [Shard 2] [Shard N] → Consumers
             1MB/s in   1MB/s in             2MB/s out per shard
             1000 rec/s  1000 rec/s           (shared or enhanced fan-out)
  • Shards determine throughput — more shards = more capacity
  • Retention: 24 hours default, up to 365 days
  • Ordering: Per shard, using partition key
  • Consumers: Lambda, KCL apps, Kinesis Data Analytics, Firehose
  • Enhanced Fan-Out: 2MB/s per shard per consumer (dedicated)

Amazon Data Firehose (formerly Kinesis Data Firehose)

  • Fully managed delivery stream — no shards to manage
  • Near-real-time (60-second buffer minimum)
  • Destinations: S3, Redshift, OpenSearch, Splunk, HTTP endpoints
  • Can transform data with Lambda before delivery
  • Auto-scales, no capacity management
Kinesis Data Streams vs Firehose:
┌──────────────────────┬────────────────────────┐
│   Data Streams       │   Firehose             │
├──────────────────────┼────────────────────────┤
│ Real-time (200ms)    │ Near real-time (60s+)  │
│ Manual scaling       │ Auto-scaling           │
│ Custom consumers     │ Fixed destinations     │
│ Data replay          │ No replay              │
│ You manage shards    │ Fully managed          │
└──────────────────────┴────────────────────────┘

ELI5: Kinesis Data Streams is like laying your own plumbing — you control every pipe, valve, and flow rate, and you can tap into the stream from multiple places simultaneously. Firehose is like calling a plumber who does it all for you — you tell them where the water should end up (S3, Redshift, etc.) and they handle everything, but you get less control. If you need custom processing or multiple consumers reading the same stream, use Data Streams. If you just need data reliably delivered somewhere, use Firehose.

  • Real-time stream processing with SQL or Java/Scala/Python
  • Source: Kinesis Data Streams, MSK
  • Use case: real-time analytics, anomaly detection, aggregations
  • RANDOM_CUT_FOREST function: built-in anomaly detection in streaming data

Amazon MSK (Managed Streaming for Apache Kafka)

  • Fully managed Apache Kafka
  • MSK Connect: managed Kafka Connect connectors
  • MSK Serverless: auto-scaling, no cluster management
  • Use case: teams already using Kafka ecosystem

Kinesis vs MSK

FeatureKinesisMSK
ProtocolAWS proprietaryApache Kafka
Message size1 MB max1 MB default (configurable higher)
RetentionUp to 365 daysUnlimited (with tiered storage)
EcosystemAWS nativeKafka ecosystem
ScalingShard splitting/mergingAdd brokers/partitions

ETL & Pipeline Orchestration

Data Sources → [Ingestion] → [Storage] → [ETL/Transform] → [ML-Ready Data]
                Kinesis        S3           Glue               Feature Store
                MSK            EFS          EMR
                Firehose       FSx          Data Wrangler

Key Integration Pattern

S3 Event Notification
    → Lambda (lightweight transform)
    → or SQS → batch consumer
    → or EventBridge → Step Functions → Glue ETL → S3 (processed)

Quick Reference: When to Use What

ScenarioService
Store training datasetsS3
High-throughput training I/OFSx for Lustre + S3
Shared data across instancesEFS
Real-time ingestion (custom processing)Kinesis Data Streams
Real-time ingestion (deliver to S3/Redshift)Data Firehose
Stream processing with SQLManaged Apache Flink
Kafka-based streamingMSK
Fastest SageMaker training inputRecordIO + Pipe mode
Archive old datasets cheaplyS3 Glacier Deep Archive
Auto-tier based on accessS3 Intelligent-Tiering