← AWS MLA-C01 — ML Engineer Associate

Domain 1B: Data Transformation & Feature Engineering

Data Transformation & Feature Engineering

Exam Domain: 1 — Data Preparation for ML (28%) Task: Transform data, perform feature engineering, ensure data integrity


Data Processing at Scale

Amazon EMR (Elastic MapReduce)

Managed Hadoop/Spark cluster for large-scale data processing.

EMR Cluster Architecture

EMR Architecture:
┌─────────────────────────────────────────────┐
│  EMR Cluster                                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │  Master   │  │  Core    │  │  Task    │  │
│  │  Node     │  │  Nodes   │  │  Nodes   │  │
│  │ (manages) │  │ (HDFS +  │  │ (compute │  │
│  │           │  │  compute)│  │  only)   │  │
│  └──────────┘  └──────────┘  └──────────┘  │
│                                             │
│  Frameworks: Spark, Hive, Presto, HBase     │
└─────────────────────────────────────────────┘
          ↕
     Amazon S3 (EMRFS)
FeatureDetails
EMR on EC2Traditional clusters, full control
EMR ServerlessNo cluster management, pay per job
EMR on EKSRun Spark on Kubernetes
EMRFSRead/write S3 as if it were HDFS

Apache Spark on EMR

  • Distributed data processing (100x faster than MapReduce)
  • PySpark for Python-based ML data prep
  • Spark MLlib for distributed ML algorithms
  • Use case: TF-IDF, large-scale feature engineering, ETL

Exam tip: When you see “large-scale data processing” or “distributed transformation” — think EMR + Spark.


Feature Engineering Techniques

The Curse of Dimensionality

More features ≠ better model. Too many features leads to:

  • Overfitting
  • Increased training time
  • Sparse data in high-dimensional space

Solution: Dimensionality reduction (PCA, feature selection)

ELI5: Imagine trying to find a specific person. In a single room (1 dimension — just their height), easy. In a building (2D — height + weight), harder but manageable. In a city described by 1,000 characteristics, nearly impossible — your data points are so spread out that “nearby” becomes meaningless. Every feature you add doubles the space your data has to fill, so you need exponentially more data to cover it. This is why adding 50 random features to your model usually makes it worse, not better.

Handling Missing Data

TechniqueWhen to Use
Drop rowsSmall % missing, random pattern
Drop columnsMost values missing in that feature
Mean/Median imputationNumerical, roughly normal distribution
Mode imputationCategorical data
KNN imputationSimilar samples should have similar values
Forward/Backward fillTime series data
Indicator variableMissingness itself is informative

ELI5: Missing data is like a survey where some people skipped questions. You have three choices: throw away the whole survey (drop rows) — wasteful if most answers are there. Guess the missing answer based on what similar people answered (imputation) — reasonable but introduces assumptions. Add a checkbox that says “this person skipped this question” (indicator variable) — useful when the fact that someone skipped is itself meaningful, like patients who skip a pain question might be the ones in the most pain.

Handling Unbalanced Data

Imbalanced Dataset (e.g., fraud detection):
  Class 0 (normal):    99,000 samples
  Class 1 (fraud):      1,000 samples   ← minority class

Techniques:
  ┌─────────────────────────────────────────────┐
  │  Oversampling: Duplicate/synthesize minority │
  │  ├─ Random oversampling (duplicate)          │
  │  └─ SMOTE (synthesize new samples)           │
  │                                              │
  │  Undersampling: Reduce majority class        │
  │  └─ Random undersampling                     │
  │                                              │
  │  Algorithm-level:                            │
  │  ├─ Class weights (cost-sensitive learning)  │
  │  └─ Anomaly detection approach               │
  └─────────────────────────────────────────────┘

SMOTE (Synthetic Minority Oversampling Technique):

  • Creates synthetic samples between existing minority samples
  • Available in SageMaker Data Wrangler as a built-in operator

ELI5: SMOTE is like creating imaginary friends that look like a blend of your existing friends, to balance out a lopsided group photo. If you only have 10 fraud examples and 10,000 normal ones, SMOTE looks at each fraud example, finds its nearest fraud neighbors, and creates new synthetic fraud examples that fall somewhere between them — not exact copies, but plausible variations. This gives your model more examples of the rare class to learn from without just repeating the same 10 examples over and over.

Handling Outliers

MethodDescription
Z-scoreRemove if |z| > 3 (assumes normal distribution)
IQR methodRemove if < Q1 - 1.5×IQR or > Q3 + 1.5×IQR
WinsorizationCap outliers at a percentile threshold
Log transformReduce skew, compress range
Keep themIf outliers are real signal (e.g., fraud)

Feature Transformation Techniques

TechniquePurposeFormula/Example
Normalization (Min-Max)Scale to [0,1]$x’ = \frac{x - x_{min}}{x_{max} - x_{min}}$
Standardization (Z-score)Mean=0, StdDev=1$x’ = \frac{x - \mu}{\sigma}$
Log TransformReduce right skew$x’ = \log(x + 1)$
BinningConvert continuous to categoricalAge → [0-18, 19-35, 36-50, 51+]
One-Hot EncodingCategorical to binary columnsColor → [is_red, is_blue, is_green]
Label EncodingCategorical to integers[small, medium, large] → [0, 1, 2]
TokenizationText to tokens“Hello world” → [“Hello”, “world”]
TF-IDFText importance scoringTerm frequency × inverse document frequency
ShufflingRandomize orderPrevent ordering bias in training

ELI5: Normalization and standardization both rescale your numbers, but differently. Normalization squishes everything into a 0–1 range — useful when you know the min and max and want everything proportional within that window. Standardization centers everything around the average (mean=0) and stretches or squishes based on spread (std=1) — useful when you don’t know the range or have outliers. Rule of thumb: use normalization for neural networks and image data, standardization for algorithms that assume a normal distribution (like linear regression or SVM).

One-Hot Encoding Example:

  Color       →    is_red   is_blue   is_green
  ─────            ──────   ───────   ────────
  red              1        0         0
  blue             0        1         0
  green            0        0         1
  red              1        0         0

AWS Data Transformation Services

SageMaker Data Wrangler

Visual, low-code data preparation within SageMaker Studio.

FeatureDetails
300+ built-in transformsNo code needed
Import fromS3, Athena, Redshift, Snowflake, Databricks
Built-in visualizationsHistograms, scatter plots, correlation
Bias detectionPre-training bias analysis
Export toSageMaker Pipelines, Processing jobs, Feature Store
Balancing operatorsRandom oversample, undersample, SMOTE

ELI5: Data Wrangler is like Excel on steroids for ML data prep — but instead of writing formulas, you drag and drop 300+ pre-built transforms. You can join datasets, handle missing values, detect bias, and visualize distributions, all without writing a single line of code. When you’re done, it generates the PySpark or Python code behind the scenes and wires it into your SageMaker Pipeline automatically.

SageMaker Ground Truth

Managed data labeling service.

Ground Truth Workflow

Labeling Workflow:
  Raw Data → [Labeling Job] → Labeled Dataset
                  ↓
         ┌───────────────────┐
         │ Human Labelers:   │
         │ • Mechanical Turk │
         │ • Private team    │
         │ • Third-party     │
         ├───────────────────┤
         │ Auto-Labeling:    │
         │ ML model labels   │
         │ high-confidence   │
         │ samples (up to    │
         │ 70% cost savings) │
         └───────────────────┘
  • Active learning: automatically labels easy samples, sends hard ones to humans
  • Annotation consolidation: combines multiple labeler outputs for accuracy
  • Supports: bounding boxes, image classification, text classification, semantic segmentation

SageMaker Feature Store

Centralized repository for ML features — serves both training and inference.

┌─────────────────────────────────────────────┐
│           SageMaker Feature Store            │
│                                             │
│  ┌──────────────┐    ┌──────────────┐       │
│  │ Online Store │    │ Offline Store│       │
│  │ (low-latency │    │ (S3-backed   │       │
│  │  reads for   │    │  for batch   │       │
│  │  inference)  │    │  training)   │       │
│  └──────┬───────┘    └──────┬───────┘       │
│         │                   │               │
│         └───────┬───────────┘               │
│                 │                           │
│         Feature Groups                      │
│         (tables of features with            │
│          record identifier + timestamp)     │
└─────────────────────────────────────────────┘
StoreLatencyUse Case
OnlineSingle-digit msReal-time inference
OfflineMinutesBatch training, historical queries

ELI5: Imagine five different teams at a company all need to know a customer’s age. Without Feature Store, each team calculates it differently — one uses signup date, another uses the profile field, a third uses a derived column — and they all get slightly different answers. Feature Store is the single source of truth: one team computes “customer_age” correctly once, stores it, and every other team — whether training a model or serving predictions — reads the exact same value. It also solves training-serving skew: the feature your model trained on is guaranteed to be the same feature used at inference time.

Exam tip: Feature Store = the answer when the question asks about reusable features shared across teams or consistent features between training and inference.


AWS Glue Ecosystem

AWS Glue — Serverless ETL

Glue ETL Ecosystem

┌──────────────────────────────────────────────┐
│                  AWS Glue                     │
│                                              │
│  ┌────────────┐  ┌────────────┐              │
│  │  Crawlers  │→ │  Data      │              │
│  │ (discover  │  │  Catalog   │              │
│  │  schemas)  │  │ (metadata) │              │
│  └────────────┘  └─────┬──────┘              │
│                        │                     │
│                  ┌─────▼──────┐               │
│                  │  ETL Jobs  │               │
│                  │ (PySpark / │               │
│                  │  Python)   │               │
│                  └────────────┘               │
│                                              │
│  ┌────────────┐  ┌────────────┐              │
│  │  Glue      │  │  Glue Data │              │
│  │  DataBrew  │  │  Quality   │              │
│  │ (visual)   │  │ (rules)    │              │
│  └────────────┘  └────────────┘              │
└──────────────────────────────────────────────┘
ComponentPurpose
CrawlersAuto-discover schema from S3/databases → populate Data Catalog
Data CatalogCentral metadata repository (Hive-compatible)
ETL JobsServerless Spark jobs for data transformation
Glue StudioVisual ETL pipeline builder
Glue DataBrewVisual data prep with 250+ transforms, PII detection/handling
Glue Data QualityDefine rules to validate data quality

AWS Glue DataBrew — PII Handling

DataBrew can automatically detect and handle PII:

  • Redact: Replace with placeholder
  • Hash: One-way hash
  • Encrypt: Reversible encryption
  • Substitute: Replace with fake data
  • Delete: Remove entirely

Exam tip: When a question mentions “PII handling in data preparation” → Glue DataBrew or S3 Object Lambda (for on-read transformation).


Amazon Athena — Serverless Analytics

  • SQL queries directly on S3 data (no ETL needed)
  • Uses Presto engine
  • Pay per query ($5 per TB scanned)

Athena Performance Optimization

TechniqueImpact
Use columnar formats (Parquet, ORC)30-90% less data scanned
Partition dataSkip irrelevant partitions
Compress data (Snappy, GZIP)Less data to scan
Use CTAS (CREATE TABLE AS SELECT)Convert formats, partition
Bucket dataFurther optimize joins

Athena + Glue Integration

S3 (raw data)
    → Glue Crawler (discover schema)
    → Glue Data Catalog (store metadata)
    → Athena (query with SQL)
  • ACID Transactions: Athena supports Apache Iceberg tables for ACID compliance
  • Fine-Grained Access: Use Lake Formation for column/row-level security

Data Integrity & Bias Detection

Pre-Training Bias Metrics (SageMaker Clarify)

Detects bias before model training by analyzing the dataset.

MetricWhat It Measures
Class Imbalance (CI)Difference in number of samples per class
Difference in Proportions of Labels (DPL)Label distribution across groups
KL DivergenceDistribution difference between groups
Jensen-Shannon DivergenceSymmetric version of KL divergence
Lp-normDistance between label distributions
Total Variation DistanceMaximum difference in distributions
Kolmogorov-SmirnovMax difference in cumulative distributions
Conditional Demographic Disparity (CDD)Disparity conditioned on other attributes

ELI5: Pre-training bias detection is like checking whether your deck of cards is stacked before you start the game. If your training dataset has 90% approvals for one demographic and 40% for another, any model trained on it will learn that skewed pattern as “truth.” Clarify’s pre-training metrics quantify exactly how tilted the deck is — so you can fix the data (resample, reweight) before spending hours training a model that’s biased from the start.

Bias Mitigation Techniques

Pre-Processing (before training):
  • Resampling (oversample minority / undersample majority)
  • SMOTE (synthetic data for minority)
  • Data augmentation

In-Processing (during training):
  • Fairness constraints in loss function
  • Adversarial debiasing

Post-Processing (after training):
  • Threshold adjustment per group
  • Reject option classification

Data Security & Compliance

RequirementSolution
Encrypt data at restS3 SSE-KMS, EBS encryption
Encrypt in transitTLS/SSL
Anonymize PIIGlue DataBrew, Macie for detection
Access controlLake Formation, S3 policies, IAM
Audit trailCloudTrail, S3 access logs
HIPAA / PHITranscribe Medical, Comprehend Medical

Quick Reference: When to Use What

ScenarioService
Large-scale distributed ETLEMR + Spark
Serverless ETLAWS Glue ETL Jobs
Visual data prep (no code)Glue DataBrew
SageMaker-integrated visual prepData Wrangler
Label training dataSageMaker Ground Truth
Store reusable featuresSageMaker Feature Store
Query S3 with SQLAthena
Detect PII in S3Amazon Macie
Handle PII in transformationsGlue DataBrew
Detect pre-training biasSageMaker Clarify
Balance imbalanced datasetsData Wrangler (SMOTE)
Schema discoveryGlue Crawlers
Data catalog / metadataGlue Data Catalog