← AWS MLS-C01 — ML Specialty

Domain 2A: EDA, Statistics & Visualization

EDA, Statistics & Visualization

Exam Domain: 2 — Exploratory Data Analysis (24%) Task: Understand your data before building models


Why EDA Matters (First Principles)

ELI5: Imagine you’re a chef who just received a mystery box of ingredients. Would you start cooking immediately, or would you first smell, taste, and identify what you have? EDA is that “what do I have?” step. Skipping it is the cardinal sin of machine learning — you might spend weeks training a model on corrupted data, or discover too late that your target variable has 90% missing values.

EDA answers the fundamental questions before you commit to any modeling decisions:

  • Distribution — Is my data normally distributed? Skewed? Bimodal?
  • Relationships — Which features correlate with the target? With each other?
  • Quality — Are there nulls, duplicates, impossible values, encoding errors?
  • Outliers — Are there extreme values that will distort my model?
  • Patterns — Are there trends, seasonality, clusters?

Why this matters for the exam: Exam questions often describe a scenario where a model performs poorly and ask you to diagnose the problem. The answer is almost always rooted in EDA findings: skewed features that weren’t transformed, outliers that weren’t removed, class imbalance that wasn’t addressed. Know your EDA tools and when to apply them.


Descriptive Statistics

Measures of Central Tendency

MeasureFormulaBest When
Mean$\mu = \frac{1}{n}\sum x_i$Symmetric distributions, no extreme outliers
MedianMiddle value when sortedSkewed data, outliers present
ModeMost frequent valueCategorical data, finding typical value

When mean lies — the skewed data problem:

Symmetric (mean ≈ median):        Right-skewed (mean > median):
    *                                       *
   ***                                     **
  *****                                   ***
 *******                                 *****
*********                               *********
──────────                             ──────────
    ↑                                   ↑   ↑
  mean=median                        median mean
                                    (outliers pull mean right)

ELI5: Nine workers earn $30,000/year, one CEO earns $1,000,000/year. Mean salary = $127,000 — nobody actually earns that. Median salary = $30,000 — that’s the truth. Any time you see income data, house prices, or anything with a long tail, trust the median over the mean.

Measures of Spread

Variance and Standard Deviation:

$$\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2$$

$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2}$$

ELI5: Standard deviation tells you the “average distance from average.” If test scores have std=5, most students scored within 5 points of the mean. If std=25, scores are wildly spread out. It answers: “How much should I trust the mean as a representative value?”

MeasureFormulaRobust to Outliers?
Rangemax - minNo (one outlier destroys it)
Variance$\sigma^2$No
Std Deviation$\sigma$No
IQRQ3 - Q1Yes
MADmedian(|x - median(x)|)Yes

Percentiles and Quartiles

Min    Q1     Q2(median)   Q3    Max
 |──────|─────────|──────────|──────|
        ↑                   ↑
   25th pct             75th pct

IQR = Q3 - Q1   (the "middle 50%" of data)
  • Q1 (25th percentile): 25% of data falls below this value
  • Q2 (50th percentile): the median
  • Q3 (75th percentile): 75% of data falls below this value
  • IQR: measures spread of the central 50%, immune to extreme values

Skewness

Left-skewed (negative):    Symmetric:        Right-skewed (positive):
      *                       *                           *
     ***                     ***                         **
    *****                   *****                       ***
   *******                 *******                     ****
──────────                ─────────                ──────────
  mean < median           mean = median           mean > median

Examples: max scores,     Height, IQ           Income, house prices,
exam grades (hard test)                        website traffic

Skewness formula interpretation:

  • Skewness = 0: symmetric
  • Skewness > 0: right tail is longer (positive skew)
  • Skewness < 0: left tail is longer (negative skew)
  • |Skewness| > 1: significant skew — consider log transform

Kurtosis

Kurtosis measures “tail heaviness” — how extreme the outliers tend to be.

TypeKurtosisShapePractical Meaning
Leptokurtic> 3Tall peak, fat tailsMore outliers than normal (financial returns)
Mesokurtic= 3Normal distributionReference baseline
Platykurtic< 3Flat peak, thin tailsFewer extreme outliers (uniform-like)

Exam tip: High kurtosis = fat tails = more outliers. This matters when you’re assuming normality for statistical tests or when building anomaly detectors.


Probability Distributions

Normal (Gaussian) Distribution

The 68-95-99.7 Rule (Empirical Rule):

                    99.7%
              ┌─────────────────┐
                   95%
            ┌───────────────────────┐
                68%
          ┌─────────────────────────────┐
          │         │         │         │
    ──────┼─────────┼─────────┼─────────┼──────
        -3σ       -1σ        μ       +1σ       +3σ

68% of data within 1σ of mean
95% of data within 2σ of mean
99.7% of data within 3σ of mean

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$

ELI5: The normal distribution appears everywhere in nature because of the Central Limit Theorem: add up enough independent random things, and the sum will always approach a bell curve, no matter what the original distribution was. Heights are the sum of thousands of genetic and nutritional factors. Test scores are the sum of thousands of knowledge gaps and preparation factors. This is why the normal curve is the “default assumption” in statistics.

Central Limit Theorem (CLT): The sampling distribution of the mean approaches normal as sample size $n \to \infty$, regardless of the underlying distribution. This is why many ML algorithms that assume normality still work reasonably well in practice.

Why this matters for ML:

  • Linear regression assumes normally distributed residuals
  • Naive Bayes assumes features are normally distributed (Gaussian NB)
  • Many statistical tests (t-test, ANOVA) assume normality
  • Z-score anomaly detection requires approximate normality

Distribution Comparison Table

DistributionTypeWhen You See ItML Implication
NormalContinuousHeights, measurement errors, CLT sumsMany algorithms assume it; check residuals
BinomialDiscreten coin flips, A/B test conversionsClassification with fixed n trials
PoissonDiscreteRare events: website hits/hr, defects/unitAnomaly detection (event rate modeling)
UniformContinuousRandom sampling, dice rollsBaseline; unif random number generation
ExponentialContinuousTime between events, hardware failuresSurvival analysis, time-to-event modeling
Power lawContinuousWord frequencies, wealth, social networksRecommendation systems, long-tail problem

Binomial — n independent trials, probability p of success: $$P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}$$

Poisson — rare events, rate $\lambda$ per unit time: $$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$$

ELI5 on Poisson: If a server gets 100 requests/hour on average, Poisson tells you the probability of getting 200 in one hour — that’s the anomaly detection hook. Requests deviating significantly from $\lambda$ signal something unusual.

Power law / long-tail distributions are critical for recommendation systems: 20% of products generate 80% of sales (Pareto principle). A model trained only on popular items will completely fail on the long tail.


Correlation & Relationships

Pearson Correlation Coefficient

$$r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}$$

Range: [-1, +1]

  • $r = +1$: perfect positive linear relationship
  • $r = -1$: perfect negative linear relationship
  • $r = 0$: no linear relationship (may still have non-linear relationship)

Limitations of Pearson:

  • Only captures linear relationships
  • Sensitive to outliers (one outlier can change r dramatically)
  • Assumes both variables are continuous

Correlation ≠ causation: Ice cream sales and drowning deaths are highly correlated (both increase in summer). The confounder is hot weather. Never use correlation alone to establish causation.

Spearman Rank Correlation

Pearson on the ranks of the data rather than the values themselves.

$$\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)}$$

where $d_i$ = difference in ranks.

Pearson vs Spearman — Decision Guide

SituationUse
Linear relationship expectedPearson
Non-linear monotonic relationshipSpearman
Outliers presentSpearman (rank-based, robust)
Ordinal variablesSpearman
Normal distribution assumedPearson
Small sample with non-normalitySpearman

Covariance vs Correlation

  • Covariance measures joint variation but is unbounded — its magnitude depends on units
  • Correlation = standardized covariance, always in [-1, +1], unit-free

$$\text{Cov}(X,Y) = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{n}$$

$$r = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y}$$

Multicollinearity

What it is: Two or more features are highly correlated with each other.

ELI5: If two features tell the model the same thing, it gets confused about which one matters. Imagine predicting house price using both “square footage” and “number of rooms” — they’re measuring the same underlying thing (size). The model will arbitrarily split the credit between them, making coefficients unstable and uninterpretable.

Why it kills regression:

  • Coefficient estimates become unstable (high variance)
  • Standard errors inflate — hard to assess feature importance
  • Does NOT necessarily hurt predictions, but kills interpretability

Detection with VIF (Variance Inflation Factor):

$$\text{VIF}_j = \frac{1}{1 - R_j^2}$$

where $R_j^2$ is the R² from regressing feature $j$ on all other features.

  • VIF = 1: no correlation
  • VIF = 1–5: acceptable
  • VIF > 5: concerning
  • VIF > 10: severe multicollinearity — remove or combine features

Exam tip: If a question mentions unstable regression coefficients, high p-values despite obvious relationships, or asks how to detect multicollinearity — the answer is VIF.


Data Visualization for ML

Which Visualization for Which Question

QuestionVisualization
What is the distribution of one variable?Histogram
How spread out is data, any outliers?Box plot
Is there a relationship between two variables?Scatter plot
How do many variables correlate with each other?Heatmap (correlation matrix)
What are all pairwise relationships?Pair plot
Is there a trend over time?Time series plot
Compare categories (counts)Bar chart
Compare distributions across groupsGrouped box plots

Histogram

Shows the distribution shape by dividing values into bins and counting occurrences.

Bin selection matters:

  • Too few bins: lose detail, distributions look flat
  • Too many bins: noise looks like signal
  • Sturges’ rule: $k = 1 + \log_2 n$ bins
  • Scott’s rule: bin width = $3.5\sigma / n^{1/3}$

Box Plot (Box-and-Whisker)

       Outlier
          *
          |
  ┌───────┤  ← Upper fence: Q3 + 1.5×IQR
  │       │
  │  Q3   ┤─────────────────── 75th percentile
  │───────│
  │  Q2   │─────────────────── Median (50th pct)
  │───────│
  │  Q1   ┤─────────────────── 25th percentile
  │       │
  └───────┤  ← Lower fence: Q1 - 1.5×IQR
          |
          *
       Outlier

 ↑─────↑
  IQR box

Box plots are excellent for comparing distributions across groups — e.g., salary by department, accuracy by model type.

Heatmap

Visualizes the correlation matrix: each cell shows the Pearson r between two features, colored from -1 (dark blue/red) to +1.

  • Diagonal is always 1.0 (feature correlated with itself)
  • Look for dark off-diagonal cells — these indicate multicollinearity
  • Look for cells highly correlated with target — potential useful features

Pair Plot

Plots every feature against every other feature in a grid. The diagonal shows each feature’s distribution (histogram or KDE).

  • Scatter in upper/lower triangle: relationship between feature pairs
  • Time-consuming for many features (O(n²) plots) — use on selected features only

Time Series Visualization

Components to look for:

  • Trend: long-term increase or decrease
  • Seasonality: regular periodic patterns (weekly, monthly, yearly)
  • Cycles: irregular fluctuations (business cycles)
  • Stationarity: does mean/variance change over time?

Exam tip: Non-stationary time series must be made stationary (via differencing or detrending) before applying many time series models.


Outlier Detection

What is an Outlier? (First Principles)

An outlier is a data point that is far from the rest of the distribution. Not all outliers are errors — in fraud detection, outliers ARE the signal. Context determines whether to remove or keep.

ELI5: In a dataset of human heights, a value of 300cm is clearly an error — remove it. In a dataset of financial transactions, a $50,000 transaction from an account that normally does $50 transactions is an outlier worth keeping and investigating. The question is always: “Is this outlier signal or noise?”

IQR Method

$$\text{Lower fence} = Q1 - 1.5 \times IQR$$ $$\text{Upper fence} = Q3 + 1.5 \times IQR$$

Points outside these fences are flagged as outliers.

  • Robust: not affected by the outliers themselves (uses Q1/Q3, not mean)
  • Assumption-free: does not assume normality
  • Limitation: may flag too many points in heavy-tailed distributions

Z-Score Method

$$z = \frac{x - \mu}{\sigma}$$

Flag as outlier if $|z| > 3$ (outside 3 standard deviations).

  • Assumes normality — only valid when distribution is approximately normal
  • Not robust: mean and std are themselves pulled by outliers
  • Simple and fast for normally distributed data

Isolation Forest

Algorithm-based anomaly detection:

  1. Randomly select a feature and a random split point
  2. Repeat recursively — outliers require fewer splits to isolate
  3. Anomaly score = average path length across all trees

ELI5: Imagine finding someone in a crowd by asking yes/no questions. “Are you taller than 6ft? Are you wearing a hat?” Finding an unusual person takes fewer questions than finding a typical person. Isolation Forest does exactly this — unusual data points are “isolated” faster.

  • Does NOT assume a specific distribution
  • Works well in high dimensions
  • Available in SageMaker built-in algorithms (Random Cut Forest)

When to Remove vs When to Keep Outliers

ScenarioDecisionReason
House price predictionRemove extreme valuesData entry errors, hurt regression
Fraud detectionKeep (they’re the target)Outliers ARE the signal
Sensor data with known equipment failuresRemove or flagKnown noise source
Network intrusion detectionKeepAttacks are “outliers”
Image pixel values outside 0-255RemovePhysically impossible
Income outliers in a wealth studyKeepBillionaires are real

Missing Data

Types of Missingness

ELI5: Think of missing data as cards falling out of a deck. MCAR is cards falling randomly — any card could fall out. MAR is aces falling out more often because they’re on top — the missingness relates to other variables you can see. MNAR is people hiding their cards on purpose — the missingness is directly related to the value that’s missing (people with low income refuse to report their income).

TypeFull NameDefinitionExampleImplication
MCARMissing Completely At RandomProbability of missing is unrelated to any variableRandom sensor dropoutSafest — can use deletion
MARMissing At RandomProbability of missing depends on observed variablesIncome missing more often for younger peopleImputation using other features works
MNARMissing Not At RandomProbability of missing depends on the missing value itselfSick patients skip health surveysHardest — imputation may introduce bias

Strategies for Handling Missing Data

1. Deletion:

  • Listwise deletion: remove entire row if any value is missing
    • OK when MCAR and < 5% missing
    • Dangerous when many rows have at least one missing value
  • Pairwise deletion: use available data per analysis
    • Only for statistical analyses, not ML pipelines

2. Imputation:

MethodWhen to UseLimitation
Mean/MedianMCAR, simple baselineReduces variance, distorts distributions
ModeCategorical featuresSame as above
KNN imputationMAR, features are correlatedSlow on large datasets
MICE (Multiple Imputation by Chained Equations)MAR, best general methodComputationally expensive
Regression imputationMAR, strong predictors existCan introduce circularity
Forward/backward fillTime seriesOnly valid for time-ordered data

3. Indicator variable: Add a binary feature_was_missing column alongside the imputed value. This lets the model learn that the pattern of missingness itself is informative (important for MNAR data).

Missing Data Decision Tree

Is missingness < 5%?
  Yes → Listwise deletion is acceptable
  No  → Continue

Is it MCAR?
  Yes → Mean/median imputation or deletion
  No  → Continue

Is it MAR (missingness relates to observed variables)?
  Yes → KNN or MICE imputation
  No (MNAR) → Add indicator variable + impute,
               consider domain-specific strategy

Why the strategy matters: Using mean imputation on MNAR data is actively harmful — you’re pretending the “missing” people are average, when they’re systematically different. A model trained this way will fail on production data where the missingness pattern continues.


Amazon QuickSight

What it is: Serverless, fully managed business intelligence (BI) and visualization service.

Data Sources → QuickSight → Dashboards & Reports
  S3                         Interactive visualizations
  Athena                     Scheduled email reports
  Redshift                   Embedded analytics
  RDS / Aurora               Mobile-accessible

SPICE Engine

Super-fast Parallel In-memory Calculation Engine

  • In-memory columnar storage, not querying source data each time
  • Enables sub-second response times even on large datasets
  • Data is imported into SPICE once and refreshed on schedule
  • Up to 500 million rows per dataset

Exam tip: SPICE is QuickSight’s secret weapon for speed. If a question asks about fast, interactive BI dashboards for non-technical users, QuickSight + SPICE is the answer.

ML Insights in QuickSight

FeatureWhat It Does
Anomaly DetectionUses Random Cut Forest to find anomalies in metrics
ForecastingBuilt-in time series forecasting (no ML knowledge needed)
Auto-narrativesNatural language summaries of chart insights
Suggested InsightsML-recommended analyses for your dataset

When to Use QuickSight vs Custom Visualization

Use QuickSight whenUse custom (e.g., SageMaker Studio plots) when
Business users need dashboardsData scientists doing deep EDA
No code / low code requiredNeed full matplotlib/seaborn control
Sharing across organizationAd-hoc research plots
Need scheduled reportsIntegrating visualizations into code pipeline
Embedded analytics in appsComplex custom chart types

Amazon SageMaker Data Wrangler

What it is: Visual, no-code/low-code data preparation and EDA tool within SageMaker Studio.

Data Sources → Data Wrangler → Transforms → Export
  S3                            300+ built-in    SageMaker Pipelines
  Athena                        Custom Python    Feature Store
  Redshift                      Data Quality     Training Job
  Lake Formation                Check

Key capabilities:

  • Visual data flow interface (drag-and-drop)
  • 300+ built-in transformations (imputation, encoding, scaling, etc.)
  • Data quality reports: missing values, outliers, distributions
  • Quick model: train a baseline model in minutes to validate feature usefulness
  • Bias detection (built-in Clarify integration)
  • Auto-generates pipeline code for export

Data Wrangler vs Glue DataBrew

FeatureData WranglerGlue DataBrew
Primary useML data prepGeneral data transformation
AudienceData scientistsData engineers / analysts
IntegrationSageMaker ecosystemAWS Glue + broader AWS
ML-specific featuresBias, quick modelNo
Code exportSageMaker PipelinesGlue jobs
Cost modelStudio computePer-node processing

Exam tip: Data Wrangler = ML-focused visual prep. Glue DataBrew = general data transformation for broader data engineering use cases. If the question mentions “data scientists” and “ML pipeline,” choose Data Wrangler.


Amazon Athena for EDA

What it is: Serverless, interactive SQL query engine for data in S3.

S3 Data (CSV, Parquet, JSON, ORC)
         ↓
     AWS Glue Data Catalog (schema)
         ↓
       Athena (SQL queries)
         ↓
  Results → S3 | QuickSight | SageMaker

Why Athena is great for initial EDA:

  • No infrastructure to provision — query immediately
  • Pay per query ($5 per TB scanned — Parquet reduces cost dramatically)
  • Standard SQL — accessible to anyone who knows SQL
  • Query data in place — no ETL needed to start exploring

Common EDA queries with Athena:

-- Distribution of a column
SELECT income_bucket, COUNT(*) as cnt
FROM customers
GROUP BY income_bucket ORDER BY income_bucket;

-- Null count check
SELECT
  COUNT(*) as total_rows,
  SUM(CASE WHEN age IS NULL THEN 1 ELSE 0 END) as missing_age,
  SUM(CASE WHEN salary IS NULL THEN 1 ELSE 0 END) as missing_salary
FROM customers;

-- Percentile analysis
SELECT
  approx_percentile(salary, 0.25) as q1,
  approx_percentile(salary, 0.50) as median,
  approx_percentile(salary, 0.75) as q3
FROM customers;

Exam tip: Athena + Parquet + Glue Catalog is the canonical “explore S3 data without infrastructure” pattern. If a question asks how to do ad-hoc SQL exploration on data sitting in S3 cheaply, this is the answer.


Quick Reference: EDA Toolkit

TaskTool / Method
Initial data exploration in SQLAmazon Athena
Visual no-code data prep + MLSageMaker Data Wrangler
Business dashboardsAmazon QuickSight
Distribution shapeHistogram, box plot
Outlier detection (rule-based)IQR method, Z-score
Outlier detection (ML-based)Isolation Forest, Random Cut Forest
Correlation between featuresPearson r, heatmap
Non-linear correlationSpearman rank
Multicollinearity checkVIF
All pairwise relationshipsPair plot
Missing data best practiceMICE or KNN imputation + indicator variable
Time series patternsDecomposition plot (trend + seasonality + residual)