Domain 2A: EDA, Statistics & Visualization

15 min read 3084 words

Table of Contents

EDA, Statistics & Visualization

EDA, Statistics & Visualization

Exam Domain: 2 — Exploratory Data Analysis (24%) Task: Understand your data before building models

Why EDA Matters (First Principles)

ELI5: Imagine you’re a chef who just received a mystery box of ingredients. Would you start cooking immediately, or would you first smell, taste, and identify what you have? EDA is that “what do I have?” step. Skipping it is the cardinal sin of machine learning — you might spend weeks training a model on corrupted data, or discover too late that your target variable has 90% missing values.

EDA answers the fundamental questions before you commit to any modeling decisions:

Distribution — Is my data normally distributed? Skewed? Bimodal?
Relationships — Which features correlate with the target? With each other?
Quality — Are there nulls, duplicates, impossible values, encoding errors?
Outliers — Are there extreme values that will distort my model?
Patterns — Are there trends, seasonality, clusters?

Why this matters for the exam: Exam questions often describe a scenario where a model performs poorly and ask you to diagnose the problem. The answer is almost always rooted in EDA findings: skewed features that weren’t transformed, outliers that weren’t removed, class imbalance that wasn’t addressed. Know your EDA tools and when to apply them.

Descriptive Statistics

Measures of Central Tendency

Measure	Formula	Best When
Mean	$\mu = \frac{1}{n}\sum x_i$	Symmetric distributions, no extreme outliers
Median	Middle value when sorted	Skewed data, outliers present
Mode	Most frequent value	Categorical data, finding typical value

When mean lies — the skewed data problem:

Symmetric (mean ≈ median):        Right-skewed (mean > median):
    *                                       *
   ***                                     **
  *****                                   ***
 *******                                 *****
*********                               *********
──────────                             ──────────
    ↑                                   ↑   ↑
  mean=median                        median mean
                                    (outliers pull mean right)

ELI5: Nine workers earn $30,000/year, one CEO earns $1,000,000/year. Mean salary = $127,000 — nobody actually earns that. Median salary = $30,000 — that’s the truth. Any time you see income data, house prices, or anything with a long tail, trust the median over the mean.

Measures of Spread

Variance and Standard Deviation:

$$\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2$$

$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2}$$

ELI5: Standard deviation tells you the “average distance from average.” If test scores have std=5, most students scored within 5 points of the mean. If std=25, scores are wildly spread out. It answers: “How much should I trust the mean as a representative value?”

Measure	Formula	Robust to Outliers?
Range	max - min	No (one outlier destroys it)
Variance	$\sigma^2$	No
Std Deviation	$\sigma$	No
IQR	Q3 - Q1	Yes
MAD	median(\|x - median(x)\|)	Yes

Percentiles and Quartiles

Min    Q1     Q2(median)   Q3    Max
 |──────|─────────|──────────|──────|
        ↑                   ↑
   25th pct             75th pct

IQR = Q3 - Q1   (the "middle 50%" of data)

Q1 (25th percentile): 25% of data falls below this value
Q2 (50th percentile): the median
Q3 (75th percentile): 75% of data falls below this value
IQR: measures spread of the central 50%, immune to extreme values

Skewness

Left-skewed (negative):    Symmetric:        Right-skewed (positive):
      *                       *                           *
     ***                     ***                         **
    *****                   *****                       ***
   *******                 *******                     ****
──────────                ─────────                ──────────
  mean < median           mean = median           mean > median

Examples: max scores,     Height, IQ           Income, house prices,
exam grades (hard test)                        website traffic

Skewness formula interpretation:

Skewness = 0: symmetric
Skewness > 0: right tail is longer (positive skew)
Skewness < 0: left tail is longer (negative skew)
|Skewness| > 1: significant skew — consider log transform

Kurtosis

Kurtosis measures “tail heaviness” — how extreme the outliers tend to be.

Type	Kurtosis	Shape	Practical Meaning
Leptokurtic	> 3	Tall peak, fat tails	More outliers than normal (financial returns)
Mesokurtic	= 3	Normal distribution	Reference baseline
Platykurtic	< 3	Flat peak, thin tails	Fewer extreme outliers (uniform-like)

Exam tip: High kurtosis = fat tails = more outliers. This matters when you’re assuming normality for statistical tests or when building anomaly detectors.

Probability Distributions

Normal (Gaussian) Distribution

The 68-95-99.7 Rule (Empirical Rule):

                    99.7%
              ┌─────────────────┐
                   95%
            ┌───────────────────────┐
                68%
          ┌─────────────────────────────┐
          │         │         │         │
    ──────┼─────────┼─────────┼─────────┼──────
        -3σ       -1σ        μ       +1σ       +3σ

68% of data within 1σ of mean
95% of data within 2σ of mean
99.7% of data within 3σ of mean

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$

ELI5: The normal distribution appears everywhere in nature because of the Central Limit Theorem: add up enough independent random things, and the sum will always approach a bell curve, no matter what the original distribution was. Heights are the sum of thousands of genetic and nutritional factors. Test scores are the sum of thousands of knowledge gaps and preparation factors. This is why the normal curve is the “default assumption” in statistics.

Central Limit Theorem (CLT): The sampling distribution of the mean approaches normal as sample size $n \to \infty$, regardless of the underlying distribution. This is why many ML algorithms that assume normality still work reasonably well in practice.

Why this matters for ML:

Linear regression assumes normally distributed residuals
Naive Bayes assumes features are normally distributed (Gaussian NB)
Many statistical tests (t-test, ANOVA) assume normality
Z-score anomaly detection requires approximate normality

Distribution Comparison Table

Distribution	Type	When You See It	ML Implication
Normal	Continuous	Heights, measurement errors, CLT sums	Many algorithms assume it; check residuals
Binomial	Discrete	n coin flips, A/B test conversions	Classification with fixed n trials
Poisson	Discrete	Rare events: website hits/hr, defects/unit	Anomaly detection (event rate modeling)
Uniform	Continuous	Random sampling, dice rolls	Baseline; unif random number generation
Exponential	Continuous	Time between events, hardware failures	Survival analysis, time-to-event modeling
Power law	Continuous	Word frequencies, wealth, social networks	Recommendation systems, long-tail problem

Binomial — n independent trials, probability p of success: $$P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}$$

Poisson — rare events, rate $\lambda$ per unit time: $$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$$

ELI5 on Poisson: If a server gets 100 requests/hour on average, Poisson tells you the probability of getting 200 in one hour — that’s the anomaly detection hook. Requests deviating significantly from $\lambda$ signal something unusual.

Power law / long-tail distributions are critical for recommendation systems: 20% of products generate 80% of sales (Pareto principle). A model trained only on popular items will completely fail on the long tail.

Correlation & Relationships

Pearson Correlation Coefficient

$$r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}$$

Range: [-1, +1]

$r = +1$: perfect positive linear relationship
$r = -1$: perfect negative linear relationship
$r = 0$: no linear relationship (may still have non-linear relationship)

Limitations of Pearson:

Only captures linear relationships
Sensitive to outliers (one outlier can change r dramatically)
Assumes both variables are continuous

Correlation ≠ causation: Ice cream sales and drowning deaths are highly correlated (both increase in summer). The confounder is hot weather. Never use correlation alone to establish causation.

Spearman Rank Correlation

Pearson on the ranks of the data rather than the values themselves.

$$\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)}$$

where $d_i$ = difference in ranks.

Pearson vs Spearman — Decision Guide

Situation	Use
Linear relationship expected	Pearson
Non-linear monotonic relationship	Spearman
Outliers present	Spearman (rank-based, robust)
Ordinal variables	Spearman
Normal distribution assumed	Pearson
Small sample with non-normality	Spearman

Covariance vs Correlation

Covariance measures joint variation but is unbounded — its magnitude depends on units
Correlation = standardized covariance, always in [-1, +1], unit-free

$$\text{Cov}(X,Y) = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{n}$$

$$r = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y}$$

Multicollinearity

What it is: Two or more features are highly correlated with each other.

ELI5: If two features tell the model the same thing, it gets confused about which one matters. Imagine predicting house price using both “square footage” and “number of rooms” — they’re measuring the same underlying thing (size). The model will arbitrarily split the credit between them, making coefficients unstable and uninterpretable.

Why it kills regression:

Coefficient estimates become unstable (high variance)
Standard errors inflate — hard to assess feature importance
Does NOT necessarily hurt predictions, but kills interpretability

Detection with VIF (Variance Inflation Factor):

$$\text{VIF}_j = \frac{1}{1 - R_j^2}$$

where $R_j^2$ is the R² from regressing feature $j$ on all other features.

VIF = 1: no correlation
VIF = 1–5: acceptable
VIF > 5: concerning
VIF > 10: severe multicollinearity — remove or combine features

Exam tip: If a question mentions unstable regression coefficients, high p-values despite obvious relationships, or asks how to detect multicollinearity — the answer is VIF.

Data Visualization for ML

Which Visualization for Which Question

Question	Visualization
What is the distribution of one variable?	Histogram
How spread out is data, any outliers?	Box plot
Is there a relationship between two variables?	Scatter plot
How do many variables correlate with each other?	Heatmap (correlation matrix)
What are all pairwise relationships?	Pair plot
Is there a trend over time?	Time series plot
Compare categories (counts)	Bar chart
Compare distributions across groups	Grouped box plots

Histogram

Shows the distribution shape by dividing values into bins and counting occurrences.

Bin selection matters:

Too few bins: lose detail, distributions look flat
Too many bins: noise looks like signal
Sturges’ rule: $k = 1 + \log_2 n$ bins
Scott’s rule: bin width = $3.5\sigma / n^{1/3}$

Box Plot (Box-and-Whisker)

       Outlier
          *
          |
  ┌───────┤  ← Upper fence: Q3 + 1.5×IQR
  │       │
  │  Q3   ┤─────────────────── 75th percentile
  │───────│
  │  Q2   │─────────────────── Median (50th pct)
  │───────│
  │  Q1   ┤─────────────────── 25th percentile
  │       │
  └───────┤  ← Lower fence: Q1 - 1.5×IQR
          |
          *
       Outlier

 ↑─────↑
  IQR box

Box plots are excellent for comparing distributions across groups — e.g., salary by department, accuracy by model type.

Heatmap

Visualizes the correlation matrix: each cell shows the Pearson r between two features, colored from -1 (dark blue/red) to +1.

Diagonal is always 1.0 (feature correlated with itself)
Look for dark off-diagonal cells — these indicate multicollinearity
Look for cells highly correlated with target — potential useful features

Pair Plot

Plots every feature against every other feature in a grid. The diagonal shows each feature’s distribution (histogram or KDE).

Scatter in upper/lower triangle: relationship between feature pairs
Time-consuming for many features (O(n²) plots) — use on selected features only

Time Series Visualization

Components to look for:

Trend: long-term increase or decrease
Seasonality: regular periodic patterns (weekly, monthly, yearly)
Cycles: irregular fluctuations (business cycles)
Stationarity: does mean/variance change over time?

Exam tip: Non-stationary time series must be made stationary (via differencing or detrending) before applying many time series models.

Outlier Detection

What is an Outlier? (First Principles)

An outlier is a data point that is far from the rest of the distribution. Not all outliers are errors — in fraud detection, outliers ARE the signal. Context determines whether to remove or keep.

ELI5: In a dataset of human heights, a value of 300cm is clearly an error — remove it. In a dataset of financial transactions, a $50,000 transaction from an account that normally does $50 transactions is an outlier worth keeping and investigating. The question is always: “Is this outlier signal or noise?”

IQR Method

$$\text{Lower fence} = Q1 - 1.5 \times IQR$$ $$\text{Upper fence} = Q3 + 1.5 \times IQR$$

Points outside these fences are flagged as outliers.

Robust: not affected by the outliers themselves (uses Q1/Q3, not mean)
Assumption-free: does not assume normality
Limitation: may flag too many points in heavy-tailed distributions

Z-Score Method

$$z = \frac{x - \mu}{\sigma}$$

Flag as outlier if $|z| > 3$ (outside 3 standard deviations).

Assumes normality — only valid when distribution is approximately normal
Not robust: mean and std are themselves pulled by outliers
Simple and fast for normally distributed data

Isolation Forest

Algorithm-based anomaly detection:

Randomly select a feature and a random split point
Repeat recursively — outliers require fewer splits to isolate
Anomaly score = average path length across all trees

ELI5: Imagine finding someone in a crowd by asking yes/no questions. “Are you taller than 6ft? Are you wearing a hat?” Finding an unusual person takes fewer questions than finding a typical person. Isolation Forest does exactly this — unusual data points are “isolated” faster.

Does NOT assume a specific distribution
Works well in high dimensions
Available in SageMaker built-in algorithms (Random Cut Forest)

When to Remove vs When to Keep Outliers

Scenario	Decision	Reason
House price prediction	Remove extreme values	Data entry errors, hurt regression
Fraud detection	Keep (they’re the target)	Outliers ARE the signal
Sensor data with known equipment failures	Remove or flag	Known noise source
Network intrusion detection	Keep	Attacks are “outliers”
Image pixel values outside 0-255	Remove	Physically impossible
Income outliers in a wealth study	Keep	Billionaires are real

Missing Data

Types of Missingness

ELI5: Think of missing data as cards falling out of a deck. MCAR is cards falling randomly — any card could fall out. MAR is aces falling out more often because they’re on top — the missingness relates to other variables you can see. MNAR is people hiding their cards on purpose — the missingness is directly related to the value that’s missing (people with low income refuse to report their income).

Type	Full Name	Definition	Example	Implication
MCAR	Missing Completely At Random	Probability of missing is unrelated to any variable	Random sensor dropout	Safest — can use deletion
MAR	Missing At Random	Probability of missing depends on observed variables	Income missing more often for younger people	Imputation using other features works
MNAR	Missing Not At Random	Probability of missing depends on the missing value itself	Sick patients skip health surveys	Hardest — imputation may introduce bias

Strategies for Handling Missing Data

1. Deletion:

Listwise deletion: remove entire row if any value is missing
- OK when MCAR and < 5% missing
- Dangerous when many rows have at least one missing value
Pairwise deletion: use available data per analysis
- Only for statistical analyses, not ML pipelines

2. Imputation:

Method	When to Use	Limitation
Mean/Median	MCAR, simple baseline	Reduces variance, distorts distributions
Mode	Categorical features	Same as above
KNN imputation	MAR, features are correlated	Slow on large datasets
MICE (Multiple Imputation by Chained Equations)	MAR, best general method	Computationally expensive
Regression imputation	MAR, strong predictors exist	Can introduce circularity
Forward/backward fill	Time series	Only valid for time-ordered data

3. Indicator variable: Add a binary feature_was_missing column alongside the imputed value. This lets the model learn that the pattern of missingness itself is informative (important for MNAR data).

Missing Data Decision Tree

Is missingness < 5%?
  Yes → Listwise deletion is acceptable
  No  → Continue

Is it MCAR?
  Yes → Mean/median imputation or deletion
  No  → Continue

Is it MAR (missingness relates to observed variables)?
  Yes → KNN or MICE imputation
  No (MNAR) → Add indicator variable + impute,
               consider domain-specific strategy

Why the strategy matters: Using mean imputation on MNAR data is actively harmful — you’re pretending the “missing” people are average, when they’re systematically different. A model trained this way will fail on production data where the missingness pattern continues.

Amazon QuickSight

What it is: Serverless, fully managed business intelligence (BI) and visualization service.

Data Sources → QuickSight → Dashboards & Reports
  S3                         Interactive visualizations
  Athena                     Scheduled email reports
  Redshift                   Embedded analytics
  RDS / Aurora               Mobile-accessible

SPICE Engine

Super-fast Parallel In-memory Calculation Engine

In-memory columnar storage, not querying source data each time
Enables sub-second response times even on large datasets
Data is imported into SPICE once and refreshed on schedule
Up to 500 million rows per dataset

Exam tip: SPICE is QuickSight’s secret weapon for speed. If a question asks about fast, interactive BI dashboards for non-technical users, QuickSight + SPICE is the answer.

ML Insights in QuickSight

Feature	What It Does
Anomaly Detection	Uses Random Cut Forest to find anomalies in metrics
Forecasting	Built-in time series forecasting (no ML knowledge needed)
Auto-narratives	Natural language summaries of chart insights
Suggested Insights	ML-recommended analyses for your dataset

When to Use QuickSight vs Custom Visualization

Use QuickSight when	Use custom (e.g., SageMaker Studio plots) when
Business users need dashboards	Data scientists doing deep EDA
No code / low code required	Need full matplotlib/seaborn control
Sharing across organization	Ad-hoc research plots
Need scheduled reports	Integrating visualizations into code pipeline
Embedded analytics in apps	Complex custom chart types

Amazon SageMaker Data Wrangler

What it is: Visual, no-code/low-code data preparation and EDA tool within SageMaker Studio.

Data Sources → Data Wrangler → Transforms → Export
  S3                            300+ built-in    SageMaker Pipelines
  Athena                        Custom Python    Feature Store
  Redshift                      Data Quality     Training Job
  Lake Formation                Check

Key capabilities:

Visual data flow interface (drag-and-drop)
300+ built-in transformations (imputation, encoding, scaling, etc.)
Data quality reports: missing values, outliers, distributions
Quick model: train a baseline model in minutes to validate feature usefulness
Bias detection (built-in Clarify integration)
Auto-generates pipeline code for export

Data Wrangler vs Glue DataBrew

Feature	Data Wrangler	Glue DataBrew
Primary use	ML data prep	General data transformation
Audience	Data scientists	Data engineers / analysts
Integration	SageMaker ecosystem	AWS Glue + broader AWS
ML-specific features	Bias, quick model	No
Code export	SageMaker Pipelines	Glue jobs
Cost model	Studio compute	Per-node processing

Exam tip: Data Wrangler = ML-focused visual prep. Glue DataBrew = general data transformation for broader data engineering use cases. If the question mentions “data scientists” and “ML pipeline,” choose Data Wrangler.

Amazon Athena for EDA

What it is: Serverless, interactive SQL query engine for data in S3.

S3 Data (CSV, Parquet, JSON, ORC)
         ↓
     AWS Glue Data Catalog (schema)
         ↓
       Athena (SQL queries)
         ↓
  Results → S3 | QuickSight | SageMaker

Why Athena is great for initial EDA:

No infrastructure to provision — query immediately
Pay per query ($5 per TB scanned — Parquet reduces cost dramatically)
Standard SQL — accessible to anyone who knows SQL
Query data in place — no ETL needed to start exploring

Common EDA queries with Athena:

-- Distribution of a column
SELECT income_bucket, COUNT(*) as cnt
FROM customers
GROUP BY income_bucket ORDER BY income_bucket;

-- Null count check
SELECT
  COUNT(*) as total_rows,
  SUM(CASE WHEN age IS NULL THEN 1 ELSE 0 END) as missing_age,
  SUM(CASE WHEN salary IS NULL THEN 1 ELSE 0 END) as missing_salary
FROM customers;

-- Percentile analysis
SELECT
  approx_percentile(salary, 0.25) as q1,
  approx_percentile(salary, 0.50) as median,
  approx_percentile(salary, 0.75) as q3
FROM customers;

Exam tip: Athena + Parquet + Glue Catalog is the canonical “explore S3 data without infrastructure” pattern. If a question asks how to do ad-hoc SQL exploration on data sitting in S3 cheaply, this is the answer.

Quick Reference: EDA Toolkit

Task	Tool / Method
Initial data exploration in SQL	Amazon Athena
Visual no-code data prep + ML	SageMaker Data Wrangler
Business dashboards	Amazon QuickSight
Distribution shape	Histogram, box plot
Outlier detection (rule-based)	IQR method, Z-score
Outlier detection (ML-based)	Isolation Forest, Random Cut Forest
Correlation between features	Pearson r, heatmap
Non-linear correlation	Spearman rank
Multicollinearity check	VIF
All pairwise relationships	Pair plot
Missing data best practice	MICE or KNN imputation + indicator variable
Time series patterns	Decomposition plot (trend + seasonality + residual)

EDA, Statistics & Visualization#

Why EDA Matters (First Principles)#

Descriptive Statistics#

Measures of Central Tendency#

Measures of Spread#

Percentiles and Quartiles#

Skewness#

Kurtosis#

Probability Distributions#

Normal (Gaussian) Distribution#

Distribution Comparison Table#

Correlation & Relationships#

Pearson Correlation Coefficient#

Spearman Rank Correlation#

Pearson vs Spearman — Decision Guide#

Covariance vs Correlation#

Multicollinearity#

Data Visualization for ML#

Which Visualization for Which Question#

Histogram#

Box Plot (Box-and-Whisker)#

Heatmap#

Pair Plot#

Time Series Visualization#

Outlier Detection#

What is an Outlier? (First Principles)#

IQR Method#

Z-Score Method#

Isolation Forest#

When to Remove vs When to Keep Outliers#

Missing Data#

Types of Missingness#

Strategies for Handling Missing Data#

Missing Data Decision Tree#

Amazon QuickSight#

SPICE Engine#

ML Insights in QuickSight#

When to Use QuickSight vs Custom Visualization#

Amazon SageMaker Data Wrangler#

Data Wrangler vs Glue DataBrew#

Amazon Athena for EDA#

Quick Reference: EDA Toolkit#