Domain 2A: EDA, Statistics & Visualization
Table of Contents
EDA, Statistics & Visualization
Exam Domain: 2 — Exploratory Data Analysis (24%) Task: Understand your data before building models
Why EDA Matters (First Principles)
ELI5: Imagine you’re a chef who just received a mystery box of ingredients. Would you start cooking immediately, or would you first smell, taste, and identify what you have? EDA is that “what do I have?” step. Skipping it is the cardinal sin of machine learning — you might spend weeks training a model on corrupted data, or discover too late that your target variable has 90% missing values.
EDA answers the fundamental questions before you commit to any modeling decisions:
- Distribution — Is my data normally distributed? Skewed? Bimodal?
- Relationships — Which features correlate with the target? With each other?
- Quality — Are there nulls, duplicates, impossible values, encoding errors?
- Outliers — Are there extreme values that will distort my model?
- Patterns — Are there trends, seasonality, clusters?
Why this matters for the exam: Exam questions often describe a scenario where a model performs poorly and ask you to diagnose the problem. The answer is almost always rooted in EDA findings: skewed features that weren’t transformed, outliers that weren’t removed, class imbalance that wasn’t addressed. Know your EDA tools and when to apply them.
Descriptive Statistics
Measures of Central Tendency
| Measure | Formula | Best When |
|---|---|---|
| Mean | $\mu = \frac{1}{n}\sum x_i$ | Symmetric distributions, no extreme outliers |
| Median | Middle value when sorted | Skewed data, outliers present |
| Mode | Most frequent value | Categorical data, finding typical value |
When mean lies — the skewed data problem:
Symmetric (mean ≈ median): Right-skewed (mean > median):
* *
*** **
***** ***
******* *****
********* *********
────────── ──────────
↑ ↑ ↑
mean=median median mean
(outliers pull mean right)
ELI5: Nine workers earn $30,000/year, one CEO earns $1,000,000/year. Mean salary = $127,000 — nobody actually earns that. Median salary = $30,000 — that’s the truth. Any time you see income data, house prices, or anything with a long tail, trust the median over the mean.
Measures of Spread
Variance and Standard Deviation:
$$\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2$$
$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2}$$
ELI5: Standard deviation tells you the “average distance from average.” If test scores have std=5, most students scored within 5 points of the mean. If std=25, scores are wildly spread out. It answers: “How much should I trust the mean as a representative value?”
| Measure | Formula | Robust to Outliers? |
|---|---|---|
| Range | max - min | No (one outlier destroys it) |
| Variance | $\sigma^2$ | No |
| Std Deviation | $\sigma$ | No |
| IQR | Q3 - Q1 | Yes |
| MAD | median(|x - median(x)|) | Yes |
Percentiles and Quartiles
Min Q1 Q2(median) Q3 Max
|──────|─────────|──────────|──────|
↑ ↑
25th pct 75th pct
IQR = Q3 - Q1 (the "middle 50%" of data)
- Q1 (25th percentile): 25% of data falls below this value
- Q2 (50th percentile): the median
- Q3 (75th percentile): 75% of data falls below this value
- IQR: measures spread of the central 50%, immune to extreme values
Skewness
Left-skewed (negative): Symmetric: Right-skewed (positive):
* * *
*** *** **
***** ***** ***
******* ******* ****
────────── ───────── ──────────
mean < median mean = median mean > median
Examples: max scores, Height, IQ Income, house prices,
exam grades (hard test) website traffic
Skewness formula interpretation:
- Skewness = 0: symmetric
- Skewness > 0: right tail is longer (positive skew)
- Skewness < 0: left tail is longer (negative skew)
- |Skewness| > 1: significant skew — consider log transform
Kurtosis
Kurtosis measures “tail heaviness” — how extreme the outliers tend to be.
| Type | Kurtosis | Shape | Practical Meaning |
|---|---|---|---|
| Leptokurtic | > 3 | Tall peak, fat tails | More outliers than normal (financial returns) |
| Mesokurtic | = 3 | Normal distribution | Reference baseline |
| Platykurtic | < 3 | Flat peak, thin tails | Fewer extreme outliers (uniform-like) |
Exam tip: High kurtosis = fat tails = more outliers. This matters when you’re assuming normality for statistical tests or when building anomaly detectors.
Probability Distributions
Normal (Gaussian) Distribution
The 68-95-99.7 Rule (Empirical Rule):
99.7%
┌─────────────────┐
95%
┌───────────────────────┐
68%
┌─────────────────────────────┐
│ │ │ │
──────┼─────────┼─────────┼─────────┼──────
-3σ -1σ μ +1σ +3σ
68% of data within 1σ of mean
95% of data within 2σ of mean
99.7% of data within 3σ of mean
$$f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$
ELI5: The normal distribution appears everywhere in nature because of the Central Limit Theorem: add up enough independent random things, and the sum will always approach a bell curve, no matter what the original distribution was. Heights are the sum of thousands of genetic and nutritional factors. Test scores are the sum of thousands of knowledge gaps and preparation factors. This is why the normal curve is the “default assumption” in statistics.
Central Limit Theorem (CLT): The sampling distribution of the mean approaches normal as sample size $n \to \infty$, regardless of the underlying distribution. This is why many ML algorithms that assume normality still work reasonably well in practice.
Why this matters for ML:
- Linear regression assumes normally distributed residuals
- Naive Bayes assumes features are normally distributed (Gaussian NB)
- Many statistical tests (t-test, ANOVA) assume normality
- Z-score anomaly detection requires approximate normality
Distribution Comparison Table
| Distribution | Type | When You See It | ML Implication |
|---|---|---|---|
| Normal | Continuous | Heights, measurement errors, CLT sums | Many algorithms assume it; check residuals |
| Binomial | Discrete | n coin flips, A/B test conversions | Classification with fixed n trials |
| Poisson | Discrete | Rare events: website hits/hr, defects/unit | Anomaly detection (event rate modeling) |
| Uniform | Continuous | Random sampling, dice rolls | Baseline; unif random number generation |
| Exponential | Continuous | Time between events, hardware failures | Survival analysis, time-to-event modeling |
| Power law | Continuous | Word frequencies, wealth, social networks | Recommendation systems, long-tail problem |
Binomial — n independent trials, probability p of success: $$P(X=k) = \binom{n}{k}p^k(1-p)^{n-k}$$
Poisson — rare events, rate $\lambda$ per unit time: $$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$$
ELI5 on Poisson: If a server gets 100 requests/hour on average, Poisson tells you the probability of getting 200 in one hour — that’s the anomaly detection hook. Requests deviating significantly from $\lambda$ signal something unusual.
Power law / long-tail distributions are critical for recommendation systems: 20% of products generate 80% of sales (Pareto principle). A model trained only on popular items will completely fail on the long tail.
Correlation & Relationships
Pearson Correlation Coefficient
$$r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}$$
Range: [-1, +1]
- $r = +1$: perfect positive linear relationship
- $r = -1$: perfect negative linear relationship
- $r = 0$: no linear relationship (may still have non-linear relationship)
Limitations of Pearson:
- Only captures linear relationships
- Sensitive to outliers (one outlier can change r dramatically)
- Assumes both variables are continuous
Correlation ≠ causation: Ice cream sales and drowning deaths are highly correlated (both increase in summer). The confounder is hot weather. Never use correlation alone to establish causation.
Spearman Rank Correlation
Pearson on the ranks of the data rather than the values themselves.
$$\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)}$$
where $d_i$ = difference in ranks.
Pearson vs Spearman — Decision Guide
| Situation | Use |
|---|---|
| Linear relationship expected | Pearson |
| Non-linear monotonic relationship | Spearman |
| Outliers present | Spearman (rank-based, robust) |
| Ordinal variables | Spearman |
| Normal distribution assumed | Pearson |
| Small sample with non-normality | Spearman |
Covariance vs Correlation
- Covariance measures joint variation but is unbounded — its magnitude depends on units
- Correlation = standardized covariance, always in [-1, +1], unit-free
$$\text{Cov}(X,Y) = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{n}$$
$$r = \frac{\text{Cov}(X,Y)}{\sigma_X \cdot \sigma_Y}$$
Multicollinearity
What it is: Two or more features are highly correlated with each other.
ELI5: If two features tell the model the same thing, it gets confused about which one matters. Imagine predicting house price using both “square footage” and “number of rooms” — they’re measuring the same underlying thing (size). The model will arbitrarily split the credit between them, making coefficients unstable and uninterpretable.
Why it kills regression:
- Coefficient estimates become unstable (high variance)
- Standard errors inflate — hard to assess feature importance
- Does NOT necessarily hurt predictions, but kills interpretability
Detection with VIF (Variance Inflation Factor):
$$\text{VIF}_j = \frac{1}{1 - R_j^2}$$
where $R_j^2$ is the R² from regressing feature $j$ on all other features.
- VIF = 1: no correlation
- VIF = 1–5: acceptable
- VIF > 5: concerning
- VIF > 10: severe multicollinearity — remove or combine features
Exam tip: If a question mentions unstable regression coefficients, high p-values despite obvious relationships, or asks how to detect multicollinearity — the answer is VIF.
Data Visualization for ML
Which Visualization for Which Question
| Question | Visualization |
|---|---|
| What is the distribution of one variable? | Histogram |
| How spread out is data, any outliers? | Box plot |
| Is there a relationship between two variables? | Scatter plot |
| How do many variables correlate with each other? | Heatmap (correlation matrix) |
| What are all pairwise relationships? | Pair plot |
| Is there a trend over time? | Time series plot |
| Compare categories (counts) | Bar chart |
| Compare distributions across groups | Grouped box plots |
Histogram
Shows the distribution shape by dividing values into bins and counting occurrences.
Bin selection matters:
- Too few bins: lose detail, distributions look flat
- Too many bins: noise looks like signal
- Sturges’ rule: $k = 1 + \log_2 n$ bins
- Scott’s rule: bin width = $3.5\sigma / n^{1/3}$
Box Plot (Box-and-Whisker)
Outlier
*
|
┌───────┤ ← Upper fence: Q3 + 1.5×IQR
│ │
│ Q3 ┤─────────────────── 75th percentile
│───────│
│ Q2 │─────────────────── Median (50th pct)
│───────│
│ Q1 ┤─────────────────── 25th percentile
│ │
└───────┤ ← Lower fence: Q1 - 1.5×IQR
|
*
Outlier
↑─────↑
IQR box
Box plots are excellent for comparing distributions across groups — e.g., salary by department, accuracy by model type.
Heatmap
Visualizes the correlation matrix: each cell shows the Pearson r between two features, colored from -1 (dark blue/red) to +1.
- Diagonal is always 1.0 (feature correlated with itself)
- Look for dark off-diagonal cells — these indicate multicollinearity
- Look for cells highly correlated with target — potential useful features
Pair Plot
Plots every feature against every other feature in a grid. The diagonal shows each feature’s distribution (histogram or KDE).
- Scatter in upper/lower triangle: relationship between feature pairs
- Time-consuming for many features (O(n²) plots) — use on selected features only
Time Series Visualization
Components to look for:
- Trend: long-term increase or decrease
- Seasonality: regular periodic patterns (weekly, monthly, yearly)
- Cycles: irregular fluctuations (business cycles)
- Stationarity: does mean/variance change over time?
Exam tip: Non-stationary time series must be made stationary (via differencing or detrending) before applying many time series models.
Outlier Detection
What is an Outlier? (First Principles)
An outlier is a data point that is far from the rest of the distribution. Not all outliers are errors — in fraud detection, outliers ARE the signal. Context determines whether to remove or keep.
ELI5: In a dataset of human heights, a value of 300cm is clearly an error — remove it. In a dataset of financial transactions, a $50,000 transaction from an account that normally does $50 transactions is an outlier worth keeping and investigating. The question is always: “Is this outlier signal or noise?”
IQR Method
$$\text{Lower fence} = Q1 - 1.5 \times IQR$$ $$\text{Upper fence} = Q3 + 1.5 \times IQR$$
Points outside these fences are flagged as outliers.
- Robust: not affected by the outliers themselves (uses Q1/Q3, not mean)
- Assumption-free: does not assume normality
- Limitation: may flag too many points in heavy-tailed distributions
Z-Score Method
$$z = \frac{x - \mu}{\sigma}$$
Flag as outlier if $|z| > 3$ (outside 3 standard deviations).
- Assumes normality — only valid when distribution is approximately normal
- Not robust: mean and std are themselves pulled by outliers
- Simple and fast for normally distributed data
Isolation Forest
Algorithm-based anomaly detection:
- Randomly select a feature and a random split point
- Repeat recursively — outliers require fewer splits to isolate
- Anomaly score = average path length across all trees
ELI5: Imagine finding someone in a crowd by asking yes/no questions. “Are you taller than 6ft? Are you wearing a hat?” Finding an unusual person takes fewer questions than finding a typical person. Isolation Forest does exactly this — unusual data points are “isolated” faster.
- Does NOT assume a specific distribution
- Works well in high dimensions
- Available in SageMaker built-in algorithms (Random Cut Forest)
When to Remove vs When to Keep Outliers
| Scenario | Decision | Reason |
|---|---|---|
| House price prediction | Remove extreme values | Data entry errors, hurt regression |
| Fraud detection | Keep (they’re the target) | Outliers ARE the signal |
| Sensor data with known equipment failures | Remove or flag | Known noise source |
| Network intrusion detection | Keep | Attacks are “outliers” |
| Image pixel values outside 0-255 | Remove | Physically impossible |
| Income outliers in a wealth study | Keep | Billionaires are real |
Missing Data
Types of Missingness
ELI5: Think of missing data as cards falling out of a deck. MCAR is cards falling randomly — any card could fall out. MAR is aces falling out more often because they’re on top — the missingness relates to other variables you can see. MNAR is people hiding their cards on purpose — the missingness is directly related to the value that’s missing (people with low income refuse to report their income).
| Type | Full Name | Definition | Example | Implication |
|---|---|---|---|---|
| MCAR | Missing Completely At Random | Probability of missing is unrelated to any variable | Random sensor dropout | Safest — can use deletion |
| MAR | Missing At Random | Probability of missing depends on observed variables | Income missing more often for younger people | Imputation using other features works |
| MNAR | Missing Not At Random | Probability of missing depends on the missing value itself | Sick patients skip health surveys | Hardest — imputation may introduce bias |
Strategies for Handling Missing Data
1. Deletion:
- Listwise deletion: remove entire row if any value is missing
- OK when MCAR and < 5% missing
- Dangerous when many rows have at least one missing value
- Pairwise deletion: use available data per analysis
- Only for statistical analyses, not ML pipelines
2. Imputation:
| Method | When to Use | Limitation |
|---|---|---|
| Mean/Median | MCAR, simple baseline | Reduces variance, distorts distributions |
| Mode | Categorical features | Same as above |
| KNN imputation | MAR, features are correlated | Slow on large datasets |
| MICE (Multiple Imputation by Chained Equations) | MAR, best general method | Computationally expensive |
| Regression imputation | MAR, strong predictors exist | Can introduce circularity |
| Forward/backward fill | Time series | Only valid for time-ordered data |
3. Indicator variable:
Add a binary feature_was_missing column alongside the imputed value. This lets the model learn that the pattern of missingness itself is informative (important for MNAR data).
Missing Data Decision Tree
Is missingness < 5%?
Yes → Listwise deletion is acceptable
No → Continue
Is it MCAR?
Yes → Mean/median imputation or deletion
No → Continue
Is it MAR (missingness relates to observed variables)?
Yes → KNN or MICE imputation
No (MNAR) → Add indicator variable + impute,
consider domain-specific strategy
Why the strategy matters: Using mean imputation on MNAR data is actively harmful — you’re pretending the “missing” people are average, when they’re systematically different. A model trained this way will fail on production data where the missingness pattern continues.
Amazon QuickSight
What it is: Serverless, fully managed business intelligence (BI) and visualization service.
Data Sources → QuickSight → Dashboards & Reports
S3 Interactive visualizations
Athena Scheduled email reports
Redshift Embedded analytics
RDS / Aurora Mobile-accessible
SPICE Engine
Super-fast Parallel In-memory Calculation Engine
- In-memory columnar storage, not querying source data each time
- Enables sub-second response times even on large datasets
- Data is imported into SPICE once and refreshed on schedule
- Up to 500 million rows per dataset
Exam tip: SPICE is QuickSight’s secret weapon for speed. If a question asks about fast, interactive BI dashboards for non-technical users, QuickSight + SPICE is the answer.
ML Insights in QuickSight
| Feature | What It Does |
|---|---|
| Anomaly Detection | Uses Random Cut Forest to find anomalies in metrics |
| Forecasting | Built-in time series forecasting (no ML knowledge needed) |
| Auto-narratives | Natural language summaries of chart insights |
| Suggested Insights | ML-recommended analyses for your dataset |
When to Use QuickSight vs Custom Visualization
| Use QuickSight when | Use custom (e.g., SageMaker Studio plots) when |
|---|---|
| Business users need dashboards | Data scientists doing deep EDA |
| No code / low code required | Need full matplotlib/seaborn control |
| Sharing across organization | Ad-hoc research plots |
| Need scheduled reports | Integrating visualizations into code pipeline |
| Embedded analytics in apps | Complex custom chart types |
Amazon SageMaker Data Wrangler
What it is: Visual, no-code/low-code data preparation and EDA tool within SageMaker Studio.
Data Sources → Data Wrangler → Transforms → Export
S3 300+ built-in SageMaker Pipelines
Athena Custom Python Feature Store
Redshift Data Quality Training Job
Lake Formation Check
Key capabilities:
- Visual data flow interface (drag-and-drop)
- 300+ built-in transformations (imputation, encoding, scaling, etc.)
- Data quality reports: missing values, outliers, distributions
- Quick model: train a baseline model in minutes to validate feature usefulness
- Bias detection (built-in Clarify integration)
- Auto-generates pipeline code for export
Data Wrangler vs Glue DataBrew
| Feature | Data Wrangler | Glue DataBrew |
|---|---|---|
| Primary use | ML data prep | General data transformation |
| Audience | Data scientists | Data engineers / analysts |
| Integration | SageMaker ecosystem | AWS Glue + broader AWS |
| ML-specific features | Bias, quick model | No |
| Code export | SageMaker Pipelines | Glue jobs |
| Cost model | Studio compute | Per-node processing |
Exam tip: Data Wrangler = ML-focused visual prep. Glue DataBrew = general data transformation for broader data engineering use cases. If the question mentions “data scientists” and “ML pipeline,” choose Data Wrangler.
Amazon Athena for EDA
What it is: Serverless, interactive SQL query engine for data in S3.
S3 Data (CSV, Parquet, JSON, ORC)
↓
AWS Glue Data Catalog (schema)
↓
Athena (SQL queries)
↓
Results → S3 | QuickSight | SageMaker
Why Athena is great for initial EDA:
- No infrastructure to provision — query immediately
- Pay per query ($5 per TB scanned — Parquet reduces cost dramatically)
- Standard SQL — accessible to anyone who knows SQL
- Query data in place — no ETL needed to start exploring
Common EDA queries with Athena:
-- Distribution of a column
SELECT income_bucket, COUNT(*) as cnt
FROM customers
GROUP BY income_bucket ORDER BY income_bucket;
-- Null count check
SELECT
COUNT(*) as total_rows,
SUM(CASE WHEN age IS NULL THEN 1 ELSE 0 END) as missing_age,
SUM(CASE WHEN salary IS NULL THEN 1 ELSE 0 END) as missing_salary
FROM customers;
-- Percentile analysis
SELECT
approx_percentile(salary, 0.25) as q1,
approx_percentile(salary, 0.50) as median,
approx_percentile(salary, 0.75) as q3
FROM customers;
Exam tip: Athena + Parquet + Glue Catalog is the canonical “explore S3 data without infrastructure” pattern. If a question asks how to do ad-hoc SQL exploration on data sitting in S3 cheaply, this is the answer.
Quick Reference: EDA Toolkit
| Task | Tool / Method |
|---|---|
| Initial data exploration in SQL | Amazon Athena |
| Visual no-code data prep + ML | SageMaker Data Wrangler |
| Business dashboards | Amazon QuickSight |
| Distribution shape | Histogram, box plot |
| Outlier detection (rule-based) | IQR method, Z-score |
| Outlier detection (ML-based) | Isolation Forest, Random Cut Forest |
| Correlation between features | Pearson r, heatmap |
| Non-linear correlation | Spearman rank |
| Multicollinearity check | VIF |
| All pairwise relationships | Pair plot |
| Missing data best practice | MICE or KNN imputation + indicator variable |
| Time series patterns | Decomposition plot (trend + seasonality + residual) |