Skip to main content
Pandas & NumPy
CHAPTER 22 Beginner

Exploratory Data Analysis (EDA)

Updated: May 18, 2026
5 min read

# CHAPTER 22

Exploratory Data Analysis (EDA)

1. Chapter Introduction

EDA is the systematic process of investigating a dataset to understand its structure, distributions, relationships, and anomalies — BEFORE modeling. Proper EDA prevents incorrect assumptions and guides better analytical decisions.

2. EDA Workflow

text
12345678
EDA Process:
1. Load & Overview     → shape, dtypes, head, info
2. Data Quality        → missing values, duplicates, dtypes
3. Univariate Analysis → distributions of individual variables
4. Bivariate Analysis  → relationships between pairs
5. Multivariate        → correlations, pair plots
6. Outlier Detection   → IQR, Z-score, visualizations
7. Key Findings        → document insights

3. Data Profiling

python
123456789101112131415161718192021222324252627282930
import pandas as pd
import numpy as np

# Load Titanic dataset (simplified)
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
# Using local simulation for offline:
np.random.seed(42)
n = 891

df = pd.DataFrame({
    'PassengerId': range(1, n+1),
    'Survived': np.random.choice([0,1], n, p=[0.62, 0.38]),
    'Pclass': np.random.choice([1,2,3], n, p=[0.24, 0.21, 0.55]),
    'Name': [f'Person_{i}' for i in range(n)],
    'Sex': np.random.choice(['male','female'], n, p=[0.65, 0.35]),
    &#039;Age': np.where(np.random.random(n) < 0.2, np.nan,
                    np.random.normal(29.7, 14.5, n).clip(0.5, 80)),
    &#039;SibSp': np.random.choice([0,1,2,3,4], n, p=[0.68, 0.23, 0.06, 0.02, 0.01]),
    &#039;Parch': np.random.choice([0,1,2,3], n, p=[0.76, 0.13, 0.09, 0.02]),
    &#039;Fare': np.random.exponential(32, n).clip(0, 512),
    &#039;Embarked': np.random.choice(['S','C','Q', None], n, p=[0.72, 0.19, 0.09, 0.02])
})

print("=" * 50)
print("STEP 1: DATASET OVERVIEW")
print("=" * 50)
print(f"Shape: {df.shape}")
print(f"Total cells: {df.size:,}")
print(f"\nColumn Types:\n{df.dtypes}")
print(f"\nFirst 5 rows:\n{df.head()}")

4. Data Quality Assessment

python
123456789101112131415161718
print("\n" + "=" * 50)
print("STEP 2: DATA QUALITY")
print("=" * 50)

# Missing values
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(1)
missing_report = pd.DataFrame({&#039;Count': missing, 'Percentage': missing_pct})
missing_report = missing_report[missing_report[&#039;Count'] > 0].sort_values('Percentage', ascending=False)
print("Missing Values:")
print(missing_report)

# Duplicates
print(f"\nDuplicate rows: {df.duplicated().sum()}")

# Statistical summary
print(f"\nNumerical Summary:")
print(df[[&#039;Age', 'Fare', 'SibSp', 'Parch']].describe().round(2))

5. Univariate Analysis

python
1234567891011121314151617181920212223
print("\n" + "=" * 50)
print("STEP 3: UNIVARIATE ANALYSIS")
print("=" * 50)

# Categorical distributions
for col in [&#039;Survived', 'Pclass', 'Sex', 'Embarked']:
    counts = df[col].value_counts()
    pcts = df[col].value_counts(normalize=True).mul(100).round(1)
    print(f"\n{col}:")
    for val in counts.index:
        print(f"  {val}: {counts[val]} ({pcts[val]}%)")

# Numerical distributions
print("\nAge Distribution:")
print(f"  Mean: {df[&#039;Age'].mean():.1f}")
print(f"  Median: {df[&#039;Age'].median():.1f}")
print(f"  Std: {df[&#039;Age'].std():.1f}")
print(f"  Skewness: {df[&#039;Age'].skew():.3f}")

print(f"\nFare Distribution:")
print(f"  Mean: ${df[&#039;Fare'].mean():.2f}")
print(f"  Median: ${df[&#039;Fare'].median():.2f}")
print(f"  Skewness: {df[&#039;Fare'].skew():.3f}")  # Heavily right-skewed

6. Bivariate Analysis and Correlation

python
123456789101112131415161718192021222324
print("\n" + "=" * 50)
print("STEP 4: BIVARIATE & CORRELATION")
print("=" * 50)

# Survival by category
print("Survival Rate by Sex:")
print(df.groupby(&#039;Sex')['Survived'].mean().round(3))

print("\nSurvival Rate by Passenger Class:")
print(df.groupby(&#039;Pclass')['Survived'].mean().round(3))

print("\nAverage Fare by Class:")
print(df.groupby(&#039;Pclass')['Fare'].mean().round(2))

# Correlation matrix
numeric_cols = [&#039;Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
corr_matrix = df[numeric_cols].corr()
print("\nCorrelation Matrix:")
print(corr_matrix.round(3))

# Top correlations with Survived
survived_corr = corr_matrix[&#039;Survived'].drop('Survived').sort_values(key=abs, ascending=False)
print("\nTop Correlations with Survived:")
print(survived_corr.round(3))

7. Outlier Detection

python
123456789101112131415161718192021222324252627
print("\n" + "=" * 50)
print("STEP 5: OUTLIER DETECTION")
print("=" * 50)

def detect_outliers_iqr(series):
    """IQR method — standard for data analysis."""
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = series[(series < lower) | (series > upper)]
    return lower, upper, len(outliers)

def detect_outliers_zscore(series, threshold=3):
    """Z-score method — for normally distributed data."""
    z_scores = np.abs((series - series.mean()) / series.std())
    return (z_scores > threshold).sum()

for col in [&#039;Age', 'Fare']:
    clean = df[col].dropna()
    lower, upper, n_iqr = detect_outliers_iqr(clean)
    n_z = detect_outliers_zscore(clean)
    print(f"\n{col}:")
    print(f"  IQR bounds: [{lower:.1f}, {upper:.1f}]")
    print(f"  IQR outliers: {n_iqr} ({n_iqr/len(clean)*100:.1f}%)")
    print(f"  Z-score outliers (>3σ): {n_z}")

8. Common Mistakes

  • Skipping EDA before modeling: Models trained on dirty or misunderstood data produce unreliable predictions. Always EDA first.
  • Removing all outliers blindly: Some outliers are legitimate extreme values (legitimate high fares). Investigate before dropping.

9. MCQs

Question 1

EDA stands for?

Question 2

IQR method outlier bounds?

Question 3

df.describe() for numeric columns shows?

Question 4

Positive skewness means?

Question 5

Correlation range?

Question 6

Z-score method detects outliers beyond?

Question 7

valuecounts(normalize=True) returns?

Question 8

High skewness in Fare suggests?

Question 9

df.skew() measures?

Question 10

Bivariate analysis studies?

10. Interview Questions

  • Q: What steps do you follow in an Exploratory Data Analysis?
  • Q: How do you detect outliers in a dataset?

11. Summary

EDA follows a structured workflow: overview → quality → univariate → bivariate → correlation → outliers → findings. describe(), value
counts(), correlation matrices, and IQR/Z-score outlier detection are the core toolkit. EDA is the most critical step before any modeling.

12. Next Chapter Recommendation

In Chapter 23: Statistical Analysis with Pandas & NumPy, we apply formal statistics — hypothesis testing, confidence intervals, and distribution fitting.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·