CHAPTER 22 Beginner

Exploratory Data Analysis (EDA)

Updated: May 18, 2026

5 min read

# CHAPTER 22

Exploratory Data Analysis (EDA)

1. Chapter Introduction

EDA is the systematic process of investigating a dataset to understand its structure, distributions, relationships, and anomalies — BEFORE modeling. Proper EDA prevents incorrect assumptions and guides better analytical decisions.

2. EDA Workflow

text

12345678

EDA Process:
1. Load & Overview     → shape, dtypes, head, info
2. Data Quality        → missing values, duplicates, dtypes
3. Univariate Analysis → distributions of individual variables
4. Bivariate Analysis  → relationships between pairs
5. Multivariate        → correlations, pair plots
6. Outlier Detection   → IQR, Z-score, visualizations
7. Key Findings        → document insights

3. Data Profiling

python

123456789101112131415161718192021222324252627282930

import pandas as pd
import numpy as np

# Load Titanic dataset (simplified)
url = &#039;https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
# Using local simulation for offline:
np.random.seed(42)
n = 891

df = pd.DataFrame({
    &#039;PassengerId': range(1, n+1),
    &#039;Survived': np.random.choice([0,1], n, p=[0.62, 0.38]),
    &#039;Pclass': np.random.choice([1,2,3], n, p=[0.24, 0.21, 0.55]),
    &#039;Name': [f'Person_{i}' for i in range(n)],
    &#039;Sex': np.random.choice(['male','female'], n, p=[0.65, 0.35]),
    &#039;Age': np.where(np.random.random(n) < 0.2, np.nan,
                    np.random.normal(29.7, 14.5, n).clip(0.5, 80)),
    &#039;SibSp': np.random.choice([0,1,2,3,4], n, p=[0.68, 0.23, 0.06, 0.02, 0.01]),
    &#039;Parch': np.random.choice([0,1,2,3], n, p=[0.76, 0.13, 0.09, 0.02]),
    &#039;Fare': np.random.exponential(32, n).clip(0, 512),
    &#039;Embarked': np.random.choice(['S','C','Q', None], n, p=[0.72, 0.19, 0.09, 0.02])
})

print("=" * 50)
print("STEP 1: DATASET OVERVIEW")
print("=" * 50)
print(f"Shape: {df.shape}")
print(f"Total cells: {df.size:,}")
print(f"\nColumn Types:\n{df.dtypes}")
print(f"\nFirst 5 rows:\n{df.head()}")

4. Data Quality Assessment

python

123456789101112131415161718

print("\n" + "=" * 50)
print("STEP 2: DATA QUALITY")
print("=" * 50)

# Missing values
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(1)
missing_report = pd.DataFrame({&#039;Count': missing, 'Percentage': missing_pct})
missing_report = missing_report[missing_report[&#039;Count'] > 0].sort_values('Percentage', ascending=False)
print("Missing Values:")
print(missing_report)

# Duplicates
print(f"\nDuplicate rows: {df.duplicated().sum()}")

# Statistical summary
print(f"\nNumerical Summary:")
print(df[[&#039;Age', 'Fare', 'SibSp', 'Parch']].describe().round(2))

5. Univariate Analysis

python

1234567891011121314151617181920212223

print("\n" + "=" * 50)
print("STEP 3: UNIVARIATE ANALYSIS")
print("=" * 50)

# Categorical distributions
for col in [&#039;Survived', 'Pclass', 'Sex', 'Embarked']:
    counts = df[col].value_counts()
    pcts = df[col].value_counts(normalize=True).mul(100).round(1)
    print(f"\n{col}:")
    for val in counts.index:
        print(f"  {val}: {counts[val]} ({pcts[val]}%)")

# Numerical distributions
print("\nAge Distribution:")
print(f"  Mean: {df[&#039;Age'].mean():.1f}")
print(f"  Median: {df[&#039;Age'].median():.1f}")
print(f"  Std: {df[&#039;Age'].std():.1f}")
print(f"  Skewness: {df[&#039;Age'].skew():.3f}")

print(f"\nFare Distribution:")
print(f"  Mean: ${df[&#039;Fare'].mean():.2f}")
print(f"  Median: ${df[&#039;Fare'].median():.2f}")
print(f"  Skewness: {df[&#039;Fare'].skew():.3f}")  # Heavily right-skewed

6. Bivariate Analysis and Correlation

python

123456789101112131415161718192021222324

print("\n" + "=" * 50)
print("STEP 4: BIVARIATE & CORRELATION")
print("=" * 50)

# Survival by category
print("Survival Rate by Sex:")
print(df.groupby(&#039;Sex')['Survived'].mean().round(3))

print("\nSurvival Rate by Passenger Class:")
print(df.groupby(&#039;Pclass')['Survived'].mean().round(3))

print("\nAverage Fare by Class:")
print(df.groupby(&#039;Pclass')['Fare'].mean().round(2))

# Correlation matrix
numeric_cols = [&#039;Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
corr_matrix = df[numeric_cols].corr()
print("\nCorrelation Matrix:")
print(corr_matrix.round(3))

# Top correlations with Survived
survived_corr = corr_matrix[&#039;Survived'].drop('Survived').sort_values(key=abs, ascending=False)
print("\nTop Correlations with Survived:")
print(survived_corr.round(3))

7. Outlier Detection

python

123456789101112131415161718192021222324252627

print("\n" + "=" * 50)
print("STEP 5: OUTLIER DETECTION")
print("=" * 50)

def detect_outliers_iqr(series):
    """IQR method — standard for data analysis."""
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = series[(series < lower) | (series > upper)]
    return lower, upper, len(outliers)

def detect_outliers_zscore(series, threshold=3):
    """Z-score method — for normally distributed data."""
    z_scores = np.abs((series - series.mean()) / series.std())
    return (z_scores > threshold).sum()

for col in [&#039;Age', 'Fare']:
    clean = df[col].dropna()
    lower, upper, n_iqr = detect_outliers_iqr(clean)
    n_z = detect_outliers_zscore(clean)
    print(f"\n{col}:")
    print(f"  IQR bounds: [{lower:.1f}, {upper:.1f}]")
    print(f"  IQR outliers: {n_iqr} ({n_iqr/len(clean)*100:.1f}%)")
    print(f"  Z-score outliers (>3σ): {n_z}")

8. Common Mistakes

Skipping EDA before modeling: Models trained on dirty or misunderstood data produce unreliable predictions. Always EDA first.

Removing all outliers blindly: Some outliers are legitimate extreme values (legitimate high fares). Investigate before dropping.

9. MCQs

Question 1

EDA stands for?

Question 2

IQR method outlier bounds?

Question 3

`df.describe()` for numeric columns shows?

Question 4

Positive skewness means?

Question 5

Correlation range?

Question 6

Z-score method detects outliers beyond?

Question 7

`valuecounts(normalize=True)` returns?

Question 8

High skewness in Fare suggests?

Question 9

df.skew() measures?

Question 10

Bivariate analysis studies?

10. Interview Questions

Q: What steps do you follow in an Exploratory Data Analysis?

Q: How do you detect outliers in a dataset?

11. Summary
EDA follows a structured workflow: overview → quality → univariate → bivariate → correlation → outliers → findings. describe(), valuecounts(), correlation matrices, and IQR/Z-score outlier detection are the core toolkit. EDA is the most critical step before any modeling.

12. Next Chapter Recommendation

In Chapter 23: Statistical Analysis with Pandas & NumPy, we apply formal statistics — hypothesis testing, confidence intervals, and distribution fitting.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Exploratory Data Analysis (EDA) #

1. Chapter Introduction #

2. EDA Workflow #

3. Data Profiling #

4. Data Quality Assessment #

5. Univariate Analysis #

6. Bivariate Analysis and Correlation #

7. Outlier Detection #

8. Common Mistakes #

9. MCQs #

EDA stands for?

IQR method outlier bounds?

df.describe() for numeric columns shows?

Positive skewness means?

Correlation range?

Z-score method detects outliers beyond?

valuecounts(normalize=True) returns?

High skewness in Fare suggests?

df.skew() measures?

Bivariate analysis studies?

10. Interview Questions #

11. Summary #

12. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

Send Feedback / Bug

Feedback Submitted!

Exploratory Data Analysis (EDA)

1. Chapter Introduction

2. EDA Workflow

3. Data Profiling

4. Data Quality Assessment

5. Univariate Analysis

6. Bivariate Analysis and Correlation

7. Outlier Detection

8. Common Mistakes

9. MCQs

`df.describe()` for numeric columns shows?

`valuecounts(normalize=True)` returns?

`df.skew()` measures?

10. Interview Questions

11. Summary

12. Next Chapter Recommendation