Skip to main content
Python for Data Science
CHAPTER 20 Beginner

Exploratory Data Analysis (EDA)

Updated: May 18, 2026
5 min read

# CHAPTER 20

Exploratory Data Analysis (EDA)

1. Chapter Introduction

You now have the tools: Pandas for manipulation, and Seaborn for visualization. Exploratory Data Analysis (EDA) is the art of combining these tools. It is the phase where you act as a detective, investigating a raw dataset to find patterns, spot anomalies, and form hypotheses *before* you ever attempt to train a machine learning model. This chapter walks through a complete EDA workflow.

2. The Standard EDA Workflow

Regardless of the dataset, a professional EDA process follows a strict chronological order:

  1. 1. Data Ingestion: Load the data.
  1. 2. Data Profiling: Check shapes, types, and summary statistics.
  1. 3. Data Cleaning: Handle NaNs and duplicates.
  1. 4. Univariate Analysis: Analyze one variable at a time (Distributions).
  1. 5. Bivariate Analysis: Analyze two variables together (Correlations).

3. Step 1 & 2: Ingestion and Profiling

We will use the famous "Titanic" dataset (Passenger survival data).

python
1234567891011
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Ingestion
df = pd.read_csv('titanic.csv')

# 2. Profiling
print(df.head())
print(df.info()) # Reveals that 'Age' and 'Cabin' have missing values
print(df.describe()) # Shows the average passenger was ~29 years old

4. Step 3: Data Cleaning

We cannot visualize data accurately if it is riddled with missing values.

python
1234567891011121314
# Check exact null counts
print(df.isna().sum())

# 'Cabin' is 80% missing. We drop the entire column.
df.drop(columns=['Cabin'], inplace=True)

# 'Age' is 20% missing. We fill it with the median age.
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)

# 'Embarked' is missing 2 values. We drop those 2 rows.
df.dropna(subset=['Embarked'], inplace=True)

print("Data is now clean!")

5. Step 4: Univariate Analysis

Univariate means analyzing *one* variable. We want to understand the demographics of the ship.

python
1234567891011
# 1. Categorical Variable: Who was on the ship? (Countplot)
sns.countplot(data=df, x='Sex', palette='Set1')
plt.title("Passenger Count by Gender")
plt.show()
# Finding: There were roughly twice as many men as women.

# 2. Numerical Variable: What was the age distribution? (Histplot)
sns.histplot(data=df, x='Age', bins=30, kde=True)
plt.title("Age Distribution")
plt.show()
# Finding: Massive spike around age 29 (due to our median filling), and a spike in toddlers.

6. Step 5: Bivariate Analysis & Outliers

Bivariate means analyzing *two* variables to find relationships. The ultimate question: What influenced survival?

python
123456789101112131415
# 1. Did Gender affect Survival? (Barplot)
sns.barplot(data=df, x='Sex', y='Survived')
plt.title("Survival Rate by Gender")
plt.show()
# Finding: ~74% of females survived, compared to ~18% of males. 
# "Women and children first" was true.

# 2. Did Class affect Survival? (Boxplot)
sns.boxplot(data=df, x='Pclass', y='Age', hue='Survived')
plt.title("Survival by Class and Age")
plt.show()

# 3. Outlier Detection
# Boxplots easily reveal outliers. We can see a few passengers 
# over age 70 (dots above the whiskers).

7. Step 6: Correlation Matrix

Finally, we run a heatmap to mathematically confirm our visual suspicions.

python
12345678910
# Select only numbers
numeric_df = df.select_dtypes(include=['float64', 'int64'])

# Generate heatmap
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title("Titanic Correlation Heatmap")
plt.show()

# Finding: 'Pclass' has a strong negative correlation with 'Survived'. 
# Lower class number (1st class) = higher survival.

8. Common Mistakes

  • Jumping to conclusions without EDA: If you train an AI on this data without doing EDA, the AI might crash due to the NaNs in the 'Cabin' column, or it might make terrible predictions because it didn't realize 80% of the data was missing.
  • Ignoring Outliers: If a CSV has an error and lists a passenger's age as 900, a boxplot will immediately show this outlier. If you skip EDA, your average age calculation will be completely ruined.

9. MCQs

Question 1

What does EDA stand for?

Question 2

What is the primary purpose of EDA?

Question 3

What is Univariate Analysis?

Question 4

What is Bivariate Analysis?

Question 5

Which chart is excellent for quickly identifying Outliers in a numerical column?

Question 6

If a column is missing 85% of its data, what is usually the best course of action?

Question 7

Which Pandas function is crucial for the "Data Profiling" phase to quickly see counts, means, and max values?

Question 8

Which Seaborn chart acts essentially like a Bar Chart but specifically counts the occurrences of categories (like counting Males vs Females)?

Question 9

Why do we filter for numerical columns (select_dtypes) before running a correlation heatmap?

Question 10

Should you handle missing data (NaNs) before or after creating your visualizations?

10. Interview Questions

  • Q: Walk me through your standard Exploratory Data Analysis workflow when handed a brand new, unknown dataset.
  • Q: How do you detect outliers in a dataset, and what are two ways you might handle them?

11. Summary

EDA is the most critical phase of data science. You must ingest the data, profile it (.info(), .describe()), and clean it (.dropna(), .fillna()). Then, proceed chronologically from Univariate analysis (Distributions, Countplots) to Bivariate analysis (Boxplots, Scatterplots) to uncover the story hidden in the data. Only once the story is understood should you proceed to Machine Learning.

12. Next Chapter Recommendation

In Chapter 21: Introduction to Machine Learning, we transition from analyzing the past to predicting the future. You will learn the difference between Supervised and Unsupervised learning and the Scikit-Learn workflow.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·