CHAPTER 15 Beginner

Exploratory Data Analysis for Cleaning

Updated: May 18, 2026

5 min read

# CHAPTER 15

Exploratory Data Analysis for Cleaning

1. Chapter Introduction

You cannot clean what you cannot see. While df.isnull().sum() is helpful, it won't tell you if a "Sales" column has a massive spike of $0 values, or if a "Gender" column contains 15 different misspellings of "Female". Exploratory Data Analysis (EDA) uses descriptive statistics and data visualization to expose the true shape and dirtiness of a dataset.

2. Descriptive Statistics

The first step in EDA is generating summary statistics. This instantly reveals impossible minimums, maximums, and strange distributions.

python

1234567891011121314151617181920

import pandas as pd
import numpy as np

# Sample dataset
df = pd.DataFrame({
    &#039;age': [25, 30, 35, 28, 999, 32, -5, 29], # 999 and -5 are anomalies
    &#039;salary': [50k, 60k, 55k, 62k, 0, 58k, 65k, 1M], # 0 and 1M are anomalies
    &#039;category': ['A', 'B', 'A', 'a', 'C', 'B', 'A ', 'A'] # 'a' and 'A ' are issues
}) # Pseudo-code representations for readability

# 1. Summarize Numeric Data
print("=== NUMERIC SUMMARY ===")
print(df.describe())
# Look at the min and max!
# If age min is -5 and max is 999, you instantly know you have dirty data.

# 2. Summarize Categorical Data
print("\n=== CATEGORICAL SUMMARY ===")
print(df.describe(include=[&#039;object']))
# Shows count, unique values, top value, and frequency.

3. Data Profiling (Value Counts)

For categorical columns, .value_counts() is the ultimate dirt-detector.

python

123456789101112

# Checking for inconsistent text entries
print("=== VALUE COUNTS FOR CATEGORY ===")
print(df[&#039;category'].value_counts(dropna=False))
# Output might look like:
# A     3
# B     2
# a     1
# C     1
# A     1  <-- Notice the trailing space making it a separate category!

# Pro-tip: Normalize immediately after seeing this
df[&#039;category'] = df['category'].str.strip().str.upper()

4. Visualizing Data Quality

Visualizations highlight outliers and strange distributions instantly.

python

1234567891011121314151617

import matplotlib.pyplot as plt
import seaborn as sns

# 1. Boxplots for Outlier Detection
plt.figure(figsize=(8, 4))
sns.boxplot(x=df[&#039;salary'])
plt.title("Salary Boxplot: Spotting Outliers")
plt.show()
# Boxplots visually plot the IQR. Dots outside the "whiskers" are outliers.

# 2. Histograms for Distribution and Zero-Spikes
plt.figure(figsize=(8, 4))
sns.histplot(df[&#039;age'], bins=20)
plt.title("Age Distribution")
plt.show()
# If you see a massive spike at 0 or 99, it usually means 
# missing values were hard-coded as 0 or 99 by the system.

5. Correlation Matrices for Logical Checks

Sometimes, columns must logically correlate. For example, 'Years of Experience' and 'Age' should have a positive correlation. If they are negatively correlated, the data might be swapped or corrupted.

python

1234567

# Calculate correlation matrix
corr = df[[&#039;age', 'salary', 'experience']].corr()

# Visualize with a heatmap
sns.heatmap(corr, annot=True, cmap=&#039;coolwarm')
plt.title("Correlation Heatmap")
plt.show()

6. Automated EDA Tools

Writing describe(), value_counts(), and plotting histograms for 50 columns takes hours. Modern Python developers use automated profiling libraries.

YData Profiling (formerly Pandas Profiling): *(Note: Requires installation: pip install ydata-profiling)*

python

1234567

from ydata_profiling import ProfileReport

# Generate a complete HTML report with 1 line of code
profile = ProfileReport(df, title="Data Quality Report")

# Save to an HTML file to view in your browser
profile.to_file("data_quality_report.html")

This generates a beautiful interactive report showing missing value matrices, histograms, correlation heatmaps, and explicit warnings (e.g., "Column X has 15% zeros", "Column Y has high cardinality").

7. Mini Project: Sales Dataset Analysis

Scenario: You receive a raw sales dataset. Before writing any cleaning code, you perform EDA.

1. You run df.describe(). You notice minimumquantity is -10. (Insight: Returns are recorded as negative quantities. Do you want to include returns in your sales total? Action: Separate returns from sales).

2. You run df['state'].valuecounts(). You see 52 unique states. (Insight: There are only 50 US states. Action: Check for typos like 'TX' vs 'Texas' or territories like 'PR').

3. You plot a histogram of saledate. You see a massive 3-month gap in the middle of 2023. (Insight: The tracking pixel broke on the website. Action: Do not impute this. Inform stakeholders the data is missing).

8. Common Mistakes

Skipping EDA: Jumping straight to dropna() and building a model without actually looking at the distributions. You will build a model on garbage.

Ignoring the "Zeros": A histogram with a huge spike at zero often indicates that missing values were imputed with 0 at the database level before you even got the CSV.

9. MCQs

Question 1

Which Pandas function provides count, mean, min, and max for numeric columns?

Question 2

To check for typos and case inconsistencies in a categorical column like "City", you should use?

Question 3

Which visualization is best for explicitly showing outliers using the IQR method?

Question 4

What does a massive spike at 0 or 999 in a histogram of user ages usually indicate?

Question 5

How can you get summary statistics for text (object) columns?

Question 6

What is YData Profiling (Pandas Profiling)?

Question 7

Why use a correlation heatmap during data cleaning?

Question 8

If df['State'].nunique() returns 55 for US States, what is the likely problem?

Question 9

Which visualization is best for seeing the shape/distribution of a single numerical column?

Question 10

EDA should be performed:

10. Interview Questions

Q: Walk me through the first 5 commands you run when handed a brand new, undocumented CSV file.

Q: You run a histogram on a "Time on Page" metric and see a massive spike at exactly 300 seconds. What data quality issue might this indicate?

11. Summary
EDA is the diagnostic phase of data cleaning. Use df.describe() to find impossible mins/maxs. Use df['col'].valuecounts() to expose spelling inconsistencies in categories. Visualize numeric distributions with Seaborn histplot (to find artificial zero-spikes) and boxplot (to find outliers). For speed, leverage automated tools like ydata-profiling to generate full diagnostic reports instantly.

12. Next Chapter Recommendation

In Chapter 16: Automating Data Cleaning Pipelines, we take everything we've learned and wrap it into robust, automated Python scripts and ETL functions that can run on a schedule without human intervention.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Exploratory Data Analysis for Cleaning #

1. Chapter Introduction #

2. Descriptive Statistics #

3. Data Profiling (Value Counts) #

4. Visualizing Data Quality #

5. Correlation Matrices for Logical Checks #

6. Automated EDA Tools #

7. Mini Project: Sales Dataset Analysis #

8. Common Mistakes #

9. MCQs #

Which Pandas function provides count, mean, min, and max for numeric columns?

To check for typos and case inconsistencies in a categorical column like "City", you should use?

Which visualization is best for explicitly showing outliers using the IQR method?

What does a massive spike at 0 or 999 in a histogram of user ages usually indicate?

How can you get summary statistics for text (object) columns?

What is YData Profiling (Pandas Profiling)?

Why use a correlation heatmap during data cleaning?

If df['State'].nunique() returns 55 for US States, what is the likely problem?

Which visualization is best for seeing the shape/distribution of a single numerical column?

EDA should be performed:

10. Interview Questions #

11. Summary #

12. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 1

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

Exploratory Data Analysis for Cleaning

1. Chapter Introduction

2. Descriptive Statistics

3. Data Profiling (Value Counts)

4. Visualizing Data Quality

5. Correlation Matrices for Logical Checks

6. Automated EDA Tools

7. Mini Project: Sales Dataset Analysis

8. Common Mistakes

9. MCQs

If `df['State'].nunique()` returns 55 for US States, what is the likely problem?

10. Interview Questions

11. Summary

12. Next Chapter Recommendation