Skip to main content
Data Cleaning
CHAPTER 15 Beginner

Exploratory Data Analysis for Cleaning

Updated: May 18, 2026
5 min read

# CHAPTER 15

Exploratory Data Analysis for Cleaning

1. Chapter Introduction

You cannot clean what you cannot see. While df.isnull().sum() is helpful, it won't tell you if a "Sales" column has a massive spike of $0 values, or if a "Gender" column contains 15 different misspellings of "Female". Exploratory Data Analysis (EDA) uses descriptive statistics and data visualization to expose the true shape and dirtiness of a dataset.

2. Descriptive Statistics

The first step in EDA is generating summary statistics. This instantly reveals impossible minimums, maximums, and strange distributions.

python
1234567891011121314151617181920
import pandas as pd
import numpy as np

# Sample dataset
df = pd.DataFrame({
    'age': [25, 30, 35, 28, 999, 32, -5, 29], # 999 and -5 are anomalies
    'salary': [50k, 60k, 55k, 62k, 0, 58k, 65k, 1M], # 0 and 1M are anomalies
    'category': ['A', 'B', 'A', 'a', 'C', 'B', 'A ', 'A'] # 'a' and 'A ' are issues
}) # Pseudo-code representations for readability

# 1. Summarize Numeric Data
print("=== NUMERIC SUMMARY ===")
print(df.describe())
# Look at the min and max!
# If age min is -5 and max is 999, you instantly know you have dirty data.

# 2. Summarize Categorical Data
print("\n=== CATEGORICAL SUMMARY ===")
print(df.describe(include=['object']))
# Shows count, unique values, top value, and frequency.

3. Data Profiling (Value Counts)

For categorical columns, .value_counts() is the ultimate dirt-detector.

python
123456789101112
# Checking for inconsistent text entries
print("=== VALUE COUNTS FOR CATEGORY ===")
print(df['category'].value_counts(dropna=False))
# Output might look like:
# A     3
# B     2
# a     1
# C     1
# A     1  <-- Notice the trailing space making it a separate category!

# Pro-tip: Normalize immediately after seeing this
df[&#039;category'] = df['category'].str.strip().str.upper()

4. Visualizing Data Quality

Visualizations highlight outliers and strange distributions instantly.

python
1234567891011121314151617
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Boxplots for Outlier Detection
plt.figure(figsize=(8, 4))
sns.boxplot(x=df[&#039;salary'])
plt.title("Salary Boxplot: Spotting Outliers")
plt.show()
# Boxplots visually plot the IQR. Dots outside the "whiskers" are outliers.

# 2. Histograms for Distribution and Zero-Spikes
plt.figure(figsize=(8, 4))
sns.histplot(df[&#039;age'], bins=20)
plt.title("Age Distribution")
plt.show()
# If you see a massive spike at 0 or 99, it usually means 
# missing values were hard-coded as 0 or 99 by the system.

5. Correlation Matrices for Logical Checks

Sometimes, columns must logically correlate. For example, 'Years of Experience' and 'Age' should have a positive correlation. If they are negatively correlated, the data might be swapped or corrupted.

python
1234567
# Calculate correlation matrix
corr = df[[&#039;age', 'salary', 'experience']].corr()

# Visualize with a heatmap
sns.heatmap(corr, annot=True, cmap=&#039;coolwarm')
plt.title("Correlation Heatmap")
plt.show()

6. Automated EDA Tools

Writing describe(), value_counts(), and plotting histograms for 50 columns takes hours. Modern Python developers use automated profiling libraries.

YData Profiling (formerly Pandas Profiling): *(Note: Requires installation: pip install ydata-profiling)*

python
1234567
from ydata_profiling import ProfileReport

# Generate a complete HTML report with 1 line of code
profile = ProfileReport(df, title="Data Quality Report")

# Save to an HTML file to view in your browser
profile.to_file("data_quality_report.html")

This generates a beautiful interactive report showing missing value matrices, histograms, correlation heatmaps, and explicit warnings (e.g., "Column X has 15% zeros", "Column Y has high cardinality").

7. Mini Project: Sales Dataset Analysis

Scenario: You receive a raw sales dataset. Before writing any cleaning code, you perform EDA.

  1. 1. You run df.describe(). You notice minimumquantity is -10. (Insight: Returns are recorded as negative quantities. Do you want to include returns in your sales total? Action: Separate returns from sales).
  1. 2. You run df['state'].valuecounts(). You see 52 unique states. (Insight: There are only 50 US states. Action: Check for typos like 'TX' vs 'Texas' or territories like 'PR').
  1. 3. You plot a histogram of saledate. You see a massive 3-month gap in the middle of 2023. (Insight: The tracking pixel broke on the website. Action: Do not impute this. Inform stakeholders the data is missing).

8. Common Mistakes

  • Skipping EDA: Jumping straight to dropna() and building a model without actually looking at the distributions. You will build a model on garbage.
  • Ignoring the "Zeros": A histogram with a huge spike at zero often indicates that missing values were imputed with 0 at the database level before you even got the CSV.

9. MCQs

Question 1

Which Pandas function provides count, mean, min, and max for numeric columns?

Question 2

To check for typos and case inconsistencies in a categorical column like "City", you should use?

Question 3

Which visualization is best for explicitly showing outliers using the IQR method?

Question 4

What does a massive spike at 0 or 999 in a histogram of user ages usually indicate?

Question 5

How can you get summary statistics for text (object) columns?

Question 6

What is YData Profiling (Pandas Profiling)?

Question 7

Why use a correlation heatmap during data cleaning?

Question 8

If df['State'].nunique() returns 55 for US States, what is the likely problem?

Question 9

Which visualization is best for seeing the shape/distribution of a single numerical column?

Question 10

EDA should be performed:

10. Interview Questions

  • Q: Walk me through the first 5 commands you run when handed a brand new, undocumented CSV file.
  • Q: You run a histogram on a "Time on Page" metric and see a massive spike at exactly 300 seconds. What data quality issue might this indicate?

11. Summary

EDA is the diagnostic phase of data cleaning. Use df.describe() to find impossible mins/maxs. Use df['col'].value
counts() to expose spelling inconsistencies in categories. Visualize numeric distributions with Seaborn histplot (to find artificial zero-spikes) and boxplot (to find outliers). For speed, leverage automated tools like ydata-profiling to generate full diagnostic reports instantly.

12. Next Chapter Recommendation

In Chapter 16: Automating Data Cleaning Pipelines, we take everything we've learned and wrap it into robust, automated Python scripts and ETL functions that can run on a schedule without human intervention.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·