Exploratory Data Analysis for Cleaning
# CHAPTER 15
Exploratory Data Analysis for Cleaning
1. Chapter Introduction
You cannot clean what you cannot see. Whiledf.isnull().sum() is helpful, it won't tell you if a "Sales" column has a massive spike of $0 values, or if a "Gender" column contains 15 different misspellings of "Female". Exploratory Data Analysis (EDA) uses descriptive statistics and data visualization to expose the true shape and dirtiness of a dataset.
2. Descriptive Statistics
The first step in EDA is generating summary statistics. This instantly reveals impossible minimums, maximums, and strange distributions.
3. Data Profiling (Value Counts)
For categorical columns, .value_counts() is the ultimate dirt-detector.
4. Visualizing Data Quality
Visualizations highlight outliers and strange distributions instantly.
5. Correlation Matrices for Logical Checks
Sometimes, columns must logically correlate. For example, 'Years of Experience' and 'Age' should have a positive correlation. If they are negatively correlated, the data might be swapped or corrupted.
6. Automated EDA Tools
Writing describe(), value_counts(), and plotting histograms for 50 columns takes hours. Modern Python developers use automated profiling libraries.
YData Profiling (formerly Pandas Profiling):
*(Note: Requires installation: pip install ydata-profiling)*
This generates a beautiful interactive report showing missing value matrices, histograms, correlation heatmaps, and explicit warnings (e.g., "Column X has 15% zeros", "Column Y has high cardinality").
7. Mini Project: Sales Dataset Analysis
Scenario: You receive a raw sales dataset. Before writing any cleaning code, you perform EDA.
-
1.
You run
df.describe(). You noticeminimumquantityis -10. (Insight: Returns are recorded as negative quantities. Do you want to include returns in your sales total? Action: Separate returns from sales).
-
2.
You run
df['state'].valuecounts(). You see 52 unique states. (Insight: There are only 50 US states. Action: Check for typos like 'TX' vs 'Texas' or territories like 'PR').
-
3.
You plot a histogram of
saledate. You see a massive 3-month gap in the middle of 2023. (Insight: The tracking pixel broke on the website. Action: Do not impute this. Inform stakeholders the data is missing).
8. Common Mistakes
-
Skipping EDA: Jumping straight to
dropna()and building a model without actually looking at the distributions. You will build a model on garbage.
- Ignoring the "Zeros": A histogram with a huge spike at zero often indicates that missing values were imputed with 0 at the database level before you even got the CSV.
9. MCQs
Which Pandas function provides count, mean, min, and max for numeric columns?
To check for typos and case inconsistencies in a categorical column like "City", you should use?
Which visualization is best for explicitly showing outliers using the IQR method?
What does a massive spike at 0 or 999 in a histogram of user ages usually indicate?
How can you get summary statistics for text (object) columns?
What is YData Profiling (Pandas Profiling)?
Why use a correlation heatmap during data cleaning?
If df['State'].nunique() returns 55 for US States, what is the likely problem?
Which visualization is best for seeing the shape/distribution of a single numerical column?
EDA should be performed:
10. Interview Questions
- Q: Walk me through the first 5 commands you run when handed a brand new, undocumented CSV file.
- Q: You run a histogram on a "Time on Page" metric and see a massive spike at exactly 300 seconds. What data quality issue might this indicate?
11. Summary
EDA is the diagnostic phase of data cleaning. Usedf.describe() to find impossible mins/maxs. Use df['col'].valuecounts() to expose spelling inconsistencies in categories. Visualize numeric distributions with Seaborn histplot (to find artificial zero-spikes) and boxplot (to find outliers). For speed, leverage automated tools like ydata-profiling to generate full diagnostic reports instantly.