Exploratory Data Analysis (EDA)
# CHAPTER 20
Exploratory Data Analysis (EDA)
1. Chapter Introduction
You now have the tools: Pandas for manipulation, and Seaborn for visualization. Exploratory Data Analysis (EDA) is the art of combining these tools. It is the phase where you act as a detective, investigating a raw dataset to find patterns, spot anomalies, and form hypotheses *before* you ever attempt to train a machine learning model. This chapter walks through a complete EDA workflow.2. The Standard EDA Workflow
Regardless of the dataset, a professional EDA process follows a strict chronological order:
- 1. Data Ingestion: Load the data.
- 2. Data Profiling: Check shapes, types, and summary statistics.
- 3. Data Cleaning: Handle NaNs and duplicates.
- 4. Univariate Analysis: Analyze one variable at a time (Distributions).
- 5. Bivariate Analysis: Analyze two variables together (Correlations).
3. Step 1 & 2: Ingestion and Profiling
We will use the famous "Titanic" dataset (Passenger survival data).
4. Step 3: Data Cleaning
We cannot visualize data accurately if it is riddled with missing values.
5. Step 4: Univariate Analysis
Univariate means analyzing *one* variable. We want to understand the demographics of the ship.
6. Step 5: Bivariate Analysis & Outliers
Bivariate means analyzing *two* variables to find relationships. The ultimate question: What influenced survival?
7. Step 6: Correlation Matrix
Finally, we run a heatmap to mathematically confirm our visual suspicions.
8. Common Mistakes
- Jumping to conclusions without EDA: If you train an AI on this data without doing EDA, the AI might crash due to the NaNs in the 'Cabin' column, or it might make terrible predictions because it didn't realize 80% of the data was missing.
- Ignoring Outliers: If a CSV has an error and lists a passenger's age as 900, a boxplot will immediately show this outlier. If you skip EDA, your average age calculation will be completely ruined.
9. MCQs
What does EDA stand for?
What is the primary purpose of EDA?
What is Univariate Analysis?
What is Bivariate Analysis?
Which chart is excellent for quickly identifying Outliers in a numerical column?
If a column is missing 85% of its data, what is usually the best course of action?
Which Pandas function is crucial for the "Data Profiling" phase to quickly see counts, means, and max values?
Which Seaborn chart acts essentially like a Bar Chart but specifically counts the occurrences of categories (like counting Males vs Females)?
Why do we filter for numerical columns (select_dtypes) before running a correlation heatmap?
Should you handle missing data (NaNs) before or after creating your visualizations?
10. Interview Questions
- Q: Walk me through your standard Exploratory Data Analysis workflow when handed a brand new, unknown dataset.
- Q: How do you detect outliers in a dataset, and what are two ways you might handle them?
11. Summary
EDA is the most critical phase of data science. You must ingest the data, profile it (.info(), .describe()), and clean it (.dropna(), .fillna()). Then, proceed chronologically from Univariate analysis (Distributions, Countplots) to Bivariate analysis (Boxplots, Scatterplots) to uncover the story hidden in the data. Only once the story is understood should you proceed to Machine Learning.