Skip to main content
R Programming
CHAPTER 22 Beginner

Exploratory Data Analysis (EDA)

Updated: May 18, 2026
5 min read

# CHAPTER 22

Exploratory Data Analysis (EDA) in R

1. Chapter Introduction

EDA is the critical first step of any data science project — systematically exploring data to understand its structure, distributions, relationships, and anomalies before modeling. This chapter builds a complete EDA framework using the Titanic dataset.

2. EDA Framework

text
1234567891011121314151617181920212223
EDA SYSTEMATIC PROCESS:

STEP 1: DATA OVERVIEW
  → Shape, types, missing values, duplicates

STEP 2: UNIVARIATE ANALYSIS
  → Distribution of each variable individually
  → Numeric: histogram, boxplot, summary stats
  → Categorical: bar chart, frequency table

STEP 3: BIVARIATE ANALYSIS
  → Relationship between two variables
  → Numeric vs Numeric: scatter, correlation
  → Categorical vs Numeric: boxplot, violin
  → Categorical vs Categorical: mosaic, grouped bar

STEP 4: MULTIVARIATE ANALYSIS
  → Pairs plot (pairplot)
  → Correlation matrix
  → Faceted charts

STEP 5: ANOMALY DETECTION
  → Outliers, impossible values, inconsistencies

3. Mini Project: Titanic Dataset Analysis

r
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
library(dplyr); library(ggplot2); library(tidyr); library(corrplot)

# Load Titanic data (built-in datasets package)
data(Titanic)
titanic <- as.data.frame(Titanic)

# Better Titanic version from titanic package
# install.packages("titanic"); library(titanic)
# df <- titanic_train

# Simulate realistic Titanic-style dataset
set.seed(42)
n <- 891
df <- data.frame(
  survived  = sample(c(0,1), n, replace=TRUE, prob=c(0.618, 0.382)),
  pclass    = sample(1:3, n, replace=TRUE, prob=c(0.24, 0.21, 0.55)),
  sex       = sample(c("male","female"), n, replace=TRUE, prob=c(0.647, 0.353)),
  age       = c(round(runif(800, 1, 70)), rep(NA, 91)),
  fare      = round(rexp(n, 0.02) * 10, 2),
  embarked  = sample(c("S","C","Q"), n, replace=TRUE, prob=c(0.724, 0.187, 0.089)),
  stringsAsFactors = FALSE
)

# ─── STEP 1: DATA OVERVIEW ───────────────────────────
cat("=== DATA OVERVIEW ===\n")
cat("Shape:", nrow(df), "rows ×", ncol(df), "columns\n")
cat("Missing Values:\n")
colSums(is.na(df)) %>% print()
str(df)

# ─── STEP 2: UNIVARIATE ANALYSIS ──────────────────────
# Survival rate
survival_rate <- mean(df$survived) * 100
cat(sprintf("\nOverall Survival Rate: %.1f%%\n", survival_rate))

# Age distribution
p1 <- ggplot(df %>% filter(!is.na(age)), aes(x=age)) +
  geom_histogram(bins=30, fill="#1565C0", alpha=0.8, color="white") +
  geom_vline(xintercept=mean(df$age, na.rm=TRUE), color="red", linetype="dashed") +
  labs(title="Age Distribution", subtitle=sprintf("Mean: %.1f, Median: %.1f",
        mean(df$age,na.rm=T), median(df$age,na.rm=T))) +
  theme_minimal()

# Fare distribution (log scale for skewed data)
p2 <- ggplot(df, aes(x=log1p(fare))) +
  geom_histogram(bins=30, fill="#2E7D32", alpha=0.8, color="white") +
  labs(title="Fare Distribution (Log Scale)", x="log(Fare+1)") +
  theme_minimal()

# ─── STEP 3: BIVARIATE ANALYSIS ──────────────────────
# Survival by gender
cat("\nSurvival by Gender:\n")
df %>% group_by(sex) %>%
  summarise(survival_rate = round(mean(survived)*100,1), n=n(), .groups="drop") %>%
  print()

# Survival by class
cat("\nSurvival by Class:\n")
df %>% group_by(pclass) %>%
  summarise(survival_rate = round(mean(survived)*100,1), n=n(), .groups="drop") %>%
  print()

# Age vs Survival
p3 <- ggplot(df %>% filter(!is.na(age)),
              aes(x=factor(survived, labels=c("Died","Survived")), y=age, fill=factor(survived))) +
  geom_boxplot(alpha=0.7) +
  scale_fill_manual(values=c("0"="#F44336","1"="#4CAF50")) +
  labs(title="Age by Survival Status", x="Status", y="Age") +
  theme_minimal() + theme(legend.position="none")

# ─── STEP 4: MULTIVARIATE ANALYSIS ───────────────────
# Survival by Class and Gender
p4 <- df %>%
  group_by(pclass, sex) %>%
  summarise(rate=round(mean(survived)*100,1), .groups="drop") %>%
  ggplot(aes(x=factor(pclass), y=rate, fill=sex)) +
  geom_col(position="dodge") +
  scale_fill_manual(values=c(female="#E91E63", male="#1565C0")) +
  labs(title="Survival Rate by Class & Gender",
       x="Passenger Class", y="Survival Rate (%)") +
  theme_minimal()

# ─── STEP 5: OUTLIER DETECTION ───────────────────────
Q1 <- quantile(df$fare, 0.25); Q3 <- quantile(df$fare, 0.75)
outlier_fares <- df$fare[df$fare > Q3 + 1.5*IQR(df$fare)]
cat(sprintf("\nFare Outliers: %d passengers paid > $%.0f\n",
             length(outlier_fares), Q3 + 1.5*IQR(df$fare)))

# Summary insights
cat("\n=== KEY INSIGHTS ===\n")
cat(sprintf("1. Survival rate: %.1f%%\n", survival_rate))
cat(sprintf("2. Female survival: %.1f%% vs Male: %.1f%%\n",
             mean(df$survived[df$sex=="female"])*100,
             mean(df$survived[df$sex=="male"])*100))
cat(sprintf("3. 1st class survival: %.1f%% vs 3rd class: %.1f%%\n",
             mean(df$survived[df$pclass==1])*100,
             mean(df$survived[df$pclass==3])*100))

4. Common Mistakes

  • Skipping EDA before modeling: Models built on uncleaned, un-understood data produce unreliable results. EDA always comes first.
  • Ignoring missing data patterns: Are NAs random (MCAR), related to observed variables (MAR), or related to missing values themselves (MNAR)? The pattern determines the imputation strategy.

5. MCQs

Question 1

EDA stands for?

Question 2

Univariate analysis examines?

Question 3

log1p(x) is preferred over log(x) because?

Question 4

Correlation matrix shows?

Question 5

Boxplot outliers are defined by?

Question 6

Bivariate analysis for two categoricals?

Question 7

Right-skewed distribution means?

Question 8

table(df$gender, df$survived) creates?

Question 9

Faceted charts are useful in EDA for?

Question 10

Key EDA output for stakeholders?

6. Interview Questions

  • Q: What are the key steps in an EDA process?
  • Q: How do you detect outliers in a dataset?

7. Summary

EDA framework: 5 steps — overview, univariate, bivariate, multivariate, anomaly detection. Overview: str(), summary(), colSums(is.na()). Univariate: histograms, boxplots, frequency tables. Bivariate: scatter (num/num), boxplot (cat/num), grouped bar (cat/cat). Multivariate: pairplot, correlation matrix. Outliers: IQR fence. Always document key insights as business findings.

8. Next Chapter Recommendation

In Chapter 23: Machine Learning Basics in R, we set up the ML workflow — training, testing, cross-validation, and model evaluation.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·