Skip to main content
R Programming
CHAPTER 25 Beginner

Working with Real-World Datasets

Updated: May 18, 2026
5 min read

# CHAPTER 25

Working with Real-World Datasets in R

1. Chapter Introduction

Real-world datasets are messy, large, and ambiguous. This chapter simulates a production-grade Kaggle-style analysis workflow — from raw data to professional business report — covering data cleaning, EDA, and actionable insights.

2. The Complete Analysis Workflow

text
1234567891011121314151617181920212223
REAL-WORLD DATA SCIENCE WORKFLOW:

1. UNDERSTAND THE BUSINESS PROBLEM
   → What decision will this analysis support?
   → Who is the audience (technical/executive)?

2. DATA ACQUISITION
   → Kaggle API, database query, CSV export

3. DATA ASSESSMENT (30 minutes)
   → Shape, types, missing values, sample records

4. DATA CLEANING (often 60-70% of total time)
   → Duplicates, NAs, types, outliers, formats

5. EDA (analysis)
   → Univariate → Bivariate → Multivariate

6. INSIGHTS AND MODELING
   → Statistical tests, regression, ML

7. PROFESSIONAL REPORT
   → R Markdown → HTML/PDF report

3. E-Commerce Dataset Analysis

r
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
library(dplyr); library(ggplot2); library(lubridate); library(scales)
library(readr); library(tidyr); library(stringr)

# Simulate realistic e-commerce transactions dataset
set.seed(42)
n <- 5000
products <- c("Laptop","Smartphone","Tablet","Headphones","Monitor","Keyboard","Mouse","Webcam")
categories <- c("Computers","Electronics","Computers","Audio","Computers","Accessories","Accessories","Electronics")
prod_cat <- setNames(categories, products)
prices_base <- c(Laptop=900, Smartphone=700, Tablet=450, Headphones=150,
                  Monitor=350, Keyboard=80, Mouse=40, Webcam=120)

transactions <- data.frame(
  order_id   = paste0("ORD", sprintf("%06d", 1:n)),
  date       = sample(seq(as.Date("2023-01-01"), as.Date("2024-12-31"), by="day"), n, TRUE),
  customer_id= paste0("CUST", sprintf("%05d", sample(1:2000, n, replace=TRUE))),
  product    = sample(names(prices_base), n, replace=TRUE,
                       prob=c(0.15,0.20,0.12,0.15,0.10,0.10,0.08,0.10)),
  quantity   = sample(1:5, n, replace=TRUE, prob=c(0.5,0.25,0.12,0.08,0.05)),
  discount_pct = sample(c(0,5,10,15,20), n, replace=TRUE, prob=c(0.4,0.2,0.2,0.1,0.1)),
  region     = sample(c("North","South","East","West"), n, replace=TRUE)
) %>%
  mutate(
    category     = prod_cat[product],
    base_price   = prices_base[product],
    unit_price   = base_price * (1 - discount_pct/100),
    revenue      = unit_price * quantity,
    cost         = base_price * 0.6 * quantity,
    profit       = revenue - cost,
    margin_pct   = round(profit/revenue*100, 1),
    year         = year(date),
    month        = month(date, label=TRUE),
    quarter      = paste0("Q", quarter(date))
  )

# ─── DATA QUALITY REPORT ─────────────────────────────
cat("=== E-COMMERCE DATA QUALITY ===\n")
cat("Orders:", nrow(transactions), "\n")
cat("Date Range:", as.character(min(transactions$date)), "to",
    as.character(max(transactions$date)), "\n")
cat("Unique Customers:", n_distinct(transactions$customer_id), "\n")
cat("Missing Values:", sum(is.na(transactions)), "\n\n")

# ─── EXECUTIVE KPI SUMMARY ───────────────────────────
cat("=== EXECUTIVE KPI SUMMARY ===\n")
kpis <- transactions %>%
  summarise(
    total_revenue = sum(revenue),
    total_profit  = sum(profit),
    avg_margin    = round(mean(margin_pct), 1),
    total_orders  = n(),
    unique_customers = n_distinct(customer_id),
    avg_order_value  = round(mean(revenue), 2),
    total_units      = sum(quantity)
  )
cat(sprintf("Total Revenue:     $%s\n", format(kpis$total_revenue, big.mark=",")))
cat(sprintf("Total Profit:      $%s\n", format(round(kpis$total_profit), big.mark=",")))
cat(sprintf("Avg Margin:        %.1f%%\n", kpis$avg_margin))
cat(sprintf("Total Orders:      %s\n", format(kpis$total_orders, big.mark=",")))
cat(sprintf("Unique Customers:  %s\n", format(kpis$unique_customers, big.mark=",")))
cat(sprintf("Avg Order Value:   $%.2f\n", kpis$avg_order_value))

# ─── TOP PRODUCTS ─────────────────────────────────────
cat("\n=== TOP PRODUCTS BY REVENUE ===\n")
transactions %>%
  group_by(product, category) %>%
  summarise(revenue=sum(revenue), profit=sum(profit),
            orders=n(), avg_margin=round(mean(margin_pct),1), .groups="drop") %>%
  arrange(desc(revenue)) %>%
  mutate(revenue=paste0("$",format(round(revenue), big.mark=",")),
         profit=paste0("$",format(round(profit), big.mark=","))) %>%
  print()

# ─── MONTHLY TREND ───────────────────────────────────
monthly <- transactions %>%
  group_by(year, month) %>%
  summarise(revenue=sum(revenue), orders=n(), .groups="drop") %>%
  mutate(date=as.Date(paste(year, as.integer(month), "01", sep="-")))

ggplot(monthly, aes(date, revenue, color=factor(year))) +
  geom_line(size=1.5) + geom_point(size=2) +
  scale_y_continuous(labels=dollar_format(scale=0.001, suffix="K")) +
  scale_x_date(date_breaks="2 months", date_labels="%b %Y") +
  scale_color_manual(values=c("2023"="#90CAF9","2024"="#1565C0"), name="Year") +
  labs(title="Monthly Revenue Trend", x="Month", y="Revenue") +
  theme_minimal() + theme(axis.text.x=element_text(angle=30, hjust=1))

# ─── REGIONAL ANALYSIS ────────────────────────────────
cat("\n=== REGIONAL PERFORMANCE ===\n")
transactions %>%
  group_by(region) %>%
  summarise(revenue=sum(revenue), profit=sum(profit),
            orders=n(), margin=round(mean(margin_pct),1), .groups="drop") %>%
  mutate(rev_share=round(revenue/sum(revenue)*100,1)) %>%
  arrange(desc(revenue)) %>% print()

4. Common Mistakes

  • Analyzing without a business question: "What's interesting in this data?" is not a valid analysis objective. Always start with a specific, decision-driving question.
  • Not validating data with domain knowledge: A median laptop price of $15 signals a data entry error. Always cross-check numeric ranges against real-world expectations.

5. MCQs

Question 1

ndistinct(customerid) counts?

Question 2

quarter(date) from lubridate returns?

Question 3

dollarformat(scale=0.001, suffix="K") formats?

Question 4

Real-world data cleaning typically takes?

Question 5

as.Date(paste(year,month,"01",sep="-")) creates?

Question 6

Business KPI report should prioritize?

Question 7

ndistinct() is preferred over length(unique()) because?

Question 8

Pivot table equivalent in R (dplyr)?

Question 9

format(x, big.mark=",") formats numbers with?

Question 10

Executive summary should be?

6. Interview Questions

  • Q: Walk me through your typical real-world data analysis workflow.
  • Q: How do you validate data quality in a large dataset?

7. Summary

Real-world workflow: business question first, then data assessment, cleaning (60-70% of time), EDA, insights, reporting. Key validation: domain range checks, duplicate customer analysis, date consistency. Use lubridate for date manipulation, scales::dollar
format() for formatted charts, n_distinct() for unique counts. Always end with executive summary — clear, actionable insights.

8. Next Chapter Recommendation

In Chapter 26: R Shiny Basics, we build interactive web dashboards directly in R — no web development experience needed.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·