Skip to main content
R Programming
CHAPTER 15 Beginner

Data Manipulation with dplyr

Updated: May 18, 2026
5 min read

# CHAPTER 15

Data Manipulation with dplyr

1. Chapter Introduction

dplyr is R's data manipulation grammar — a consistent set of verbs that make data transformation readable, chainable, and fast. This chapter masters all core dplyr verbs with real business data analysis.

2. Core dplyr Verbs

r
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
library(dplyr)
library(tidyr)

# Sample sales data
sales <- data.frame(
  date     = rep(seq(as.Date("2024-01-01"), by="month", length.out=12), each=5),
  rep_name = rep(c("Alice","Bob","Carol","David","Eve"), 12),
  region   = rep(c("North","South","North","East","West"), 12),
  product  = sample(c("Laptop","Phone","Tablet","Monitor"), 60, replace=TRUE),
  revenue  = round(runif(60, 5000, 50000), -2),
  units    = sample(1:30, 60, replace=TRUE),
  stringsAsFactors = FALSE
)

# ── filter() — Keep rows matching conditions ──────────
high_sales  <- sales %>% filter(revenue > 30000)
alice_north <- sales %>% filter(rep_name == "Alice", region == "North")
good_months <- sales %>% filter(format(date, "%Y-%m") %in% c("2024-06","2024-07","2024-08"))

# ── select() — Keep/drop columns ─────────────────────
sales %>% select(rep_name, revenue, units)        # Keep 3 cols
sales %>% select(-date)                            # Drop date
sales %>% select(starts_with("re"))                # Columns starting with "re"
sales %>% select(where(is.numeric))                # Only numeric cols
sales %>% select(rep_name, everything())           # Put rep_name first

# ── mutate() — Create/modify columns ──────────────────
sales <- sales %>%
  mutate(
    avg_price   = revenue / units,
    month       = format(date, "%B"),
    quarter     = paste0("Q", ceiling(as.numeric(format(date, "%m")) / 3)),
    performance = case_when(
      revenue >= 40000 ~ "Excellent",
      revenue >= 25000 ~ "Good",
      revenue >= 10000 ~ "Average",
      TRUE             ~ "Below Target"
    )
  )

# ── arrange() — Sort rows ────────────────────────────
sales %>% arrange(desc(revenue))           # Descending revenue
sales %>% arrange(region, desc(revenue))  # By region, then rev desc

# ── group_by() + summarise() — Aggregation ────────────
rep_summary <- sales %>%
  group_by(rep_name) %>%
  summarise(
    total_revenue = sum(revenue),
    avg_revenue   = round(mean(revenue), 0),
    total_units   = sum(units),
    deals         = n(),
    .groups       = "drop"
  ) %>%
  arrange(desc(total_revenue))

print(rep_summary)

3. Advanced dplyr Operations

r
123456789101112131415161718192021222324252627282930313233343536
# ── Joins ─────────────────────────────────────────────
reps <- data.frame(
  rep_name = c("Alice","Bob","Carol","David","Eve"),
  manager  = c("Smith","Jones","Smith","Brown","Jones"),
  base_pay = c(60000, 52000, 58000, 65000, 55000)
)

# Inner join: only matching rows
merged <- sales %>% inner_join(reps, by="rep_name")

# Left join: all sales, add rep info
merged <- sales %>% left_join(reps, by="rep_name")

# ── Window functions ──────────────────────────────────
sales %>%
  group_by(region) %>%
  mutate(
    region_rank  = rank(desc(revenue)),
    running_total = cumsum(revenue),
    pct_of_region = revenue / sum(revenue) * 100
  ) %>%
  ungroup()

# ── Reshaping with tidyr ─────────────────────────────
# Wide → Long
quarterly_wide <- data.frame(
  rep_name = c("Alice","Bob","Carol"),
  Q1=c(45000,38000,52000), Q2=c(51000,42000,48000),
  Q3=c(48000,45000,55000), Q4=c(60000,52000,61000)
)
quarterly_long <- quarterly_wide %>%
  pivot_longer(cols=Q1:Q4, names_to="Quarter", values_to="Revenue")

# Long → Wide
quarterly_wide2 <- quarterly_long %>%
  pivot_wider(names_from=Quarter, values_from=Revenue)

4. Mini Project: Sales Analytics Dashboard

r
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
library(dplyr); library(tidyr)

# ─── SALES ANALYTICS DASHBOARD ───────────────────────
set.seed(42)
n <- 200
transactions <- data.frame(
  date     = sample(seq(as.Date("2024-01-01"), as.Date("2024-12-31"), by="day"), n, TRUE),
  salesperson = sample(c("Alice","Bob","Carol","David","Eve"), n, TRUE),
  region   = sample(c("North","South","East","West"), n, TRUE),
  product  = sample(c("Laptop","Phone","Tablet","Monitor","Headphones"), n, TRUE),
  revenue  = round(runif(n, 500, 5000), -2),
  cost     = round(runif(n, 200, 2000), -2)
) %>%
  mutate(
    profit  = revenue - cost,
    margin  = round(profit / revenue * 100, 1),
    quarter = paste0("Q", ceiling(as.numeric(format(date, "%m")) / 3))
  )

# 1: Top salespeople
cat("=== TOP SALESPEOPLE ===\n")
transactions %>%
  group_by(salesperson) %>%
  summarise(revenue=sum(revenue), profit=sum(profit),
            deals=n(), avg_margin=round(mean(margin),1), .groups="drop") %>%
  arrange(desc(revenue)) %>% print()

# 2: Regional performance
cat("\n=== REGIONAL PERFORMANCE ===\n")
transactions %>%
  group_by(region) %>%
  summarise(revenue=sum(revenue), deals=n(), avg_deal=round(mean(revenue)),
            .groups="drop") %>%
  mutate(revenue_pct=round(revenue/sum(revenue)*100,1)) %>%
  arrange(desc(revenue)) %>% print()

# 3: Product profitability
cat("\n=== PRODUCT PROFITABILITY ===\n")
transactions %>%
  group_by(product) %>%
  summarise(revenue=sum(revenue), profit=sum(profit),
            avg_margin=round(mean(margin),1), .groups="drop") %>%
  arrange(desc(profit)) %>% print()

# 4: Quarterly trend
cat("\n=== QUARTERLY TREND ===\n")
transactions %>%
  group_by(quarter) %>%
  summarise(revenue=sum(revenue), profit=sum(profit), .groups="drop") %>%
  mutate(growth = c(NA, diff(revenue)/lag(revenue)[-1]*100)) %>% print()

5. Common Mistakes

  • Forgetting .groups="drop" in summarise: In dplyr 1.0+, summarise() retains grouping. Forgetting .groups="drop" leaves the result grouped, causing unexpected behavior in downstream operations.
  • Using filter() with | vs %in%: filter(col == "A" | col == "B") is verbose. Use filter(col %in% c("A", "B")) for cleaner code.

6. MCQs

Question 1

filter(revenue > 5000, region == "North") keeps rows where?

Question 2

select(startswith("re")) selects columns?

Question 3

mutate() vs summarise()?

Question 4

arrange(desc(revenue)) sorts?

Question 5

leftjoin(a, b, by="id") keeps?

Question 6

pivotlonger(cols=Q1:Q4) creates?

Question 7

casewhen() is equivalent to?

Question 8

n() inside summarise() counts?

Question 9

ungroup() removes?

Question 10

cumsum(x) computes?

7. Interview Questions

  • Q: What is the difference between mutate() and summarise() in dplyr?
  • Q: How do you perform a left join in dplyr?

8. Summary

dplyr verbs: filter() (rows), select() (columns), mutate() (new columns), arrange() (sort), groupby() + summarise() (aggregate), leftjoin()/innerjoin() (combine tables). Pipe %>% chains operations. casewhen() for complex conditionals. pivotlonger()/pivotwider() for reshaping. Always .groups="drop" in summarise.

9. Next Chapter Recommendation

In Chapter 16: Data Visualization with ggplot2, we build beautiful, publication-quality charts using R's Grammar of Graphics.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·