Skip to main content
Pandas & NumPy
CHAPTER 16 Beginner

GroupBy and Aggregation

Updated: May 18, 2026
5 min read

# CHAPTER 16

GroupBy and Aggregation

1. Chapter Introduction

GroupBy is Pandas' equivalent of SQL GROUP BY — splitting data into groups, applying aggregate functions, and combining results. It transforms raw transaction data into meaningful summaries and KPIs.

2. Basic GroupBy

python
12345678910111213141516171819
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Employee': ['Alice','Bob','Carol','David','Eve','Frank','Grace'],
    'Dept': ['Eng','Mkt','Eng','Sales','HR','Eng','Mkt'],
    'Salary': [85000,62000,91000,55000,58000,78000,67000],
    'Experience': [5,3,8,2,4,6,4],
    'Rating': [4.5,3.8,4.9,3.5,4.1,4.3,4.0]
})

# Basic aggregation
print(df.groupby('Dept')['Salary'].mean())
print(df.groupby('Dept')['Salary'].sum())
print(df.groupby('Dept')['Salary'].count())
print(df.groupby('Dept')['Salary'].max())

# Group by multiple columns
print(df.groupby(['Dept'])['Salary'].describe())

3. Multiple Aggregations with agg()

python
123456789101112131415161718
# Single aggregation per group
summary = df.groupby('Dept').agg({
    'Salary': ['mean', 'sum', 'min', 'max', 'count'],
    'Rating': ['mean', 'max'],
    'Experience': 'mean'
})
print(summary)

# Named aggregations (clean column names)
clean_summary = df.groupby('Dept').agg(
    Headcount=('Employee', 'count'),
    Avg_Salary=('Salary', 'mean'),
    Max_Salary=('Salary', 'max'),
    Total_Payroll=('Salary', 'sum'),
    Avg_Rating=('Rating', 'mean'),
    Avg_Experience=('Experience', 'mean')
).round(2)
print(clean_summary)

4. GroupBy with Transform

python
12345678910111213141516
# transform(): return a Series with same length as original DataFrame
# Great for creating group-level features alongside original data

# Add group mean as new column
df['Dept_Avg_Salary'] = df.groupby('Dept')['Salary'].transform('mean')
df['Salary_vs_Dept_Avg'] = df['Salary'] - df['Dept_Avg_Salary']

# Rank within each group
df['Dept_Salary_Rank'] = df.groupby('Dept')['Salary'].rank(ascending=False, method='min').astype(int)

# Normalize within group (z-score)
df['Salary_Z'] = df.groupby('Dept')['Salary'].transform(
    lambda x: (x - x.mean()) / x.std()
).round(3)

print(df[['Employee', 'Dept', 'Salary', 'Dept_Avg_Salary', 'Salary_vs_Dept_Avg', 'Dept_Salary_Rank']])

5. pivot_table — Excel Pivot Table Style

python
123456789101112131415161718192021222324252627282930
sales_df = pd.DataFrame({
    'Month': ['Jan','Jan','Jan','Feb','Feb','Feb','Mar','Mar','Mar'],
    'Product': ['Laptop','Phone','Desk','Laptop','Phone','Desk','Laptop','Phone','Desk'],
    'Region': ['North','South','North','South','North','South','North','South','North'],
    'Revenue': [12000,4500,2100,14000,5200,1800,13500,4800,2400],
    'Units': [10,15,7,12,18,6,11,16,8]
})

# pivot_table — like Excel PivotTable
pivot = sales_df.pivot_table(
    values='Revenue',
    index='Month',
    columns='Product',
    aggfunc='sum',
    fill_value=0,
    margins=True,   # Add row/column totals
    margins_name='Total'
)
print("Revenue by Month & Product:")
print(pivot)

# Multi-value pivot
pivot2 = sales_df.pivot_table(
    values=['Revenue', 'Units'],
    index='Region',
    columns='Product',
    aggfunc={'Revenue': 'sum', 'Units': 'mean'}
)
print("\nMulti-value pivot:")
print(pivot2)

6. Mini Project: Sales Analytics Dashboard

python
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import pandas as pd
import numpy as np

np.random.seed(42)
n = 200

df = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=n, freq='D'),
    'Salesperson': np.random.choice(['Alice','Bob','Carol','David','Eve'], n),
    'Product': np.random.choice(['Laptop','Phone','Monitor','Desk','Chair'], n),
    'Category': np.nan,
    'Region': np.random.choice(['North','South','East','West'], n),
    'Units': np.random.randint(1, 20, n),
    'Unit_Price': np.random.choice([1200, 500, 300, 450, 200], n)
})
df['Revenue'] = df['Units'] * df['Unit_Price']
df['Quarter'] = df['Date'].dt.quarter.map({1:'Q1',2:'Q2',3:'Q3',4:'Q4'})

# Category mapping
cat_map = {'Laptop': 'Electronics', 'Phone': 'Electronics',
           'Monitor': 'Electronics', 'Desk': 'Furniture', 'Chair': 'Furniture'}
df['Category'] = df['Product'].map(cat_map)

print("=" * 55)
print("SALES ANALYTICS DASHBOARD")
print("=" * 55)

# 1. Revenue by salesperson
print("\n1. Revenue by Salesperson:")
print(df.groupby('Salesperson').agg(
    Revenue=('Revenue','sum'), Orders=('Revenue','count')
).sort_values('Revenue', ascending=False))

# 2. Product performance
print("\n2. Product Performance:")
print(df.groupby('Product').agg(
    Total_Revenue=('Revenue','sum'),
    Total_Units=('Units','sum'),
    Avg_Order=('Revenue','mean')
).sort_values('Total_Revenue', ascending=False).round(0))

# 3. Revenue by Quarter
print("\n3. Quarterly Revenue:")
print(df.groupby('Quarter')['Revenue'].sum().sort_index())

# 4. Regional breakdown
print("\n4. Revenue by Region:")
print(df.groupby('Region')['Revenue'].sum().sort_values(ascending=False))

# 5. Top 5 transactions
print("\n5. Top 5 Transactions:")
print(df.nlargest(5, 'Revenue')[['Date','Salesperson','Product','Units','Revenue']])

7. Common Mistakes

  • Using apply instead of agg: groupby().apply(lambda x: x.mean()) is much slower than groupby().mean(). Always prefer native aggregation functions.
  • Forgetting resetindex(): After groupby().agg(), the group columns become the index. Use .resetindex() to make them regular columns.

8. MCQs

Question 1

groupby('Dept')['Salary'].mean() computes?

Question 2

Named aggregation syntax Headcount=('col', 'count') is in?

Question 3

transform('mean') vs agg('mean')?

Question 4

pivottable(margins=True) adds?

Question 5

groupby(['A','B']) groups by?

Question 6

After groupby().agg(), group column is?

Question 7

fillvalue=0 in pivottable?

Question 8

df.nlargest(5, 'Revenue') returns?

Question 9

groupby().transform() output length equals?

Question 10

SQL equivalent of groupby().sum() is?

9. Interview Questions

  • Q: What is the difference between groupby().agg() and groupby().transform()?
  • Q: How do you create a pivot table in Pandas?

10. Summary

GroupBy split-apply-combine pattern is Pandas' most powerful analysis tool. agg() with named aggregations creates clean summaries. transform() adds group statistics to the original DataFrame. pivot
table creates Excel-style cross-tabulations. Always reset_index() after aggregation for clean DataFrames.

11. Next Chapter Recommendation

In Chapter 17: Merging and Joining DataFrames, we combine data from multiple sources with merge, join, and concat — the Pandas equivalents of SQL joins.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·