CHAPTER 16 Beginner

Data Analysis and Aggregation

Updated: May 18, 2026

5 min read

# CHAPTER 16

Data Analysis and Aggregation

1. Chapter Introduction

Once your data is clean, it is time to find the answers. If you have a database of 10,000 sales transactions across 5 cities, how do you find the total revenue for *each* city? You need to group the data together and aggregate it. This chapter introduces the most powerful analytical function in Pandas: groupby(), as well as Pivot Tables.

2. The Mechanics of `groupby()`

The groupby() function implements a "Split-Apply-Combine" workflow:

1. Split: Split the dataset into groups based on a category (e.g., City).

2. Apply: Apply a mathematical function to each group (e.g., Sum the Revenue).

3. Combine: Combine the results into a new summary DataFrame.

python

1234567891011121314151617181920212223

import pandas as pd

# Sales Data
data = {
    "City": ["NY", "LA", "NY", "SF", "LA", "NY"],
    "Salesman": ["Alice", "Bob", "Charlie", "Dave", "Eve", "Frank"],
    "Revenue": [500, 300, 400, 700, 250, 600]
}
df = pd.DataFrame(data)

# Goal: What is the total revenue for each city?
# 1. Group by the 'City' column
# 2. Select the 'Revenue' column to do math on
# 3. Apply the .sum() aggregation function
city_revenue = df.groupby(&#039;City')['Revenue'].sum()

print(city_revenue)
# Output:
# City
# LA     550
# NY    1500
# SF     700
# Name: Revenue, dtype: int64

3. Multiple Aggregations at Once

What if you want to know the Total Revenue, the Average Revenue, AND the Number of sales for each city simultaneously? You use the .agg() method.

python

12345678910

# Pass a list of aggregation functions
summary = df.groupby(&#039;City')['Revenue'].agg(['sum', 'mean', 'count'])

print(summary)
# Output:
#        sum        mean  count
# City                         
# LA     550  275.000000      2
# NY    1500  500.000000      3
# SF     700  700.000000      1

4. Grouping by Multiple Columns

You can split the data into sub-groups by passing a list to groupby().

python

123456

# Add a 'Year' column
df[&#039;Year'] = [2022, 2022, 2023, 2023, 2023, 2022]

# Group by City first, then by Year
yearly_city_rev = df.groupby([&#039;City', 'Year'])['Revenue'].sum()
print(yearly_city_rev)

5. Pivot Tables in Pandas

If you are coming from Excel, you probably love Pivot Tables. Pandas has a dedicated function for them which is often more intuitive than complex groupby chains.

python

12345678910111213141516171819

# Values: The number to aggregate (Revenue)
# Index: The rows (City)
# Columns: The columns (Year)
# aggfunc: The math to apply (sum)

pivot = pd.pivot_table(df, 
                       values=&#039;Revenue', 
                       index=&#039;City', 
                       columns=&#039;Year', 
                       aggfunc=&#039;sum',
                       fill_value=0) # Fills missing combos with 0

print(pivot)
# Output:
# Year  2022  2023
# City            
# LA     300   250
# NY    1100   400
# SF       0   700

6. Mini Project: HR Analytics

Find out which department has the highest average salary.

python

1234567891011121314151617

hr_data = pd.DataFrame({
    "Dept": ["IT", "Sales", "IT", "HR", "Sales", "HR"],
    "Role": ["Dev", "Rep", "Dev", "Manager", "Manager", "Recruiter"],
    "Salary": [90000, 50000, 95000, 80000, 120000, 60000]
})

# 1. Group by Dept and calculate average salary
avg_salaries = hr_data.groupby(&#039;Dept')['Salary'].mean()

# 2. Sort to find the highest
sorted_salaries = avg_salaries.sort_values(ascending=False)

print("Average Salaries by Department:")
print(sorted_salaries)

# 3. Identify the highest paying department
print(f"\nHighest paying dept: {sorted_salaries.index[0]}")

7. Common Mistakes

Forgetting the Aggregation Function: If you type df.groupby('City') and hit enter, it won't show you data. It will just return a <pandas.core.groupby.DataFrameGroupBy object>. You MUST tell it what math to do (e.g., .sum(), .mean(), .count()).

Grouping on unique identifiers: Do not groupby('CustomerID') if every customer is unique. Grouping only works on categorical columns where items repeat (like City, Department, Gender, Month).

8. MCQs

Question 1

What is the fundamental Pandas function used to split data into categories for analysis?

Question 2

What are the three steps in a GroupBy workflow?

Question 3

If you run `df.groupby('City')` without adding `.sum()` or `.mean()`, what happens?

Question 4

How do you calculate the total (sum) of the 'Revenue' column grouped by 'Region'?

Question 5

Which method allows you to apply MULTIPLE aggregation functions (like sum and mean) at the same time?

Q6. Can you group by more than one column (e.g., Region AND Year)? a) Yes, by passing a list: df.groupby(['Region', 'Year']) b) No, only one column is allowed — Answer: a

Question 7

Which Pandas function provides an Excel-like way to restructure data with indices and columns?

Question 8

In `pd.pivottable(df, values='Sales', index='Store', aggfunc='sum')`, what does `aggfunc` do?

Question 9

What happens if a combination in a Pivot Table has no data (e.g., Store A had no sales in 2022)?

Question 10

You should only use `groupby()` on what type of columns?

9. Interview Questions

Q: Explain the "Split-Apply-Combine" strategy behind the .groupby() function.

Q: In what scenario would you prefer using pd.pivottable() over standard .groupby()?

10. Summary
Analytical questions like "What is the average X for each Y?" are solved using .groupby(). Always remember to specify the column you want to do math on, followed by the aggregation function (sum(), mean(), count()). Use .agg() to apply multiple calculations at once. For more complex, multi-dimensional reporting, pd.pivottable() provides an Excel-like interface to summarize data.

11. Next Chapter Recommendation

In Chapter 17: Working with CSV, Excel, and JSON Data, we will step back to look at the advanced data ingestion parameters, learning how to handle massive files, parse dates automatically, and export our analytical summaries.

Explore More

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Data Analysis and Aggregation #

1. Chapter Introduction #

2. The Mechanics of groupby() #

3. Multiple Aggregations at Once #

4. Grouping by Multiple Columns #

5. Pivot Tables in Pandas #

6. Mini Project: HR Analytics #

7. Common Mistakes #

8. MCQs #

What is the fundamental Pandas function used to split data into categories for analysis?

What are the three steps in a GroupBy workflow?

If you run df.groupby('City') without adding .sum() or .mean(), what happens?

How do you calculate the total (sum) of the 'Revenue' column grouped by 'Region'?

Which method allows you to apply MULTIPLE aggregation functions (like sum and mean) at the same time?

Which Pandas function provides an Excel-like way to restructure data with indices and columns?

In pd.pivottable(df, values='Sales', index='Store', aggfunc='sum'), what does aggfunc do?

What happens if a combination in a Pivot Table has no data (e.g., Store A had no sales in 2022)?

You should only use groupby() on what type of columns?

9. Interview Questions #

10. Summary #

11. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 1

🎥 Related Videos 1

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

Data Analysis and Aggregation

1. Chapter Introduction

2. The Mechanics of `groupby()`

3. Multiple Aggregations at Once

4. Grouping by Multiple Columns

5. Pivot Tables in Pandas

6. Mini Project: HR Analytics

7. Common Mistakes

8. MCQs

If you run `df.groupby('City')` without adding `.sum()` or `.mean()`, what happens?

In `pd.pivottable(df, values='Sales', index='Store', aggfunc='sum')`, what does `aggfunc` do?

You should only use `groupby()` on what type of columns?

9. Interview Questions

10. Summary

11. Next Chapter Recommendation