Skip to main content
Python for Data Science
CHAPTER 16 Beginner

Data Analysis and Aggregation

Updated: May 18, 2026
5 min read

# CHAPTER 16

Data Analysis and Aggregation

1. Chapter Introduction

Once your data is clean, it is time to find the answers. If you have a database of 10,000 sales transactions across 5 cities, how do you find the total revenue for *each* city? You need to group the data together and aggregate it. This chapter introduces the most powerful analytical function in Pandas: groupby(), as well as Pivot Tables.

2. The Mechanics of groupby()

The groupby() function implements a "Split-Apply-Combine" workflow:

  1. 1. Split: Split the dataset into groups based on a category (e.g., City).
  1. 2. Apply: Apply a mathematical function to each group (e.g., Sum the Revenue).
  1. 3. Combine: Combine the results into a new summary DataFrame.

python
1234567891011121314151617181920212223
import pandas as pd

# Sales Data
data = {
    "City": ["NY", "LA", "NY", "SF", "LA", "NY"],
    "Salesman": ["Alice", "Bob", "Charlie", "Dave", "Eve", "Frank"],
    "Revenue": [500, 300, 400, 700, 250, 600]
}
df = pd.DataFrame(data)

# Goal: What is the total revenue for each city?
# 1. Group by the 'City' column
# 2. Select the 'Revenue' column to do math on
# 3. Apply the .sum() aggregation function
city_revenue = df.groupby('City')['Revenue'].sum()

print(city_revenue)
# Output:
# City
# LA     550
# NY    1500
# SF     700
# Name: Revenue, dtype: int64

3. Multiple Aggregations at Once

What if you want to know the Total Revenue, the Average Revenue, AND the Number of sales for each city simultaneously? You use the .agg() method.

python
12345678910
# Pass a list of aggregation functions
summary = df.groupby('City')['Revenue'].agg(['sum', 'mean', 'count'])

print(summary)
# Output:
#        sum        mean  count
# City                         
# LA     550  275.000000      2
# NY    1500  500.000000      3
# SF     700  700.000000      1

4. Grouping by Multiple Columns

You can split the data into sub-groups by passing a list to groupby().

python
123456
# Add a 'Year' column
df['Year'] = [2022, 2022, 2023, 2023, 2023, 2022]

# Group by City first, then by Year
yearly_city_rev = df.groupby(['City', 'Year'])['Revenue'].sum()
print(yearly_city_rev)

5. Pivot Tables in Pandas

If you are coming from Excel, you probably love Pivot Tables. Pandas has a dedicated function for them which is often more intuitive than complex groupby chains.

python
12345678910111213141516171819
# Values: The number to aggregate (Revenue)
# Index: The rows (City)
# Columns: The columns (Year)
# aggfunc: The math to apply (sum)

pivot = pd.pivot_table(df, 
                       values='Revenue', 
                       index='City', 
                       columns='Year', 
                       aggfunc='sum',
                       fill_value=0) # Fills missing combos with 0

print(pivot)
# Output:
# Year  2022  2023
# City            
# LA     300   250
# NY    1100   400
# SF       0   700

6. Mini Project: HR Analytics

Find out which department has the highest average salary.

python
1234567891011121314151617
hr_data = pd.DataFrame({
    "Dept": ["IT", "Sales", "IT", "HR", "Sales", "HR"],
    "Role": ["Dev", "Rep", "Dev", "Manager", "Manager", "Recruiter"],
    "Salary": [90000, 50000, 95000, 80000, 120000, 60000]
})

# 1. Group by Dept and calculate average salary
avg_salaries = hr_data.groupby('Dept')['Salary'].mean()

# 2. Sort to find the highest
sorted_salaries = avg_salaries.sort_values(ascending=False)

print("Average Salaries by Department:")
print(sorted_salaries)

# 3. Identify the highest paying department
print(f"\nHighest paying dept: {sorted_salaries.index[0]}")

7. Common Mistakes

  • Forgetting the Aggregation Function: If you type df.groupby('City') and hit enter, it won't show you data. It will just return a <pandas.core.groupby.DataFrameGroupBy object>. You MUST tell it what math to do (e.g., .sum(), .mean(), .count()).
  • Grouping on unique identifiers: Do not groupby('CustomerID') if every customer is unique. Grouping only works on categorical columns where items repeat (like City, Department, Gender, Month).

8. MCQs

Question 1

What is the fundamental Pandas function used to split data into categories for analysis?

Question 2

What are the three steps in a GroupBy workflow?

Question 3

If you run df.groupby('City') without adding .sum() or .mean(), what happens?

Question 4

How do you calculate the total (sum) of the 'Revenue' column grouped by 'Region'?

Question 5

Which method allows you to apply MULTIPLE aggregation functions (like sum and mean) at the same time?

Q6. Can you group by more than one column (e.g., Region AND Year)? a) Yes, by passing a list: df.groupby(['Region', 'Year']) b) No, only one column is allowed — Answer: a
Question 7

Which Pandas function provides an Excel-like way to restructure data with indices and columns?

Question 8

In pd.pivottable(df, values='Sales', index='Store', aggfunc='sum'), what does aggfunc do?

Question 9

What happens if a combination in a Pivot Table has no data (e.g., Store A had no sales in 2022)?

Question 10

You should only use groupby() on what type of columns?

9. Interview Questions

  • Q: Explain the "Split-Apply-Combine" strategy behind the .groupby() function.
  • Q: In what scenario would you prefer using pd.pivottable() over standard .groupby()?

10. Summary

Analytical questions like "What is the average X for each Y?" are solved using .groupby(). Always remember to specify the column you want to do math on, followed by the aggregation function (sum(), mean(), count()). Use .agg() to apply multiple calculations at once. For more complex, multi-dimensional reporting, pd.pivot
table() provides an Excel-like interface to summarize data.

11. Next Chapter Recommendation

In Chapter 17: Working with CSV, Excel, and JSON Data, we will step back to look at the advanced data ingestion parameters, learning how to handle massive files, parse dates automatically, and export our analytical summaries.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·