Data Analysis and Aggregation
# CHAPTER 16
Data Analysis and Aggregation
1. Chapter Introduction
Once your data is clean, it is time to find the answers. If you have a database of 10,000 sales transactions across 5 cities, how do you find the total revenue for *each* city? You need to group the data together and aggregate it. This chapter introduces the most powerful analytical function in Pandas:groupby(), as well as Pivot Tables.
2. The Mechanics of groupby()
The groupby() function implements a "Split-Apply-Combine" workflow:
- 1. Split: Split the dataset into groups based on a category (e.g., City).
- 2. Apply: Apply a mathematical function to each group (e.g., Sum the Revenue).
- 3. Combine: Combine the results into a new summary DataFrame.
3. Multiple Aggregations at Once
What if you want to know the Total Revenue, the Average Revenue, AND the Number of sales for each city simultaneously? You use the .agg() method.
4. Grouping by Multiple Columns
You can split the data into sub-groups by passing a list to groupby().
5. Pivot Tables in Pandas
If you are coming from Excel, you probably love Pivot Tables. Pandas has a dedicated function for them which is often more intuitive than complex groupby chains.
6. Mini Project: HR Analytics
Find out which department has the highest average salary.
7. Common Mistakes
-
Forgetting the Aggregation Function: If you type
df.groupby('City')and hit enter, it won't show you data. It will just return a<pandas.core.groupby.DataFrameGroupBy object>. You MUST tell it what math to do (e.g.,.sum(),.mean(),.count()).
-
Grouping on unique identifiers: Do not
groupby('CustomerID')if every customer is unique. Grouping only works on categorical columns where items repeat (like City, Department, Gender, Month).
8. MCQs
What is the fundamental Pandas function used to split data into categories for analysis?
What are the three steps in a GroupBy workflow?
If you run df.groupby('City') without adding .sum() or .mean(), what happens?
How do you calculate the total (sum) of the 'Revenue' column grouped by 'Region'?
Which method allows you to apply MULTIPLE aggregation functions (like sum and mean) at the same time?
df.groupby(['Region', 'Year']) b) No, only one column is allowed — Answer: a
Which Pandas function provides an Excel-like way to restructure data with indices and columns?
In pd.pivottable(df, values='Sales', index='Store', aggfunc='sum'), what does aggfunc do?
What happens if a combination in a Pivot Table has no data (e.g., Store A had no sales in 2022)?
You should only use groupby() on what type of columns?
9. Interview Questions
-
Q: Explain the "Split-Apply-Combine" strategy behind the
.groupby()function.
-
Q: In what scenario would you prefer using
pd.pivottable()over standard.groupby()?
10. Summary
Analytical questions like "What is the average X for each Y?" are solved using.groupby(). Always remember to specify the column you want to do math on, followed by the aggregation function (sum(), mean(), count()). Use .agg() to apply multiple calculations at once. For more complex, multi-dimensional reporting, pd.pivottable() provides an Excel-like interface to summarize data.