Skip to main content
Pandas & NumPy
CHAPTER 21 Beginner

Advanced Pandas Operations

Updated: May 18, 2026
5 min read

# CHAPTER 21

Advanced Pandas Operations

1. Chapter Introduction

Pandas' advanced features handle hierarchical data, time series windows, memory-efficient categoricals, and complex aggregation patterns used in production analytics pipelines.

2. MultiIndex — Hierarchical Indexing

python
1234567891011121314151617181920212223242526272829
import pandas as pd
import numpy as np

# Create MultiIndex DataFrame
data = {
    'Revenue': [12000, 8000, 15000, 9500, 14000, 7800, 13200, 8900],
    'Units':   [10, 8, 12, 9, 11, 7, 13, 8]
}
index = pd.MultiIndex.from_tuples([
    ('2024 Q1', 'Laptop'), ('2024 Q1', 'Phone'),
    ('2024 Q2', 'Laptop'), ('2024 Q2', 'Phone'),
    ('2024 Q3', 'Laptop'), ('2024 Q3', 'Phone'),
    ('2024 Q4', 'Laptop'), ('2024 Q4', 'Phone')
], names=['Quarter', 'Product'])

df = pd.DataFrame(data, index=index)
print(df)

# Access levels
print(df.loc['2024 Q1'])              # All Q1 data
print(df.loc[('2024 Q1', 'Laptop')]) # Specific: Q1 Laptop
print(df.loc['2024 Q1':'2024 Q2'])   # Q1 and Q2 slice

# Stack/Unstack — pivot between index and columns
print(df['Revenue'].unstack())       # Products as columns
print(df.unstack().stack())          # Round trip

# Reset to flat index
df_flat = df.reset_index()

3. Window Functions

python
123456789101112131415161718192021222324
np.random.seed(42)
df_ts = pd.DataFrame({
    'Date': pd.date_range('2024-01', periods=60, freq='D'),
    'Sales': np.random.randint(500, 2000, 60) + np.linspace(0, 500, 60)
}).set_index('Date')

# Rolling windows
df_ts['MA_7']   = df_ts['Sales'].rolling(7).mean()    # 7-day moving average
df_ts['MA_30']  = df_ts['Sales'].rolling(30).mean()   # 30-day moving average
df_ts['STD_7']  = df_ts['Sales'].rolling(7).std()     # 7-day volatility
df_ts['MAX_7']  = df_ts['Sales'].rolling(7).max()     # 7-day rolling max

# Bollinger Bands (financial analysis)
df_ts['Upper_Band'] = df_ts['MA_7'] + 2 * df_ts['STD_7']
df_ts['Lower_Band'] = df_ts['MA_7'] - 2 * df_ts['STD_7']

# Expanding windows (cumulative from start)
df_ts['Cum_Mean'] = df_ts['Sales'].expanding().mean()
df_ts['Cum_Max']  = df_ts['Sales'].expanding().max()

# EWM — Exponentially Weighted Moving Average (recent data weighted more)
df_ts['EWM_7'] = df_ts['Sales'].ewm(span=7, adjust=False).mean()

print(df_ts[['Sales', 'MA_7', 'MA_30', 'EWM_7']].head(15))

4. Categorical Data

python
12345678910111213141516171819202122232425
# Categorical dtype — significant memory and speed benefits for repeated strings
df = pd.DataFrame({
    'Dept': ['Engineering', 'Marketing', 'Engineering', 'Sales', 'HR',
             'Engineering', 'Marketing', 'Sales', 'HR', 'Engineering'] * 1000,
    'Salary': np.random.randint(50000, 100000, 10000)
})

# Compare memory
print(f"String:   {df['Dept'].memory_usage(deep=True):,} bytes")
df['Dept'] = df['Dept'].astype('category')
print(f"Category: {df['Dept'].memory_usage(deep=True):,} bytes")  # Much smaller!

# Ordered categories (for comparison)
satisfaction = pd.Categorical(
    ['High', 'Low', 'Medium', 'High', 'Medium'],
    categories=['Low', 'Medium', 'High'],
    ordered=True
)
s = pd.Series(satisfaction)
print(s > 'Low')   # [True, False, True, True, True]
print(s.max())     # 'High'

# Use with groupby — faster than string groupby
summary = df.groupby('Dept', observed=True)['Salary'].mean()
print(summary)

5. crosstab and cut/qcut

python
12345678910111213141516171819202122232425262728
df_survey = pd.DataFrame({
    'Gender': np.random.choice(['Male', 'Female'], 200),
    'Dept': np.random.choice(['Eng', 'Mkt', 'Sales'], 200),
    'Salary': np.random.normal(70000, 20000, 200).astype(int),
    'Rating': np.random.choice(['A', 'B', 'C'], 200)
})

# Cross tabulation
ct = pd.crosstab(df_survey['Dept'], df_survey['Gender'])
print("Count by Dept and Gender:")
print(ct)

ct_pct = pd.crosstab(df_survey['Dept'], df_survey['Gender'], normalize='index') * 100
print("\nPercentage:")
print(ct_pct.round(1))

# cut — bin continuous data
df_survey['Salary_Band'] = pd.cut(
    df_survey['Salary'],
    bins=[0, 50000, 70000, 90000, float('inf')],
    labels=['Low', 'Mid', 'High', 'Very High']
)

# qcut — equal-frequency bins (quantile-based)
df_survey['Salary_Quartile'] = pd.qcut(df_survey['Salary'], q=4,
                                        labels=['Q1', 'Q2', 'Q3', 'Q4'])
print("\nSalary Band distribution:")
print(df_survey['Salary_Band'].value_counts().sort_index())

6. Common Mistakes

  • Forgetting observed=True in groupby with categories: Pandas shows all category combinations by default (including zero-count groups). Pass observed=True to show only observed combinations.
  • Using string dtype for repeated values: Categorical dtype uses 95%+ less memory for columns with few unique values repeated many times.

7. MCQs

Question 1

MultiIndex allows?

Question 2

df.unstack() converts?

Question 3

rolling(7).mean() uses?

Question 4

Categorical dtype saves memory because?

Question 5

Ordered categories enable?

Question 6

pd.crosstab() creates?

Question 7

pd.cut() vs pd.qcut()?

Question 8

EWM gives?

Question 9

expanding().mean() computes?

Question 10

df.loc['2024 Q1':'2024 Q2'] on MultiIndex selects?

8. Interview Questions

  • Q: When would you use categorical dtype in Pandas?
  • Q: What is the difference between rolling() and expanding()?

9. Summary

Advanced Pandas: MultiIndex for hierarchical data, rolling/ewm for time series smoothing, categorical dtype for 90%+ memory savings on repeated strings, crosstab for contingency tables, and cut/qcut for binning continuous variables. These tools separate professional from beginner Pandas users.

10. Next Chapter Recommendation

In Chapter 22: Exploratory Data Analysis (EDA), we systematically investigate datasets — profiling, correlations, outliers, and the Titanic dataset analysis project.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·