Skip to main content
Data Cleaning
CHAPTER 12 Beginner

Cleaning Data with Pandas

Updated: May 18, 2026
5 min read

# CHAPTER 12

Cleaning Data with Pandas

1. Chapter Introduction

You now know how to handle missing values, duplicates, bad types, and messy strings individually. However, professional data scientists don't run 50 separate, disjointed lines of code to clean a dataset. They write elegant, efficient, and reproducible data pipelines. This chapter synthesizes previous concepts into cohesive Pandas workflows, focusing on the powerful concept of "Method Chaining."

2. The Problem with Step-by-Step Cleaning

Beginners often write code that reassigns variables repeatedly. This clutters the namespace, uses excess memory, and is hard to read.

The Beginner (Imperative) Way:

python
12345678
# Bad Practice: Hard to read, prone to copy-paste errors
df = pd.read_csv('data.csv')
df = df.drop_duplicates()
df['name'] = df['name'].str.lower()
df['name'] = df['name'].str.strip()
df['price'] = df['price'].str.replace('$', '')
df['price'] = pd.to_numeric(df['price'])
df = df.dropna(subset=['price'])

3. The Professional Way: Method Chaining

Pandas methods generally return a new DataFrame. This means you can chain methods together end-to-end. Wrap the entire chain in parentheses () to format it nicely across multiple lines.

The Professional (Functional) Way:

python
123456789101112131415161718192021222324252627282930313233343536
import pandas as pd
import numpy as np

# Sample messy data
raw_data = pd.DataFrame({
    '  Cust_ID  ': [1, 2, 2, 4, 5],
    'Name': ['Alice ', 'BOB', 'BOB', 'David', np.nan],
    'Price': ['$10.50', 'Free', 'Free', '$22.00', '$15.00']
})

# Method Chaining Pipeline
clean_df = (
    raw_data
    # 1. Standardize column names (using rename with a lambda function)
    .rename(columns=lambda x: x.strip().lower())
    
    # 2. Drop duplicates
    .drop_duplicates()
    
    # 3. Clean strings using assign() to create/modify columns within a chain
    .assign(
        name = lambda df_: df_['name'].str.title().str.strip(),
        price = lambda df_: df_['price'].str.replace(r'[^\d.]', '', regex=True)
    )
    
    # 4. Convert types (Free becomes NaN, then we fill it)
    .assign(
        price = lambda df_: pd.to_numeric(df_['price'], errors='coerce').fillna(0)
    )
    
    # 5. Filter rows (Drop missing names)
    .dropna(subset=['name'])
)

print("=== CLEANED PIPELINE OUTPUT ===")
print(clean_df)

4. Why Use assign() and lambda?

In a method chain, the DataFrame is constantly changing. If you try to do df['price'].str.replace() inside the chain, df refers to the *original* dataframe, not the intermediate one currently moving through the chain.

Using .assign(newcol = lambda currentdf: currentdf['oldcol'] * 2) ensures you are working with the data exactly as it exists at that exact step in the pipeline.

5. Using pipe() for Custom Functions

If you have complex cleaning logic that doesn't fit in an .assign(), write a custom function and insert it into the chain using .pipe().

python
1234567891011121314
def remove_outliers(df, col):
    """Custom function to remove IQR outliers"""
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR
    return df[(df[col] >= lower) & (df[col] <= upper)]

# Insert custom function into the chain
final_df = (
    clean_df
    .pipe(remove_outliers, col=&#039;price') # Passes the moving dataframe into the function
    .reset_index(drop=True)
)

6. Mini Project: Ecommerce Data Cleaner

Combine everything into a reproducible class/function for an Ecommerce company.

python
1234567891011121314151617181920212223242526272829303132
def clean_ecommerce_data(filepath):
    print("Initializing cleaning pipeline...")
    
    # Load
    raw = pd.DataFrame({
        &#039;Order Date': ['01/15/2024', '02-10-2024'],
        &#039;Status': ['Shipped', 'PENDING '],
        &#039;Total_Amt': ['$150.00', '$2,450.99']
    }) # Simulating pd.read_csv(filepath)
    
    # Clean
    cleaned = (
        raw
        .rename(columns=lambda c: c.lower().replace(&#039; ', '_'))
        .assign(
            order_date = lambda d: pd.to_datetime(d[&#039;order_date']),
            status = lambda d: d[&#039;status'].str.strip().str.title(),
            total_amt = lambda d: pd.to_numeric(
                d[&#039;total_amt'].str.replace(r'[^\d.]', '', regex=True), 
                errors=&#039;coerce'
            )
        )
    )
    
    # Validate
    assert cleaned[&#039;total_amt'].isnull().sum() == 0, "Missing values in Total Amount!"
    
    print("Pipeline complete.")
    return cleaned

result = clean_ecommerce_data(&#039;dummy.csv')
print("\n", result)

7. Common Mistakes

  • SettingWithCopyWarning: A notorious Pandas error that occurs when you filter a dataframe subdf = df[df['A'] > 5] and then try to modify it subdf['B'] = 1. Method chaining avoids this entirely because you are always explicitly generating a new dataframe.
  • Making chains too long: If a chain exceeds 10 steps, it becomes hard to debug. Break it into two logical chains (e.g., dfcleantext = (...) then dffinal = (dfcleantext...)).

8. MCQs

Question 1

What is Pandas method chaining?

Question 2

To format a method chain across multiple lines for readability, you should wrap the entire expression in?

Question 3

Inside a method chain, how do you create or modify a column?

Question 4

Why use a lambda function inside .assign() in a method chain?

Question 5

How can you apply a custom Python function inside a Pandas method chain?

Question 6

Method chaining helps avoid which common Pandas warning?

Question 7

What does .rename(columns=lambda x: x.lower()) do?

Question 8

What is a disadvantage of method chaining?

Question 9

Which method is used to reset row numbers after filtering rows in a chain?

Question 10

Imperative coding assigns variables at every step. Method chaining represents what programming paradigm?

9. Interview Questions

  • Q: Explain the concept of Method Chaining in Pandas. What are its pros and cons?
  • Q: How do you resolve a SettingWithCopyWarning in Pandas?

10. Summary

Method chaining transforms messy, imperative scripts into elegant, functional data pipelines. Wrap your code in (). Use .rename() for headers, .drop
duplicates() for rows, and .assign(colname = lambda d: d['colname'].operation()) for column transformations. If you need complex logic, write a custom function and pass it into the chain using .pipe().

11. Next Chapter Recommendation

In Chapter 13: Cleaning Data with SQL, we step away from Python to learn how to clean data directly in relational databases using SQL queries, UPDATE statements, and string functions.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·