CHAPTER 12 Beginner

Cleaning Data with Pandas

Updated: May 18, 2026

5 min read

# CHAPTER 12

Cleaning Data with Pandas

1. Chapter Introduction

You now know how to handle missing values, duplicates, bad types, and messy strings individually. However, professional data scientists don't run 50 separate, disjointed lines of code to clean a dataset. They write elegant, efficient, and reproducible data pipelines. This chapter synthesizes previous concepts into cohesive Pandas workflows, focusing on the powerful concept of "Method Chaining."

2. The Problem with Step-by-Step Cleaning

Beginners often write code that reassigns variables repeatedly. This clutters the namespace, uses excess memory, and is hard to read.

The Beginner (Imperative) Way:

python

12345678

# Bad Practice: Hard to read, prone to copy-paste errors
df = pd.read_csv(&#039;data.csv')
df = df.drop_duplicates()
df[&#039;name'] = df['name'].str.lower()
df[&#039;name'] = df['name'].str.strip()
df[&#039;price'] = df['price'].str.replace('$', '')
df[&#039;price'] = pd.to_numeric(df['price'])
df = df.dropna(subset=[&#039;price'])

3. The Professional Way: Method Chaining

Pandas methods generally return a new DataFrame. This means you can chain methods together end-to-end. Wrap the entire chain in parentheses () to format it nicely across multiple lines.

The Professional (Functional) Way:

python

123456789101112131415161718192021222324252627282930313233343536

import pandas as pd
import numpy as np

# Sample messy data
raw_data = pd.DataFrame({
    &#039;  Cust_ID  ': [1, 2, 2, 4, 5],
    &#039;Name': ['Alice ', 'BOB', 'BOB', 'David', np.nan],
    &#039;Price': ['$10.50', 'Free', 'Free', '$22.00', '$15.00']
})

# Method Chaining Pipeline
clean_df = (
    raw_data
    # 1. Standardize column names (using rename with a lambda function)
    .rename(columns=lambda x: x.strip().lower())
    
    # 2. Drop duplicates
    .drop_duplicates()
    
    # 3. Clean strings using assign() to create/modify columns within a chain
    .assign(
        name = lambda df_: df_[&#039;name'].str.title().str.strip(),
        price = lambda df_: df_[&#039;price'].str.replace(r'[^\d.]', '', regex=True)
    )
    
    # 4. Convert types (Free becomes NaN, then we fill it)
    .assign(
        price = lambda df_: pd.to_numeric(df_[&#039;price'], errors='coerce').fillna(0)
    )
    
    # 5. Filter rows (Drop missing names)
    .dropna(subset=[&#039;name'])
)

print("=== CLEANED PIPELINE OUTPUT ===")
print(clean_df)

4. Why Use `assign()` and `lambda`?

In a method chain, the DataFrame is constantly changing. If you try to do df['price'].str.replace() inside the chain, df refers to the *original* dataframe, not the intermediate one currently moving through the chain.

Using .assign(newcol = lambda currentdf: currentdf['oldcol'] * 2) ensures you are working with the data exactly as it exists at that exact step in the pipeline.

5. Using `pipe()` for Custom Functions

If you have complex cleaning logic that doesn't fit in an .assign(), write a custom function and insert it into the chain using .pipe().

python

1234567891011121314

def remove_outliers(df, col):
    """Custom function to remove IQR outliers"""
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR
    return df[(df[col] >= lower) & (df[col] <= upper)]

# Insert custom function into the chain
final_df = (
    clean_df
    .pipe(remove_outliers, col=&#039;price') # Passes the moving dataframe into the function
    .reset_index(drop=True)
)

6. Mini Project: Ecommerce Data Cleaner

Combine everything into a reproducible class/function for an Ecommerce company.

python

1234567891011121314151617181920212223242526272829303132

def clean_ecommerce_data(filepath):
    print("Initializing cleaning pipeline...")
    
    # Load
    raw = pd.DataFrame({
        &#039;Order Date': ['01/15/2024', '02-10-2024'],
        &#039;Status': ['Shipped', 'PENDING '],
        &#039;Total_Amt': ['$150.00', '$2,450.99']
    }) # Simulating pd.read_csv(filepath)
    
    # Clean
    cleaned = (
        raw
        .rename(columns=lambda c: c.lower().replace(&#039; ', '_'))
        .assign(
            order_date = lambda d: pd.to_datetime(d[&#039;order_date']),
            status = lambda d: d[&#039;status'].str.strip().str.title(),
            total_amt = lambda d: pd.to_numeric(
                d[&#039;total_amt'].str.replace(r'[^\d.]', '', regex=True), 
                errors=&#039;coerce'
            )
        )
    )
    
    # Validate
    assert cleaned[&#039;total_amt'].isnull().sum() == 0, "Missing values in Total Amount!"
    
    print("Pipeline complete.")
    return cleaned

result = clean_ecommerce_data(&#039;dummy.csv')
print("\n", result)

7. Common Mistakes

SettingWithCopyWarning: A notorious Pandas error that occurs when you filter a dataframe subdf = df[df['A'] > 5] and then try to modify it subdf['B'] = 1. Method chaining avoids this entirely because you are always explicitly generating a new dataframe.

Making chains too long: If a chain exceeds 10 steps, it becomes hard to debug. Break it into two logical chains (e.g., dfcleantext = (...) then dffinal = (dfcleantext...)).

8. MCQs

Question 1

What is Pandas method chaining?

Question 2

To format a method chain across multiple lines for readability, you should wrap the entire expression in?

Question 3

Inside a method chain, how do you create or modify a column?

Question 4

Why use a lambda function inside .assign() in a method chain?

Question 5

How can you apply a custom Python function inside a Pandas method chain?

Question 6

Method chaining helps avoid which common Pandas warning?

Question 7

What does .rename(columns=lambda x: x.lower()) do?

Question 8

What is a disadvantage of method chaining?

Question 9

Which method is used to reset row numbers after filtering rows in a chain?

Question 10

Imperative coding assigns variables at every step. Method chaining represents what programming paradigm?

9. Interview Questions

Q: Explain the concept of Method Chaining in Pandas. What are its pros and cons?

Q: How do you resolve a SettingWithCopyWarning in Pandas?

10. Summary
Method chaining transforms messy, imperative scripts into elegant, functional data pipelines. Wrap your code in (). Use .rename() for headers, .dropduplicates() for rows, and .assign(colname = lambda d: d['colname'].operation()) for column transformations. If you need complex logic, write a custom function and pass it into the chain using .pipe().

11. Next Chapter Recommendation

In Chapter 13: Cleaning Data with SQL, we step away from Python to learn how to clean data directly in relational databases using SQL queries, UPDATE statements, and string functions.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Cleaning Data with Pandas #

1. Chapter Introduction #

2. The Problem with Step-by-Step Cleaning #

3. The Professional Way: Method Chaining #

4. Why Use assign() and lambda? #

5. Using pipe() for Custom Functions #

6. Mini Project: Ecommerce Data Cleaner #

7. Common Mistakes #

8. MCQs #

What is Pandas method chaining?

To format a method chain across multiple lines for readability, you should wrap the entire expression in?

Inside a method chain, how do you create or modify a column?

Why use a lambda function inside .assign() in a method chain?

How can you apply a custom Python function inside a Pandas method chain?

Method chaining helps avoid which common Pandas warning?

What does .rename(columns=lambda x: x.lower()) do?

What is a disadvantage of method chaining?

Which method is used to reset row numbers after filtering rows in a chain?

Imperative coding assigns variables at every step. Method chaining represents what programming paradigm?

9. Interview Questions #

10. Summary #

11. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 1

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

Cleaning Data with Pandas

1. Chapter Introduction

2. The Problem with Step-by-Step Cleaning

3. The Professional Way: Method Chaining

4. Why Use `assign()` and `lambda`?

5. Using `pipe()` for Custom Functions

6. Mini Project: Ecommerce Data Cleaner

7. Common Mistakes

8. MCQs

Why use a `lambda` function inside `.assign()` in a method chain?

What does `.rename(columns=lambda x: x.lower())` do?

9. Interview Questions

10. Summary

11. Next Chapter Recommendation