CHAPTER 09 Beginner

String Cleaning and Text Processing

Updated: May 18, 2026

5 min read

# CHAPTER 9

String Cleaning and Text Processing

1. Chapter Introduction

User-generated text is notoriously messy. A state column might contain "NY", "N.Y.", "New York", and " new york ". If you group by this column, you'll get four different rows instead of one. This chapter teaches you how to normalize text, strip hidden whitespace, and use the sheer power of Regular Expressions (Regex) to extract clean data from messy strings.

2. The `.str` Accessor in Pandas

Pandas provides a .str accessor that lets you apply Python string methods to an entire column at once.

python

12345678910

import pandas as pd

df = pd.DataFrame({
    &#039;name': ['  Alice Smith  ', 'BOB JONES', 'charlie brown', 'David  Lee'],
    &#039;state': ['NY', 'n.y.', 'New York', ' ny '],
    &#039;phone': ['(555) 123-4567', '555.123.4568', '555-123-4569', '1234567890']
})

print("=== RAW DATA ===")
print(df)

3. Normalization: Case and Whitespace

The first step in string cleaning is making everything look uniform.

python

123456789101112

# 1. Strip leading and trailing whitespace
df[&#039;name_clean'] = df['name'].str.strip()

# 2. Standardize case (title case for names, upper for state codes)
df[&#039;name_clean'] = df['name_clean'].str.title()

# Fix internal multiple spaces (David  Lee -> David Lee)
# replace '  ' (two spaces) with ' ' (one space)
df[&#039;name_clean'] = df['name_clean'].str.replace(r'\s+', ' ', regex=True)

print("\n=== CLEANED NAMES ===")
print(df[[&#039;name', 'name_clean']])

4. Replacing and Standardizing Strings

Often you need to map messy variations to a single standard format.

python

123456789101112131415

# Standardize state to uppercase and strip spaces
df[&#039;state_clean'] = df['state'].str.strip().str.upper()

# Remove punctuation (like periods in N.Y.)
df[&#039;state_clean'] = df['state_clean'].str.replace('.', '', regex=False)

# Map variations to a standard code
state_map = {
    &#039;NEW YORK': 'NY'
}
# Replace specific strings using a dictionary
df[&#039;state_clean'] = df['state_clean'].replace(state_map)

print("\n=== CLEANED STATES ===")
print(df[[&#039;state', 'state_clean']])

5. Regex: The Ultimate Text Tool

Regular Expressions (Regex) define a search pattern. They are complex but incredibly powerful for extracting or replacing data.

*Basic Regex cheat sheet:*

\d: Any digit (0-9)

\D: Any non-digit

\s: Any whitespace

\w: Any word character (a-z, 0-9, _)

^: Start of string

$: End of string

python

123456

# Clean Phone Numbers: We only want the 10 digits
# \D matches anything that is NOT a digit. We replace it with nothing ('')
df[&#039;phone_clean'] = df['phone'].str.replace(r'\D', '', regex=True)

print("\n=== CLEANED PHONES ===")
print(df[[&#039;phone', 'phone_clean']])

6. Extracting Data with Regex

Sometimes data is trapped inside a longer string (e.g., extracting an invoice number from a description).

python

1234567891011121314

invoices = pd.DataFrame({
    &#039;desc': [
        &#039;Paid INV-1045 on Tuesday',
        &#039;Invoice INV-3390 processed',
        &#039;Check for INV-0021'
    ]
})

# Extract pattern: "INV-" followed by exactly 4 digits
# The parenthesis () capture the specific part we want to keep
invoices[&#039;invoice_num'] = invoices['desc'].str.extract(r'(INV-\d{4})')

print("\n=== EXTRACTED DATA ===")
print(invoices)

7. Splitting Strings

You often need to split a single column into multiple columns.

python

1234567891011121314

names = pd.DataFrame({&#039;full_name': ['Smith, John', 'Doe, Jane', 'Bond, James']})

# Split on the comma and space
# expand=True returns a DataFrame with multiple columns instead of a list
split_names = names[&#039;full_name'].str.split(', ', expand=True)

# Rename the new columns
split_names.columns = [&#039;last_name', 'first_name']

# Combine with original
names = pd.concat([names, split_names], axis=1)

print("\n=== SPLIT NAMES ===")
print(names)

8. Common Mistakes

Forgetting regex=True (or False): In Pandas, str.replace('.', '') might replace EVERY character if it assumes '.' is a regex wildcard. Use regex=False for literal characters, or regex=True when using patterns like \d.

Not standardizing case before grouping: If you group sales by "Apple" and "apple", they are treated as two different companies. Always run .str.lower() or .str.upper() before doing aggregations.

9. MCQs

Question 1

Which accessor allows you to use string methods on a Pandas Series?

Question 2

What does `str.strip()` do?

Question 3

To fix "BOB JONES" to "Bob Jones", which method is best?

Question 4

What is a Regular Expression (Regex)?

Question 5

In regex, what does `\d` match?

Question 6

How do you remove all non-numeric characters (like dashes and parentheses) from a phone number?

Question 7

What does str.split(',', expand=True) do?

Question 8

If you want to replace the exact literal string ".", what should you ensure?

Question 9

To extract specific data from a string using str.extract(), your regex must contain:

Question 10

Why is string normalization important before analysis?

10. Interview Questions

Q: A column contains product IDs like "PROD1234ABC", "PROD56_XY". Write a Pandas command to extract just the numeric part.

Q: How do you handle a column where city names are typed inconsistently (e.g., "la", "L.A.", "Los Angeles")?

11. Summary

Text data requires rigorous normalization. Use .str.strip() to remove invisible whitespace, .str.upper()/lower() to standardize casing, and .replace() with dictionaries for exact matches. For complex extractions and cleaning (like formatting phone numbers or isolating invoice IDs), Regular Expressions (\d, \D, str.extract) are essential tools in your data cleaning arsenal.

12. Next Chapter Recommendation

In Chapter 10: Date and Time Cleaning, we tackle the most frustrating data type of all: timestamps. We'll learn to parse mixed date formats and handle timezone inconsistencies.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

String Cleaning and Text Processing #

1. Chapter Introduction #

2. The .str Accessor in Pandas #

3. Normalization: Case and Whitespace #

4. Replacing and Standardizing Strings #

5. Regex: The Ultimate Text Tool #

6. Extracting Data with Regex #

7. Splitting Strings #

8. Common Mistakes #

9. MCQs #

Which accessor allows you to use string methods on a Pandas Series?

What does str.strip() do?

To fix "BOB JONES" to "Bob Jones", which method is best?

What is a Regular Expression (Regex)?

In regex, what does \d match?

How do you remove all non-numeric characters (like dashes and parentheses) from a phone number?

What does str.split(',', expand=True) do?

If you want to replace the exact literal string ".", what should you ensure?

To extract specific data from a string using str.extract(), your regex must contain:

Why is string normalization important before analysis?

10. Interview Questions #

11. Summary #

12. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 1

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

String Cleaning and Text Processing

1. Chapter Introduction

2. The `.str` Accessor in Pandas

3. Normalization: Case and Whitespace

4. Replacing and Standardizing Strings

5. Regex: The Ultimate Text Tool

6. Extracting Data with Regex

7. Splitting Strings

8. Common Mistakes

9. MCQs

What does `str.strip()` do?

In regex, what does `\d` match?

What does `str.split(',', expand=True)` do?

To extract specific data from a string using `str.extract()`, your regex must contain:

10. Interview Questions

11. Summary

12. Next Chapter Recommendation