Skip to main content
Data Cleaning
CHAPTER 09 Beginner

String Cleaning and Text Processing

Updated: May 18, 2026
5 min read

# CHAPTER 9

String Cleaning and Text Processing

1. Chapter Introduction

User-generated text is notoriously messy. A state column might contain "NY", "N.Y.", "New York", and " new york ". If you group by this column, you'll get four different rows instead of one. This chapter teaches you how to normalize text, strip hidden whitespace, and use the sheer power of Regular Expressions (Regex) to extract clean data from messy strings.

2. The .str Accessor in Pandas

Pandas provides a .str accessor that lets you apply Python string methods to an entire column at once.

python
12345678910
import pandas as pd

df = pd.DataFrame({
    'name': ['  Alice Smith  ', 'BOB JONES', 'charlie brown', 'David  Lee'],
    'state': ['NY', 'n.y.', 'New York', ' ny '],
    'phone': ['(555) 123-4567', '555.123.4568', '555-123-4569', '1234567890']
})

print("=== RAW DATA ===")
print(df)

3. Normalization: Case and Whitespace

The first step in string cleaning is making everything look uniform.

python
123456789101112
# 1. Strip leading and trailing whitespace
df['name_clean'] = df['name'].str.strip()

# 2. Standardize case (title case for names, upper for state codes)
df['name_clean'] = df['name_clean'].str.title()

# Fix internal multiple spaces (David  Lee -> David Lee)
# replace '  ' (two spaces) with ' ' (one space)
df['name_clean'] = df['name_clean'].str.replace(r'\s+', ' ', regex=True)

print("\n=== CLEANED NAMES ===")
print(df[['name', 'name_clean']])

4. Replacing and Standardizing Strings

Often you need to map messy variations to a single standard format.

python
123456789101112131415
# Standardize state to uppercase and strip spaces
df['state_clean'] = df['state'].str.strip().str.upper()

# Remove punctuation (like periods in N.Y.)
df['state_clean'] = df['state_clean'].str.replace('.', '', regex=False)

# Map variations to a standard code
state_map = {
    'NEW YORK': 'NY'
}
# Replace specific strings using a dictionary
df['state_clean'] = df['state_clean'].replace(state_map)

print("\n=== CLEANED STATES ===")
print(df[['state', 'state_clean']])

5. Regex: The Ultimate Text Tool

Regular Expressions (Regex) define a search pattern. They are complex but incredibly powerful for extracting or replacing data.

*Basic Regex cheat sheet:*

  • \d: Any digit (0-9)
  • \D: Any non-digit
  • \s: Any whitespace
  • \w: Any word character (a-z, 0-9, _)
  • ^: Start of string
  • $: End of string

python
123456
# Clean Phone Numbers: We only want the 10 digits
# \D matches anything that is NOT a digit. We replace it with nothing ('')
df['phone_clean'] = df['phone'].str.replace(r'\D', '', regex=True)

print("\n=== CLEANED PHONES ===")
print(df[['phone', 'phone_clean']])

6. Extracting Data with Regex

Sometimes data is trapped inside a longer string (e.g., extracting an invoice number from a description).

python
1234567891011121314
invoices = pd.DataFrame({
    'desc': [
        'Paid INV-1045 on Tuesday',
        'Invoice INV-3390 processed',
        'Check for INV-0021'
    ]
})

# Extract pattern: "INV-" followed by exactly 4 digits
# The parenthesis () capture the specific part we want to keep
invoices['invoice_num'] = invoices['desc'].str.extract(r'(INV-\d{4})')

print("\n=== EXTRACTED DATA ===")
print(invoices)

7. Splitting Strings

You often need to split a single column into multiple columns.

python
1234567891011121314
names = pd.DataFrame({'full_name': ['Smith, John', 'Doe, Jane', 'Bond, James']})

# Split on the comma and space
# expand=True returns a DataFrame with multiple columns instead of a list
split_names = names['full_name'].str.split(', ', expand=True)

# Rename the new columns
split_names.columns = ['last_name', 'first_name']

# Combine with original
names = pd.concat([names, split_names], axis=1)

print("\n=== SPLIT NAMES ===")
print(names)

8. Common Mistakes

  • Forgetting regex=True (or False): In Pandas, str.replace('.', '') might replace EVERY character if it assumes '.' is a regex wildcard. Use regex=False for literal characters, or regex=True when using patterns like \d.
  • Not standardizing case before grouping: If you group sales by "Apple" and "apple", they are treated as two different companies. Always run .str.lower() or .str.upper() before doing aggregations.

9. MCQs

Question 1

Which accessor allows you to use string methods on a Pandas Series?

Question 2

What does str.strip() do?

Question 3

To fix "BOB JONES" to "Bob Jones", which method is best?

Question 4

What is a Regular Expression (Regex)?

Question 5

In regex, what does \d match?

Question 6

How do you remove all non-numeric characters (like dashes and parentheses) from a phone number?

Question 7

What does str.split(',', expand=True) do?

Question 8

If you want to replace the exact literal string ".", what should you ensure?

Question 9

To extract specific data from a string using str.extract(), your regex must contain:

Question 10

Why is string normalization important before analysis?

10. Interview Questions

  • Q: A column contains product IDs like "PROD1234ABC", "PROD56_XY". Write a Pandas command to extract just the numeric part.
  • Q: How do you handle a column where city names are typed inconsistently (e.g., "la", "L.A.", "Los Angeles")?

11. Summary

Text data requires rigorous normalization. Use .str.strip() to remove invisible whitespace, .str.upper()/lower() to standardize casing, and .replace() with dictionaries for exact matches. For complex extractions and cleaning (like formatting phone numbers or isolating invoice IDs), Regular Expressions (\d, \D, str.extract) are essential tools in your data cleaning arsenal.

12. Next Chapter Recommendation

In Chapter 10: Date and Time Cleaning, we tackle the most frustrating data type of all: timestamps. We'll learn to parse mixed date formats and handle timezone inconsistencies.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·