String Cleaning and Text Processing
# CHAPTER 9
String Cleaning and Text Processing
1. Chapter Introduction
User-generated text is notoriously messy. A state column might contain "NY", "N.Y.", "New York", and " new york ". If you group by this column, you'll get four different rows instead of one. This chapter teaches you how to normalize text, strip hidden whitespace, and use the sheer power of Regular Expressions (Regex) to extract clean data from messy strings.
2. The .str Accessor in Pandas
Pandas provides a .str accessor that lets you apply Python string methods to an entire column at once.
3. Normalization: Case and Whitespace
The first step in string cleaning is making everything look uniform.
4. Replacing and Standardizing Strings
Often you need to map messy variations to a single standard format.
5. Regex: The Ultimate Text Tool
Regular Expressions (Regex) define a search pattern. They are complex but incredibly powerful for extracting or replacing data.
*Basic Regex cheat sheet:*
-
\d: Any digit (0-9)
-
\D: Any non-digit
-
\s: Any whitespace
-
\w: Any word character (a-z, 0-9, _)
-
^: Start of string
-
$: End of string
6. Extracting Data with Regex
Sometimes data is trapped inside a longer string (e.g., extracting an invoice number from a description).
7. Splitting Strings
You often need to split a single column into multiple columns.
8. Common Mistakes
-
Forgetting
regex=True(or False): In Pandas,str.replace('.', '')might replace EVERY character if it assumes '.' is a regex wildcard. Useregex=Falsefor literal characters, orregex=Truewhen using patterns like\d.
-
Not standardizing case before grouping: If you group sales by "Apple" and "apple", they are treated as two different companies. Always run
.str.lower()or.str.upper()before doing aggregations.
9. MCQs
Which accessor allows you to use string methods on a Pandas Series?
What does str.strip() do?
To fix "BOB JONES" to "Bob Jones", which method is best?
What is a Regular Expression (Regex)?
In regex, what does \d match?
How do you remove all non-numeric characters (like dashes and parentheses) from a phone number?
What does str.split(',', expand=True) do?
If you want to replace the exact literal string ".", what should you ensure?
To extract specific data from a string using str.extract(), your regex must contain:
Why is string normalization important before analysis?
10. Interview Questions
- Q: A column contains product IDs like "PROD1234ABC", "PROD56_XY". Write a Pandas command to extract just the numeric part.
- Q: How do you handle a column where city names are typed inconsistently (e.g., "la", "L.A.", "Los Angeles")?
11. Summary
Text data requires rigorous normalization. Use.str.strip() to remove invisible whitespace, .str.upper()/lower() to standardize casing, and .replace() with dictionaries for exact matches. For complex extractions and cleaning (like formatting phone numbers or isolating invoice IDs), Regular Expressions (\d, \D, str.extract) are essential tools in your data cleaning arsenal.