Removing Duplicate Data
# CHAPTER 7
Removing Duplicate Data
1. Chapter Introduction
Duplicate data ruins analytics. It inflates revenue calculations, sends multiple marketing emails to the same customer, and biases machine learning models. Duplicates occur during database merges, user retries on web forms, and API scraping. This chapter teaches you how to detect exact and partial duplicates and remove them safely.2. Detecting Exact Duplicates
An exact duplicate means every single column in Row A matches every single column in Row B.
3. Removing Exact Duplicates
Removing exact duplicates is straightforward using drop_duplicates().
4. Subset Duplicates (Partial Duplicates)
Sometimes a row isn't an *exact* match, but it represents the same logical entity. For example, a customer might click "Submit" twice on an order form resulting in two different transaction IDs, but the same customer, amount, and timestamp.
5. Controlling Which Duplicate to Keep
When using drop_duplicates(), Pandas keeps the *first* occurrence and drops the rest by default. You can change this behavior.
6. Fuzzy Duplicates (String Matching)
What if the data entry is slightly different? "Jon Doe" vs "John Doe". Pandas duplicated() won't catch this. You need string normalization first.
7. Common Mistakes
-
Forgetting to sort before keeping 'last': If you want to keep the most recent updated record using
keep='last', you MUST ensure your DataFrame is sorted by the date column first usingdf.sortvalues('updatedat').
-
Dropping based on Name alone: Dropping duplicates based on
subset=['Name']is dangerous. Two different people can be named "John Smith". Always use unique identifiers (Email, User ID) or a combination of columns (Name + DOB + Zip Code).
8. MCQs
Which Pandas function identifies duplicate rows?
By default, df.duplicated() returns True for?
How do you permanently remove duplicate rows in Pandas?
To check for duplicates based ONLY on the 'email' column, use:
If a customer updates their profile 3 times and you want to keep their latest info (assuming chronological order), use:
What does keep=False do in dropduplicates()?
Why might exact duplicate detection miss duplicates?
What happens to the DataFrame index after dropping duplicates?
How do you fix the index gaps after dropping duplicates?
Dropping subset duplicates based on 'First Name' and 'Last Name' is risky because?
9. Interview Questions
- Q: You have a dataset of support tickets. Users sometimes double-click the submit button, creating two tickets a second apart. How would you clean this?
-
Q: Explain the difference between
keep='first',keep='last', andkeep=False. When would you usekeep=False?
10. Summary
Duplicate records silently distort analytical results. Useduplicated().sum() to gauge the problem. Use dropduplicates() for exact matches. For partial duplicates (same entity, different timestamps), use subset=[...] to define what constitutes a duplicate, and sort_values() combined with keep='last' to retain the most recent update. Always normalize text (lowercase, strip whitespace) before checking for string duplicates.