Skip to main content
Data Cleaning
CHAPTER 07 Beginner

Removing Duplicate Data

Updated: May 18, 2026
5 min read

# CHAPTER 7

Removing Duplicate Data

1. Chapter Introduction

Duplicate data ruins analytics. It inflates revenue calculations, sends multiple marketing emails to the same customer, and biases machine learning models. Duplicates occur during database merges, user retries on web forms, and API scraping. This chapter teaches you how to detect exact and partial duplicates and remove them safely.

2. Detecting Exact Duplicates

An exact duplicate means every single column in Row A matches every single column in Row B.

python
1234567891011121314151617181920212223
import pandas as pd

# Dataset with duplicates
df = pd.DataFrame({
    'tx_id': [101, 102, 103, 102, 104, 104],
    'customer': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'David'],
    'amount': [50.0, 75.5, 120.0, 75.5, 200.0, 200.0],
    'date': ['2024-01-01', '2024-01-02', '2024-01-02', '2024-01-02', '2024-01-03', '2024-01-03']
})

print("=== RAW DATA ===")
print(df)

# duplicated() returns a Boolean Series: True if the row is a duplicate of a PREVIOUS row
print("\n=== BOOLEAN DUPLICATE MASK ===")
print(df.duplicated())

# Count exact duplicates
print(f"\nTotal exact duplicate rows: {df.duplicated().sum()}")

# View the duplicate rows
print("\n=== VIEW THE DUPLICATES ===")
print(df[df.duplicated()])

3. Removing Exact Duplicates

Removing exact duplicates is straightforward using drop_duplicates().

python
12345
# Remove exact duplicates and reset index
df_clean = df.drop_duplicates().reset_index(drop=True)

print("\n=== CLEANED EXACT DUPLICATES ===")
print(df_clean)

4. Subset Duplicates (Partial Duplicates)

Sometimes a row isn't an *exact* match, but it represents the same logical entity. For example, a customer might click "Submit" twice on an order form resulting in two different transaction IDs, but the same customer, amount, and timestamp.

python
123456789101112131415
df_partial = pd.DataFrame({
    'tx_id': [101, 102, 103, 104],  # Notice IDs are all unique
    'customer': ['Alice', 'Bob', 'Bob', 'Charlie'],
    'amount': [50.0, 75.5, 75.5, 120.0],
    'time': ['10:00:01', '10:05:00', '10:05:02', '10:30:00'] # Time is slightly different
})

# If we check exact duplicates, it finds 0
print(f"\nExact duplicates: {df_partial.duplicated().sum()}") 

# Check for duplicates based ONLY on specific columns
# E.g., same customer and same amount is suspicious
print("\n=== SUBSET DUPLICATES ===")
subset_dupes = df_partial.duplicated(subset=['customer', 'amount'])
print(df_partial[subset_dupes])

5. Controlling Which Duplicate to Keep

When using drop_duplicates(), Pandas keeps the *first* occurrence and drops the rest by default. You can change this behavior.

python
1234567891011121314151617
users = pd.DataFrame({
    'user_id': [1, 1, 1],
    'email': ['old@mail.com', 'new@mail.com', 'final@mail.com'],
    'updated_at': ['2023-01', '2023-06', '2024-01']
})

# Keep the FIRST occurrence (default)
print("\n=== KEEP FIRST ===")
print(users.drop_duplicates(subset=['user_id'], keep='first'))

# Keep the LAST occurrence (useful if data is sorted chronologically)
print("\n=== KEEP LAST ===")
print(users.drop_duplicates(subset=['user_id'], keep='last'))

# Drop ALL duplicates (keep only records that appear exactly once)
print("\n=== KEEP FALSE (DROP ALL) ===")
print(users.drop_duplicates(subset=['user_id'], keep=False))

6. Fuzzy Duplicates (String Matching)

What if the data entry is slightly different? "Jon Doe" vs "John Doe". Pandas duplicated() won't catch this. You need string normalization first.

python
1234567
dirty_names = pd.DataFrame({'name': ['Alice Smith', 'alice smith', ' ALICE SMITH ']})

# Normalize strings before checking duplicates
dirty_names['clean_name'] = dirty_names['name'].str.strip().str.lower()
print("\n=== AFTER NORMALIZATION ===")
print(dirty_names)
print(f"Duplicates found after cleaning: {dirty_names.duplicated(subset=['clean_name']).sum()}")

7. Common Mistakes

  • Forgetting to sort before keeping 'last': If you want to keep the most recent updated record using keep='last', you MUST ensure your DataFrame is sorted by the date column first using df.sortvalues('updatedat').
  • Dropping based on Name alone: Dropping duplicates based on subset=['Name'] is dangerous. Two different people can be named "John Smith". Always use unique identifiers (Email, User ID) or a combination of columns (Name + DOB + Zip Code).

8. MCQs

Question 1

Which Pandas function identifies duplicate rows?

Question 2

By default, df.duplicated() returns True for?

Question 3

How do you permanently remove duplicate rows in Pandas?

Question 4

To check for duplicates based ONLY on the 'email' column, use:

Question 5

If a customer updates their profile 3 times and you want to keep their latest info (assuming chronological order), use:

Question 6

What does keep=False do in dropduplicates()?

Question 7

Why might exact duplicate detection miss duplicates?

Question 8

What happens to the DataFrame index after dropping duplicates?

Question 9

How do you fix the index gaps after dropping duplicates?

Question 10

Dropping subset duplicates based on 'First Name' and 'Last Name' is risky because?

9. Interview Questions

  • Q: You have a dataset of support tickets. Users sometimes double-click the submit button, creating two tickets a second apart. How would you clean this?
  • Q: Explain the difference between keep='first', keep='last', and keep=False. When would you use keep=False?

10. Summary

Duplicate records silently distort analytical results. Use duplicated().sum() to gauge the problem. Use drop
duplicates() for exact matches. For partial duplicates (same entity, different timestamps), use subset=[...] to define what constitutes a duplicate, and sort_values() combined with keep='last' to retain the most recent update. Always normalize text (lowercase, strip whitespace) before checking for string duplicates.

11. Next Chapter Recommendation

In Chapter 8: Detecting and Handling Outliers, we explore how to identify extreme values that break machine learning models and skew averages, using statistical techniques like IQR and Z-scores.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·