CHAPTER 07 Beginner

Removing Duplicate Data

Updated: May 18, 2026

5 min read

# CHAPTER 7

Removing Duplicate Data

1. Chapter Introduction

Duplicate data ruins analytics. It inflates revenue calculations, sends multiple marketing emails to the same customer, and biases machine learning models. Duplicates occur during database merges, user retries on web forms, and API scraping. This chapter teaches you how to detect exact and partial duplicates and remove them safely.

2. Detecting Exact Duplicates

An exact duplicate means every single column in Row A matches every single column in Row B.

python

1234567891011121314151617181920212223

import pandas as pd

# Dataset with duplicates
df = pd.DataFrame({
    &#039;tx_id': [101, 102, 103, 102, 104, 104],
    &#039;customer': ['Alice', 'Bob', 'Charlie', 'Bob', 'David', 'David'],
    &#039;amount': [50.0, 75.5, 120.0, 75.5, 200.0, 200.0],
    &#039;date': ['2024-01-01', '2024-01-02', '2024-01-02', '2024-01-02', '2024-01-03', '2024-01-03']
})

print("=== RAW DATA ===")
print(df)

# duplicated() returns a Boolean Series: True if the row is a duplicate of a PREVIOUS row
print("\n=== BOOLEAN DUPLICATE MASK ===")
print(df.duplicated())

# Count exact duplicates
print(f"\nTotal exact duplicate rows: {df.duplicated().sum()}")

# View the duplicate rows
print("\n=== VIEW THE DUPLICATES ===")
print(df[df.duplicated()])

3. Removing Exact Duplicates

Removing exact duplicates is straightforward using drop_duplicates().

python

12345

# Remove exact duplicates and reset index
df_clean = df.drop_duplicates().reset_index(drop=True)

print("\n=== CLEANED EXACT DUPLICATES ===")
print(df_clean)

4. Subset Duplicates (Partial Duplicates)

Sometimes a row isn't an *exact* match, but it represents the same logical entity. For example, a customer might click "Submit" twice on an order form resulting in two different transaction IDs, but the same customer, amount, and timestamp.

python

123456789101112131415

df_partial = pd.DataFrame({
    &#039;tx_id': [101, 102, 103, 104],  # Notice IDs are all unique
    &#039;customer': ['Alice', 'Bob', 'Bob', 'Charlie'],
    &#039;amount': [50.0, 75.5, 75.5, 120.0],
    &#039;time': ['10:00:01', '10:05:00', '10:05:02', '10:30:00'] # Time is slightly different
})

# If we check exact duplicates, it finds 0
print(f"\nExact duplicates: {df_partial.duplicated().sum()}") 

# Check for duplicates based ONLY on specific columns
# E.g., same customer and same amount is suspicious
print("\n=== SUBSET DUPLICATES ===")
subset_dupes = df_partial.duplicated(subset=[&#039;customer', 'amount'])
print(df_partial[subset_dupes])

5. Controlling Which Duplicate to Keep

When using drop_duplicates(), Pandas keeps the *first* occurrence and drops the rest by default. You can change this behavior.

python

1234567891011121314151617

users = pd.DataFrame({
    &#039;user_id': [1, 1, 1],
    &#039;email': ['old@mail.com', 'new@mail.com', 'final@mail.com'],
    &#039;updated_at': ['2023-01', '2023-06', '2024-01']
})

# Keep the FIRST occurrence (default)
print("\n=== KEEP FIRST ===")
print(users.drop_duplicates(subset=[&#039;user_id'], keep='first'))

# Keep the LAST occurrence (useful if data is sorted chronologically)
print("\n=== KEEP LAST ===")
print(users.drop_duplicates(subset=[&#039;user_id'], keep='last'))

# Drop ALL duplicates (keep only records that appear exactly once)
print("\n=== KEEP FALSE (DROP ALL) ===")
print(users.drop_duplicates(subset=[&#039;user_id'], keep=False))

6. Fuzzy Duplicates (String Matching)

What if the data entry is slightly different? "Jon Doe" vs "John Doe". Pandas duplicated() won't catch this. You need string normalization first.

python

1234567

dirty_names = pd.DataFrame({&#039;name': ['Alice Smith', 'alice smith', ' ALICE SMITH ']})

# Normalize strings before checking duplicates
dirty_names[&#039;clean_name'] = dirty_names['name'].str.strip().str.lower()
print("\n=== AFTER NORMALIZATION ===")
print(dirty_names)
print(f"Duplicates found after cleaning: {dirty_names.duplicated(subset=[&#039;clean_name']).sum()}")

7. Common Mistakes

Forgetting to sort before keeping 'last': If you want to keep the most recent updated record using keep='last', you MUST ensure your DataFrame is sorted by the date column first using df.sortvalues('updatedat').

Dropping based on Name alone: Dropping duplicates based on subset=['Name'] is dangerous. Two different people can be named "John Smith". Always use unique identifiers (Email, User ID) or a combination of columns (Name + DOB + Zip Code).

8. MCQs

Question 1

Which Pandas function identifies duplicate rows?

Question 2

By default, `df.duplicated()` returns True for?

Question 3

How do you permanently remove duplicate rows in Pandas?

Question 4

To check for duplicates based ONLY on the 'email' column, use:

Question 5

If a customer updates their profile 3 times and you want to keep their latest info (assuming chronological order), use:

Question 6

What does `keep=False` do in `dropduplicates()`?

Question 7

Why might exact duplicate detection miss duplicates?

Question 8

What happens to the DataFrame index after dropping duplicates?

Question 9

How do you fix the index gaps after dropping duplicates?

Question 10

Dropping subset duplicates based on 'First Name' and 'Last Name' is risky because?

9. Interview Questions

Q: You have a dataset of support tickets. Users sometimes double-click the submit button, creating two tickets a second apart. How would you clean this?

Q: Explain the difference between keep='first', keep='last', and keep=False. When would you use keep=False?

10. Summary
Duplicate records silently distort analytical results. Use duplicated().sum() to gauge the problem. Use dropduplicates() for exact matches. For partial duplicates (same entity, different timestamps), use subset=[...] to define what constitutes a duplicate, and sort_values() combined with keep='last' to retain the most recent update. Always normalize text (lowercase, strip whitespace) before checking for string duplicates.

11. Next Chapter Recommendation

In Chapter 8: Detecting and Handling Outliers, we explore how to identify extreme values that break machine learning models and skew averages, using statistical techniques like IQR and Z-scores.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Removing Duplicate Data #

1. Chapter Introduction #

2. Detecting Exact Duplicates #

3. Removing Exact Duplicates #

4. Subset Duplicates (Partial Duplicates) #

5. Controlling Which Duplicate to Keep #

6. Fuzzy Duplicates (String Matching) #

7. Common Mistakes #

8. MCQs #

Which Pandas function identifies duplicate rows?

By default, df.duplicated() returns True for?

How do you permanently remove duplicate rows in Pandas?

To check for duplicates based ONLY on the 'email' column, use:

If a customer updates their profile 3 times and you want to keep their latest info (assuming chronological order), use:

What does keep=False do in dropduplicates()?

Why might exact duplicate detection miss duplicates?

What happens to the DataFrame index after dropping duplicates?

How do you fix the index gaps after dropping duplicates?

Dropping subset duplicates based on 'First Name' and 'Last Name' is risky because?

9. Interview Questions #

10. Summary #

11. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 1

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

Removing Duplicate Data

1. Chapter Introduction

2. Detecting Exact Duplicates

3. Removing Exact Duplicates

4. Subset Duplicates (Partial Duplicates)

5. Controlling Which Duplicate to Keep

6. Fuzzy Duplicates (String Matching)

7. Common Mistakes

8. MCQs

By default, `df.duplicated()` returns True for?

What does `keep=False` do in `dropduplicates()`?

9. Interview Questions

10. Summary

11. Next Chapter Recommendation