Skip to main content
Data Cleaning
CHAPTER 17 Beginner

Real-World Data Cleaning Projects

Updated: May 18, 2026
5 min read

# CHAPTER 17

Real-World Data Cleaning Projects

1. Chapter Introduction

Theory and syntax only take you so far. Real-world data is inherently complex, unstructured, and domain-specific. The way you clean financial transactions is entirely different from how you clean patient medical records. This chapter outlines the architecture and common challenges of 5 real-world data cleaning projects you can build for your portfolio.

2. Project 1: CRM Customer Records Cleanup

The Scenario: A company merged with a competitor. They combined their Salesforce exports, resulting in massive duplication and inconsistent formatting.

The Workflow:

  1. 1. Normalization: Convert all Email addresses to lowercase and .strip() whitespace.
  1. 2. Standardization: Map 'USA', 'US', 'United States' to standard ISO codes using a dictionary replace.
  1. 3. Regex Extraction: Extract clean 10-digit phone numbers from (555) 123-4567, 555.123.4567, etc.
  1. 4. Fuzzy Matching: Use algorithms (like Levenshtein distance) or normalize names (e.g., 'Jon Doe' vs 'Jonathan Doe') to find near-duplicates.
  1. 5. Deduplication: Use .dropduplicates(subset=['email'], keep='last') assuming the most recently updated record is the most accurate.

3. Project 2: Sales Data Preprocessing for Forecasting

The Scenario: You need to forecast next month's sales, but the historical POS (Point of Sale) data is full of errors, refunds, and missing dates.

The Workflow:

  1. 1. Datetime Conversion: Use pd.todatetime() on the transaction date. Create new features: Year, Month, DayOfWeek using the .dt accessor.
  1. 2. Handling Refunds: Identify rows where Quantity or Price is negative. Separate these into a returnsdf and keep only positive sales in the salesdf.
  1. 3. Imputation: If ProductCategory is missing, impute it by mapping the ProductID to a known catalog, or use the mode (most frequent category) for that specific store.
  1. 4. Aggregation: Group by Date and Store to get daily totals.
  1. 5. Time Series Gaps: If a store was closed on a Sunday, the date might be missing entirely. Reindex the dataframe to include all calendar days and .fillna(0) for closed days.

4. Project 3: Healthcare Patient Data Formatting

The Scenario: Predicting patient readmission rates based on Electronic Health Records (EHR). Healthcare data is highly sensitive and prone to missing clinical test results.

The Workflow:

  1. 1. Validation (Sanity Checks): Use assertions to ensure AdmissionDate is <= DischargeDate.
  1. 2. Outlier Treatment: Medical data has true outliers (e.g., very high blood pressure). Do NOT delete them. Use clinical thresholds (e.g., Blood Sugar > 600) to cap values (Winsorization) or flag them for review.
  1. 3. MNAR Handling (Missing Not At Random): If a patient didn't get an MRI, the MRIResult is missing. Don't impute this with the mean! Fill it with a category like "TestNotAdministered", as the *absence* of the test is a clinical indicator itself.
  1. 4. Encoding: One-Hot Encode categorical diagnoses codes (ICD-10) for the machine learning model.

5. Project 4: Financial Transaction Cleaning (Anti-Money Laundering)

The Scenario: Processing millions of bank transactions to detect fraudulent or structured money transfers.

The Workflow:

  1. 1. Type Casting: Ensure all Amount strings (like "$10,000.00") are stripped of currency symbols and cast strictly to float64.
  1. 2. Outlier Detection: Use the IQR method to flag transactions that are statistically anomalous for a specific AccountID.
  1. 3. Categorization: Use Regex to search the TransactionDescription text (e.g., "UBER EATS", "AMZN MKT") and map them to standard categories ("Food", "Shopping").
  1. 4. Handling Zeros: A transaction amount of 0.00 might be a ping to verify an account. Filter these out of volume calculations.

6. Project 5: HR Employee Dataset Cleaning

The Scenario: Analyzing employee churn (turnover). The data comes from manual HR Excel sheets, meaning lots of human data entry errors.

The Workflow:

  1. 1. Data Types: Convert HireDate and TerminationDate to datetime.
  1. 2. Logic Calculation: Create a TenureDays column (TerminationDate - HireDate). For current employees, TerminationDate will be NaN. Fill the NaN with today's date before calculating tenure.
  1. 3. Categorical Standardization: Standardize Department names (e.g., "Dev", "Software", "Engineering" all map to "Engineering").
  1. 4. Boolean Conversion: Convert IsManager from "Yes"/"No" strings to True/False booleans.

7. Common Mistakes in Projects

  • Cleaning without a goal: You don't need to clean every single column perfectly. Focus heavily on the features required for your specific analysis or machine learning model.
  • Ignoring Data Leakage in ML Projects: If you use the entire dataset to calculate the mean for imputation, and then split into Train/Test, your Training data has learned from the Test data. Always split *before* imputing with statistical measures.

8. MCQs

Question 1

In a CRM cleanup, what is the best strategy for handling names like "jon doe" and "JON DOE"?

Question 2

When preparing sales data for daily forecasting, what must you do if a day had zero sales and is missing from the dataset?

Question 3

In healthcare data, replacing a missing "Cancer Test Result" with the average test result is?

Question 4

To extract standard categories ("Food") from messy bank descriptions ("UBER EATS PENDING"), what tool is best?

Question 5

How do you calculate employee tenure for people who still work at the company (Termination Date is NULL)?

Question 6

What does "Winsorization" mean in the context of healthcare outliers?

Question 7

If you are cleaning data to train a Machine Learning model, when should you calculate the Mean for imputation?

Question 8

In financial data, a string like "-$1,500.00" requires what step before doing math?

Question 9

Merging two different company databases usually requires heavy focus on?

Question 10

Why is calculating DaysBetween = EndDate - Start_Date a good validation step?

9. Interview Questions

  • Q: Walk me through how you would clean a dataset containing millions of credit card transactions to prepare it for a fraud detection model.
  • Q: You are given an HR dataset where 20% of employees are missing their 'Department'. How do you handle this?

10. Summary

Real-world data cleaning is domain-specific. CRM data requires heavy string normalization and deduplication. Sales/Time-series data requires date parsing and filling temporal gaps with zeros. Healthcare requires extreme caution with outliers and MNAR (Missing Not At Random) values. Financial data requires rigorous type casting. Always clean with the final analytical goal in mind.

11. Next Chapter Recommendation

In Chapter 18: Performance Optimization for Large Datasets, we will learn how to handle files that are too big for your computer's RAM, utilizing chunking, memory optimization, and parallel processing.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·