CHAPTER 17 Beginner

Real-World Data Cleaning Projects

Updated: May 18, 2026

5 min read

# CHAPTER 17

Real-World Data Cleaning Projects

1. Chapter Introduction

Theory and syntax only take you so far. Real-world data is inherently complex, unstructured, and domain-specific. The way you clean financial transactions is entirely different from how you clean patient medical records. This chapter outlines the architecture and common challenges of 5 real-world data cleaning projects you can build for your portfolio.

2. Project 1: CRM Customer Records Cleanup

The Scenario: A company merged with a competitor. They combined their Salesforce exports, resulting in massive duplication and inconsistent formatting.

The Workflow:

1. Normalization: Convert all Email addresses to lowercase and .strip() whitespace.

2. Standardization: Map 'USA', 'US', 'United States' to standard ISO codes using a dictionary replace.

3. Regex Extraction: Extract clean 10-digit phone numbers from (555) 123-4567, 555.123.4567, etc.

4. Fuzzy Matching: Use algorithms (like Levenshtein distance) or normalize names (e.g., 'Jon Doe' vs 'Jonathan Doe') to find near-duplicates.

5. Deduplication: Use .dropduplicates(subset=['email'], keep='last') assuming the most recently updated record is the most accurate.

3. Project 2: Sales Data Preprocessing for Forecasting

The Scenario: You need to forecast next month's sales, but the historical POS (Point of Sale) data is full of errors, refunds, and missing dates.

The Workflow:

1. Datetime Conversion: Use pd.todatetime() on the transaction date. Create new features: Year, Month, DayOfWeek using the .dt accessor.

2. Handling Refunds: Identify rows where Quantity or Price is negative. Separate these into a returnsdf and keep only positive sales in the salesdf.

3. Imputation: If ProductCategory is missing, impute it by mapping the ProductID to a known catalog, or use the mode (most frequent category) for that specific store.

4. Aggregation: Group by Date and Store to get daily totals.

5. Time Series Gaps: If a store was closed on a Sunday, the date might be missing entirely. Reindex the dataframe to include all calendar days and .fillna(0) for closed days.

4. Project 3: Healthcare Patient Data Formatting

The Scenario: Predicting patient readmission rates based on Electronic Health Records (EHR). Healthcare data is highly sensitive and prone to missing clinical test results.

The Workflow:

1. Validation (Sanity Checks): Use assertions to ensure AdmissionDate is <= DischargeDate.

2. Outlier Treatment: Medical data has true outliers (e.g., very high blood pressure). Do NOT delete them. Use clinical thresholds (e.g., Blood Sugar > 600) to cap values (Winsorization) or flag them for review.

3. MNAR Handling (Missing Not At Random): If a patient didn't get an MRI, the MRIResult is missing. Don't impute this with the mean! Fill it with a category like "TestNotAdministered", as the *absence* of the test is a clinical indicator itself.

4. Encoding: One-Hot Encode categorical diagnoses codes (ICD-10) for the machine learning model.

5. Project 4: Financial Transaction Cleaning (Anti-Money Laundering)

The Scenario: Processing millions of bank transactions to detect fraudulent or structured money transfers.

The Workflow:
1. Type Casting: Ensure all Amount strings (like "$10,000.00") are stripped of currency symbols and cast strictly to float64.

2. Outlier Detection: Use the IQR method to flag transactions that are statistically anomalous for a specific AccountID.

3. Categorization: Use Regex to search the TransactionDescription text (e.g., "UBER EATS", "AMZN MKT") and map them to standard categories ("Food", "Shopping").

4. Handling Zeros: A transaction amount of 0.00 might be a ping to verify an account. Filter these out of volume calculations.

6. Project 5: HR Employee Dataset Cleaning

The Scenario: Analyzing employee churn (turnover). The data comes from manual HR Excel sheets, meaning lots of human data entry errors.

The Workflow:

1. Data Types: Convert HireDate and TerminationDate to datetime.

2. Logic Calculation: Create a TenureDays column (TerminationDate - HireDate). For current employees, TerminationDate will be NaN. Fill the NaN with today's date before calculating tenure.

3. Categorical Standardization: Standardize Department names (e.g., "Dev", "Software", "Engineering" all map to "Engineering").

4. Boolean Conversion: Convert IsManager from "Yes"/"No" strings to True/False booleans.

7. Common Mistakes in Projects

Cleaning without a goal: You don't need to clean every single column perfectly. Focus heavily on the features required for your specific analysis or machine learning model.

Ignoring Data Leakage in ML Projects: If you use the entire dataset to calculate the mean for imputation, and then split into Train/Test, your Training data has learned from the Test data. Always split *before* imputing with statistical measures.

8. MCQs

Question 1

In a CRM cleanup, what is the best strategy for handling names like "jon doe" and "JON DOE"?

Question 2

When preparing sales data for daily forecasting, what must you do if a day had zero sales and is missing from the dataset?

Question 3

In healthcare data, replacing a missing "Cancer Test Result" with the average test result is?

Question 4

To extract standard categories ("Food") from messy bank descriptions ("UBER EATS PENDING"), what tool is best?

Question 5

How do you calculate employee tenure for people who still work at the company (Termination Date is NULL)?

Question 6

What does "Winsorization" mean in the context of healthcare outliers?

Question 7

If you are cleaning data to train a Machine Learning model, when should you calculate the Mean for imputation?

Question 8

In financial data, a string like "-$1,500.00" requires what step before doing math?

Question 9

Merging two different company databases usually requires heavy focus on?

Question 10

Why is calculating `DaysBetween = EndDate - Start_Date` a good validation step?

9. Interview Questions

Q: Walk me through how you would clean a dataset containing millions of credit card transactions to prepare it for a fraud detection model.

Q: You are given an HR dataset where 20% of employees are missing their 'Department'. How do you handle this?

10. Summary

Real-world data cleaning is domain-specific. CRM data requires heavy string normalization and deduplication. Sales/Time-series data requires date parsing and filling temporal gaps with zeros. Healthcare requires extreme caution with outliers and MNAR (Missing Not At Random) values. Financial data requires rigorous type casting. Always clean with the final analytical goal in mind.

11. Next Chapter Recommendation

In Chapter 18: Performance Optimization for Large Datasets, we will learn how to handle files that are too big for your computer's RAM, utilizing chunking, memory optimization, and parallel processing.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Real-World Data Cleaning Projects #

1. Chapter Introduction #

2. Project 1: CRM Customer Records Cleanup #

3. Project 2: Sales Data Preprocessing for Forecasting #

4. Project 3: Healthcare Patient Data Formatting #

5. Project 4: Financial Transaction Cleaning (Anti-Money Laundering) #

6. Project 5: HR Employee Dataset Cleaning #

7. Common Mistakes in Projects #

8. MCQs #

In a CRM cleanup, what is the best strategy for handling names like "jon doe" and "JON DOE"?

When preparing sales data for daily forecasting, what must you do if a day had zero sales and is missing from the dataset?

In healthcare data, replacing a missing "Cancer Test Result" with the average test result is?

To extract standard categories ("Food") from messy bank descriptions ("UBER EATS PENDING"), what tool is best?

How do you calculate employee tenure for people who still work at the company (Termination Date is NULL)?

What does "Winsorization" mean in the context of healthcare outliers?

If you are cleaning data to train a Machine Learning model, when should you calculate the Mean for imputation?

In financial data, a string like "-$1,500.00" requires what step before doing math?

Merging two different company databases usually requires heavy focus on?

Why is calculating DaysBetween = EndDate - Start_Date a good validation step?

9. Interview Questions #

10. Summary #

11. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 1

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

Real-World Data Cleaning Projects

1. Chapter Introduction

2. Project 1: CRM Customer Records Cleanup

3. Project 2: Sales Data Preprocessing for Forecasting

4. Project 3: Healthcare Patient Data Formatting

5. Project 4: Financial Transaction Cleaning (Anti-Money Laundering)

6. Project 5: HR Employee Dataset Cleaning

7. Common Mistakes in Projects

8. MCQs

Why is calculating `DaysBetween = EndDate - Start_Date` a good validation step?

9. Interview Questions

10. Summary

11. Next Chapter Recommendation