Real-World Data Cleaning Projects
# CHAPTER 17
Real-World Data Cleaning Projects
1. Chapter Introduction
Theory and syntax only take you so far. Real-world data is inherently complex, unstructured, and domain-specific. The way you clean financial transactions is entirely different from how you clean patient medical records. This chapter outlines the architecture and common challenges of 5 real-world data cleaning projects you can build for your portfolio.2. Project 1: CRM Customer Records Cleanup
The Scenario: A company merged with a competitor. They combined their Salesforce exports, resulting in massive duplication and inconsistent formatting.
The Workflow:
-
1.
Normalization: Convert all
Emailaddresses to lowercase and.strip()whitespace.
- 2. Standardization: Map 'USA', 'US', 'United States' to standard ISO codes using a dictionary replace.
-
3.
Regex Extraction: Extract clean 10-digit phone numbers from
(555) 123-4567,555.123.4567, etc.
- 4. Fuzzy Matching: Use algorithms (like Levenshtein distance) or normalize names (e.g., 'Jon Doe' vs 'Jonathan Doe') to find near-duplicates.
-
5.
Deduplication: Use
.dropduplicates(subset=['email'], keep='last')assuming the most recently updated record is the most accurate.
3. Project 2: Sales Data Preprocessing for Forecasting
The Scenario: You need to forecast next month's sales, but the historical POS (Point of Sale) data is full of errors, refunds, and missing dates.
The Workflow:
-
1.
Datetime Conversion: Use
pd.todatetime()on the transaction date. Create new features:Year,Month,DayOfWeekusing the.dtaccessor.
-
2.
Handling Refunds: Identify rows where
QuantityorPriceis negative. Separate these into areturnsdfand keep only positive sales in thesalesdf.
-
3.
Imputation: If
ProductCategoryis missing, impute it by mapping theProductIDto a known catalog, or use the mode (most frequent category) for that specific store.
-
4.
Aggregation: Group by
DateandStoreto get daily totals.
-
5.
Time Series Gaps: If a store was closed on a Sunday, the date might be missing entirely. Reindex the dataframe to include all calendar days and
.fillna(0)for closed days.
4. Project 3: Healthcare Patient Data Formatting
The Scenario: Predicting patient readmission rates based on Electronic Health Records (EHR). Healthcare data is highly sensitive and prone to missing clinical test results.
The Workflow:
-
1.
Validation (Sanity Checks): Use assertions to ensure
AdmissionDateis <=DischargeDate.
- 2. Outlier Treatment: Medical data has true outliers (e.g., very high blood pressure). Do NOT delete them. Use clinical thresholds (e.g., Blood Sugar > 600) to cap values (Winsorization) or flag them for review.
-
3.
MNAR Handling (Missing Not At Random): If a patient didn't get an MRI, the
MRIResultis missing. Don't impute this with the mean! Fill it with a category like "TestNotAdministered", as the *absence* of the test is a clinical indicator itself.
- 4. Encoding: One-Hot Encode categorical diagnoses codes (ICD-10) for the machine learning model.
5. Project 4: Financial Transaction Cleaning (Anti-Money Laundering)
The Scenario: Processing millions of bank transactions to detect fraudulent or structured money transfers.
The Workflow:
-
1.
Type Casting: Ensure all
Amountstrings (like"$10,000.00") are stripped of currency symbols and cast strictly tofloat64.
-
2.
Outlier Detection: Use the IQR method to flag transactions that are statistically anomalous for a specific
AccountID.
-
3.
Categorization: Use Regex to search the
TransactionDescriptiontext (e.g., "UBER EATS", "AMZN MKT") and map them to standard categories ("Food", "Shopping").
- 4. Handling Zeros: A transaction amount of 0.00 might be a ping to verify an account. Filter these out of volume calculations.
6. Project 5: HR Employee Dataset Cleaning
The Scenario: Analyzing employee churn (turnover). The data comes from manual HR Excel sheets, meaning lots of human data entry errors.
The Workflow:
-
1.
Data Types: Convert
HireDateandTerminationDateto datetime.
-
2.
Logic Calculation: Create a
TenureDayscolumn (TerminationDate-HireDate). For current employees,TerminationDatewill beNaN. Fill theNaNwith today's date before calculating tenure.
-
3.
Categorical Standardization: Standardize
Departmentnames (e.g., "Dev", "Software", "Engineering" all map to "Engineering").
-
4.
Boolean Conversion: Convert
IsManagerfrom "Yes"/"No" strings toTrue/Falsebooleans.
7. Common Mistakes in Projects
- Cleaning without a goal: You don't need to clean every single column perfectly. Focus heavily on the features required for your specific analysis or machine learning model.
- Ignoring Data Leakage in ML Projects: If you use the entire dataset to calculate the mean for imputation, and then split into Train/Test, your Training data has learned from the Test data. Always split *before* imputing with statistical measures.
8. MCQs
In a CRM cleanup, what is the best strategy for handling names like "jon doe" and "JON DOE"?
When preparing sales data for daily forecasting, what must you do if a day had zero sales and is missing from the dataset?
In healthcare data, replacing a missing "Cancer Test Result" with the average test result is?
To extract standard categories ("Food") from messy bank descriptions ("UBER EATS PENDING"), what tool is best?
How do you calculate employee tenure for people who still work at the company (Termination Date is NULL)?
What does "Winsorization" mean in the context of healthcare outliers?
If you are cleaning data to train a Machine Learning model, when should you calculate the Mean for imputation?
In financial data, a string like "-$1,500.00" requires what step before doing math?
Merging two different company databases usually requires heavy focus on?
Why is calculating DaysBetween = EndDate - Start_Date a good validation step?
9. Interview Questions
- Q: Walk me through how you would clean a dataset containing millions of credit card transactions to prepare it for a fraud detection model.
- Q: You are given an HR dataset where 20% of employees are missing their 'Department'. How do you handle this?