Data Cleaning Interview Preparation
# CHAPTER 19
Data Cleaning Interview Preparation
1. Chapter Introduction
Interviews for Data Analyst, Data Scientist, and Data Engineering roles will rigorously test your data cleaning skills. Interviewers don't just want to know if you can write code; they want to know *how you think* about data anomalies. This chapter provides a comprehensive guide to technical screening questions, Pandas coding challenges, and SQL scenarios.2. Conceptual Questions
1. What is the difference between data cleaning and data transformation? *Answer:* Data cleaning fixes errors (removing duplicates, filling missing values, fixing typos). Data transformation changes the shape or scale of clean data to make it suitable for modeling (e.g., Min-Max scaling, One-Hot Encoding).
2. Explain the difference between MCAR, MAR, and MNAR. *Answer:*
- MCAR (Missing Completely At Random): The missingness has no relationship with any data (e.g., a sensor glitch). Safe to drop or impute.
- MAR (Missing At Random): Missingness is related to *other* observed variables (e.g., men are less likely to disclose their salary).
- MNAR (Missing Not At Random): Missingness is related to the *value itself* (e.g., patients with severe depression drop out of a depression study). Cannot safely impute with average values.
3. When would you use Mean imputation vs Median imputation? *Answer:* Mean is appropriate for normally distributed data. Median is required for skewed data (like salaries or housing prices) because the median is robust to extreme outliers, whereas the mean will be pulled artificially high/low.
4. How do you handle outliers? *Answer:* First, investigate the source. If it's a data entry error (e.g., age=999), drop or cap it. If it's a valid extreme value (e.g., a $1M transaction in fraud detection), leave it, or apply a log-transformation to reduce its leverage on the model without deleting the information.
3. Pandas Coding Challenges
Challenge 1: Remove duplicates based on subset and keep the latest.
*Scenario:* A user logs in multiple times. You have a dataframe with userid, logintime, and ip_address. Keep only their most recent login.
Challenge 2: Convert a messy currency string to float.
*Scenario:* Convert "$1,250.50" to 1250.50. If the string is "Free", make it 0.0.
Challenge 3: Handle the Dummy Variable Trap.
*Scenario:* Write the code to One-Hot Encode a color column containing Red, Green, Blue, while avoiding multicollinearity for a regression model.
Challenge 4: Filter out IQR outliers.
*Scenario:* Write a function to remove rows where the salary column is an outlier based on the IQR method.
4. SQL Cleaning Challenges
Challenge 1: Find duplicates in SQL.
*Scenario:* Write a query to find all email addresses that appear more than once in the users table.
Challenge 2: Fill NULLs in a SELECT statement.
*Scenario:* Select first_name and phone. If phone is NULL, output 'N/A'.
Challenge 3: Update and standardize text.
*Scenario:* Update the state column in address_table to be fully uppercase and stripped of leading/trailing spaces.
5. Scenario-Based Debugging
Scenario: You run a machine learning model to predict loan default, and it gets 99.9% accuracy. You investigate the data and find a column called collectionagencyfee.
*Question:* What data cleaning/preprocessing principle did you violate?
*Answer:* Data Leakage. A collection agency fee is only applied *after* someone defaults on a loan. By including this column, you leaked the future target variable into the training data. The model just learned "If fee > 0, then default". The column must be dropped during preprocessing.
Scenario: You have a 50GB CSV file and 16GB of RAM. You need to calculate the average transaction amount for the state of New York.
*Question:* How do you write this in Python?
*Answer:* I would use pd.readcsv(chunksize=100000). Inside a loop, I would filter each chunk for State == 'NY', calculate the sum of the amounts and the count of the rows for that chunk, add them to a running total, and then divide the grand total sum by the grand total count at the very end.
6. MCQs
In a technical interview, if asked how to handle missing data, what should your first question be?
What is Data Leakage?
To keep the most recent record when dropping duplicates in Pandas, you must first do what?
COALESCE in SQL is equivalent to which Pandas function?
Why do interviewers care about the Dummy Variable Trap?
If an interviewer asks you to scale data for a Neural Network, which scaler is generally preferred?
What is the SQL keyword used to filter results *after* a GROUP BY?
If you have a column of user comments, what preprocessing step is usually required before analysis?
What is the benefit of pd.getdummies(dropfirst=True)?
How do you describe the IQR to a non-technical stakeholder?
7. Interview Questions
- Q: (Behavioral) Tell me about a time you found a critical error in a dataset. How did you diagnose it, fix it, and communicate it to your stakeholders?
- Q: (Technical) Write a Pandas method chain that reads a CSV, renames all columns to lowercase, drops duplicates, and fills missing salaries with the department median.
8. Summary
Interviewers test your technical syntax (Pandas, SQL) and your analytical judgment. Always ask *why* data is missing before suggesting a fix. Know the difference between Mean/Median imputation, and when to use Z-score vs Min-Max scaling. Masterdrop_duplicates(keep='last'), method chaining, and SQL COALESCE and HAVING. Understand Data Leakage and the Dummy Variable Trap.