Skip to main content
Data Cleaning
CHAPTER 19 Beginner

Data Cleaning Interview Preparation

Updated: May 18, 2026
5 min read

# CHAPTER 19

Data Cleaning Interview Preparation

1. Chapter Introduction

Interviews for Data Analyst, Data Scientist, and Data Engineering roles will rigorously test your data cleaning skills. Interviewers don't just want to know if you can write code; they want to know *how you think* about data anomalies. This chapter provides a comprehensive guide to technical screening questions, Pandas coding challenges, and SQL scenarios.

2. Conceptual Questions

1. What is the difference between data cleaning and data transformation? *Answer:* Data cleaning fixes errors (removing duplicates, filling missing values, fixing typos). Data transformation changes the shape or scale of clean data to make it suitable for modeling (e.g., Min-Max scaling, One-Hot Encoding).

2. Explain the difference between MCAR, MAR, and MNAR. *Answer:*

  • MCAR (Missing Completely At Random): The missingness has no relationship with any data (e.g., a sensor glitch). Safe to drop or impute.
  • MAR (Missing At Random): Missingness is related to *other* observed variables (e.g., men are less likely to disclose their salary).
  • MNAR (Missing Not At Random): Missingness is related to the *value itself* (e.g., patients with severe depression drop out of a depression study). Cannot safely impute with average values.

3. When would you use Mean imputation vs Median imputation? *Answer:* Mean is appropriate for normally distributed data. Median is required for skewed data (like salaries or housing prices) because the median is robust to extreme outliers, whereas the mean will be pulled artificially high/low.

4. How do you handle outliers? *Answer:* First, investigate the source. If it's a data entry error (e.g., age=999), drop or cap it. If it's a valid extreme value (e.g., a $1M transaction in fraud detection), leave it, or apply a log-transformation to reduce its leverage on the model without deleting the information.

3. Pandas Coding Challenges

Challenge 1: Remove duplicates based on subset and keep the latest. *Scenario:* A user logs in multiple times. You have a dataframe with userid, logintime, and ip_address. Keep only their most recent login.

python
1234
# Answer:
clean_df = (df
            .sort_values('login_time')
            .drop_duplicates(subset=['user_id'], keep='last'))

Challenge 2: Convert a messy currency string to float. *Scenario:* Convert "$1,250.50" to 1250.50. If the string is "Free", make it 0.0.

python
123
# Answer:
df['price'] = df['price'].str.replace(r'[^\d.]', '', regex=True)
df['price'] = pd.to_numeric(df['price'], errors='coerce').fillna(0.0)

Challenge 3: Handle the Dummy Variable Trap. *Scenario:* Write the code to One-Hot Encode a color column containing Red, Green, Blue, while avoiding multicollinearity for a regression model.

python
12
# Answer:
encoded_df = pd.get_dummies(df, columns=['color'], drop_first=True)

Challenge 4: Filter out IQR outliers. *Scenario:* Write a function to remove rows where the salary column is an outlier based on the IQR method.

python
123456
# Answer:
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR
clean_df = df[(df[&#039;salary'] >= lower) & (df['salary'] <= upper)]

4. SQL Cleaning Challenges

Challenge 1: Find duplicates in SQL. *Scenario:* Write a query to find all email addresses that appear more than once in the users table.

sql
12345
-- Answer:
SELECT email, COUNT(*) as occurrence
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

Challenge 2: Fill NULLs in a SELECT statement. *Scenario:* Select first_name and phone. If phone is NULL, output 'N/A'.

sql
123
-- Answer:
SELECT first_name, COALESCE(phone, &#039;N/A') AS phone
FROM users;

Challenge 3: Update and standardize text. *Scenario:* Update the state column in address_table to be fully uppercase and stripped of leading/trailing spaces.

sql
123
-- Answer:
UPDATE address_table
SET state = UPPER(TRIM(state));

5. Scenario-Based Debugging

Scenario: You run a machine learning model to predict loan default, and it gets 99.9% accuracy. You investigate the data and find a column called collectionagencyfee. *Question:* What data cleaning/preprocessing principle did you violate? *Answer:* Data Leakage. A collection agency fee is only applied *after* someone defaults on a loan. By including this column, you leaked the future target variable into the training data. The model just learned "If fee > 0, then default". The column must be dropped during preprocessing.

Scenario: You have a 50GB CSV file and 16GB of RAM. You need to calculate the average transaction amount for the state of New York. *Question:* How do you write this in Python? *Answer:* I would use pd.readcsv(chunksize=100000). Inside a loop, I would filter each chunk for State == 'NY', calculate the sum of the amounts and the count of the rows for that chunk, add them to a running total, and then divide the grand total sum by the grand total count at the very end.

6. MCQs

Question 1

In a technical interview, if asked how to handle missing data, what should your first question be?

Question 2

What is Data Leakage?

Question 3

To keep the most recent record when dropping duplicates in Pandas, you must first do what?

Question 4

COALESCE in SQL is equivalent to which Pandas function?

Question 5

Why do interviewers care about the Dummy Variable Trap?

Question 6

If an interviewer asks you to scale data for a Neural Network, which scaler is generally preferred?

Question 7

What is the SQL keyword used to filter results *after* a GROUP BY?

Question 8

If you have a column of user comments, what preprocessing step is usually required before analysis?

Question 9

What is the benefit of pd.getdummies(dropfirst=True)?

Question 10

How do you describe the IQR to a non-technical stakeholder?

7. Interview Questions

  • Q: (Behavioral) Tell me about a time you found a critical error in a dataset. How did you diagnose it, fix it, and communicate it to your stakeholders?
  • Q: (Technical) Write a Pandas method chain that reads a CSV, renames all columns to lowercase, drops duplicates, and fills missing salaries with the department median.

8. Summary

Interviewers test your technical syntax (Pandas, SQL) and your analytical judgment. Always ask *why* data is missing before suggesting a fix. Know the difference between Mean/Median imputation, and when to use Z-score vs Min-Max scaling. Master drop_duplicates(keep='last'), method chaining, and SQL COALESCE and HAVING. Understand Data Leakage and the Dummy Variable Trap.

9. Next Chapter Recommendation

In Chapter 20: Final Projects and Real-World Applications, we conclude the course by outlining complete, end-to-end architectures for building data cleaning pipelines across diverse industries.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·