Cleaning Data with SQL
# CHAPTER 13
Cleaning Data with SQL
1. Chapter Introduction
While Pandas is excellent for data cleaning in memory, often the data lives in a relational database. Pulling 50 million rows into Python just to capitalize names is wildly inefficient. This chapter teaches you how to perform data cleaning *in-database* using SQLUPDATE operations, string functions, and DELETE queries to handle duplicates and NULLs.
2. Handling Missing Values (NULLs) in SQL
In SQL, missing data is represented as NULL.
*Rule #1 of SQL:* You cannot use = to find NULLs. You must use IS NULL.
3. String Formatting and Cleaning
SQL provides built-in functions to handle case, whitespace, and string extraction.
5. Type Casting and Formatting
Often, dates are stored as strings (VARCHAR), or numbers are stored as text.
6. Removing Duplicates in SQL
Removing exact duplicates in SQL without deleting *all* copies requires identifying rows by their unique physical location (like CTID in Postgres or ROWID in Oracle) or using Window Functions (ROW_NUMBER()).
*How it works:* We partition (group) the data by email. We assign a row number to each record in that group, sorted by the newest first. The newest is row 1. We delete everything where row_num > 1.
7. Mini Project: SQL Customer Database Cleaner
8. Common Mistakes
-
Forgetting the
WHEREclause in anUPDATEstatement: RunningUPDATE users SET status = 'Inactive';without a WHERE clause will mark *every single user* in the database as Inactive. Always test yourWHERElogic with aSELECTstatement first.
-
Using
=with NULL: WritingWHERE age = NULLreturns 0 rows. You must useWHERE age IS NULL.
9. MCQs
How do you find rows with missing data in SQL?
Which SQL function returns the first non-null value in a list?
To permanently change data in an existing SQL table, you use?
What happens if you run an UPDATE statement without a WHERE clause?
Which function removes leading and trailing spaces in SQL?
To find duplicate email addresses, which clauses do you combine?
What is CAST() used for in SQL?
A safe way to execute multiple cleaning queries is to wrap them in?
Which window function assigns a sequential integer to rows within a partition (used heavily for deduping)?
How do you replace all dashes '-' in a phone number with nothing?
10. Interview Questions
-
Q: Write a SQL query to delete duplicate rows based on
email, keeping only the record with the most recentcreatedatdate.
-
Q: Explain the difference between
COALESCE()and writing anUPDATEstatement to handle NULLs.
11. Summary
SQL is powerful for in-place data cleaning, avoiding the overhead of pulling data into Python. UseUPDATE ... SET ... WHERE to fix formatting and fill missing values. Use TRIM(), UPPER(), and REPLACE() for string normalization. Use COALESCE() for handling NULLs on-the-fly in SELECT queries. To remove duplicates, combine CTEs (WITH clauses) and ROW_NUMBER() window functions to isolate and DELETE the extras.