Skip to main content
Data Cleaning
CHAPTER 13 Beginner

Cleaning Data with SQL

Updated: May 18, 2026
5 min read

# CHAPTER 13

Cleaning Data with SQL

1. Chapter Introduction

While Pandas is excellent for data cleaning in memory, often the data lives in a relational database. Pulling 50 million rows into Python just to capitalize names is wildly inefficient. This chapter teaches you how to perform data cleaning *in-database* using SQL UPDATE operations, string functions, and DELETE queries to handle duplicates and NULLs.

2. Handling Missing Values (NULLs) in SQL

In SQL, missing data is represented as NULL. *Rule #1 of SQL:* You cannot use = to find NULLs. You must use IS NULL.

sql
123456789101112131415161718192021222324
-- 1. DETECTING NULLS
-- Find all customers missing an email address
SELECT * FROM customers
WHERE email IS NULL;

-- 2. FILLING NULLS (SELECT)
-- COALESCE returns the first non-null value
-- This doesn't change the database, it just formats the output
SELECT 
    id, 
    first_name, 
    COALESCE(phone, 'No Phone on File') AS phone_status
FROM customers;

-- 3. UPDATING NULLS (Permanent Fix)
-- Permanently replace NULL countries with 'Unknown'
UPDATE customers
SET country = 'Unknown'
WHERE country IS NULL;

-- 4. DELETING NULLS
-- Delete records that are completely useless (e.g., missing ID or email)
DELETE FROM customers
WHERE email IS NULL;

3. String Formatting and Cleaning

SQL provides built-in functions to handle case, whitespace, and string extraction.

sql
12345678910111213141516171819202122
-- 1. WHITESPACE CLEANING
-- TRIM() removes leading and trailing spaces
UPDATE employees
SET last_name = TRIM(last_name);

-- 2. CASE STANDARDIZATION
-- Make names Title Case (varies slightly by SQL dialect, this is PostgreSQL/MySQL approach)
UPDATE employees
SET first_name = CONCAT(UPPER(SUBSTRING(first_name, 1, 1)), LOWER(SUBSTRING(first_name, 2)));
-- Simple UPPER/LOWER:
-- SET state = UPPER(state);

-- 3. STRING REPLACEMENT
-- Remove hyphens from phone numbers
UPDATE contacts
SET phone_number = REPLACE(phone_number, '-', '');
-- '555-123-4567' becomes '5551234567'

-- 4. REGEX IN SQL (PostgreSQL example)
-- Remove all non-numeric characters from a phone string
UPDATE contacts
SET phone_number = REGEXP_REPLACE(phone_number, '\D', '', 'g');

5. Type Casting and Formatting

Often, dates are stored as strings (VARCHAR), or numbers are stored as text.

sql
12345678910111213
-- Convert string '2024-01-15' to an actual DATE type in output
SELECT CAST(signup_date AS DATE) FROM users;
-- OR (PostgreSQL syntax)
SELECT signup_date::DATE FROM users;

-- Standardize boolean flags
-- If active status is stored as 'Y', 'N', 'Yes', '1', '0'
UPDATE users
SET is_active = CASE 
    WHEN status IN ('Y', 'Yes', '1') THEN 1
    WHEN status IN ('N', 'No', '0') THEN 0
    ELSE 0
END;

6. Removing Duplicates in SQL

Removing exact duplicates in SQL without deleting *all* copies requires identifying rows by their unique physical location (like CTID in Postgres or ROWID in Oracle) or using Window Functions (ROW_NUMBER()).

sql
123456789101112131415161718192021222324
-- Identifying duplicates (Using Group By)
-- Find emails that appear more than once
SELECT email, COUNT(*) as count
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

-- Deleting duplicates (Keeping the latest one based on updated_at)
-- This uses a Common Table Expression (CTE) and Window Function
WITH RankedDuplicates AS (
    SELECT 
        id,
        ROW_NUMBER() OVER(
            PARTITION BY email 
            ORDER BY updated_at DESC
        ) as row_num
    FROM users
)
DELETE FROM users
WHERE id IN (
    SELECT id 
    FROM RankedDuplicates 
    WHERE row_num > 1
);

*How it works:* We partition (group) the data by email. We assign a row number to each record in that group, sorted by the newest first. The newest is row 1. We delete everything where row_num > 1.

7. Mini Project: SQL Customer Database Cleaner

sql
12345678910111213141516171819202122232425262728293031
-- Execute a multi-step cleaning script inside a transaction
-- This ensures that if one step fails, the database rolls back to the original state

BEGIN TRANSACTION;

-- 1. Remove garbage rows (No name AND no email)
DELETE FROM customer_leads 
WHERE name IS NULL AND email IS NULL;

-- 2. Standardize Emails
UPDATE customer_leads
SET email = LOWER(TRIM(email));

-- 3. Handle NULL phone numbers
UPDATE customer_leads
SET phone = 'N/A'
WHERE phone IS NULL;

-- 4. Standardize State abbreviations
UPDATE customer_leads
SET state = 'NY' WHERE state IN ('New York', 'ny', 'N.Y.');

-- 5. Remove Duplicates (keep the lowest ID / oldest record)
WITH Dupes AS (
    SELECT id, ROW_NUMBER() OVER(PARTITION BY email ORDER BY id ASC) as rn
    FROM customer_leads
)
DELETE FROM customer_leads 
WHERE id IN (SELECT id FROM Dupes WHERE rn > 1);

COMMIT;

8. Common Mistakes

  • Forgetting the WHERE clause in an UPDATE statement: Running UPDATE users SET status = 'Inactive'; without a WHERE clause will mark *every single user* in the database as Inactive. Always test your WHERE logic with a SELECT statement first.
  • Using = with NULL: Writing WHERE age = NULL returns 0 rows. You must use WHERE age IS NULL.

9. MCQs

Question 1

How do you find rows with missing data in SQL?

Question 2

Which SQL function returns the first non-null value in a list?

Question 3

To permanently change data in an existing SQL table, you use?

Question 4

What happens if you run an UPDATE statement without a WHERE clause?

Question 5

Which function removes leading and trailing spaces in SQL?

Question 6

To find duplicate email addresses, which clauses do you combine?

Question 7

What is CAST() used for in SQL?

Question 8

A safe way to execute multiple cleaning queries is to wrap them in?

Question 9

Which window function assigns a sequential integer to rows within a partition (used heavily for deduping)?

Question 10

How do you replace all dashes '-' in a phone number with nothing?

10. Interview Questions

  • Q: Write a SQL query to delete duplicate rows based on email, keeping only the record with the most recent createdat date.
  • Q: Explain the difference between COALESCE() and writing an UPDATE statement to handle NULLs.

11. Summary

SQL is powerful for in-place data cleaning, avoiding the overhead of pulling data into Python. Use UPDATE ... SET ... WHERE to fix formatting and fill missing values. Use TRIM(), UPPER(), and REPLACE() for string normalization. Use COALESCE() for handling NULLs on-the-fly in SELECT queries. To remove duplicates, combine CTEs (WITH clauses) and ROW_NUMBER() window functions to isolate and DELETE the extras.

12. Next Chapter Recommendation

In Chapter 14: Data Transformation and Standardization, we return to Python to learn how to scale numerical data and encode categorical data so it can be fed into machine learning models.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·