Skip to main content
Data Cleaning
CHAPTER 11 Beginner

Data Validation Techniques

Updated: May 18, 2026
5 min read

# CHAPTER 11

Data Validation Techniques

1. Chapter Introduction

Cleaning data is reactive; you are fixing problems that already exist. Data Validation is proactive. It involves writing logical rules (assertions) to ensure the data entering your pipeline meets strict business requirements. If data fails validation, the pipeline throws an error before corrupted data can poison a database or machine learning model.

2. What is Data Validation?

Data validation checks if data conforms to defined rules.

  • Range Validation: Is Age between 0 and 120?
  • Uniqueness Validation: Are all Customer IDs distinct?
  • Set Validation: Is the Status column strictly limited to ['Active', 'Pending', 'Closed']?
  • Cross-Column Validation: Is the EndDate logically after the StartDate?

3. Implementing Basic Validations with Assertions

Python's assert statement is the simplest way to validate data. If the condition is True, nothing happens. If False, it crashes the program with an AssertionError.

python
123456789101112131415161718192021222324252627282930
import pandas as pd

df = pd.DataFrame({
    'emp_id': [101, 102, 103, 104], # Unique
    'age': [25, 45, 17, 60],        # One under 18
    'status': ['Active', 'Active', 'Inactive', 'Fired'] # 'Fired' is not allowed
})

# 1. Uniqueness Validation
# Assert that the number of unique IDs equals the total number of rows
assert df['emp_id'].nunique() == len(df), "ERROR: Duplicate Employee IDs found!"
print("Uniqueness validation passed.")

# 2. Range Validation
# We want to ensure all employees are >= 18
try:
    assert df['age'].min() >= 18, f"ERROR: Found employees under 18! Min age is {df['age'].min()}"
except AssertionError as e:
    print(e)

# 3. Set Membership Validation
allowed_statuses = ['Active', 'Inactive']
# isin() returns True/False. all() checks if ALL rows are True.
try:
    assert df['status'].isin(allowed_statuses).all(), "ERROR: Invalid status codes found!"
except AssertionError as e:
    print(e)
    # View the invalid rows
    invalid_rows = df[~df['status'].isin(allowed_statuses)]
    print("Invalid rows:\n", invalid_rows)

4. Cross-Column Logic Validation

Often, business rules involve multiple columns.

python
123456789101112
orders = pd.DataFrame({
    'order_id': [1, 2, 3],
    'order_qty': [10, 5, 2],
    'returned_qty': [2, 0, 5]  # Error: returned 5, but only ordered 2!
})

# Validate that returned quantity cannot exceed ordered quantity
valid_returns = orders[&#039;returned_qty'] <= orders['order_qty']

if not valid_returns.all():
    print("\n=== CROSS-COLUMN VALIDATION FAILED ===")
    print(orders[~valid_returns])

5. Writing a Defensive Data Pipeline

In production, you wrap your cleaning logic in functions that validate the data *before* and *after* cleaning.

python
12345678910111213141516171819202122232425262728
def process_customer_data(df):
    """A robust data pipeline with validation checks."""
    
    # Pre-validation: Expecting exactly 3 columns
    expected_cols = [&#039;id', 'email', 'age']
    if not all(col in df.columns for col in expected_cols):
        raise ValueError(f"Missing required columns. Expected: {expected_cols}")
        
    print("Pre-validation passed. Proceeding with cleaning...")
    
    # --- Cleaning Steps ---
    df_clean = df.copy()
    df_clean[&#039;email'] = df_clean['email'].str.lower()
    df_clean = df_clean.dropna(subset=[&#039;id', 'email'])
    
    # --- Post-validation ---
    assert df_clean[&#039;id'].isnull().sum() == 0, "Null IDs still exist!"
    assert df_clean[&#039;email'].duplicated().sum() == 0, "Duplicate emails exist!"
    
    print("Post-validation passed. Data is clean.")
    return df_clean

# Example Usage
raw_data = pd.DataFrame({&#039;id': [1, 2, 2], 'email': ['A@x.com', 'b@x.com', 'b@x.com'], 'age':[20, 30, 30]})
try:
    clean_data = process_customer_data(raw_data)
except AssertionError as e:
    print(f"PIPELINE HALTED: {e}")

6. Introduction to Schema Validation Libraries

For advanced enterprise pipelines, relying on assert statements gets messy. Libraries like Pandera or Great Expectations define rigid schemas.

*(Note: Concept overview, does not require installation for this course)*

python
123456
# Pandera conceptual example:
# schema = pa.DataFrameSchema({
#     "age": pa.Column(int, checks=pa.Check.ge(18)),
#     "email": pa.Column(str, checks=pa.Check.str_matches(r"^[\w\.-]+@[\w\.-]+\.\w+$"))
# })
# schema.validate(df)

These libraries automatically handle type checking, range checking, and missing value verification in one step.

7. Common Mistakes

  • Failing silently: If you detect invalid data and just print a warning, the bad data continues down the pipeline into your database. Use assertions or raise exceptions to halt execution when critical rules are broken.
  • Validating too late: Validate the raw data *immediately* after reading it from the CSV/Database. Don't wait until the end of a 500-line script to realize a required column is missing.

8. MCQs

Question 1

What is the main difference between data cleaning and data validation?

Question 2

Which Python statement is commonly used to throw an error if a condition is False?

Question 3

To check if a column only contains values from a specific list, use:

Question 4

df['id'].nunique() == len(df) checks for what?

Question 5

A rule stating "Delivery Date must be >= Order Date" is an example of?

Question 6

What happens when an assert statement fails?

Question 7

Defensive programming in data pipelines means?

Question 8

Which of the following is a popular Python library specifically for DataFrame schema validation?

Question 9

What does the .all() method do when appended to a boolean series?

Question 10

Why should you validate data *before* cleaning?

9. Interview Questions

  • Q: How would you implement a check in an automated pipeline to ensure a dataset doesn't suddenly drop 50% of its rows compared to yesterday's data?
  • Q: What is the difference between writing an assert statement and dropping invalid rows?

10. Summary

Data validation ensures data integrity. Use Python's assert statement to enforce uniqueness (nunique == len), set membership (isin), range bounds (min/max), and cross-column logic. Build defensive data pipelines by sandwiching your cleaning logic between pre-validation (checking raw schema) and post-validation (verifying cleaning success). When rules break, halt the pipeline rather than failing silently.

11. Next Chapter Recommendation

In Chapter 12: Cleaning Data with Pandas, we will synthesize everything learned so far into efficient, chainable Pandas workflows for complete DataFrame manipulation.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·