CHAPTER 11 Beginner

Data Validation Techniques

Updated: May 18, 2026

5 min read

# CHAPTER 11

Data Validation Techniques

1. Chapter Introduction

Cleaning data is reactive; you are fixing problems that already exist. Data Validation is proactive. It involves writing logical rules (assertions) to ensure the data entering your pipeline meets strict business requirements. If data fails validation, the pipeline throws an error before corrupted data can poison a database or machine learning model.

2. What is Data Validation?

Data validation checks if data conforms to defined rules.

Range Validation: Is Age between 0 and 120?

Uniqueness Validation: Are all Customer IDs distinct?

Set Validation: Is the Status column strictly limited to ['Active', 'Pending', 'Closed']?

Cross-Column Validation: Is the EndDate logically after the StartDate?

3. Implementing Basic Validations with Assertions

Python's assert statement is the simplest way to validate data. If the condition is True, nothing happens. If False, it crashes the program with an AssertionError.

python

123456789101112131415161718192021222324252627282930

import pandas as pd

df = pd.DataFrame({
    &#039;emp_id': [101, 102, 103, 104], # Unique
    &#039;age': [25, 45, 17, 60],        # One under 18
    &#039;status': ['Active', 'Active', 'Inactive', 'Fired'] # 'Fired' is not allowed
})

# 1. Uniqueness Validation
# Assert that the number of unique IDs equals the total number of rows
assert df[&#039;emp_id'].nunique() == len(df), "ERROR: Duplicate Employee IDs found!"
print("Uniqueness validation passed.")

# 2. Range Validation
# We want to ensure all employees are >= 18
try:
    assert df[&#039;age'].min() >= 18, f"ERROR: Found employees under 18! Min age is {df['age'].min()}"
except AssertionError as e:
    print(e)

# 3. Set Membership Validation
allowed_statuses = [&#039;Active', 'Inactive']
# isin() returns True/False. all() checks if ALL rows are True.
try:
    assert df[&#039;status'].isin(allowed_statuses).all(), "ERROR: Invalid status codes found!"
except AssertionError as e:
    print(e)
    # View the invalid rows
    invalid_rows = df[~df[&#039;status'].isin(allowed_statuses)]
    print("Invalid rows:\n", invalid_rows)

4. Cross-Column Logic Validation

Often, business rules involve multiple columns.

python

123456789101112

orders = pd.DataFrame({
    &#039;order_id': [1, 2, 3],
    &#039;order_qty': [10, 5, 2],
    &#039;returned_qty': [2, 0, 5]  # Error: returned 5, but only ordered 2!
})

# Validate that returned quantity cannot exceed ordered quantity
valid_returns = orders[&#039;returned_qty'] <= orders['order_qty']

if not valid_returns.all():
    print("\n=== CROSS-COLUMN VALIDATION FAILED ===")
    print(orders[~valid_returns])

5. Writing a Defensive Data Pipeline

In production, you wrap your cleaning logic in functions that validate the data *before* and *after* cleaning.

python

12345678910111213141516171819202122232425262728

def process_customer_data(df):
    """A robust data pipeline with validation checks."""
    
    # Pre-validation: Expecting exactly 3 columns
    expected_cols = [&#039;id', 'email', 'age']
    if not all(col in df.columns for col in expected_cols):
        raise ValueError(f"Missing required columns. Expected: {expected_cols}")
        
    print("Pre-validation passed. Proceeding with cleaning...")
    
    # --- Cleaning Steps ---
    df_clean = df.copy()
    df_clean[&#039;email'] = df_clean['email'].str.lower()
    df_clean = df_clean.dropna(subset=[&#039;id', 'email'])
    
    # --- Post-validation ---
    assert df_clean[&#039;id'].isnull().sum() == 0, "Null IDs still exist!"
    assert df_clean[&#039;email'].duplicated().sum() == 0, "Duplicate emails exist!"
    
    print("Post-validation passed. Data is clean.")
    return df_clean

# Example Usage
raw_data = pd.DataFrame({&#039;id': [1, 2, 2], 'email': ['A@x.com', 'b@x.com', 'b@x.com'], 'age':[20, 30, 30]})
try:
    clean_data = process_customer_data(raw_data)
except AssertionError as e:
    print(f"PIPELINE HALTED: {e}")

6. Introduction to Schema Validation Libraries

For advanced enterprise pipelines, relying on assert statements gets messy. Libraries like Pandera or Great Expectations define rigid schemas.

*(Note: Concept overview, does not require installation for this course)*

python

123456

# Pandera conceptual example:
# schema = pa.DataFrameSchema({
#     "age": pa.Column(int, checks=pa.Check.ge(18)),
#     "email": pa.Column(str, checks=pa.Check.str_matches(r"^[\w\.-]+@[\w\.-]+\.\w+$"))
# })
# schema.validate(df)

These libraries automatically handle type checking, range checking, and missing value verification in one step.

7. Common Mistakes

Failing silently: If you detect invalid data and just print a warning, the bad data continues down the pipeline into your database. Use assertions or raise exceptions to halt execution when critical rules are broken.

Validating too late: Validate the raw data *immediately* after reading it from the CSV/Database. Don't wait until the end of a 500-line script to realize a required column is missing.

8. MCQs

Question 1

What is the main difference between data cleaning and data validation?

Question 2

Which Python statement is commonly used to throw an error if a condition is False?

Question 3

To check if a column only contains values from a specific list, use:

Question 4

`df['id'].nunique() == len(df)` checks for what?

Question 5

A rule stating "Delivery Date must be >= Order Date" is an example of?

Question 6

What happens when an `assert` statement fails?

Question 7

Defensive programming in data pipelines means?

Question 8

Which of the following is a popular Python library specifically for DataFrame schema validation?

Question 9

What does the `.all()` method do when appended to a boolean series?

Question 10

Why should you validate data before cleaning?

9. Interview Questions

Q: How would you implement a check in an automated pipeline to ensure a dataset doesn't suddenly drop 50% of its rows compared to yesterday's data?

Q: What is the difference between writing an assert statement and dropping invalid rows?

10. Summary

Data validation ensures data integrity. Use Python's assert statement to enforce uniqueness (nunique == len), set membership (isin), range bounds (min/max), and cross-column logic. Build defensive data pipelines by sandwiching your cleaning logic between pre-validation (checking raw schema) and post-validation (verifying cleaning success). When rules break, halt the pipeline rather than failing silently.

11. Next Chapter Recommendation

In Chapter 12: Cleaning Data with Pandas, we will synthesize everything learned so far into efficient, chainable Pandas workflows for complete DataFrame manipulation.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Data Validation Techniques #

1. Chapter Introduction #

2. What is Data Validation? #

3. Implementing Basic Validations with Assertions #

4. Cross-Column Logic Validation #

5. Writing a Defensive Data Pipeline #

6. Introduction to Schema Validation Libraries #

7. Common Mistakes #

8. MCQs #

What is the main difference between data cleaning and data validation?

Which Python statement is commonly used to throw an error if a condition is False?

To check if a column only contains values from a specific list, use:

df['id'].nunique() == len(df) checks for what?

A rule stating "Delivery Date must be >= Order Date" is an example of?

What happens when an assert statement fails?

Defensive programming in data pipelines means?

Which of the following is a popular Python library specifically for DataFrame schema validation?

What does the .all() method do when appended to a boolean series?

Why should you validate data *before* cleaning?

9. Interview Questions #

10. Summary #

11. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 1

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

Data Validation Techniques

1. Chapter Introduction

2. What is Data Validation?

3. Implementing Basic Validations with Assertions

4. Cross-Column Logic Validation

5. Writing a Defensive Data Pipeline

6. Introduction to Schema Validation Libraries

7. Common Mistakes

8. MCQs

`df['id'].nunique() == len(df)` checks for what?

What happens when an `assert` statement fails?

What does the `.all()` method do when appended to a boolean series?

Why should you validate data before cleaning?

9. Interview Questions

10. Summary

11. Next Chapter Recommendation