Data Validation Techniques
# CHAPTER 11
Data Validation Techniques
1. Chapter Introduction
Cleaning data is reactive; you are fixing problems that already exist. Data Validation is proactive. It involves writing logical rules (assertions) to ensure the data entering your pipeline meets strict business requirements. If data fails validation, the pipeline throws an error before corrupted data can poison a database or machine learning model.2. What is Data Validation?
Data validation checks if data conforms to defined rules.
- Range Validation: Is Age between 0 and 120?
- Uniqueness Validation: Are all Customer IDs distinct?
- Set Validation: Is the Status column strictly limited to ['Active', 'Pending', 'Closed']?
-
Cross-Column Validation: Is the
EndDatelogically after theStartDate?
3. Implementing Basic Validations with Assertions
Python's assert statement is the simplest way to validate data. If the condition is True, nothing happens. If False, it crashes the program with an AssertionError.
4. Cross-Column Logic Validation
Often, business rules involve multiple columns.
5. Writing a Defensive Data Pipeline
In production, you wrap your cleaning logic in functions that validate the data *before* and *after* cleaning.
6. Introduction to Schema Validation Libraries
For advanced enterprise pipelines, relying on assert statements gets messy. Libraries like Pandera or Great Expectations define rigid schemas.
*(Note: Concept overview, does not require installation for this course)*
These libraries automatically handle type checking, range checking, and missing value verification in one step.
7. Common Mistakes
- Failing silently: If you detect invalid data and just print a warning, the bad data continues down the pipeline into your database. Use assertions or raise exceptions to halt execution when critical rules are broken.
- Validating too late: Validate the raw data *immediately* after reading it from the CSV/Database. Don't wait until the end of a 500-line script to realize a required column is missing.
8. MCQs
What is the main difference between data cleaning and data validation?
Which Python statement is commonly used to throw an error if a condition is False?
To check if a column only contains values from a specific list, use:
df['id'].nunique() == len(df) checks for what?
A rule stating "Delivery Date must be >= Order Date" is an example of?
What happens when an assert statement fails?
Defensive programming in data pipelines means?
Which of the following is a popular Python library specifically for DataFrame schema validation?
What does the .all() method do when appended to a boolean series?
Why should you validate data *before* cleaning?
9. Interview Questions
- Q: How would you implement a check in an automated pipeline to ensure a dataset doesn't suddenly drop 50% of its rows compared to yesterday's data?
-
Q: What is the difference between writing an
assertstatement and dropping invalid rows?
10. Summary
Data validation ensures data integrity. Use Python'sassert statement to enforce uniqueness (nunique == len), set membership (isin), range bounds (min/max), and cross-column logic. Build defensive data pipelines by sandwiching your cleaning logic between pre-validation (checking raw schema) and post-validation (verifying cleaning success). When rules break, halt the pipeline rather than failing silently.