Date and Time Cleaning
# CHAPTER 10
Date and Time Cleaning
1. Chapter Introduction
Dates are universally problematic. Americans writeMM/DD/YYYY, Europeans write DD/MM/YYYY, databases store ISO YYYY-MM-DD, and Excel often stores them as raw numbers. If Pandas reads a date as a string (object), you cannot filter by year or calculate the days between two events. This chapter shows you how to wrangle messy date strings into standardized datetime objects.
2. Converting Strings to Datetime
The primary tool for this is pd.to_datetime().
3. Handling Mixed and Ambiguous Formats
What happens when 01/02/2024 appears? Is it Jan 2nd or Feb 1st?
By default, Pandas assumes US format (Month first).
4. Handling Invalid Dates (Out of Bounds)
What if a user typed "2024-13-45" (Invalid month/day) or "Not a date"?
5. Extracting Date Components
Once a column is a proper datetime object, you unlock the .dt accessor. This allows you to extract features effortlessly—vital for machine learning and reporting.
6. Mini Project: Event Scheduling Cleaner
Calculate the duration between two events and identify logically impossible dates (e.g., end date is before start date).
7. Common Mistakes
-
Relying on Pandas to guess the format on huge files:
pd.todatetime()without aformatparameter is very slow because Pandas has to parse and guess every single row. If you know the format is 'YYYY-MM-DD', always useformat='%Y-%m-%d'. It is 10x faster.
-
Ignoring Timezones:
2024-01-01 10:00:00means nothing globally without a timezone. If merging data from servers in NY and London, you must localize them first using.dt.tzlocalize()and.dt.tzconvert().
8. MCQs
Which Pandas function converts string columns to dates?
What data type does Pandas use for dates and times?
If a European dataset has dates formatted as DD/MM/YYYY, what parameter should you use?
What happens when pd.todatetime() encounters "Feb 30th" with errors='coerce'?
Which accessor is required to extract the year from a datetime column?
How do you extract the day of the week (e.g., Monday)?
Why is it recommended to explicitly provide the format='%Y-%m-%d' parameter?
What represents a missing date value in Pandas?
If you subtract two datetime columns df['end'] - df['start'], what is the resulting data type?
How do you extract just the integer number of days from a timedelta calculation?
9. Interview Questions
-
Q: A CSV contains a mix of US (
MM/DD/YYYY) and European (DD/MM/YYYY) dates in the same column. How would you clean this?
-
Q: How do you calculate the exact number of months between a user's
signupdateandchurndatein Pandas?
10. Summary
Dates stored as strings limit your analytical power. Usepd.to_datetime() to cast strings into datetime64. Use errors='coerce' to handle unparseable garbage text (turning them into NaT). Once converted, the .dt accessor unlocks the ability to easily extract years, months, days, and weekdays, enabling powerful time-series analysis and cohort building.