Skip to main content
Data Cleaning
CHAPTER 10 Beginner

Date and Time Cleaning

Updated: May 18, 2026
5 min read

# CHAPTER 10

Date and Time Cleaning

1. Chapter Introduction

Dates are universally problematic. Americans write MM/DD/YYYY, Europeans write DD/MM/YYYY, databases store ISO YYYY-MM-DD, and Excel often stores them as raw numbers. If Pandas reads a date as a string (object), you cannot filter by year or calculate the days between two events. This chapter shows you how to wrangle messy date strings into standardized datetime objects.

2. Converting Strings to Datetime

The primary tool for this is pd.to_datetime().

python
123456789101112131415161718192021
import pandas as pd

df = pd.DataFrame({
    'date_str': [
        '2024-01-15',      # ISO standard (YYYY-MM-DD)
        '01/25/2024',      # US format (MM/DD/YYYY)
        'Jan 5, 2024',     # Text format
        '20240130'         # YYYYMMDD continuous
    ]
})

print("=== RAW DTYPES ===")
print(df.dtypes)

# Pandas is remarkably good at guessing the format automatically
df['clean_date'] = pd.to_datetime(df['date_str'])

print("\n=== CONVERTED DATETIME ===")
print(df)
print("\nNew Dtypes:")
print(df.dtypes) # Note it is now datetime64[ns]

3. Handling Mixed and Ambiguous Formats

What happens when 01/02/2024 appears? Is it Jan 2nd or Feb 1st? By default, Pandas assumes US format (Month first).

python
12345678910
ambiguous = pd.Series(['01/05/2024', '15/05/2024']) # 15 must be day, so it's May 15th

# If data is European (Day first), TELL pandas explicitly:
euro_dates = pd.to_datetime(ambiguous, dayfirst=True)
print("\n=== DAY FIRST PARSING ===")
print(euro_dates)

# Explicitly defining the format (FASTEST and SAFEST method)
# %Y = 4 digit year, %m = 2 digit month, %d = 2 digit day
df['exact_date'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d', errors='ignore')

4. Handling Invalid Dates (Out of Bounds)

What if a user typed "2024-13-45" (Invalid month/day) or "Not a date"?

python
1234567891011
bad_dates = pd.DataFrame({'dob': ['1990-05-12', '2020-02-30', 'Unknown', '1985-11-22']})

# default: errors='raise' (will crash the script)
# errors='coerce': forces bad data to become NaT (Not a Time)

bad_dates['clean_dob'] = pd.to_datetime(bad_dates['dob'], errors='coerce')

print("\n=== HANDLING BAD DATES ===")
print(bad_dates)
# 2020-02-30 (Feb 30th doesn't exist) becomes NaT
# 'Unknown' becomes NaT

5. Extracting Date Components

Once a column is a proper datetime object, you unlock the .dt accessor. This allows you to extract features effortlessly—vital for machine learning and reporting.

python
1234567891011
df = pd.DataFrame({'timestamp': pd.to_datetime(['2024-01-15 14:30:00', '2024-07-22 09:15:00'])})

# Extract components using .dt
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day_name'] = df['timestamp'].dt.day_name()
df['hour'] = df['timestamp'].dt.hour
df['is_weekend'] = df['timestamp'].dt.dayofweek >= 5 # 5 is Saturday, 6 is Sunday

print("\n=== EXTRACTED COMPONENTS ===")
print(df)

6. Mini Project: Event Scheduling Cleaner

Calculate the duration between two events and identify logically impossible dates (e.g., end date is before start date).

python
12345678910111213141516171819202122
events = pd.DataFrame({
    'event': ['A', 'B', 'C'],
    'start': ['2024-01-01', '2024-05-10', '2024-10-01'],
    'end':   ['2024-01-05', '2024-05-01', '2024-10-15'] # Event B ends before it starts!
})

# 1. Convert to datetime
events['start_dt'] = pd.to_datetime(events['start'])
events['end_dt'] = pd.to_datetime(events['end'])

# 2. Calculate duration
events['duration_days'] = (events['end_dt'] - events['start_dt']).dt.days

# 3. Data Validation: Find logically impossible rows
invalid_events = events[events[&#039;duration_days'] < 0]

print("\n=== EVENT VALIDATION ===")
print("Invalid events found:")
print(invalid_events)

# Clean it by forcing invalid durations to NaN
events.loc[events[&#039;duration_days'] < 0, 'end_dt'] = pd.NaT

7. Common Mistakes

  • Relying on Pandas to guess the format on huge files: pd.todatetime() without a format parameter is very slow because Pandas has to parse and guess every single row. If you know the format is 'YYYY-MM-DD', always use format='%Y-%m-%d'. It is 10x faster.
  • Ignoring Timezones: 2024-01-01 10:00:00 means nothing globally without a timezone. If merging data from servers in NY and London, you must localize them first using .dt.tzlocalize() and .dt.tzconvert().

8. MCQs

Question 1

Which Pandas function converts string columns to dates?

Question 2

What data type does Pandas use for dates and times?

Question 3

If a European dataset has dates formatted as DD/MM/YYYY, what parameter should you use?

Question 4

What happens when pd.todatetime() encounters "Feb 30th" with errors='coerce'?

Question 5

Which accessor is required to extract the year from a datetime column?

Question 6

How do you extract the day of the week (e.g., Monday)?

Question 7

Why is it recommended to explicitly provide the format='%Y-%m-%d' parameter?

Question 8

What represents a missing date value in Pandas?

Question 9

If you subtract two datetime columns df['end'] - df['start'], what is the resulting data type?

Question 10

How do you extract just the integer number of days from a timedelta calculation?

9. Interview Questions

  • Q: A CSV contains a mix of US (MM/DD/YYYY) and European (DD/MM/YYYY) dates in the same column. How would you clean this?
  • Q: How do you calculate the exact number of months between a user's signupdate and churndate in Pandas?

10. Summary

Dates stored as strings limit your analytical power. Use pd.to_datetime() to cast strings into datetime64. Use errors='coerce' to handle unparseable garbage text (turning them into NaT). Once converted, the .dt accessor unlocks the ability to easily extract years, months, days, and weekdays, enabling powerful time-series analysis and cohort building.

11. Next Chapter Recommendation

In Chapter 11: Data Validation Techniques, we elevate our skills from just fixing data to writing rules that verify if our data actually makes logical business sense.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·