Skip to main content
Data Cleaning
CHAPTER 04 Beginner

Working with CSV, Excel, and JSON Files

Updated: May 18, 2026
5 min read

# CHAPTER 4

Working with CSV, Excel, and JSON Files

1. Chapter Introduction

Before you can clean data, you must be able to load it. Data rarely comes in perfectly formatted packages; it lives in messy CSVs with weird encodings, multi-sheet Excel files, and deeply nested JSON APIs. This chapter covers how to ingest and export these common formats using Pandas.

2. Reading CSV Files

CSVs (Comma Separated Values) are the most common format for tabular data.

python
12345678910111213141516171819202122232425262728
import pandas as pd

# Basic CSV read
# Assuming file: data.csv
# name,email,age
# John,john@example.com,25
df_csv = pd.read_csv('data.csv')

# Common CSV Problems and Solutions:

# 1. Different delimiter (e.g., TSV - tab separated)
df_tsv = pd.read_csv('data.tsv', sep='\t')

# 2. No header row in the file
df_no_head = pd.read_csv('data.csv', header=None, names=['Name', 'Email', 'Age'])

# 3. Encoding errors (e.g., UnicodeDecodeError)
# Use 'utf-8', 'latin1', 'iso-8859-1', or 'cp1252'
df_enc = pd.read_csv('data.csv', encoding='latin1')

# 4. Large files (read in chunks)
chunk_iter = pd.read_csv('huge_data.csv', chunksize=10000)
for chunk in chunk_iter:
    # process chunk
    pass

# 5. Skipping bad lines (rows with too many commas)
df_bad = pd.read_csv('messy.csv', on_bad_lines='skip')

3. Reading Excel Files

Excel files (.xlsx, .xls) often contain multiple sheets, merged cells, and formatted headers.

python
1234567891011121314151617
# Note: Requires openpyxl installed (pip install openpyxl)

# Basic Excel read (reads first sheet by default)
df_xl = pd.read_excel('report.xlsx')

# 1. Read a specific sheet by name
df_sheet = pd.read_excel('report.xlsx', sheet_name='Sales_2023')

# 2. Read all sheets into a dictionary of DataFrames
all_sheets = pd.read_excel('report.xlsx', sheet_name=None)
print(all_sheets.keys()) # Shows sheet names

# 3. Skip rows (e.g., title and empty rows at the top)
df_skip = pd.read_excel('report.xlsx', skiprows=3)

# 4. Specify NA values (e.g., Excel uses "N/A", "Missing", "-")
df_na = pd.read_excel('report.xlsx', na_values=['N/A', 'Missing', '-', '#DIV/0!'])

4. Reading JSON Files

JSON (JavaScript Object Notation) is heavily used in web APIs. It can be flat or deeply nested.

python
123456789101112131415161718192021
# Basic JSON read (flat list of objects)
df_json = pd.read_json('data.json')

# Reading nested JSON (requires json_normalize)
import json
from pandas import json_normalize

nested_json = {
    "users": [
        {"id": 1, "info": {"name": "Alice", "age": 25}},
        {"id": 2, "info": {"name": "Bob", "age": 30}}
    ]
}

# Flatten the nested dictionaries
df_flat = json_normalize(nested_json['users'])
print(df_flat)
# Output:
#    id info.name  info.age
# 0   1     Alice        25
# 1   2       Bob        30

5. Exporting Cleaned Data

After cleaning, you need to save the data.

python
12345678910
# Export to CSV
# index=False prevents pandas from writing the row numbers to the file
df.to_csv('cleaned_data.csv', index=False)

# Export to Excel
df.to_excel('cleaned_data.xlsx', sheet_name='Cleaned', index=False)

# Export to JSON
# orient='records' creates a standard list of objects
df.to_json('cleaned_data.json', orient='records', indent=4)

6. Mini Project: CSV Cleanup Utility

python
12345678910111213141516171819202122232425262728293031323334
import pandas as pd

def clean_csv_file(input_path, output_path):
    print(f"Loading {input_path}...")
    try:
        # Try UTF-8 first, fallback to latin1 if it fails
        try:
            df = pd.read_csv(input_path, encoding='utf-8')
        except UnicodeDecodeError:
            print("UTF-8 failed, trying latin1 encoding...")
            df = pd.read_csv(input_path, encoding='latin1')
        
        print(f"Original shape: {df.shape}")
        
        # 1. Standardize column names (lowercase, replace spaces with underscores)
        df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
        
        # 2. Drop completely empty rows
        df = df.dropna(how='all')
        
        # 3. Drop exact duplicate rows
        df = df.drop_duplicates()
        
        print(f"Cleaned shape: {df.shape}")
        
        # Export
        df.to_csv(output_path, index=False)
        print(f"Success! Saved to {output_path}")
        
    except Exception as e:
        print(f"Error processing file: {e}")

# Usage:
# clean_csv_file('messy_input.csv', 'clean_output.csv')

7. Common Mistakes

  • Forgetting index=False when saving CSVs: Pandas will write the row numbers (0, 1, 2...) as a new column every time you save. If you load and save the file 3 times, you'll have 3 useless "Unnamed: 0" columns.
  • Ignoring encoding errors: If a CSV contains accents (é, ñ) and won't load, beginners often delete the bad rows manually. Use encoding='latin1' or encoding='utf-8-sig' instead.

8. MCQs

Question 1

Which pandas function is used to read a CSV file?

Question 2

If your CSV file uses tabs instead of commas, what parameter do you use?

Question 3

How do you skip the first 2 rows when reading an Excel file?

Question 4

What is jsonnormalize used for?

Question 5

Why is index=False important in tocsv()?

Question 6

What does onbadlines='skip' do in readcsv?

Question 7

How do you read ALL sheets from an Excel file at once?

Question 8

Which parameter allows you to treat strings like "Missing" as NaNs when loading?

Question 9

What is a common fix for a UnicodeDecodeError when reading a CSV?

Question 10

How do you output JSON as a clean array of objects?

9. Interview Questions

  • Q: You are loading a 50GB CSV file that doesn't fit in RAM. How do you handle it using Pandas?
  • Q: Explain how you would flatten a JSON API response where one column contains lists of dictionaries.

10. Summary

Pandas provides robust tools for loading data. read
csv() handles text files (use sep, encoding, onbadlines). readexcel() handles spreadsheets (use sheetname, skiprows). readjson() and jsonnormalize() handle nested web data. When exporting, always use index=False to keep files clean.

11. Next Chapter Recommendation

In Chapter 5: Data Types and Data Formatting, we will learn how to convert strings to numbers, format dates, and fix type inconsistencies.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·