CHAPTER 04 Beginner

Working with CSV, Excel, and JSON Files

Updated: May 18, 2026

5 min read

# CHAPTER 4

Working with CSV, Excel, and JSON Files

1. Chapter Introduction

Before you can clean data, you must be able to load it. Data rarely comes in perfectly formatted packages; it lives in messy CSVs with weird encodings, multi-sheet Excel files, and deeply nested JSON APIs. This chapter covers how to ingest and export these common formats using Pandas.

2. Reading CSV Files

CSVs (Comma Separated Values) are the most common format for tabular data.

python

12345678910111213141516171819202122232425262728

import pandas as pd

# Basic CSV read
# Assuming file: data.csv
# name,email,age
# John,john@example.com,25
df_csv = pd.read_csv(&#039;data.csv')

# Common CSV Problems and Solutions:

# 1. Different delimiter (e.g., TSV - tab separated)
df_tsv = pd.read_csv(&#039;data.tsv', sep='\t')

# 2. No header row in the file
df_no_head = pd.read_csv(&#039;data.csv', header=None, names=['Name', 'Email', 'Age'])

# 3. Encoding errors (e.g., UnicodeDecodeError)
# Use 'utf-8', 'latin1', 'iso-8859-1', or 'cp1252'
df_enc = pd.read_csv(&#039;data.csv', encoding='latin1')

# 4. Large files (read in chunks)
chunk_iter = pd.read_csv(&#039;huge_data.csv', chunksize=10000)
for chunk in chunk_iter:
    # process chunk
    pass

# 5. Skipping bad lines (rows with too many commas)
df_bad = pd.read_csv(&#039;messy.csv', on_bad_lines='skip')

3. Reading Excel Files

Excel files (.xlsx, .xls) often contain multiple sheets, merged cells, and formatted headers.

python

1234567891011121314151617

# Note: Requires openpyxl installed (pip install openpyxl)

# Basic Excel read (reads first sheet by default)
df_xl = pd.read_excel(&#039;report.xlsx')

# 1. Read a specific sheet by name
df_sheet = pd.read_excel(&#039;report.xlsx', sheet_name='Sales_2023')

# 2. Read all sheets into a dictionary of DataFrames
all_sheets = pd.read_excel(&#039;report.xlsx', sheet_name=None)
print(all_sheets.keys()) # Shows sheet names

# 3. Skip rows (e.g., title and empty rows at the top)
df_skip = pd.read_excel(&#039;report.xlsx', skiprows=3)

# 4. Specify NA values (e.g., Excel uses "N/A", "Missing", "-")
df_na = pd.read_excel(&#039;report.xlsx', na_values=['N/A', 'Missing', '-', '#DIV/0!'])

4. Reading JSON Files

JSON (JavaScript Object Notation) is heavily used in web APIs. It can be flat or deeply nested.

python

123456789101112131415161718192021

# Basic JSON read (flat list of objects)
df_json = pd.read_json(&#039;data.json')

# Reading nested JSON (requires json_normalize)
import json
from pandas import json_normalize

nested_json = {
    "users": [
        {"id": 1, "info": {"name": "Alice", "age": 25}},
        {"id": 2, "info": {"name": "Bob", "age": 30}}
    ]
}

# Flatten the nested dictionaries
df_flat = json_normalize(nested_json[&#039;users'])
print(df_flat)
# Output:
#    id info.name  info.age
# 0   1     Alice        25
# 1   2       Bob        30

5. Exporting Cleaned Data

After cleaning, you need to save the data.

python

12345678910

# Export to CSV
# index=False prevents pandas from writing the row numbers to the file
df.to_csv(&#039;cleaned_data.csv', index=False)

# Export to Excel
df.to_excel(&#039;cleaned_data.xlsx', sheet_name='Cleaned', index=False)

# Export to JSON
# orient='records' creates a standard list of objects
df.to_json(&#039;cleaned_data.json', orient='records', indent=4)

6. Mini Project: CSV Cleanup Utility

python

12345678910111213141516171819202122232425262728293031323334

import pandas as pd

def clean_csv_file(input_path, output_path):
    print(f"Loading {input_path}...")
    try:
        # Try UTF-8 first, fallback to latin1 if it fails
        try:
            df = pd.read_csv(input_path, encoding=&#039;utf-8')
        except UnicodeDecodeError:
            print("UTF-8 failed, trying latin1 encoding...")
            df = pd.read_csv(input_path, encoding=&#039;latin1')
        
        print(f"Original shape: {df.shape}")
        
        # 1. Standardize column names (lowercase, replace spaces with underscores)
        df.columns = df.columns.str.strip().str.lower().str.replace(&#039; ', '_')
        
        # 2. Drop completely empty rows
        df = df.dropna(how=&#039;all')
        
        # 3. Drop exact duplicate rows
        df = df.drop_duplicates()
        
        print(f"Cleaned shape: {df.shape}")
        
        # Export
        df.to_csv(output_path, index=False)
        print(f"Success! Saved to {output_path}")
        
    except Exception as e:
        print(f"Error processing file: {e}")

# Usage:
# clean_csv_file('messy_input.csv', 'clean_output.csv')

7. Common Mistakes

Forgetting index=False when saving CSVs: Pandas will write the row numbers (0, 1, 2...) as a new column every time you save. If you load and save the file 3 times, you'll have 3 useless "Unnamed: 0" columns.

Ignoring encoding errors: If a CSV contains accents (é, ñ) and won't load, beginners often delete the bad rows manually. Use encoding='latin1' or encoding='utf-8-sig' instead.

8. MCQs

Question 1

Which pandas function is used to read a CSV file?

Question 2

If your CSV file uses tabs instead of commas, what parameter do you use?

Question 3

How do you skip the first 2 rows when reading an Excel file?

Question 4

What is `json``normalize` used for?

Question 5

Why is `index=False` important in `tocsv()`?

Question 6

What does `on``badlines='skip'` do in `read``csv`?

Question 7

How do you read ALL sheets from an Excel file at once?

Question 8

Which parameter allows you to treat strings like "Missing" as NaNs when loading?

Question 9

What is a common fix for a `UnicodeDecodeError` when reading a CSV?

Question 10

How do you output JSON as a clean array of objects?

9. Interview Questions

Q: You are loading a 50GB CSV file that doesn't fit in RAM. How do you handle it using Pandas?

Q: Explain how you would flatten a JSON API response where one column contains lists of dictionaries.

10. Summary
Pandas provides robust tools for loading data. readcsv() handles text files (use sep, encoding, onbadlines). readexcel() handles spreadsheets (use sheetname, skiprows). readjson() and jsonnormalize() handle nested web data. When exporting, always use index=False to keep files clean.

11. Next Chapter Recommendation

In Chapter 5: Data Types and Data Formatting, we will learn how to convert strings to numbers, format dates, and fix type inconsistencies.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Working with CSV, Excel, and JSON Files #

1. Chapter Introduction #

2. Reading CSV Files #

3. Reading Excel Files #

4. Reading JSON Files #

5. Exporting Cleaned Data #

6. Mini Project: CSV Cleanup Utility #

7. Common Mistakes #

8. MCQs #

Which pandas function is used to read a CSV file?

If your CSV file uses tabs instead of commas, what parameter do you use?

How do you skip the first 2 rows when reading an Excel file?

What is jsonnormalize used for?

Why is index=False important in tocsv()?

What does onbadlines='skip' do in readcsv?

How do you read ALL sheets from an Excel file at once?

Which parameter allows you to treat strings like "Missing" as NaNs when loading?

What is a common fix for a UnicodeDecodeError when reading a CSV?

How do you output JSON as a clean array of objects?

9. Interview Questions #

10. Summary #

11. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 1

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

Working with CSV, Excel, and JSON Files

1. Chapter Introduction

2. Reading CSV Files

3. Reading Excel Files

4. Reading JSON Files

5. Exporting Cleaned Data

6. Mini Project: CSV Cleanup Utility

7. Common Mistakes

8. MCQs

What is `json``normalize` used for?

Why is `index=False` important in `tocsv()`?

What does `on``badlines='skip'` do in `read``csv`?

What is a common fix for a `UnicodeDecodeError` when reading a CSV?

9. Interview Questions

10. Summary

11. Next Chapter Recommendation