Skip to main content
Python for Data Science
CHAPTER 17 Beginner

Working with CSV, Excel, and JSON Data

Updated: May 18, 2026
5 min read

# CHAPTER 17

Working with CSV, Excel, and JSON Data

1. Chapter Introduction

We briefly touched on pd.readcsv() in Chapter 13. However, real-world files are messy. Sometimes the CSV uses semicolons instead of commas. Sometimes the Excel file has 5 different sheets. Sometimes the dates are formatted weirdly. This chapter covers the advanced parameters of Pandas I/O (Input/Output) functions so you can handle any file format a client sends you.

2. Advanced CSV Ingestion

The readcsv() function has dozens of parameters to handle messy files.

python
12345678910111213141516
import pandas as pd

# Problem 1: The separator isn't a comma, it's a semicolon (common in Europe)
# Solution: Use the 'sep' parameter
df = pd.read_csv('data.csv', sep=';')

# Problem 2: The CSV has no header row, so Pandas accidentally makes the first row of data the header!
# Solution: Tell Pandas there is no header, and provide the names manually
df = pd.read_csv('data.csv', header=None, names=['ID', 'Name', 'Age'])

# Problem 3: You only want 3 specific columns out of a 100-column file to save RAM.
# Solution: Use 'usecols'
df = pd.read_csv('data.csv', usecols=['Name', 'Salary'])

# Problem 4: Automatically converting string dates into proper DateTime objects!
df = pd.read_csv('data.csv', parse_dates=['Date_Joined'])

3. Working with Excel Files (.xlsx)

Excel files are different from CSVs because a single Excel workbook can contain multiple "Sheets". *(Note: You may need to run !pip install openpyxl to read Excel files in Pandas).*

python
12345678
# By default, this reads the FIRST sheet only
df_excel = pd.read_excel('financials.xlsx')

# To read a specific sheet by name
df_q2 = pd.read_excel('financials.xlsx', sheet_name='Q2_Report')

# To read a specific sheet by index (the second sheet)
df_sheet2 = pd.read_excel('financials.xlsx', sheet_name=1)

4. Working with JSON Data

JSON is the language of the web. If you pull data from a web API, it arrives as JSON.

python
12345678910111213141516
# Reading a JSON file directly into a DataFrame
df_json = pd.read_json('users.json')

# Often, JSON is nested (dictionaries inside dictionaries). 
# Pandas has a utility to "flatten" it out.
import json
from pandas.json_normalize import json_normalize

raw_json = [
    {"id": 1, "name": "Alice", "address": {"city": "NY", "zip": "10001"}},
    {"id": 2, "name": "Bob", "address": {"city": "LA", "zip": "90001"}}
]

# Flattens 'address' into 'address.city' and 'address.zip' columns!
df_flat = json_normalize(raw_json)
print(df_flat)

5. Exporting Data (Writing Files)

Once you finish your groupby() analysis or data cleaning, you need to save the results.

python
123456789
# 1. Export to CSV
# index=False prevents Pandas from writing the 0,1,2 index numbers as a column
df.to_csv('cleaned_data.csv', index=False)

# 2. Export to Excel
df.to_excel('final_report.xlsx', sheet_name='Summary', index=False)

# 3. Export to JSON (records orientation is best for APIs)
df.to_json('data_export.json', orient='records')

6. Mini Project: Sales Report Generator

Let's simulate a workflow where we combine data from a CSV and an Excel sheet, process it, and export it.

python
123456789101112131415
# 1. Ingest Data
# (Assuming these files exist)
# customers = pd.read_csv('customers.csv')
# sales = pd.read_excel('sales_2023.xlsx', sheet_name='Dec')

# Simulate the data for this example
customers = pd.DataFrame({"ID": [1, 2], "Name": ["Alice", "Bob"]})
sales = pd.DataFrame({"Cust_ID": [1, 2], "Amount": [500, 300]})

# 2. Merge Data (Like SQL JOIN)
report = pd.merge(sales, customers, left_on='Cust_ID', right_on='ID')

# 3. Export to a clean CSV for the boss
report.to_csv('December_Sales_Report.csv', index=False)
print("Report successfully generated and exported!")

7. Common Mistakes

  • Forgetting index=False when exporting: By default, .tocsv() saves the row numbers (0, 1, 2). When you load that CSV back in tomorrow, Pandas will assign *new* row numbers, and the old ones become a useless "Unnamed: 0" column. Always use index=False unless the index is a meaningful date or name.
  • Not parsing dates on load: If you don't use parsedates=['ColName'], Pandas treats dates as Text. You won't be able to filter for "Sales in 2023" without doing messy text conversions later.

8. MCQs

Question 1

If a CSV file uses semicolons instead of commas, how do you open it?

Question 2

How do you prevent Pandas from using the first row of data as the column names?

Question 3

Which parameter allows you to load only 2 specific columns from a massive CSV file?

Question 4

What parameter automatically converts string dates (like "2023-01-01") into real DateTime objects upon loading?

Question 5

How do you specify which sheet to load from an Excel workbook?

Question 6

What function is incredibly useful for converting deeply nested JSON data into a flat, tabular DataFrame?

Question 7

What method saves a DataFrame back to a CSV file on your hard drive?

Question 8

Why is index=False a critical parameter when using df.tocsv()?

Question 9

Which Python library is required under-the-hood by Pandas to read Excel files?

Question 10

How can you combine two DataFrames based on a common ID column (similar to a SQL JOIN)?

9. Interview Questions

  • Q: A client sends you a 5GB CSV file, but you only need the "CustomerID" and "Revenue" columns. How do you load this efficiently without crashing your RAM?
  • Q: Explain why you should use parsedates during ingestion rather than converting the column type after the dataframe is loaded.

10. Summary

Mastering data ingestion saves you hours of cleaning later. Use sep, header, and usecols to handle messy CSV structures. Use sheet
name to navigate Excel files, and jsonnormalize to flatten nested web data. Finally, always remember to export your finished analysis using .tocsv(index=False) to prevent useless index columns from polluting your final reports.

11. Next Chapter Recommendation

In Chapter 18: Data Visualization with Matplotlib, we will bring our data to life. A table of 1000 aggregated numbers is boring; we will learn how to turn those numbers into compelling line charts, bar charts, and scatter plots.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·