Skip to main content
Pandas & NumPy
CHAPTER 11 Beginner

Reading and Writing Data Files

Updated: May 18, 2026
5 min read

# CHAPTER 11

Reading and Writing Data Files

1. Chapter Introduction

Real data lives in files — CSVs, Excel sheets, JSON APIs, databases. Pandas' IO functions are the gateway from raw files to analysis-ready DataFrames, supporting 15+ formats with production-grade options.

2. Reading CSV Files

python
1234567891011121314151617181920212223
import pandas as pd

# Basic read
df = pd.read_csv('sales.csv')

# With options
df = pd.read_csv(
    'sales.csv',
    sep=',',              # Delimiter (default ',')
    header=0,             # Row to use as column names (0=first row)
    index_col='ID',       # Column to use as index
    usecols=['Date', 'Product', 'Revenue'],  # Only these columns
    nrows=1000,           # Read only first 1000 rows
    skiprows=[1, 2],      # Skip specific rows
    na_values=['N/A', 'null', '-'],  # Treat as NaN
    parse_dates=['Date'], # Auto-parse date columns
    dtype={'Revenue': float, 'Quantity': int}  # Force dtypes
)

# Quick look
print(df.shape)
print(df.head())
print(df.dtypes)
text
1234567
# Sample sales.csv
Date,Product,Category,Quantity,Revenue,Region
2024-01-15,Laptop,Electronics,2,1800.00,North
2024-01-16,Phone,Electronics,5,3500.00,South
2024-01-17,Desk,Furniture,1,450.00,North
2024-01-18,Chair,Furniture,3,900.00,East
2024-01-19,Monitor,Electronics,2,600.00,West

3. Reading Excel Files

python
1234567891011121314
# Read Excel
df = pd.read_excel(
    'data.xlsx',
    sheet_name='Sales',    # Sheet name or index
    header=1,              # If header is on row 2
    usecols='A:F'          # Read columns A to F
)

# Read multiple sheets
all_sheets = pd.read_excel('data.xlsx', sheet_name=None)
# Returns dict: {'Sheet1': df1, 'Sheet2': df2}

for sheet_name, sheet_df in all_sheets.items():
    print(f"Sheet: {sheet_name}, Shape: {sheet_df.shape}")

4. Reading JSON

python
1234567891011121314151617181920
# From JSON file
df = pd.read_json('data.json')

# From JSON string (common with APIs)
import json
json_data = '''[
    {"id": 1, "name": "Alice", "salary": 85000},
    {"id": 2, "name": "Bob", "salary": 72000}
]'''
df = pd.read_json(json_data)

# Nested JSON → normalize
nested = {
    "employees": [
        {"name": "Alice", "contact": {"email": "alice@ex.com", "phone": "555-1234"}},
        {"name": "Bob",   "contact": {"email": "bob@ex.com",   "phone": "555-5678"}}
    ]
}
df = pd.json_normalize(nested['employees'])
print(df.columns)  # name, contact.email, contact.phone

5. Writing Data Files

python
1234567891011121314151617181920
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Carol'],
    'Sales': [45000, 62000, 38000],
    'Region': ['North', 'South', 'East']
})

# Write to CSV
df.to_csv('output.csv', index=False)          # No row numbers

# Write to Excel
df.to_excel('output.xlsx', sheet_name='Sales', index=False)

# Write to JSON
df.to_json('output.json', orient='records', indent=2)

# Write to multiple Excel sheets
with pd.ExcelWriter('report.xlsx', engine='openpyxl') as writer:
    df.to_excel(writer, sheet_name='Sales', index=False)
    df.describe().to_excel(writer, sheet_name='Statistics')
    print("Multi-sheet Excel saved!")

6. Mini Project: CSV Sales Report Analyzer

python
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import pandas as pd
import numpy as np

# Create sample CSV data
sales_data = """Date,Product,Category,Quantity,Revenue,Region,Salesperson
2024-01-05,Laptop,Electronics,2,2400,North,Alice
2024-01-06,Phone,Electronics,5,3750,South,Bob
2024-01-07,Desk,Furniture,3,1350,East,Carol
2024-01-08,Laptop,Electronics,1,1200,West,Alice
2024-01-09,Chair,Furniture,4,800,North,David
2024-01-10,Monitor,Electronics,3,900,South,Bob
2024-01-11,Phone,Electronics,2,1500,East,Carol
2024-01-12,Desk,Furniture,2,900,West,David
2024-01-13,Laptop,Electronics,3,3600,North,Alice
2024-01-14,Monitor,Electronics,5,1500,South,Bob"""

with open('sales_report.csv', 'w') as f:
    f.write(sales_data)

# Load and analyze
df = pd.read_csv('sales_report.csv', parse_dates=['Date'])

print("=" * 55)
print("SALES ANALYSIS REPORT")
print("=" * 55)

# Overview
print(f"\nData range: {df['Date'].min().date()} to {df['Date'].max().date()}")
print(f"Total transactions: {len(df)}")
print(f"Total revenue: ${df['Revenue'].sum():,.0f}")

# Revenue by category
print("\nRevenue by Category:")
cat_rev = df.groupby('Category')['Revenue'].agg(['sum', 'mean', 'count'])
cat_rev.columns = ['Total', 'Average', 'Transactions']
print(cat_rev)

# Top salesperson
print("\nTop Salesperson:")
sales_rev = df.groupby('Salesperson')['Revenue'].sum().sort_values(ascending=False)
print(sales_rev)

# Best selling product
print("\nBest Products by Revenue:")
print(df.groupby('Product')['Revenue'].sum().sort_values(ascending=False))

# Save report
df.to_csv('analyzed_report.csv', index=False)
print("\n✅ Report saved to 'analyzed_report.csv'")

7. Common Mistakes

  • Forgetting index=False in tocsv(): Without it, Pandas adds a row number column — creating a duplicate index on re-read.
  • Date columns not auto-parsed: Use parsedates=['Date'] in readcsv() to get proper datetime dtype, not strings.

8. MCQs

Question 1

readcsv('file.csv', indexcol='ID') does?

Question 2

tocsv(index=False) prevents?

Question 3

usecols=['A','B'] in readcsv?

Question 4

nrows=500 in readcsv?

Question 5

parsedates=['Date'] converts?

Question 6

pd.jsonnormalize() handles?

Question 7

pd.readexcel(sheetname=None) returns?

Question 8

ExcelWriter context manager is for?

Question 9

navalues=['N/A', '-'] in readcsv?

Question 10

dtype={'Revenue': float} in readcsv?

9. Interview Questions

  • Q: How do you load only specific columns from a large CSV in Pandas?
  • Q: How do you write a DataFrame to multiple sheets in a single Excel file?

10. Summary

Pandas IO handles every common data format. Key params: index
col, usecols, nrows, parsedates, dtype, navalues. Always use index=False when writing CSVs. Use ExcelWriter for multi-sheet Excel reports.

11. Next Chapter Recommendation

In Chapter 12: Data Selection and Filtering, we master loc[], iloc[], boolean filtering, and query syntax to extract exactly the data we need.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·