Working with CSV, Excel, and JSON Data
# CHAPTER 17
Working with CSV, Excel, and JSON Data
1. Chapter Introduction
We briefly touched onpd.readcsv() in Chapter 13. However, real-world files are messy. Sometimes the CSV uses semicolons instead of commas. Sometimes the Excel file has 5 different sheets. Sometimes the dates are formatted weirdly. This chapter covers the advanced parameters of Pandas I/O (Input/Output) functions so you can handle any file format a client sends you.
2. Advanced CSV Ingestion
The readcsv() function has dozens of parameters to handle messy files.
3. Working with Excel Files (.xlsx)
Excel files are different from CSVs because a single Excel workbook can contain multiple "Sheets".
*(Note: You may need to run !pip install openpyxl to read Excel files in Pandas).*
4. Working with JSON Data
JSON is the language of the web. If you pull data from a web API, it arrives as JSON.
5. Exporting Data (Writing Files)
Once you finish your groupby() analysis or data cleaning, you need to save the results.
6. Mini Project: Sales Report Generator
Let's simulate a workflow where we combine data from a CSV and an Excel sheet, process it, and export it.
7. Common Mistakes
-
Forgetting
index=Falsewhen exporting: By default,.tocsv()saves the row numbers (0, 1, 2). When you load that CSV back in tomorrow, Pandas will assign *new* row numbers, and the old ones become a useless "Unnamed: 0" column. Always useindex=Falseunless the index is a meaningful date or name.
-
Not parsing dates on load: If you don't use
parsedates=['ColName'], Pandas treats dates as Text. You won't be able to filter for "Sales in 2023" without doing messy text conversions later.
8. MCQs
If a CSV file uses semicolons instead of commas, how do you open it?
How do you prevent Pandas from using the first row of data as the column names?
Which parameter allows you to load only 2 specific columns from a massive CSV file?
What parameter automatically converts string dates (like "2023-01-01") into real DateTime objects upon loading?
How do you specify which sheet to load from an Excel workbook?
What function is incredibly useful for converting deeply nested JSON data into a flat, tabular DataFrame?
What method saves a DataFrame back to a CSV file on your hard drive?
Why is index=False a critical parameter when using df.tocsv()?
Which Python library is required under-the-hood by Pandas to read Excel files?
How can you combine two DataFrames based on a common ID column (similar to a SQL JOIN)?
9. Interview Questions
- Q: A client sends you a 5GB CSV file, but you only need the "CustomerID" and "Revenue" columns. How do you load this efficiently without crashing your RAM?
-
Q: Explain why you should use
parsedatesduring ingestion rather than converting the column type after the dataframe is loaded.
10. Summary
Mastering data ingestion saves you hours of cleaning later. Usesep, header, and usecols to handle messy CSV structures. Use sheetname to navigate Excel files, and jsonnormalize to flatten nested web data. Finally, always remember to export your finished analysis using .tocsv(index=False) to prevent useless index columns from polluting your final reports.