CHAPTER 29
Beginner
Performance Optimization in Pandas and NumPy
Updated: May 18, 2026
5 min read
# CHAPTER 29
Performance Optimization in Pandas & NumPy
1. Chapter Introduction
Data science code that runs in 10 hours is useless in production. This chapter systematically optimizes Pandas and NumPy — replacing loops with vectorized operations, choosing efficient dtypes, using eval/query, and profiling with timeit.2. Benchmarking Tools
python
3. Vectorization — Replace All Loops
python
4. eval() and query() — Faster Expression Evaluation
pd.eval
5. Memory Optimization Patterns
python
6. NumPy Performance Patterns
python
7. Caching and Memoization
python
8. Common Mistakes
-
Using
iterrows()for computation:iterrows()is 100-1000x slower than vectorized operations. It's only appropriate for non-vectorizable custom logic.
-
Unnecessary
apply()on math:df.apply(lambda x: x['A'] + x['B'], axis=1)vsdf['A'] + df['B']— the latter is 10-50x faster.
9. MCQs
Question 1
Fastest way to double every value in a DataFrame column?
Question 2
pd.eval() advantage?
Question 3
df.query() is preferred over boolean filter when?
Question 4
Pre-allocating NumPy array is faster than appending because?
Question 5
itertuples() vs iterrows()?
Question 6
inplace=True in Pandas operations?
Question 7
Best profiling tool for Python functions?
Question 8
np.ascontiguousarray() is for?
Question 9
usecols=['A','B'] in readcsv improves?
Question 10
When is @lrucache useful in data science?
10. Interview Questions
- Q: You have a 5GB CSV. How do you process it with Pandas?
- Q: What are the most common causes of slow Pandas code?
11. Summary
Performance hierarchy: vectorized ops > NumPy ufuncs >str.* > apply() > itertuples() > iterrows(). Use pd.eval() for complex multi-column expressions. Optimize dtypes at load time. Profile with timeit before optimizing. Contiguous C-order arrays are fastest for row-wise NumPy operations.