Skip to main content
Pandas & NumPy
CHAPTER 29 Beginner

Performance Optimization in Pandas and NumPy

Updated: May 18, 2026
5 min read

# CHAPTER 29

Performance Optimization in Pandas & NumPy

1. Chapter Introduction

Data science code that runs in 10 hours is useless in production. This chapter systematically optimizes Pandas and NumPy — replacing loops with vectorized operations, choosing efficient dtypes, using eval/query, and profiling with timeit.

2. Benchmarking Tools

python
1234567891011121314151617181920212223242526272829303132
import pandas as pd
import numpy as np
import time

# timeit — accurate multi-run benchmarking
import timeit

n = 1_000_000
arr = np.random.rand(n)

# Method 1: Python loop
def loop_method(arr):
    return [x * 2 + 1 for x in arr]

# Method 2: NumPy vectorized
def numpy_method(arr):
    return arr * 2 + 1

# Benchmark
t_loop  = timeit.timeit(lambda: loop_method(arr),  number=5) / 5
t_numpy = timeit.timeit(lambda: numpy_method(arr), number=5) / 5

print(f"Loop:  {t_loop:.4f}s")
print(f"NumPy: {t_numpy:.5f}s")
print(f"Speedup: {t_loop/t_numpy:.0f}x")

# Memory profiling
import sys
list_data = [float(x) for x in range(n)]
np_data   = np.arange(n, dtype=np.float64)
print(f"\nList memory:  {sys.getsizeof(list_data)/1e6:.1f} MB")
print(f"NumPy memory: {np_data.nbytes/1e6:.1f} MB")

3. Vectorization — Replace All Loops

python
123456789101112131415161718192021222324252627
# ❌ Slow: Python for loop on DataFrame
df = pd.DataFrame({'A': np.random.randint(1, 100, 100000),
                   'B': np.random.randint(1, 100, 100000)})

# DON'T
slow_result = []
for _, row in df.iterrows():
    if row['A'] > row['B']:
        slow_result.append(row['A'] * 2)
    else:
        slow_result.append(row['B'] * 2)

# ✅ Fast: Vectorized with np.where
fast_result = np.where(df['A'] > df['B'], df['A'] * 2, df['B'] * 2)

# ❌ Slow: apply() for math operations
df['result_slow'] = df.apply(lambda row: row['A'] ** 2 + row['B'] ** 2, axis=1)

# ✅ Fast: Direct vectorized math
df['result_fast'] = df['A'] ** 2 + df['B'] ** 2

# Comparison for string operations
df_str = pd.DataFrame({'text': ['Hello World'] * 100000})
# ❌ Slow
df_str['lower_slow'] = df_str['text'].apply(lambda x: x.lower())
# ✅ Fast
df_str['lower_fast'] = df_str['text'].str.lower()

4. eval() and query() — Faster Expression Evaluation

pd.eval
123456789101112131415161718192021222324
n = 1_000_000
df = pd.DataFrame({
    'A': np.random.rand(n),
    'B': np.random.rand(n),
    'C': np.random.rand(n)
})

# Standard Pandas (creates multiple intermediate arrays)
t1 = timeit.timeit(lambda: df['A'] + df['B'] * df['C'] - df['A'] * df['B'], number=3)/3

# pd.eval() (single pass, less memory)
t2 = timeit.timeit(lambda: pd.eval('df.A + df.B * df.C - df.A * df.B'), number=3)/3

print(f"Standard: {t1:.4f}s")
print(f"eval():   {t2:.4f}s")

# df.eval() — add result as column
df.eval('D = A + B * C', inplace=True)

# df.query() vs boolean filter
# ❌ Creates temporary boolean arrays
result1 = df[(df['A'] > 0.5) & (df['B'] > 0.5)]
# ✅ More memory efficient for large DataFrames
result2 = df.query('A > 0.5 and B > 0.5')

5. Memory Optimization Patterns

python
12345678910111213141516171819202122232425
# Pattern 1: Read only needed columns
df = pd.read_csv('large_file.csv', usecols=['date', 'revenue', 'region'])

# Pattern 2: Specify dtypes at read time
dtypes = {
    'user_id': np.int32,
    'product_id': np.int16,
    'quantity': np.int8,
    'price': np.float32,
    'category': 'category'
}
df = pd.read_csv('orders.csv', dtype=dtypes)

# Pattern 3: Delete intermediate DataFrames
df_temp = df[df['revenue'] > 1000]
# ... process df_temp ...
del df_temp
import gc; gc.collect()

# Pattern 4: Use inplace operations
df.drop(columns=['unused_col'], inplace=True)
df.dropna(inplace=True)

# Pattern 5: String operations → categorical for repeated values
df[&#039;category'] = df['category'].astype('category')  # If < 50% unique values

6. NumPy Performance Patterns

python
12345678910111213141516171819202122232425262728
# Pattern 1: Pre-allocate output arrays
n = 1_000_000
# ❌ Slow — appending
result = []
for i in range(n):
    result.append(i * 2)

# ✅ Fast — pre-allocated
result = np.empty(n, dtype=np.int64)
result[:] = np.arange(n) * 2

# Pattern 2: Use NumPy's built-in where applicable
data = np.random.rand(n)
# ❌ Python sum
total = sum(data)
# ✅ NumPy sum (faster, especially with BLAS for large arrays)
total = np.sum(data)

# Pattern 3: Avoid unnecessary copies
arr = np.arange(1000000)
# ❌ Creates copy
arr2 = arr + 0
# ✅ In-place addition
arr += 0  # No new array

# Pattern 4: Use contiguous arrays for matrix ops
arr_f = np.asfortranarray(np.arange(1000).reshape(20, 50))
arr_c = np.ascontiguousarray(arr_f)  # Convert to C-contiguous for faster row ops

7. Caching and Memoization

python
1234567891011121314
from functools import lru_cache
import numpy as np

# Cache expensive computations
@lru_cache(maxsize=128)
def expensive_calculation(n):
    return np.sum(np.arange(n) ** 2)

# Or use a dict cache
_cache = {}
def cached_groupby(df_hash, groupby_col, agg_col):
    if df_hash not in _cache:
        _cache[df_hash] = df.groupby(groupby_col)[agg_col].mean()
    return _cache[df_hash]

8. Common Mistakes

  • Using iterrows() for computation: iterrows() is 100-1000x slower than vectorized operations. It's only appropriate for non-vectorizable custom logic.
  • Unnecessary apply() on math: df.apply(lambda x: x['A'] + x['B'], axis=1) vs df['A'] + df['B'] — the latter is 10-50x faster.

9. MCQs

Question 1

Fastest way to double every value in a DataFrame column?

Question 2

pd.eval() advantage?

Question 3

df.query() is preferred over boolean filter when?

Question 4

Pre-allocating NumPy array is faster than appending because?

Question 5

itertuples() vs iterrows()?

Question 6

inplace=True in Pandas operations?

Question 7

Best profiling tool for Python functions?

Question 8

np.ascontiguousarray() is for?

Question 9

usecols=['A','B'] in readcsv improves?

Question 10

When is @lrucache useful in data science?

10. Interview Questions

  • Q: You have a 5GB CSV. How do you process it with Pandas?
  • Q: What are the most common causes of slow Pandas code?

11. Summary

Performance hierarchy: vectorized ops > NumPy ufuncs > str.* > apply() > itertuples() > iterrows(). Use pd.eval() for complex multi-column expressions. Optimize dtypes at load time. Profile with timeit before optimizing. Contiguous C-order arrays are fastest for row-wise NumPy operations.

12. Next Chapter Recommendation

In Chapter 30: Final Projects and Real-World Applications, we build 6 complete data science applications combining every skill from this course.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·