CHAPTER 29 Beginner

Performance Optimization in Pandas and NumPy

Updated: May 18, 2026

5 min read

# CHAPTER 29

Performance Optimization in Pandas & NumPy

1. Chapter Introduction

Data science code that runs in 10 hours is useless in production. This chapter systematically optimizes Pandas and NumPy — replacing loops with vectorized operations, choosing efficient dtypes, using eval/query, and profiling with timeit.

2. Benchmarking Tools

python

1234567891011121314151617181920212223242526272829303132

import pandas as pd
import numpy as np
import time

# timeit — accurate multi-run benchmarking
import timeit

n = 1_000_000
arr = np.random.rand(n)

# Method 1: Python loop
def loop_method(arr):
    return [x * 2 + 1 for x in arr]

# Method 2: NumPy vectorized
def numpy_method(arr):
    return arr * 2 + 1

# Benchmark
t_loop  = timeit.timeit(lambda: loop_method(arr),  number=5) / 5
t_numpy = timeit.timeit(lambda: numpy_method(arr), number=5) / 5

print(f"Loop:  {t_loop:.4f}s")
print(f"NumPy: {t_numpy:.5f}s")
print(f"Speedup: {t_loop/t_numpy:.0f}x")

# Memory profiling
import sys
list_data = [float(x) for x in range(n)]
np_data   = np.arange(n, dtype=np.float64)
print(f"\nList memory:  {sys.getsizeof(list_data)/1e6:.1f} MB")
print(f"NumPy memory: {np_data.nbytes/1e6:.1f} MB")

3. Vectorization — Replace All Loops

python

123456789101112131415161718192021222324252627

# ❌ Slow: Python for loop on DataFrame
df = pd.DataFrame({&#039;A': np.random.randint(1, 100, 100000),
                   &#039;B': np.random.randint(1, 100, 100000)})

# DON'T
slow_result = []
for _, row in df.iterrows():
    if row[&#039;A'] > row['B']:
        slow_result.append(row[&#039;A'] * 2)
    else:
        slow_result.append(row[&#039;B'] * 2)

# ✅ Fast: Vectorized with np.where
fast_result = np.where(df[&#039;A'] > df['B'], df['A'] * 2, df['B'] * 2)

# ❌ Slow: apply() for math operations
df[&#039;result_slow'] = df.apply(lambda row: row['A'] ** 2 + row['B'] ** 2, axis=1)

# ✅ Fast: Direct vectorized math
df[&#039;result_fast'] = df['A'] ** 2 + df['B'] ** 2

# Comparison for string operations
df_str = pd.DataFrame({&#039;text': ['Hello World'] * 100000})
# ❌ Slow
df_str[&#039;lower_slow'] = df_str['text'].apply(lambda x: x.lower())
# ✅ Fast
df_str[&#039;lower_fast'] = df_str['text'].str.lower()

4. eval() and query() — Faster Expression Evaluation

pd.eval

123456789101112131415161718192021222324

n = 1_000_000
df = pd.DataFrame({
    &#039;A': np.random.rand(n),
    &#039;B': np.random.rand(n),
    &#039;C': np.random.rand(n)
})

# Standard Pandas (creates multiple intermediate arrays)
t1 = timeit.timeit(lambda: df[&#039;A'] + df['B'] * df['C'] - df['A'] * df['B'], number=3)/3

# pd.eval() (single pass, less memory)
t2 = timeit.timeit(lambda: pd.eval(&#039;df.A + df.B * df.C - df.A * df.B'), number=3)/3

print(f"Standard: {t1:.4f}s")
print(f"eval():   {t2:.4f}s")

# df.eval() — add result as column
df.eval(&#039;D = A + B * C', inplace=True)

# df.query() vs boolean filter
# ❌ Creates temporary boolean arrays
result1 = df[(df[&#039;A'] > 0.5) & (df['B'] > 0.5)]
# ✅ More memory efficient for large DataFrames
result2 = df.query(&#039;A > 0.5 and B > 0.5')

5. Memory Optimization Patterns

python

12345678910111213141516171819202122232425

# Pattern 1: Read only needed columns
df = pd.read_csv(&#039;large_file.csv', usecols=['date', 'revenue', 'region'])

# Pattern 2: Specify dtypes at read time
dtypes = {
    &#039;user_id': np.int32,
    &#039;product_id': np.int16,
    &#039;quantity': np.int8,
    &#039;price': np.float32,
    &#039;category': 'category'
}
df = pd.read_csv(&#039;orders.csv', dtype=dtypes)

# Pattern 3: Delete intermediate DataFrames
df_temp = df[df[&#039;revenue'] > 1000]
# ... process df_temp ...
del df_temp
import gc; gc.collect()

# Pattern 4: Use inplace operations
df.drop(columns=[&#039;unused_col'], inplace=True)
df.dropna(inplace=True)

# Pattern 5: String operations → categorical for repeated values
df[&#039;category'] = df['category'].astype('category')  # If < 50% unique values

6. NumPy Performance Patterns

python

12345678910111213141516171819202122232425262728

# Pattern 1: Pre-allocate output arrays
n = 1_000_000
# ❌ Slow — appending
result = []
for i in range(n):
    result.append(i * 2)

# ✅ Fast — pre-allocated
result = np.empty(n, dtype=np.int64)
result[:] = np.arange(n) * 2

# Pattern 2: Use NumPy's built-in where applicable
data = np.random.rand(n)
# ❌ Python sum
total = sum(data)
# ✅ NumPy sum (faster, especially with BLAS for large arrays)
total = np.sum(data)

# Pattern 3: Avoid unnecessary copies
arr = np.arange(1000000)
# ❌ Creates copy
arr2 = arr + 0
# ✅ In-place addition
arr += 0  # No new array

# Pattern 4: Use contiguous arrays for matrix ops
arr_f = np.asfortranarray(np.arange(1000).reshape(20, 50))
arr_c = np.ascontiguousarray(arr_f)  # Convert to C-contiguous for faster row ops

7. Caching and Memoization

python

1234567891011121314

from functools import lru_cache
import numpy as np

# Cache expensive computations
@lru_cache(maxsize=128)
def expensive_calculation(n):
    return np.sum(np.arange(n) ** 2)

# Or use a dict cache
_cache = {}
def cached_groupby(df_hash, groupby_col, agg_col):
    if df_hash not in _cache:
        _cache[df_hash] = df.groupby(groupby_col)[agg_col].mean()
    return _cache[df_hash]

8. Common Mistakes

Using iterrows() for computation: iterrows() is 100-1000x slower than vectorized operations. It's only appropriate for non-vectorizable custom logic.

Unnecessary apply() on math: df.apply(lambda x: x['A'] + x['B'], axis=1) vs df['A'] + df['B'] — the latter is 10-50x faster.

9. MCQs

Question 1

Fastest way to double every value in a DataFrame column?

Question 2

`pd.eval()` advantage?

Question 3

`df.query()` is preferred over boolean filter when?

Question 4

Pre-allocating NumPy array is faster than appending because?

Question 5

`itertuples()` vs `iterrows()`?

Question 6

`inplace=True` in Pandas operations?

Question 7

Best profiling tool for Python functions?

Question 8

`np.ascontiguousarray()` is for?

Question 9

`usecols=['A','B']` in readcsv improves?

Question 10

When is `@lru``cache` useful in data science?

10. Interview Questions

Q: You have a 5GB CSV. How do you process it with Pandas?

Q: What are the most common causes of slow Pandas code?

11. Summary

Performance hierarchy: vectorized ops > NumPy ufuncs > str.* > apply() > itertuples() > iterrows(). Use pd.eval() for complex multi-column expressions. Optimize dtypes at load time. Profile with timeit before optimizing. Contiguous C-order arrays are fastest for row-wise NumPy operations.

12. Next Chapter Recommendation

In Chapter 30: Final Projects and Real-World Applications, we build 6 complete data science applications combining every skill from this course.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Performance Optimization in Pandas & NumPy #

1. Chapter Introduction #

2. Benchmarking Tools #

3. Vectorization — Replace All Loops #

4. eval() and query() — Faster Expression Evaluation #

5. Memory Optimization Patterns #

6. NumPy Performance Patterns #

7. Caching and Memoization #

8. Common Mistakes #

9. MCQs #

Fastest way to double every value in a DataFrame column?

pd.eval() advantage?

df.query() is preferred over boolean filter when?

Pre-allocating NumPy array is faster than appending because?

itertuples() vs iterrows()?

inplace=True in Pandas operations?

Best profiling tool for Python functions?

np.ascontiguousarray() is for?

usecols=['A','B'] in readcsv improves?

When is @lrucache useful in data science?

10. Interview Questions #

11. Summary #

12. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

Send Feedback / Bug

Feedback Submitted!

Performance Optimization in Pandas & NumPy

1. Chapter Introduction

2. Benchmarking Tools

3. Vectorization — Replace All Loops

4. eval() and query() — Faster Expression Evaluation

5. Memory Optimization Patterns

6. NumPy Performance Patterns

7. Caching and Memoization

8. Common Mistakes

9. MCQs

`pd.eval()` advantage?

`df.query()` is preferred over boolean filter when?

`itertuples()` vs `iterrows()`?

`inplace=True` in Pandas operations?

`np.ascontiguousarray()` is for?

`usecols=['A','B']` in readcsv improves?

When is `@lru``cache` useful in data science?

10. Interview Questions

11. Summary

12. Next Chapter Recommendation