Skip to main content
Data Visualization
CHAPTER 29 Beginner

Performance Optimization for Large Datasets

Updated: May 18, 2026
5 min read

# CHAPTER 29

Performance Optimization for Large Datasets

1. Chapter Introduction

Visualizing 10 million data points with plt.scatter() will freeze your browser. This chapter covers sampling, pre-aggregation, hexbin charts, WebGL rendering, and profiling techniques for production-scale visualization performance.

2. The Problem with Large Data

python
1234567891011121314151617181920212223
import numpy as np
import time

n = 1_000_000
x = np.random.rand(n)
y = np.random.rand(n)

# Test: How slow is naive scatter for 1M points?
import matplotlib.pyplot as plt

t0 = time.time()
fig, ax = plt.subplots()
ax.scatter(x, y, s=1, alpha=0.1)  # Will be very slow and unreadable
ax.set_title(f'1M Points (bad): {time.time()-t0:.2f}s')
plt.close()

# Solution options:
print("Solutions for large data:")
print("1. Random sampling — quick, loses outliers")
print("2. Pre-aggregation — keeps all data, loses detail")
print("3. Hexbin — density visualization, fast, readable")
print("4. WebGL backend — renders millions of points in browser")
print("5. Datashader — production-grade large data rendering")

3. Strategy 1: Intelligent Sampling

python
12345678910111213141516171819202122232425262728293031323334353637
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)
n = 2_000_000
x = np.concatenate([np.random.normal(3, 1, n//2), np.random.normal(7, 1.5, n//2)])
y = np.concatenate([np.random.normal(4, 1, n//2), np.random.normal(8, 1, n//2)])

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# 1: Random sample (10k points)
sample_idx = np.random.choice(n, 10000, replace=False)
axes[0].scatter(x[sample_idx], y[sample_idx], s=5, alpha=0.3, color='#1565C0')
axes[0].set_title('Random Sample (10K/2M)\nFast but may miss outliers')

# 2: Stratified sample (preserves distribution)
# Bin x into 20 buckets, sample proportionally
bins = np.digitize(x, np.percentile(x, np.arange(0, 101, 5)))
sample_idx2 = np.array([np.random.choice(np.where(bins == b)[0],
                                           min(500, len(np.where(bins == b)[0])), replace=False)
                          for b in range(1, 21)]).flatten()
axes[1].scatter(x[sample_idx2], y[sample_idx2], s=5, alpha=0.3, color='#2E7D32')
axes[1].set_title('Stratified Sample\nPreserves distribution shape')

# 3: Hexbin (density)
hb = axes[2].hexbin(x, y, gridsize=60, cmap='Blues', mincnt=1)
plt.colorbar(hb, ax=axes[2], label='Count')
axes[2].set_title('Hexbin Chart\nAll 2M points, shows density')

for ax in axes:
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

plt.suptitle('Large Data Visualization Strategies', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.savefig('large_data.png', dpi=150)
plt.show()

4. Strategy 2: Pre-Aggregation with Pandas

python
1234567891011121314151617181920212223242526272829303132333435
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Simulate 5M transaction records
np.random.seed(42)
n = 5_000_000
transactions = pd.DataFrame({
    'date':     pd.date_range('2020-01-01', periods=n, freq='min'),
    'revenue':  np.random.exponential(scale=50, size=n),
    'region':   np.random.choice(['North', 'South', 'East', 'West'], n),
    'product':  np.random.choice(['A', 'B', 'C', 'D'], n),
})

# PRE-AGGREGATE before plotting (milliseconds to compute, instant to render)
daily = transactions.groupby(transactions['date'].dt.date)['revenue'].sum()
monthly = transactions.groupby(transactions['date'].dt.to_period('M'))['revenue'].sum()
by_region = transactions.groupby('region')['revenue'].sum()

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot aggregates (not raw data)
daily.plot(ax=axes[0], color='#1565C0', linewidth=1, title='Daily Revenue (aggregated)')
monthly.plot(kind='bar', ax=axes[1], color='#2E7D32', title='Monthly Revenue')
by_region.sort_values().plot(kind='barh', ax=axes[2], color='#E65100', title='By Region')

for ax in axes:
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.grid(True, alpha=0.3)

plt.suptitle('5M Transactions → Pre-aggregated (Fast!)', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.savefig('preagg.png', dpi=150)
plt.show()

5. Strategy 3: Plotly WebGL for Interactive Large Data

python
1234567891011121314151617181920
import plotly.graph_objects as go
import numpy as np

np.random.seed(42)
n = 500_000
x = np.random.randn(n)
y = np.random.randn(n)
color = np.random.rand(n)

# Scattergl = WebGL rendering (10-100x faster than SVG for large data)
fig = go.Figure(go.Scattergl(
    x=x, y=y,
    mode='markers',
    marker=dict(size=2, color=color, colorscale='Viridis',
                 opacity=0.3, showscale=True),
))
fig.update_layout(title='500K Points — WebGL (Scattergl) Rendering',
                   template='plotly_white', height=500)
fig.show()
# WebGL renders in browser GPU — handles 500K+ points smoothly

6. Common Mistakes

  • Rendering raw 1M+ points with SVG-based charts: SVG creates one DOM element per point → browser freezes. Use WebGL (Scattergl) or pre-aggregate.
  • Sampling without stratification: Pure random sampling can under-represent rare but important events (outliers, fraud, rare products).

7. MCQs

Question 1

plt.scatter() becomes slow at?

Question 2

Hexbin chart advantage for large data?

Question 3

Pre-aggregation strategy involves?

Question 4

go.Scattergl in Plotly uses?

Question 5

Stratified sampling ensures?

Question 6

gridsize=60 in hexbin?

Question 7

mincnt=1 in hexbin?

Question 8

5M rows → daily aggregation reduces to?

Question 9

Best strategy for 10M point scatter?

Question 10

alpha=0.1 for large scatter helps?

8. Interview Questions

  • Q: You have 10 million data points to visualize. What is your approach?
  • Q: What is the difference between SVG and WebGL rendering in Plotly?

9. Summary

Large data visualization strategy: 1) Sample intelligently (stratified > random), 2) Pre-aggregate with Pandas groupby before plotting, 3) Use hexbin for scatter density, 4) Use Plotly's Scattergl (WebGL) for interactive 100K+ point rendering. Never render 1M+ raw points in SVG-based charts.

10. Next Chapter Recommendation

In Chapter 30: Final Projects, we build 6 complete production dashboards combining every skill from this course.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·