CHAPTER 29 Beginner

Performance Optimization for Large Datasets

Updated: May 18, 2026

5 min read

# CHAPTER 29

Performance Optimization for Large Datasets

1. Chapter Introduction

Visualizing 10 million data points with plt.scatter() will freeze your browser. This chapter covers sampling, pre-aggregation, hexbin charts, WebGL rendering, and profiling techniques for production-scale visualization performance.

2. The Problem with Large Data

python

1234567891011121314151617181920212223

import numpy as np
import time

n = 1_000_000
x = np.random.rand(n)
y = np.random.rand(n)

# Test: How slow is naive scatter for 1M points?
import matplotlib.pyplot as plt

t0 = time.time()
fig, ax = plt.subplots()
ax.scatter(x, y, s=1, alpha=0.1)  # Will be very slow and unreadable
ax.set_title(f&#039;1M Points (bad): {time.time()-t0:.2f}s')
plt.close()

# Solution options:
print("Solutions for large data:")
print("1. Random sampling — quick, loses outliers")
print("2. Pre-aggregation — keeps all data, loses detail")
print("3. Hexbin — density visualization, fast, readable")
print("4. WebGL backend — renders millions of points in browser")
print("5. Datashader — production-grade large data rendering")

3. Strategy 1: Intelligent Sampling

python

12345678910111213141516171819202122232425262728293031323334353637

import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)
n = 2_000_000
x = np.concatenate([np.random.normal(3, 1, n//2), np.random.normal(7, 1.5, n//2)])
y = np.concatenate([np.random.normal(4, 1, n//2), np.random.normal(8, 1, n//2)])

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# 1: Random sample (10k points)
sample_idx = np.random.choice(n, 10000, replace=False)
axes[0].scatter(x[sample_idx], y[sample_idx], s=5, alpha=0.3, color=&#039;#1565C0')
axes[0].set_title(&#039;Random Sample (10K/2M)\nFast but may miss outliers')

# 2: Stratified sample (preserves distribution)
# Bin x into 20 buckets, sample proportionally
bins = np.digitize(x, np.percentile(x, np.arange(0, 101, 5)))
sample_idx2 = np.array([np.random.choice(np.where(bins == b)[0],
                                           min(500, len(np.where(bins == b)[0])), replace=False)
                          for b in range(1, 21)]).flatten()
axes[1].scatter(x[sample_idx2], y[sample_idx2], s=5, alpha=0.3, color=&#039;#2E7D32')
axes[1].set_title(&#039;Stratified Sample\nPreserves distribution shape')

# 3: Hexbin (density)
hb = axes[2].hexbin(x, y, gridsize=60, cmap=&#039;Blues', mincnt=1)
plt.colorbar(hb, ax=axes[2], label=&#039;Count')
axes[2].set_title(&#039;Hexbin Chart\nAll 2M points, shows density')

for ax in axes:
    ax.spines[&#039;top'].set_visible(False)
    ax.spines[&#039;right'].set_visible(False)

plt.suptitle(&#039;Large Data Visualization Strategies', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.savefig(&#039;large_data.png', dpi=150)
plt.show()

4. Strategy 2: Pre-Aggregation with Pandas

python

1234567891011121314151617181920212223242526272829303132333435

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Simulate 5M transaction records
np.random.seed(42)
n = 5_000_000
transactions = pd.DataFrame({
    &#039;date':     pd.date_range('2020-01-01', periods=n, freq='min'),
    &#039;revenue':  np.random.exponential(scale=50, size=n),
    &#039;region':   np.random.choice(['North', 'South', 'East', 'West'], n),
    &#039;product':  np.random.choice(['A', 'B', 'C', 'D'], n),
})

# PRE-AGGREGATE before plotting (milliseconds to compute, instant to render)
daily = transactions.groupby(transactions[&#039;date'].dt.date)['revenue'].sum()
monthly = transactions.groupby(transactions[&#039;date'].dt.to_period('M'))['revenue'].sum()
by_region = transactions.groupby(&#039;region')['revenue'].sum()

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot aggregates (not raw data)
daily.plot(ax=axes[0], color=&#039;#1565C0', linewidth=1, title='Daily Revenue (aggregated)')
monthly.plot(kind=&#039;bar', ax=axes[1], color='#2E7D32', title='Monthly Revenue')
by_region.sort_values().plot(kind=&#039;barh', ax=axes[2], color='#E65100', title='By Region')

for ax in axes:
    ax.spines[&#039;top'].set_visible(False)
    ax.spines[&#039;right'].set_visible(False)
    ax.grid(True, alpha=0.3)

plt.suptitle(&#039;5M Transactions → Pre-aggregated (Fast!)', fontsize=13, fontweight='bold')
plt.tight_layout()
plt.savefig(&#039;preagg.png', dpi=150)
plt.show()

5. Strategy 3: Plotly WebGL for Interactive Large Data

python

1234567891011121314151617181920

import plotly.graph_objects as go
import numpy as np

np.random.seed(42)
n = 500_000
x = np.random.randn(n)
y = np.random.randn(n)
color = np.random.rand(n)

# Scattergl = WebGL rendering (10-100x faster than SVG for large data)
fig = go.Figure(go.Scattergl(
    x=x, y=y,
    mode=&#039;markers',
    marker=dict(size=2, color=color, colorscale=&#039;Viridis',
                 opacity=0.3, showscale=True),
))
fig.update_layout(title=&#039;500K Points — WebGL (Scattergl) Rendering',
                   template=&#039;plotly_white', height=500)
fig.show()
# WebGL renders in browser GPU — handles 500K+ points smoothly

6. Common Mistakes

Rendering raw 1M+ points with SVG-based charts: SVG creates one DOM element per point → browser freezes. Use WebGL (Scattergl) or pre-aggregate.

Sampling without stratification: Pure random sampling can under-represent rare but important events (outliers, fraud, rare products).

7. MCQs

Question 1

`plt.scatter()` becomes slow at?

Question 2

Hexbin chart advantage for large data?

Question 3

Pre-aggregation strategy involves?

Question 4

`go.Scattergl` in Plotly uses?

Question 5

Stratified sampling ensures?

Question 6

`gridsize=60` in hexbin?

Question 7

`mincnt=1` in hexbin?

Question 8

5M rows → daily aggregation reduces to?

Question 9

Best strategy for 10M point scatter?

Question 10

`alpha=0.1` for large scatter helps?

8. Interview Questions

Q: You have 10 million data points to visualize. What is your approach?

Q: What is the difference between SVG and WebGL rendering in Plotly?

9. Summary

Large data visualization strategy: 1) Sample intelligently (stratified > random), 2) Pre-aggregate with Pandas groupby before plotting, 3) Use hexbin for scatter density, 4) Use Plotly's Scattergl (WebGL) for interactive 100K+ point rendering. Never render 1M+ raw points in SVG-based charts.

10. Next Chapter Recommendation

In Chapter 30: Final Projects, we build 6 complete production dashboards combining every skill from this course.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Performance Optimization for Large Datasets #

1. Chapter Introduction #

2. The Problem with Large Data #

3. Strategy 1: Intelligent Sampling #

4. Strategy 2: Pre-Aggregation with Pandas #

5. Strategy 3: Plotly WebGL for Interactive Large Data #

6. Common Mistakes #

7. MCQs #

plt.scatter() becomes slow at?

Hexbin chart advantage for large data?

Pre-aggregation strategy involves?

go.Scattergl in Plotly uses?

Stratified sampling ensures?

gridsize=60 in hexbin?

mincnt=1 in hexbin?

5M rows → daily aggregation reduces to?

Best strategy for 10M point scatter?

alpha=0.1 for large scatter helps?

8. Interview Questions #

9. Summary #

10. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 6

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

Performance Optimization for Large Datasets

1. Chapter Introduction

2. The Problem with Large Data

3. Strategy 1: Intelligent Sampling

4. Strategy 2: Pre-Aggregation with Pandas

5. Strategy 3: Plotly WebGL for Interactive Large Data

6. Common Mistakes

7. MCQs

`plt.scatter()` becomes slow at?

`go.Scattergl` in Plotly uses?

`gridsize=60` in hexbin?

`mincnt=1` in hexbin?

`alpha=0.1` for large scatter helps?

8. Interview Questions

9. Summary

10. Next Chapter Recommendation