Skip to main content
Data Visualization
CHAPTER 08 Beginner

Scatter Plots and Correlation Analysis

Updated: May 18, 2026
5 min read

# CHAPTER 8

Scatter Plots and Correlation Analysis

1. Chapter Introduction

Scatter plots reveal the relationship between two numeric variables — the most important visualization in exploratory data analysis and the foundation of regression analysis. This chapter covers scatter plots from basic to bubble charts with real business correlation examples.

2. Basic Scatter Plot

python
1234567891011121314151617181920212223242526272829303132333435
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

np.random.seed(42)
ad_spend  = np.random.uniform(10000, 100000, 100)
sales     = ad_spend * 0.45 + np.random.normal(0, 8000, 100) + 20000

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(ad_spend, sales, alpha=0.6, s=60, color='#2196F3', edgecolors='white', linewidth=0.5)

# Regression line
slope, intercept, r, p, _ = stats.linregress(ad_spend, sales)
x_line = np.linspace(ad_spend.min(), ad_spend.max(), 100)
ax.plot(x_line, slope * x_line + intercept, 'r-', linewidth=2, label=f'Regression (r²={r**2:.3f})')

ax.set_title('Advertising Spend vs Sales Revenue', fontsize=14, fontweight='bold')
ax.set_xlabel('Ad Spend ($)')
ax.set_ylabel('Sales Revenue ($)')
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'${x/1000:.0f}K'))
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'${x/1000:.0f}K'))
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Add correlation text
r2 = r**2
interpretation = 'Strong' if abs(r) > 0.7 else 'Moderate' if abs(r) > 0.4 else 'Weak'
ax.text(0.05, 0.95, f'Pearson r = {r:.3f}\n{interpretation} positive correlation',
        transform=ax.transAxes, fontsize=11, va='top',
        bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))
plt.tight_layout()
plt.savefig('scatter_basic.png', dpi=150)
plt.show()

3. Colored Scatter (Three Variables)

python
12345678910111213141516171819202122232425
np.random.seed(42)
experience = np.random.randint(1, 20, 150)
salary     = experience * 4000 + np.random.normal(0, 8000, 150) + 30000
dept       = np.random.choice(['Engineering', 'Marketing', 'Sales'], 150, p=[0.4, 0.3, 0.3])

dept_colors = {'Engineering': '#1565C0', 'Marketing': '#E91E63', 'Sales': '#2E7D32'}
colors_list = [dept_colors[d] for d in dept]

fig, ax = plt.subplots(figsize=(10, 6))
for d, color in dept_colors.items():
    mask = dept == d
    ax.scatter(experience[mask], salary[mask], label=d, color=color,
               alpha=0.7, s=70, edgecolors='white', linewidth=0.5)

ax.set_title('Experience vs Salary by Department', fontsize=14, fontweight='bold')
ax.set_xlabel('Years of Experience')
ax.set_ylabel('Annual Salary ($)')
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'${x:,.0f}'))
ax.legend(title='Department', framealpha=0.8)
ax.grid(True, alpha=0.3)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.savefig('scatter_colored.png', dpi=150)
plt.show()

4. Bubble Chart (Four Variables)

python
1234567891011121314151617181920212223242526
countries = ['USA', 'China', 'Germany', 'Japan', 'UK', 'India', 'Brazil', 'France']
gdp   = [25.5, 17.7, 4.3, 4.2, 3.1, 3.7, 2.1, 2.9]      # Trillion USD
hdi   = [0.926, 0.788, 0.942, 0.920, 0.929, 0.633, 0.754, 0.910]  # HDI
co2   = [14.9, 7.6, 8.2, 8.5, 5.5, 1.9, 6.5, 5.4]         # CO2 per capita
pop   = [331, 1412, 84, 126, 67, 1380, 213, 68]             # Population (M)

fig, ax = plt.subplots(figsize=(11, 7))
scatter = ax.scatter(gdp, hdi, s=[p * 0.2 for p in pop],
                      c=co2, cmap='YlOrRd', alpha=0.8,
                      edgecolors='gray', linewidth=0.8)

for i, country in enumerate(countries):
    ax.annotate(country, (gdp[i], hdi[i]),
                xytext=(5, 5), textcoords='offset points', fontsize=9)

plt.colorbar(scatter, label='CO₂ per capita (tonnes)')
ax.set_xlabel('GDP (Trillion USD)', fontsize=12)
ax.set_ylabel('Human Development Index', fontsize=12)
ax.set_title('GDP vs HDI\n(Bubble size = Population, Color = CO₂ emissions)',
              fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.savefig('bubble_chart.png', dpi=150)
plt.show()

5. Common Mistakes

  • Overplotting: When many points overlap, patterns are hidden. Use alpha=0.3 for transparency, or hexbin for very large datasets.
  • Confusing correlation with causation: A scatter plot showing r=0.9 does NOT prove X causes Y — just that they move together.

6. MCQs

Question 1

Scatter plot visualizes?

Question 2

Pearson r of 0.85 indicates?

Question 3

Bubble chart extends scatter to show?

Question 4

alpha=0.6 in scatter plot helps with?

Question 5

stats.linregress(x, y) returns?

Question 6

Overplotting solution for millions of points?

Question 7

Color encoding in scatter plot adds?

Question 8

c=values, cmap='YlOrRd' colors scatter by?

Q9. Correlation causation warning means? a) Correlation is wrong b) Correlation ≠ causation — lurking variables may explain — Answer: b
Question 10

R² of 0.81 means?

7. Interview Questions

  • Q: What is the difference between a scatter plot and a bubble chart?
  • Q: How do you visualize three variables in a scatter plot?

8. Summary

Scatter plots reveal relationships — use for correlation analysis. Add regression lines for trend quantification. Color-encode a third categorical variable. Bubble chart size encodes a fourth variable. Key metric: Pearson r (strength) and r² (variance explained). Always warn viewers: correlation ≠ causation.

9. Next Chapter Recommendation

In Chapter 9: Histograms and Distribution Analysis, we visualize the spread and shape of single numeric variables — essential for understanding data before modeling.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·