Skip to main content
Data Visualization
CHAPTER 09 Beginner

Histograms and Distribution Analysis

Updated: May 18, 2026
5 min read

# CHAPTER 9

Histograms and Distribution Analysis

1. Chapter Introduction

Before modeling data, you must understand its distribution — is it normal, skewed, bimodal? Histograms answer this question visually in seconds. This chapter covers histograms, KDE, and distribution comparison across groups.

2. Histogram Fundamentals

python
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

np.random.seed(42)
salaries = np.random.normal(loc=75000, scale=15000, size=1000)

fig, axes = plt.subplots(2, 2, figsize=(13, 10))

# 1: Basic histogram
axes[0,0].hist(salaries, bins=30, color='#2196F3', edgecolor='white', alpha=0.8)
axes[0,0].set_title('Basic Histogram (30 bins)')
axes[0,0].set_xlabel('Salary ($)')

# 2: Different bin sizes
axes[0,1].hist(salaries, bins=10, color='#4CAF50', edgecolor='white', alpha=0.8, label='10 bins')
axes[0,1].hist(salaries, bins=50, color='#FF9800', edgecolor='white', alpha=0.5, label='50 bins')
axes[0,1].set_title('Bin Size Comparison')
axes[0,1].legend()

# 3: Histogram + KDE
count, bins, _ = axes[1,0].hist(salaries, bins=30, color='#9C27B0',
                                  edgecolor='white', alpha=0.6, density=True)
x_range = np.linspace(salaries.min(), salaries.max(), 200)
kde = stats.gaussian_kde(salaries)
axes[1,0].plot(x_range, kde(x_range), 'r-', linewidth=2.5, label='KDE')
axes[1,0].axvline(salaries.mean(), color='blue', linestyle='--', label=f'Mean: ${salaries.mean():,.0f}')
axes[1,0].axvline(np.median(salaries), color='green', linestyle='--', label=f'Median: ${np.median(salaries):,.0f}')
axes[1,0].set_title('Histogram + KDE with Reference Lines')
axes[1,0].legend(fontsize=8)

# 4: Skewed distribution
skewed = np.random.exponential(scale=30000, size=1000) + 20000
axes[1,1].hist(skewed, bins=40, color='#F44336', edgecolor='white', alpha=0.8, density=True)
x_s = np.linspace(skewed.min(), skewed.max(), 200)
axes[1,1].plot(x_s, stats.gaussian_kde(skewed)(x_s), 'b-', linewidth=2.5)
skewness = stats.skew(skewed)
axes[1,1].set_title(f'Right-Skewed Distribution (skew={skewness:.2f})')
axes[1,1].set_xlabel('Value')

for ax in axes.flatten():
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.grid(True, alpha=0.3)

plt.suptitle('Histogram Variations', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('histograms.png', dpi=150)
plt.show()

3. Distribution Comparison

python
123456789101112131415161718192021222324
# Compare salary distributions across departments
eng_salaries  = np.random.normal(95000, 18000, 300)
mkt_salaries  = np.random.normal(72000, 12000, 200)
sales_salaries = np.random.normal(65000, 20000, 250)

fig, ax = plt.subplots(figsize=(11, 6))

for salaries, dept, color in [(eng_salaries, 'Engineering', '#1565C0'),
                                (mkt_salaries,  'Marketing', '#2E7D32'),
                                (sales_salaries, 'Sales', '#E65100')]:
    ax.hist(salaries, bins=30, alpha=0.5, color=color, edgecolor='none', label=dept)
    ax.axvline(salaries.mean(), color=color, linestyle='--', linewidth=1.5)

ax.set_title('Salary Distribution by Department', fontsize=14, fontweight='bold')
ax.set_xlabel('Annual Salary ($)')
ax.set_ylabel('Employee Count')
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'${x/1000:.0f}K'))
ax.legend(title='Department')
ax.grid(True, alpha=0.3)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.savefig('distribution_comparison.png', dpi=150)
plt.show()

4. Common Mistakes

  • Wrong bin count: Too few bins hides the shape; too many creates noisy patterns. Sturges' rule: bins = 1 + log2(n). For n=1000, that's ~10. Freedman-Diaconis adjusts for outliers.
  • Not using density=True for comparison: When comparing histograms of different sizes, use density=True to normalize to probability density.

5. MCQs

Question 1

Histogram is best for?

Question 2

Too few histogram bins causes?

Question 3

KDE (Kernel Density Estimate) is?

Question 4

density=True in histogram?

Question 5

Right-skewed distribution has?

Question 6

stats.skew() returns?

Question 7

Bimodal distribution in histogram looks like?

Question 8

For large overlapping datasets, use alpha=?

Question 9

Freedman-Diaconis rule for bin count uses?

Question 10

Overlaying multiple histograms to compare distributions requires?

6. Interview Questions

  • Q: How do you choose the right number of bins for a histogram?
  • Q: What does a right-skewed distribution tell you about the data?

7. Summary

Histograms reveal distribution shape — normal, skewed, bimodal, uniform. Overlay KDE for smooth approximation. Use density=True when comparing groups of different sizes. Mean vs median divergence indicates skewness. Bin count matters: too few = over-smoothed, too many = noise.

8. Next Chapter Recommendation

In Chapter 10: Box Plots and Statistical Visualization, we visualize quartiles, medians, and outliers — comparing distributions across groups more compactly than histograms.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·