Skip to main content
Pandas & NumPy
CHAPTER 23 Beginner

Statistical Analysis with Pandas and NumPy

Updated: May 18, 2026
5 min read

# CHAPTER 23

Statistical Analysis with Pandas & NumPy

1. Chapter Introduction

Statistics is the mathematical foundation of data science. This chapter applies descriptive statistics, correlation analysis, hypothesis testing basics, and distribution fitting using Pandas and NumPy — no separate statistics library required for fundamentals.

2. Descriptive Statistics

python
12345678910111213141516171819202122232425262728293031
import pandas as pd
import numpy as np
from scipy import stats

data = pd.DataFrame({
    'Exam_A': [85,92,78,96,67,88,74,91,83,88,76,94,71,89,82],
    'Exam_B': [82,88,75,93,70,85,79,90,80,87,73,91,68,88,83],
    'Study_Hours': [6,8,5,9,4,7,5,8,6,7,5,9,4,8,6]
})

# Central tendency
print("=== CENTRAL TENDENCY ===")
for col in ['Exam_A', 'Exam_B']:
    print(f"\n{col}:")
    print(f"  Mean:   {data[col].mean():.2f}")
    print(f"  Median: {data[col].median():.2f}")
    print(f"  Mode:   {data[col].mode()[0]}")

# Dispersion
print("\n=== DISPERSION ===")
for col in ['Exam_A', 'Exam_B']:
    print(f"\n{col}:")
    print(f"  Variance: {data[col].var():.2f}")
    print(f"  Std Dev:  {data[col].std():.2f}")
    print(f"  Range:    {data[col].max() - data[col].min()}")
    print(f"  IQR:      {data[col].quantile(0.75) - data[col].quantile(0.25):.2f}")

# Shape
print("\n=== DISTRIBUTION SHAPE ===")
for col in ['Exam_A', 'Study_Hours']:
    print(f"{col}: skewness={data[col].skew():.3f}, kurtosis={data[col].kurtosis():.3f}")

3. Correlation Analysis

python
12345678910111213141516171819202122232425
# Pearson correlation (linear relationship)
corr_matrix = data.corr()
print("Correlation Matrix:")
print(corr_matrix.round(3))

# Specific pair correlation
r, p_value = stats.pearsonr(data['Study_Hours'], data['Exam_A'])
print(f"\nStudy Hours vs Exam A:")
print(f"  Pearson r: {r:.4f}")
print(f"  P-value:   {p_value:.6f}")
print(f"  Significant (p<0.05): {p_value < 0.05}")

# Spearman correlation (monotonic, handles non-linear)
rho, p_spearman = stats.spearmanr(data[&#039;Study_Hours'], data['Exam_A'])
print(f"\nSpearman ρ: {rho:.4f}, p-value: {p_spearman:.6f}")

# Interpretation
print("""
Correlation Strength Guide:
|r| 0.0-0.2: Negligible
|r| 0.2-0.4: Weak
|r| 0.4-0.6: Moderate
|r| 0.6-0.8: Strong
|r| 0.8-1.0: Very strong
""")

4. Hypothesis Testing Basics

python
12345678910111213141516171819202122232425262728
# T-test: Is there a significant difference between two groups?
group_a = data[&#039;Exam_A']
group_b = data[&#039;Exam_B']

# Paired t-test (same students, two exams)
t_stat, p_value = stats.ttest_rel(group_a, group_b)
print(f"Paired t-test (Exam A vs B):")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value:     {p_value:.6f}")
print(f"  Significant difference: {p_value < 0.05}")
print(f"  Exam A mean: {group_a.mean():.2f}, Exam B mean: {group_b.mean():.2f}")

# Independent t-test (two different groups)
male_scores   = np.array([88, 76, 92, 84, 79, 91, 85, 83])
female_scores = np.array([91, 89, 94, 87, 95, 88, 92, 96])

t, p = stats.ttest_ind(male_scores, female_scores)
print(f"\nIndependent t-test (Male vs Female scores):")
print(f"  t-statistic: {t:.4f}, p-value: {p:.6f}")
print(f"  Male mean: {male_scores.mean():.2f}")
print(f"  Female mean: {female_scores.mean():.2f}")
print(f"  Significant: {p < 0.05}")

# Chi-square test (categorical independence)
contingency_table = np.array([[45, 55], [30, 70]])
chi2, p_chi2, dof, expected = stats.chi2_contingency(contingency_table)
print(f"\nChi-square test (Gender vs Pass/Fail):")
print(f"  χ² = {chi2:.4f}, p = {p_chi2:.4f}, dof = {dof}")

5. Confidence Intervals

python
1234567891011121314151617181920
# 95% Confidence Interval for mean
def confidence_interval(data, confidence=0.95):
    n = len(data)
    mean = np.mean(data)
    se = stats.sem(data)  # Standard error
    ci = stats.t.interval(confidence, df=n-1, loc=mean, scale=se)
    return mean, ci[0], ci[1]

for name, scores in [(&#039;Exam A', data['Exam_A']), ('Exam B', data['Exam_B'])]:
    mean, lower, upper = confidence_interval(scores)
    print(f"{name}: Mean={mean:.2f}, 95% CI=[{lower:.2f}, {upper:.2f}]")

# Bootstrap CI (distribution-free)
rng = np.random.default_rng(42)
exam_a = data[&#039;Exam_A'].values
n_boot = 10000
boot_means = [rng.choice(exam_a, len(exam_a), replace=True).mean() for _ in range(n_boot)]
ci_low  = np.percentile(boot_means, 2.5)
ci_high = np.percentile(boot_means, 97.5)
print(f"\nBootstrap 95% CI for Exam A: [{ci_low:.2f}, {ci_high:.2f}]")

6. Common Mistakes

  • Confusing correlation with causation: A high correlation between two variables does NOT mean one causes the other.
  • Using t-test on non-normal data: For non-normal distributions or small samples with non-normal data, use non-parametric tests like Mann-Whitney U.

7. MCQs

Question 1

Mean is most affected by?

Question 2

Median is preferred when?

Question 3

Pearson r of -0.85 means?

Question 4

P-value < 0.05 typically means?

Question 5

stats.ttestrel() is for?

Question 6

Standard error (SE) measures?

Question 7

Chi-square test is for?

Question 8

Bootstrap CI is useful when?

Question 9

Kurtosis measures?

Question 10

IQR (Interquartile Range) is?

8. Interview Questions

  • Q: What is the difference between correlation and causation?
  • Q: When would you use a Mann-Whitney test instead of a t-test?

9. Summary

Statistical analysis in Python: descriptive stats (mean, std, skew, kurtosis), Pearson/Spearman correlation, hypothesis testing (ttest
ind, ttestrel, chi2contingency), and confidence intervals. Always validate assumptions (normality, equal variance) before applying parametric tests.

10. Next Chapter Recommendation

In Chapter 24: Working with Large Datasets, we handle datasets that don't fit in memory using chunking, Dask, and memory optimization.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·