CHAPTER 23 Beginner

Statistical Analysis with Pandas and NumPy

Updated: May 18, 2026

5 min read

# CHAPTER 23

Statistical Analysis with Pandas & NumPy

1. Chapter Introduction

Statistics is the mathematical foundation of data science. This chapter applies descriptive statistics, correlation analysis, hypothesis testing basics, and distribution fitting using Pandas and NumPy — no separate statistics library required for fundamentals.

2. Descriptive Statistics

python

12345678910111213141516171819202122232425262728293031

import pandas as pd
import numpy as np
from scipy import stats

data = pd.DataFrame({
    &#039;Exam_A': [85,92,78,96,67,88,74,91,83,88,76,94,71,89,82],
    &#039;Exam_B': [82,88,75,93,70,85,79,90,80,87,73,91,68,88,83],
    &#039;Study_Hours': [6,8,5,9,4,7,5,8,6,7,5,9,4,8,6]
})

# Central tendency
print("=== CENTRAL TENDENCY ===")
for col in [&#039;Exam_A', 'Exam_B']:
    print(f"\n{col}:")
    print(f"  Mean:   {data[col].mean():.2f}")
    print(f"  Median: {data[col].median():.2f}")
    print(f"  Mode:   {data[col].mode()[0]}")

# Dispersion
print("\n=== DISPERSION ===")
for col in [&#039;Exam_A', 'Exam_B']:
    print(f"\n{col}:")
    print(f"  Variance: {data[col].var():.2f}")
    print(f"  Std Dev:  {data[col].std():.2f}")
    print(f"  Range:    {data[col].max() - data[col].min()}")
    print(f"  IQR:      {data[col].quantile(0.75) - data[col].quantile(0.25):.2f}")

# Shape
print("\n=== DISTRIBUTION SHAPE ===")
for col in [&#039;Exam_A', 'Study_Hours']:
    print(f"{col}: skewness={data[col].skew():.3f}, kurtosis={data[col].kurtosis():.3f}")

3. Correlation Analysis

python

12345678910111213141516171819202122232425

# Pearson correlation (linear relationship)
corr_matrix = data.corr()
print("Correlation Matrix:")
print(corr_matrix.round(3))

# Specific pair correlation
r, p_value = stats.pearsonr(data[&#039;Study_Hours'], data['Exam_A'])
print(f"\nStudy Hours vs Exam A:")
print(f"  Pearson r: {r:.4f}")
print(f"  P-value:   {p_value:.6f}")
print(f"  Significant (p<0.05): {p_value < 0.05}")

# Spearman correlation (monotonic, handles non-linear)
rho, p_spearman = stats.spearmanr(data[&#039;Study_Hours'], data['Exam_A'])
print(f"\nSpearman ρ: {rho:.4f}, p-value: {p_spearman:.6f}")

# Interpretation
print("""
Correlation Strength Guide:
|r| 0.0-0.2: Negligible
|r| 0.2-0.4: Weak
|r| 0.4-0.6: Moderate
|r| 0.6-0.8: Strong
|r| 0.8-1.0: Very strong
""")

4. Hypothesis Testing Basics

python

12345678910111213141516171819202122232425262728

# T-test: Is there a significant difference between two groups?
group_a = data[&#039;Exam_A']
group_b = data[&#039;Exam_B']

# Paired t-test (same students, two exams)
t_stat, p_value = stats.ttest_rel(group_a, group_b)
print(f"Paired t-test (Exam A vs B):")
print(f"  t-statistic: {t_stat:.4f}")
print(f"  p-value:     {p_value:.6f}")
print(f"  Significant difference: {p_value < 0.05}")
print(f"  Exam A mean: {group_a.mean():.2f}, Exam B mean: {group_b.mean():.2f}")

# Independent t-test (two different groups)
male_scores   = np.array([88, 76, 92, 84, 79, 91, 85, 83])
female_scores = np.array([91, 89, 94, 87, 95, 88, 92, 96])

t, p = stats.ttest_ind(male_scores, female_scores)
print(f"\nIndependent t-test (Male vs Female scores):")
print(f"  t-statistic: {t:.4f}, p-value: {p:.6f}")
print(f"  Male mean: {male_scores.mean():.2f}")
print(f"  Female mean: {female_scores.mean():.2f}")
print(f"  Significant: {p < 0.05}")

# Chi-square test (categorical independence)
contingency_table = np.array([[45, 55], [30, 70]])
chi2, p_chi2, dof, expected = stats.chi2_contingency(contingency_table)
print(f"\nChi-square test (Gender vs Pass/Fail):")
print(f"  χ² = {chi2:.4f}, p = {p_chi2:.4f}, dof = {dof}")

5. Confidence Intervals

python

1234567891011121314151617181920

# 95% Confidence Interval for mean
def confidence_interval(data, confidence=0.95):
    n = len(data)
    mean = np.mean(data)
    se = stats.sem(data)  # Standard error
    ci = stats.t.interval(confidence, df=n-1, loc=mean, scale=se)
    return mean, ci[0], ci[1]

for name, scores in [(&#039;Exam A', data['Exam_A']), ('Exam B', data['Exam_B'])]:
    mean, lower, upper = confidence_interval(scores)
    print(f"{name}: Mean={mean:.2f}, 95% CI=[{lower:.2f}, {upper:.2f}]")

# Bootstrap CI (distribution-free)
rng = np.random.default_rng(42)
exam_a = data[&#039;Exam_A'].values
n_boot = 10000
boot_means = [rng.choice(exam_a, len(exam_a), replace=True).mean() for _ in range(n_boot)]
ci_low  = np.percentile(boot_means, 2.5)
ci_high = np.percentile(boot_means, 97.5)
print(f"\nBootstrap 95% CI for Exam A: [{ci_low:.2f}, {ci_high:.2f}]")

6. Common Mistakes

Confusing correlation with causation: A high correlation between two variables does NOT mean one causes the other.

Using t-test on non-normal data: For non-normal distributions or small samples with non-normal data, use non-parametric tests like Mann-Whitney U.

7. MCQs

Question 1

Mean is most affected by?

Question 2

Median is preferred when?

Question 3

Pearson r of -0.85 means?

Question 4

P-value < 0.05 typically means?

Question 5

`stats.ttestrel()` is for?

Question 6

Standard error (SE) measures?

Question 7

Chi-square test is for?

Question 8

Bootstrap CI is useful when?

Question 9

Kurtosis measures?

Question 10

IQR (Interquartile Range) is?

8. Interview Questions

Q: What is the difference between correlation and causation?

Q: When would you use a Mann-Whitney test instead of a t-test?

9. Summary
Statistical analysis in Python: descriptive stats (mean, std, skew, kurtosis), Pearson/Spearman correlation, hypothesis testing (ttestind, ttestrel, chi2contingency), and confidence intervals. Always validate assumptions (normality, equal variance) before applying parametric tests.

10. Next Chapter Recommendation

In Chapter 24: Working with Large Datasets, we handle datasets that don't fit in memory using chunking, Dask, and memory optimization.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Statistical Analysis with Pandas & NumPy #

1. Chapter Introduction #

2. Descriptive Statistics #

3. Correlation Analysis #

4. Hypothesis Testing Basics #

5. Confidence Intervals #

6. Common Mistakes #

7. MCQs #

Mean is most affected by?

Median is preferred when?

Pearson r of -0.85 means?

P-value < 0.05 typically means?

stats.ttestrel() is for?

Standard error (SE) measures?

Chi-square test is for?

Bootstrap CI is useful when?

Kurtosis measures?

IQR (Interquartile Range) is?

8. Interview Questions #

9. Summary #

10. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

Send Feedback / Bug

Feedback Submitted!

Statistical Analysis with Pandas & NumPy

1. Chapter Introduction

2. Descriptive Statistics

3. Correlation Analysis

4. Hypothesis Testing Basics

5. Confidence Intervals

6. Common Mistakes

7. MCQs

`stats.ttestrel()` is for?

8. Interview Questions

9. Summary

10. Next Chapter Recommendation