Skip to main content
Pandas & NumPy
CHAPTER 27 Beginner

Real-World Data Science Projects

Updated: May 18, 2026
5 min read

# CHAPTER 27

Real-World Data Science Projects

1. Chapter Introduction

This chapter brings together all Pandas and NumPy skills through five complete, real-world analysis projects — the portfolio pieces every data science professional needs.

---

Project 1: Sales Analytics Dashboard

python
123456789101112131415161718192021222324252627282930313233343536373839
import pandas as pd
import numpy as np

np.random.seed(42)
n = 500

sales = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=n, freq='D')[:n],
    'Salesperson': np.random.choice(['Alice','Bob','Carol','David','Eve'], n),
    'Product': np.random.choice(['Laptop','Phone','Monitor','Desk','Chair'], n),
    'Category': np.nan,
    'Region': np.random.choice(['North','South','East','West'], n),
    'Units': np.random.randint(1, 15, n),
    'Unit_Price': np.random.choice([1200, 500, 300, 450, 200], n)
})
sales['Category'] = sales['Product'].map({'Laptop':'Electronics','Phone':'Electronics',
                                           'Monitor':'Electronics','Desk':'Furniture','Chair':'Furniture'})
sales['Revenue'] = sales['Units'] * sales['Unit_Price']
sales['Quarter'] = sales['Date'].dt.quarter.map({1:'Q1',2:'Q2',3:'Q3',4:'Q4'})

print("=== SALES ANALYTICS DASHBOARD ===\n")
print(f"Period: {sales['Date'].min().date()} to {sales['Date'].max().date()}")
print(f"Total Revenue: ${sales['Revenue'].sum():,.0f}")
print(f"Total Transactions: {len(sales):,}")

# YoY monthly trend
monthly = sales.groupby(sales['Date'].dt.to_period('M'))['Revenue'].sum()
print("\nTop 3 months by revenue:")
print(monthly.nlargest(3))

# Salesperson performance
print("\nSalesperson Rankings:")
print(sales.groupby('Salesperson').agg(
    Revenue=('Revenue','sum'), Orders=('Revenue','count'), Avg_Order=('Revenue','mean')
).sort_values('Revenue', ascending=False).round(0))

# Regional breakdown
print("\nRevenue by Region:")
print(sales.groupby('Region')['Revenue'].sum().sort_values(ascending=False))

---

Project 2: Student Performance Analysis

python
123456789101112131415161718192021222324252627282930313233343536
np.random.seed(123)
n_students = 300

students = pd.DataFrame({
    'Student_ID': [f'S{i:04d}' for i in range(1, n_students+1)],
    'Gender': np.random.choice(['Male','Female'], n_students),
    'School_Type': np.random.choice(['Public','Private'], n_students, p=[0.65, 0.35]),
    'Study_Hours': np.random.normal(5, 2, n_students).clip(0, 12).round(1),
    'Attendance': np.random.normal(85, 10, n_students).clip(50, 100).round(1),
    'Math': np.random.normal(72, 15, n_students).clip(0, 100).round(0),
    'Science': np.random.normal(74, 14, n_students).clip(0, 100).round(0),
    'English': np.random.normal(76, 12, n_students).clip(0, 100).round(0),
})

students['Average'] = students[['Math','Science','English']].mean(axis=1).round(1)
students['Grade'] = pd.cut(students['Average'],
                            bins=[0,50,60,70,80,100],
                            labels=['F','D','C','B','A'])

print("=== STUDENT PERFORMANCE ANALYSIS ===\n")
print(f"Students: {len(students)}")
print(f"Overall Average: {students['Average'].mean():.1f}")
print(f"\nGrade Distribution:\n{students['Grade'].value_counts().sort_index()}")

# Gender performance gap
print("\nAverage by Gender:")
print(students.groupby('Gender')[['Math','Science','English','Average']].mean().round(2))

# Correlation: Study hours vs performance
corr = students[['Study_Hours','Attendance','Average']].corr()
print("\nCorrelation with Average Score:")
print(corr['Average'].drop('Average').round(3))

# At-risk students
at_risk = students[(students[&#039;Average'] < 60) | (students['Attendance'] < 70)]
print(f"\nAt-risk students: {len(at_risk)} ({len(at_risk)/len(students)*100:.1f}%)")

---

Project 3: Customer Churn Analysis

python
123456789101112131415161718192021222324252627282930313233
np.random.seed(42)
n = 1000

churn = pd.DataFrame({
    &#039;Customer_ID': [f'C{i:04d}' for i in range(1, n+1)],
    &#039;Age': np.random.randint(18, 75, n),
    &#039;Tenure_Months': np.random.randint(1, 72, n),
    &#039;Monthly_Charges': np.random.normal(65, 25, n).clip(20, 150).round(2),
    &#039;Contract': np.random.choice(['Month-to-Month','One Year','Two Year'], n, p=[0.55, 0.25, 0.20]),
    &#039;Tech_Support': np.random.choice(['Yes','No'], n, p=[0.45, 0.55]),
    &#039;Payment_Method': np.random.choice(['Credit Card','Bank Transfer','Electronic Check','Mailed Check'], n),
})
churn[&#039;Total_Charges'] = (churn['Monthly_Charges'] * churn['Tenure_Months']).round(2)
# Churn probability: higher for month-to-month, higher charges
churn_prob = (0.4 * (churn[&#039;Contract'] == 'Month-to-Month').astype(float) +
              0.2 * (churn[&#039;Monthly_Charges'] > 80).astype(float) +
              0.1 * (churn[&#039;Tenure_Months'] < 12).astype(float))
churn[&#039;Churned'] = (np.random.random(n) < churn_prob / churn_prob.max() * 0.5).astype(int)

print("=== CUSTOMER CHURN ANALYSIS ===\n")
print(f"Overall churn rate: {churn[&#039;Churned'].mean()*100:.1f}%")

print("\nChurn Rate by Contract Type:")
print(churn.groupby(&#039;Contract')['Churned'].mean().sort_values(ascending=False).round(3))

print("\nChurn Rate by Payment Method:")
print(churn.groupby(&#039;Payment_Method')['Churned'].mean().sort_values(ascending=False).round(3))

# Churned vs Retained comparison
comparison = churn.groupby(&#039;Churned')[['Age','Tenure_Months','Monthly_Charges','Total_Charges']].mean()
comparison.index = [&#039;Retained', 'Churned']
print("\nChurned vs Retained Customer Profile:")
print(comparison.round(2))

5. Common Mistakes

  • Analyzing without domain knowledge: Numbers without context are meaningless. Always understand what each metric means in the business context.
  • Drawing causal conclusions from correlation: Churn correlating with high charges doesn't mean high charges CAUSE churn — could be confounded by contract type.

6. MCQs

Question 1

nlargest(3) returns?

Question 2

Churn analysis primary metric?

Question 3

pd.cut(x, bins=[0,50,100], labels=['Low','High']) creates?

Question 4

corr() returns values from?

Question 5

.clip(0, 100) ensures?

Question 6

dt.to_period('M') converts?

Question 7

f'C{i:04d}' creates?

Question 8

.astype(float) on bool column?

Question 9

At-risk identification uses?

Question 10

axis=1 in .mean() on DataFrame?

7. Interview Questions

  • Q: How would you identify at-risk customers for churn analysis?
  • Q: What metrics would you track in a sales analytics dashboard?

8. Summary

Real-world projects apply all Pandas skills: data generation with NumPy, groupby analytics, correlation analysis, segmentation with pd.cut, conditional logic for risk scoring, and comparative analysis with .groupby(). These five projects cover analytics, education, and CRM — the core business domains of data science.

9. Next Chapter Recommendation

In Chapter 28: Interview Preparation, we compile 50 interview questions, 20 coding challenges, and debugging tasks for Pandas & NumPy interviews.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·