CHAPTER 27 Beginner

Real-World Data Science Projects

Updated: May 18, 2026

5 min read

# CHAPTER 27

Real-World Data Science Projects

1. Chapter Introduction

This chapter brings together all Pandas and NumPy skills through five complete, real-world analysis projects — the portfolio pieces every data science professional needs.

---

Project 1: Sales Analytics Dashboard

python

123456789101112131415161718192021222324252627282930313233343536373839

import pandas as pd
import numpy as np

np.random.seed(42)
n = 500

sales = pd.DataFrame({
    &#039;Date': pd.date_range('2024-01-01', periods=n, freq='D')[:n],
    &#039;Salesperson': np.random.choice(['Alice','Bob','Carol','David','Eve'], n),
    &#039;Product': np.random.choice(['Laptop','Phone','Monitor','Desk','Chair'], n),
    &#039;Category': np.nan,
    &#039;Region': np.random.choice(['North','South','East','West'], n),
    &#039;Units': np.random.randint(1, 15, n),
    &#039;Unit_Price': np.random.choice([1200, 500, 300, 450, 200], n)
})
sales[&#039;Category'] = sales['Product'].map({'Laptop':'Electronics','Phone':'Electronics',
                                           &#039;Monitor':'Electronics','Desk':'Furniture','Chair':'Furniture'})
sales[&#039;Revenue'] = sales['Units'] * sales['Unit_Price']
sales[&#039;Quarter'] = sales['Date'].dt.quarter.map({1:'Q1',2:'Q2',3:'Q3',4:'Q4'})

print("=== SALES ANALYTICS DASHBOARD ===\n")
print(f"Period: {sales[&#039;Date'].min().date()} to {sales['Date'].max().date()}")
print(f"Total Revenue: ${sales[&#039;Revenue'].sum():,.0f}")
print(f"Total Transactions: {len(sales):,}")

# YoY monthly trend
monthly = sales.groupby(sales[&#039;Date'].dt.to_period('M'))['Revenue'].sum()
print("\nTop 3 months by revenue:")
print(monthly.nlargest(3))

# Salesperson performance
print("\nSalesperson Rankings:")
print(sales.groupby(&#039;Salesperson').agg(
    Revenue=(&#039;Revenue','sum'), Orders=('Revenue','count'), Avg_Order=('Revenue','mean')
).sort_values(&#039;Revenue', ascending=False).round(0))

# Regional breakdown
print("\nRevenue by Region:")
print(sales.groupby(&#039;Region')['Revenue'].sum().sort_values(ascending=False))

---

Project 2: Student Performance Analysis

python

123456789101112131415161718192021222324252627282930313233343536

np.random.seed(123)
n_students = 300

students = pd.DataFrame({
    &#039;Student_ID': [f'S{i:04d}' for i in range(1, n_students+1)],
    &#039;Gender': np.random.choice(['Male','Female'], n_students),
    &#039;School_Type': np.random.choice(['Public','Private'], n_students, p=[0.65, 0.35]),
    &#039;Study_Hours': np.random.normal(5, 2, n_students).clip(0, 12).round(1),
    &#039;Attendance': np.random.normal(85, 10, n_students).clip(50, 100).round(1),
    &#039;Math': np.random.normal(72, 15, n_students).clip(0, 100).round(0),
    &#039;Science': np.random.normal(74, 14, n_students).clip(0, 100).round(0),
    &#039;English': np.random.normal(76, 12, n_students).clip(0, 100).round(0),
})

students[&#039;Average'] = students[['Math','Science','English']].mean(axis=1).round(1)
students[&#039;Grade'] = pd.cut(students['Average'],
                            bins=[0,50,60,70,80,100],
                            labels=[&#039;F','D','C','B','A'])

print("=== STUDENT PERFORMANCE ANALYSIS ===\n")
print(f"Students: {len(students)}")
print(f"Overall Average: {students[&#039;Average'].mean():.1f}")
print(f"\nGrade Distribution:\n{students[&#039;Grade'].value_counts().sort_index()}")

# Gender performance gap
print("\nAverage by Gender:")
print(students.groupby(&#039;Gender')[['Math','Science','English','Average']].mean().round(2))

# Correlation: Study hours vs performance
corr = students[[&#039;Study_Hours','Attendance','Average']].corr()
print("\nCorrelation with Average Score:")
print(corr[&#039;Average'].drop('Average').round(3))

# At-risk students
at_risk = students[(students[&#039;Average'] < 60) | (students['Attendance'] < 70)]
print(f"\nAt-risk students: {len(at_risk)} ({len(at_risk)/len(students)*100:.1f}%)")

---

Project 3: Customer Churn Analysis

python

123456789101112131415161718192021222324252627282930313233

np.random.seed(42)
n = 1000

churn = pd.DataFrame({
    &#039;Customer_ID': [f'C{i:04d}' for i in range(1, n+1)],
    &#039;Age': np.random.randint(18, 75, n),
    &#039;Tenure_Months': np.random.randint(1, 72, n),
    &#039;Monthly_Charges': np.random.normal(65, 25, n).clip(20, 150).round(2),
    &#039;Contract': np.random.choice(['Month-to-Month','One Year','Two Year'], n, p=[0.55, 0.25, 0.20]),
    &#039;Tech_Support': np.random.choice(['Yes','No'], n, p=[0.45, 0.55]),
    &#039;Payment_Method': np.random.choice(['Credit Card','Bank Transfer','Electronic Check','Mailed Check'], n),
})
churn[&#039;Total_Charges'] = (churn['Monthly_Charges'] * churn['Tenure_Months']).round(2)
# Churn probability: higher for month-to-month, higher charges
churn_prob = (0.4 * (churn[&#039;Contract'] == 'Month-to-Month').astype(float) +
              0.2 * (churn[&#039;Monthly_Charges'] > 80).astype(float) +
              0.1 * (churn[&#039;Tenure_Months'] < 12).astype(float))
churn[&#039;Churned'] = (np.random.random(n) < churn_prob / churn_prob.max() * 0.5).astype(int)

print("=== CUSTOMER CHURN ANALYSIS ===\n")
print(f"Overall churn rate: {churn[&#039;Churned'].mean()*100:.1f}%")

print("\nChurn Rate by Contract Type:")
print(churn.groupby(&#039;Contract')['Churned'].mean().sort_values(ascending=False).round(3))

print("\nChurn Rate by Payment Method:")
print(churn.groupby(&#039;Payment_Method')['Churned'].mean().sort_values(ascending=False).round(3))

# Churned vs Retained comparison
comparison = churn.groupby(&#039;Churned')[['Age','Tenure_Months','Monthly_Charges','Total_Charges']].mean()
comparison.index = [&#039;Retained', 'Churned']
print("\nChurned vs Retained Customer Profile:")
print(comparison.round(2))

5. Common Mistakes

Analyzing without domain knowledge: Numbers without context are meaningless. Always understand what each metric means in the business context.

Drawing causal conclusions from correlation: Churn correlating with high charges doesn't mean high charges CAUSE churn — could be confounded by contract type.

6. MCQs

Question 1

`nlargest(3)` returns?

Question 2

Churn analysis primary metric?

Question 3

`pd.cut(x, bins=[0,50,100], labels=['Low','High'])` creates?

Question 4

`corr()` returns values from?

Question 5

`.clip(0, 100)` ensures?

Question 6

`dt.to_period('M')` converts?

Question 7

`f'C{i:04d}'` creates?

Question 8

`.astype(float)` on bool column?

Question 9

At-risk identification uses?

Question 10

`axis=1` in `.mean()` on DataFrame?

7. Interview Questions

Q: How would you identify at-risk customers for churn analysis?

Q: What metrics would you track in a sales analytics dashboard?

8. Summary

Real-world projects apply all Pandas skills: data generation with NumPy, groupby analytics, correlation analysis, segmentation with pd.cut, conditional logic for risk scoring, and comparative analysis with .groupby(). These five projects cover analytics, education, and CRM — the core business domains of data science.

9. Next Chapter Recommendation

In Chapter 28: Interview Preparation, we compile 50 interview questions, 20 coding challenges, and debugging tasks for Pandas & NumPy interviews.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Real-World Data Science Projects #

1. Chapter Introduction #

Project 1: Sales Analytics Dashboard #

Project 2: Student Performance Analysis #

Project 3: Customer Churn Analysis #

5. Common Mistakes #

6. MCQs #

nlargest(3) returns?

Churn analysis primary metric?

pd.cut(x, bins=[0,50,100], labels=['Low','High']) creates?

corr() returns values from?

.clip(0, 100) ensures?

dt.to_period('M') converts?

f'C{i:04d}' creates?

.astype(float) on bool column?

At-risk identification uses?

axis=1 in .mean() on DataFrame?

7. Interview Questions #

8. Summary #

9. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

Send Feedback / Bug

Feedback Submitted!

Real-World Data Science Projects

1. Chapter Introduction

Project 1: Sales Analytics Dashboard

Project 2: Student Performance Analysis

Project 3: Customer Churn Analysis

5. Common Mistakes

6. MCQs

`nlargest(3)` returns?

`pd.cut(x, bins=[0,50,100], labels=['Low','High'])` creates?

`corr()` returns values from?

`.clip(0, 100)` ensures?

`dt.to_period('M')` converts?

`f'C{i:04d}'` creates?

`.astype(float)` on bool column?

`axis=1` in `.mean()` on DataFrame?

7. Interview Questions

8. Summary

9. Next Chapter Recommendation