Skip to main content
Pandas & NumPy
CHAPTER 01 Beginner

Introduction to Data Science, Pandas, and NumPy

Updated: May 18, 2026
5 min read

# CHAPTER 1

Introduction to Data Science, Pandas, and NumPy

1. Chapter Introduction

Data is the oil of the 21st century — but raw oil is useless without refining. Data Science is the discipline of extracting meaning from data. Python's NumPy and Pandas libraries are the two most critical tools in every data scientist's toolkit — used by Google, Netflix, NASA, and every data-driven company worldwide.

Analogy: NumPy is like a high-powered calculator for arrays of numbers. Pandas is like Excel on steroids — but programmable, scalable, and 1000× faster.

2. Learning Objectives

  • Understand what data science is and where it applies.
  • Know what NumPy and Pandas are and why they exist.
  • Understand the difference between NumPy arrays and Pandas DataFrames.
  • Identify real-world applications of data science.

3. What is Data Science?

text
1234567891011
Data Science Pipeline:

Raw Data → Collection → Cleaning → Analysis → Visualization → Insights → Action

Tools at each stage:
Collection:    APIs, web scraping, databases, CSV files
Cleaning:      Pandas (dropna, fillna, drop_duplicates)
Analysis:      Pandas + NumPy (groupby, aggregate, math)
Visualization: Matplotlib, Seaborn, Plotly
Insights:      Statistical analysis, ML models
Action:        Business decisions, dashboards, reports

4. What is NumPy?

python
12345678910111213141516
# NumPy = Numerical Python
# The fundamental package for scientific computing in Python

import numpy as np

# NumPy's core: the ndarray (N-dimensional array)
# Much faster than Python lists for numerical computation

# Python list (slow for math):
python_list = [1, 2, 3, 4, 5]
result = [x * 2 for x in python_list]   # Needs a loop

# NumPy array (fast — C-level computation):
np_array = np.array([1, 2, 3, 4, 5])
result = np_array * 2   # Vectorized — no loop needed!
print(result)           # [2 4 6 8 10]
text
12345678
NumPy Capabilities:
✅ N-dimensional arrays (1D, 2D, 3D, nD)
✅ Vectorized math (no Python loops)
✅ Linear algebra (matrices, eigenvalues)
✅ Random number generation
✅ Fourier transforms
✅ Broadcasting (operations on different-shaped arrays)
✅ Foundation for Pandas, Scikit-learn, TensorFlow

5. What is Pandas?

python
123456789101112131415
# Pandas = Panel Data + Python
# The most popular data analysis library

import pandas as pd

# Pandas' core: DataFrame — a 2D labeled table like Excel/SQL
data = {
    'Name': ['Alice', 'Bob', 'Carol', 'David'],
    'Age':  [25, 30, 35, 28],
    'Salary': [55000, 72000, 88000, 61000],
    'Department': ['Engineering', 'Marketing', 'Engineering', 'Sales']
}

df = pd.DataFrame(data)
print(df)
text
123456
Output:
    Name  Age  Salary   Department
0  Alice   25   55000  Engineering
1    Bob   30   72000    Marketing
2  Carol   35   88000  Engineering
3  David   28   61000        Sales

6. NumPy vs Pandas — When to Use Which

text
12345678910
+------------------+---------------------------+---------------------------+
| Feature          | NumPy                     | Pandas                    |
+------------------+---------------------------+---------------------------+
| Data type        | Homogeneous (numbers)     | Mixed (strings + numbers) |
| Structure        | Arrays (nD)               | Series (1D) + DataFrame   |
| Labels           | No (integer index only)   | Yes (named columns/index) |
| Use case         | Math, ML, computation     | Data analysis, cleaning   |
| Excel equivalent | Arrays/matrices           | Spreadsheet               |
| Best for         | Linear algebra, ML        | CSV, SQL, EDA             |
+------------------+---------------------------+---------------------------+

7. Industry Applications

text
123456789101112131415161718192021
Finance:
  - Stock price analysis (time series)
  - Risk modeling (statistical analysis)
  - Fraud detection (anomaly detection)

Healthcare:
  - Patient outcome analysis
  - Drug efficacy studies
  - Epidemiology (COVID-19 tracking)

Retail/Ecommerce:
  - Sales analytics and forecasting
  - Customer segmentation
  - Inventory optimization

Technology:
  - Log analysis
  - A/B testing
  - Recommendation systems

Used by: Google, Meta, Netflix, Uber, Airbnb, NASA, WHO

8. Mini Project: Analyze Sample Student Data

python
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
import pandas as pd
import numpy as np

# Sample student dataset
students = {
    'Name': ['Alice', 'Bob', 'Carol', 'David', 'Eve', 'Frank'],
    'Math': [92, 78, 85, 60, 95, 72],
    'Science': [88, 82, 90, 55, 97, 65],
    'English': [76, 88, 84, 70, 89, 80],
    'Grade': ['A', 'B', 'B', 'C', 'A', 'B']
}

df = pd.DataFrame(students)

# 1. Basic info
print("=== Student Dataset ===")
print(df)
print(f"\nShape: {df.shape}")     # (6, 5) — 6 rows, 5 columns
print(f"Total students: {len(df)}")

# 2. Statistical summary with NumPy
math_scores = np.array(df['Math'])
print(f"\nMath Statistics:")
print(f"  Mean:   {np.mean(math_scores):.1f}")    # 80.3
print(f"  Median: {np.median(math_scores):.1f}")  # 81.5
print(f"  Std:    {np.std(math_scores):.1f}")     # 12.1
print(f"  Min:    {np.min(math_scores)}")         # 60
print(f"  Max:    {np.max(math_scores)}")         # 95

# 3. Overall score
df['Total'] = df['Math'] + df['Science'] + df['English']
df['Average'] = df['Total'] / 3

# 4. Rank students
df = df.sort_values('Average', ascending=False).reset_index(drop=True)
df.index = df.index + 1  # Rank starts at 1
df.index.name = 'Rank'

print("\n=== Student Rankings ===")
print(df[['Name', 'Math', 'Science', 'English', 'Average', 'Grade']])

# 5. Grade distribution
print("\n=== Grade Distribution ===")
print(df['Grade'].value_counts())

# 6. Top performers
top_students = df[df['Average'] >= 85]
print(f"\nTop performers (avg ≥ 85): {list(top_students['Name'])}")
text
123456789101112
Output:
=== Student Rankings ===
      Name  Math  Science  English    Average Grade
Rank
1      Eve    95       97       89  93.666667     A
2    Alice    92       88       76  85.333333     A
3    Carol    85       90       84  86.333333     B
4      Bob    78       82       88  82.666667     B
5    Frank    72       65       80  72.333333     B
6    David    60       55       70  61.666667     C

Top performers (avg ≥ 85): ['Eve', 'Alice', 'Carol']

9. Common Mistakes

  • Confusing NumPy arrays with Python lists: NumPy arrays have fixed types, support vectorized operations, and are far faster for math. Don't use Python loops on NumPy arrays.
  • Importing without aliases: Always use import numpy as np and import pandas as pd — these aliases are universal conventions.

10. MCQs

Question 1

NumPy stands for?

Question 2

Pandas' primary data structure?

Question 3

NumPy arrays are?

Question 4

Which is faster for math?

Question 5

Pandas is built on top of?

Question 6

Standard alias for Pandas?

Question 7

Standard alias for NumPy?

Question 8

DataFrame is like?

Question 9

NumPy is the foundation for?

Question 10

Data science workflow final stage?

11. Interview Questions

  • Q: What is the difference between NumPy and Pandas?
  • Q: Why is NumPy faster than Python lists for mathematical operations?

12. Summary

Data Science transforms raw data into decisions. NumPy provides blazing-fast numerical arrays. Pandas provides labeled DataFrames for structured data analysis. Together, they form the core of every Python data science workflow — from exploratory analysis to machine learning preprocessing.

13. Next Chapter Recommendation

In Chapter 2: Installing Python, NumPy, and Pandas, we set up a complete data science development environment with Python, Jupyter Notebook, and all required libraries.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·