Skip to main content
Python for Beginners
CHAPTER 28 Beginner

Data Analysis with Python Basics

Updated: May 17, 2026
30 min read

# Data Analysis with Python Basics

Welcome to Chapter 28! Python is the #1 language for data analysis. In this chapter, you'll learn the two most important libraries: NumPy (numerical computing) and Pandas (data manipulation).

---

1. Learning Objectives

  • Use NumPy for numerical operations.
  • Use Pandas for data manipulation.
  • Work with DataFrames and Series.
  • Read and analyze CSV files.
  • Create basic data visualizations.

---

2. NumPy Basics

bash
1
pip install numpy

```python id="py28ex1" import numpy as np

# Creating arrays arr = np.array([1, 2, 3, 4, 5]) print(f"Array: {arr}") print(f"Shape: {arr.shape}") print(f"Type: {arr.dtype}")

# Array operations (vectorized — no loops needed!) print(f"Sum: {arr.sum()}") print(f"Mean: {arr.mean()}") print(f"Std: {arr.std():.2f}") print(f"Max: {arr.max()}")

# Element-wise operations print(f"Doubled: {arr * 2}") print(f"Squared: {arr ** 2}")

# Creating special arrays zeros = np.zeros(5) ones = np.ones((3, 3)) rangearr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8] linspace = np.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1.0] randomarr = np.random.rand(5) # 5 random floats

print(f"Range: {rangearr}") print(f"Linspace: {linspace}")

12
### 2D Arrays (Matrices)

python id="py28_ex2" matrix = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ])

print(f"Shape: {matrix.shape}") # (3, 3) print(f"Element [1,2]: {matrix[1, 2]}") # 6 print(f"Row 0: {matrix[0]}") # [1, 2, 3] print(f"Col 1: {matrix[:, 1]}") # [2, 5, 8] print(f"Sum: {matrix.sum()}") # 45 print(f"Row sums: {matrix.sum(axis=1)}") # [6, 15, 24]

1234
---

## 3. Pandas Basics

bash pip install pandas

1

python id="py28_ex3" import pandas as pd

# Series (1D labeled array) grades = pd.Series([85, 92, 78, 95, 88], index=["Alice", "Bob", "Charlie", "Diana", "Eve"]) print(grades) print(f"\nMean: {grades.mean():.1f}") print(f"Max: {grades.max()} ({grades.idxmax()})")

12
### DataFrames (2D labeled table)

python id="py28_ex4" import pandas as pd

# Creating DataFrame from dict data = { "Name": ["Alice", "Bob", "Charlie", "Diana", "Eve"], "Age": [25, 30, 28, 22, 27], "City": ["NYC", "LA", "Chicago", "NYC", "LA"], "Salary": [75000, 85000, 70000, 65000, 90000], "Department": ["Engineering", "Marketing", "Engineering", "HR", "Marketing"] }

df = pd.DataFrame(data) print(df) print(f"\nShape: {df.shape}") print(f"\nInfo:") print(df.info()) print(f"\nStatistics:") print(df.describe())

1234
---

## 4. DataFrame Operations

python id="py28ex5" import pandas as pd

df = pd.DataFrame({ "Name": ["Alice", "Bob", "Charlie", "Diana", "Eve"], "Age": [25, 30, 28, 22, 27], "Salary": [75000, 85000, 70000, 65000, 90000], "Dept": ["Eng", "Mkt", "Eng", "HR", "Mkt"] })

# Selecting columns print(df["Name"]) # Single column (Series) print(df[["Name", "Age"]]) # Multiple columns (DataFrame)

# Filtering rows seniors = df[df["Age"] > 25] highsalary = df[df["Salary"] >= 80000] engteam = df[df["Dept"] == "Eng"]

print("\nSeniors (age > 25):") print(seniors)

# Sorting sorteddf = df.sortvalues("Salary", ascending=False) print("\nSorted by Salary:") print(sorteddf)

# Adding columns df["Bonus"] = df["Salary"] * 0.1 df["Tax"] = df["Salary"] * 0.2

# Group by deptavg = df.groupby("Dept")["Salary"].mean() print("\nAverage Salary by Department:") print(deptavg)

1234
---

## 5. Reading CSV Files

python id="py28ex6" import pandas as pd

# Create sample CSV sampledata = pd.DataFrame({ "Date": ["2025-01-01", "2025-01-02", "2025-01-03", "2025-01-04"], "Product": ["Laptop", "Phone", "Tablet", "Laptop"], "Quantity": [5, 12, 8, 3], "Price": [999.99, 699.99, 499.99, 1099.99] }) sampledata.tocsv("sales.csv", index=False)

# Read CSV df = pd.read_csv("sales.csv") print(df.head())

# Quick analysis print(f"\nTotal Revenue: ${(df['Quantity'] * df['Price']).sum():,.2f}") print(f"Average Price: ${df['Price'].mean():,.2f}") print(f"Most Sold: {df.loc[df['Quantity'].idxmax(), 'Product']}")

1234
---

## 6. Data Cleaning

python id="py28ex7" import pandas as pd import numpy as np

# Sample data with issues df = pd.DataFrame({ "Name": ["Alice", "Bob", None, "Diana", "Eve"], "Age": [25, np.nan, 28, 22, 27], "Score": [85, 92, 78, np.nan, 88] })

print("Before cleaning:") print(df) print(f"\nMissing values:\n{df.isnull().sum()}")

# Fill missing values df["Age"].fillna(df["Age"].mean(), inplace=True) df["Name"].fillna("Unknown", inplace=True)

# Drop rows with any NaN dfclean = df.dropna()

print("\nAfter cleaning:") print(df_clean)

1234
---

## 7. Basic Visualization

python id="py28ex8" # Note: Install matplotlib: pip install matplotlib import matplotlib matplotlib.use('Agg') # Non-interactive backend import matplotlib.pyplot as plt import pandas as pd

# Sales data months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun"] sales = [12000, 15000, 13500, 17000, 19000, 22000]

# Line chart plt.figure(figsize=(10, 5)) plt.plot(months, sales, marker='o', color='#3498db', linewidth=2) plt.title("Monthly Sales 2025", fontsize=16) plt.xlabel("Month") plt.ylabel("Sales ($)") plt.grid(True, alpha=0.3) plt.savefig("saleschart.png", dpi=100, bboxinches='tight') plt.close() print("📊 Chart saved as saleschart.png")

# Bar chart departments = ["Engineering", "Marketing", "HR", "Sales"] headcount = [45, 30, 15, 25]

plt.figure(figsize=(8, 5)) plt.bar(departments, headcount, color=['#3498db', '#e74c3c', '#2ecc71', '#f39c12']) plt.title("Department Headcount") plt.ylabel("Employees") plt.savefig("departments.png", dpi=100, bboxinches='tight') plt.close() print("📊 Chart saved as departments.png") ``

---

8. MCQs with Answers

Q1: NumPy arrays are: A) Slower than lists B) Faster than lists C) Same speed D) Only for strings Answer: B — NumPy uses C-optimized operations.

Q2: df.head() shows: A) Last 5 rows B) First 5 rows C) All rows D) Column names Answer: B

Q3: df.describe() provides: A) Column names B) Statistical summary C) Data types D) Missing values Answer: B

Q4: df.groupby() does: A) Sorts data B) Groups and aggregates C) Filters data D) Merges data Answer: B

Q5: pd.readcsv() returns: A) List B) Dict C) DataFrame D) Array Answer: C

Q6: df.isnull().sum() counts: A) Rows B) Columns C) Missing values per column D) Total cells Answer: C

Q7: NumPy np.zeros((3,3)) creates: A) 3x3 of ones B) 3x3 of zeros C) 1D array D) Error Answer: B

Q8: df.sortvalues("col") sorts by: A) Index B) Column values C) Data type D) Memory Answer: B

Q9: fillna() does: A) Drops missing B) Fills missing values C) Finds missing D) Counts missing Answer: B

Q10: Pandas Series is: A) 2D B) 1D labeled array C) Dict D) Matrix Answer: B

---

9. Interview Questions

  1. 1. NumPy vs Python lists? NumPy is faster (C-optimized), supports vectorized operations, uses less memory, fixed-type.
  1. 2. What is a DataFrame? 2D labeled data structure with rows and columns (like a spreadsheet or SQL table).
  1. 3. How to handle missing data? dropna(), fillna(), interpolate(). Choice depends on context.
  1. 4. Series vs DataFrame? Series is 1D; DataFrame is 2D. A DataFrame is a collection of Series.
  1. 5. How to merge DataFrames? pd.merge() (SQL-like joins), pd.concat() (stacking), df.join().

---

10. Summary

  • NumPy provides fast numerical computing with arrays.
  • Pandas provides DataFrames for data manipulation and analysis.
  • Key Pandas operations: head(), describe(), groupby(), sortvalues(), merge().
  • Handle missing data with dropna() and fillna()`.
  • Visualize data with Matplotlib.

---

11. Next Chapter Recommendation

In Chapter 29: Python Interview Preparation, you'll prepare for technical interviews with 50 questions and 20 coding exercises! 🚀

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·