Data Analysis with Python Basics
# Data Analysis with Python Basics
Welcome to Chapter 28! Python is the #1 language for data analysis. In this chapter, you'll learn the two most important libraries: NumPy (numerical computing) and Pandas (data manipulation).
---
1. Learning Objectives
- Use NumPy for numerical operations.
- Use Pandas for data manipulation.
- Work with DataFrames and Series.
- Read and analyze CSV files.
- Create basic data visualizations.
---
2. NumPy Basics
```python id="py28ex1" import numpy as np
# Creating arrays arr = np.array([1, 2, 3, 4, 5]) print(f"Array: {arr}") print(f"Shape: {arr.shape}") print(f"Type: {arr.dtype}")
# Array operations (vectorized — no loops needed!) print(f"Sum: {arr.sum()}") print(f"Mean: {arr.mean()}") print(f"Std: {arr.std():.2f}") print(f"Max: {arr.max()}")
# Element-wise operations print(f"Doubled: {arr * 2}") print(f"Squared: {arr ** 2}")
# Creating special arrays zeros = np.zeros(5) ones = np.ones((3, 3)) rangearr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8] linspace = np.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1.0] randomarr = np.random.rand(5) # 5 random floats
print(f"Range: {rangearr}") print(f"Linspace: {linspace}")
python id="py28_ex2" matrix = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ])
print(f"Shape: {matrix.shape}") # (3, 3) print(f"Element [1,2]: {matrix[1, 2]}") # 6 print(f"Row 0: {matrix[0]}") # [1, 2, 3] print(f"Col 1: {matrix[:, 1]}") # [2, 5, 8] print(f"Sum: {matrix.sum()}") # 45 print(f"Row sums: {matrix.sum(axis=1)}") # [6, 15, 24]
bash pip install pandas
python id="py28_ex3" import pandas as pd
# Series (1D labeled array) grades = pd.Series([85, 92, 78, 95, 88], index=["Alice", "Bob", "Charlie", "Diana", "Eve"]) print(grades) print(f"\nMean: {grades.mean():.1f}") print(f"Max: {grades.max()} ({grades.idxmax()})")
python id="py28_ex4" import pandas as pd
# Creating DataFrame from dict data = { "Name": ["Alice", "Bob", "Charlie", "Diana", "Eve"], "Age": [25, 30, 28, 22, 27], "City": ["NYC", "LA", "Chicago", "NYC", "LA"], "Salary": [75000, 85000, 70000, 65000, 90000], "Department": ["Engineering", "Marketing", "Engineering", "HR", "Marketing"] }
df = pd.DataFrame(data) print(df) print(f"\nShape: {df.shape}") print(f"\nInfo:") print(df.info()) print(f"\nStatistics:") print(df.describe())
python id="py28ex5" import pandas as pd
df = pd.DataFrame({ "Name": ["Alice", "Bob", "Charlie", "Diana", "Eve"], "Age": [25, 30, 28, 22, 27], "Salary": [75000, 85000, 70000, 65000, 90000], "Dept": ["Eng", "Mkt", "Eng", "HR", "Mkt"] })
# Selecting columns print(df["Name"]) # Single column (Series) print(df[["Name", "Age"]]) # Multiple columns (DataFrame)
# Filtering rows seniors = df[df["Age"] > 25] highsalary = df[df["Salary"] >= 80000] engteam = df[df["Dept"] == "Eng"]
print("\nSeniors (age > 25):") print(seniors)
# Sorting sorteddf = df.sortvalues("Salary", ascending=False) print("\nSorted by Salary:") print(sorteddf)
# Adding columns df["Bonus"] = df["Salary"] * 0.1 df["Tax"] = df["Salary"] * 0.2
# Group by deptavg = df.groupby("Dept")["Salary"].mean() print("\nAverage Salary by Department:") print(deptavg)
python id="py28ex6" import pandas as pd
# Create sample CSV sampledata = pd.DataFrame({ "Date": ["2025-01-01", "2025-01-02", "2025-01-03", "2025-01-04"], "Product": ["Laptop", "Phone", "Tablet", "Laptop"], "Quantity": [5, 12, 8, 3], "Price": [999.99, 699.99, 499.99, 1099.99] }) sampledata.tocsv("sales.csv", index=False)
# Read CSV df = pd.read_csv("sales.csv") print(df.head())
# Quick analysis print(f"\nTotal Revenue: ${(df['Quantity'] * df['Price']).sum():,.2f}") print(f"Average Price: ${df['Price'].mean():,.2f}") print(f"Most Sold: {df.loc[df['Quantity'].idxmax(), 'Product']}")
python id="py28ex7" import pandas as pd import numpy as np
# Sample data with issues df = pd.DataFrame({ "Name": ["Alice", "Bob", None, "Diana", "Eve"], "Age": [25, np.nan, 28, 22, 27], "Score": [85, 92, 78, np.nan, 88] })
print("Before cleaning:") print(df) print(f"\nMissing values:\n{df.isnull().sum()}")
# Fill missing values df["Age"].fillna(df["Age"].mean(), inplace=True) df["Name"].fillna("Unknown", inplace=True)
# Drop rows with any NaN dfclean = df.dropna()
print("\nAfter cleaning:") print(df_clean)
python id="py28ex8" # Note: Install matplotlib: pip install matplotlib import matplotlib matplotlib.use('Agg') # Non-interactive backend import matplotlib.pyplot as plt import pandas as pd
# Sales data months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun"] sales = [12000, 15000, 13500, 17000, 19000, 22000]
# Line chart plt.figure(figsize=(10, 5)) plt.plot(months, sales, marker='o', color='#3498db', linewidth=2) plt.title("Monthly Sales 2025", fontsize=16) plt.xlabel("Month") plt.ylabel("Sales ($)") plt.grid(True, alpha=0.3) plt.savefig("saleschart.png", dpi=100, bboxinches='tight') plt.close() print("📊 Chart saved as saleschart.png")
# Bar chart departments = ["Engineering", "Marketing", "HR", "Sales"] headcount = [45, 30, 15, 25]
plt.figure(figsize=(8, 5))
plt.bar(departments, headcount, color=['#3498db', '#e74c3c', '#2ecc71', '#f39c12'])
plt.title("Department Headcount")
plt.ylabel("Employees")
plt.savefig("departments.png", dpi=100, bboxinches='tight')
plt.close()
print("📊 Chart saved as departments.png")
``
---
8. MCQs with Answers
Q1: NumPy arrays are: A) Slower than lists B) Faster than lists C) Same speed D) Only for strings Answer: B — NumPy uses C-optimized operations.
Q2: df.head() shows:
A) Last 5 rows B) First 5 rows C) All rows D) Column names
Answer: B
Q3: df.describe() provides:
A) Column names B) Statistical summary C) Data types D) Missing values
Answer: B
Q4: df.groupby() does:
A) Sorts data B) Groups and aggregates C) Filters data D) Merges data
Answer: B
Q5: pd.readcsv() returns:
A) List B) Dict C) DataFrame D) Array
Answer: C
Q6: df.isnull().sum() counts:
A) Rows B) Columns C) Missing values per column D) Total cells
Answer: C
Q7: NumPy np.zeros((3,3)) creates:
A) 3x3 of ones B) 3x3 of zeros C) 1D array D) Error
Answer: B
Q8: df.sortvalues("col") sorts by:
A) Index B) Column values C) Data type D) Memory
Answer: B
Q9: fillna() does:
A) Drops missing B) Fills missing values C) Finds missing D) Counts missing
Answer: B
Q10: Pandas Series is: A) 2D B) 1D labeled array C) Dict D) Matrix Answer: B
---
9. Interview Questions
- 1. NumPy vs Python lists? NumPy is faster (C-optimized), supports vectorized operations, uses less memory, fixed-type.
- 2. What is a DataFrame? 2D labeled data structure with rows and columns (like a spreadsheet or SQL table).
-
3.
How to handle missing data? dropna()
,fillna(),interpolate(). Choice depends on context.
- 4. Series vs DataFrame? Series is 1D; DataFrame is 2D. A DataFrame is a collection of Series.
-
5.
How to merge DataFrames? pd.merge()
(SQL-like joins),pd.concat()(stacking),df.join().
---
10. Summary
- NumPy provides fast numerical computing with arrays.
- Pandas provides DataFrames for data manipulation and analysis.
-
Key Pandas operations: head()
,describe(),groupby(),sortvalues(),merge().
-
Handle missing data with dropna()
andfillna()`.
- Visualize data with Matplotlib.
---
11. Next Chapter Recommendation
In Chapter 29: Python Interview Preparation, you'll prepare for technical interviews with 50 questions and 20 coding exercises! 🚀