Skip to main content
Scikit-learn Basics
CHAPTER 15 Intermediate

Dimensionality Reduction with PCA

Updated: May 16, 2026
6 min read

# CHAPTER 15

Dimensionality Reduction with PCA

1. Introduction

Modern datasets are massive. If you are analyzing a 100x100 pixel image, you have 10,000 features. Training a model on 10,000 features takes an incredible amount of time, memory, and often leads to severe overfitting—a problem known as the "Curse of Dimensionality." What if we could mathematically compress those 10,000 features down to just 50, without losing the core information? This is the magic of Principal Component Analysis (PCA). In this chapter, we will learn how to compress data intelligently.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain the "Curse of Dimensionality".
  • Understand the mathematical intuition behind PCA.
  • Define "Variance" in the context of machine learning.
  • Implement PCA using Scikit-learn.
  • Use PCA to compress data and speed up model training.

3. The Curse of Dimensionality

As you add more features (dimensions) to your dataset, the amount of data you need to ensure the model doesn't overfit grows exponentially. Furthermore, visualizing data past 3 dimensions is physically impossible for the human brain. Dimensionality reduction solves this by finding a smaller set of new variables that contain the exact same information.

4. How PCA Works

PCA is an Unsupervised algorithm. It looks at the dataset and asks: "Which direction contains the most variance (spread of data)?" Imagine a 3D cloud of data points shaped like a flat pancake.
  • You don't actually need 3 dimensions to describe a flat pancake; you can describe it in 2 dimensions (length and width) because it has almost no thickness.
  • PCA rotates the axis to align with the "length" and "width" of the pancake. These new axes are called Principal Components.
  • It then drops the "thickness" dimension because it contains very little variance (information). We just successfully compressed 3D data to 2D!

5. Variance is Information

In PCA, Variance = Information. If a feature does not vary (e.g., a column where everyone is exactly 30 years old), it tells the model nothing useful. PCA actively seeks out the features that vary the most and combines them into "Principal Components."

6. Implementing PCA in Scikit-learn

Let's compress a dataset with 30 features down to just 2 features so we can plot it on a 2D graph!
python
1234567891011121314151617181920
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np

# 1. Load Data (30 features)
X, y = load_breast_cancer(return_X_y=True)
print(f"Original shape: {X.shape}") # (569, 30)

# 2. CRITICAL: PCA requires strictly scaled data!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Initialize PCA (We want 2 components)
pca = PCA(n_components=2)

# 4. Fit and Transform
X_pca = pca.fit_transform(X_scaled)

print(f"Compressed shape: {X_pca.shape}") # (569, 2)

7. Explained Variance Ratio

We just deleted 28 features! Did we lose too much information? We can check the explainedvarianceratio_.
python
1234
variance = pca.explained_variance_ratio_
print(f"Component 1 explains: {variance[0]*100:.2f}% of the data")
print(f"Component 2 explains: {variance[1]*100:.2f}% of the data")
print(f"Total variance retained: {np.sum(variance)*100:.2f}%")

*If 2 components retain 95% of the variance, we successfully compressed our dataset by 93% while keeping almost all the critical information!*

8. Choosing the Right Number of Components

Instead of manually guessing n_components=2, you can tell Scikit-learn exactly how much variance you want to keep.
python
123456
# Tell PCA to keep exactly 95% of the information
pca_auto = PCA(n_components=0.95)
X_auto = pca_auto.fit_transform(X_scaled)

# Let's see how many components it took to reach 95%
print(f"Components needed for 95% variance: {pca_auto.n_components_}")

9. Common Mistakes

  • Forgetting to Scale: If one feature is measured in thousands and another in decimals, PCA will erroneously think the feature in thousands has the most "variance." You must run StandardScaler before PCA.
  • Losing Interpretability: Once you transform data with PCA, the new columns are no longer "Age" or "Income". They are "Principal Component 1" and "Principal Component 2"—mathematical mashups of the original features. You cannot easily explain *why* the model made a prediction to a business stakeholder.

10. Best Practices

  • Use PCA for Image Data: PCA (often called Eigenfaces in facial recognition) is heavily used in image processing to compress high-megapixel photos before feeding them to classifiers.

11. Exercises

  1. 1. If you have a dataset with 50 features and you apply PCA(ncomponents=0.99), what are you instructing the algorithm to do?
  1. 2. Why does PCA make it harder to explain a model's prediction to non-technical stakeholders?

12. MCQ Quiz with Answers

Question 1

What is the primary purpose of Principal Component Analysis (PCA)?

Question 2

Which preprocessing step MUST be performed before applying PCA?

13. Interview Questions

  • Q: Explain the "Curse of Dimensionality" and how PCA helps solve it.
  • Q: What does explainedvarianceratio tell you about your PCA transformation?

14. FAQs

Q: Can I use PCA on categorical text data? A: No. PCA relies on mathematical variance and covariance matrices. It is designed for continuous numerical data. Categorical data (even if One-Hot Encoded) usually requires different dimensionality reduction techniques like Multiple Correspondence Analysis (MCA).

15. Summary

PCA is the ultimate data compression algorithm for Machine Learning. By mathematically rotating our dataset to find the axes with the highest variance, we can discard redundant features, drastically speed up our training times, and visualize highly complex datasets in 2D or 3D space.

16. Next Chapter Recommendation

We have trained a variety of models, but how do we definitively prove which one is best? "Accuracy" is not always enough. In Chapter 16: Model Evaluation Metrics, we will learn how to read Confusion Matrices, F1 Scores, and ROC curves to judge our models like a professional.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·