CHAPTER 11
Intermediate
Image Classification with CNNs in PyTorch
Updated: May 16, 2026
6 min read
# CHAPTER 11
Image Classification with CNNs in PyTorch
1. Introduction
If you feed a high-resolution, 4K image into the standardnn.Linear (Dense) neural network we built previously, the network will try to create a Weight for every single pixel. This results in billions of parameters. The network will run out of memory and instantly overfit. Furthermore, Dense layers don't understand spatial relationships (e.g., that an eye is usually above a nose). To solve Computer Vision, researchers invented the Convolutional Neural Network (CNN). In this chapter, we build the architecture that allows cars to see.
2. Learning Objectives
By the end of this chapter, you will be able to:- Explain why standard Linear networks fail on complex images.
- Understand the Convolution operation (Filters).
- Understand MaxPooling and feature compression.
-
Build a CNN architecture using
nn.Conv2dandnn.MaxPool2d.
-
Understand PyTorch's Image Tensor Shape format
[Batch, Channels, Height, Width].
3. How Convolutions Work
Instead of looking at the entire image at once, a CNN uses a Filter (a small 3x3 pixel square).- 1. The 3x3 filter slides (convolves) across the image, pixel by pixel, scanning it like a flashlight.
- 2. The filter is mathematically designed to detect a specific feature (like a horizontal edge, a vertical edge, or a curve).
- 3. If it finds the feature, it "lights up," creating a Feature Map.
- 4. The CNN applies dozens of these filters simultaneously. The first layer detects simple edges. Deeper layers combine those edges to detect complex shapes (like a dog's ear).
4. MaxPooling (Compression)
After a Convolution layer finds features, the image is still massive. We use MaxPooling to shrink it. Ann.MaxPool2d layer looks at a 2x2 grid of pixels and simply keeps the maximum (brightest) value, discarding the other three. This effectively cuts the image size in half, keeping only the most important features and drastically reducing computational requirements.
5. Standard CNN Architecture
A CNN almost always follows this pattern:-
1.
Conv2d->ReLU->MaxPool2d(Extract low-level features, compress)
-
2.
Conv2d->ReLU->MaxPool2d(Extract mid-level features, compress)
-
3.
Flatten(Convert the 2D feature maps into a 1D line)
-
4.
Linear(Standard dense network to make the final prediction based on the features)
6. The PyTorch Image Shape Quirk
This is incredibly important. When working with images, different frameworks format the matrix differently.-
TensorFlow/NumPy Format:
[BatchSize, Height, Width, ColorChannels]
-
PyTorch Format:
[BatchSize, ColorChannels, Height, Width]
If you have a batch of 32 color (RGB) images that are 150x150 pixels, the PyTorch tensor shape MUST be: [32, 3, 150, 150].
7. Mini Project: Cat vs Dog Classifier Architecture
Let's build a CNN to classify color images of Cats and Dogs (150x150 pixels, 3 color channels).
python
8. Common Mistakes
-
The Flattening Math: Calculating the
infeaturesfor the firstnn.Linearlayer after the Convolutions is notoriously difficult in PyTorch. If an image starts at 150x150, and you apply twoMaxPool2d(2)layers, it shrinks to 75x75, and then 37x37. If your last conv layer output 32 channels, the math is32 * 37 * 37. If you get this wrong, PyTorch will throw a Shape Mismatch error.
-
Wrong Tensor Dimensions: Passing an image as
[150, 150, 3]instead of[3, 150, 150]. You must usetorch.permute()to rearrange the dimensions if loading images from standard Python libraries like PIL or OpenCV.
9. Best Practices
-
Data Augmentation: Neural networks need massive amounts of data. If you only have 1,000 pictures of cats, you can use
torchvision.transformsto artificially flip, rotate, and zoom the images as they load, effectively turning 1,000 images into 10,000 unique images! This prevents overfitting.
10. Exercises
-
1.
Look at
nn.Conv2d(inchannels=3, outchannels=16). What do the numbers 3 and 16 represent in the context of images and filters?
-
2.
If an image enters a
nn.MaxPool2d(kernelsize=2)layer with a spatial size of 100x100 pixels, what size will it be when it exits the layer?
11. MCQ Quiz with Answers
Question 1
Why are CNNs superior to standard nn.Linear networks for processing images?
Question 2
What is the correct PyTorch tensor shape format for a batch of 64 color images that are 224x224 pixels in size?
12. Interview Questions
- Q: Explain the mathematical operation of Convolution in the context of image processing.
-
Q: Describe a standard CNN architecture flow from the Input layer to the Output layer, including the purpose of the
Flattenoperation.
13. FAQs
Q: Can I use CNNs for things other than images? A: Yes! 1D Convolutions (nn.Conv1d) are highly effective at processing audio wave signals and even certain types of sequential text and time-series data.
14. Summary
Convolutional Neural Networks revolutionized Artificial Intelligence. By utilizing sliding filters (Conv2d) to extract edges and shapes, and MaxPooling to compress the data, CNNs can "see" complex images without buckling under billions of parameters. They are the undeniable kings of Computer Vision.