Skip to main content
PyTorch Essentials
CHAPTER 11 Intermediate

Image Classification with CNNs in PyTorch

Updated: May 16, 2026
6 min read

# CHAPTER 11

Image Classification with CNNs in PyTorch

1. Introduction

If you feed a high-resolution, 4K image into the standard nn.Linear (Dense) neural network we built previously, the network will try to create a Weight for every single pixel. This results in billions of parameters. The network will run out of memory and instantly overfit. Furthermore, Dense layers don't understand spatial relationships (e.g., that an eye is usually above a nose). To solve Computer Vision, researchers invented the Convolutional Neural Network (CNN). In this chapter, we build the architecture that allows cars to see.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain why standard Linear networks fail on complex images.
  • Understand the Convolution operation (Filters).
  • Understand MaxPooling and feature compression.
  • Build a CNN architecture using nn.Conv2d and nn.MaxPool2d.
  • Understand PyTorch's Image Tensor Shape format [Batch, Channels, Height, Width].

3. How Convolutions Work

Instead of looking at the entire image at once, a CNN uses a Filter (a small 3x3 pixel square).
  1. 1. The 3x3 filter slides (convolves) across the image, pixel by pixel, scanning it like a flashlight.
  1. 2. The filter is mathematically designed to detect a specific feature (like a horizontal edge, a vertical edge, or a curve).
  1. 3. If it finds the feature, it "lights up," creating a Feature Map.
  1. 4. The CNN applies dozens of these filters simultaneously. The first layer detects simple edges. Deeper layers combine those edges to detect complex shapes (like a dog's ear).

4. MaxPooling (Compression)

After a Convolution layer finds features, the image is still massive. We use MaxPooling to shrink it. A nn.MaxPool2d layer looks at a 2x2 grid of pixels and simply keeps the maximum (brightest) value, discarding the other three. This effectively cuts the image size in half, keeping only the most important features and drastically reducing computational requirements.

5. Standard CNN Architecture

A CNN almost always follows this pattern:
  1. 1. Conv2d -> ReLU -> MaxPool2d (Extract low-level features, compress)
  1. 2. Conv2d -> ReLU -> MaxPool2d (Extract mid-level features, compress)
  1. 3. Flatten (Convert the 2D feature maps into a 1D line)
  1. 4. Linear (Standard dense network to make the final prediction based on the features)

6. The PyTorch Image Shape Quirk

This is incredibly important. When working with images, different frameworks format the matrix differently.
  • TensorFlow/NumPy Format: [BatchSize, Height, Width, ColorChannels]
  • PyTorch Format: [BatchSize, ColorChannels, Height, Width]

If you have a batch of 32 color (RGB) images that are 150x150 pixels, the PyTorch tensor shape MUST be: [32, 3, 150, 150].

7. Mini Project: Cat vs Dog Classifier Architecture

Let's build a CNN to classify color images of Cats and Dogs (150x150 pixels, 3 color channels).
python
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        
        # Layer 1: Convolution
        # in_channels=3 (RGB image), out_channels=16 (Create 16 different feature filters)
        # kernel_size=3 (The filter is 3x3 pixels), padding=1 (Keep edges intact)
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2) # Shrinks image by half (150 -> 75)
        
        # Layer 2: Convolution
        # in_channels MUST match the out_channels of the previous layer (16)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2) # Shrinks image by half (75 -> 37)
        
        # Flatten layer
        self.flatten = nn.Flatten()
        
        # The Math: 32 channels * 37 height * 37 width = 43,808 incoming features
        self.fc1 = nn.Linear(in_features=32 * 37 * 37, out_features=128)
        self.relu3 = nn.ReLU()
        
        # Output layer (1 neuron for Binary Classification)
        self.output = nn.Linear(in_features=128, out_features=1)
        
    def forward(self, x):
        # Feature Extraction Block
        x = self.pool1(self.relu1(self.conv1(x)))
        x = self.pool2(self.relu2(self.conv2(x)))
        
        # Flattening
        x = self.flatten(x)
        
        # Classification Block
        x = self.relu3(self.fc1(x))
        x = self.output(x)
        return x

model = SimpleCNN()
print(model)

# Test with a dummy image tensor [Batch=1, Channels=3, Height=150, Width=150]
dummy_img = torch.randn(1, 3, 150, 150)
out = model(dummy_img)
print("Output shape:", out.shape) # Output: torch.Size([1, 1])

8. Common Mistakes

  • The Flattening Math: Calculating the infeatures for the first nn.Linear layer after the Convolutions is notoriously difficult in PyTorch. If an image starts at 150x150, and you apply two MaxPool2d(2) layers, it shrinks to 75x75, and then 37x37. If your last conv layer output 32 channels, the math is 32 * 37 * 37. If you get this wrong, PyTorch will throw a Shape Mismatch error.
  • Wrong Tensor Dimensions: Passing an image as [150, 150, 3] instead of [3, 150, 150]. You must use torch.permute() to rearrange the dimensions if loading images from standard Python libraries like PIL or OpenCV.

9. Best Practices

  • Data Augmentation: Neural networks need massive amounts of data. If you only have 1,000 pictures of cats, you can use torchvision.transforms to artificially flip, rotate, and zoom the images as they load, effectively turning 1,000 images into 10,000 unique images! This prevents overfitting.

10. Exercises

  1. 1. Look at nn.Conv2d(inchannels=3, outchannels=16). What do the numbers 3 and 16 represent in the context of images and filters?
  1. 2. If an image enters a nn.MaxPool2d(kernelsize=2) layer with a spatial size of 100x100 pixels, what size will it be when it exits the layer?

11. MCQ Quiz with Answers

Question 1

Why are CNNs superior to standard nn.Linear networks for processing images?

Question 2

What is the correct PyTorch tensor shape format for a batch of 64 color images that are 224x224 pixels in size?

12. Interview Questions

  • Q: Explain the mathematical operation of Convolution in the context of image processing.
  • Q: Describe a standard CNN architecture flow from the Input layer to the Output layer, including the purpose of the Flatten operation.

13. FAQs

Q: Can I use CNNs for things other than images? A: Yes! 1D Convolutions (nn.Conv1d) are highly effective at processing audio wave signals and even certain types of sequential text and time-series data.

14. Summary

Convolutional Neural Networks revolutionized Artificial Intelligence. By utilizing sliding filters (Conv2d) to extract edges and shapes, and MaxPooling to compress the data, CNNs can "see" complex images without buckling under billions of parameters. They are the undeniable kings of Computer Vision.

15. Next Chapter Recommendation

Training a CNN from scratch on 100,000 images takes weeks on a supercomputer. What if you only have a laptop and 500 images of your specific dog? Can you still build a world-class AI? Yes! In Chapter 12: Transfer Learning with Pretrained Models, we will learn how to "steal" the brains of supercomputers.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·