Skip to main content
PyTorch Essentials
CHAPTER 08 Intermediate

Activation Functions and Loss Functions

Updated: May 16, 2026
7 min read

# CHAPTER 8

Activation Functions and Loss Functions

1. Introduction

In the previous chapter, we built a neural network and sprinkled in mysterious layers like nn.ReLU(). If you blindly copy-paste these into future projects, your models will eventually fail. Activation Functions are the "spark" that allows a neural network to learn complex patterns, while Loss Functions are the "ruler" used to measure how badly the network is failing during training. In this chapter, we will decipher exactly what these functions do and how to choose the right ones in PyTorch.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain why non-linear Activation Functions are required.
  • Identify the use cases for ReLU, Sigmoid, and Softmax.
  • Define what a Loss Function (Criterion) does.
  • Choose the correct Loss Function for Regression vs. Classification tasks.
  • Implement PyTorch's nn.CrossEntropyLoss and nn.MSELoss.

3. Why Do We Need Activation Functions?

Recall the math inside a single nn.Linear layer: Output = (Input * Weight) + Bias. This is a Linear equation (a straight line). If you stack 100 linear layers on top of each other, mathematically, they just collapse into one giant straight line. The network would be completely incapable of learning complex, curvy patterns (like the shape of a face). Activation Functions inject "Non-Linearity" (curves) into the network, allowing it to learn highly complex, real-world data.

4. Core Activation Functions in PyTorch

A. ReLU (Rectified Linear Unit)
  • *What it does:* If the number coming out of the neuron is negative, ReLU turns it to 0. If it is positive, it leaves it alone.
  • *Where to use it:* Hidden Layers. It is the industry standard. It is mathematically simple, extremely fast to compute, and solves historical problems with deep networks.
python
1
self.relu = nn.ReLU()

B. Sigmoid

  • *What it does:* Squashes any number into a value exactly between 0.0 and 1.0 (acting like a probability).
  • *Where to use it:* Output Layer (Binary Classification). If you are predicting Cat (1) or Dog (0), a single output neuron with a Sigmoid activation will output 0.85 (85% sure it is a Cat).

python
1
self.sigmoid = nn.Sigmoid()

C. Softmax

  • *What it does:* Used when you have multiple output neurons. It squashes all outputs so that they add up to exactly 1.0 (100%).
  • *Where to use it:* Output Layer (Multi-class Classification). If predicting 10 different digits, Softmax ensures the probabilities of all 10 digits sum to 100%.

5. What are Loss Functions?

During training, the network makes a guess. The Loss Function (often called the criterion in PyTorch code) measures how far that guess is from the true answer.
  • If the network predicts "Cat" and the picture is a Cat, the Loss is 0.
  • If it predicts "Dog" and the picture is a Cat, the Loss is High.
The Optimizer's only goal is to change the Weights to make the Loss drop to 0.

6. Choosing the Correct Loss Function

Choosing the wrong loss function is the #1 reason beginner models fail to learn.

Scenario 1: Regression (Predicting a Continuous Number)

  • *Task:* Predicting House Prices in dollars.
  • *Loss Function:* Mean Squared Error (MSE). Calculates the numerical difference between the guess and the real price.

python
1
criterion = nn.MSELoss()

Scenario 2: Binary Classification (Yes or No)

  • *Task:* Predicting Spam (1) or Not Spam (0).
  • *Loss Function:* Binary Cross Entropy (BCE). Excellent for measuring probabilities between two choices.

python
123
# Note: BCELoss requires the output of the model to already have a Sigmoid applied!
# BCEWithLogitsLoss applies the Sigmoid automatically for better numerical stability.
criterion = nn.BCEWithLogitsLoss()

Scenario 3: Multi-class Classification (A, B, or C)

  • *Task:* Predicting 10 different handwritten digits.
  • *Loss Function:* Cross Entropy Loss.

python
1234
# CRITICAL PYTORCH QUIRK:
# nn.CrossEntropyLoss() AUTOMATICALLY applies Softmax internally!
# Do NOT put a Softmax layer at the end of your neural network if you use this loss function.
criterion = nn.CrossEntropyLoss()

7. Step-by-Step Implementation: Matching Output to Loss

Let's look at how the Problem dictates the Output Layer and Loss Function:
python
12345678910111213141516
import torch.nn as nn

# Problem 1: Predicting House Price (Regression)
# Output: 1 Neuron. Loss: MSE
model_reg = nn.Sequential(nn.Linear(10, 64), nn.ReLU(), nn.Linear(64, 1))
criterion_reg = nn.MSELoss()

# Problem 2: Spam Detection (Binary Classification)
# Output: 1 Neuron. Loss: BCEWithLogitsLoss
model_bin = nn.Sequential(nn.Linear(10, 64), nn.ReLU(), nn.Linear(64, 1))
criterion_bin = nn.BCEWithLogitsLoss() # Applies Sigmoid internally

# Problem 3: Predicting 3 Animal Types (Multi-class Classification)
# Output: 3 Neurons. Loss: CrossEntropyLoss
model_multi = nn.Sequential(nn.Linear(10, 64), nn.ReLU(), nn.Linear(64, 3))
criterion_multi = nn.CrossEntropyLoss() # Applies Softmax internally

8. Common Mistakes

  • Double Softmax: Because nn.CrossEntropyLoss automatically applies Softmax inside its formula, if you manually add nn.Softmax() as the final layer of your model, you are applying Softmax twice. Your model will train incredibly slowly or fail entirely.
  • Using Mean Squared Error for Classification: MSE is mathematically designed for continuous numbers. Using it to measure "Cat vs Dog" errors confuses the optimizer, resulting in a model that refuses to learn.

9. Best Practices

  • Default to ReLU: For hidden layers, do not overthink it. Use ReLU 99% of the time. Only explore advanced variants (like Leaky ReLU) if your model is specifically struggling to learn.

10. Exercises

  1. 1. You are building a neural network to predict if a patient has a disease ("Yes" or "No"). Describe exactly how you would configure the final nn.Linear layer and which PyTorch criterion you would use.
  1. 2. Why is a non-linear activation function required in hidden layers?

11. MCQ Quiz with Answers

Question 1

Which activation function is standard for hidden layers because of its computational efficiency and ability to introduce non-linearity?

Question 2

You are predicting whether an image is a car, a truck, or a motorcycle. You use nn.CrossEntropyLoss(). What MUST you remember regarding the architecture of your model?

12. Interview Questions

  • Q: Explain the PyTorch quirk regarding nn.CrossEntropyLoss() and the Softmax activation function.
  • Q: If your neural network's loss is not decreasing during training, and you notice you used nn.MSELoss for a classification task, explain mathematically why this is causing a problem.

13. FAQs

Q: I keep hearing about the "Vanishing Gradient" problem. What is it? A: Historically, researchers used the Sigmoid function in hidden layers. Because Sigmoid squashes numbers so tightly between 0 and 1, the error signals (gradients) passing backward during Backpropagation became smaller and smaller until they vanished to 0. The deep layers stopped learning. ReLU solved this!

14. Summary

You are no longer guessing when building models. You now know that ReLU introduces the non-linear flexibility required for deep learning. Most importantly, you understand PyTorch's specific quirks: ensuring you pair the correct Loss Function (MSE vs. BCE vs. CrossEntropy) to your specific business problem, and avoiding the "double Softmax" trap.

15. Next Chapter Recommendation

We have built the perfect model and defined our Loss function. But how do we actually tell PyTorch to update the weights? It is time to write the infamous Training Loop. In Chapter 9: Training and Evaluating Models in PyTorch, we will bring our AI to life.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·