CHAPTER 08
Intermediate
Activation Functions and Loss Functions
Updated: May 16, 2026
7 min read
# CHAPTER 8
Activation Functions and Loss Functions
1. Introduction
In the previous chapter, we built a neural network and sprinkled in mysterious layers likenn.ReLU(). If you blindly copy-paste these into future projects, your models will eventually fail. Activation Functions are the "spark" that allows a neural network to learn complex patterns, while Loss Functions are the "ruler" used to measure how badly the network is failing during training. In this chapter, we will decipher exactly what these functions do and how to choose the right ones in PyTorch.
2. Learning Objectives
By the end of this chapter, you will be able to:- Explain why non-linear Activation Functions are required.
- Identify the use cases for ReLU, Sigmoid, and Softmax.
- Define what a Loss Function (Criterion) does.
- Choose the correct Loss Function for Regression vs. Classification tasks.
-
Implement PyTorch's
nn.CrossEntropyLossandnn.MSELoss.
3. Why Do We Need Activation Functions?
Recall the math inside a singlenn.Linear layer: Output = (Input * Weight) + Bias.
This is a Linear equation (a straight line). If you stack 100 linear layers on top of each other, mathematically, they just collapse into one giant straight line. The network would be completely incapable of learning complex, curvy patterns (like the shape of a face).
Activation Functions inject "Non-Linearity" (curves) into the network, allowing it to learn highly complex, real-world data.
4. Core Activation Functions in PyTorch
A. ReLU (Rectified Linear Unit)-
*What it does:* If the number coming out of the neuron is negative, ReLU turns it to
0. If it is positive, it leaves it alone.
- *Where to use it:* Hidden Layers. It is the industry standard. It is mathematically simple, extremely fast to compute, and solves historical problems with deep networks.
python
B. Sigmoid
-
*What it does:* Squashes any number into a value exactly between
0.0and1.0(acting like a probability).
-
*Where to use it:* Output Layer (Binary Classification). If you are predicting Cat (1) or Dog (0), a single output neuron with a Sigmoid activation will output
0.85(85% sure it is a Cat).
python
C. Softmax
-
*What it does:* Used when you have multiple output neurons. It squashes all outputs so that they add up to exactly
1.0(100%).
- *Where to use it:* Output Layer (Multi-class Classification). If predicting 10 different digits, Softmax ensures the probabilities of all 10 digits sum to 100%.
5. What are Loss Functions?
During training, the network makes a guess. The Loss Function (often called thecriterion in PyTorch code) measures how far that guess is from the true answer.
-
If the network predicts "Cat" and the picture is a Cat, the Loss is
0.
-
If it predicts "Dog" and the picture is a Cat, the Loss is
High.
6. Choosing the Correct Loss Function
Choosing the wrong loss function is the #1 reason beginner models fail to learn.Scenario 1: Regression (Predicting a Continuous Number)
- *Task:* Predicting House Prices in dollars.
- *Loss Function:* Mean Squared Error (MSE). Calculates the numerical difference between the guess and the real price.
python
Scenario 2: Binary Classification (Yes or No)
- *Task:* Predicting Spam (1) or Not Spam (0).
- *Loss Function:* Binary Cross Entropy (BCE). Excellent for measuring probabilities between two choices.
python
Scenario 3: Multi-class Classification (A, B, or C)
- *Task:* Predicting 10 different handwritten digits.
- *Loss Function:* Cross Entropy Loss.
python
7. Step-by-Step Implementation: Matching Output to Loss
Let's look at how the Problem dictates the Output Layer and Loss Function:
python
8. Common Mistakes
-
Double Softmax: Because
nn.CrossEntropyLossautomatically applies Softmax inside its formula, if you manually addnn.Softmax()as the final layer of your model, you are applying Softmax twice. Your model will train incredibly slowly or fail entirely.
- Using Mean Squared Error for Classification: MSE is mathematically designed for continuous numbers. Using it to measure "Cat vs Dog" errors confuses the optimizer, resulting in a model that refuses to learn.
9. Best Practices
-
Default to ReLU: For hidden layers, do not overthink it. Use
ReLU99% of the time. Only explore advanced variants (like Leaky ReLU) if your model is specifically struggling to learn.
10. Exercises
-
1.
You are building a neural network to predict if a patient has a disease ("Yes" or "No"). Describe exactly how you would configure the final
nn.Linearlayer and which PyTorchcriterionyou would use.
- 2. Why is a non-linear activation function required in hidden layers?
11. MCQ Quiz with Answers
Question 1
Which activation function is standard for hidden layers because of its computational efficiency and ability to introduce non-linearity?
Question 2
You are predicting whether an image is a car, a truck, or a motorcycle. You use nn.CrossEntropyLoss(). What MUST you remember regarding the architecture of your model?
12. Interview Questions
-
Q: Explain the PyTorch quirk regarding
nn.CrossEntropyLoss()and the Softmax activation function.
-
Q: If your neural network's loss is not decreasing during training, and you notice you used
nn.MSELossfor a classification task, explain mathematically why this is causing a problem.