Skip to main content
PyTorch Essentials
CHAPTER 15 Intermediate

LSTM and Sequence Models

Updated: May 16, 2026
7 min read

# CHAPTER 15

LSTM and Sequence Models

1. Introduction

In the last chapter, we learned that a standard RNN suffers from the Vanishing Gradient problem; it forgets the beginning of a paragraph by the time it reaches the end. To solve this, researchers invented the LSTM (Long Short-Term Memory) network. LSTMs are the heavy machinery of sequence modeling. Until the recent invention of Transformers, LSTMs powered Google Translate, Siri, and Alexa. In this chapter, we will learn how LSTMs manage memory in PyTorch.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain how an LSTM solves the Vanishing Gradient problem.
  • Understand the function of the Cell State and internal Gates.
  • Implement an nn.LSTM layer in PyTorch.
  • Compare Bi-directional LSTMs to standard LSTMs.
  • Build a predictive Sequence Model.

3. How an LSTM Works (The Conveyor Belt)

An LSTM is a Recurrent layer, but instead of just one Hidden State, it introduces a massive innovation: The Cell State. Imagine the Cell State as a conveyor belt running straight through the top of the entire neural network. Information can flow down this belt unchanged from the first word to the very last word, bypassing the Vanishing Gradient entirely!

4. The Three Gates

To control what goes onto the conveyor belt, the LSTM uses three mathematical "Gates":
  1. 1. Forget Gate: Looks at the new word and the old memory, and decides what old information is no longer relevant and should be thrown away (e.g., the sentence subject changed from "Bob" to "Alice").
  1. 2. Input Gate: Decides what *new* information from the current word is important enough to add to the conveyor belt.
  1. 3. Output Gate: Decides what the actual output prediction should be for this specific time step.

5. Implementing LSTM in PyTorch

Replacing an nn.RNN with an nn.LSTM in PyTorch requires exactly one word change. However, be aware that an LSTM returns three items instead of two! It returns the output, the hidden state, AND the new cell state.
python
123456789101112131415161718192021222324252627
import torch
import torch.nn as nn

class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # The LSTM layer with 64 memory units
        # num_layers=2 means we stack two LSTMs on top of each other!
        self.lstm = nn.LSTM(embed_dim, hidden_size, num_layers=2, batch_first=True)
        
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        embedded = self.embedding(x)
        
        # LSTM returns: output, (hidden_state, cell_state)
        lstm_out, (hidden, cell) = self.lstm(embedded)
        
        # We grab the hidden state from the FINAL layer (index -1)
        final_memory = hidden[-1]
        
        out = self.fc(final_memory)
        return out

model = LSTMModel(vocab_size=10000, embed_dim=32, hidden_size=64)

*This model will drastically outperform a standard RNN on long movie reviews because it can remember the context from the very first sentence!*

6. Bidirectional LSTMs

When you read the sentence "The bank of the river," you know "bank" means land, not a financial institution, because of the word "river" at the end of the sentence. Standard LSTMs read strictly left-to-right, so they don't see "river" until it's too late. A Bidirectional LSTM runs two LSTMs simultaneously: one reads left-to-right, and the other reads right-to-left! It combines their knowledge for massive accuracy boosts.
python
123456
# Creating a Bidirectional LSTM in PyTorch is as simple as adding bidirectional=True
self.lstm_bidir = nn.LSTM(embed_dim, hidden_size, bidirectional=True, batch_first=True)

# WARNING: Because it runs two LSTMs, the hidden state output size is DOUBLED.
# Your next Linear layer MUST have: in_features = hidden_size * 2
self.fc_bidir = nn.Linear(hidden_size * 2, 1)

7. Mini Project: Sequence Prediction (Text Generation)

Let's look at the architecture for a model that reads a sequence of words and predicts the very next word (the foundation of ChatGPT!).
python
12345678910111213141516171819
class TextGeneratorLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # A 3-layer LSTM for deep understanding
        self.lstm = nn.LSTM(embed_dim, hidden_size, num_layers=3, batch_first=True)
        
        # Output layer size matches the ENTIRE vocabulary size.
        # It will output probabilities for every possible word to be the "next" word.
        self.fc = nn.Linear(hidden_size, vocab_size)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, (hidden, cell) = self.lstm(embedded)
        
        # We use the final hidden state to predict the next word
        prediction_probs = self.fc(hidden[-1])
        return prediction_probs

8. Common Mistakes

  • Overfitting with complex LSTMs: LSTMs have millions of parameters (due to all the internal gates). They overfit very quickly on small datasets. Always add dropout=0.2 to your nn.LSTM instantiation if numlayers is greater than 1.
  • Ignoring GRUs: PyTorch also provides an nn.GRU (Gated Recurrent Unit) layer. It is a simplified version of an LSTM that trains much faster and often achieves the exact same accuracy. Always try a GRU first!

9. Best Practices

  • Use 1D Convolutions with LSTMs: A massive industry secret for text processing is passing the Embeddings through an nn.Conv1d layer *before* feeding it to the LSTM. The CNN extracts phrase patterns, shortening the sequence and making the LSTM's job much easier and faster!

10. Exercises

  1. 1. What is the purpose of the "Forget Gate" inside an LSTM cell?
  1. 2. If you create nn.LSTM(hiddensize=128, bidirectional=True), what must the in_features of your subsequent nn.Linear layer be?

11. MCQ Quiz with Answers

Question 1

How does an LSTM solve the Vanishing Gradient problem found in standard RNNs?

Question 2

When should you use a Bidirectional LSTM instead of a standard LSTM?

12. Interview Questions

  • Q: Explain the difference in architecture between a standard RNN and an LSTM.
  • Q: Why would you use bidirectional=True on an LSTM, and in what scenario (like real-time forecasting) would this actually be a bad idea?

13. FAQs

Q: Are LSTMs obsolete because of Transformers (like GPT-4)? A: For massive, billion-parameter language modeling, yes, Transformers have taken over. However, for smaller tasks (like real-time IoT sensor forecasting, basic sentiment analysis, or mobile app features), LSTMs are still widely used because they are significantly smaller, faster, and cheaper to train than a Transformer.

14. Summary

LSTMs represent a massive leap in Artificial Intelligence. By engineering complex internal gates to actively manage a long-term memory state via the Cell State, LSTMs conquered the Vanishing Gradient problem. Whether generating text, translating languages, or forecasting financial markets, LSTMs remain a critical tool for mastering sequential data.

15. Next Chapter Recommendation

You have built incredibly complex CNNs and LSTMs. But how do you save them? How do you put them into an app so users can interact with them? In Chapter 16: Saving, Loading, and Deploying PyTorch Models, we bridge the gap between Data Science and Software Engineering.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·