Skip to main content
PyTorch Essentials
CHAPTER 13 Intermediate

Natural Language Processing Basics with PyTorch

Updated: May 16, 2026
6 min read

# CHAPTER 13

Natural Language Processing Basics with PyTorch

1. Introduction

Neural networks only understand numbers. If you feed the string "I love this movie!" into an nn.Linear layer, PyTorch will crash. To teach a machine to read, we must mathematically translate human language into numbers. This field of AI is called Natural Language Processing (NLP). In this chapter, we will learn the standard pipeline for preparing text data: cleaning, tokenization, vocabulary building, and the revolutionary concept of Word Embeddings.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain the standard NLP preprocessing workflow.
  • Understand Tokenization and Vocabulary creation.
  • Convert sentences into numerical sequences.
  • Pad sequences to handle variable sentence lengths.
  • Understand and implement nn.Embedding in PyTorch.

3. The NLP Workflow

Before training an NLP model (like a Spam detector or Sentiment Analyzer), text must go through four strict steps:
  1. 1. Cleaning: Removing punctuation, HTML tags, and making everything lowercase.
  1. 2. Tokenization: Splitting a sentence into individual words (Tokens).
  1. 3. Vocabulary Building: Assigning a unique integer to every unique word in your entire dataset (e.g., "I" = 1, "love" = 2).
  1. 4. Sequencing & Padding: Replacing words with their integer IDs. Since neural networks require fixed-size inputs, if one sentence is 5 words and another is 10, we add 0s to the short sentence to make them equal length.

4. Step-by-Step: Tokenization and Sequencing

While PyTorch has a dedicated library called torchtext, for basic understanding, we will build a simple vocabulary manually.
python
12345678910111213141516171819202122232425262728293031323334
import torch

# 1. Mock Dataset
sentences = [
    "I love deep learning",
    "I love PyTorch",
    "Deep learning is amazing"
]

# 2. Tokenization & Vocabulary Building
vocab = {"<PAD>": 0, "<OOV>": 1} # PADDING and Out-Of-Vocabulary tokens
word_idx = 2

for sentence in sentences:
    for word in sentence.lower().split():
        if word not in vocab:
            vocab[word] = word_idx
            word_idx += 1

print("Vocabulary:", vocab)
# Output: {'<PAD>': 0, '<OOV>': 1, 'i': 2, 'love': 3, 'deep': 4, 'learning': 5, ...}

# 3. Sequencing and Padding
def text_to_tensor(text, vocab, max_len=5):
    sequence = [vocab.get(w, vocab["<OOV>"]) for w in text.lower().split()]
    # Padding: Add 0s to the end if it's too short
    while len(sequence) < max_len:
        sequence.append(vocab["<PAD>"])
    # Truncate if it's too long
    return torch.tensor(sequence[:max_len])

tensor_seq = text_to_tensor("I love PyTorch", vocab)
print("Sequence Tensor:", tensor_seq)
# Output: tensor([2, 3, 6, 0, 0]) -> Padded perfectly!

5. The Problem with Integers

We successfully turned words into numbers. However, if "Apple" is 1 and "Banana" is 100, the neural network will mathematically assume that a Banana is 100 times greater than an Apple. This makes no logical sense. We need a way to tell the network the *meaning* of the words.

6. Word Embeddings (nn.Embedding)

Embeddings are the secret sauce of modern NLP. Instead of a single integer, an Embedding layer converts every word into a dense vector of numbers (an array). Imagine a 2D graph with "Fruit" on the Y-axis and "Technology" on the X-axis.
  • "Apple" (the fruit) might be plotted at [0.1, 0.9].
  • "Banana" might be plotted at [0.1, 0.95].
Because they are physically plotted next to each other on the graph, the neural network understands they have similar meanings!

7. Adding an Embedding Layer in PyTorch

In PyTorch, adding an Embedding layer is incredibly simple. It acts as a lookup table and must be the very first layer in your NLP model.
python
123456789101112131415161718192021222324252627282930
import torch.nn as nn

class NLPModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        
        # 1. The Embedding Layer
        # num_embeddings = Total words in dictionary
        # embedding_dim = How many dimensions (axes) the semantic graph has
        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)
        
        # 2. Flatten the embeddings so Linear layers can read them
        self.flatten = nn.Flatten()
        
        # 3. Dense classification layers (Assuming sequences are exactly 5 words long)
        self.fc = nn.Linear(in_features=embedding_dim * 5, out_features=16)
        self.relu = nn.ReLU()
        self.output = nn.Linear(16, 1) # Binary sentiment (Positive or Negative)

    def forward(self, x):
        # x starts as integers: [2, 3, 6, 0, 0]
        x = self.embedding(x) 
        # Now x is a massive matrix of decimals containing semantic meaning!
        x = self.flatten(x)
        x = self.relu(self.fc(x))
        x = self.output(x)
        return x

model = NLPModel(vocab_size=len(vocab), embedding_dim=8)
print(model)

8. Common Mistakes

  • Forgetting the <OOV> token: When your model is deployed, users will type words that weren't in your training data. If you don't use an Out-Of-Vocabulary (<OOV>) token, your code will crash when it tries to look up a word that doesn't exist in the dictionary.
  • Applying Embeddings to Floats: The input to an nn.Embedding layer must be integer IDs (dtype=torch.long). If you pass floats, PyTorch will throw an error because it uses these IDs as exact row indexes in a lookup table.

9. Best Practices

  • Pre-trained Embeddings: Just like Transfer Learning for images, you don't have to train Embeddings from scratch. You can download pre-trained Embeddings (like GloVe or Word2Vec) that already know the relationships between billions of English words, and load them directly into your nn.Embedding layer!

10. Exercises

  1. 1. What does padding a sequence mean, and why is it mandatory for Neural Networks?
  1. 2. Explain why representing the word "Dog" as the integer 5 and "Cat" as the integer 10 causes mathematical problems in a standard Linear network.

11. MCQ Quiz with Answers

Question 1

In the NLP workflow, what is the process of converting a sentence into a sequence of individual integer IDs called?

Question 2

What is the primary purpose of a PyTorch nn.Embedding layer?

12. Interview Questions

  • Q: Explain the step-by-step preprocessing pipeline required to feed a raw text string into a PyTorch neural network.
  • Q: What is an Out-Of-Vocabulary (OOV) token, and why is it crucial for production NLP models?

13. FAQs

Q: Do people still write tokenizers from scratch like this? A: No. In production, we use extremely advanced tokenizers from libraries like HuggingFace Transformers. However, building one from scratch is the only way to truly understand what the libraries are doing under the hood!

14. Summary

Before AI can comprehend language, we must perform aggressive mathematical formatting. By cleaning text, Tokenizing it into integer sequences, Padding it to uniform lengths, and utilizing the spatial logic of Word Embeddings, we successfully translate human vocabulary into the complex mathematical vectors that PyTorch requires.

15. Next Chapter Recommendation

We have encoded the words, but our simple Linear network still reads the sentence all at once via flattening. It doesn't understand that the order of words matters! ("The dog bit the man" vs "The man bit the dog"). In Chapter 14: Recurrent Neural Networks (RNN), we will learn how to build models with a concept of time and memory.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·