Skip to main content
TensorFlow Introduction
CHAPTER 13 Intermediate

Natural Language Processing Basics

Updated: May 16, 2026
6 min read

# CHAPTER 13

Natural Language Processing Basics

1. Introduction

Neural networks only understand numbers. If you feed the string "I love this movie!" into a Dense layer, TensorFlow will crash. To teach a machine to read, we must mathematically translate human language into numbers. This field of AI is called Natural Language Processing (NLP). In this chapter, we will learn the standard pipeline for preparing text data: cleaning, tokenization, sequencing, and the revolutionary concept of Word Embeddings.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain the standard NLP preprocessing workflow.
  • Implement Tokenization using Keras.
  • Convert sentences into numerical sequences.
  • Pad sequences to handle variable sentence lengths.
  • Understand the concept of Word Embeddings.

3. The NLP Workflow

Before training an NLP model (like a Spam detector or Sentiment Analyzer), text must go through four strict steps:
  1. 1. Cleaning: Removing punctuation, HTML tags, and making everything lowercase.
  1. 2. Tokenization: Splitting a sentence into individual words (Tokens) and assigning a unique integer to every word in your vocabulary (e.g., "I" = 1, "love" = 2).
  1. 3. Sequencing: Replacing the words in the original sentence with their integer IDs.
  1. 4. Padding: Neural networks require fixed-size inputs. If one sentence is 5 words and another is 10, we add 0s to the short sentence to make them equal length.

4. Step-by-Step Implementation: Tokenization and Sequencing

TensorFlow provides a built-in Tokenizer to handle the first three steps automatically.
python
123456789101112131415161718192021222324252627282930
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 1. Mock Dataset
sentences = [
    "I love machine learning!",
    "I love TensorFlow.",
    "Deep learning is amazing."
]

# 2. Initialize Tokenizer (Keep the top 100 most frequent words)
# oov_token specifies a placeholder for words the model has never seen before
tokenizer = Tokenizer(num_words=100, oov_token="<OOV>")

# 3. Build the Vocabulary
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print("Vocabulary Dictionary:", word_index)
# Output: {'<OOV>': 1, 'i': 2, 'love': 3, 'learning': 4, 'machine': 5, ...}

# 4. Convert Sentences to Sequences of Numbers
sequences = tokenizer.texts_to_sequences(sentences)
print("\nSequences:", sequences)
# "I love TensorFlow." becomes [2, 3, 6]

# 5. Padding (Make all sentences the exact same length)
# post means add zeros at the end. maxlen forces all sequences to length 5.
padded_sequences = pad_sequences(sequences, padding=&#039;post', maxlen=5)
print("\nPadded Sequences:\n", padded_sequences)
# Output for "I love TensorFlow." -> [2, 3, 6, 0, 0]

*We now have a perfect NumPy matrix of integers ready for a neural network!*

5. The Problem with Integers

We successfully turned words into numbers. However, if "Apple" is 1 and "Banana" is 100, the neural network will mathematically assume that a Banana is 100 times greater than an Apple. This makes no logical sense. We need a way to tell the network the *meaning* of the words.

6. Word Embeddings

Embeddings are the secret sauce of modern NLP. Instead of a single integer, an Embedding layer converts every word into a dense vector of numbers (an array). Imagine a 2D graph with "Fruit" on the Y-axis and "Technology" on the X-axis.
  • "Apple" (the fruit) might be plotted at [0.1, 0.9].
  • "Banana" might be plotted at [0.1, 0.95].
Because they are physically plotted next to each other on the graph, the neural network understands they have similar meanings!

7. Adding an Embedding Layer

In Keras, adding an Embedding layer is incredibly simple. It must be the very first layer in your NLP model.
python
1234567891011121314151617181920
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

vocab_size = 100  # Number of unique words in our dictionary
embed_dim = 16    # How many dimensions (axes) the graph has
max_length = 5    # The length of our padded sequences

model = Sequential([
    # 1. The Embedding Layer
    Embedding(input_dim=vocab_size, output_dim=embed_dim, input_length=max_length),
    
    # 2. Flatten the embeddings so Dense layers can read them
    Flatten(),
    
    # 3. Dense classification layers
    Dense(16, activation=&#039;relu'),
    Dense(1, activation=&#039;sigmoid') # Binary sentiment (Positive or Negative)
])

model.compile(optimizer=&#039;adam', loss='binary_crossentropy', metrics=['accuracy'])

8. Common Mistakes

  • Forgetting oovtoken: When your model is deployed, users will type words that weren't in your training data. If you don't use an Out-Of-Vocabulary (<OOV>) token, the Tokenizer will completely drop those words, ruining the structure of the sentence.
  • Fitting the Tokenizer on the Test Data: You must only run tokenizer.fitontexts() on the Xtrain data. If you fit it on the test data, it's Data Leakage. The test data should only be processed using textstosequences.

9. Best Practices

  • Pre-trained Embeddings: Just like Transfer Learning for images, you can download pre-trained Embeddings (like GloVe or Word2Vec) that already know the relationships between billions of English words, rather than forcing your Embedding layer to learn them from scratch.

10. Exercises

  1. 1. What does the pad_sequences function do, and why is it mandatory for Neural Networks?
  1. 2. Explain why representing the word "Dog" as the integer 5 and "Cat" as the integer 10 causes problems in a standard Dense neural network.

11. MCQ Quiz with Answers

Question 1

In the NLP workflow, what is the process of converting a sentence into a sequence of individual integer IDs called?

Question 2

What is the primary purpose of a Keras Embedding layer?

12. Interview Questions

  • Q: Explain the step-by-step preprocessing pipeline required to feed a raw text string into a Keras Dense neural network.
  • Q: What is an Out-Of-Vocabulary (OOV) token, and why is it crucial for production NLP models?

13. FAQs

Q: Does Tokenization work for languages other than English? A: Yes, but it requires different rules. Languages like Chinese don't use spaces between words, so standard space-based tokenizers fail. You have to use specialized libraries (like jieba for Chinese) before feeding the data into Keras.

14. Summary

Before AI can comprehend language, we must perform aggressive mathematical formatting. By cleaning text, Tokenizing it into integer sequences, Padding it to uniform lengths, and utilizing the spatial logic of Word Embeddings, we successfully translate human vocabulary into the complex mathematical vectors that TensorFlow requires.

15. Next Chapter Recommendation

We have encoded the words, but our simple Dense network still reads the sentence all at once. It doesn't understand that the order of words matters! ("The dog bit the man" vs "The man bit the dog"). In Chapter 14: Recurrent Neural Networks (RNN), we will learn how to build models with a concept of time and memory.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·