CHAPTER 08
Beginner
Stemming and Lemmatization
Updated: May 14, 2026
20 min read
# CHAPTER 8
Stemming and Lemmatization
1. Introduction
In English, a single concept can be written in many different ways depending on tense and grammar. For example: "Run", "Running", "Ran", and "Runs". To a computer, these look like four entirely different words, which wastes memory and confuses the AI. In this chapter, we will learn how Stemming and Lemmatization reduce words down to their core root, ensuring the AI recognizes that they all mean the exact same thing.2. Learning Objectives
By the end of this chapter, you will be able to:- Explain the concept of reducing words to their root form.
- Define Stemming and understand its crude, rule-based approach.
- Define Lemmatization and understand its dictionary-based approach.
- Compare the pros and cons of both techniques.
3. Beginner-Friendly Explanation
Imagine you are sorting a massive pile of Lego bricks. You see a "blue 4-peg brick", a "scratched blue 4-peg brick", and a "shiny blue 4-peg brick". To build a wall, you don't care about the scratches or the shine; you just need to know it's a "blue 4-peg brick." In NLP, we have words like "playing", "played", and "plays". We don't care about the tense. We just want the computer to know the core action is "play". Stemming takes a chainsaw and chops the end of the word off. Lemmatization takes a scalpel and carefully transforms the word into its dictionary definition.4. What is Stemming?
Stemming is a fast, crude, rule-based process that chops suffixes off the end of words. The most famous algorithm is the Porter Stemmer, invented in 1980.-
Playing-> chops off "ing" ->Play
-
Played-> chops off "ed" ->Play
-
University-> chops off "sity" ->Univers
-
Universe-> chops off "e" ->Univers
5. What is Lemmatization?
Lemmatization is a much smarter, slower process. It looks at the context of the word and checks a massive built-in dictionary to find the Lemma (the actual root word).-
Playing-> dictionary lookup ->Play
-
Better-> dictionary lookup ->Good
-
Geese-> dictionary lookup ->Goose
6. Stemming vs Lemmatization
| Feature | Stemming (Chainsaw) | Lemmatization (Scalpel) |
|---|---|---|
| Speed | Very Fast | Slower |
| Accuracy | Low (creates fake words) | High (uses actual dictionary) |
| Method | Chops off suffixes | Dictionary lookup |
| Example | Caring -> Car | Caring -> Care |
7. When to Use Which?
- Use Stemming: When you are building a massive search engine indexing billions of pages and speed is your number one priority. (If the user searches "Univers", they will get results for Universe and University).
- Use Lemmatization: When building a Chatbot, an AI summarizer, or a Sentiment Analyzer where actual human meaning and precision are required.
8. Python Examples
Using NLTK, we can see the stark difference between the two techniques.
python
9. Mini Project
Act as the Algorithm: Look at the following three words:["flying", "flies", "flew"]
- 1. What would a crude Stemmer likely output for "flies"? *(Probably "fli")*.
- 2. What would a Lemmatizer output for all three words? *(Answer: "fly")*.
10. Best Practices
- Part of Speech matters in Lemmatization: In Python, you often have to tell the Lemmatizer whether the word is a Verb or a Noun. If you pass the word "leaves" as a noun, the Lemma is "leaf". If you pass it as a verb, the Lemma is "leave".
11. Common Mistakes
- Using Stemming for Text Generation: Never use Stemming if the final output will be read by a human. The human will see chopped up gibberish words like "comput" instead of "computer".
12. Exercises
- 1. Explain why Stemming would fail to link the words "Mouse" and "Mice" together, but Lemmatization would succeed.
13. Coding Challenges
Challenge 1: Write pseudocode for a simple text normalization pipeline that takes a raw string, tokenizes it, and applies lemmatization.
text
14. MCQs with Answers
Question 1
Which technique reduces words to their root by crudely chopping off suffixes based on a set of rules?
Question 2
Why is Lemmatization generally preferred over Stemming for complex NLP tasks?
15. Interview Questions
- Q: Compare and contrast Stemming and Lemmatization. Provide an example where Stemming fails but Lemmatization succeeds.
- Q: In a scenario where processing speed is critical and millions of documents are analyzed per second, which text normalization technique would you choose and why?