CHAPTER 08 Beginner

Stemming and Lemmatization

Updated: May 14, 2026

20 min read

# CHAPTER 8

Stemming and Lemmatization

1. Introduction

In English, a single concept can be written in many different ways depending on tense and grammar. For example: "Run", "Running", "Ran", and "Runs". To a computer, these look like four entirely different words, which wastes memory and confuses the AI. In this chapter, we will learn how Stemming and Lemmatization reduce words down to their core root, ensuring the AI recognizes that they all mean the exact same thing.

2. Learning Objectives

By the end of this chapter, you will be able to:

Explain the concept of reducing words to their root form.

Define Stemming and understand its crude, rule-based approach.

Define Lemmatization and understand its dictionary-based approach.

Compare the pros and cons of both techniques.

3. Beginner-Friendly Explanation

Imagine you are sorting a massive pile of Lego bricks. You see a "blue 4-peg brick", a "scratched blue 4-peg brick", and a "shiny blue 4-peg brick". To build a wall, you don't care about the scratches or the shine; you just need to know it's a "blue 4-peg brick." In NLP, we have words like "playing", "played", and "plays". We don't care about the tense. We just want the computer to know the core action is "play". Stemming takes a chainsaw and chops the end of the word off. Lemmatization takes a scalpel and carefully transforms the word into its dictionary definition.

4. What is Stemming?

Stemming is a fast, crude, rule-based process that chops suffixes off the end of words. The most famous algorithm is the Porter Stemmer, invented in 1980.

Playing -> chops off "ing" -> Play

Played -> chops off "ed" -> Play

*The Problem:* Because it just blindly chops letters based on rules, it often creates non-existent words.

University -> chops off "sity" -> Univers

Universe -> chops off "e" -> Univers

It thinks both words mean the same thing, which is wrong!

5. What is Lemmatization?

Lemmatization is a much smarter, slower process. It looks at the context of the word and checks a massive built-in dictionary to find the Lemma (the actual root word).

Playing -> dictionary lookup -> Play

Better -> dictionary lookup -> Good

Geese -> dictionary lookup -> Goose

Notice how Stemming could never turn "Better" into "Good" or "Geese" into "Goose" because chopping off letters wouldn't work. Lemmatization understands the actual language.

6. Stemming vs Lemmatization

Feature	Stemming (Chainsaw)	Lemmatization (Scalpel)
Speed	Very Fast	Slower
Accuracy	Low (creates fake words)	High (uses actual dictionary)
Method	Chops off suffixes	Dictionary lookup
Example	`Caring` -> `Car`	`Caring` -> `Care`

7. When to Use Which?

Use Stemming: When you are building a massive search engine indexing billions of pages and speed is your number one priority. (If the user searches "Univers", they will get results for Universe and University).

Use Lemmatization: When building a Chatbot, an AI summarizer, or a Sentiment Analyzer where actual human meaning and precision are required.

8. Python Examples

Using NLTK, we can see the stark difference between the two techniques.

python

12345678910111213141516

from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Initialize tools
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

word = "caring"

# Stemming simply chops 'ing'
print("Stemmed:", stemmer.stem(word)) 
# Output: "car"  <-- Terrible! It changed the meaning entirely.

# Lemmatization finds the dictionary root
print("Lemmatized:", lemmatizer.lemmatize(word, pos=&#039;v')) 
# Output: "care" <-- Perfect!

9. Mini Project

Act as the Algorithm: Look at the following three words: ["flying", "flies", "flew"]

1. What would a crude Stemmer likely output for "flies"? *(Probably "fli")*.

2. What would a Lemmatizer output for all three words? *(Answer: "fly")*.

10. Best Practices

Part of Speech matters in Lemmatization: In Python, you often have to tell the Lemmatizer whether the word is a Verb or a Noun. If you pass the word "leaves" as a noun, the Lemma is "leaf". If you pass it as a verb, the Lemma is "leave".

11. Common Mistakes

Using Stemming for Text Generation: Never use Stemming if the final output will be read by a human. The human will see chopped up gibberish words like "comput" instead of "computer".

12. Exercises

1. Explain why Stemming would fail to link the words "Mouse" and "Mice" together, but Lemmatization would succeed.

13. Coding Challenges

Challenge 1: Write pseudocode for a simple text normalization pipeline that takes a raw string, tokenizes it, and applies lemmatization.

text

1234567891011

raw_text = "The children are playing"

tokens = word_tokenize(raw_text) // ["The", "children", "are", "playing"]
lemmatized_array = []

For word in tokens:
    root = Lemmatize(word)
    Add root to lemmatized_array

Print lemmatized_array 
// Output: ["The", "child", "be", "play"]

14. MCQs with Answers

Question 1

Which technique reduces words to their root by crudely chopping off suffixes based on a set of rules?

Question 2

Why is Lemmatization generally preferred over Stemming for complex NLP tasks?

15. Interview Questions

Q: Compare and contrast Stemming and Lemmatization. Provide an example where Stemming fails but Lemmatization succeeds.

Q: In a scenario where processing speed is critical and millions of documents are analyzed per second, which text normalization technique would you choose and why?

16. FAQs

Q: Do modern LLMs like GPT-4 use Stemming? A: No. Modern Transformer models use Sub-word Tokenization (which we covered in Chapter 6). This naturally handles variations of words without the need for destructive stemming or dictionary-heavy lemmatization. Stemming and Lemmatizing are primarily used for traditional Machine Learning and Search Engine indexing.

17. Summary

In Chapter 8, we finalized our text cleaning process. To prevent the AI from treating variations of the same word as different entities, we shrink them to their roots. Stemming does this quickly by chopping off endings, while Lemmatization does this accurately by looking up dictionary definitions.

18. Next Chapter Recommendation

Our words are now clean and isolated. But how does the computer know if a word is a noun, a verb, or an adjective? Proceed to Chapter 9: Parts of Speech Tagging (POS) to learn how AI understands grammar.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Stemming and Lemmatization #

1. Introduction #

2. Learning Objectives #

3. Beginner-Friendly Explanation #

4. What is Stemming? #

5. What is Lemmatization? #

6. Stemming vs Lemmatization #

7. When to Use Which? #

8. Python Examples #

9. Mini Project #

10. Best Practices #

11. Common Mistakes #

12. Exercises #

13. Coding Challenges #

14. MCQs with Answers #

Which technique reduces words to their root by crudely chopping off suffixes based on a set of rules?

Why is Lemmatization generally preferred over Stemming for complex NLP tasks?

15. Interview Questions #

16. FAQs #

17. Summary #

18. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

❓ Related Quizzes 6

🎥 Related Videos 1

Send Feedback / Bug

Feedback Submitted!

Stemming and Lemmatization

1. Introduction

2. Learning Objectives

3. Beginner-Friendly Explanation

4. What is Stemming?

5. What is Lemmatization?

6. Stemming vs Lemmatization

7. When to Use Which?

8. Python Examples

9. Mini Project

10. Best Practices

11. Common Mistakes

12. Exercises

13. Coding Challenges

14. MCQs with Answers

15. Interview Questions

16. FAQs

17. Summary

18. Next Chapter Recommendation