Skip to main content
NLP Basics Tutorial
CHAPTER 08 Beginner

Stemming and Lemmatization

Updated: May 14, 2026
20 min read

# CHAPTER 8

Stemming and Lemmatization

1. Introduction

In English, a single concept can be written in many different ways depending on tense and grammar. For example: "Run", "Running", "Ran", and "Runs". To a computer, these look like four entirely different words, which wastes memory and confuses the AI. In this chapter, we will learn how Stemming and Lemmatization reduce words down to their core root, ensuring the AI recognizes that they all mean the exact same thing.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain the concept of reducing words to their root form.
  • Define Stemming and understand its crude, rule-based approach.
  • Define Lemmatization and understand its dictionary-based approach.
  • Compare the pros and cons of both techniques.

3. Beginner-Friendly Explanation

Imagine you are sorting a massive pile of Lego bricks. You see a "blue 4-peg brick", a "scratched blue 4-peg brick", and a "shiny blue 4-peg brick". To build a wall, you don't care about the scratches or the shine; you just need to know it's a "blue 4-peg brick." In NLP, we have words like "playing", "played", and "plays". We don't care about the tense. We just want the computer to know the core action is "play". Stemming takes a chainsaw and chops the end of the word off. Lemmatization takes a scalpel and carefully transforms the word into its dictionary definition.

4. What is Stemming?

Stemming is a fast, crude, rule-based process that chops suffixes off the end of words. The most famous algorithm is the Porter Stemmer, invented in 1980.
  • Playing -> chops off "ing" -> Play
  • Played -> chops off "ed" -> Play
*The Problem:* Because it just blindly chops letters based on rules, it often creates non-existent words.
  • University -> chops off "sity" -> Univers
  • Universe -> chops off "e" -> Univers
It thinks both words mean the same thing, which is wrong!

5. What is Lemmatization?

Lemmatization is a much smarter, slower process. It looks at the context of the word and checks a massive built-in dictionary to find the Lemma (the actual root word).
  • Playing -> dictionary lookup -> Play
  • Better -> dictionary lookup -> Good
  • Geese -> dictionary lookup -> Goose
Notice how Stemming could never turn "Better" into "Good" or "Geese" into "Goose" because chopping off letters wouldn't work. Lemmatization understands the actual language.

6. Stemming vs Lemmatization

FeatureStemming (Chainsaw)Lemmatization (Scalpel)
SpeedVery FastSlower
AccuracyLow (creates fake words)High (uses actual dictionary)
MethodChops off suffixesDictionary lookup
ExampleCaring -> CarCaring -> Care

7. When to Use Which?

  • Use Stemming: When you are building a massive search engine indexing billions of pages and speed is your number one priority. (If the user searches "Univers", they will get results for Universe and University).
  • Use Lemmatization: When building a Chatbot, an AI summarizer, or a Sentiment Analyzer where actual human meaning and precision are required.

8. Python Examples

Using NLTK, we can see the stark difference between the two techniques.
python
12345678910111213141516
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Initialize tools
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

word = "caring"

# Stemming simply chops 'ing'
print("Stemmed:", stemmer.stem(word)) 
# Output: "car"  <-- Terrible! It changed the meaning entirely.

# Lemmatization finds the dictionary root
print("Lemmatized:", lemmatizer.lemmatize(word, pos=&#039;v')) 
# Output: "care" <-- Perfect!

9. Mini Project

Act as the Algorithm: Look at the following three words: ["flying", "flies", "flew"]
  1. 1. What would a crude Stemmer likely output for "flies"? *(Probably "fli")*.
  1. 2. What would a Lemmatizer output for all three words? *(Answer: "fly")*.

10. Best Practices

  • Part of Speech matters in Lemmatization: In Python, you often have to tell the Lemmatizer whether the word is a Verb or a Noun. If you pass the word "leaves" as a noun, the Lemma is "leaf". If you pass it as a verb, the Lemma is "leave".

11. Common Mistakes

  • Using Stemming for Text Generation: Never use Stemming if the final output will be read by a human. The human will see chopped up gibberish words like "comput" instead of "computer".

12. Exercises

  1. 1. Explain why Stemming would fail to link the words "Mouse" and "Mice" together, but Lemmatization would succeed.

13. Coding Challenges

Challenge 1: Write pseudocode for a simple text normalization pipeline that takes a raw string, tokenizes it, and applies lemmatization.
text
1234567891011
raw_text = "The children are playing"

tokens = word_tokenize(raw_text) // ["The", "children", "are", "playing"]
lemmatized_array = []

For word in tokens:
    root = Lemmatize(word)
    Add root to lemmatized_array

Print lemmatized_array 
// Output: ["The", "child", "be", "play"]

14. MCQs with Answers

Question 1

Which technique reduces words to their root by crudely chopping off suffixes based on a set of rules?

Question 2

Why is Lemmatization generally preferred over Stemming for complex NLP tasks?

15. Interview Questions

  • Q: Compare and contrast Stemming and Lemmatization. Provide an example where Stemming fails but Lemmatization succeeds.
  • Q: In a scenario where processing speed is critical and millions of documents are analyzed per second, which text normalization technique would you choose and why?

16. FAQs

Q: Do modern LLMs like GPT-4 use Stemming? A: No. Modern Transformer models use Sub-word Tokenization (which we covered in Chapter 6). This naturally handles variations of words without the need for destructive stemming or dictionary-heavy lemmatization. Stemming and Lemmatizing are primarily used for traditional Machine Learning and Search Engine indexing.

17. Summary

In Chapter 8, we finalized our text cleaning process. To prevent the AI from treating variations of the same word as different entities, we shrink them to their roots. Stemming does this quickly by chopping off endings, while Lemmatization does this accurately by looking up dictionary definitions.

18. Next Chapter Recommendation

Our words are now clean and isolated. But how does the computer know if a word is a noun, a verb, or an adjective? Proceed to Chapter 9: Parts of Speech Tagging (POS) to learn how AI understands grammar.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·