Skip to main content
NLP Basics Tutorial
CHAPTER 12 Beginner

Text Classification Fundamentals

Updated: May 14, 2026
25 min read

# CHAPTER 12

Text Classification Fundamentals

1. Introduction

While Sentiment Analysis categorizes text based on emotion, what if we want to categorize text based on its topic or purpose? This broader concept is called Text Classification. It is one of the most fundamental and widely used techniques in NLP. In this chapter, we will explore how AI automatically reads a document and sorts it into predefined buckets, powering everything from spam filters to news aggregators.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Define Text Classification.
  • Identify common use cases (Spam Detection, Topic Labeling, Intent Recognition).
  • Understand the concept of "Bag of Words" and TF-IDF.
  • Explain how Supervised Learning is used to train a classifier.

3. Beginner-Friendly Explanation

Imagine a mailroom clerk at a massive corporation. Their job is to read every incoming letter and throw it into one of three bins: "Billing", "Legal", or "Junk". The clerk looks for keywords. If the letter contains "Invoice" or "Payment", it goes to Billing. If it contains "Lawsuit", it goes to Legal. Text Classification is automating this mailroom clerk. We train an AI model to look at the words in a document and assign it a specific "Label" or "Class". (Note: Sentiment Analysis from the previous chapter is just a specific type of Text Classification where the labels are Positive/Negative).

4. Common Use Cases

  • Spam Filtering: The oldest and most famous text classifier. Labels emails as Spam or Not Spam.
  • Topic Categorization: Google News reads a million articles a day and automatically categorizes them into Sports, Politics, Technology, or Entertainment.
  • Intent Recognition: When you ask Alexa, "What's the score of the game?", a classifier reads the text and classifies your *intent* as SPORTSSCOREQUERY, routing it to the sports database.

5. How It Works: The "Bag of Words" (BoW)

Before we can train a model, we must convert our text into numbers (Feature Engineering). The simplest way is the Bag of Words model. Imagine a giant matrix (spreadsheet). Every row is a sentence. Every column is a word in the English dictionary. If the word exists in the sentence, we put a 1 in that column. If it doesn't, we put a 0. *Problem:* This ignores word order entirely (it's just a "bag" of words), but it is surprisingly effective for basic topic classification!

6. A Better Way: TF-IDF

Because Bag of Words is too simple, data scientists often use TF-IDF (Term Frequency-Inverse Document Frequency). This algorithm scores words based on how *unique* they are. If the word "Algorithm" appears 5 times in one article, but almost never appears in the rest of the database, TF-IDF gives "Algorithm" a very high mathematical score. The AI learns that "Algorithm" is a massive clue that the article belongs in the Technology bucket.

7. Training the Classifier

Text Classification is a classic Supervised Learning problem.
  1. 1. Gather Data: Collect 10,000 news articles.
  1. 2. Label Data: A human labels 5,000 as Sports and 5,000 as Politics.
  1. 3. Train: The AI (like a Naive Bayes algorithm or a Neural Network) analyzes the TF-IDF math. It learns that words like "Touchdown" and "Coach" strongly correlate with the Sports label.
  1. 4. Predict: Give the AI a brand new, unlabeled article, and it will output a probability (e.g., 95% chance this is Sports).

8. Python Examples

Using scikit-learn, we can build a simple text classifier pipeline.
python
1234567891011121314151617181920
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# 1. Our tiny dataset (Training Data)
texts = ["I love playing basketball", "The election was close", "He scored a touchdown", "The senator voted yes"]
labels = ["Sports", "Politics", "Sports", "Politics"]

# 2. Build the Pipeline (TF-IDF converter + Naive Bayes AI Model)
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

# 3. Train the model
model.fit(texts, labels)

# 4. Predict a brand new, unseen sentence
new_sentence = ["The quarterback threw the ball"]
prediction = model.predict(new_sentence)

print(prediction[0])
# Output: "Sports"

9. Mini Project

Act as the Classifier: You have three buckets: Support, Billing, and Sales. Categorize the following customer messages:
  1. 1. "My credit card was charged twice!" *(Billing)*
  1. 2. "How much does the enterprise tier cost?" *(Sales)*
  1. 3. "The app keeps crashing when I open it." *(Support)*

10. Best Practices

  • Balanced Data: If you are training a Spam filter, do not train it on 9,000 Normal emails and 100 Spam emails. The model will just guess "Normal" every time and achieve 99% accuracy while actually being completely broken. You need a balanced dataset.

11. Common Mistakes

  • Ignoring the "Unknown" class: If your classifier is only trained on Sports and Politics, and a user uploads a recipe for a cake, the AI will force the cake recipe into one of those two categories. Always have an "Other/Unknown" confidence threshold!

12. Exercises

  1. 1. Explain why TF-IDF is generally a better mathematical representation of text than a simple Bag of Words (BoW) count for topic classification.

13. Coding Challenges

Challenge 1: Write pseudocode for how an Intent Recognition classifier routes a user's chatbot message to the correct database.
text
123456789
user_message = "Turn off the living room lights."
intent = AI_Classifier.predict(user_message)

If intent == "WEATHER_QUERY":
    Return get_weather_api()
Else If intent == "SMART_HOME_COMMAND":
    Return trigger_smart_home_lights()
Else:
    Return "I'm sorry, I don't understand that command."

14. MCQs with Answers

Question 1

A system that automatically routes incoming customer service emails to either the "Tech Support" team or the "Refunds" team is an example of:

Question 2

What does the "Bag of Words" model fail to capture when converting text into mathematics?

15. Interview Questions

  • Q: Describe the process of training a supervised Text Classifier from raw text data to deployment.
  • Q: What is TF-IDF, and how does it help a machine learning model identify the most important words in a document?

16. FAQs

Q: Can a text belong to more than one category? A: Yes! That is called "Multi-label Classification." An article about a politician playing basketball could legitimately be tagged as both Sports and Politics.

17. Summary

In Chapter 12, we learned how to organize the chaos of unstructured text. Text Classification is the supervised ML process of reading a document and assigning it a predefined label. By converting text into mathematical weights (using BoW or TF-IDF), algorithms can learn to detect Spam, categorize news topics, and route chatbot intents with incredible accuracy.

18. Next Chapter Recommendation

TF-IDF is great, but it still doesn't truly understand *meaning*. How do modern AI models know that "King" and "Queen" are related concepts? Proceed to Chapter 13: Introduction to Word Embeddings to learn the geometry of language.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·