CHAPTER 06
Beginner
Tokenization and Text Segmentation
Updated: May 14, 2026
20 min read
# CHAPTER 6
Tokenization and Text Segmentation
1. Introduction
Machine learning algorithms cannot ingest a giant paragraph of text all at once. They need the text broken down into digestible, discrete pieces. This crucial process is called Tokenization. In this chapter, we will learn how to chop paragraphs into sentences, sentences into words, and words into characters, turning a block of text into an array of analyzable "Tokens".2. Learning Objectives
By the end of this chapter, you will be able to:- Define Tokenization in NLP.
- Explain the difference between Sentence Tokenization and Word Tokenization.
- Understand the benefits and drawbacks of Character Tokenization.
- Identify the challenges of tokenizing complex languages.
3. Beginner-Friendly Explanation
Imagine a child eating a giant pizza. They cannot swallow the whole pizza in one bite. First, the parent cuts the pizza into slices (Sentence Tokenization). Then, the parent cuts the slices into smaller bites so the child can chew them easily (Word Tokenization). In NLP, the computer is the child. It cannot process "I love AI. It is great." It needs the text chopped into pieces:["I", "love", "AI", ".", "It", "is", "great", "."]. Each piece is called a Token.
4. Sentence Tokenization (Text Segmentation)
This is the process of breaking a large paragraph into individual sentences. You might think this is as easy as splitting the text every time you see a period (.). But it is much harder!
Consider this text: *"Dr. Smith went to Washington D.C. to buy a U.S. flag."*
If we split at every period, we get:
-
1.
Dr
-
2.
Smith went to Washington D
-
3.
C
5. Word Tokenization
This is the most common form of tokenization. It breaks a sentence into individual words and punctuation marks. Input:"I can't wait!"
Output: ["I", "ca", "n't", "wait", "!"]
Notice how the tokenizer intelligently split the contraction "can't" into "ca" and "n't" because they hold different grammatical meanings.
6. Sub-Word and Character Tokenization
-
Character Tokenization: Splitting text into single letters:
["a", "p", "p", "l", "e"]. This is rarely used because individual letters hold no meaning, but it solves the problem of misspellings.
- Sub-word Tokenization: The modern standard used by ChatGPT. It breaks rare words into smaller, meaningful chunks.
"Unbelievable"
Output: ["Un", "believ", "able"]. This allows the AI to understand the root meaning of a word even if it has never seen the exact word before.
7. The Challenges of Tokenization
Tokenization is relatively easy in English because we put spaces between our words. However, languages like Chinese, Japanese, and Thai do not use spaces between words.我喜欢吃苹果 (I like to eat apples).
Where do you split this? NLP models must use complex statistical dictionaries to figure out where one word ends and the next begins in these languages.
8. Python Examples
While you can tokenize using basic Pythonsplit(), professional NLP developers use a library called NLTK (Natural Language Toolkit).
python
9. Mini Project
Manual Tokenizer: Look at the phrase:"New York is a great city."
If a naive Python script just splits the text at every space, what is the output? What is the problem with this output?
*(Answer: ["New", "York", "is", "a", "great", "city."] The problem is that "New York" is a single entity, but it was split into two tokens. Also, the period is attached to "city", making the computer think "city" and "city." are different words).*
10. Best Practices
-
Use dedicated libraries: Never write your own tokenizer using
text.split(" "). It will fail on edge cases like abbreviations, contractions, and punctuation. Always use industry-standard tools like NLTK or spaCy.
11. Common Mistakes
- Ignoring the language: A tokenizer built for English will completely fail if you pass German or Chinese text into it. Always specify the language when using a tokenization tool.
12. Exercises
- 1. Explain why Sub-word Tokenization (used by modern LLMs) is superior to standard Word Tokenization when dealing with newly invented words or typos.
13. Coding Challenges
Challenge 1: Write pseudocode for how a naive Python script splits text by spaces, and contrast it with an ideal tokenized array.
python
14. MCQs with Answers
Question 1
What is the main problem with splitting a paragraph into sentences by simply looking for periods?
Question 2
Which tokenization method breaks words into smaller meaningful chunks (e.g., "playing" into "play" and "ing"), and is used by models like ChatGPT?
15. Interview Questions
- Q: Explain the difference between Word Tokenization and Character Tokenization. What are the pros and cons of each?
- Q: Why is tokenizing languages like Chinese or Japanese significantly harder than tokenizing English?
16. FAQs
Q: Do I need to remove punctuation before or after I tokenize? A: Usually, you tokenize first, and the tokenizer will separate the punctuation marks into their own tokens (e.g.,","). Then, you can easily filter those punctuation tokens out of your array.