Skip to main content
NLP Basics Tutorial
CHAPTER 06 Beginner

Tokenization and Text Segmentation

Updated: May 14, 2026
20 min read

# CHAPTER 6

Tokenization and Text Segmentation

1. Introduction

Machine learning algorithms cannot ingest a giant paragraph of text all at once. They need the text broken down into digestible, discrete pieces. This crucial process is called Tokenization. In this chapter, we will learn how to chop paragraphs into sentences, sentences into words, and words into characters, turning a block of text into an array of analyzable "Tokens".

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Define Tokenization in NLP.
  • Explain the difference between Sentence Tokenization and Word Tokenization.
  • Understand the benefits and drawbacks of Character Tokenization.
  • Identify the challenges of tokenizing complex languages.

3. Beginner-Friendly Explanation

Imagine a child eating a giant pizza. They cannot swallow the whole pizza in one bite. First, the parent cuts the pizza into slices (Sentence Tokenization). Then, the parent cuts the slices into smaller bites so the child can chew them easily (Word Tokenization). In NLP, the computer is the child. It cannot process "I love AI. It is great." It needs the text chopped into pieces: ["I", "love", "AI", ".", "It", "is", "great", "."]. Each piece is called a Token.

4. Sentence Tokenization (Text Segmentation)

This is the process of breaking a large paragraph into individual sentences. You might think this is as easy as splitting the text every time you see a period (.). But it is much harder! Consider this text: *"Dr. Smith went to Washington D.C. to buy a U.S. flag."* If we split at every period, we get:
  1. 1. Dr
  1. 2. Smith went to Washington D
  1. 3. C
This is completely wrong. Modern sentence tokenizers use ML models to know when a period indicates an abbreviation versus the end of a sentence.

5. Word Tokenization

This is the most common form of tokenization. It breaks a sentence into individual words and punctuation marks. Input: "I can't wait!" Output: ["I", "ca", "n't", "wait", "!"] Notice how the tokenizer intelligently split the contraction "can't" into "ca" and "n't" because they hold different grammatical meanings.

6. Sub-Word and Character Tokenization

  • Character Tokenization: Splitting text into single letters: ["a", "p", "p", "l", "e"]. This is rarely used because individual letters hold no meaning, but it solves the problem of misspellings.
  • Sub-word Tokenization: The modern standard used by ChatGPT. It breaks rare words into smaller, meaningful chunks.
Input: "Unbelievable" Output: ["Un", "believ", "able"]. This allows the AI to understand the root meaning of a word even if it has never seen the exact word before.

7. The Challenges of Tokenization

Tokenization is relatively easy in English because we put spaces between our words. However, languages like Chinese, Japanese, and Thai do not use spaces between words. 我喜欢吃苹果 (I like to eat apples). Where do you split this? NLP models must use complex statistical dictionaries to figure out where one word ends and the next begins in these languages.

8. Python Examples

While you can tokenize using basic Python split(), professional NLP developers use a library called NLTK (Natural Language Toolkit).
python
1234567891011121314151617
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# NLTK requires downloading its punctuation dictionary first
# nltk.download('punkt')

paragraph = "Hello! Dr. AI is here. Let's learn."

# Sentence Tokenization
sentences = sent_tokenize(paragraph)
print(sentences)
# Output: ['Hello!', 'Dr. AI is here.', "Let's learn."]

# Word Tokenization
words = word_tokenize("Let's learn.")
print(words)
# Output: ['Let', "'s", 'learn', '.']

9. Mini Project

Manual Tokenizer: Look at the phrase: "New York is a great city." If a naive Python script just splits the text at every space, what is the output? What is the problem with this output? *(Answer: ["New", "York", "is", "a", "great", "city."] The problem is that "New York" is a single entity, but it was split into two tokens. Also, the period is attached to "city", making the computer think "city" and "city." are different words).*

10. Best Practices

  • Use dedicated libraries: Never write your own tokenizer using text.split(" "). It will fail on edge cases like abbreviations, contractions, and punctuation. Always use industry-standard tools like NLTK or spaCy.

11. Common Mistakes

  • Ignoring the language: A tokenizer built for English will completely fail if you pass German or Chinese text into it. Always specify the language when using a tokenization tool.

12. Exercises

  1. 1. Explain why Sub-word Tokenization (used by modern LLMs) is superior to standard Word Tokenization when dealing with newly invented words or typos.

13. Coding Challenges

Challenge 1: Write pseudocode for how a naive Python script splits text by spaces, and contrast it with an ideal tokenized array.
python
12345678
raw_text = "I am happy."

# Naive approach (Fails on punctuation)
naive_tokens = raw_text.split(" ") 
# Output: ["I", "am", "happy."]

# Ideal Tokenizer output
ideal_tokens = ["I", "am", "happy", "."]

14. MCQs with Answers

Question 1

What is the main problem with splitting a paragraph into sentences by simply looking for periods?

Question 2

Which tokenization method breaks words into smaller meaningful chunks (e.g., "playing" into "play" and "ing"), and is used by models like ChatGPT?

15. Interview Questions

  • Q: Explain the difference between Word Tokenization and Character Tokenization. What are the pros and cons of each?
  • Q: Why is tokenizing languages like Chinese or Japanese significantly harder than tokenizing English?

16. FAQs

Q: Do I need to remove punctuation before or after I tokenize? A: Usually, you tokenize first, and the tokenizer will separate the punctuation marks into their own tokens (e.g., ","). Then, you can easily filter those punctuation tokens out of your array.

17. Summary

In Chapter 6, we chopped our text into pieces. Tokenization is the process of breaking paragraphs into sentences, and sentences into words or sub-words. This transforms a giant, unreadable block of text into an organized array of "Tokens," which serves as the foundational input for all subsequent NLP analysis and modeling.

18. Next Chapter Recommendation

Our text is chopped up, but our array is full of useless words like "the", "a", and "is". Proceed to Chapter 7: Stop Words and Text Cleaning to learn how to filter out the noise.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·