Skip to main content
System Design – Complete Beginner to Advanced Guide
CHAPTER 13 Intermediate

Search Systems and Indexing

Updated: May 16, 2026
30 min read

# CHAPTER 13

Search Systems and Indexing

1. Introduction

If you have an e-commerce database with 50 million products, and a user types "Wireless Blue Headphones" into the search bar, a standard SQL LIKE '%Wireless Blue Headphones%' query will scan every single row sequentially. This will take minutes, consume 100% of the database CPU, and the user will abandon the site. To search massive datasets in milliseconds, we must abandon traditional relational queries and build specialized Search Architectures. In this chapter, we will master Search Systems and Indexing. We will uncover the magic of the Inverted Index, explore distributed search engines like Elasticsearch, and architect synchronization pipelines to ensure our search engine perfectly mirrors our primary database in real-time.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain why standard SQL databases fail at massive Full-Text Search.
  • Understand the core mechanics of an "Inverted Index."
  • Architect a distributed search cluster using Elasticsearch.
  • Design data synchronization pipelines between a primary DB and a Search Engine.
  • Explore basic ranking and relevance scoring (TF-IDF).
Relational databases are built for precise, structured queries (SELECT * WHERE user_id = 5).
  • The Problem: If a user searches for "Running Shoes," but the database entry is "Shoe for Runners," a standard SQL search will return exactly zero results because the string does not match perfectly.
  • The Performance Cost: If you use fuzzy matching (wildcards), the database is forced to perform a "Full Table Scan," reading every single character of every single row. At millions of rows, this is an architectural bottleneck that will bring down the entire system.
A Search Engine (like Google or Elasticsearch) does not scan documents; it scans an index.
  • The Concept: It works exactly like the index at the back of a textbook.
  • The Process: When a product ("Blue Nike Running Shoe") is added to the system, the Search Engine tears the string apart (Tokenization). It creates an index where every unique word points directly to the document IDs containing that word.
  • The Index:
  • Blue -> [Doc 1, Doc 45]
  • Nike -> [Doc 1, Doc 9]
  • Shoe -> [Doc 1, Doc 9, Doc 45]
  • The Speed: When a user searches for "Nike Shoe," the engine doesn't scan millions of products. It instantly looks up Nike and Shoe in the index, finds the intersection (Doc 1, Doc 9), and returns the results in 2 milliseconds.

5. Elasticsearch Architecture

Elasticsearch is the industry-standard distributed search engine.
  • The Structure: It is essentially a massively scalable NoSQL document store explicitly optimized for searching.
  • Distributed by Default: An Elasticsearch cluster is built from dozens of "Nodes" (servers). Indexes are broken into "Shards" and distributed across the cluster, allowing you to search terabytes of text in parallel.
  • Fuzzy Matching: It inherently understands human error. It can handle typos ("Nikke Shos"), synonyms, and stemming (knowing that "Running" and "Run" are the same concept).

6. The Synchronization Pipeline

You cannot use Elasticsearch as your primary, ACID-compliant database. You must run it *alongside* your primary database.
  • The Challenge: When a seller updates the price of a shoe in the primary PostgreSQL database, how does Elasticsearch know about the update?
  • The Solution (Event-Driven): You utilize a Message Broker (Kafka) or a Change Data Capture (CDC) tool (like Debezium).
  • The Workflow: PostgreSQL updates the row. Debezium detects the change in the database log and fires an event to Kafka. A background worker pulls the event from Kafka and instantly updates the JSON document in Elasticsearch. The search is updated with mere milliseconds of latency.

7. Diagrams/Visual Suggestions

*Architecture Diagram: The Search Synchronization Pipeline*
text
123456
[ Client App ] --(Writes Data)--> [ Primary DB (PostgreSQL) ]
                                          |
                                    (CDC / Kafka Pipeline)
                                          |
                                          v
[ Client App ] <--(Reads Search)- [ ELASTICSEARCH CLUSTER ]

8. Best Practices

  • Relevance Scoring: Finding the documents is only half the battle; ordering them is the hard part. Advanced search systems use algorithms like TF-IDF (Term Frequency - Inverse Document Frequency). If a user searches for "The red shoe," the word "shoe" is highly specific and scores heavily, while the word "the" appears in every document and is given a mathematical weight of almost zero.

9. Common Mistakes

  • The Synchronous Dual-Write: A developer tries to keep the DB and Search Engine in sync by writing code that saves to PostgreSQL, and immediately in the next line of code, saves to Elasticsearch. *The Failure:* If Elasticsearch is temporarily down, the entire API request fails, or worse, the DB saves but the Search fails, leaving the systems permanently out of sync. *The Fix:* Always use asynchronous event queues (Kafka/CDC) to sync secondary datastores.

10. Mini Project: Architect a Netflix Search Engine

Let's design the system that finds movies.
  1. 1. The Primary Data: The central movie catalog is stored securely in an SQL database.
  1. 2. The Ingestion: We set up a Kafka pipeline. Every time a new movie is added to SQL, it streams into our Elasticsearch cluster.
  1. 3. The Indexing: Elasticsearch tokenizes the title, cast, and description. It creates complex mappings (e.g., mapping "Sci-Fi" to "Science Fiction").
  1. 4. The Search Bar: A user types "Spcce Wrs" (with typos). The API hits Elasticsearch, which executes a fuzzy search against the Inverted Index, calculates the relevance score in 5 milliseconds, and returns "Star Wars" as the #1 result.

11. Practice Exercises

  1. 1. Define the concept of an "Inverted Index." Explain how it is fundamentally different from a traditional SQL database table scan, and why it allows for millisecond search speeds.
  1. 2. Discuss the architectural necessity of running a dedicated Search Engine (like Elasticsearch) alongside a primary relational database. Why not just use Elasticsearch for everything?

12. MCQs with Answers

Question 1

When building a massive e-commerce search bar, relying on standard SQL LIKE '%search_term%' queries becomes a catastrophic architectural bottleneck because the database is forced to execute a slow "Full Table Scan." What specialized data structure do Search Engines use to solve this problem and return text results in milliseconds?

Question 2

To ensure that an Elasticsearch cluster remains perfectly synchronized with a primary PostgreSQL database without risking synchronous API failures or data loss, what is the industry standard architectural pattern for moving data between the two systems?

13. Interview Questions

  • Q: Explain the mechanical process of "Tokenization" when a document is ingested into a Search Engine. How does the engine handle things like stop words (and, the, a) and stemming (running vs. run)?
  • Q: Walk me through the architectural danger of the "Synchronous Dual-Write" pattern. If your API attempts to simultaneously write data to both a primary SQL database and an Elasticsearch cluster in the same HTTP request, what are the failure modes?
  • Q: You are tasked with building a global product search system for Amazon. Explain how you would utilize Elasticsearch shards to distribute the massive index across multiple physical servers.

14. FAQs

Q: Does adding Elasticsearch mean I have two copies of all my data? A: Yes. In modern distributed systems, data duplication is heavily encouraged. You have the strict, relational copy in SQL (The Source of Truth), and the flattened, heavily indexed JSON copy in Elasticsearch (The Search View). Storage is cheap; CPU scanning time is expensive.

15. Summary

In Chapter 13, we engineered the ability to find a needle in a digital haystack. We recognized that standard relational databases are incapable of massive full-text search, forcing us to adopt specialized architectures. We unlocked the immense speed of the Inverted Index, allowing us to map unstructured text to specific documents in milliseconds. We deployed Elasticsearch clusters, distributing our search capabilities across massive server fleets to handle typos, synonyms, and complex relevance scoring. Finally, we architected resilient, event-driven synchronization pipelines, guaranteeing our lightning-fast search engines perfectly mirror the rigid reality of our primary databases.

16. Next Chapter Recommendation

Our systems can store files, search text, and process transactions. But how do we push data instantly to a user's screen without them refreshing the page? Proceed to Chapter 14: Real-Time Systems Design.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·