Skip to main content
MongoDB
CHAPTER 14 Beginner

MongoDB Embedding vs Referencing | Denormalization

Updated: May 16, 2026
15 min read

# CHAPTER 14

Embedding vs Referencing Documents

1. Introduction

The most critical architectural decision you will make in MongoDB is choosing whether to Embed data (put it inside the document) or Reference data (put it in another collection and link to it with an ID). There is no "perfect" answer; it is a delicate balance of read-performance vs. write-performance. In this chapter, we will analyze the trade-offs of both approaches by architecting a real-world Ecommerce product system.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Define Embedding (Denormalization).
  • Define Referencing (Normalization).
  • Identify the pros and cons of both architectures.
  • Understand the problem of "Data Duplication" and "Update Anomalies".
  • Make confident architectural decisions for real-world schemas.

3. The Embedded Approach (Denormalization)

Embedding means putting all related data directly inside a single document.
javascript
1234567
// Collection: orders
{
  "_id": 101,
  "user": { "name": "John Doe", "email": "john@example.com" }, // Embedded!
  "product": { "id": 55, "name": "Laptop", "price": 999 },     // Embedded!
  "total": 999
}

Pros:

  • Blazing Fast Reads: To display an Order Receipt page, the backend executes exactly 1 query. All the data is instantly available.

Cons:

  • Data Duplication: If John places 50 separate orders, his name and email are duplicated 50 times across 50 different documents.
  • Update Anomalies: If John changes his email address, you cannot just update a single users document. You must write a complex updateMany() query to hunt down and update all 50 of his historical order documents!

4. The Referenced Approach (Normalization)

Referencing means splitting the data into separate collections and linking them using the ObjectId (exactly like a SQL Foreign Key).
javascript
12345678910111213
// Collection: users
{ "_id": ObjectId("User_John"), "name": "John Doe", "email": "john@example.com" }

// Collection: products
{ "_id": ObjectId("Prod_Laptop"), "name": "Laptop", "price": 999 }

// Collection: orders
{
  "_id": 101,
  "user_id": ObjectId("User_John"),      // Reference!
  "product_id": ObjectId("Prod_Laptop"), // Reference!
  "total": 999
}

Pros:

  • Clean Updates: If John changes his email, you run one simple updateOne() on the users collection. The orders collection doesn't need to change because it just points to John's ID.

Cons:

  • Slow Reads: To display the Order Receipt page, the backend must execute 3 separate queries (Get Order, Get User, Get Product). Network latency increases.

5. The Hybrid Approach (The Industry Standard)

In the real world, architects use a Hybrid approach. You reference the data, but you embed a *snapshot* of the data that the application needs immediately.
javascript
1234567891011
// Collection: orders
{
  "_id": 101,
  "user_id": ObjectId("User_John"), 
  // We EMBED the snapshot of the price at the time of checkout!
  "snapshot_product": { 
      "id": ObjectId("Prod_Laptop"), 
      "name": "Laptop", 
      "price_at_checkout": 999 
  }
}

Why this is brilliant: If the live Laptop price increases to $1200 tomorrow, John's historical receipt must still say $999. Embedding the snapshot preserves historical accuracy while referencing maintains overall database sanity!

6. The Decision Matrix (When to use which?)

Ask yourself these questions when designing a schema:
  1. 1. Does the data grow infinitely? (e.g., Server Logs).
-> *REFERENCE it. Embedding will hit the 16MB limit.*
  1. 2. Is the data accessed together 99% of the time? (e.g., A blog post and its title).
-> *EMBED it. Maximize read performance.*
  1. 3. Does the data change constantly? (e.g., A live stock price).
-> *REFERENCE it. Duplicating rapidly changing data will destroy write performance.*

7. Mini Project: Ecommerce Product System

Let's design the products collection. A product has 5 Images, and belongs to 1 Category.
  • Images: There are only 5. They are only ever viewed when looking at the product. They don't change often. -> EMBED an array of URLs.
  • Category: A category name ("Electronics") might change to ("Tech"). Millions of products share this category. -> REFERENCE the Category ID.
javascript
123456789
{
  "_id": ObjectId("..."),
  "name": "Wireless Mouse",
  "category_id": ObjectId("Cat_Electronics"), // Referenced
  "images": [ // Embedded
    "url.com/img1.jpg", 
    "url.com/img2.jpg"
  ]
}

8. Common Mistakes

  • Fear of Duplication: Developers coming from strict SQL backgrounds are terrified of data duplication. In NoSQL, duplicating a user's username across 100 comments to avoid doing a JOIN is highly encouraged. Storage is cheap; CPU time is expensive.

9. Best Practices

  • Analyze the Read/Write Ratio: If a field is read 10,000 times a second but updated once a year (like a username), embed and duplicate it everywhere. The massive speed boost on Reads heavily outweighs the annoyance of running an updateMany() once a year.

10. Exercises

  1. 1. What is the MongoDB term for putting all related data directly inside a single document (Denormalization)?
  1. 2. If a document embeds an array of data that grows by 100 items per minute, what strict MongoDB limitation will the document eventually hit and crash against?

11. MongoDB Challenges

You are designing a Twitter clone. A "User" document has a list of "Followers". Some famous users have 50 million followers. Should the followers be Embedded as an array inside the User document, or Referenced in a separate collection? State your reasoning. *(Answer: They MUST be Referenced in a separate collection. An array of 50 million IDs embedded inside a single document will vastly exceed the 16MB document size limit).*

12. MCQ Quiz with Answers

Question 1

What is a significant drawback (Con) of using the Embedded (Denormalized) approach in MongoDB?

Question 2

When architecting an Order Receipt in an E-commerce database, why is it an industry standard to Embed a "Snapshot" of the product's price directly into the Order document, rather than Referencing the live Product collection?

13. Interview Questions

  • Q: You are tasked with migrating a highly normalized MySQL database to MongoDB. The Lead Engineer suggests keeping all 50 tables exactly as they are and just using ObjectIds to link them. Provide an architectural argument against this approach.
  • Q: Explain the Decision Matrix for Embedding vs. Referencing. What three questions should an architect ask themselves before finalizing a schema?

14. FAQs

Q: If I use Referencing, how do I do a SQL JOIN to get the data? A: You can do a "Manual Join" in your backend code (Fetch Order -> Extract ID -> Fetch User). However, MongoDB actually provides a native JOIN operator called $lookup inside its Aggregation Framework! We will cover this in Chapter 16.

15. Summary

Schema design is the hardest part of NoSQL. By understanding that Embedding maximizes Read performance (at the cost of duplication), and Referencing maximizes Update performance (at the cost of slow reads), you can make intelligent, application-driven architectural decisions that balance speed, integrity, and scalability.

16. Next Chapter Recommendation

We have mastered queries, updates, and schema architecture. Now, we must analyze the data. How do we calculate total revenue, or find the average age of users? In Chapter 15: Aggregation Framework Basics, we will unlock MongoDB's most powerful analytical engine.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·