Databases and Data Storage
# CHAPTER 4
Databases and Data Storage
1. Chapter Introduction
If servers are the brain of your application, databases are the memory. In a system design interview, the database architecture is often the most scrutinized component. Choosing the wrong data storage solution leads to data corruption, massive latency, and system failure under load. This chapter explores the foundational concepts of Relational Databases (RDBMS), explaining how they guarantee data integrity through ACID properties and how Indexing provides lightning-fast search capabilities.2. What is a Relational Database?
A Relational Database Management System (RDBMS) stores data in tables consisting of rows and columns.-
Tables: Collections of related data (e.g.,
Users,Orders).
- Rows (Records): A single entity (e.g., John Doe's user profile).
-
Columns (Fields): Attributes of the entity (e.g.,
Email,CreatedAt).
-
Primary Key: A unique identifier for every row (e.g.,
UserID: 123).
*Popular RDBMS:* MySQL, PostgreSQL, Oracle, SQL Server.
*The defining feature:* Data is "Relational." An Order row contains a UserID (Foreign Key) that links it directly to the Users table.
3. ACID Properties (The Guarantee)
Relational databases are famous because they guarantee data validity despite errors, power failures, or network crashes. They do this via ACID properties.- 1. Atomicity: Transactions are "all or nothing." If you transfer $100 from Account A to Account B, and the server crashes mid-transfer, Atomicity ensures the $100 isn't permanently deducted from Account A without being added to Account B. It rolls back the transaction entirely.
- 2. Consistency: The database ensures that any transaction brings the database from one valid state to another. Constraints (like "Account Balance cannot be negative") are enforced.
- 3. Isolation: Concurrent transactions occur as if they are executed sequentially. If User 1 and User 2 try to buy the last concert ticket at the exact same millisecond, Isolation ensures they don't both get it.
- 4. Durability: Once a transaction is committed, it will remain in the system, even in the event of a total system failure (e.g., power loss). Data is written to non-volatile storage (disk).
*Interview Tip:* If an interviewer asks you to design a Financial, Banking, or E-commerce billing system, you MUST choose a database with strict ACID compliance (usually an RDBMS).
4. Database Transactions
A transaction is a sequence of database operations treated as a single logical unit of work. *Example Workflow:*-
1.
BEGIN TRANSACTION;
- 2. Update User A's balance (-$100).
- 3. Update User B's balance (+$100).
-
4.
COMMIT;(If everything succeeded, write to disk permanently).
-
5.
ROLLBACK;(If any step failed, undo all changes).
5. Indexing (The Secret to Speed)
Imagine a library with 1 million books thrown randomly on the floor. If you want to find a book by Stephen King, you have to look at every single book one by one. This is a Full Table Scan (O(N) time complexity). It is devastatingly slow.An Index is like the card catalog in the library. It creates an organized, sorted data structure (usually a B-Tree) that holds a copy of specific columns (e.g., AuthorName) and a pointer to the original row on the disk.
Searching a B-Tree Index takes O(log N) time. It is lightning fast.
6. The Tradeoffs of Indexing
If indexes make searching so fast, why don't we index every single column in the database?- Slower Writes: Every time you INSERT, UPDATE, or DELETE a row, the database must also update every associated index.
- Storage Space: Indexes consume additional hard drive space.
WHERE, ORDER BY, or JOIN clauses. Do not index columns that are frequently updated but rarely searched.
7. HR/Interview Perspective: The Schema Design Test
In an interview, you will be asked to draw the Database Schema. If asked to design an Instagram clone, you must draw the tables and show relationships:-
Users(id, username, email)
-
Posts(id, userid, imageurl, createdat)
-
Follows(followerid, followeeid)
8. Real-World Scenario: The Missing Index
*Scenario:* An application has aUsers table with 50 million rows. Users log in using their email address. The login API suddenly starts timing out, taking 15 seconds to respond.
*The Problem:* The engineer forgot to put an index on the Email column. Every time a user tried to log in, the database was performing a Full Table Scan of 50 million rows to find the matching email.
*The Fix:* The DBA runs CREATE INDEX idxemail ON Users(Email);. The query time instantly drops from 15 seconds to 2 milliseconds.
9. Mini Project: Design a Banking Schema
On a piece of paper, design the basic schema for a bank.-
1.
Create a
Userstable.
-
2.
Create an
Accountstable (checking, savings).
-
3.
Create a
Transactionstable to track the movement of money.
10. Common Mistakes
- Using strings as Primary Keys: Storing a 50-character string (like an email) as a primary key makes indexing and joining incredibly slow. Use auto-incrementing Integers or UUIDs as primary keys.
- Over-indexing: Creating 15 indexes on a single table. Writes will grind to a halt because the database has to update 15 B-Trees for every single row insertion.
11. Best Practices
- Normalization (Up to 3NF): Design your tables to avoid data duplication. If you have a customer's address, it should exist in one place, not repeated across 50 of their orders.
-
Soft Deletes: Instead of permanently deleting user data (
DELETE FROM Users), add anis_deletedboolean column. This prevents breaking foreign key constraints and preserves historical data.
12. Exercises
- 1. Explain the "A" in ACID using the analogy of buying a cup of coffee.
-
2.
Write a SQL query to find all users in a
Userstable who signed up after '2023-01-01'. Which column should be indexed to make this fast?
13. MCQs
What defines a "Relational" database?
What does the acronym ACID stand for in database systems?
Which ACID property ensures that a transaction is "all or nothing," preventing partial updates if a crash occurs mid-transaction?
If you are asked to design a high-stakes financial ledger or billing system, what type of database property is mandatory?
What happens when a database performs a "Full Table Scan"?
How does a Database Index improve search performance?
What is the major tradeoff (downside) of adding too many Indexes to a table?
Why is using an auto-incrementing Integer or UUID better for a Primary Key than using an Email Address?
What does the "Durability" property of ACID guarantee?
What is "Database Normalization"?
14. Interview Questions
- Q: "You have a table with 100 million rows representing IoT sensor data. Your SELECT queries are timing out. Walk me through exactly how you would debug and fix this."
15. FAQs
- Q: Are relational databases obsolete now that NoSQL exists?