Skip to main content
Pandas & NumPy
CHAPTER 17 Beginner

Merging and Joining DataFrames

Updated: May 18, 2026
5 min read

# CHAPTER 17

Merging and Joining DataFrames

1. Chapter Introduction

Real data rarely lives in one table. Customer records, orders, products, and locations each live in separate datasets. Pandas merge, join, and concat replicate all SQL join types, enabling powerful multi-dataset analysis.

2. pd.concat — Stacking DataFrames

python
123456789101112131415
import pandas as pd

# Vertical concat (append rows)
q1 = pd.DataFrame({'Product': ['Laptop','Phone'], 'Revenue': [12000, 4500], 'Quarter': 'Q1'})
q2 = pd.DataFrame({'Product': ['Laptop','Desk'],  'Revenue': [14000, 2100], 'Quarter': 'Q2'})
q3 = pd.DataFrame({'Product': ['Phone','Laptop'], 'Revenue': [5200, 13500], 'Quarter': 'Q3'})

all_quarters = pd.concat([q1, q2, q3], ignore_index=True)
print(all_quarters)

# Horizontal concat (add columns)
df1 = pd.DataFrame({'Name': ['Alice','Bob','Carol']})
df2 = pd.DataFrame({'Salary': [85000,72000,91000], 'Dept': ['Eng','Mkt','Eng']})
combined = pd.concat([df1, df2], axis=1)
print(combined)

3. merge — SQL-Style Joins

python
12345678910111213141516171819202122232425262728293031
employees = pd.DataFrame({
    'EmpID': [1, 2, 3, 4, 5],
    'Name':  ['Alice', 'Bob', 'Carol', 'David', 'Eve'],
    'DeptID': [10, 20, 10, 30, 40]
})

departments = pd.DataFrame({
    'DeptID': [10, 20, 30],
    'DeptName': ['Engineering', 'Marketing', 'Sales'],
    'Location': ['NYC', 'LA', 'Chicago']
})

# INNER JOIN (only matching rows)
inner = pd.merge(employees, departments, on='DeptID', how='inner')
print("INNER JOIN:")
print(inner)  # 4 rows — Eve's dept (40) has no match, dropped

# LEFT JOIN (all left rows, NaN for unmatched right)
left = pd.merge(employees, departments, on='DeptID', how='left')
print("\nLEFT JOIN:")
print(left)   # 5 rows — Eve has NaN for DeptName, Location

# RIGHT JOIN (all right rows)
right = pd.merge(employees, departments, on='DeptID', how='right')
print("\nRIGHT JOIN:")
print(right)

# OUTER JOIN (all rows from both)
outer = pd.merge(employees, departments, on='DeptID', how='outer')
print("\nOUTER JOIN:")
print(outer)

4. Merge on Different Column Names

python
123456789101112131415161718
orders = pd.DataFrame({
    'OrderID': [101, 102, 103, 104],
    'CustomerID': ['C01', 'C02', 'C01', 'C03'],
    'Amount': [500, 1200, 800, 350]
})

customers = pd.DataFrame({
    'CustID': ['C01', 'C02', 'C04'],
    'Name': ['Alice', 'Bob', 'David'],
    'City': ['NYC', 'LA', 'Chicago']
})

# Different key column names
result = pd.merge(orders, customers,
                  left_on='CustomerID',
                  right_on='CustID',
                  how='left')
print(result)

5. Multi-Table Real-World Example

python
1234567891011121314151617181920212223242526272829303132333435
# E-commerce dataset
products = pd.DataFrame({
    'ProductID': [1, 2, 3, 4],
    'Name': ['Laptop', 'Phone', 'Desk', 'Monitor'],
    'Category': ['Electronics', 'Electronics', 'Furniture', 'Electronics'],
    'Price': [1200, 500, 450, 300]
})

orders = pd.DataFrame({
    'OrderID': [101, 102, 103, 104, 105, 106],
    'CustomerID': [1, 2, 1, 3, 2, 1],
    'ProductID': [1, 2, 3, 1, 4, 2],
    'Quantity': [1, 2, 1, 1, 3, 1],
    'Date': pd.date_range('2024-01', periods=6, freq='W')
})

customers = pd.DataFrame({
    'CustomerID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Carol'],
    'City': ['NYC', 'LA', 'Chicago']
})

# Multi-table merge (chain merges)
analysis = (orders
    .merge(products, on='ProductID')
    .merge(customers, on='CustomerID')
    .assign(Revenue=lambda df: df['Quantity'] * df['Price'])
)

print("Order Analysis:")
print(analysis[['OrderID', 'Name_y', 'Name_x', 'Quantity', 'Revenue', 'Date']])

# Revenue per customer
print("\nRevenue per Customer:")
print(analysis.groupby('Name_y')['Revenue'].sum())

6. Common Mistakes

  • Duplicate column names after merge: When both DataFrames have a column 'Name', Pandas adds suffixes: Namex, Namey. Rename before or after merging.
  • Many-to-many merges create row explosion: Merging on a non-unique key in BOTH DataFrames multiplies rows. Always check key uniqueness before merging.

7. MCQs

Question 1

pd.concat([df1,df2], axis=0) does?

Question 2

INNER JOIN keeps?

Question 3

LEFT JOIN keeps?

Question 4

lefton/righton is used when?

Question 5

After merge, Namex and Namey appear because?

Question 6

how='outer' keeps?

Question 7

ignoreindex=True in concat?

Question 8

Many-to-many merge risk?

Question 9

validate='onetoone' in merge?

Question 10

pd.merge(df1, df2, on='ID') default join type?

8. Interview Questions

  • Q: What is the difference between merge, join, and concat in Pandas?
  • Q: How do you perform a left join in Pandas?

9. Summary

concat stacks DataFrames. merge performs SQL-style joins (inner, left, right, outer). left
on/right_on handle different key names. Chain merges for multi-table analysis. Watch for duplicate column names (suffixes) and many-to-many row explosions.

10. Next Chapter Recommendation

In Chapter 18: Working with Dates and Time Series, we master datetime parsing, date-based filtering, resampling, and time series analysis.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·