Skip to main content
PyTorch Essentials
CHAPTER 10 Intermediate

PyTorch Datasets and DataLoaders

Updated: May 16, 2026
7 min read

# CHAPTER 10

PyTorch Datasets and DataLoaders

1. Introduction

If you have a 1GB dataset of images, you can load it into your RAM and pass it entirely through your model (model(Xtrain)). But what if your dataset is 500 Terabytes of driving video? If you try to load that, your computer will crash instantly. You must "stream" the data from your hard drive to your GPU in small chunks called Batches. To solve this, PyTorch provides two incredibly powerful classes: Dataset and DataLoader. In this chapter, we will build industrial-grade data pipelines.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain the concept of Batch Training.
  • Subclass torch.utils.data.Dataset to create a custom dataset.
  • Understand the len and getitem methods.
  • Use torch.utils.data.DataLoader to automatically batch and shuffle data.
  • Integrate DataLoaders into the PyTorch Training Loop.

3. The Dataset Class

The Dataset class is a blueprint. It tells PyTorch *where* your data is and *how* to grab a single item from it. To create a custom dataset, you subclass Dataset and overwrite three mandatory methods:
  1. 1. init: Runs once. You load your CSV file or define your image folders here.
  1. 2. len: Returns the total number of items in the dataset.
  1. 3. getitem_: The magic function. Given an index (like 5), it returns exactly one piece of data (the 5th row of the CSV) and its label.

4. Building a Custom Dataset

Let's build a dataset that wraps around a simple Pandas DataFrame.
python
1234567891011121314151617181920212223242526272829303132
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np

class CustomTabularDataset(Dataset):
    def __init__(self, csv_file):
        # 1. Load the data
        # In a real scenario, you'd use pd.read_csv(csv_file)
        self.data = pd.DataFrame({
            "Feature1": np.random.rand(100),
            "Feature2": np.random.rand(100),
            "Label": np.random.randint(0, 2, 100) # Binary 0 or 1
        })
        
        # Convert to PyTorch Tensors
        self.X = torch.tensor(self.data[["Feature1", "Feature2"]].values, dtype=torch.float32)
        self.y = torch.tensor(self.data["Label"].values, dtype=torch.long)

    def __len__(self):
        # 2. Return total rows
        return len(self.data)

    def __getitem__(self, idx):
        # 3. Return ONE specific row and its label
        features = self.X[idx]
        label = self.y[idx]
        return features, label

# Instantiate the dataset
my_dataset = CustomTabularDataset("dummy_path.csv")
print(f"Total items: {len(my_dataset)}")

5. The DataLoader Class

The Dataset only grabs one item at a time. The DataLoader acts as a manager. You hand it your Dataset, and it automatically:
  • Grabs 32 items at a time (batchsize=32).
  • Stacks them into a single massive Tensor.
  • Shuffles the data randomly every epoch (shuffle=True).
  • Can use multiple CPU cores to load data in the background (numworkers=4).
python
123456789101112
# Create the DataLoader
# It will group our 100 rows into batches of 16
train_loader = DataLoader(dataset=my_dataset, batch_size=16, shuffle=True)

# Let's look at one batch
# iter() and next() allow us to grab a single batch from the loader
features_batch, labels_batch = next(iter(train_loader))

print("Features Batch Shape:", features_batch.shape) 
# Output: torch.Size([16, 2]) -> 16 rows, 2 features!
print("Labels Batch Shape:", labels_batch.shape)
# Output: torch.Size([16])

6. Mini Project: The Batched Training Loop

Now that we have a DataLoader, our Training Loop from Chapter 9 changes slightly. We must add a second for loop to iterate through the batches!
python
1234567891011121314151617181920212223242526272829
import torch.nn as nn
import torch.optim as optim

model = nn.Sequential(nn.Linear(2, 16), nn.ReLU(), nn.Linear(16, 2))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

epochs = 5

for epoch in range(epochs):
    model.train()
    running_loss = 0.0
    
    # NEW: Loop through the DataLoader!
    # In each iteration, 'X_batch' contains 16 rows of data
    for X_batch, y_batch in train_loader:
        
        # The 5 standard training steps
        optimizer.zero_grad()
        predictions = model(X_batch)
        loss = criterion(predictions, y_batch)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        
    # Calculate average loss for the entire epoch
    avg_loss = running_loss / len(train_loader)
    print(f"Epoch {epoch+1} completed. Average Loss: {avg_loss:.4f}")

7. Common Mistakes

  • Forgetting shuffle=True: If you load a dataset where all the "Cats" are in the first 500 rows, and "Dogs" are in the last 500 rows, and you don't shuffle, the neural network will see nothing but Cats for the first 15 batches. It will completely overfit to Cats and forget what a Dog is. Always shuffle the Training set! (Note: You do *not* need to shuffle the Test set).
  • Batch Size too Large: If you set batchsize=10000, the GPU will try to load 10,000 items into memory at once and crash with an "Out of Memory" (OOM) error. Stick to powers of 2 (32, 64, 128).

8. Best Practices

  • Use numworkers: By default, DataLoader uses 1 CPU core (numworkers=0). If the GPU finishes training Batch 1 in 0.1 seconds, it has to sit idle while the CPU loads Batch 2. Set numworkers=2 or 4 to have the CPU load the next batch in the background while the GPU is working!

9. Exercises

  1. 1. What are the three mandatory methods you must define when subclassing torch.utils.data.Dataset?
  1. 2. If your dataset has 1,000 images, and your DataLoader has batchsize=100, how many iterations will the inner for loop run to complete exactly 1 Epoch?

10. MCQ Quiz with Answers

Question 1

What is the primary purpose of the DataLoader class in PyTorch?

Question 2

In a custom Dataset class, what is the role of the getitem_ method?

11. Interview Questions

  • Q: Explain the distinct responsibilities of the Dataset class versus the DataLoader class.
  • Q: Why is "Batch Training" superior to passing the entire dataset into the model at once, both in terms of hardware limitations and mathematical convergence?

12. FAQs

Q: Do I always have to write a custom Dataset class? A: No! For standard datasets like MNIST or standard image folders, PyTorch provides built-in functions (like torchvision.datasets.ImageFolder) that automatically act as a Dataset object without you writing any custom classes.

13. Summary

You are no longer limited by your computer's RAM. By utilizing the Dataset class to define how data is accessed, and the DataLoader to efficiently orchestrate the batching, shuffling, and multi-core streaming of that data, you have built the industrial pipeline required to feed massive datasets to a hungry GPU.

14. Next Chapter Recommendation

We have mastered the foundational plumbing of PyTorch. Now, let's build something visually stunning. Why do simple Dense layers fail at recognizing complex photos? In Chapter 11: Image Classification with CNNs in PyTorch, we will learn the architecture that revolutionized Artificial Intelligence.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·