Skip to main content
Regression Models
CHAPTER 15 Intermediate

Random Forest Regression

Updated: May 16, 2026
6 min read

# CHAPTER 15

Random Forest Regression

1. Introduction

In the last chapter, we learned that a single Decision Tree is highly unstable. If you change just one row of training data, the entire flowchart might rearrange itself, resulting in erratic predictions. To solve this, data scientists asked a simple question: *"What if we ask 100 different trees for their prediction, and average their answers?"* This concept is called Ensemble Learning, and its most famous implementation is the Random Forest. In this chapter, we explore the industry-standard algorithm for tabular data.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain the concept of Ensemble Learning.
  • Understand how a Random Forest creates diversity (Bagging).
  • Train a RandomForestRegressor in scikit-learn.
  • Extract Feature Importances from the forest.
  • Understand why Random Forests are highly resistant to overfitting.

3. What is Ensemble Learning?

Ensemble Learning relies on the "Wisdom of the Crowd." If you ask one person to guess the exact weight of a cow, they might be off by 500 lbs. If you ask 1,000 random people and average all their guesses together, the final average will be astonishingly close to the exact weight. A Random Forest works exactly like this. It builds an "ensemble" of hundreds of individual Decision Trees. When a new data point comes in, all 100 trees make a prediction. The final prediction is simply the average of all 100 answers.

4. How the Forest Stays Random (Bagging)

If you train 100 trees on the exact same data, they will all build the exact same flowchart. That defeats the purpose! The forest must be diverse. It achieves this using a technique called Bagging (Bootstrap Aggregating):
  1. 1. Random Data: Each tree is trained on a random, scrambled subset of the rows (e.g., Tree 1 only sees 70% of the houses).
  1. 2. Random Features: At every split in the flowchart, the tree is only allowed to look at a random subset of columns (e.g., Tree 1 is forced to ignore the "Bedrooms" column).

*Because every tree is slightly "blind," they all make different mistakes. When you average them out, the mistakes cancel each other out, resulting in a perfectly robust prediction!*

5. Mini Project: Car Price Prediction

Let's build a robust Random Forest to predict the price of used cars based on Mileage, Age, and Engine Size.
python
12345678910111213141516171819202122232425
import numpy as np
from sklearn.ensemble import RandomForestRegressor

# 1. Provide the Data [Mileage, Age_Years, Engine_Liters]
X_train = np.array([
    [50000, 5, 2.0],
    [10000, 1, 3.5],
    [120000, 10, 1.8],
    [30000, 3, 2.5]
])
y_train = np.array([15000, 35000, 5000, 22000]) # Price in $

# 2. Initialize the Model
# n_estimators = 100 (This means "Plant 100 Trees!")
# random_state ensures reproducibility
forest_model = RandomForestRegressor(n_estimators=100, random_state=42)

# 3. Train the Forest
forest_model.fit(X_train, y_train)

# 4. Make a Prediction!
# Predict for a car with 60,000 miles, 6 years old, 2.0L engine
X_test = np.array([[60000, 6, 2.0]])
prediction = forest_model.predict(X_test)
print(f"Predicted Car Price: ${prediction[0]:.2f}")

6. Feature Importance (The Power of Forests)

Unlike Linear Regression, where raw coefficients can be misleading due to scale, Random Forests provide a mathematically bulletproof ranking of how important every feature is, ranging from 0.0 to 1.0.
python
12345678
# Extract the importance of each feature
importances = forest_model.feature_importances_

print(f"Importance of Mileage: {importances[0]*100:.1f}%")
print(f"Importance of Age: {importances[1]*100:.1f}%")
print(f"Importance of Engine: {importances[2]*100:.1f}%")

# The output will clearly show which feature the 100 trees relied on the most!

7. Overfitting and Random Forests

Random Forests are famously resistant to overfitting. Because the final answer is an average of hundreds of models, a single tree memorizing a noisy data point gets "drowned out" by the 99 other trees that ignored it. While you can still tweak hyperparameters like maxdepth, Random Forests usually work incredibly well straight out of the box with default settings!

8. Common Mistakes

  • Setting nestimators too low: If you only use 5 trees, you do not have a forest, and you won't get the benefits of the Wisdom of the Crowd. Always use at least 100 (the scikit-learn default).
  • Using Forests for Time-Series: Like all standard regression models, Random Forests cannot predict a number higher than they saw in training. They are terrible at forecasting stock prices that are trending upward into unseen territory.

9. Best Practices

  • Use as a Baseline: For any tabular (CSV) data problem, the Random Forest is the ultimate baseline. Run it before you try complex Neural Networks. Often, the Random Forest will be faster and just as accurate!

10. Exercises

  1. 1. What does the hyperparameter n_estimators=250 tell the RandomForestRegressor to do?
  1. 2. Explain how a Random Forest calculates its final prediction for a regression task.

11. MCQ Quiz with Answers

Question 1

What is the fundamental concept behind Ensemble Learning algorithms like Random Forest?

Question 2

How does a Random Forest prevent all of its internal trees from looking exactly the same?

12. Interview Questions

  • Q: Explain the mechanism of "Bootstrap Aggregating" (Bagging) inside a Random Forest.
  • Q: Why is a Random Forest generally much more resistant to overfitting on training data than a single Decision Tree?

13. FAQs

Q: What is XGBoost, and is it better than Random Forest? A: XGBoost is another Ensemble tree method that uses "Boosting" instead of "Bagging." Instead of building 100 trees at once, it builds them sequentially, where Tree 2 tries to fix the specific errors made by Tree 1. XGBoost is slightly harder to tune, but usually yields the highest possible accuracy on Kaggle competitions!

14. Summary

The Random Forest is a triumph of statistical engineering. By planting a diverse forest of slightly "blind" decision trees and averaging their chaotic predictions, the algorithm creates a highly stable, non-linear model that is practically immune to outliers and overfitting. It is the gold standard for tabular Machine Learning.

15. Next Chapter Recommendation

We have explored Lines and we have explored Trees. But there is a third, mathematically fascinating way to draw boundaries through data points using margins and vectors. In Chapter 16: Support Vector Regression (SVR), we will explore an algorithm designed for complex, high-dimensional spaces.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·