Skip to main content
Python for Data Science
CHAPTER 23 Beginner

Regression Algorithms

Updated: May 18, 2026
5 min read

# CHAPTER 23

Regression Algorithms

1. Chapter Introduction

We are ready to predict the future. The most fundamental task in Supervised Machine Learning is Regression—predicting a continuous number. How much will this house sell for? What will our revenue be next month? This chapter covers Linear Regression, how to train it using Scikit-Learn, and how to evaluate if its predictions are actually accurate.

2. What is Linear Regression?

Linear Regression attempts to draw a straight "line of best fit" through your data points.

If you plot Square Footage on the X-axis and House Price on the Y-axis, the algorithm finds the perfect line that minimizes the distance between the line and every single data point. Once the line is drawn, you can use it to predict the price of a house size that isn't even in your dataset.

3. Training a Linear Regression Model

Let's assume our data is already preprocessed (Cleaned, Encoded, Split, and Scaled) from Chapter 22.

python
1234567891011121314
from sklearn.linear_model import LinearRegression

# 1. Initialize the model
model = LinearRegression()

# 2. Train the model on the Training Data
# The model looks at X_train, looks at y_train, and figures out the math rules
model.fit(X_train, y_train)

print("Model Training Complete!")

# 3. View the learned rules (Coefficients)
# If coefficient is 50, it means: for every 1 unit increase in X, Price goes up by 50
print("Coefficients:", model.coef_)

4. Making Predictions

Now that the model has learned the rules, we test it. We give it the X_test data (which it has never seen before) and ask it to guess the house prices.

python
123456
# Pass the unseen features to the model
predictions = model.predict(X_test)

# Let's compare the first 3 guesses to the real answers!
print("Model Guesses:", predictions[:3])
print("Real Answers: ", y_test[:3].values)

5. Evaluating Regression Models (Metrics)

How do we know if the model is good? We calculate the error between the predictions and the real y_test answers.

1. Mean Absolute Error (MAE): The average amount the model was wrong by, in actual dollars. 2. R-Squared (R²): A score from 0 to 1. An R² of 0.85 means the model explains 85% of the variance in house prices. (1.0 is perfect).

python
123456789
from sklearn.metrics import mean_absolute_error, r2_score

# Calculate MAE
mae = mean_absolute_error(y_test, predictions)
print(f"On average, our model's predictions are off by: ${mae:,.2f}")

# Calculate R-Squared
r2 = r2_score(y_test, predictions)
print(f"R-Squared Score: {r2:.2f}")

6. Polynomial Regression (Non-Linear Data)

What if the relationship isn't a straight line? What if it curves? Linear Regression will fail. We must use Polynomial Regression, which bends the line of best fit.

In Scikit-Learn, we do this by transforming the features *before* feeding them to Linear Regression.

python
123456789
from sklearn.preprocessing import PolynomialFeatures

# Create squared versions of our features (X^2)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)

# Train the model on the curved features
model_curved = LinearRegression()
model_curved.fit(X_train_poly, y_train)

7. Mini Project: House Price Predictor

python
1234567891011121314151617181920212223242526
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# 1. Fake Dataset (Sqft, Bedrooms, Age -> Price)
X = pd.DataFrame({
    'Sqft': [1500, 2000, 2500, 1200, 3000],
    'Beds': [3, 4, 4, 2, 5],
    'Age': [10, 5, 2, 20, 1]
})
y = pd.Series([300000, 450000, 500000, 200000, 650000])

# 2. Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train
model = LinearRegression()
model.fit(X_train, y_train)

# 4. Predict a brand new house not in the dataset!
new_house = pd.DataFrame({'Sqft': [2100], 'Beds': [3], 'Age': [8]})
predicted_price = model.predict(new_house)

print(f"Predicted Price for new house: ${predicted_price[0]:,.2f}")

8. Common Mistakes

  • Evaluating on Training Data: If you calculate your Error metrics using model.predict(Xtrain), you will get an amazing score. This is a lie. The model has already memorized that data. You MUST evaluate using model.predict(Xtest).
  • Ignoring the MAE scale: An MAE of 5,000 is terrible if you are predicting the price of a $20 book. An MAE of 5,000 is incredible if you are predicting the price of a $1,000,000 house. Context matters.

9. MCQs

Question 1

What is the goal of Regression in Machine Learning?

Question 2

Which Scikit-Learn method asks the model to learn from the data?

Question 3

Which Scikit-Learn method asks the trained model to make a guess on unseen data?

Question 4

What does Mean Absolute Error (MAE) represent?

Question 5

What is the best possible R-Squared (R²) score?

Question 6

If your data has a curved relationship (like exponential growth), what should you use?

Question 7

To evaluate the true accuracy of your model, which data should you use?

Question 8

What attribute holds the learned rules/weights of a Linear Regression model?

Question 9

Is Linear Regression an example of Supervised or Unsupervised learning?

Q10. Can Linear Regression be used to predict if an email is Spam or Not Spam? a) Yes b) No, predicting discrete categories requires Classification algorithms — Answer: b

10. Interview Questions

  • Q: Explain how R-Squared and Mean Absolute Error differ. Which one is easier to explain to a non-technical business manager?
  • Q: You train a Linear Regression model, but the R-Squared is very low (0.3). The data looks curved on a scatter plot. How do you fix this pipeline?

11. Summary

Regression predicts numbers. The workflow is universal: initialize LinearRegression(), teach it the rules using .fit(X
train, ytrain), and ask it to guess the test data using .predict(Xtest). Finally, calculate the meanabsoluteerror to see exactly how many dollars (or units) your model's predictions are off by on average.

12. Next Chapter Recommendation

In Chapter 24: Classification Algorithms, we shift from predicting numbers to predicting *categories*. We will use Logistic Regression and Decision Trees to predict whether a tumor is malignant, or if an email is spam.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·