Skip to main content
Scikit-learn Basics
CHAPTER 09 Intermediate

Linear Regression in Scikit-learn

Updated: May 16, 2026
6 min read

# CHAPTER 9

Linear Regression in Scikit-learn

1. Introduction

We have prepared our data, and now it is time to build our first predictive algorithm. Regression is a type of Supervised Learning used to predict a continuous number (e.g., Salary, House Price, Temperature). The simplest and most widely used regression algorithm in the world is Linear Regression. In this chapter, we will learn how the algorithm works conceptually and implement it in Python using Scikit-learn.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Understand the difference between Regression and Classification.
  • Explain the concept of the "Line of Best Fit."
  • Implement LinearRegression using Scikit-learn.
  • Interpret the model's coefficients (slope) and intercept.
  • Evaluate regression performance using R-squared and Mean Squared Error.

3. Regression Basics

While Classification predicts a category ("Cat" or "Dog"), Regression predicts an infinite number. If you plot house sizes (square feet) on the X-axis and house prices on the Y-axis, you will see a trend: as size increases, price increases. Linear Regression draws a straight line through the middle of these data points. When a new house comes on the market, you find its size on the X-axis, move up to the line, and move over to the Y-axis to predict its price.

4. The Math (Simplified)

You might remember the equation for a straight line from high school algebra: y = mx + b

In Machine Learning, we write it as: y = (w * x) + b

  • y: The Target we want to predict (Price).
  • x: The Input Feature (Square footage).
  • w (Weight / Coefficient): The slope of the line. How much does price increase for every 1 extra sq ft?
  • b (Bias / Intercept): The base price. If a house is 0 sq ft (just land), what is it worth?

Scikit-learn's job is to calculate the perfect w and b that draws the line as close to all the data points as mathematically possible.

5. Mini Project: House Price Prediction

Let's build this in Scikit-learn. We will use a mock dataset of house sizes and prices.
python
1234567891011121314151617181920212223
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# 1. Provide Data
# X = Square footage
X = np.array([[1000], [1500], [2000], [2500], [3000], [3500]])
# y = Price in dollars
y = np.array([150000, 200000, 250000, 310000, 360000, 400000])

# 2. Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 3. Initialize and Train the Model
model = LinearRegression()
model.fit(X_train, y_train)

# 4. Make Predictions
predictions = model.predict(X_test)

for i in range(len(X_test)):
    print(f"SqFt: {X_test[i][0]} | Predicted Price: ${predictions[i]:.2f} | Actual Price: ${y_test[i]}")

6. Model Coefficients

Once the model is trained (fit), we can peek inside to see the math it learned.
python
12345
# The Weight (w)
print(f"Coefficient (Price per SqFt): ${model.coef_[0]:.2f}")

# The Intercept (b)
print(f"Intercept (Base Price): ${model.intercept_:.2f}")

*If the Coefficient is $100, the model learned that for every additional square foot, the house value increases by $100.*

7. Regression Evaluation

We cannot use "Accuracy" for regression. If the actual price is $250,000 and the model predicts $250,001, an "Accuracy" metric would mark it as 100% wrong. Instead, we measure the *error* (the distance between the prediction and the actual value).
  • Mean Squared Error (MSE): The average of the squared errors. Lower is better.
  • R-squared (R2 Score): Explains how much of the variance in the target is explained by the features. Ranges from 0 to 1. An R2 of 0.90 means the model is excellent.
python
12345
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")

8. Multiple Linear Regression

What if we have more than one input feature (e.g., SqFt, Number of Bedrooms, Age of House)? The code does not change! Scikit-learn handles it automatically. y = (w1 * SqFt) + (w2 * Beds) + (w3 * Age) + b
python
12
# X now has 3 columns. The model will output 3 coefficients.
model.fit(X_train_multi, y_train)

9. Common Mistakes

  • Applying Linear Regression to Non-Linear Data: If your data looks like a curved U-shape when plotted, a straight line will fail to predict anything accurately. You would need Polynomial Features or a non-linear algorithm like Random Forest.
  • Forgetting to Scale Features: While simple Linear Regression isn't as sensitive to unscaled data as SVM or KNN, scaling your features helps the math converge faster and makes the coefficients easier to interpret.

10. Best Practices

  • Analyze Coefficients: In business, stakeholders don't just want a prediction; they want to know *why*. By printing the model.coef_, you can tell the marketing team exactly which feature drives the most revenue.

11. Exercises

  1. 1. If an ML model predicts a continuous number like a bank account balance, is it a Classification or Regression problem?
  1. 2. Write the Python code to import LinearRegression from Scikit-learn and instantiate it into a variable named model.

12. MCQ Quiz with Answers

Question 1

Which evaluation metric is appropriate for a Linear Regression model?

Question 2

What does the "Coefficient" in a single-feature Linear Regression model represent?

13. Interview Questions

  • Q: Explain the mathematical concept of Linear Regression in simple terms.
  • Q: What is the difference between Mean Squared Error (MSE) and the R-squared metric?

14. FAQs

Q: What is Ridge or Lasso Regression? A: They are advanced versions of Linear Regression that add a "Penalty" (Regularization) to the coefficients. This prevents the model from overfitting when dealing with hundreds of complex features.

15. Summary

Linear Regression is the foundation of predictive modeling. By calculating the optimal weights (slope) and bias (intercept) that minimize the error across training data, Scikit-learn allows us to accurately predict continuous numerical targets. While simple, it remains one of the most interpretable and widely used algorithms in data science.

16. Next Chapter Recommendation

We have conquered Regression. But what if we want to predict a "Yes" or "No" answer, like whether an email is Spam? In Chapter 10: Logistic Regression for Classification, we will tackle the other half of Supervised Learning.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·