Linear Regression in Scikit-learn
# CHAPTER 9
Linear Regression in Scikit-learn
1. Introduction
We have prepared our data, and now it is time to build our first predictive algorithm. Regression is a type of Supervised Learning used to predict a continuous number (e.g., Salary, House Price, Temperature). The simplest and most widely used regression algorithm in the world is Linear Regression. In this chapter, we will learn how the algorithm works conceptually and implement it in Python using Scikit-learn.2. Learning Objectives
By the end of this chapter, you will be able to:- Understand the difference between Regression and Classification.
- Explain the concept of the "Line of Best Fit."
-
Implement
LinearRegressionusing Scikit-learn.
- Interpret the model's coefficients (slope) and intercept.
- Evaluate regression performance using R-squared and Mean Squared Error.
3. Regression Basics
While Classification predicts a category ("Cat" or "Dog"), Regression predicts an infinite number. If you plot house sizes (square feet) on the X-axis and house prices on the Y-axis, you will see a trend: as size increases, price increases. Linear Regression draws a straight line through the middle of these data points. When a new house comes on the market, you find its size on the X-axis, move up to the line, and move over to the Y-axis to predict its price.4. The Math (Simplified)
You might remember the equation for a straight line from high school algebra:y = mx + b
In Machine Learning, we write it as:
y = (w * x) + b
- y: The Target we want to predict (Price).
- x: The Input Feature (Square footage).
- w (Weight / Coefficient): The slope of the line. How much does price increase for every 1 extra sq ft?
- b (Bias / Intercept): The base price. If a house is 0 sq ft (just land), what is it worth?
Scikit-learn's job is to calculate the perfect w and b that draws the line as close to all the data points as mathematically possible.
5. Mini Project: House Price Prediction
Let's build this in Scikit-learn. We will use a mock dataset of house sizes and prices.6. Model Coefficients
Once the model is trained (fit), we can peek inside to see the math it learned.
*If the Coefficient is $100, the model learned that for every additional square foot, the house value increases by $100.*
7. Regression Evaluation
We cannot use "Accuracy" for regression. If the actual price is $250,000 and the model predicts $250,001, an "Accuracy" metric would mark it as 100% wrong. Instead, we measure the *error* (the distance between the prediction and the actual value).- Mean Squared Error (MSE): The average of the squared errors. Lower is better.
- R-squared (R2 Score): Explains how much of the variance in the target is explained by the features. Ranges from 0 to 1. An R2 of 0.90 means the model is excellent.
8. Multiple Linear Regression
What if we have more than one input feature (e.g., SqFt, Number of Bedrooms, Age of House)? The code does not change! Scikit-learn handles it automatically.y = (w1 * SqFt) + (w2 * Beds) + (w3 * Age) + b
9. Common Mistakes
- Applying Linear Regression to Non-Linear Data: If your data looks like a curved U-shape when plotted, a straight line will fail to predict anything accurately. You would need Polynomial Features or a non-linear algorithm like Random Forest.
- Forgetting to Scale Features: While simple Linear Regression isn't as sensitive to unscaled data as SVM or KNN, scaling your features helps the math converge faster and makes the coefficients easier to interpret.
10. Best Practices
-
Analyze Coefficients: In business, stakeholders don't just want a prediction; they want to know *why*. By printing the
model.coef_, you can tell the marketing team exactly which feature drives the most revenue.
11. Exercises
- 1. If an ML model predicts a continuous number like a bank account balance, is it a Classification or Regression problem?
-
2.
Write the Python code to import
LinearRegressionfrom Scikit-learn and instantiate it into a variable namedmodel.
12. MCQ Quiz with Answers
Which evaluation metric is appropriate for a Linear Regression model?
What does the "Coefficient" in a single-feature Linear Regression model represent?
13. Interview Questions
- Q: Explain the mathematical concept of Linear Regression in simple terms.
- Q: What is the difference between Mean Squared Error (MSE) and the R-squared metric?