CHAPTER 08
Intermediate
Regression Assumptions Explained
Updated: May 16, 2026
6 min read
# CHAPTER 8
Regression Assumptions Explained
1. Introduction
Linear Regression is not magic; it is pure statistics. Because it is a statistical algorithm, it comes with a strict set of rules, known as the Assumptions of Linear Regression. If you feed a dataset intoscikit-learn that violates these rules, the Python code will still run, and the model will output a prediction. However, that prediction will be statistically invalid and confidently wrong. In this chapter, we explore the 5 rules you must check before trusting your model in production.
2. Learning Objectives
By the end of this chapter, you will be able to:- Explain the necessity of Regression Assumptions.
- Define Linearity.
- Understand Independence of Observations.
- Explain Homoscedasticity vs. Heteroscedasticity.
- Define Normal Distribution of Residuals.
- Identify Multicollinearity and its dangers.
3. Rule 1: Linearity
The Rule: There must be a linear (straight-line) relationship between the independent variables (X) and the dependent variable (y). The Intuition: If you are trying to fit a perfectly straight stick through data points that form a U-shape (like a smile), the stick will miss almost all the points. How to check: Create a scatter plot of X vs y. If the dots look like a curve or a circle, Linear Regression will fail. You must use Polynomial Regression instead.4. Rule 2: Independence of Observations
The Rule: The data points must be completely independent of each other. One row's target value cannot influence the next row's target value. The Intuition: If Row 1 is the temperature on Monday, and Row 2 is the temperature on Tuesday, these are *not* independent. Monday's weather heavily influences Tuesday's weather. This is "Time-Series" data. The Danger: Standard Linear Regression is terrible at predicting stock prices or weather patterns because it assumes the rows are unrelated. You need specialized Time-Series models (like ARIMA or LSTMs) for this.5. Rule 3: Homoscedasticity (Equal Variance)
The Rule: The "errors" (how far the dots are from the regression line) must remain relatively constant across all predictions. The Intuition: Imagine predicting the price of a house. For cheap houses ($100k), the model's error might be off by $5k. But for expensive mansions ($5 Million), the model's error suddenly explodes and is off by $500k. The errors get wider like a megaphone. This is called Heteroscedasticity, and it means your model is highly unreliable for larger values. How to check: Plot the predicted values vs the errors (Residual Plot). If the dots fan out into a cone shape, you violated the rule.6. Rule 4: Normal Distribution of Errors (Residuals)
The Rule: The errors the model makes should follow a normal "Bell Curve" distribution. The Intuition: Your model should make small mistakes very frequently, and massive mistakes very rarely. Furthermore, the mistakes should be balanced: it should over-predict and under-predict equally. If your model *always* under-predicts prices by a massive margin, the errors are skewed, indicating your model is fundamentally flawed or missing a key variable.7. Rule 5: No Multicollinearity
The Rule: The independent variables (X columns) should not be highly correlated with *each other*. The Intuition: Imagine predicting House Price using two columns:SizeinSquareFeet and SizeinSquareMeters. These two columns contain the exact same information!
The Danger: If two features are perfectly correlated, the math inside Linear Regression completely breaks down (division by zero in matrix inversion). The coefficients ($m$) will swing wildly, and the model's interpretation becomes useless.
How to fix: Drop one of the correlated columns before training!
8. The "LINE" Acronym
A great way to remember the core assumptions for interviews is the LINE acronym:- Linearity
- Independence
- Normality of residuals
- Equal variance (Homoscedasticity)
*(Note: Multicollinearity is the 5th, often separate assumption dealing directly with Multiple Regression).*
9. Common Mistakes
-
Ignoring the Assumptions: The biggest mistake beginners make is loading a CSV, calling
model.fit(), and deploying the model immediately because theR-Squaredscore looked good. If the data is highly multicollinear, the model will collapse in production.
- Deleting data to force Normality: If your errors are not normally distributed, do not just delete the data points that the model got wrong. This is data manipulation! Investigate *why* the model failed on those points.
10. Best Practices
-
Correlation Heatmaps: Always plot a Pandas Correlation Matrix/Heatmap (
df.corr()) before training. If you see two Input Features (X) that have a 0.95 correlation with each other, delete one of them to prevent Multicollinearity!
11. Exercises
- 1. Why is standard Linear Regression a poor choice for predicting the stock market (Time-Series data)? Which assumption does it violate?
- 2. What is the difference between Homoscedasticity and Heteroscedasticity?
12. MCQ Quiz with Answers
Question 1
What does the assumption of "Linearity" dictate?
Question 2
If you include both "Year of Birth" and "Current Age" as input features in a Multiple Regression model, which assumption are you severely violating?
13. Interview Questions
- Q: Explain the acronym "LINE" in the context of Linear Regression assumptions.
- Q: Describe a scenario where Heteroscedasticity (megaphone-shaped errors) might occur in a predictive model, and why it makes the model untrustworthy.