CHAPTER 25 Beginner

Model Evaluation Techniques

Updated: May 18, 2026

5 min read

# CHAPTER 25

Model Evaluation Techniques

1. Chapter Introduction

In Chapter 24, we used the accuracy_score to evaluate our model. But Accuracy is often a dangerous lie. Imagine a dataset of 100 emails, where 99 are Safe and 1 is Spam. A broken model that just guesses "Safe" every single time will score 99% accuracy! But it failed its only job: catching the spam. This chapter introduces the Confusion Matrix, Precision, and Recall to truly evaluate Classification models.

2. The Confusion Matrix

A Confusion Matrix breaks down exactly *how* your model was right, and *how* it was wrong. It is a 2x2 grid.

True Positives (TP): Model predicted Spam (1), and it WAS Spam. (Good!)

True Negatives (TN): Model predicted Safe (0), and it WAS Safe. (Good!)

False Positives (FP): Model predicted Spam (1), but it was actually Safe. (Bad - A false alarm).

False Negatives (FN): Model predicted Safe (0), but it was actually Spam. (Bad - A missed threat).

python

12345678910111213

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming you have y_test and predictions from Chapter 24
cm = confusion_matrix(y_test, predictions)

# Visualize it cleanly using Seaborn
sns.heatmap(cm, annot=True, fmt=&#039;d', cmap='Blues')
plt.ylabel(&#039;Actual')
plt.xlabel(&#039;Predicted')
plt.title("Confusion Matrix")
plt.show()

3. Precision vs. Recall

Depending on your business problem, you must optimize for either Precision or Recall.

1. Precision (Quality of Alarms): *Formula: TP / (TP + FP)* Out of all the emails the model flagged as Spam, how many were *actually* Spam? *Use Case:* When False Positives are terrible. (e.g., You don't want a legitimate email from your boss going to the Spam folder).

2. Recall (Catching the Threats): *Formula: TP / (TP + FN)* Out of all the actual Spam emails that existed, how many did the model *catch*? *Use Case:* When False Negatives are terrible. (e.g., Cancer detection. It is better to falsely alarm a patient than to miss a real tumor).

python

1234567

from sklearn.metrics import precision_score, recall_score, classification_report

print(f"Precision: {precision_score(y_test, predictions):.2f}")
print(f"Recall: {recall_score(y_test, predictions):.2f}")

# The ultimate function that prints everything at once:
print(classification_report(y_test, predictions))

4. Cross-Validation

A Train-Test split relies on a random slice of data. What if the Test set happens to contain only the easiest data points by pure luck? Your score will be artificially high.

K-Fold Cross-Validation solves this. It chops the data into 5 pieces (folds). It trains on 4, tests on 1. Then it rotates, doing this 5 times until every piece of data has been used as the Test set once. The final score is the average of all 5 tests.

python

1234567891011

from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=3)

# Run 5-Fold Cross Validation on the entire X and y datasets
# cv=5 means 5 folds
scores = cross_val_score(model, X, y, cv=5)

print("Scores for each fold:", scores)
print(f"True Average Accuracy: {scores.mean():.2f}")

5. Mini Project: Cancer Detection Evaluator

Let's evaluate a fake model predicting Malignant (1) vs Benign (0) tumors.

python

1234567891011121314151617181920

from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

# 10 Patients. 1 = Malignant (Cancer), 0 = Benign (Safe)
y_actual = [1, 0, 0, 1, 1, 0, 0, 1, 0, 0]

# The model's guesses
y_pred =   [1, 0, 0, 0, 1, 1, 0, 1, 0, 0] 

# Evaluate
print("--- CANCER PREDICTION REPORT ---")
print(classification_report(y_actual, y_pred))

# Look closely at the Confusion Matrix
cm = confusion_matrix(y_actual, y_pred)
print("\nConfusion Matrix:")
print(f"True Negatives (Safe correctly predicted): {cm[0][0]}")
print(f"False Positives (Safe patient told they have cancer): {cm[0][1]}")
print(f"False Negatives (Cancer patient told they are safe): {cm[1][0]}")
print(f"True Positives (Cancer correctly detected): {cm[1][1]}")

*Business Decision: The False Negative is deadly. We must adjust the algorithm to prioritize Recall over Precision.*

6. Common Mistakes

Relying solely on Accuracy for Imbalanced Data: If a dataset is 99% Class A and 1% Class B, accuracy is a useless metric. You must look at the Confusion Matrix.

Confusing Precision and Recall: Precision asks "When you yelled wolf, was there actually a wolf?" Recall asks "Of all the wolves that were there, how many did you yell at?"

7. MCQs

Question 1

Why is "Accuracy" a flawed metric for imbalanced datasets?

Question 2

What does a "False Positive" mean in a Spam filter?

Question 3

What does a "False Negative" mean in a cancer detection model?

Question 4

Which metric asks: "Out of all the items the model flagged as Positive, how many were actually Positive?"

Question 5

Which metric asks: "Out of all the actual Positives in the dataset, how many did the model successfully catch?"

Question 6

If missing a threat (False Negative) is catastrophic (e.g., Cancer detection), which metric must you prioritize?

Question 7

If a false alarm (False Positive) is unacceptable (e.g., sending the CEO's email to Spam), which metric must you prioritize?

Question 8

What Scikit-Learn function prints Precision, Recall, and Accuracy all at once?

Question 9

What is K-Fold Cross-Validation?

Question 10

How many quadrants are in a standard binary Confusion Matrix?

8. Interview Questions

Q: You are building a model to detect fraudulent credit card transactions. Fraud is extremely rare (0.1% of transactions). Why is Accuracy a terrible metric here? What metric would you use instead?

Q: Explain K-Fold Cross-Validation. Why is it more robust than a single Train-Test split?

9. Summary
Never trust Accuracy on its own. Use confusionmatrix() to see exactly where your model is failing. Optimize for Precision if False Positives (false alarms) are expensive. Optimize for Recall if False Negatives (missed threats) are dangerous. Use classificationreport() to see everything at a glance, and use crossval_score() to prove your model's robustness before putting it into production.

10. Next Chapter Recommendation

In Chapter 26: Working with APIs and Web Data, we step away from Machine Learning to learn how Data Engineers gather raw data from the internet using REST APIs and the Python requests library.

Explore More

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Model Evaluation Techniques #

1. Chapter Introduction #

2. The Confusion Matrix #

3. Precision vs. Recall #

4. Cross-Validation #

5. Mini Project: Cancer Detection Evaluator #

6. Common Mistakes #

7. MCQs #

Why is "Accuracy" a flawed metric for imbalanced datasets?

What does a "False Positive" mean in a Spam filter?

What does a "False Negative" mean in a cancer detection model?

Which metric asks: "Out of all the items the model *flagged* as Positive, how many were actually Positive?"

Which metric asks: "Out of all the *actual* Positives in the dataset, how many did the model successfully catch?"

If missing a threat (False Negative) is catastrophic (e.g., Cancer detection), which metric must you prioritize?

If a false alarm (False Positive) is unacceptable (e.g., sending the CEO's email to Spam), which metric must you prioritize?

What Scikit-Learn function prints Precision, Recall, and Accuracy all at once?

What is K-Fold Cross-Validation?

How many quadrants are in a standard binary Confusion Matrix?

8. Interview Questions #

9. Summary #

10. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 5

🧪 Related Labs 1

🎥 Related Videos 1

🗺️ Related Roadmaps 1

Send Feedback / Bug

Feedback Submitted!

Model Evaluation Techniques

1. Chapter Introduction

2. The Confusion Matrix

3. Precision vs. Recall

4. Cross-Validation

5. Mini Project: Cancer Detection Evaluator

6. Common Mistakes

7. MCQs

Which metric asks: "Out of all the items the model flagged as Positive, how many were actually Positive?"

Which metric asks: "Out of all the actual Positives in the dataset, how many did the model successfully catch?"

8. Interview Questions

9. Summary

10. Next Chapter Recommendation