Skip to main content
Scikit-learn Basics
CHAPTER 16 Intermediate

Model Evaluation Metrics

Updated: May 16, 2026
6 min read

# CHAPTER 16

Model Evaluation Metrics

1. Introduction

If you build a model to detect a rare disease that affects 1% of the population, and the model simply predicts "Healthy" for every single person, its Accuracy is 99%. You would get an A+ in math, but your model is completely useless and potentially dangerous. "Accuracy" is a terrible metric for imbalanced datasets. In this chapter, we will learn the professional metrics Data Scientists use to evaluate models accurately: Precision, Recall, F1 Score, and the Confusion Matrix.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Explain why overall Accuracy can be misleading.
  • Interpret a Confusion Matrix (True Positives, False Positives, etc.).
  • Calculate and understand Precision and Recall.
  • Utilize the F1 Score to balance Precision and Recall.
  • Generate a Classification Report in Scikit-learn.

3. The Confusion Matrix

The foundation of all classification metrics is the Confusion Matrix. It is a simple 2x2 table that shows exactly *how* your model was confused.

Imagine testing 100 emails (50 Spam, 50 Normal).

  • True Positives (TP): It was Spam, and the model correctly predicted Spam.
  • True Negatives (TN): It was Normal, and the model correctly predicted Normal.
  • False Positives (FP) [Type 1 Error]: It was Normal, but the model falsely predicted Spam. (The user misses an important email).
  • False Negatives (FN) [Type 2 Error]: It was Spam, but the model falsely predicted Normal. (A scam email hits the inbox).

4. Scikit-learn Confusion Matrix

python
123456789101112
from sklearn.metrics import confusion_matrix
import numpy as np

# y_true = Actual answers, y_pred = Model's predictions
y_true = np.array([1, 0, 1, 1, 0, 1])
y_pred = np.array([1, 0, 0, 1, 0, 1])

cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", cm)
# Output format:
# [[TN  FP]
#  [FN  TP]]

5. Precision vs. Recall

Depending on your business problem, you will care more about False Positives or False Negatives.

Precision: Out of all the times the model predicted "Positive", how many were actually correct?

  • *Formula:* TP / (TP + FP)
  • *Use Case:* Spam detection. You want high Precision. You do NOT want False Positives (sending a crucial work email to the Spam folder).

Recall (Sensitivity): Out of all the actual "Positives" in the real world, how many did the model find?

  • *Formula:* TP / (TP + FN)
  • *Use Case:* Cancer detection. You want high Recall. You do NOT want False Negatives (telling a sick patient they are healthy).

6. The F1 Score

Usually, if you tune a model to increase Precision, Recall drops. If you increase Recall, Precision drops. The F1 Score is the harmonic mean of Precision and Recall. It gives you a single metric that balances both. If the F1 Score is high, both Precision and Recall are solid.

7. Scikit-learn Classification Report

Instead of calculating these manually, Scikit-learn provides a magical function that prints a beautiful report of all these metrics instantly.
python
123456
from sklearn.metrics import classification_report

y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 2, 0]

print(classification_report(y_true, y_pred))

*This report will show Precision, Recall, F1-Score, and Support (how many examples of each class existed) for every single category!*

8. ROC and AUC

ROC (Receiver Operating Characteristic) Curve: A graph that plots the True Positive Rate against the False Positive Rate at various threshold settings (remember the Sigmoid 0.5 threshold from Chapter 10?). AUC (Area Under the Curve): A single number summarizing the ROC curve.
  • AUC = 0.5: The model is guessing randomly.
  • AUC = 1.0: The model is perfect.
  • AUC > 0.8: The model is generally considered excellent.

9. Mini Project: Compare ML Models

Let's train two models and use the F1 Score to see which is better.
python
12345678910111213141516171819
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
lr = LogisticRegression().fit(X_train, y_train)
lr_preds = lr.predict(X_test)

# Train Random Forest
rf = RandomForestClassifier(random_state=42).fit(X_train, y_train)
rf_preds = rf.predict(X_test)

print(f"Logistic Regression F1: {f1_score(y_test, lr_preds):.3f}")
print(f"Random Forest F1: {f1_score(y_test, rf_preds):.3f}")

10. Common Mistakes

  • Relying on Accuracy for Imbalanced Data: As mentioned in the introduction, if 99% of your data is Class 0, a model that predicts "0" every single time is 99% accurate but completely useless. Always use F1-Score or AUC for imbalanced datasets.

11. Best Practices

  • Define the business metric early: Before building the model, ask the business stakeholders: "What is worse for our business? A False Positive or a False Negative?" This dictates whether you optimize the model for Precision or Recall.

12. Exercises

  1. 1. You are building a model to flag fraudulent credit card transactions. Is it better to have a False Positive (flagging a normal purchase as fraud) or a False Negative (letting a fraudulent purchase go through)? Would you optimize for Precision or Recall?
  1. 2. Write the code to import confusionmatrix and classificationreport.

13. MCQ Quiz with Answers

Question 1

In a medical test for a deadly disease, which type of error is generally considered the most dangerous?

Question 2

Which metric provides a balance between Precision and Recall?

14. Interview Questions

  • Q: Explain why Accuracy is a poor metric to evaluate a model trained on a highly imbalanced dataset (e.g., 99% Class 0, 1% Class 1).
  • Q: Define Precision and Recall and give an example of a scenario where you would prioritize Precision.

15. FAQs

Q: What is a good F1 Score? A: It depends entirely on the difficulty of the problem. For classifying distinct images of cats and dogs, an F1 of 0.95 is expected. For predicting the stock market, an F1 of 0.55 might make you a billionaire.

16. Summary

Metrics translate mathematical performance into business reality. By moving beyond simple Accuracy and mastering the Confusion Matrix, Precision, Recall, and the F1 Score, you can confidently evaluate models on difficult, imbalanced datasets and tune them to meet exact business requirements.

17. Next Chapter Recommendation

You built a Random Forest and got an F1 Score of 0.80. Can you get it to 0.85? Yes, by tweaking the algorithm's hidden settings! In Chapter 17: Hyperparameter Tuning and GridSearchCV, we will learn how to automate the search for the perfect model settings.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·