Skip to main content
Python for Data Science
CHAPTER 24 Beginner

Classification Algorithms

Updated: May 18, 2026
5 min read

# CHAPTER 24

Classification Algorithms

1. Chapter Introduction

While Regression predicts numbers (e.g., $450,000), Classification predicts discrete categories. Will this customer Churn (Yes/No)? Is this email Spam or Not Spam? Is this image a Cat, Dog, or Bird? This chapter introduces the three most common classification algorithms in Scikit-Learn: Logistic Regression, K-Nearest Neighbors (KNN), and Decision Trees.

2. Logistic Regression (Binary Classification)

Despite the word "Regression" in its name, Logistic Regression is used for *Classification*. It predicts the probability (from 0 to 1) that an item belongs to a specific class. If the probability is > 0.5, it predicts "Yes" (1). Otherwise, "No" (0).

python
12345678910111213141516
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 1. Initialize
model = LogisticRegression()

# 2. Train (Assuming X_train, y_train are already preprocessed)
# y_train contains 1s (Spam) and 0s (Not Spam)
model.fit(X_train, y_train)

# 3. Predict
predictions = model.predict(X_test)

# 4. Evaluate (What percentage did we get right?)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy * 100:.1f}%")

3. K-Nearest Neighbors (KNN)

KNN is intuitive. It plots the new data point, looks at the 'K' closest historical data points, and takes a vote. If K=5, and 4 of the 5 closest points are "Cats", the model classifies the new point as a "Cat".

*Crucial:* KNN relies on distance. You MUST scale your data (StandardScaler) before using KNN, or features with large numbers will dominate the distance calculation.

python
12345678
from sklearn.neighbors import KNeighborsClassifier

# Initialize with K=5
knn_model = KNeighborsClassifier(n_neighbors=5)

# Train and Predict
knn_model.fit(X_train_scaled, y_train)
knn_preds = knn_model.predict(X_test_scaled)

4. Decision Trees

A Decision Tree works like a flowchart. It asks a series of True/False questions about the features to split the data until it reaches a conclusion. (e.g., *Is Income > 50k?* -> *Is Age > 30?* -> *Classify as 'Will Buy'*).

Decision Trees do *not* require feature scaling!

python
12345678
from sklearn.tree import DecisionTreeClassifier

# Initialize
# max_depth stops the tree from growing infinitely and memorizing the data
tree_model = DecisionTreeClassifier(max_depth=3)

tree_model.fit(X_train, y_train)
tree_preds = tree_model.predict(X_test)

5. Multi-Class Classification

Classification isn't limited to Yes/No. If your y_train column contains 3 categories (0=Cat, 1=Dog, 2=Bird), Scikit-Learn algorithms automatically handle it. You write the exact same .fit() and .predict() code.

6. Mini Project: Spam Email Classifier

Let's simulate classifying emails based on two features: Word Count and Number of Links.

python
1234567891011121314151617181920212223
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Fake Data
# Target: 1 = Spam, 0 = Safe
X_train = pd.DataFrame({
    'Word_Count': [50, 200, 20, 500, 10],
    'Num_Links': [1, 0, 5, 1, 10]
})
y_train = pd.Series([0, 0, 1, 0, 1]) 

# Train
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Predict a new email: 15 words, 8 links
new_email = pd.DataFrame({'Word_Count': [15], 'Num_Links': [8]})
prediction = clf.predict(new_email)

if prediction[0] == 1:
    print("ALERT: Email classified as SPAM")
else:
    print("Email is SAFE")

7. Common Mistakes

  • Not scaling data for KNN: If Feature A ranges from 0-1 and Feature B ranges from 0-1,000,000, KNN will completely ignore Feature A because the distance calculation is overwhelmed by B. Always use StandardScaler.
  • Overfitting Decision Trees: If you don't set a maxdepth, a Decision Tree will keep branching until every single training point is perfectly classified. It will score 100% on training data, and fail miserably on test data because it memorized the noise.

8. MCQs

Question 1

Classification algorithms are used to predict what?

Question 2

What does Logistic Regression predict under the hood?

Question 3

Which algorithm classifies new data by taking a "vote" among its closest historical neighbors?

Question 4

Which algorithm relies heavily on distance mathematics and MUST have its features scaled?

Question 5

Which algorithm acts like a flowchart of True/False questions?

Question 6

What metric is used to evaluate the percentage of correct guesses made by a classification model?

Question 7

What is "Binary Classification"?

Question 8

How do you prevent a Decision Tree from memorizing the training data (overfitting)?

Question 9

Is Logistic Regression used for Regression or Classification?

Question 10

If the target y column contains 5 different categories (e.g., Car brands), which algorithms can handle it?

9. Interview Questions

  • Q: Explain how K-Nearest Neighbors (KNN) makes a prediction. Why is feature scaling absolutely mandatory for this algorithm?
  • Q: Despite its name, what task is Logistic Regression actually used for?

11. Summary

Classification predicts categories. Use Logistic Regression for fast, probability-based binary predictions. Use K-Nearest Neighbors for distance-based voting (remembering to scale the data first). Use Decision Trees to build interpretable flowcharts (remembering to limit max
depth to prevent overfitting). No matter the algorithm, the Scikit-Learn .fit() and .predict() syntax remains identical.

12. Next Chapter Recommendation

In Chapter 25: Model Evaluation Techniques, we will learn why standard "Accuracy" is often a lie. You will learn advanced classification metrics like Precision, Recall, and the Confusion Matrix to truly understand where your model is failing.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·