Skip to main content
Scikit-learn Basics
CHAPTER 07 Intermediate

Feature Engineering and Encoding

Updated: May 16, 2026
6 min read

# CHAPTER 7

Feature Engineering and Encoding

1. Introduction

Machine Learning algorithms are strictly mathematical; they only understand numbers. If your dataset contains a column for "City" with values like "New York", "London", and "Tokyo", the model will crash if you try to train it. We must convert these text categories into numbers. Furthermore, sometimes the raw data isn't enough. Creating new, more informative columns out of existing ones can dramatically improve model accuracy. This process is called Feature Engineering and Encoding.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Understand the difference between Categorical and Numerical data.
  • Apply Label Encoding for ordinal categories.
  • Apply One-Hot Encoding for nominal categories.
  • Generate Polynomial Features to capture complex relationships.
  • Understand basic Feature Selection.

3. Categorical Data Types

Before encoding text, you must identify its type:
  • Ordinal Data: Categories with a built-in mathematical order or ranking. (e.g., T-shirt sizes: Small, Medium, Large. Large is clearly bigger than Small).
  • Nominal Data: Categories with NO inherent order. (e.g., Colors: Red, Green, Blue. Blue is not "greater" than Red).

4. Label Encoding (For Ordinal Data)

Label Encoding converts each category into a simple integer. *Small -> 0, Medium -> 1, Large -> 2.* Because algorithms see 2 > 0, they will understand that Large is greater than Small.
python
12345678910111213
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample Ordinal Data
df = pd.DataFrame({"Size": ["Small", "Medium", "Large", "Medium"]})

encoder = LabelEncoder()
df["Size_Encoded"] = encoder.fit_transform(df["Size"])

print(df)
# Note: LabelEncoder assigns numbers alphabetically by default. 
# For strict ordinality, Pandas .map() is often preferred:
# df['Size_Mapped'] = df['Size'].map({'Small': 0, 'Medium': 1, 'Large': 2})

5. One-Hot Encoding (For Nominal Data)

If you use Label Encoding on nominal data (Red->0, Green->1, Blue->2), the algorithm will mistakenly assume Blue is mathematically greater than Red. One-Hot Encoding solves this by creating a new binary (0 or 1) column for *every* unique category.
python
12345678910
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({"Color": ["Red", "Green", "Blue", "Red"]})

# Create encoder, drop='first' prevents the Dummy Variable Trap (multicollinearity)
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_colors = encoder.fit_transform(df[["Color"]])

print(encoded_colors)
# Output matrix contains 1s and 0s representing the presence of each color.

Pandas Shortcut: Pandas has a built-in function that is incredibly easy for quick data exploration:

python
1
df_encoded = pd.get_dummies(df, columns=["Color"], drop_first=True)

6. Feature Engineering: Creating New Features

Feature engineering is the art of using domain knowledge to create new inputs. *Example:* You have a dataset predicting if someone will default on a loan. You have columns for TotalDebt and TotalIncome. Instead of feeding those raw numbers to the model, you create a new column: DebttoIncomeRatio = TotalDebt / Total_Income. This newly engineered feature will likely be a far better predictor than the raw numbers alone!

7. Polynomial Features

Sometimes, the relationship between features and the target is not a straight line (linear); it's curved. By creating Polynomial Features, you square or multiply existing features together to help simple algorithms capture complex patterns.
python
12345678910111213
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

# A simple feature matrix
X = np.array([[2, 3]])

# Create polynomial features of degree 2
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Original: [a, b]
# New features: [a, b, a^2, a*b, b^2]
print(X_poly) # Output: [[ 2.  3.  4.  6.  9.]]

8. Feature Selection

If you have 1,000 features, using all of them can lead to "The Curse of Dimensionality" (training takes forever, and the model overfits). Feature Selection is the process of keeping only the most important columns and dropping the garbage. Scikit-learn provides tools like SelectKBest which runs a statistical test and only keeps the top 'K' scoring features.

9. Common Mistakes

  • Using Label Encoding on Nominal Data: This is the most common beginner mistake. Do not encode "City" names as 1, 2, 3, 4. The model will assume City 4 is mathematically 4 times greater than City 1. Always use One-Hot Encoding for nominal data.
  • The Dummy Variable Trap: When One-Hot Encoding, if you have 3 categories (Red, Green, Blue), you only need 2 columns to represent them. If Red=0 and Green=0, the model knows it must be Blue. Including the 3rd column causes mathematical redundancy (multicollinearity). Always use drop='first'.

10. Best Practices

  • Handle rare categories: If you are One-Hot Encoding a "City" column, and 50 cities only appear once in a 10,000-row dataset, your dataset will explode with 50 mostly-empty columns. Group these rare cities into an "Other" category before encoding.

11. Exercises

  1. 1. You have a column named "CarMake" with values ["Toyota", "Ford", "Honda"]. Should you use Label Encoding or One-Hot Encoding? Why?
  1. 2. Write a Pandas getdummies function to encode a column named "Country".

12. MCQ Quiz with Answers

Question 1

Why must we encode text-based categorical data before training a Scikit-learn model?

Question 2

Which encoding technique creates a new binary column (containing 0s and 1s) for every unique category value?

13. Interview Questions

  • Q: Explain the Dummy Variable Trap and how to avoid it when using One-Hot Encoding.
  • Q: Give an example of a situation where you would engineer a new feature from existing data rather than relying on the raw data.

14. FAQs

Q: My dataset has a column with zip codes (e.g., 90210, 10001). They are numbers. Should I leave them alone? A: No! Zip codes are numbers, but they represent *Nominal categories* (geographical areas), not mathematical values. Zip code 90210 is not mathematically "greater" than 10001. You must treat them as text and One-Hot Encode them.

15. Summary

Feature Engineering and Encoding transform raw, human-readable data into the highly optimized numerical matrices required by machine learning algorithms. By correctly applying Label Encoding for ranked data and One-Hot Encoding for unranked categories, we unlock the ability to train models on diverse real-world datasets.

16. Next Chapter Recommendation

Our data is finally clean, scaled, and encoded into numbers. But before we feed it to an algorithm, we must solve a critical problem: How will we test the model? In Chapter 8: Train-Test Split and Cross Validation, we will learn how to divide our dataset to prevent the cardinal sin of ML: Overfitting.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·