CHAPTER 07
Intermediate
Feature Engineering and Encoding
Updated: May 16, 2026
6 min read
# CHAPTER 7
Feature Engineering and Encoding
1. Introduction
Machine Learning algorithms are strictly mathematical; they only understand numbers. If your dataset contains a column for "City" with values like "New York", "London", and "Tokyo", the model will crash if you try to train it. We must convert these text categories into numbers. Furthermore, sometimes the raw data isn't enough. Creating new, more informative columns out of existing ones can dramatically improve model accuracy. This process is called Feature Engineering and Encoding.2. Learning Objectives
By the end of this chapter, you will be able to:- Understand the difference between Categorical and Numerical data.
- Apply Label Encoding for ordinal categories.
- Apply One-Hot Encoding for nominal categories.
- Generate Polynomial Features to capture complex relationships.
- Understand basic Feature Selection.
3. Categorical Data Types
Before encoding text, you must identify its type:- Ordinal Data: Categories with a built-in mathematical order or ranking. (e.g., T-shirt sizes: Small, Medium, Large. Large is clearly bigger than Small).
- Nominal Data: Categories with NO inherent order. (e.g., Colors: Red, Green, Blue. Blue is not "greater" than Red).
4. Label Encoding (For Ordinal Data)
Label Encoding converts each category into a simple integer. *Small -> 0, Medium -> 1, Large -> 2.* Because algorithms see 2 > 0, they will understand that Large is greater than Small.
python
5. One-Hot Encoding (For Nominal Data)
If you use Label Encoding on nominal data (Red->0, Green->1, Blue->2), the algorithm will mistakenly assume Blue is mathematically greater than Red. One-Hot Encoding solves this by creating a new binary (0 or 1) column for *every* unique category.
python
Pandas Shortcut: Pandas has a built-in function that is incredibly easy for quick data exploration:
python
6. Feature Engineering: Creating New Features
Feature engineering is the art of using domain knowledge to create new inputs. *Example:* You have a dataset predicting if someone will default on a loan. You have columns forTotalDebt and TotalIncome.
Instead of feeding those raw numbers to the model, you create a new column: DebttoIncomeRatio = TotalDebt / Total_Income. This newly engineered feature will likely be a far better predictor than the raw numbers alone!
7. Polynomial Features
Sometimes, the relationship between features and the target is not a straight line (linear); it's curved. By creating Polynomial Features, you square or multiply existing features together to help simple algorithms capture complex patterns.
python
8. Feature Selection
If you have 1,000 features, using all of them can lead to "The Curse of Dimensionality" (training takes forever, and the model overfits). Feature Selection is the process of keeping only the most important columns and dropping the garbage. Scikit-learn provides tools likeSelectKBest which runs a statistical test and only keeps the top 'K' scoring features.
9. Common Mistakes
- Using Label Encoding on Nominal Data: This is the most common beginner mistake. Do not encode "City" names as 1, 2, 3, 4. The model will assume City 4 is mathematically 4 times greater than City 1. Always use One-Hot Encoding for nominal data.
-
The Dummy Variable Trap: When One-Hot Encoding, if you have 3 categories (Red, Green, Blue), you only need 2 columns to represent them. If Red=0 and Green=0, the model knows it must be Blue. Including the 3rd column causes mathematical redundancy (multicollinearity). Always use
drop='first'.
10. Best Practices
- Handle rare categories: If you are One-Hot Encoding a "City" column, and 50 cities only appear once in a 10,000-row dataset, your dataset will explode with 50 mostly-empty columns. Group these rare cities into an "Other" category before encoding.
11. Exercises
- 1. You have a column named "CarMake" with values ["Toyota", "Ford", "Honda"]. Should you use Label Encoding or One-Hot Encoding? Why?
-
2.
Write a Pandas
getdummiesfunction to encode a column named "Country".
12. MCQ Quiz with Answers
Question 1
Why must we encode text-based categorical data before training a Scikit-learn model?
Question 2
Which encoding technique creates a new binary column (containing 0s and 1s) for every unique category value?
13. Interview Questions
- Q: Explain the Dummy Variable Trap and how to avoid it when using One-Hot Encoding.
- Q: Give an example of a situation where you would engineer a new feature from existing data rather than relying on the raw data.