CHAPTER 06 Intermediate

Data Preprocessing and Cleaning

Updated: May 16, 2026

6 min read

# CHAPTER 6

Data Preprocessing and Cleaning

1. Introduction

In the real world, datasets are never perfect. Users skip form fields, sensors malfunction, and databases crash. If you feed a Scikit-learn model a dataset containing missing values (NaNs), text instead of numbers, or extreme outliers, the algorithm will instantly throw an error and crash. The phrase "Garbage In, Garbage Out" is the golden rule of Data Science. In this chapter, we will learn how to clean messy datasets using Pandas and prepare them for Machine Learning.

2. Learning Objectives

By the end of this chapter, you will be able to:

Identify and remove duplicate data.

Handle missing values using imputation strategies.

Detect and handle extreme outliers.

Understand the importance of Feature Scaling.

Use Scikit-learn's StandardScaler and MinMaxScaler.

3. Handling Duplicate Data

Duplicates can artificially inflate the importance of certain records. Removing them in Pandas is trivial.

python

123456789

import pandas as pd

df = pd.read_csv("messy_data.csv")

# Check how many duplicates exist
print(df.duplicated().sum())

# Remove them entirely
df = df.drop_duplicates()

4. Handling Missing Values (NaN)

When data is missing, Pandas represents it as NaN (Not a Number). Scikit-learn algorithms generally cannot process NaNs. You have two choices: Drop them or Fill them (Imputation).

A. Dropping Missing Data: If you have 10,000 rows and only 5 are missing an Age, just drop those 5 rows.

python

# Drops any row that contains at least one NaN
df_clean = df.dropna()

B. Imputation (Filling): If you have a small dataset, dropping rows loses valuable information. Instead, we fill the missing values with the Mean (average) or Median of that column.

python

123

# Pandas approach
mean_age = df[&#039;Age'].mean()
df[&#039;Age'].fillna(mean_age, inplace=True)

Scikit-learn approach (SimpleImputer): Scikit-learn provides a built-in tool for this, which is essential when building automated pipelines later.

python

12345678

from sklearn.impute import SimpleImputer
import numpy as np

# Create imputer to replace NaN with the mean
imputer = SimpleImputer(missing_values=np.nan, strategy=&#039;mean')

# Fit and transform the Age column
df[&#039;Age'] = imputer.fit_transform(df[['Age']])

5. Handling Outliers

An outlier is a data point that differs significantly from other observations (e.g., an age of 150 years). Outliers can severely distort models like Linear Regression.

Detection: You can use visualization (Boxplots) or statistical rules (Interquartile Range - IQR) to find them.

Handling: Once identified, you can drop the row or cap the value to a logical maximum.

6. Feature Scaling (Normalization vs Standardization)

Consider two features: Age (ranges from 18 to 80) and Salary (ranges from 30,000 to 150,000). Some ML algorithms (like KNN or SVM) calculate the distance between points. Because Salary numbers are massively larger than Age numbers, the algorithm will think Salary is 1000x more important than Age! To fix this, we scale all features so they have a similar range.

A. Normalization (Min-Max Scaling): Squashes all values to be exactly between 0 and 1.

python

12345

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
# Fit calculates the min/max, transform applies the math
scaled_data = scaler.fit_transform(df[[&#039;Age', 'Salary']])

B. Standardization: Centers the data around a mean of 0, with a standard deviation of 1. This handles outliers better than Min-Max scaling.

python

1234

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
standardized_data = scaler.fit_transform(df[[&#039;Age', 'Salary']])

7. Mini Project: Clean Messy Dataset

Let's tie it all together in a quick cleaning script.

python

1234567891011121314151617181920

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# 1. Load Data
data = {"Age": [25, 30, np.nan, 40, 25], "Salary": [50000, 60000, 70000, np.nan, 50000]}
df = pd.DataFrame(data)

# 2. Drop Duplicates
df = df.drop_duplicates()

# 3. Impute Missing Values (Mean)
imputer = SimpleImputer(strategy=&#039;mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# 4. Standardize Features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_imputed), columns=df.columns)

print(df_scaled)

8. Common Mistakes

Scaling the Target Variable: You should generally only scale the *Features* (X), not the *Target/Label* (y). If you scale the target house price from $500,000 to 0.5, your model will predict 0.5!

Data Leakage in Scaling: We will cover this in Chapter 8, but you must fit your Scaler ONLY on the Training data, not the Testing data.

9. Best Practices

Document your cleaning steps: The exact cleaning steps applied to your training data must be applied to new data when the model is in production. Using Scikit-learn Transformers (SimpleImputer, StandardScaler) makes this process repeatable.

10. Exercises

1. Use Pandas to create a DataFrame with missing values. Try using both dropna() and fillna() and observe how the shape of the DataFrame changes.

2. Why does the StandardScaler help machine learning algorithms perform better?

11. MCQ Quiz with Answers

Question 1

What is the process of replacing missing data values (NaN) with substituted values like the mean or median called?

Question 2

If you have two features: "Number of Bedrooms" (1-5) and "House Price" ($100k-$1M), what must you do before feeding them into a distance-based algorithm like KNN?

12. Interview Questions

Q: Explain the difference between Normalization (MinMaxScaler) and Standardization (StandardScaler).

Q: How do you handle missing data in a dataset if the missing column is critical but 40% of its values are NaN?

13. FAQs

Q: Do all ML algorithms require Feature Scaling? A: No! Tree-based algorithms like Decision Trees and Random Forests do not care about the scale of the features. However, algorithms like SVM, KNN, and Neural Networks absolutely require it.

14. Summary

Cleaning data is the unglamorous but vital reality of machine learning. By utilizing Pandas to drop duplicates and Scikit-learn's preprocessing modules (SimpleImputer, StandardScaler) to handle missing values and scale numeric ranges, we ensure our data is mathematically sound before it ever touches a predictive algorithm.

15. Next Chapter Recommendation

We have cleaned our numeric data, but what do we do when our dataset contains text columns like "Male/Female" or "City Names"? Algorithms cannot do math on words! In Chapter 7: Feature Engineering and Encoding, we will learn how to translate text categories into numbers.

Browse All 21+ Subject Areas

Quick Links

Visual Algorithm Labs

Frontend Dev

The Future of Web Architecture in 2026

Practice Quizzes

Data Preprocessing and Cleaning #

1. Introduction #

2. Learning Objectives #

3. Handling Duplicate Data #

4. Handling Missing Values (NaN) #

5. Handling Outliers #

6. Feature Scaling (Normalization vs Standardization) #

7. Mini Project: Clean Messy Dataset #

8. Common Mistakes #

9. Best Practices #

10. Exercises #

11. MCQ Quiz with Answers #

What is the process of replacing missing data values (NaN) with substituted values like the mean or median called?

If you have two features: "Number of Bedrooms" (1-5) and "House Price" ($100k-$1M), what must you do before feeding them into a distance-based algorithm like KNN?

12. Interview Questions #

13. FAQs #

14. Summary #

15. Next Chapter Recommendation #

Finish this Chapter

Discussion

Explore More

📖 Related Tutorials 4

Send Feedback / Bug

Feedback Submitted!

Data Preprocessing and Cleaning

1. Introduction

2. Learning Objectives

3. Handling Duplicate Data

4. Handling Missing Values (NaN)

5. Handling Outliers

6. Feature Scaling (Normalization vs Standardization)

7. Mini Project: Clean Messy Dataset

8. Common Mistakes

9. Best Practices

10. Exercises

11. MCQ Quiz with Answers

12. Interview Questions

13. FAQs

14. Summary

15. Next Chapter Recommendation