Skip to main content
Scikit-learn Basics
CHAPTER 06 Intermediate

Data Preprocessing and Cleaning

Updated: May 16, 2026
6 min read

# CHAPTER 6

Data Preprocessing and Cleaning

1. Introduction

In the real world, datasets are never perfect. Users skip form fields, sensors malfunction, and databases crash. If you feed a Scikit-learn model a dataset containing missing values (NaNs), text instead of numbers, or extreme outliers, the algorithm will instantly throw an error and crash. The phrase "Garbage In, Garbage Out" is the golden rule of Data Science. In this chapter, we will learn how to clean messy datasets using Pandas and prepare them for Machine Learning.

2. Learning Objectives

By the end of this chapter, you will be able to:
  • Identify and remove duplicate data.
  • Handle missing values using imputation strategies.
  • Detect and handle extreme outliers.
  • Understand the importance of Feature Scaling.
  • Use Scikit-learn's StandardScaler and MinMaxScaler.

3. Handling Duplicate Data

Duplicates can artificially inflate the importance of certain records. Removing them in Pandas is trivial.
python
123456789
import pandas as pd

df = pd.read_csv("messy_data.csv")

# Check how many duplicates exist
print(df.duplicated().sum())

# Remove them entirely
df = df.drop_duplicates()

4. Handling Missing Values (NaN)

When data is missing, Pandas represents it as NaN (Not a Number). Scikit-learn algorithms generally cannot process NaNs. You have two choices: Drop them or Fill them (Imputation).

A. Dropping Missing Data: If you have 10,000 rows and only 5 are missing an Age, just drop those 5 rows.

python
12
# Drops any row that contains at least one NaN
df_clean = df.dropna()

B. Imputation (Filling): If you have a small dataset, dropping rows loses valuable information. Instead, we fill the missing values with the Mean (average) or Median of that column.

python
123
# Pandas approach
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)

Scikit-learn approach (SimpleImputer): Scikit-learn provides a built-in tool for this, which is essential when building automated pipelines later.

python
12345678
from sklearn.impute import SimpleImputer
import numpy as np

# Create imputer to replace NaN with the mean
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit and transform the Age column
df['Age'] = imputer.fit_transform(df[['Age']])

5. Handling Outliers

An outlier is a data point that differs significantly from other observations (e.g., an age of 150 years). Outliers can severely distort models like Linear Regression.
  • Detection: You can use visualization (Boxplots) or statistical rules (Interquartile Range - IQR) to find them.
  • Handling: Once identified, you can drop the row or cap the value to a logical maximum.

6. Feature Scaling (Normalization vs Standardization)

Consider two features: Age (ranges from 18 to 80) and Salary (ranges from 30,000 to 150,000). Some ML algorithms (like KNN or SVM) calculate the distance between points. Because Salary numbers are massively larger than Age numbers, the algorithm will think Salary is 1000x more important than Age! To fix this, we scale all features so they have a similar range.

A. Normalization (Min-Max Scaling): Squashes all values to be exactly between 0 and 1.

python
12345
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
# Fit calculates the min/max, transform applies the math
scaled_data = scaler.fit_transform(df[['Age', 'Salary']])

B. Standardization: Centers the data around a mean of 0, with a standard deviation of 1. This handles outliers better than Min-Max scaling.

python
1234
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
standardized_data = scaler.fit_transform(df[['Age', 'Salary']])

7. Mini Project: Clean Messy Dataset

Let's tie it all together in a quick cleaning script.
python
1234567891011121314151617181920
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# 1. Load Data
data = {"Age": [25, 30, np.nan, 40, 25], "Salary": [50000, 60000, 70000, np.nan, 50000]}
df = pd.DataFrame(data)

# 2. Drop Duplicates
df = df.drop_duplicates()

# 3. Impute Missing Values (Mean)
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# 4. Standardize Features
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_imputed), columns=df.columns)

print(df_scaled)

8. Common Mistakes

  • Scaling the Target Variable: You should generally only scale the *Features* (X), not the *Target/Label* (y). If you scale the target house price from $500,000 to 0.5, your model will predict 0.5!
  • Data Leakage in Scaling: We will cover this in Chapter 8, but you must fit your Scaler ONLY on the Training data, not the Testing data.

9. Best Practices

  • Document your cleaning steps: The exact cleaning steps applied to your training data must be applied to new data when the model is in production. Using Scikit-learn Transformers (SimpleImputer, StandardScaler) makes this process repeatable.

10. Exercises

  1. 1. Use Pandas to create a DataFrame with missing values. Try using both dropna() and fillna() and observe how the shape of the DataFrame changes.
  1. 2. Why does the StandardScaler help machine learning algorithms perform better?

11. MCQ Quiz with Answers

Question 1

What is the process of replacing missing data values (NaN) with substituted values like the mean or median called?

Question 2

If you have two features: "Number of Bedrooms" (1-5) and "House Price" ($100k-$1M), what must you do before feeding them into a distance-based algorithm like KNN?

12. Interview Questions

  • Q: Explain the difference between Normalization (MinMaxScaler) and Standardization (StandardScaler).
  • Q: How do you handle missing data in a dataset if the missing column is critical but 40% of its values are NaN?

13. FAQs

Q: Do all ML algorithms require Feature Scaling? A: No! Tree-based algorithms like Decision Trees and Random Forests do not care about the scale of the features. However, algorithms like SVM, KNN, and Neural Networks absolutely require it.

14. Summary

Cleaning data is the unglamorous but vital reality of machine learning. By utilizing Pandas to drop duplicates and Scikit-learn's preprocessing modules (SimpleImputer, StandardScaler) to handle missing values and scale numeric ranges, we ensure our data is mathematically sound before it ever touches a predictive algorithm.

15. Next Chapter Recommendation

We have cleaned our numeric data, but what do we do when our dataset contains text columns like "Male/Female" or "City Names"? Algorithms cannot do math on words! In Chapter 7: Feature Engineering and Encoding, we will learn how to translate text categories into numbers.

Finish this Chapter

Save your progress on your learning path and prepare for coding interview challenges.

Discussion

Join the discussion

Log in or create a free account to participate.

Sort: ·