Data Preprocessing and Cleaning
# CHAPTER 6
Data Preprocessing and Cleaning
1. Introduction
In the real world, datasets are never perfect. Users skip form fields, sensors malfunction, and databases crash. If you feed a Scikit-learn model a dataset containing missing values (NaNs), text instead of numbers, or extreme outliers, the algorithm will instantly throw an error and crash. The phrase "Garbage In, Garbage Out" is the golden rule of Data Science. In this chapter, we will learn how to clean messy datasets using Pandas and prepare them for Machine Learning.2. Learning Objectives
By the end of this chapter, you will be able to:- Identify and remove duplicate data.
- Handle missing values using imputation strategies.
- Detect and handle extreme outliers.
- Understand the importance of Feature Scaling.
-
Use Scikit-learn's
StandardScalerandMinMaxScaler.
3. Handling Duplicate Data
Duplicates can artificially inflate the importance of certain records. Removing them in Pandas is trivial.4. Handling Missing Values (NaN)
When data is missing, Pandas represents it asNaN (Not a Number). Scikit-learn algorithms generally cannot process NaNs. You have two choices: Drop them or Fill them (Imputation).
A. Dropping Missing Data: If you have 10,000 rows and only 5 are missing an Age, just drop those 5 rows.
B. Imputation (Filling): If you have a small dataset, dropping rows loses valuable information. Instead, we fill the missing values with the Mean (average) or Median of that column.
Scikit-learn approach (SimpleImputer): Scikit-learn provides a built-in tool for this, which is essential when building automated pipelines later.
5. Handling Outliers
An outlier is a data point that differs significantly from other observations (e.g., an age of 150 years). Outliers can severely distort models like Linear Regression.- Detection: You can use visualization (Boxplots) or statistical rules (Interquartile Range - IQR) to find them.
- Handling: Once identified, you can drop the row or cap the value to a logical maximum.
6. Feature Scaling (Normalization vs Standardization)
Consider two features:Age (ranges from 18 to 80) and Salary (ranges from 30,000 to 150,000).
Some ML algorithms (like KNN or SVM) calculate the distance between points. Because Salary numbers are massively larger than Age numbers, the algorithm will think Salary is 1000x more important than Age!
To fix this, we scale all features so they have a similar range.
A. Normalization (Min-Max Scaling): Squashes all values to be exactly between 0 and 1.
B. Standardization: Centers the data around a mean of 0, with a standard deviation of 1. This handles outliers better than Min-Max scaling.
7. Mini Project: Clean Messy Dataset
Let's tie it all together in a quick cleaning script.8. Common Mistakes
- Scaling the Target Variable: You should generally only scale the *Features* (X), not the *Target/Label* (y). If you scale the target house price from $500,000 to 0.5, your model will predict 0.5!
- Data Leakage in Scaling: We will cover this in Chapter 8, but you must fit your Scaler ONLY on the Training data, not the Testing data.
9. Best Practices
-
Document your cleaning steps: The exact cleaning steps applied to your training data must be applied to new data when the model is in production. Using Scikit-learn Transformers (
SimpleImputer,StandardScaler) makes this process repeatable.
10. Exercises
-
1.
Use Pandas to create a DataFrame with missing values. Try using both
dropna()andfillna()and observe how the shape of the DataFrame changes.
-
2.
Why does the
StandardScalerhelp machine learning algorithms perform better?
11. MCQ Quiz with Answers
What is the process of replacing missing data values (NaN) with substituted values like the mean or median called?
If you have two features: "Number of Bedrooms" (1-5) and "House Price" ($100k-$1M), what must you do before feeding them into a distance-based algorithm like KNN?
12. Interview Questions
- Q: Explain the difference between Normalization (MinMaxScaler) and Standardization (StandardScaler).
- Q: How do you handle missing data in a dataset if the missing column is critical but 40% of its values are NaN?
13. FAQs
Q: Do all ML algorithms require Feature Scaling? A: No! Tree-based algorithms like Decision Trees and Random Forests do not care about the scale of the features. However, algorithms like SVM, KNN, and Neural Networks absolutely require it.14. Summary
Cleaning data is the unglamorous but vital reality of machine learning. By utilizing Pandas to drop duplicates and Scikit-learn's preprocessing modules (SimpleImputer, StandardScaler) to handle missing values and scale numeric ranges, we ensure our data is mathematically sound before it ever touches a predictive algorithm.