CHAPTER 14
Intermediate
Clustering with K-Means
Updated: May 16, 2026
6 min read
# CHAPTER 14
Clustering with K-Means
1. Introduction
Every algorithm we have learned so far—Regression, SVM, Trees—has been Supervised Learning. We provided the data (X) AND the correct answers (y). But what happens when you don't have the answers? Imagine you have 10,000 customer records (Age and Income), and the marketing team asks you to "group them into distinct buyer profiles." You don't know what the profiles are yet. This is Unsupervised Learning. In this chapter, we will use the most popular unsupervised algorithm, K-Means, to automatically discover hidden structures in data.
2. Learning Objectives
By the end of this chapter, you will be able to:- Define Unsupervised Learning and Clustering.
- Explain how the K-Means algorithm mathematically finds groups.
-
Implement
KMeansclustering in Scikit-learn.
- Determine the optimal number of clusters using the "Elbow Method".
- Build a customer segmentation model.
3. What is Clustering?
Clustering is the task of dividing the dataset into groups, such that data points in the same group (a cluster) are more similar to each other than to data points in other groups.4. How K-Means Works
The "K" in K-Means stands for the number of clusters you want the algorithm to find. If you setK=3:
- 1. Initialization: The algorithm drops 3 random "Centroids" (center points) onto the graph.
- 2. Assignment: It calculates the distance of every data point to these 3 centroids. Each data point is assigned to the cluster of its closest centroid.
- 3. Update: It looks at all the points in Cluster 1, calculates their exact mathematical middle, and moves Centroid 1 to that new middle.
- 4. Repeat: It repeats steps 2 and 3 until the centroids stop moving. The clusters are now locked in!
5. Mini Project: Customer Segmentation
Let's cluster a mock dataset of customers based on their Annual Income and Spending Score. Notice that there is noy (target) variable here!
python
6. The Elbow Method (Finding the right K)
In the example above, I choseK=3 because I knew the mock data had 3 groups. In reality, you don't know how many groups exist. How do you find the best K?
We use the Elbow Method.
- 1. We run K-Means multiple times, from K=1 to K=10.
- 2. For each run, we calculate the Inertia (how tightly packed the clusters are. Lower is better).
- 3. We plot these inertia values on a graph. The line will look like an arm. The "Elbow" (where the line stops dropping drastically and flattens out) is the optimal K.
python
7. Visualizing Clusters
If your data has only 2 or 3 features, you can easily plot the results using Matplotlib. You plot theX features on a scatter plot and color them based on the clusters array outputted by K-Means. This is incredibly powerful for presenting data to business stakeholders.
8. Common Mistakes
-
Forgetting to Scale: K-Means calculates Euclidean distance. If Income is in the $100,000s and Age is in the 20s, the algorithm will completely ignore Age. You must use
StandardScaler.
-
Assuming K-Means always finds the best clusters: K-Means forces spherical clusters. If your data forms complex shapes (like a crescent moon), K-Means will fail. You would need algorithms like
DBSCAN.
9. Best Practices
-
Random Initialization Trap: Sometimes, the initial random placement of centroids causes poor clusters. Scikit-learn handles this by setting
n_init=10(default), which runs the algorithm 10 times with different random starting points and picks the best outcome.
10. Exercises
-
1.
Run K-Means on a dataset with
K=1. What will the algorithm do? Where will the single centroid end up?
- 2. Explain the difference between K-Means (Unsupervised) and K-Nearest Neighbors (Supervised).
11. MCQ Quiz with Answers
Question 1
In the K-Means algorithm, what does "K" represent?
Question 2
Which technique is commonly used to determine the optimal value of K when you do not know how many groups exist in your data?
12. Interview Questions
- Q: Explain how the K-Means algorithm iteratively updates its centroids until it converges.
- Q: What is Unsupervised Learning, and give a real-world business example of where it would be used over Supervised Learning.
13. FAQs
Q: Can I use K-Means to predict outcomes for new data? A: Yes! Once K-Means has locked its centroids (using.fit()), you can pass new data to .predict(). It will measure the distance from the new data to the established centroids and assign it to the closest cluster.