Introduction to Clustering
Clustering is the art of finding natural groups in data without being told what to look for. Unlike classification where we have labels, clustering is unsupervised learning - we let the algorithm discover patterns on its own. It's like sorting a pile of mixed fruits without knowing their names - you'd naturally group apples with apples and oranges with oranges based on appearance.
What is Clustering? (Simple Definition)
Clustering is like being a detective who organizes clues into piles without knowing what crime was committed. You look at similarities and group things together.
Real-Life Examples
- Spotify grouping similar songs into playlists
- Netflix suggesting shows like ones you watched
- Stores grouping customers by shopping habits
- Email providers separating spam from real mail
Key Question Clustering Answers
"I have a bunch of data points. Which ones are similar to each other? Can I organize them into meaningful groups without knowing ahead of time what those groups should be?"
Types of Clustering
Centroid-Based
Clusters are defined by a central point (centroid). K-Means is the classic example. Works best with spherical, similarly-sized clusters.
Hierarchical
Builds a tree of clusters. Can be agglomerative (bottom-up) or divisive (top-down). Great for understanding cluster relationships.
Density-Based
Clusters are regions of high density separated by low-density areas. DBSCAN can find clusters of any shape and detect outliers.
Essential Vocabulary (Know These Terms!)
๐ฏ Cluster: A group of similar data points that belong together
๐ Centroid: The center point of a cluster (like the average of all points)
๐ Distance: How far apart two data points are (usually Euclidean distance)
๐ Inertia: Total distance of all points to their centroids (lower = tighter clusters)
๐ Unsupervised: Learning without labels - the algorithm discovers patterns alone
โ ๏ธ Outlier: A data point that doesn't fit into any cluster (noise)
Real-World Applications
Customer Segmentation
Group customers by purchasing behavior to create targeted marketing campaigns. High-value customers, bargain hunters, and seasonal shoppers emerge naturally.
Image Segmentation
Group pixels by color/texture to identify objects in images. Used in medical imaging to identify tumors or in self-driving cars to recognize road elements.
Gene Expression Analysis
Cluster genes with similar expression patterns to discover gene families and understand biological processes.
Anomaly Detection
Points that don't belong to any cluster are outliers. Used in fraud detection, network intrusion detection, and quality control.
# Quick Clustering Demo
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np
# Generate sample data with 3 natural clusters
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=42)
# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)
print(f"Data shape: {X.shape}")
print(f"Cluster labels: {np.unique(y_pred)}")
print(f"Cluster sizes: {np.bincount(y_pred)}")
print(f"Cluster centers shape: {kmeans.cluster_centers_.shape}")
Practice Questions: Introduction
Test your understanding with these coding challenges.
Task: Use make_blobs to create data with 4 clusters and 500 samples. Print shape and center count.
Show Solution
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=500, centers=4, random_state=42)
print(f"Data shape: {X.shape}")
print(f"Number of clusters: {len(set(y))}")
print(f"Samples per cluster: {[list(y).count(i) for i in range(4)]}")
Task: Generate blob data, cluster with KMeans, and calculate adjusted rand score.
Show Solution
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
X, y_true = make_blobs(n_samples=300, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)
ari = adjusted_rand_score(y_true, y_pred)
print(f"Adjusted Rand Index: {ari:.3f}")
# 1.0 = perfect clustering, 0 = random
Task: Load Iris, apply KMeans with k=3, compare with true species labels using ARI.
Show Solution
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.preprocessing import StandardScaler
iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X_scaled)
ari = adjusted_rand_score(iris.target, y_pred)
print(f"ARI (vs true species): {ari:.3f}")
Task: Use make_moons to create 300 samples with noise=0.1. Print the shape and unique labels.
Show Solution
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=300, noise=0.1, random_state=42)
print(f"Data shape: {X.shape}")
print(f"Unique labels: {set(y)}")
print(f"Samples per class: {[list(y).count(i) for i in [0, 1]]}")
Task: Load Wine dataset, apply StandardScaler, then cluster with KMeans. Compare mean and std before/after scaling.
Show Solution
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
import numpy as np
wine = load_wine()
X = wine.data
print("Before scaling:")
print(f"Mean: {np.mean(X, axis=0)[:3]}...") # First 3 features
print(f"Std: {np.std(X, axis=0)[:3]}...")
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("\nAfter scaling:")
print(f"Mean: {np.mean(X_scaled, axis=0)[:3]}...") # ~0
print(f"Std: {np.std(X_scaled, axis=0)[:3]}...") # ~1
Task: Use make_circles to create concentric circles data. Print shape and apply KMeans - note why it fails.
Show Solution
from sklearn.datasets import make_circles
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
X, y_true = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)
print(f"Data shape: {X.shape}")
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)
ari = adjusted_rand_score(y_true, y_pred)
print(f"ARI: {ari:.3f}")
print("K-Means fails on circles because it assumes spherical clusters!")
Task: Cluster blob data and calculate ARI, NMI (Normalized Mutual Information), and Silhouette Score.
Show Solution
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, silhouette_score
X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)
ari = adjusted_rand_score(y_true, y_pred)
nmi = normalized_mutual_info_score(y_true, y_pred)
sil = silhouette_score(X, y_pred)
print(f"Adjusted Rand Index: {ari:.3f}")
print(f"Normalized Mutual Info: {nmi:.3f}")
print(f"Silhouette Score: {sil:.3f}")
Task: Cluster data and print a simple text-based visualization showing cluster assignments.
Show Solution
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np
X, _ = make_blobs(n_samples=20, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
print("Cluster Assignments:")
for cluster in range(3):
indices = np.where(labels == cluster)[0]
print(f"\nCluster {cluster}: {len(indices)} points")
print(f" Points: {indices.tolist()}")
print(f" Centroid: {kmeans.cluster_centers_[cluster].round(2)}")
K-Means Clustering
K-Means is the most popular clustering algorithm due to its simplicity and speed. It partitions data into K clusters where each point belongs to the cluster with the nearest centroid (center). Think of it like placing K magnets in a room and letting each data point stick to its nearest magnet.
Understanding K-Means (Beginner's Guide)
What is "K"?
K is the number of groups you want. You choose K before running the algorithm. K=3 means "find me 3 groups."
What is a "Centroid"?
The center point of each cluster. Like the flag pole in the middle of a camp - everyone in that group gathers around it.
What is "Inertia"?
Measures how spread out points are from their centroid. Lower inertia = tighter, better clusters.
Think of it like this: You're a teacher splitting 30 students into K study groups. You want students in each group to be as similar as possible (by skill level, interests, etc.). The centroid is like the "average student" representing each group.
The K-Means Algorithm
Initialize
Randomly select K data points as initial centroids
Assign
Assign each point to nearest centroid
Update
Move centroid to mean of assigned points
Repeat
Go to step 2 until centroids stop moving
Watch K-Means in Action (Step-by-Step Example)
| Step | What Happens | Example (K=2) |
|---|---|---|
| 1 | Pick K random points as starting centroids | Centroid A = (2,3), Centroid B = (8,7) |
| 2 | Each point measures distance to all centroids, joins nearest | Point (1,2) โ closer to A โ joins Cluster A |
| 3 | Move each centroid to the average position of its members | Cluster A members average = (2.5, 2.8) โ new centroid |
| 4 | Repeat steps 2-3 until centroids stop moving | Usually 5-20 iterations until stable |
| Done! | Final clusters are ready | Each point has a cluster label (0 or 1) |
# K-Means Clustering with Scikit-learn
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)
# Always scale your data for K-Means!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Create and fit K-Means
kmeans = KMeans(
n_clusters=4, # Number of clusters (K)
init='k-means++', # Smart initialization (default)
n_init=10, # Run 10 times with different seeds
max_iter=300, # Max iterations per run
random_state=42
)
kmeans.fit(X_scaled)
# Results
print(f"Cluster labels: {kmeans.labels_[:10]}...") # First 10 labels
print(f"Cluster centers:\n{kmeans.cluster_centers_}")
print(f"Inertia (within-cluster sum of squares): {kmeans.inertia_:.2f}")
print(f"Number of iterations: {kmeans.n_iter_}")
K-Means++ Initialization
Random initialization can lead to poor results if centroids start too close together. K-Means++ is a smarter initialization that spreads initial centroids apart:
Random Init Problems
- Centroids may start in same cluster
- Results vary greatly between runs
- May converge to poor local minimum
- Needs many restarts (n_init)
K-Means++ Benefits
- Spreads centroids across data
- More consistent results
- Faster convergence
- Better final clustering
# Comparing Random vs K-Means++ Initialization
from sklearn.cluster import KMeans
import numpy as np
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
# Random initialization
kmeans_random = KMeans(n_clusters=4, init='random', n_init=1, random_state=42)
kmeans_random.fit(X)
print(f"Random init - Inertia: {kmeans_random.inertia_:.2f}, Iterations: {kmeans_random.n_iter_}")
# K-Means++ initialization (default)
kmeans_pp = KMeans(n_clusters=4, init='k-means++', n_init=1, random_state=42)
kmeans_pp.fit(X)
print(f"K-Means++ - Inertia: {kmeans_pp.inertia_:.2f}, Iterations: {kmeans_pp.n_iter_}")
# Lower inertia = tighter clusters = better!
- Must specify K: You need to know/guess the number of clusters
- Spherical clusters: Assumes clusters are roughly round and similar in size
- Sensitive to outliers: Outliers can pull centroids away from dense regions
- Sensitive to scale: Always standardize your features!
Common Beginner Mistakes to Avoid
Mistake #1: Forgetting to Scale
If salary is 50,000 and age is 30, K-Means will think salary matters 1,667x more! Always use StandardScaler first.
Mistake #2: Random K Selection
Don't just guess K=5 because it "sounds right." Use Elbow Method or Silhouette Score to find the optimal value.
Mistake #3: Using K-Means on Non-Spherical Data
K-Means assumes round clusters. Moon-shaped or ring-shaped data? Use DBSCAN instead.
Mistake #4: Running Only Once
K-Means results depend on initial centroid positions. Use n_init=10 to run 10 times and pick the best result.
Practice Questions: K-Means
Test your understanding with these coding challenges.
Task: Load wine dataset, scale features, apply K-Means with k=3, print cluster sizes.
Show Solution
from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np
wine = load_wine()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(wine.data)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
print(f"Cluster sizes: {np.bincount(labels)}")
Task: Train KMeans, then use .predict() to assign new data points to clusters.
Show Solution
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np
X_train, _ = make_blobs(n_samples=200, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X_train)
# New data points to predict
X_new = np.array([[0, 0], [5, 5], [-5, -5]])
new_labels = kmeans.predict(X_new)
print(f"New point clusters: {new_labels}")
Task: Run K-Means for k=1 to 10 and print the inertia for each.
Show Solution
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=5, random_state=42)
inertias = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
print(f"k={k:2d}: Inertia = {kmeans.inertia_:.2f}")
# Look for the "elbow" where inertia stops decreasing rapidly
Task: Train KMeans with k=3 and print the coordinates of all cluster centers.
Show Solution
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=150, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X)
print("Cluster Centers:")
for i, center in enumerate(kmeans.cluster_centers_):
print(f"Cluster {i}: ({center[0]:.2f}, {center[1]:.2f})")
Task: Run KMeans with random and k-means++ init. Compare inertia values for 5 different random states.
Show Solution
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
print("Random State | Random Init | K-Means++")
print("-" * 40)
for rs in range(5):
km_rand = KMeans(n_clusters=4, init='random', n_init=1, random_state=rs)
km_pp = KMeans(n_clusters=4, init='k-means++', n_init=1, random_state=rs)
km_rand.fit(X)
km_pp.fit(X)
print(f"{rs:12d} | {km_rand.inertia_:11.2f} | {km_pp.inertia_:.2f}")
Task: Cluster the Digits dataset (first 2 PCA components) and show cluster size distribution.
Show Solution
from sklearn.datasets import load_digits
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import numpy as np
digits = load_digits()
pca = PCA(n_components=2)
X_pca = pca.fit_transform(digits.data)
kmeans = KMeans(n_clusters=10, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_pca)
print("Cluster Size Distribution:")
for i in range(10):
count = np.sum(labels == i)
print(f"Cluster {i}: {count} samples ({count/len(labels)*100:.1f}%)")
Task: After clustering, calculate the distance of each point to its assigned centroid and find the furthest point.
Show Solution
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np
X, _ = make_blobs(n_samples=100, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
# Calculate distance to assigned centroid
distances = []
for i, point in enumerate(X):
centroid = kmeans.cluster_centers_[labels[i]]
dist = np.linalg.norm(point - centroid)
distances.append(dist)
print(f"Average distance to centroid: {np.mean(distances):.3f}")
print(f"Max distance: {np.max(distances):.3f}")
print(f"Furthest point index: {np.argmax(distances)}")
Task: Use KMeans transform() method to get distances from each point to all centroids.
Show Solution
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np
X, _ = make_blobs(n_samples=10, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X)
# transform() returns distance to each centroid
distances = kmeans.transform(X)
print("Distances to each centroid:")
print(f"Shape: {distances.shape}") # (n_samples, n_clusters)
print(f"\nFirst 5 points:")
for i in range(5):
print(f"Point {i}: {distances[i].round(2)} -> Closest: Cluster {np.argmin(distances[i])}")
Choosing the Optimal K
The biggest challenge in K-Means is choosing the right number of clusters. Too few clusters underfit (miss patterns), too many overfit (find noise). The Elbow Method and Silhouette Score are the two most popular techniques to find the "sweet spot."
The "How Many Groups?" Problem (Beginner's Guide)
Choosing K is like deciding how many drawers you need to organize your clothes. Too few = everything crammed together. Too many = mostly empty drawers and wasted space.
Elbow Method (Visual Approach)
What it measures: How much "tightness" you gain by adding more clusters
What to look for: The "elbow" point where the curve bends - adding more clusters after this doesn't help much
Analogy: Like squeezing a sponge. At first, lots of water comes out. Eventually, squeezing harder barely helps. The elbow is when you should stop squeezing.
Silhouette Score (Quality Score)
What it measures: How well each point fits in its cluster vs neighboring clusters
Score range: -1 (wrong cluster) โ 0 (border) โ +1 (perfect fit)
Analogy: Like asking each student "Do you feel you belong in this group?" High scores mean happy, well-placed students.
The Elbow Method
Plot inertia (within-cluster sum of squares) for different K values. Look for the "elbow" - the point where adding more clusters doesn't significantly reduce inertia anymore.
# Elbow Method
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)
# Calculate inertia for different K values
k_values = range(1, 11)
inertias = []
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
# Print results (for plotting)
for k, inertia in zip(k_values, inertias):
print(f"k={k:2d}: Inertia = {inertia:8.2f}")
# The "elbow" is typically where the curve bends - around k=4 here
Silhouette Score
The Silhouette Score measures how similar a point is to its own cluster compared to other clusters. It ranges from -1 to 1:
+1
Point is well-matched to its cluster and poorly matched to neighbors
0
Point is on the border between two clusters
-1
Point is probably assigned to the wrong cluster
# Silhouette Score Analysis
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)
# Calculate silhouette score for different K values
k_values = range(2, 11) # Silhouette needs at least 2 clusters
silhouette_scores = []
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
score = silhouette_score(X, labels)
silhouette_scores.append(score)
print(f"k={k:2d}: Silhouette = {score:.3f}")
# Higher is better - find the maximum!
best_k = k_values[silhouette_scores.index(max(silhouette_scores))]
print(f"\nBest k by silhouette: {best_k}")
Combining Both Methods
# Complete Analysis
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
print("K | Inertia | Silhouette")
print("-" * 30)
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
sil = silhouette_score(X, labels)
print(f"{k} | {kmeans.inertia_:10.2f} | {sil:.3f}")
# Look for: elbow in inertia + high silhouette score
Practice Questions: Choosing K
Test your understanding with these coding challenges.
Task: Cluster Iris data with k=3 and print the silhouette score.
Show Solution
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
print(f"Silhouette Score (k=3): {score:.3f}")
Task: Test k=2 to 8 on Wine dataset and find the k with highest silhouette score.
Show Solution
from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
wine = load_wine()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(wine.data)
best_k, best_score = 2, -1
for k in range(2, 9):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
print(f"k={k}: {score:.3f}")
if score > best_score:
best_k, best_score = k, score
print(f"\nBest: k={best_k} with score {best_score:.3f}")
Task: Calculate the rate of change in inertia to programmatically find the elbow.
Show Solution
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
inertias = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
# Calculate rate of change (first derivative)
deltas = np.diff(inertias)
# Elbow is where delta stops decreasing rapidly
ratios = deltas[:-1] / deltas[1:]
elbow = np.argmax(ratios) + 2 # +2 because of diff offsets
print(f"Inertias: {[int(i) for i in inertias]}")
print(f"Detected elbow at k={elbow}")
Task: Use silhouette_samples() to get per-point silhouette scores and find the worst-clustered point.
Show Solution
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
from sklearn.datasets import make_blobs
import numpy as np
X, _ = make_blobs(n_samples=100, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
scores = silhouette_samples(X, labels)
print(f"Mean silhouette: {scores.mean():.3f}")
print(f"Worst point index: {np.argmin(scores)}")
print(f"Worst point score: {scores.min():.3f}")
print(f"Best point score: {scores.max():.3f}")
Task: Calculate silhouette score for Iris, Wine, and Blobs datasets with their natural K values.
Show Solution
from sklearn.datasets import load_iris, load_wine, make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
datasets = {
'Iris (k=3)': (load_iris().data, 3),
'Wine (k=3)': (load_wine().data, 3),
'Blobs (k=4)': (make_blobs(n_samples=300, centers=4, random_state=42)[0], 4)
}
for name, (X, k) in datasets.items():
X_scaled = StandardScaler().fit_transform(X)
labels = KMeans(n_clusters=k, random_state=42, n_init=10).fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
print(f"{name}: Silhouette = {score:.3f}")
Task: Use calinski_harabasz_score (another metric - higher is better) to find optimal K.
Show Solution
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
print("K | Calinski-Harabasz Score")
print("-" * 30)
best_k, best_score = 2, -1
for k in range(2, 11):
labels = KMeans(n_clusters=k, random_state=42, n_init=10).fit_predict(X)
score = calinski_harabasz_score(X, labels)
print(f"{k} | {score:.2f}")
if score > best_score:
best_k, best_score = k, score
print(f"\nBest K: {best_k}")
Task: Calculate Davies-Bouldin Index (lower is better) for different K values.
Show Solution
from sklearn.cluster import KMeans
from sklearn.metrics import davies_bouldin_score
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
print("K | Davies-Bouldin Score (lower = better)")
print("-" * 40)
best_k, best_score = 2, float('inf')
for k in range(2, 11):
labels = KMeans(n_clusters=k, random_state=42, n_init=10).fit_predict(X)
score = davies_bouldin_score(X, labels)
print(f"{k} | {score:.3f}")
if score < best_score:
best_k, best_score = k, score
print(f"\nBest K: {best_k} (score: {best_score:.3f})")
Task: Create a comprehensive table showing Silhouette, Calinski-Harabasz, and Davies-Bouldin for k=2-8.
Show Solution
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, random_state=42)
print("K | Silhouette | Calinski-H | Davies-B")
print("-" * 45)
for k in range(2, 9):
labels = KMeans(n_clusters=k, random_state=42, n_init=10).fit_predict(X)
sil = silhouette_score(X, labels)
ch = calinski_harabasz_score(X, labels)
db = davies_bouldin_score(X, labels)
print(f"{k} | {sil:.3f} | {ch:7.1f} | {db:.3f}")
print("\nGoal: High Silhouette, High CH, Low DB")
Hierarchical Clustering
Hierarchical Clustering builds a tree of clusters called a dendrogram. Unlike K-Means, you don't need to specify K upfront - you can cut the tree at any level to get different numbers of clusters. It's like organizing a family tree where similar family members are grouped together.
Understanding Hierarchical Clustering (Beginner's Guide)
What is a Dendrogram?
A tree diagram showing how clusters merge together. The height shows how different the merged groups are. Think of it like a company org chart, but for data similarity.
Why is it Flexible?
You can "cut" the tree at any height to get 2 clusters, 5 clusters, or any number! Unlike K-Means where you must decide K before training.
School Analogy: Imagine organizing all students by similarity. First, best friends pair up (most similar). Then pairs join to form friend groups. Groups combine into class sections. Sections form grades. Grades form the whole school. The dendrogram shows this entire hierarchy - cut at "grades" level to see 4 groups, or at "friend groups" level to see 50 groups!
Agglomerative vs Divisive
Agglomerative (Bottom-Up)
Start with each point as its own cluster. Merge the two closest clusters repeatedly until one cluster remains. Most common approach.
Divisive (Top-Down)
Start with all points in one cluster. Split recursively until each point is its own cluster. Less common but useful for some problems.
# Agglomerative Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
X, _ = make_blobs(n_samples=50, centers=3, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Agglomerative clustering with 3 clusters
agg = AgglomerativeClustering(
n_clusters=3, # Number of clusters
linkage='ward', # Linkage criterion
metric='euclidean' # Distance metric (if not ward)
)
labels = agg.fit_predict(X_scaled)
print(f"Cluster labels: {labels}")
print(f"Cluster sizes: {[list(labels).count(i) for i in range(3)]}")
Applies Agglomerative (bottom-up) Hierarchical Clustering using Ward linkage, which minimizes within-cluster variance. Unlike K-Means, hierarchical clustering builds a tree structure by progressively merging the closest clusters. The output shows which cluster each point belongs to and the size of each resulting cluster.
Linkage Methods
When merging clusters, how do we measure the distance between them? Different linkage methods give different results:
What's "Linkage"? (Simple Explanation)
When you merge two groups, you need to know how "far apart" they are. But groups have multiple points! Linkage is the rule for measuring distance between groups. Different rules give different cluster shapes.
Ward
Minimizes within-cluster variance. Creates compact, spherical clusters. Default and usually best choice.
Single
Distance = minimum between any two points. Can create elongated chains. Good for non-spherical clusters.
Complete
Distance = maximum between any two points. Creates compact clusters. Sensitive to outliers.
Average
Distance = average of all pairwise distances. Balanced approach between single and complete.
Understanding Dendrograms
# Creating and Analyzing Dendrograms
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs
import numpy as np
# Small dataset for visualization
X, _ = make_blobs(n_samples=10, centers=3, random_state=42)
# Compute linkage matrix
Z = linkage(X, method='ward')
# The linkage matrix Z has 4 columns:
# [cluster1_idx, cluster2_idx, distance, n_points_in_new_cluster]
print("Linkage Matrix (first 5 merges):")
print("Cluster1 | Cluster2 | Distance | Size")
for i, row in enumerate(Z[:5]):
print(f" {int(row[0]):5d} | {int(row[1]):5d} | {row[2]:.2f} | {int(row[3])}")
# To cut the dendrogram at a specific number of clusters:
from scipy.cluster.hierarchy import fcluster
labels_3 = fcluster(Z, t=3, criterion='maxclust')
print(f"\n3 clusters: {labels_3}")
Practice Questions: Hierarchical Clustering
Test your understanding with these coding challenges.
Task: Cluster Iris data using AgglomerativeClustering with 3 clusters and ward linkage.
Show Solution
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score
iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = agg.fit_predict(X_scaled)
ari = adjusted_rand_score(iris.target, labels)
print(f"Adjusted Rand Index: {ari:.3f}")
Task: Compare ward, single, complete, and average linkage on the same data.
Show Solution
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)
for linkage in ['ward', 'single', 'complete', 'average']:
agg = AgglomerativeClustering(n_clusters=3, linkage=linkage)
labels = agg.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
print(f"{linkage:8s}: Silhouette = {score:.3f}")
Task: Create linkage matrix and use fcluster to get 2, 3, 4, and 5 clusters.
Show Solution
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)
Z = linkage(X_scaled, method='ward')
for n_clusters in [2, 3, 4, 5]:
labels = fcluster(Z, t=n_clusters, criterion='maxclust')
score = silhouette_score(X_scaled, labels)
print(f"{n_clusters} clusters: Silhouette = {score:.3f}")
Task: Create a linkage matrix with scipy and print its shape and first 3 rows.
Show Solution
from scipy.cluster.hierarchy import linkage
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=20, centers=3, random_state=42)
Z = linkage(X, method='ward')
print(f"Linkage matrix shape: {Z.shape}")
print(f"Columns: [cluster1, cluster2, distance, size]")
print(f"\nFirst 3 merges:")
for row in Z[:3]:
print(f" Merge {int(row[0])} + {int(row[1])}, dist={row[2]:.2f}, size={int(row[3])}")
Task: Use fcluster with criterion='distance' to cut at different height thresholds.
Show Solution
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=100, centers=4, random_state=42)
Z = linkage(X, method='ward')
print("Threshold | Clusters")
print("-" * 25)
for threshold in [5, 10, 15, 20, 30]:
labels = fcluster(Z, t=threshold, criterion='distance')
n_clusters = len(set(labels))
print(f"{threshold:9.1f} | {n_clusters}")
Task: Show that sklearn's AgglomerativeClustering and scipy's fcluster give the same results.
Show Solution
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score
X, _ = make_blobs(n_samples=100, centers=3, random_state=42)
# sklearn approach
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels_sklearn = agg.fit_predict(X)
# scipy approach
Z = linkage(X, method='ward')
labels_scipy = fcluster(Z, t=3, criterion='maxclust')
# Compare (labels may be numbered differently, use ARI)
ari = adjusted_rand_score(labels_sklearn, labels_scipy)
print(f"ARI between sklearn and scipy: {ari:.3f}")
print("(1.0 means identical clustering)")
Task: Calculate inconsistency statistics to automatically determine optimal cut point.
Show Solution
from scipy.cluster.hierarchy import linkage, fcluster, inconsistent
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
X, _ = make_blobs(n_samples=150, centers=4, random_state=42)
Z = linkage(X, method='ward')
# Inconsistency method - look for jumps
R = inconsistent(Z, d=2)
max_inconsistency = R[:, 3].max()
print(f"Max inconsistency: {max_inconsistency:.2f}")
# Try different inconsistency thresholds
for thresh in [0.7, 0.8, 0.9, 1.0]:
labels = fcluster(Z, t=thresh, criterion='inconsistent')
n_c = len(set(labels))
if n_c >= 2:
sil = silhouette_score(X, labels)
print(f"threshold={thresh}: {n_c} clusters, silhouette={sil:.3f}")
Task: Calculate cophenetic correlation to measure how well the dendrogram preserves pairwise distances.
Show Solution
from scipy.cluster.hierarchy import linkage, cophenet
from scipy.spatial.distance import pdist
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=100, centers=4, random_state=42)
# Compare different linkage methods
print("Linkage Method | Cophenetic Correlation")
print("-" * 40)
for method in ['single', 'complete', 'average', 'ward']:
Z = linkage(X, method=method)
c, _ = cophenet(Z, pdist(X))
print(f"{method:14s} | {c:.4f}")
print("\n(Higher = dendrogram better preserves original distances)")
DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters based on density. It doesn't require specifying the number of clusters and can find clusters of arbitrary shapes. Best of all, it automatically identifies outliers as noise points.
Understanding DBSCAN (Beginner's Guide)
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. Unlike K-Means which requires you to specify the number of clusters beforehand, DBSCAN automatically discovers clusters based on how densely packed the data points are. It's particularly powerful because it can find clusters of any shape and automatically identifies outliers!
eps (ฮต)
"Neighborhood Radius"
The maximum distance between two points for them to be considered neighbors. Think of it as drawing a circle around each point.
Like asking "Who lives within 1 mile of me?"
min_samples
"Minimum Crowd Size"
The minimum number of points required within eps distance for a point to be considered a "core point" that starts a cluster.
Like "Need 5 friends nearby to be a hub."
-1 Label
"Noise / Outliers"
Points that don't belong to any cluster are automatically labeled as -1. These are outliers or noise in your data.
Like lone houses in the countryside!
Types of Points in DBSCAN
Core Point
Has โฅ min_samples neighbors within eps distance
The "popular kids" who start clusters.
Border Point
Within eps of a core point, but not enough neighbors itself
The "fringe members" of a cluster.
Noise Point
Not within eps of any core point (label = -1)
The "loners" - outliers in your data.
How DBSCAN Works (Step by Step)
DBSCAN Superpower: Finds clusters of ANY shape (crescents, rings, spirals) because it follows density, not geometry. K-Means only finds round blobs! Plus, you don't need to specify the number of clusters - DBSCAN discovers them automatically!
Key Concepts
Core Points
Points with at least min_samples neighbors within eps distance. These form the "heart" of clusters.
Border Points
Not core points themselves, but within eps of a core point. They're on the edge of clusters.
Noise Points
Not core or border points. These are outliers that don't belong to any cluster. Labeled as -1.
# DBSCAN Clustering
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
import numpy as np
# Generate moon-shaped data (non-spherical clusters)
X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply DBSCAN
dbscan = DBSCAN(
eps=0.3, # Maximum distance between neighbors
min_samples=5, # Minimum points to form a core point
metric='euclidean' # Distance metric
)
labels = dbscan.fit_predict(X_scaled)
# Analyze results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Number of clusters: {n_clusters}")
print(f"Number of noise points: {n_noise}")
print(f"Cluster labels: {np.unique(labels)}")
Choosing eps and min_samples
eps (epsilon)
The maximum distance between two points to be considered neighbors.
- Too small: Most points are noise, many small clusters
- Too large: Clusters merge together, few noise points
- Tip: Use k-distance graph to find good eps
min_samples
Minimum points in neighborhood to be a core point.
- Too small: Noise points form clusters
- Too large: Smaller clusters become noise
- Tip: Start with n_dimensions + 1 or more
Quick DBSCAN Tuning Guide
If you see too many clusters:
โ Increase eps (make neighborhoods bigger) OR decrease min_samples
If you see too few clusters (everything in one):
โ Decrease eps (make neighborhoods smaller) OR increase min_samples
# Finding Good eps Using k-Distance Graph
from sklearn.neighbors import NearestNeighbors
import numpy as np
X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Find distance to k-th nearest neighbor for each point
k = 5 # min_samples we're considering
neigh = NearestNeighbors(n_neighbors=k)
neigh.fit(X_scaled)
distances, _ = neigh.kneighbors(X_scaled)
# Sort distances to k-th neighbor
k_distances = np.sort(distances[:, k-1])
# Print some distances - look for the "knee"
print("k-distances (sorted):")
print(k_distances[::20]) # Every 20th value
# Good eps is often at the "knee" of this curve
DBSCAN vs K-Means
# DBSCAN Handles Non-Spherical Clusters
from sklearn.cluster import DBSCAN, KMeans
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score
# Moon-shaped data - K-Means will struggle!
X, y_true = make_moons(n_samples=200, noise=0.1, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# K-Means (assumes spherical clusters)
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)
kmeans_ari = adjusted_rand_score(y_true, kmeans_labels)
# DBSCAN (finds density-based clusters)
dbscan = DBSCAN(eps=0.3, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)
dbscan_ari = adjusted_rand_score(y_true, dbscan_labels)
print(f"K-Means ARI: {kmeans_ari:.3f}")
print(f"DBSCAN ARI: {dbscan_ari:.3f}")
# DBSCAN wins on non-spherical data!
- Varying densities: Struggles when clusters have different densities
- High dimensions: eps becomes hard to tune in high-dimensional space
- Parameters matter: Results are very sensitive to eps and min_samples
Practice Questions: DBSCAN
Test your understanding with these coding challenges.
Task: Use DBSCAN on make_blobs data with added outliers and count noise points.
Show Solution
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import numpy as np
# Create data with outliers
X, _ = make_blobs(n_samples=200, centers=3, random_state=42)
outliers = np.random.uniform(-10, 10, size=(20, 2))
X_with_outliers = np.vstack([X, outliers])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_with_outliers)
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
n_noise = list(labels).count(-1)
print(f"Noise points detected: {n_noise}")
Task: Run DBSCAN with eps values 0.1, 0.3, 0.5, 1.0 and compare clusters/noise.
Show Solution
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
for eps in [0.1, 0.3, 0.5, 1.0]:
dbscan = DBSCAN(eps=eps, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"eps={eps:.1f}: {n_clusters} clusters, {n_noise} noise")
Task: Compute k-distances, find a good eps, and verify with silhouette score.
Show Solution
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import numpy as np
X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Find k-distances
k = 5
neigh = NearestNeighbors(n_neighbors=k)
neigh.fit(X_scaled)
distances, _ = neigh.kneighbors(X_scaled)
k_distances = np.sort(distances[:, k-1])
# Look for the knee - try percentile approach
eps_candidates = [np.percentile(k_distances, p) for p in [10, 20, 30, 40]]
for eps in eps_candidates:
dbscan = DBSCAN(eps=eps, min_samples=k)
labels = dbscan.fit_predict(X_scaled)
if len(set(labels) - {-1}) >= 2:
non_noise = labels != -1
if sum(non_noise) > 1:
score = silhouette_score(X_scaled[non_noise], labels[non_noise])
print(f"eps={eps:.3f}: silhouette={score:.3f}")
Task: After DBSCAN, count how many points are core, border, and noise.
Show Solution
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
import numpy as np
X, _ = make_blobs(n_samples=100, centers=3, random_state=42)
dbscan = DBSCAN(eps=1.5, min_samples=5)
labels = dbscan.fit_predict(X)
# Core points have indices in core_sample_indices_
n_core = len(dbscan.core_sample_indices_)
n_noise = list(labels).count(-1)
n_border = len(X) - n_core - n_noise
print(f"Core points: {n_core}")
print(f"Border points: {n_border}")
print(f"Noise points: {n_noise}")
Task: Keep eps fixed and vary min_samples from 3 to 15. Show how it affects clustering.
Show Solution
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
X_scaled = StandardScaler().fit_transform(X)
print("min_samples | Clusters | Noise")
print("-" * 35)
for min_s in [3, 5, 7, 10, 15]:
dbscan = DBSCAN(eps=0.3, min_samples=min_s)
labels = dbscan.fit_predict(X_scaled)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"{min_s:11d} | {n_clusters:8d} | {n_noise}")
Task: Create blobs with different cluster_std values and see how DBSCAN handles varying densities.
Show Solution
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score
import numpy as np
# Create blobs with different densities
X1, y1 = make_blobs(n_samples=100, centers=[[0, 0]], cluster_std=0.5, random_state=42)
X2, y2 = make_blobs(n_samples=100, centers=[[5, 5]], cluster_std=2.0, random_state=42)
X = np.vstack([X1, X2])
y_true = np.hstack([np.zeros(100), np.ones(100)])
for eps in [0.5, 1.0, 1.5, 2.0]:
dbscan = DBSCAN(eps=eps, min_samples=5)
labels = dbscan.fit_predict(X)
ari = adjusted_rand_score(y_true, labels)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print(f"eps={eps}: {n_clusters} clusters, ARI={ari:.3f}")
print("\nNote: DBSCAN struggles with varying densities!")
Task: Try OPTICS, which handles varying densities better than DBSCAN.
Show Solution
from sklearn.cluster import OPTICS, DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score
X, y_true = make_moons(n_samples=200, noise=0.1, random_state=42)
X_scaled = StandardScaler().fit_transform(X)
# OPTICS - similar to DBSCAN but more flexible
optics = OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.1)
optics_labels = optics.fit_predict(X_scaled)
# Compare with DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)
print(f"OPTICS - Clusters: {len(set(optics_labels)) - (1 if -1 in optics_labels else 0)}")
print(f"DBSCAN - Clusters: {len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)}")
print(f"OPTICS ARI: {adjusted_rand_score(y_true, optics_labels):.3f}")
print(f"DBSCAN ARI: {adjusted_rand_score(y_true, dbscan_labels):.3f}")
Task: Create a grid search to find the best eps and min_samples combination.
Show Solution
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import itertools
X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
X_scaled = StandardScaler().fit_transform(X)
# Grid search
eps_values = [0.2, 0.3, 0.4, 0.5]
min_samples_values = [3, 5, 7, 10]
best_score = -1
best_params = {}
for eps, min_s in itertools.product(eps_values, min_samples_values):
dbscan = DBSCAN(eps=eps, min_samples=min_s)
labels = dbscan.fit_predict(X_scaled)
# Skip if only one cluster or all noise
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
if n_clusters >= 2:
mask = labels != -1
if mask.sum() > 1:
score = silhouette_score(X_scaled[mask], labels[mask])
if score > best_score:
best_score = score
best_params = {'eps': eps, 'min_samples': min_s}
print(f"Best params: {best_params}")
print(f"Best silhouette: {best_score:.3f}")
Algorithm Comparison
Each clustering algorithm has its strengths and ideal use cases. Understanding when to use which algorithm is key to successful clustering projects.
Quick Decision Helper (Beginner's Cheat Sheet)
Not sure which algorithm to pick? Answer these questions:
"I need speed & simplicity"
โ Use K-Means
Fast, works with millions of points, easy to understand. Just need to pick K.
"I want to see relationships"
โ Use Hierarchical
See how groups relate to each other. Pick K after looking at the dendrogram.
"I have weird shapes or outliers"
โ Use DBSCAN
Finds any shape, spots outliers automatically. No need to specify K.
Pro Tip: When in doubt, try all three and compare silhouette scores! The code at the bottom of this section shows you how.
Quick Comparison Table
Why Does Choosing the Right Algorithm Matter?
Using K-Means on crescent-shaped data is like using a hammer to screw in a bolt - it might work, but poorly. Matching algorithm to data shape saves time and produces meaningful results. The table below helps you make the right choice!
| Aspect | K-Means | Hierarchical | DBSCAN |
|---|---|---|---|
| Specify K? | Yes, required | Can cut dendrogram later | No, finds automatically |
| Cluster Shape | Spherical only | Depends on linkage | Any shape |
| Handles Outliers | No (sensitive) | Somewhat (depends) | Yes (labels as noise) |
| Scalability | Excellent (O(n)) | Poor (O(nยฒ) or O(nยณ)) | Good (O(n log n) with index) |
| Interpretability | Centroids are interpretable | Dendrogram shows hierarchy | Core/border points |
| Best For | Large data, spherical clusters | Understanding relationships | Arbitrary shapes, outlier detection |
How to Read This Table
Focus on your main constraint: Speed? โ K-Means wins. Don't know K? โ DBSCAN or Hierarchical. Weird shapes? โ DBSCAN. Want to explore hierarchy? โ Hierarchical. When in doubt, try multiple algorithms and compare silhouette scores!
Decision Guide
Use K-Means When
- You have a rough idea of K
- Clusters are roughly spherical
- Dataset is large (>10,000 points)
- Speed is important
- Need reproducible results
Use Hierarchical When
- Want to explore different K values
- Need to understand cluster relationships
- Dataset is small-medium (<5,000)
- Creating taxonomy/hierarchy
- Visualization is important
Use DBSCAN When
- Don't know number of clusters
- Clusters are non-spherical
- Data has outliers/noise
- Clusters have similar density
- Need anomaly detection
# Complete Comparison on Same Dataset
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score, silhouette_score
import numpy as np
# Non-spherical data with some noise
X, y_true = make_moons(n_samples=200, noise=0.1, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply all three algorithms
results = {}
# K-Means
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
results['K-Means'] = kmeans.fit_predict(X_scaled)
# Hierarchical
agg = AgglomerativeClustering(n_clusters=2)
results['Hierarchical'] = agg.fit_predict(X_scaled)
# DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
results['DBSCAN'] = dbscan.fit_predict(X_scaled)
# Compare results
print("Algorithm | ARI | Silhouette")
print("-" * 40)
for name, labels in results.items():
ari = adjusted_rand_score(y_true, labels)
# For silhouette, exclude noise points in DBSCAN
mask = labels != -1
if mask.sum() > 1 and len(set(labels[mask])) > 1:
sil = silhouette_score(X_scaled[mask], labels[mask])
else:
sil = float('nan')
print(f"{name:14s} | {ari:.3f} | {sil:.3f}")
Practice Questions: Algorithm Comparison
Test your understanding with these coding challenges.
Task: Apply K-Means, Agglomerative, and DBSCAN to make_blobs data and compare ARI.
Show Solution
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score
X, y_true = make_blobs(n_samples=300, centers=3, random_state=42)
models = {
'K-Means': KMeans(n_clusters=3, random_state=42, n_init=10),
'Agglomerative': AgglomerativeClustering(n_clusters=3),
'DBSCAN': DBSCAN(eps=1.0, min_samples=5)
}
for name, model in models.items():
labels = model.fit_predict(X)
ari = adjusted_rand_score(y_true, labels)
print(f"{name}: ARI = {ari:.3f}")
Task: Use make_circles (concentric circles) - which algorithm works best?
Show Solution
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_circles
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score
X, y_true = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
models = {
'K-Means': KMeans(n_clusters=2, random_state=42, n_init=10),
'Agglomerative': AgglomerativeClustering(n_clusters=2),
'DBSCAN': DBSCAN(eps=0.3, min_samples=5)
}
for name, model in models.items():
labels = model.fit_predict(X_scaled)
ari = adjusted_rand_score(y_true, labels)
print(f"{name}: ARI = {ari:.3f}")
# DBSCAN should win on concentric circles!
Task: Create a function that tries all 3 algorithms and returns the best one based on silhouette score.
Show Solution
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons
def best_clustering(X, n_clusters=3):
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
results = {}
# K-Means
km = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
km_labels = km.fit_predict(X_scaled)
results['K-Means'] = silhouette_score(X_scaled, km_labels)
# Agglomerative
agg = AgglomerativeClustering(n_clusters=n_clusters)
agg_labels = agg.fit_predict(X_scaled)
results['Agglomerative'] = silhouette_score(X_scaled, agg_labels)
# DBSCAN
db = DBSCAN(eps=0.5, min_samples=5)
db_labels = db.fit_predict(X_scaled)
mask = db_labels != -1
if len(set(db_labels[mask])) >= 2:
results['DBSCAN'] = silhouette_score(X_scaled[mask], db_labels[mask])
best = max(results, key=results.get)
return best, results
# Test it
X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
best, scores = best_clustering(X, n_clusters=2)
print(f"Best: {best}")
print(f"Scores: {scores}")
Task: Time how long each algorithm takes to cluster the same dataset.
Show Solution
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_blobs
import time
X, _ = make_blobs(n_samples=1000, centers=5, random_state=42)
models = {
'K-Means': KMeans(n_clusters=5, random_state=42, n_init=10),
'Agglomerative': AgglomerativeClustering(n_clusters=5),
'DBSCAN': DBSCAN(eps=1.0, min_samples=5)
}
print("Algorithm | Time (ms)")
print("-" * 30)
for name, model in models.items():
start = time.time()
model.fit_predict(X)
elapsed = (time.time() - start) * 1000
print(f"{name:14s} | {elapsed:.2f}")
Task: Create elongated blobs using a transformation and compare algorithms.
Show Solution
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score
import numpy as np
X, y_true = make_blobs(n_samples=300, centers=3, random_state=42)
# Stretch the blobs to make them elongated
transformation = [[0.6, -0.6], [-0.4, 0.8]]
X_aniso = np.dot(X, transformation)
models = {
'K-Means': KMeans(n_clusters=3, random_state=42, n_init=10),
'Agglomerative': AgglomerativeClustering(n_clusters=3),
'DBSCAN': DBSCAN(eps=0.5, min_samples=5)
}
print("Algorithm | ARI")
print("-" * 25)
for name, model in models.items():
labels = model.fit_predict(X_aniso)
ari = adjusted_rand_score(y_true, labels)
print(f"{name:14s} | {ari:.3f}")
Task: Add 20 random outliers to blob data and see which algorithm handles them best.
Show Solution
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score
import numpy as np
X, y_true = make_blobs(n_samples=200, centers=3, random_state=42)
# Add outliers
outliers = np.random.uniform(-15, 15, size=(20, 2))
X_with_outliers = np.vstack([X, outliers])
y_with_outliers = np.hstack([y_true, [-1] * 20]) # -1 for outliers
models = {
'K-Means': KMeans(n_clusters=3, random_state=42, n_init=10),
'Agglomerative': AgglomerativeClustering(n_clusters=3),
'DBSCAN': DBSCAN(eps=1.5, min_samples=5)
}
print("Algorithm | Clusters | Noise detected")
print("-" * 45)
for name, model in models.items():
labels = model.fit_predict(X_with_outliers)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"{name:14s} | {n_clusters:8d} | {n_noise}")
Task: Create a complete pipeline that scales data, tries all algorithms, and reports the best.
Show Solution
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import load_wine
def cluster_comparison(X, k_range=(2, 6)):
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
results = []
# K-Means with different K
for k in range(*k_range):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = km.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
results.append(('K-Means', k, score))
# Agglomerative with different K
for k in range(*k_range):
agg = AgglomerativeClustering(n_clusters=k)
labels = agg.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
results.append(('Agglomerative', k, score))
# Best result
best = max(results, key=lambda x: x[2])
return best, sorted(results, key=lambda x: -x[2])[:5]
wine = load_wine()
best, top5 = cluster_comparison(wine.data)
print(f"Best: {best[0]} with k={best[1]}, silhouette={best[2]:.3f}")
print("\nTop 5:")
for alg, k, score in top5:
print(f" {alg} (k={k}): {score:.3f}")
Task: Run multiple clusterings and use consensus/voting to get final labels.
Show Solution
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score
import numpy as np
from scipy.stats import mode
X, y_true = make_blobs(n_samples=300, centers=3, random_state=42)
# Run K-Means multiple times with different seeds
all_labels = []
for seed in range(10):
km = KMeans(n_clusters=3, random_state=seed, n_init=1)
labels = km.fit_predict(X)
all_labels.append(labels)
all_labels = np.array(all_labels) # Shape: (10, 300)
# Consensus: majority vote for each point
consensus_labels, _ = mode(all_labels, axis=0)
consensus_labels = consensus_labels.flatten()
# Compare
single_ari = adjusted_rand_score(y_true, all_labels[0])
consensus_ari = adjusted_rand_score(y_true, consensus_labels)
print(f"Single run ARI: {single_ari:.3f}")
print(f"Consensus ARI: {consensus_ari:.3f}")
print("Ensemble often improves stability!")