Clustering Algorithms

Introduction to Clustering

Clustering is the art of finding natural groups in data without being told what to look for. Unlike classification where we have labels, clustering is unsupervised learning - we let the algorithm discover patterns on its own. It's like sorting a pile of mixed fruits without knowing their names - you'd naturally group apples with apples and oranges with oranges based on appearance.

What is Clustering? (Simple Definition)

Clustering is like being a detective who organizes clues into piles without knowing what crime was committed. You look at similarities and group things together.

Real-Life Examples

Spotify grouping similar songs into playlists
Netflix suggesting shows like ones you watched
Stores grouping customers by shopping habits
Email providers separating spam from real mail

Key Question Clustering Answers

"I have a bunch of data points. Which ones are similar to each other? Can I organize them into meaningful groups without knowing ahead of time what those groups should be?"

Party Analogy: Imagine you're at a party where you don't know anyone. You naturally notice groups forming - people who share interests clustering together. Some groups are tight-knit (dense), others are spread out. That's exactly what clustering algorithms do with data points!

Types of Clustering

Centroid-Based

Clusters are defined by a central point (centroid). K-Means is the classic example. Works best with spherical, similarly-sized clusters.

Hierarchical

Builds a tree of clusters. Can be agglomerative (bottom-up) or divisive (top-down). Great for understanding cluster relationships.

Density-Based

Clusters are regions of high density separated by low-density areas. DBSCAN can find clusters of any shape and detect outliers.

Essential Vocabulary (Know These Terms!)

🎯 Cluster: A group of similar data points that belong together

📍 Centroid: The center point of a cluster (like the average of all points)

📏 Distance: How far apart two data points are (usually Euclidean distance)

📊 Inertia: Total distance of all points to their centroids (lower = tighter clusters)

🔍 Unsupervised: Learning without labels - the algorithm discovers patterns alone

⚠️ Outlier: A data point that doesn't fit into any cluster (noise)

Real-World Applications

Customer Segmentation

Group customers by purchasing behavior to create targeted marketing campaigns. High-value customers, bargain hunters, and seasonal shoppers emerge naturally.

Image Segmentation

Group pixels by color/texture to identify objects in images. Used in medical imaging to identify tumors or in self-driving cars to recognize road elements.

Gene Expression Analysis

Cluster genes with similar expression patterns to discover gene families and understand biological processes.

Anomaly Detection

Points that don't belong to any cluster are outliers. Used in fraud detection, network intrusion detection, and quality control.

# Quick Clustering Demo
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np

# Generate sample data with 3 natural clusters
X, y_true = make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=42)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)

print(f"Data shape: {X.shape}")
print(f"Cluster labels: {np.unique(y_pred)}")
print(f"Cluster sizes: {np.bincount(y_pred)}")
print(f"Cluster centers shape: {kmeans.cluster_centers_.shape}")

Practice Questions: Introduction

Test your understanding with these coding challenges.

Task: Use make_blobs to create data with 4 clusters and 500 samples. Print shape and center count.

Show Solution

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=500, centers=4, random_state=42)
print(f"Data shape: {X.shape}")
print(f"Number of clusters: {len(set(y))}")
print(f"Samples per cluster: {[list(y).count(i) for i in range(4)]}")

Task: Generate blob data, cluster with KMeans, and calculate adjusted rand score.

Show Solution

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

X, y_true = make_blobs(n_samples=300, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)

ari = adjusted_rand_score(y_true, y_pred)
print(f"Adjusted Rand Index: {ari:.3f}")
# 1.0 = perfect clustering, 0 = random

Task: Load Iris, apply KMeans with k=3, compare with true species labels using ARI.

Show Solution

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
from sklearn.preprocessing import StandardScaler

iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X_scaled)

ari = adjusted_rand_score(iris.target, y_pred)
print(f"ARI (vs true species): {ari:.3f}")

Task: Use make_moons to create 300 samples with noise=0.1. Print the shape and unique labels.

Show Solution

from sklearn.datasets import make_moons

X, y = make_moons(n_samples=300, noise=0.1, random_state=42)
print(f"Data shape: {X.shape}")
print(f"Unique labels: {set(y)}")
print(f"Samples per class: {[list(y).count(i) for i in [0, 1]]}")

Task: Load Wine dataset, apply StandardScaler, then cluster with KMeans. Compare mean and std before/after scaling.

Show Solution

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
import numpy as np

wine = load_wine()
X = wine.data

print("Before scaling:")
print(f"Mean: {np.mean(X, axis=0)[:3]}...")  # First 3 features
print(f"Std: {np.std(X, axis=0)[:3]}...")

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("\nAfter scaling:")
print(f"Mean: {np.mean(X_scaled, axis=0)[:3]}...")  # ~0
print(f"Std: {np.std(X_scaled, axis=0)[:3]}...")   # ~1

Task: Use make_circles to create concentric circles data. Print shape and apply KMeans - note why it fails.

Show Solution

from sklearn.datasets import make_circles
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

X, y_true = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)
print(f"Data shape: {X.shape}")

kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)

ari = adjusted_rand_score(y_true, y_pred)
print(f"ARI: {ari:.3f}")
print("K-Means fails on circles because it assumes spherical clusters!")

Task: Cluster blob data and calculate ARI, NMI (Normalized Mutual Information), and Silhouette Score.

Show Solution

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, silhouette_score

X, y_true = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)

ari = adjusted_rand_score(y_true, y_pred)
nmi = normalized_mutual_info_score(y_true, y_pred)
sil = silhouette_score(X, y_pred)

print(f"Adjusted Rand Index: {ari:.3f}")
print(f"Normalized Mutual Info: {nmi:.3f}")
print(f"Silhouette Score: {sil:.3f}")

Task: Cluster data and print a simple text-based visualization showing cluster assignments.

Show Solution

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import numpy as np

X, _ = make_blobs(n_samples=20, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)

print("Cluster Assignments:")
for cluster in range(3):
    indices = np.where(labels == cluster)[0]
    print(f"\nCluster {cluster}: {len(indices)} points")
    print(f"  Points: {indices.tolist()}")
    print(f"  Centroid: {kmeans.cluster_centers_[cluster].round(2)}")

K-Means Clustering

K-Means is the most popular clustering algorithm due to its simplicity and speed. It partitions data into K clusters where each point belongs to the cluster with the nearest centroid (center). Think of it like placing K magnets in a room and letting each data point stick to its nearest magnet.

Understanding K-Means (Beginner's Guide)

What is "K"?

K is the number of groups you want. You choose K before running the algorithm. K=3 means "find me 3 groups."

What is a "Centroid"?

The center point of each cluster. Like the flag pole in the middle of a camp - everyone in that group gathers around it.

What is "Inertia"?

Measures how spread out points are from their centroid. Lower inertia = tighter, better clusters.

Think of it like this: You're a teacher splitting 30 students into K study groups. You want students in each group to be as similar as possible (by skill level, interests, etc.). The centroid is like the "average student" representing each group.

Magnet Analogy: Imagine scattering iron filings (data points) on a table and placing K magnets (centroids). Each filing moves to its closest magnet. Then you move each magnet to the center of its filings. Repeat until the magnets stop moving - that's K-Means!

The K-Means Algorithm

Initialize

Randomly select K data points as initial centroids

Assign

Assign each point to nearest centroid

Update

Move centroid to mean of assigned points

Repeat

Go to step 2 until centroids stop moving

Watch K-Means in Action (Step-by-Step Example)

Step	What Happens	Example (K=2)
1	Pick K random points as starting centroids	Centroid A = (2,3), Centroid B = (8,7)
2	Each point measures distance to all centroids, joins nearest	Point (1,2) → closer to A → joins Cluster A
3	Move each centroid to the average position of its members	Cluster A members average = (2.5, 2.8) → new centroid
4	Repeat steps 2-3 until centroids stop moving	Usually 5-20 iterations until stable
Done!	Final clusters are ready	Each point has a cluster label (0 or 1)

# K-Means Clustering with Scikit-learn
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# Generate sample data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Always scale your data for K-Means!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create and fit K-Means
kmeans = KMeans(
    n_clusters=4,       # Number of clusters (K)
    init='k-means++',   # Smart initialization (default)
    n_init=10,          # Run 10 times with different seeds
    max_iter=300,       # Max iterations per run
    random_state=42
)
kmeans.fit(X_scaled)

# Results
print(f"Cluster labels: {kmeans.labels_[:10]}...")  # First 10 labels
print(f"Cluster centers:\n{kmeans.cluster_centers_}")
print(f"Inertia (within-cluster sum of squares): {kmeans.inertia_:.2f}")
print(f"Number of iterations: {kmeans.n_iter_}")

K-Means++ Initialization

Random initialization can lead to poor results if centroids start too close together. K-Means++ is a smarter initialization that spreads initial centroids apart:

Random Init Problems

Centroids may start in same cluster
Results vary greatly between runs
May converge to poor local minimum
Needs many restarts (n_init)

K-Means++ Benefits

Spreads centroids across data
More consistent results
Faster convergence
Better final clustering

# Comparing Random vs K-Means++ Initialization
from sklearn.cluster import KMeans
import numpy as np

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Random initialization
kmeans_random = KMeans(n_clusters=4, init='random', n_init=1, random_state=42)
kmeans_random.fit(X)
print(f"Random init - Inertia: {kmeans_random.inertia_:.2f}, Iterations: {kmeans_random.n_iter_}")

# K-Means++ initialization (default)
kmeans_pp = KMeans(n_clusters=4, init='k-means++', n_init=1, random_state=42)
kmeans_pp.fit(X)
print(f"K-Means++ - Inertia: {kmeans_pp.inertia_:.2f}, Iterations: {kmeans_pp.n_iter_}")

# Lower inertia = tighter clusters = better!

K-Means Limitations:

Must specify K: You need to know/guess the number of clusters
Spherical clusters: Assumes clusters are roughly round and similar in size
Sensitive to outliers: Outliers can pull centroids away from dense regions
Sensitive to scale: Always standardize your features!

Common Beginner Mistakes to Avoid

Mistake #1: Forgetting to Scale

If salary is 50,000 and age is 30, K-Means will think salary matters 1,667x more! Always use StandardScaler first.

Mistake #2: Random K Selection

Don't just guess K=5 because it "sounds right." Use Elbow Method or Silhouette Score to find the optimal value.

Mistake #3: Using K-Means on Non-Spherical Data

K-Means assumes round clusters. Moon-shaped or ring-shaped data? Use DBSCAN instead.

Mistake #4: Running Only Once

K-Means results depend on initial centroid positions. Use n_init=10 to run 10 times and pick the best result.

Practice Questions: K-Means

Test your understanding with these coding challenges.

Task: Load wine dataset, scale features, apply K-Means with k=3, print cluster sizes.

Show Solution

from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import numpy as np

wine = load_wine()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(wine.data)

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)

print(f"Cluster sizes: {np.bincount(labels)}")

Task: Train KMeans, then use .predict() to assign new data points to clusters.

Show Solution

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np

X_train, _ = make_blobs(n_samples=200, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X_train)

# New data points to predict
X_new = np.array([[0, 0], [5, 5], [-5, -5]])
new_labels = kmeans.predict(X_new)
print(f"New point clusters: {new_labels}")

Task: Run K-Means for k=1 to 10 and print the inertia for each.

Show Solution

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=5, random_state=42)

inertias = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)
    print(f"k={k:2d}: Inertia = {kmeans.inertia_:.2f}")

# Look for the "elbow" where inertia stops decreasing rapidly

Task: Train KMeans with k=3 and print the coordinates of all cluster centers.

Show Solution

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=150, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X)

print("Cluster Centers:")
for i, center in enumerate(kmeans.cluster_centers_):
    print(f"Cluster {i}: ({center[0]:.2f}, {center[1]:.2f})")

Task: Run KMeans with random and k-means++ init. Compare inertia values for 5 different random states.

Show Solution

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

print("Random State | Random Init | K-Means++")
print("-" * 40)
for rs in range(5):
    km_rand = KMeans(n_clusters=4, init='random', n_init=1, random_state=rs)
    km_pp = KMeans(n_clusters=4, init='k-means++', n_init=1, random_state=rs)
    km_rand.fit(X)
    km_pp.fit(X)
    print(f"{rs:12d} | {km_rand.inertia_:11.2f} | {km_pp.inertia_:.2f}")

Task: Cluster the Digits dataset (first 2 PCA components) and show cluster size distribution.

Show Solution

from sklearn.datasets import load_digits
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import numpy as np

digits = load_digits()
pca = PCA(n_components=2)
X_pca = pca.fit_transform(digits.data)

kmeans = KMeans(n_clusters=10, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_pca)

print("Cluster Size Distribution:")
for i in range(10):
    count = np.sum(labels == i)
    print(f"Cluster {i}: {count} samples ({count/len(labels)*100:.1f}%)")

Task: After clustering, calculate the distance of each point to its assigned centroid and find the furthest point.

Show Solution

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np

X, _ = make_blobs(n_samples=100, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)

# Calculate distance to assigned centroid
distances = []
for i, point in enumerate(X):
    centroid = kmeans.cluster_centers_[labels[i]]
    dist = np.linalg.norm(point - centroid)
    distances.append(dist)

print(f"Average distance to centroid: {np.mean(distances):.3f}")
print(f"Max distance: {np.max(distances):.3f}")
print(f"Furthest point index: {np.argmax(distances)}")

Task: Use KMeans transform() method to get distances from each point to all centroids.

Show Solution

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np

X, _ = make_blobs(n_samples=10, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X)

# transform() returns distance to each centroid
distances = kmeans.transform(X)

print("Distances to each centroid:")
print(f"Shape: {distances.shape}")  # (n_samples, n_clusters)
print(f"\nFirst 5 points:")
for i in range(5):
    print(f"Point {i}: {distances[i].round(2)} -> Closest: Cluster {np.argmin(distances[i])}")

Choosing the Optimal K

The biggest challenge in K-Means is choosing the right number of clusters. Too few clusters underfit (miss patterns), too many overfit (find noise). The Elbow Method and Silhouette Score are the two most popular techniques to find the "sweet spot."

The "How Many Groups?" Problem (Beginner's Guide)

Choosing K is like deciding how many drawers you need to organize your clothes. Too few = everything crammed together. Too many = mostly empty drawers and wasted space.

Elbow Method (Visual Approach)

What it measures: How much "tightness" you gain by adding more clusters

What to look for: The "elbow" point where the curve bends - adding more clusters after this doesn't help much

Analogy: Like squeezing a sponge. At first, lots of water comes out. Eventually, squeezing harder barely helps. The elbow is when you should stop squeezing.

Silhouette Score (Quality Score)

What it measures: How well each point fits in its cluster vs neighboring clusters

Score range: -1 (wrong cluster) → 0 (border) → +1 (perfect fit)

Analogy: Like asking each student "Do you feel you belong in this group?" High scores mean happy, well-placed students.

The Elbow Method

Plot inertia (within-cluster sum of squares) for different K values. Look for the "elbow" - the point where adding more clusters doesn't significantly reduce inertia anymore.

# Elbow Method
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Calculate inertia for different K values
k_values = range(1, 11)
inertias = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Print results (for plotting)
for k, inertia in zip(k_values, inertias):
    print(f"k={k:2d}: Inertia = {inertia:8.2f}")

# The "elbow" is typically where the curve bends - around k=4 here

Finding the Elbow: The elbow is where the rate of decrease sharply changes. Before the elbow, adding clusters helps a lot. After the elbow, you're just splitting natural clusters into smaller pieces without much benefit.

Silhouette Score

The Silhouette Score measures how similar a point is to its own cluster compared to other clusters. It ranges from -1 to 1:

+1

Point is well-matched to its cluster and poorly matched to neighbors

0

Point is on the border between two clusters

-1

Point is probably assigned to the wrong cluster

# Silhouette Score Analysis
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

# Calculate silhouette score for different K values
k_values = range(2, 11)  # Silhouette needs at least 2 clusters
silhouette_scores = []

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X)
    score = silhouette_score(X, labels)
    silhouette_scores.append(score)
    print(f"k={k:2d}: Silhouette = {score:.3f}")

# Higher is better - find the maximum!
best_k = k_values[silhouette_scores.index(max(silhouette_scores))]
print(f"\nBest k by silhouette: {best_k}")

Combining Both Methods

# Complete Analysis
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

print("K | Inertia    | Silhouette")
print("-" * 30)

for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X)
    sil = silhouette_score(X, labels)
    print(f"{k} | {kmeans.inertia_:10.2f} | {sil:.3f}")

# Look for: elbow in inertia + high silhouette score

Practice Questions: Choosing K

Test your understanding with these coding challenges.

Task: Cluster Iris data with k=3 and print the silhouette score.

Show Solution

from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)

score = silhouette_score(X_scaled, labels)
print(f"Silhouette Score (k=3): {score:.3f}")

Task: Test k=2 to 8 on Wine dataset and find the k with highest silhouette score.

Show Solution

from sklearn.datasets import load_wine
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

wine = load_wine()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(wine.data)

best_k, best_score = 2, -1
for k in range(2, 9):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    print(f"k={k}: {score:.3f}")
    if score > best_score:
        best_k, best_score = k, score

print(f"\nBest: k={best_k} with score {best_score:.3f}")

Task: Calculate the rate of change in inertia to programmatically find the elbow.

Show Solution

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

inertias = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Calculate rate of change (first derivative)
deltas = np.diff(inertias)
# Elbow is where delta stops decreasing rapidly
ratios = deltas[:-1] / deltas[1:]
elbow = np.argmax(ratios) + 2  # +2 because of diff offsets

print(f"Inertias: {[int(i) for i in inertias]}")
print(f"Detected elbow at k={elbow}")

Task: Use silhouette_samples() to get per-point silhouette scores and find the worst-clustered point.

Show Solution

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
from sklearn.datasets import make_blobs
import numpy as np

X, _ = make_blobs(n_samples=100, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)

scores = silhouette_samples(X, labels)

print(f"Mean silhouette: {scores.mean():.3f}")
print(f"Worst point index: {np.argmin(scores)}")
print(f"Worst point score: {scores.min():.3f}")
print(f"Best point score: {scores.max():.3f}")

Task: Calculate silhouette score for Iris, Wine, and Blobs datasets with their natural K values.

Show Solution

from sklearn.datasets import load_iris, load_wine, make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

datasets = {
    'Iris (k=3)': (load_iris().data, 3),
    'Wine (k=3)': (load_wine().data, 3),
    'Blobs (k=4)': (make_blobs(n_samples=300, centers=4, random_state=42)[0], 4)
}

for name, (X, k) in datasets.items():
    X_scaled = StandardScaler().fit_transform(X)
    labels = KMeans(n_clusters=k, random_state=42, n_init=10).fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    print(f"{name}: Silhouette = {score:.3f}")

Task: Use calinski_harabasz_score (another metric - higher is better) to find optimal K.

Show Solution

from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabasz_score
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

print("K | Calinski-Harabasz Score")
print("-" * 30)
best_k, best_score = 2, -1
for k in range(2, 11):
    labels = KMeans(n_clusters=k, random_state=42, n_init=10).fit_predict(X)
    score = calinski_harabasz_score(X, labels)
    print(f"{k} | {score:.2f}")
    if score > best_score:
        best_k, best_score = k, score

print(f"\nBest K: {best_k}")

Task: Calculate Davies-Bouldin Index (lower is better) for different K values.

Show Solution

from sklearn.cluster import KMeans
from sklearn.metrics import davies_bouldin_score
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

print("K | Davies-Bouldin Score (lower = better)")
print("-" * 40)
best_k, best_score = 2, float('inf')
for k in range(2, 11):
    labels = KMeans(n_clusters=k, random_state=42, n_init=10).fit_predict(X)
    score = davies_bouldin_score(X, labels)
    print(f"{k} | {score:.3f}")
    if score < best_score:
        best_k, best_score = k, score

print(f"\nBest K: {best_k} (score: {best_score:.3f})")

Task: Create a comprehensive table showing Silhouette, Calinski-Harabasz, and Davies-Bouldin for k=2-8.

Show Solution

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

print("K | Silhouette | Calinski-H | Davies-B")
print("-" * 45)
for k in range(2, 9):
    labels = KMeans(n_clusters=k, random_state=42, n_init=10).fit_predict(X)
    sil = silhouette_score(X, labels)
    ch = calinski_harabasz_score(X, labels)
    db = davies_bouldin_score(X, labels)
    print(f"{k} |     {sil:.3f} |   {ch:7.1f} |    {db:.3f}")

print("\nGoal: High Silhouette, High CH, Low DB")

Hierarchical Clustering

Hierarchical Clustering builds a tree of clusters called a dendrogram. Unlike K-Means, you don't need to specify K upfront - you can cut the tree at any level to get different numbers of clusters. It's like organizing a family tree where similar family members are grouped together.

Understanding Hierarchical Clustering (Beginner's Guide)

What is a Dendrogram?

A tree diagram showing how clusters merge together. The height shows how different the merged groups are. Think of it like a company org chart, but for data similarity.

Why is it Flexible?

You can "cut" the tree at any height to get 2 clusters, 5 clusters, or any number! Unlike K-Means where you must decide K before training.

School Analogy: Imagine organizing all students by similarity. First, best friends pair up (most similar). Then pairs join to form friend groups. Groups combine into class sections. Sections form grades. Grades form the whole school. The dendrogram shows this entire hierarchy - cut at "grades" level to see 4 groups, or at "friend groups" level to see 50 groups!

Family Tree Analogy: Imagine organizing people by similarity. Siblings are most similar (grouped first), then cousins, then extended family. The dendrogram shows this hierarchy - close relationships at the bottom, distant ones at the top. Cut anywhere to define your "clusters."

Agglomerative vs Divisive

Agglomerative (Bottom-Up)

Start with each point as its own cluster. Merge the two closest clusters repeatedly until one cluster remains. Most common approach.

Divisive (Top-Down)

Start with all points in one cluster. Split recursively until each point is its own cluster. Less common but useful for some problems.

# Agglomerative Hierarchical Clustering
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

X, _ = make_blobs(n_samples=50, centers=3, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Agglomerative clustering with 3 clusters
agg = AgglomerativeClustering(
    n_clusters=3,          # Number of clusters
    linkage='ward',        # Linkage criterion
    metric='euclidean'     # Distance metric (if not ward)
)
labels = agg.fit_predict(X_scaled)

print(f"Cluster labels: {labels}")
print(f"Cluster sizes: {[list(labels).count(i) for i in range(3)]}")

Applies Agglomerative (bottom-up) Hierarchical Clustering using Ward linkage, which minimizes within-cluster variance. Unlike K-Means, hierarchical clustering builds a tree structure by progressively merging the closest clusters. The output shows which cluster each point belongs to and the size of each resulting cluster.

Linkage Methods

When merging clusters, how do we measure the distance between them? Different linkage methods give different results:

What's "Linkage"? (Simple Explanation)

When you merge two groups, you need to know how "far apart" they are. But groups have multiple points! Linkage is the rule for measuring distance between groups. Different rules give different cluster shapes.

Ward

Minimizes within-cluster variance. Creates compact, spherical clusters. Default and usually best choice.

Single

Distance = minimum between any two points. Can create elongated chains. Good for non-spherical clusters.

Complete

Distance = maximum between any two points. Creates compact clusters. Sensitive to outliers.

Average

Distance = average of all pairwise distances. Balanced approach between single and complete.

Understanding Dendrograms

# Creating and Analyzing Dendrograms
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.datasets import make_blobs
import numpy as np

# Small dataset for visualization
X, _ = make_blobs(n_samples=10, centers=3, random_state=42)

# Compute linkage matrix
Z = linkage(X, method='ward')

# The linkage matrix Z has 4 columns:
# [cluster1_idx, cluster2_idx, distance, n_points_in_new_cluster]
print("Linkage Matrix (first 5 merges):")
print("Cluster1 | Cluster2 | Distance | Size")
for i, row in enumerate(Z[:5]):
    print(f"   {int(row[0]):5d} |    {int(row[1]):5d} |   {row[2]:.2f}  |  {int(row[3])}")

# To cut the dendrogram at a specific number of clusters:
from scipy.cluster.hierarchy import fcluster
labels_3 = fcluster(Z, t=3, criterion='maxclust')
print(f"\n3 clusters: {labels_3}")

Reading a Dendrogram: The height (y-axis) represents the distance at which clusters merge. A horizontal cut at any height gives you clusters. Cut low = many small clusters. Cut high = few large clusters. Look for large vertical gaps (natural cluster boundaries).

Practice Questions: Hierarchical Clustering

Test your understanding with these coding challenges.

Task: Cluster Iris data using AgglomerativeClustering with 3 clusters and ward linkage.

Show Solution

from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score

iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)

agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = agg.fit_predict(X_scaled)

ari = adjusted_rand_score(iris.target, labels)
print(f"Adjusted Rand Index: {ari:.3f}")

Task: Compare ward, single, complete, and average linkage on the same data.

Show Solution

from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)

for linkage in ['ward', 'single', 'complete', 'average']:
    agg = AgglomerativeClustering(n_clusters=3, linkage=linkage)
    labels = agg.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    print(f"{linkage:8s}: Silhouette = {score:.3f}")

Task: Create linkage matrix and use fcluster to get 2, 3, 4, and 5 clusters.

Show Solution

from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)

Z = linkage(X_scaled, method='ward')

for n_clusters in [2, 3, 4, 5]:
    labels = fcluster(Z, t=n_clusters, criterion='maxclust')
    score = silhouette_score(X_scaled, labels)
    print(f"{n_clusters} clusters: Silhouette = {score:.3f}")

Task: Create a linkage matrix with scipy and print its shape and first 3 rows.

Show Solution

from scipy.cluster.hierarchy import linkage
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=20, centers=3, random_state=42)
Z = linkage(X, method='ward')

print(f"Linkage matrix shape: {Z.shape}")
print(f"Columns: [cluster1, cluster2, distance, size]")
print(f"\nFirst 3 merges:")
for row in Z[:3]:
    print(f"  Merge {int(row[0])} + {int(row[1])}, dist={row[2]:.2f}, size={int(row[3])}")

Task: Use fcluster with criterion='distance' to cut at different height thresholds.

Show Solution

from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=100, centers=4, random_state=42)
Z = linkage(X, method='ward')

print("Threshold | Clusters")
print("-" * 25)
for threshold in [5, 10, 15, 20, 30]:
    labels = fcluster(Z, t=threshold, criterion='distance')
    n_clusters = len(set(labels))
    print(f"{threshold:9.1f} | {n_clusters}")

Task: Show that sklearn's AgglomerativeClustering and scipy's fcluster give the same results.

Show Solution

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import linkage, fcluster
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score

X, _ = make_blobs(n_samples=100, centers=3, random_state=42)

# sklearn approach
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels_sklearn = agg.fit_predict(X)

# scipy approach
Z = linkage(X, method='ward')
labels_scipy = fcluster(Z, t=3, criterion='maxclust')

# Compare (labels may be numbered differently, use ARI)
ari = adjusted_rand_score(labels_sklearn, labels_scipy)
print(f"ARI between sklearn and scipy: {ari:.3f}")
print("(1.0 means identical clustering)")

Task: Calculate inconsistency statistics to automatically determine optimal cut point.

Show Solution

from scipy.cluster.hierarchy import linkage, fcluster, inconsistent
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

X, _ = make_blobs(n_samples=150, centers=4, random_state=42)
Z = linkage(X, method='ward')

# Inconsistency method - look for jumps
R = inconsistent(Z, d=2)
max_inconsistency = R[:, 3].max()
print(f"Max inconsistency: {max_inconsistency:.2f}")

# Try different inconsistency thresholds
for thresh in [0.7, 0.8, 0.9, 1.0]:
    labels = fcluster(Z, t=thresh, criterion='inconsistent')
    n_c = len(set(labels))
    if n_c >= 2:
        sil = silhouette_score(X, labels)
        print(f"threshold={thresh}: {n_c} clusters, silhouette={sil:.3f}")

Task: Calculate cophenetic correlation to measure how well the dendrogram preserves pairwise distances.

Show Solution

from scipy.cluster.hierarchy import linkage, cophenet
from scipy.spatial.distance import pdist
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=100, centers=4, random_state=42)

# Compare different linkage methods
print("Linkage Method | Cophenetic Correlation")
print("-" * 40)
for method in ['single', 'complete', 'average', 'ward']:
    Z = linkage(X, method=method)
    c, _ = cophenet(Z, pdist(X))
    print(f"{method:14s} | {c:.4f}")

print("\n(Higher = dendrogram better preserves original distances)")

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters based on density. It doesn't require specifying the number of clusters and can find clusters of arbitrary shapes. Best of all, it automatically identifies outliers as noise points.

Understanding DBSCAN (Beginner's Guide)

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. Unlike K-Means which requires you to specify the number of clusters beforehand, DBSCAN automatically discovers clusters based on how densely packed the data points are. It's particularly powerful because it can find clusters of any shape and automatically identifies outliers!

Imagine a satellite photo at night - cities glow bright (high density), rural areas are dark (low density). DBSCAN finds the cities!

eps (ε)

"Neighborhood Radius"

The maximum distance between two points for them to be considered neighbors. Think of it as drawing a circle around each point.

Like asking "Who lives within 1 mile of me?"

min_samples

"Minimum Crowd Size"

The minimum number of points required within eps distance for a point to be considered a "core point" that starts a cluster.

Like "Need 5 friends nearby to be a hub."

-1 Label

"Noise / Outliers"

Points that don't belong to any cluster are automatically labeled as -1. These are outliers or noise in your data.

Like lone houses in the countryside!

Types of Points in DBSCAN

Core Point

Has ≥ min_samples neighbors within eps distance

The "popular kids" who start clusters.

Border Point

Within eps of a core point, but not enough neighbors itself

The "fringe members" of a cluster.

Noise Point

Not within eps of any core point (label = -1)

The "loners" - outliers in your data.

How DBSCAN Works (Step by Step)

1 Pick a random unvisited point - Start anywhere in your dataset

2 Find all neighbors within eps - Draw a circle of radius eps around this point

3 Check if it's a core point - Does it have ≥ min_samples neighbors? If yes, start a new cluster!

4 Expand the cluster - Add all neighbors to the cluster, then check their neighbors too (recursively)

5 Repeat until all points are visited - Points that never joined a cluster become noise (-1)

DBSCAN Superpower: Finds clusters of ANY shape (crescents, rings, spirals) because it follows density, not geometry. K-Means only finds round blobs! Plus, you don't need to specify the number of clusters - DBSCAN discovers them automatically!

City Analogy: Imagine looking at a map of houses. Dense neighborhoods (cities) are clusters. Houses in rural areas with no nearby neighbors are noise/outliers. DBSCAN finds these "cities" by looking for areas where houses are close together, regardless of the city's shape.

Key Concepts

Core Points

Points with at least min_samples neighbors within eps distance. These form the "heart" of clusters.

Border Points

Not core points themselves, but within eps of a core point. They're on the edge of clusters.

Noise Points

Not core or border points. These are outliers that don't belong to any cluster. Labeled as -1.

# DBSCAN Clustering
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
import numpy as np

# Generate moon-shaped data (non-spherical clusters)
X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply DBSCAN
dbscan = DBSCAN(
    eps=0.3,           # Maximum distance between neighbors
    min_samples=5,     # Minimum points to form a core point
    metric='euclidean' # Distance metric
)
labels = dbscan.fit_predict(X_scaled)

# Analyze results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print(f"Number of clusters: {n_clusters}")
print(f"Number of noise points: {n_noise}")
print(f"Cluster labels: {np.unique(labels)}")

Choosing eps and min_samples

eps (epsilon)

The maximum distance between two points to be considered neighbors.

Too small: Most points are noise, many small clusters
Too large: Clusters merge together, few noise points
Tip: Use k-distance graph to find good eps

min_samples

Minimum points in neighborhood to be a core point.

Too small: Noise points form clusters
Too large: Smaller clusters become noise
Tip: Start with n_dimensions + 1 or more

Quick DBSCAN Tuning Guide

If you see too many clusters:

→ Increase eps (make neighborhoods bigger) OR decrease min_samples

If you see too few clusters (everything in one):

→ Decrease eps (make neighborhoods smaller) OR increase min_samples

# Finding Good eps Using k-Distance Graph
from sklearn.neighbors import NearestNeighbors
import numpy as np

X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Find distance to k-th nearest neighbor for each point
k = 5  # min_samples we're considering
neigh = NearestNeighbors(n_neighbors=k)
neigh.fit(X_scaled)
distances, _ = neigh.kneighbors(X_scaled)

# Sort distances to k-th neighbor
k_distances = np.sort(distances[:, k-1])

# Print some distances - look for the "knee"
print("k-distances (sorted):")
print(k_distances[::20])  # Every 20th value

# Good eps is often at the "knee" of this curve

DBSCAN vs K-Means

# DBSCAN Handles Non-Spherical Clusters
from sklearn.cluster import DBSCAN, KMeans
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score

# Moon-shaped data - K-Means will struggle!
X, y_true = make_moons(n_samples=200, noise=0.1, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# K-Means (assumes spherical clusters)
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans_labels = kmeans.fit_predict(X_scaled)
kmeans_ari = adjusted_rand_score(y_true, kmeans_labels)

# DBSCAN (finds density-based clusters)
dbscan = DBSCAN(eps=0.3, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)
dbscan_ari = adjusted_rand_score(y_true, dbscan_labels)

print(f"K-Means ARI: {kmeans_ari:.3f}")
print(f"DBSCAN ARI: {dbscan_ari:.3f}")
# DBSCAN wins on non-spherical data!

DBSCAN Limitations:

Varying densities: Struggles when clusters have different densities
High dimensions: eps becomes hard to tune in high-dimensional space
Parameters matter: Results are very sensitive to eps and min_samples

Practice Questions: DBSCAN

Test your understanding with these coding challenges.

Task: Use DBSCAN on make_blobs data with added outliers and count noise points.

Show Solution

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import numpy as np

# Create data with outliers
X, _ = make_blobs(n_samples=200, centers=3, random_state=42)
outliers = np.random.uniform(-10, 10, size=(20, 2))
X_with_outliers = np.vstack([X, outliers])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_with_outliers)

dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

n_noise = list(labels).count(-1)
print(f"Noise points detected: {n_noise}")

Task: Run DBSCAN with eps values 0.1, 0.3, 0.5, 1.0 and compare clusters/noise.

Show Solution

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler

X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

for eps in [0.1, 0.3, 0.5, 1.0]:
    dbscan = DBSCAN(eps=eps, min_samples=5)
    labels = dbscan.fit_predict(X_scaled)
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = list(labels).count(-1)
    print(f"eps={eps:.1f}: {n_clusters} clusters, {n_noise} noise")

Task: Compute k-distances, find a good eps, and verify with silhouette score.

Show Solution

from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import numpy as np

X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Find k-distances
k = 5
neigh = NearestNeighbors(n_neighbors=k)
neigh.fit(X_scaled)
distances, _ = neigh.kneighbors(X_scaled)
k_distances = np.sort(distances[:, k-1])

# Look for the knee - try percentile approach
eps_candidates = [np.percentile(k_distances, p) for p in [10, 20, 30, 40]]

for eps in eps_candidates:
    dbscan = DBSCAN(eps=eps, min_samples=k)
    labels = dbscan.fit_predict(X_scaled)
    if len(set(labels) - {-1}) >= 2:
        non_noise = labels != -1
        if sum(non_noise) > 1:
            score = silhouette_score(X_scaled[non_noise], labels[non_noise])
            print(f"eps={eps:.3f}: silhouette={score:.3f}")

Task: After DBSCAN, count how many points are core, border, and noise.

Show Solution

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
import numpy as np

X, _ = make_blobs(n_samples=100, centers=3, random_state=42)
dbscan = DBSCAN(eps=1.5, min_samples=5)
labels = dbscan.fit_predict(X)

# Core points have indices in core_sample_indices_
n_core = len(dbscan.core_sample_indices_)
n_noise = list(labels).count(-1)
n_border = len(X) - n_core - n_noise

print(f"Core points: {n_core}")
print(f"Border points: {n_border}")
print(f"Noise points: {n_noise}")

Task: Keep eps fixed and vary min_samples from 3 to 15. Show how it affects clustering.

Show Solution

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler

X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

print("min_samples | Clusters | Noise")
print("-" * 35)
for min_s in [3, 5, 7, 10, 15]:
    dbscan = DBSCAN(eps=0.3, min_samples=min_s)
    labels = dbscan.fit_predict(X_scaled)
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = list(labels).count(-1)
    print(f"{min_s:11d} | {n_clusters:8d} | {n_noise}")

Task: Create blobs with different cluster_std values and see how DBSCAN handles varying densities.

Show Solution

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score
import numpy as np

# Create blobs with different densities
X1, y1 = make_blobs(n_samples=100, centers=[[0, 0]], cluster_std=0.5, random_state=42)
X2, y2 = make_blobs(n_samples=100, centers=[[5, 5]], cluster_std=2.0, random_state=42)
X = np.vstack([X1, X2])
y_true = np.hstack([np.zeros(100), np.ones(100)])

for eps in [0.5, 1.0, 1.5, 2.0]:
    dbscan = DBSCAN(eps=eps, min_samples=5)
    labels = dbscan.fit_predict(X)
    ari = adjusted_rand_score(y_true, labels)
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    print(f"eps={eps}: {n_clusters} clusters, ARI={ari:.3f}")

print("\nNote: DBSCAN struggles with varying densities!")

Task: Try OPTICS, which handles varying densities better than DBSCAN.

Show Solution

from sklearn.cluster import OPTICS, DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score

X, y_true = make_moons(n_samples=200, noise=0.1, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

# OPTICS - similar to DBSCAN but more flexible
optics = OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.1)
optics_labels = optics.fit_predict(X_scaled)

# Compare with DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
dbscan_labels = dbscan.fit_predict(X_scaled)

print(f"OPTICS - Clusters: {len(set(optics_labels)) - (1 if -1 in optics_labels else 0)}")
print(f"DBSCAN - Clusters: {len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)}")
print(f"OPTICS ARI: {adjusted_rand_score(y_true, optics_labels):.3f}")
print(f"DBSCAN ARI: {adjusted_rand_score(y_true, dbscan_labels):.3f}")

Task: Create a grid search to find the best eps and min_samples combination.

Show Solution

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import itertools

X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

# Grid search
eps_values = [0.2, 0.3, 0.4, 0.5]
min_samples_values = [3, 5, 7, 10]

best_score = -1
best_params = {}

for eps, min_s in itertools.product(eps_values, min_samples_values):
    dbscan = DBSCAN(eps=eps, min_samples=min_s)
    labels = dbscan.fit_predict(X_scaled)
    
    # Skip if only one cluster or all noise
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    if n_clusters >= 2:
        mask = labels != -1
        if mask.sum() > 1:
            score = silhouette_score(X_scaled[mask], labels[mask])
            if score > best_score:
                best_score = score
                best_params = {'eps': eps, 'min_samples': min_s}

print(f"Best params: {best_params}")
print(f"Best silhouette: {best_score:.3f}")

Algorithm Comparison

Each clustering algorithm has its strengths and ideal use cases. Understanding when to use which algorithm is key to successful clustering projects.

Quick Decision Helper (Beginner's Cheat Sheet)

Not sure which algorithm to pick? Answer these questions:

"I need speed & simplicity"

→ Use K-Means

Fast, works with millions of points, easy to understand. Just need to pick K.

"I want to see relationships"

→ Use Hierarchical

See how groups relate to each other. Pick K after looking at the dendrogram.

"I have weird shapes or outliers"

→ Use DBSCAN

Finds any shape, spots outliers automatically. No need to specify K.

Pro Tip: When in doubt, try all three and compare silhouette scores! The code at the bottom of this section shows you how.

Quick Comparison Table

Why Does Choosing the Right Algorithm Matter?

Using K-Means on crescent-shaped data is like using a hammer to screw in a bolt - it might work, but poorly. Matching algorithm to data shape saves time and produces meaningful results. The table below helps you make the right choice!

Aspect	K-Means	Hierarchical	DBSCAN
Specify K?	Yes, required	Can cut dendrogram later	No, finds automatically
Cluster Shape	Spherical only	Depends on linkage	Any shape
Handles Outliers	No (sensitive)	Somewhat (depends)	Yes (labels as noise)
Scalability	Excellent (O(n))	Poor (O(n²) or O(n³))	Good (O(n log n) with index)
Interpretability	Centroids are interpretable	Dendrogram shows hierarchy	Core/border points
Best For	Large data, spherical clusters	Understanding relationships	Arbitrary shapes, outlier detection

How to Read This Table

Focus on your main constraint: Speed? → K-Means wins. Don't know K? → DBSCAN or Hierarchical. Weird shapes? → DBSCAN. Want to explore hierarchy? → Hierarchical. When in doubt, try multiple algorithms and compare silhouette scores!

Decision Guide

Use K-Means When

You have a rough idea of K
Clusters are roughly spherical
Dataset is large (>10,000 points)
Speed is important
Need reproducible results

Use Hierarchical When

Want to explore different K values
Need to understand cluster relationships
Dataset is small-medium (<5,000)
Creating taxonomy/hierarchy
Visualization is important

Use DBSCAN When

Don't know number of clusters
Clusters are non-spherical
Data has outliers/noise
Clusters have similar density
Need anomaly detection

# Complete Comparison on Same Dataset
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score, silhouette_score
import numpy as np

# Non-spherical data with some noise
X, y_true = make_moons(n_samples=200, noise=0.1, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply all three algorithms
results = {}

# K-Means
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
results['K-Means'] = kmeans.fit_predict(X_scaled)

# Hierarchical
agg = AgglomerativeClustering(n_clusters=2)
results['Hierarchical'] = agg.fit_predict(X_scaled)

# DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
results['DBSCAN'] = dbscan.fit_predict(X_scaled)

# Compare results
print("Algorithm      | ARI   | Silhouette")
print("-" * 40)
for name, labels in results.items():
    ari = adjusted_rand_score(y_true, labels)
    # For silhouette, exclude noise points in DBSCAN
    mask = labels != -1
    if mask.sum() > 1 and len(set(labels[mask])) > 1:
        sil = silhouette_score(X_scaled[mask], labels[mask])
    else:
        sil = float('nan')
    print(f"{name:14s} | {ari:.3f} | {sil:.3f}")

Practice Questions: Algorithm Comparison

Test your understanding with these coding challenges.

Task: Apply K-Means, Agglomerative, and DBSCAN to make_blobs data and compare ARI.

Show Solution

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score

X, y_true = make_blobs(n_samples=300, centers=3, random_state=42)

models = {
    'K-Means': KMeans(n_clusters=3, random_state=42, n_init=10),
    'Agglomerative': AgglomerativeClustering(n_clusters=3),
    'DBSCAN': DBSCAN(eps=1.0, min_samples=5)
}

for name, model in models.items():
    labels = model.fit_predict(X)
    ari = adjusted_rand_score(y_true, labels)
    print(f"{name}: ARI = {ari:.3f}")

Task: Use make_circles (concentric circles) - which algorithm works best?

Show Solution

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_circles
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score

X, y_true = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

models = {
    'K-Means': KMeans(n_clusters=2, random_state=42, n_init=10),
    'Agglomerative': AgglomerativeClustering(n_clusters=2),
    'DBSCAN': DBSCAN(eps=0.3, min_samples=5)
}

for name, model in models.items():
    labels = model.fit_predict(X_scaled)
    ari = adjusted_rand_score(y_true, labels)
    print(f"{name}: ARI = {ari:.3f}")
# DBSCAN should win on concentric circles!

Task: Create a function that tries all 3 algorithms and returns the best one based on silhouette score.

Show Solution

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons

def best_clustering(X, n_clusters=3):
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    results = {}
    
    # K-Means
    km = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    km_labels = km.fit_predict(X_scaled)
    results['K-Means'] = silhouette_score(X_scaled, km_labels)
    
    # Agglomerative
    agg = AgglomerativeClustering(n_clusters=n_clusters)
    agg_labels = agg.fit_predict(X_scaled)
    results['Agglomerative'] = silhouette_score(X_scaled, agg_labels)
    
    # DBSCAN
    db = DBSCAN(eps=0.5, min_samples=5)
    db_labels = db.fit_predict(X_scaled)
    mask = db_labels != -1
    if len(set(db_labels[mask])) >= 2:
        results['DBSCAN'] = silhouette_score(X_scaled[mask], db_labels[mask])
    
    best = max(results, key=results.get)
    return best, results

# Test it
X, _ = make_moons(n_samples=200, noise=0.1, random_state=42)
best, scores = best_clustering(X, n_clusters=2)
print(f"Best: {best}")
print(f"Scores: {scores}")

Task: Time how long each algorithm takes to cluster the same dataset.

Show Solution

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_blobs
import time

X, _ = make_blobs(n_samples=1000, centers=5, random_state=42)

models = {
    'K-Means': KMeans(n_clusters=5, random_state=42, n_init=10),
    'Agglomerative': AgglomerativeClustering(n_clusters=5),
    'DBSCAN': DBSCAN(eps=1.0, min_samples=5)
}

print("Algorithm      | Time (ms)")
print("-" * 30)
for name, model in models.items():
    start = time.time()
    model.fit_predict(X)
    elapsed = (time.time() - start) * 1000
    print(f"{name:14s} | {elapsed:.2f}")

Task: Create elongated blobs using a transformation and compare algorithms.

Show Solution

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score
import numpy as np

X, y_true = make_blobs(n_samples=300, centers=3, random_state=42)
# Stretch the blobs to make them elongated
transformation = [[0.6, -0.6], [-0.4, 0.8]]
X_aniso = np.dot(X, transformation)

models = {
    'K-Means': KMeans(n_clusters=3, random_state=42, n_init=10),
    'Agglomerative': AgglomerativeClustering(n_clusters=3),
    'DBSCAN': DBSCAN(eps=0.5, min_samples=5)
}

print("Algorithm      | ARI")
print("-" * 25)
for name, model in models.items():
    labels = model.fit_predict(X_aniso)
    ari = adjusted_rand_score(y_true, labels)
    print(f"{name:14s} | {ari:.3f}")

Task: Add 20 random outliers to blob data and see which algorithm handles them best.

Show Solution

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score
import numpy as np

X, y_true = make_blobs(n_samples=200, centers=3, random_state=42)
# Add outliers
outliers = np.random.uniform(-15, 15, size=(20, 2))
X_with_outliers = np.vstack([X, outliers])
y_with_outliers = np.hstack([y_true, [-1] * 20])  # -1 for outliers

models = {
    'K-Means': KMeans(n_clusters=3, random_state=42, n_init=10),
    'Agglomerative': AgglomerativeClustering(n_clusters=3),
    'DBSCAN': DBSCAN(eps=1.5, min_samples=5)
}

print("Algorithm      | Clusters | Noise detected")
print("-" * 45)
for name, model in models.items():
    labels = model.fit_predict(X_with_outliers)
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = list(labels).count(-1)
    print(f"{name:14s} | {n_clusters:8d} | {n_noise}")

Task: Create a complete pipeline that scales data, tries all algorithms, and reports the best.

Show Solution

from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import load_wine

def cluster_comparison(X, k_range=(2, 6)):
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    results = []
    
    # K-Means with different K
    for k in range(*k_range):
        km = KMeans(n_clusters=k, random_state=42, n_init=10)
        labels = km.fit_predict(X_scaled)
        score = silhouette_score(X_scaled, labels)
        results.append(('K-Means', k, score))
    
    # Agglomerative with different K
    for k in range(*k_range):
        agg = AgglomerativeClustering(n_clusters=k)
        labels = agg.fit_predict(X_scaled)
        score = silhouette_score(X_scaled, labels)
        results.append(('Agglomerative', k, score))
    
    # Best result
    best = max(results, key=lambda x: x[2])
    return best, sorted(results, key=lambda x: -x[2])[:5]

wine = load_wine()
best, top5 = cluster_comparison(wine.data)

print(f"Best: {best[0]} with k={best[1]}, silhouette={best[2]:.3f}")
print("\nTop 5:")
for alg, k, score in top5:
    print(f"  {alg} (k={k}): {score:.3f}")

Task: Run multiple clusterings and use consensus/voting to get final labels.

Show Solution

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import adjusted_rand_score
import numpy as np
from scipy.stats import mode

X, y_true = make_blobs(n_samples=300, centers=3, random_state=42)

# Run K-Means multiple times with different seeds
all_labels = []
for seed in range(10):
    km = KMeans(n_clusters=3, random_state=seed, n_init=1)
    labels = km.fit_predict(X)
    all_labels.append(labels)

all_labels = np.array(all_labels)  # Shape: (10, 300)

# Consensus: majority vote for each point
consensus_labels, _ = mode(all_labels, axis=0)
consensus_labels = consensus_labels.flatten()

# Compare
single_ari = adjusted_rand_score(y_true, all_labels[0])
consensus_ari = adjusted_rand_score(y_true, consensus_labels)

print(f"Single run ARI: {single_ari:.3f}")
print(f"Consensus ARI: {consensus_ari:.3f}")
print("Ensemble often improves stability!")

What You'll Learn

Contents

Introduction to Clustering

What is Clustering? (Simple Definition)

Real-Life Examples

Key Question Clustering Answers

Types of Clustering

Centroid-Based

Hierarchical

Density-Based

Essential Vocabulary (Know These Terms!)

Real-World Applications

Customer Segmentation

Image Segmentation

Gene Expression Analysis

Anomaly Detection

Practice Questions: Introduction

Easy Generate and explore cluster data

Medium Compare true vs predicted labels

Hard Cluster the Iris dataset

Easy Create moon-shaped data for clustering

Medium Scale data before clustering

Medium Create circles data and explore it

Hard Calculate multiple clustering metrics

Hard Visualize cluster assignments with colors

K-Means Clustering

Understanding K-Means (Beginner's Guide)

What is "K"?

What is a "Centroid"?

What is "Inertia"?

The K-Means Algorithm

Initialize

Assign

Update

Repeat

Watch K-Means in Action (Step-by-Step Example)

K-Means++ Initialization

Random Init Problems

K-Means++ Benefits

Common Beginner Mistakes to Avoid

Mistake #1: Forgetting to Scale

Mistake #2: Random K Selection

Mistake #3: Using K-Means on Non-Spherical Data

Mistake #4: Running Only Once

Practice Questions: K-Means

Easy Basic K-Means on scaled data

Medium Predict cluster for new data

Hard Compare multiple K values and their inertia

Easy Access cluster centers after training

Medium Compare random vs k-means++ initialization

Medium Count samples in each cluster

Hard Find distance of each point to its centroid

Hard Use transform() to get distances to all centroids

Choosing the Optimal K

The "How Many Groups?" Problem (Beginner's Guide)

Elbow Method (Visual Approach)

Silhouette Score (Quality Score)

The Elbow Method

Silhouette Score

+1

0

-1

Combining Both Methods

Practice Questions: Choosing K

Easy Calculate silhouette score for a single K

Medium Find optimal K for Wine dataset

Hard Automate finding the elbow point

Easy Calculate silhouette for individual samples

Medium Compare silhouette across datasets

Medium Calculate Calinski-Harabasz Index

Hard Use Davies-Bouldin Index

Hard Combine all three metrics to find optimal K

Hierarchical Clustering

Understanding Hierarchical Clustering (Beginner's Guide)

What is a Dendrogram?

Why is it Flexible?

Agglomerative vs Divisive

Agglomerative (Bottom-Up)

Divisive (Top-Down)