Module 10.2

Dimensionality Reduction

Learn how to reduce the number of features in your data while preserving important information. Master PCA for linear reduction and t-SNE/UMAP for powerful visualizations!

40 min read
Intermediate
Hands-on Examples
What You'll Learn
  • Principal Component Analysis (PCA)
  • Explained variance and scree plots
  • t-SNE for visualization
  • UMAP for high-dimensional data
  • Choosing the right technique
Contents
01

Principal Component Analysis (PCA)

PCA is the most widely used dimensionality reduction technique. It transforms your data into a new coordinate system where the axes (principal components) capture the maximum variance in decreasing order.

Why Reduce Dimensions?

Imagine trying to describe a person using 1,000 different measurements - height, weight, shoe size, favorite color, number of books read, coffee preference, etc. Many of these measurements are related (tall people often have larger shoe sizes) or irrelevant for your task. Dimensionality reduction is like finding the most important 5-10 characteristics that capture the essence of a person.

Real-World Analogy: Think of a high-resolution photo (millions of pixels) vs a thumbnail (few hundred pixels). The thumbnail loses detail, but you can still recognize what's in the image. That's dimensionality reduction - keeping the essential information while drastically reducing size.

High-dimensional data presents several challenges: It's hard to visualize (we can only see 3 dimensions at most), computationally expensive to process (more features = more calculations), and often contains redundant or correlated features (like temperature in Celsius and Fahrenheit - they're the same information!). Dimensionality reduction addresses these issues while preserving the essential patterns and relationships in your data.

Benefits (Why Use It?)
  • Faster model training: Fewer features = less computation. A model with 10 features trains 10x faster than one with 100!
  • Reduced storage: Save disk space and memory. Store 50 features instead of 5,000.
  • Better visualization: We can see 2D/3D plots. Can't visualize 784 dimensions (like in images)!
  • Removes noise: Gets rid of random fluctuations and keeps the signal. Like removing static from a radio.
  • Avoids overfitting: Fewer features mean simpler models that generalize better to new data.
Trade-offs (The Cost)
  • Information loss: Like compressing a song to MP3 - smaller file, but some quality is lost.
  • Less interpretable: "Principal Component 1" is harder to explain than "age" or "income".
  • Linear only (PCA): PCA can't capture curved patterns. Like trying to draw a circle with straight lines.
  • Requires scaling: Must standardize features first or results will be biased toward large-scale features.
  • Outlier sensitive: Extreme values can distort the components. One billionaire in a dataset of average earners!

How PCA Works

Beginner-Friendly Explanation

Imagine you're looking at a football field from the stands. You see players moving in all directions. But if you rotate your view to look from the sideline, you'll notice most movement happens along the length of the field (1st dimension) and less width-wise (2nd dimension).

PCA does exactly this! It "rotates" your data to find the directions where your data varies the most. The 1st principal component is the direction of maximum variation (like the field's length), the 2nd is perpendicular and captures the next most variation (like the width), and so on.

Technically speaking: PCA finds new axes (called principal components) that are linear combinations of your original features. Think of a linear combination as a recipe - "PC1 = 0.5 × height + 0.3 × weight + 0.2 × age". Each component is a weighted mix of your original features.

The key rule: The first component points in the direction of maximum variance (where data is most spread out). The second is perpendicular (orthogonal) to the first and captures the next most variance. The third is perpendicular to both, and so on. This guarantees no overlap or redundancy between components!

Algorithm

PCA Steps (What Happens Under the Hood)

  1. Standardize: Scale features to have zero mean (center at 0) and unit variance (same spread).
    Why? Without this, features with larger scales (like salary in $100K vs age in years) would dominate the components unfairly.
  2. Covariance Matrix: Calculate how each pair of features varies together.
    Think of it as a "relationship table" - high covariance means two features move together (height & weight), low means they're independent.
  3. Eigenvectors & Eigenvalues: Find the directions (eigenvectors) of maximum variance and their importance (eigenvalues).
    Eigenvectors = the new axes (principal components). Eigenvalues = how much variance each axis captures. Bigger eigenvalue = more important component!
  4. Select Components: Keep the top k components based on explained variance (usually 90-95%).
    You're choosing: "I want to keep 95% of the information, how many components do I need?" Typically much fewer than original features!
  5. Transform: Project your original data onto the selected principal components.
    This is the final step - rotating and projecting your data into the new coordinate system. Your 100 features become 10 components!

Key Insight (The Magic of PCA):

Each principal component is orthogonal (perpendicular, at 90°) to all others, which means they are completely uncorrelated. No redundancy! If feature A and B were 80% correlated (redundant), PCA combines them into one component. You get independent, information-rich features!

Implementing PCA in Python

Scikit-learn makes PCA straightforward. Always remember to scale your data first!

# Import required libraries
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For creating visualizations
from sklearn.decomposition import PCA  # The PCA algorithm implementation
from sklearn.preprocessing import StandardScaler  # For scaling features (CRITICAL!)
from sklearn.datasets import load_iris  # Sample dataset with 4 features

# Load the famous Iris dataset
# - 150 flowers (samples)
# - 4 measurements per flower: sepal length, sepal width, petal length, petal width
# - 3 species: setosa, versicolor, virginica
iris = load_iris()
X = iris.data  # Shape: (150, 4) - 150 flowers, 4 features each
y = iris.target  # Species labels: 0, 1, or 2

print(f"Original shape: {X.shape}")  # (150, 4) - We'll reduce from 4D to 2D
print(f"Feature names: {iris.feature_names}")  # See what we're compressing
Always scale first! PCA is based on variance. Without scaling, features with larger scales will dominate the principal components.
# Step 1: Scale the data (ABSOLUTELY CRITICAL!)
# WHY: PCA is based on variance. If one feature ranges 0-1000 and another 0-5,
# the 0-1000 feature will dominate the principal components unfairly!
# StandardScaler makes all features have mean=0 and standard deviation=1

scaler = StandardScaler()  # Create the scaler
X_scaled = scaler.fit_transform(X)  # Fit to data AND transform it

# After scaling, all features have:
# - Mean = 0 (centered)
# - Std Dev = 1 (same spread)
# This ensures fair comparison!

print(f"Original mean: {X.mean(axis=0).round(2)}")  # Different means
print(f"Scaled mean: {X_scaled.mean(axis=0).round(2)}")  # All near 0!

# Step 2: Apply PCA (reduce from 4 dimensions to 2)
# n_components=2 means "give me the top 2 principal components"
# These 2 will capture the MOST variance from the original 4 features

pca = PCA(n_components=2)  # Create PCA object
X_pca = pca.fit_transform(X_scaled)  # Fit PCA and transform data

# What just happened?
# - fit(): PCA analyzed X_scaled and found the 2 best directions
# - transform(): Projected the data onto those 2 directions
# - Result: We went from 4D to 2D!

print(f"Reduced shape: {X_pca.shape}")  # (150, 2) - Same 150 flowers, but only 2 features now!
print(f"We reduced dimensions by {(1 - 2/4)*100:.0f}%!")  # 50% reduction

Visualize the reduced data to see how well the classes separate:

# Visualize the 2D projection
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA: Iris Dataset (4D to 2D)')
plt.colorbar(scatter, label='Species')
plt.show()

Understanding PCA Attributes

After fitting, PCA provides useful attributes to understand the transformation:

# After fitting, PCA provides useful attributes to understand the transformation

# 1. Explained Variance Ratio - THE MOST IMPORTANT METRIC!
# This tells you: "What percentage of total variance does each component capture?"
print("Explained variance ratio:", pca.explained_variance_ratio_)
# Example output: [0.7296, 0.2285]
# Interpretation:
# - PC1 captures 73% of the variance (MOST important direction)
# - PC2 captures 23% of the variance (2nd most important)
# - Together: 73% + 23% = 96% of total variance retained!

# 2. Total Variance Retained
# How much of the original information did we keep?
total_variance = sum(pca.explained_variance_ratio_)
print(f"Total variance retained: {total_variance:.2%}")  # e.g., 95.81%

# Rule of thumb:
# - 90-95%: Excellent! Lost very little information
# - 80-90%: Good for most tasks
# - < 80%: Might have lost too much information

# 3. Component Loadings (Feature Contributions)
# These show HOW each original feature contributes to each component
print("Component loadings shape:", pca.components_.shape)  # (2, 4)
# Shape: (n_components, n_original_features)
# Each row is one principal component
# Each column shows contribution from one original feature

# Example: If pca.components_[0] = [0.5, 0.3, -0.6, -0.5]
# It means: PC1 = 0.5*sepal_length + 0.3*sepal_width - 0.6*petal_length - 0.5*petal_width
# Positive = feature increases with PC1, Negative = feature decreases with PC1

Feature Importance from PCA

The component loadings show how much each original feature contributes to each principal component:

# Visualize feature contributions to components
import pandas as pd

loadings = pd.DataFrame(
    pca.components_.T,
    columns=['PC1', 'PC2'],
    index=iris.feature_names
)

print(loadings)

# Plot loadings as heatmap
plt.figure(figsize=(8, 4))
plt.imshow(loadings.values, cmap='coolwarm', aspect='auto')
plt.colorbar(label='Loading')
plt.xticks([0, 1], ['PC1', 'PC2'])
plt.yticks(range(4), iris.feature_names)
plt.title('Feature Loadings on Principal Components')
plt.show()

Practice Questions: PCA

Test your understanding with these hands-on exercises.

Task: Load the iris dataset, scale it, and apply PCA to reduce to 3 components.

Show Solution
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)

print(f"Reduced shape: {X_pca.shape}")

Task: Apply PCA with 2 components and print the total variance retained as a percentage.

Show Solution
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)

pca = PCA(n_components=2)
pca.fit(X_scaled)

total = sum(pca.explained_variance_ratio_)
print(f"Total variance retained: {total:.2%}")

Task: Apply PCA to reduce iris data to 2D and create a scatter plot colored by species.

Show Solution
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=iris.target, cmap='viridis')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA of Iris Dataset')
plt.colorbar(label='Species')
plt.show()

Task: Use PCA with n_components=0.95 to automatically select components that retain 95% variance.

Show Solution
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)

# Use ratio to auto-select components
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_scaled)

print(f"Components needed: {pca.n_components_}")
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.2%}")
02

Explained Variance & Scree Plot

How many components should you keep? Explained variance ratios and scree plots help you decide the optimal number of dimensions to retain while preserving most of the information in your data.

Understanding Explained Variance

Beginner-Friendly Explanation

Imagine you have a $100 budget to explain why students succeed. You discover:

  • Study hours explains $60 of success (60%)
  • Sleep quality explains $25 (25%)
  • Diet explains $10 (10%)
  • Shoe size explains $5 (5%)

The first two factors (study + sleep) give you $85 of your $100 budget - that's 85% explained! This is exactly what explained variance means in PCA. Each component "explains" a portion of the total variation in your data, and you want to keep enough components to hit your budget (usually 90-95%).

Each principal component captures a portion of the total variance (spread) in your data. The first component always captures the most (it's designed to!), the second captures the next most from what remains, and so on. The explained variance ratio tells you what fraction of total variance each component represents - like slices of a pie.

Why does this matter? If PC1 explains 90% of variance and PC2 explains 8%, keeping just these 2 components means you've retained 98% of your data's information! The remaining components (PC3, PC4, ...) might just be noise and can be safely discarded.

Individual Variance
One Component at a Time

explained_variance_ratio_ shows what proportion each component captures individually. Think: "How much does this ONE component contribute?"

Example: [0.73, 0.23, 0.04]
PC1: 73% (biggest slice)
PC2: 23% (2nd slice)
PC3: 4% (tiny slice)
Cumulative Variance
Running Total

The cumulative sum shows total variance retained as you add more components. Think: "How much do I have so far?"

Example: [0.73, 0.96, 1.00]
PC1 alone: 73%
PC1+PC2: 96% (almost there!)
All 3: 100% (everything)
Target Threshold
When to Stop?

Common practice: Keep enough components to retain 90-95% of total variance. Think: "I'm okay losing 5-10% to save space."

Trade-off:
More components = more info = more features
Fewer components = less info = more compression
90-95% is the sweet spot!

Computing Explained Variance

Fit PCA on all components first to see how variance is distributed:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

# Load and scale data
iris = load_iris()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(iris.data)

# Fit PCA on ALL components
pca = PCA()  # No n_components = keep all
pca.fit(X_scaled)

# Get explained variance
variance_ratio = pca.explained_variance_ratio_
cumulative = np.cumsum(variance_ratio)

print("Individual variance:", variance_ratio.round(3))
print("Cumulative variance:", cumulative.round(3))
Tip: When n_components is not specified, PCA keeps all components (min of n_samples, n_features). This is useful for analyzing variance distribution before deciding how many to keep.

Creating a Scree Plot

A scree plot visualizes the explained variance for each component. Look for an "elbow" where adding more components yields diminishing returns:

# Create a scree plot
fig, ax1 = plt.subplots(figsize=(10, 6))

# Bar chart for individual variance
components = range(1, len(variance_ratio) + 1)
ax1.bar(components, variance_ratio, alpha=0.7, color='steelblue', label='Individual')
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Explained Variance Ratio', color='steelblue')
ax1.tick_params(axis='y', labelcolor='steelblue')

# Line for cumulative variance
ax2 = ax1.twinx()
ax2.plot(components, cumulative, 'ro-', label='Cumulative')
ax2.axhline(y=0.95, color='g', linestyle='--', label='95% threshold')
ax2.set_ylabel('Cumulative Explained Variance', color='red')
ax2.tick_params(axis='y', labelcolor='red')

plt.title('Scree Plot with Cumulative Variance')
fig.legend(loc='center right', bbox_to_anchor=(0.85, 0.5))
plt.tight_layout()
plt.show()
Visualization

Reading a Scree Plot (The Elbow Hunter's Guide)

The scree plot is named after the geological term for debris at the base of a cliff. Imagine a mountain:

  • The Cliff (Steep Drop): First few components with HIGH explained variance - these are important!
  • The Scree (Flat Tail): Later components with LOW explained variance - mostly noise, can discard
  • The Elbow (Bend Point): Where the steep drop transitions to flat - THIS is your cutoff!

The Elbow Rule (How to Decide):

Choose the number of components at the point where the curve bends sharply (the "elbow").
Before the elbow: Each component adds significant value.
After the elbow: Adding more components gives diminishing returns - not worth it!

Pro Tip: If you see 4 components before the elbow, keep 4. Don't keep the flat tail!

Automatic Component Selection

Instead of manually choosing, you can specify a variance threshold:

# Method 1: Specify variance to retain
pca_95 = PCA(n_components=0.95)  # Keep 95% of variance
X_reduced = pca_95.fit_transform(X_scaled)
print(f"Components for 95%: {pca_95.n_components_}")

# Method 2: Specify exact number
pca_2 = PCA(n_components=2)
X_2d = pca_2.fit_transform(X_scaled)
print(f"Variance with 2 components: {sum(pca_2.explained_variance_ratio_):.2%}")

Variance Thresholds Table

Common rules of thumb for choosing the number of components:

Target Variance Use Case Trade-off
80-85% Quick exploration, visualization Loses more detail, but very compact
90-95% General purpose, most ML tasks Good balance of compression and accuracy
99% When accuracy is critical Minimal compression, removes only noise
2-3 components Visualization only (2D/3D plots) May lose significant variance

Reconstructing Data from Components

You can transform data back to the original space. The difference between original and reconstructed data shows what information was lost:

# Reduce to 2 components
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)

# Reconstruct back to original dimensions
X_reconstructed = pca.inverse_transform(X_reduced)

# Calculate reconstruction error
mse = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"Reconstruction MSE: {mse:.4f}")

# Compare original vs reconstructed for first sample
print("Original:", X_scaled[0].round(2))
print("Reconstructed:", X_reconstructed[0].round(2))

Practice Questions: Explained Variance

Test your understanding with these hands-on exercises.

Task: Fit PCA on the iris dataset without limiting components and print the explained variance ratio.

Show Solution
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)

pca = PCA()  # Keep all components
pca.fit(X_scaled)

print("Variance ratio:", pca.explained_variance_ratio_.round(4))

Task: Compute and print the cumulative explained variance using numpy cumsum.

Show Solution
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)

pca = PCA()
pca.fit(X_scaled)

cumulative = np.cumsum(pca.explained_variance_ratio_)
print("Cumulative variance:", cumulative.round(4))

Task: Create a bar chart of individual variance and overlay a line plot of cumulative variance.

Show Solution
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
pca = PCA().fit(X_scaled)

variance = pca.explained_variance_ratio_
cumulative = np.cumsum(variance)
components = range(1, len(variance) + 1)

fig, ax1 = plt.subplots(figsize=(8, 5))
ax1.bar(components, variance, alpha=0.7, color='blue')
ax1.set_xlabel('Component')
ax1.set_ylabel('Individual Variance')

ax2 = ax1.twinx()
ax2.plot(components, cumulative, 'r-o')
ax2.set_ylabel('Cumulative Variance')

plt.title('Scree Plot')
plt.show()

Task: Use n_components=0.90 to find how many components are needed to retain 90% variance.

Show Solution
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)

pca = PCA(n_components=0.90)
pca.fit(X_scaled)

print(f"Components for 90%: {pca.n_components_}")
print(f"Actual variance: {sum(pca.explained_variance_ratio_):.2%}")
03

t-SNE for Visualization

t-SNE (t-distributed Stochastic Neighbor Embedding) excels at revealing clusters and patterns in high-dimensional data by creating stunning 2D or 3D visualizations that preserve local structure.

What is t-SNE?

Beginner-Friendly Explanation

Imagine you're creating a map of a city. PCA would preserve the overall layout - north stays north, the downtown area stays in the center. But what if you care more about neighborhoods? You want houses on the same street to stay close together, even if that means bending the overall map.

That's t-SNE! It focuses on keeping neighbors together. If data points A and B were close in the original high-dimensional space, t-SNE makes sure they're also close in the 2D visualization. It's willing to distort global distances to preserve local neighborhoods - perfect for discovering clusters!

Technical explanation: Unlike PCA which preserves global variance (overall spread of data), t-SNE focuses on preserving local neighborhoods. Points that are close together in high-dimensional space will remain close in the low-dimensional projection. This makes t-SNE excellent for discovering clusters that might not be visible with PCA.

Important caveat: Because t-SNE distorts global structure, the distances between clusters are NOT meaningful. Two clusters might appear far apart in a t-SNE plot but could actually be close in the original space. t-SNE is for visualization and cluster discovery ONLY, not for measuring distances!

PCA (The Global Preserver)
  • Linear transformation: Like rotating a photo - no bending or warping, just rotation
  • Preserves global structure: Overall relationships and distances stay accurate
  • Fast & deterministic: Same input = same output, every time. Runs in seconds!
  • Feature reduction: Great for ML pipelines - use PCA features as input to models
  • Interpretable axes: Can understand what each PC represents (hard but possible)
t-SNE (The Cluster Revealer)
  • Non-linear transformation: Can bend and warp to reveal hidden patterns
  • Preserves local structure: Keeps neighbors together, but distorts overall layout
  • Slower & stochastic: Different runs give slightly different results. Takes minutes!
  • Visualization only: Beautiful plots but don't use for machine learning features
  • Not interpretable: Axes have no meaning - just look at cluster separation
Algorithm

How t-SNE Works (The Neighborhood Preservation Dance)

  1. Compute High-D Probabilities: Calculate the probability that each point would "pick" every other point as a neighbor in high-dimensional space.
    Think: "If point A were at a party, how likely is it to hang out with point B vs point C?" Close points = high probability.
  2. Random Low-D Initialization: Randomly scatter all points in 2D or 3D space (like throwing confetti).
    Starting point is random - that's why different runs give slightly different results (stochastic).
  3. Iterative Optimization: Gradually move points around in 2D so that the low-D probabilities match the high-D probabilities.
    Like organizing a party - keep rearranging people until everyone is near their friends. This takes many iterations (1000+).
  4. Use t-Distribution: In low dimensions, use a t-distribution (instead of Gaussian) which has heavy tails.
    Why? Prevents "crowding problem" - gives clusters room to spread out. Without this, all points would pile up in the center!

Key Insight (The t-SNE Philosophy):

t-SNE optimizes for neighborhood preservation, NOT absolute distances! Two points that are far apart in the t-SNE plot might still be relatively close in the original space - you just can't tell. Only trust local clusters, not global layout! The magic: reveals hidden clusters beautifully.

Applying t-SNE in Python

Scikit-learn provides an easy-to-use t-SNE implementation:

from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load and scale data
iris = load_iris()
X = iris.data
y = iris.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_scaled)

print(f"Original shape: {X.shape}")  # (150, 4)
print(f"t-SNE shape: {X_tsne.shape}")  # (150, 2)
Important: t-SNE has no transform() method. You cannot apply a fitted t-SNE to new data. Always use fit_transform() on all your data at once.
# Visualize t-SNE results
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.7, s=50)
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.title('t-SNE Visualization of Iris Dataset')
plt.colorbar(scatter, label='Species')
plt.show()

The Perplexity Parameter

Perplexity is the most important hyperparameter. It controls how many neighbors each point considers:

Perplexity Value Effect on Visualization Best For Icon
5-10
Very Low
Focuses on very tight, local structure. Each point only considers its 5-10 nearest neighbors. Can create many small, fragmented clusters. Small datasets (< 100 points), when you want to see micro-clusters, detecting very tight groups
30
Default
Balanced view of both local and global structure. Each point considers ~30 neighbors. Most reliable and recommended starting point. Most datasets, general purpose visualization, when you're not sure what value to use - START HERE!
50-100
High
Considers broader neighborhoods, preserves more global structure. Creates smoother, more spread-out clusters. Less fragmentation. Larger datasets (1000+ samples), when you want to see macro-structure, reducing over-clustering artifacts
Rule of Thumb: Perplexity should be between 5 and 50. For datasets with N samples, try perplexity around N/100 to N/50. With 1000 samples, try 10-20. With 10,000 samples, try 30-50. Always visualize with multiple perplexity values to ensure your clusters are real and not artifacts!
# Compare different perplexity values
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
perplexities = [5, 30, 50]

for ax, perp in zip(axes, perplexities):
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42)
    X_tsne = tsne.fit_transform(X_scaled)
    ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.7)
    ax.set_title(f'Perplexity = {perp}')
    ax.set_xlabel('Dim 1')
    ax.set_ylabel('Dim 2')

plt.tight_layout()
plt.show()

t-SNE Best Practices

Do
  • Use for visualization only, not as features
  • Always set random_state for reproducibility
  • Try multiple perplexity values
  • Scale your data first
  • Use PCA to reduce to 50 dims first for speed
Avoid
  • Do not interpret distances between clusters
  • Do not use for feature engineering
  • Do not trust cluster sizes (they can be artifacts)
  • Do not run on very large datasets directly
  • Do not use on new unseen data

Speed Optimization with PCA

For datasets with many features, first reduce with PCA before applying t-SNE:

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.datasets import fetch_openml

# Load MNIST (784 features)
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
X, y = mnist.data[:5000], mnist.target[:5000].astype(int)

# Step 1: Reduce to 50 dimensions with PCA
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X)

# Step 2: Apply t-SNE on reduced data (much faster!)
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_pca)

print(f"784 dims -> 50 dims -> 2 dims")

Practice Questions: t-SNE

Test your understanding with these hands-on exercises.

Task: Apply t-SNE to reduce the iris dataset to 2 dimensions.

Show Solution
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)

tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

print(f"Shape: {X_tsne.shape}")

Task: Create a scatter plot of t-SNE output colored by species.

Show Solution
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target, cmap='viridis')
plt.xlabel('t-SNE 1')
plt.ylabel('t-SNE 2')
plt.colorbar(label='Species')
plt.title('t-SNE of Iris')
plt.show()

Task: Create a 1x3 subplot comparing t-SNE with perplexity values 5, 30, and 50.

Show Solution
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
for ax, perp in zip(axes, [5, 30, 50]):
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42)
    X_tsne = tsne.fit_transform(X_scaled)
    ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target, cmap='viridis')
    ax.set_title(f'Perplexity = {perp}')
plt.tight_layout()
plt.show()

Task: For large datasets, first reduce dimensions with PCA before applying t-SNE.

Show Solution
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits

digits = load_digits()
X = digits.data

# Step 1: PCA to 30 dimensions
pca = PCA(n_components=30)
X_pca = pca.fit_transform(X)

# Step 2: t-SNE on reduced data
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_pca)

print(f"Original: {X.shape} -> PCA: {X_pca.shape} -> t-SNE: {X_tsne.shape}")
04

UMAP for High-Dimensional Data

UMAP (Uniform Manifold Approximation and Projection) is a modern technique that preserves both local and global structure, runs faster than t-SNE, and works well for both visualization and as a preprocessing step for machine learning.

Why UMAP? (The Modern Champion)

Beginner-Friendly Explanation

Imagine you need to create a map of your city for tourists:

  • PCA: Fast but boring - just a rotated satellite view, loses neighborhoods
  • t-SNE: Beautiful and shows neighborhoods, but SLOW, and you can't add new locations later
  • UMAP: Beautiful like t-SNE, FAST like PCA, AND you can add new locations to the same map!

UMAP is the best of both worlds! It was developed in 2018 to fix t-SNE's limitations. It creates gorgeous visualizations (like t-SNE) while being faster, preserving more global structure, and supporting transformation of new data (like PCA). It's quickly becoming the go-to choice!

UMAP (Uniform Manifold Approximation and Projection) was developed in 2018 and has quickly become the preferred choice for many data scientists and machine learning practitioners. Why? Because it combines the best aspects of PCA (speed, usability as features for ML) with the visualization quality of t-SNE.

The UMAP advantage: Unlike t-SNE which only preserves local structure, UMAP preserves both local AND global structure. This means cluster separations and relative positions are more meaningful. Plus, it scales to millions of data points efficiently!

Speed & Scalability
Blazing Fast!

UMAP is significantly faster than t-SNE, especially on larger datasets. Where t-SNE might take 30 minutes, UMAP finishes in 2 minutes!

Scales beautifully:
1,000 points: seconds
10,000 points: 1-2 min
1,000,000+ points: possible!
Global + Local Structure
Best of Both!

Unlike t-SNE (local only), UMAP preserves more global relationships. Distances between clusters are more meaningful and trustworthy.

What this means:
See clusters (local)
AND their relationships (global)
More accurate overall view!
Transform New Data
ML Ready!

UMAP can transform new data using a fitted model. This makes it usable for machine learning pipelines, unlike t-SNE which can't!

Workflow:
Fit on training data
Transform test data
Use as ML features!

Installing UMAP

UMAP is not included in scikit-learn. Install it separately:

# Install UMAP (run once)
# pip install umap-learn

# Note: the package is 'umap-learn' but you import 'umap'
import umap

print("UMAP installed successfully!")
Package name: The pip package is umap-learn (not umap), but you import it as import umap or from umap import UMAP.

Applying UMAP in Python

import umap
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load and scale data
iris = load_iris()
X = iris.data
y = iris.target

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply UMAP
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X_scaled)

print(f"Original shape: {X.shape}")  # (150, 4)
print(f"UMAP shape: {X_umap.shape}")  # (150, 2)
# Visualize UMAP results
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='viridis', alpha=0.7, s=50)
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
plt.title('UMAP Visualization of Iris Dataset')
plt.colorbar(scatter, label='Species')
plt.show()

Key UMAP Parameters

UMAP has two main hyperparameters that control the embedding:

Parameter Default Effect & What It Controls Guidance (When to Change)
n_neighbors
15 Controls local vs global focus. Number of neighbors each point considers when building the manifold.
Higher = broader view, lower = tighter focus
Increase (30-100): For more global structure, smoother embedding
Decrease (5-10): For very local, fine details, small datasets
min_dist
0.1 Controls point spacing. Minimum distance between points in the low-D output. Think "cluster tightness".
Lower = tighter clusters, higher = more spread out
Lower (0.01-0.05): Tight, dense clusters - good for clear separation
Higher (0.3-0.5): Spread out - better for seeing individual points
n_components
2 Output dimensions. Usually 2 for visualization (2D plots), 3 for 3D plots, or higher for ML features.
2/3 for viz, 10-50 for ML preprocessing
2 or 3: For visualization/plotting only
10-50: When using as features for machine learning models
metric
'euclidean' Distance metric. How to measure "closeness" between points. Euclidean is standard straight-line distance.
Other options: cosine, manhattan, hamming, etc.
'cosine': For text/document data (word vectors)
'manhattan': For grid-like data, coordinates
'euclidean': Most data - good default!
Quick Start Guide: For most tasks, use the defaults (n_neighbors=15, min_dist=0.1) first. Then experiment: try n_neighbors=5 (local focus) vs n_neighbors=50 (global focus), and min_dist=0.01 (tight) vs min_dist=0.5 (spread). See which reveals your data best!
# Experiment with parameters
fig, axes = plt.subplots(2, 2, figsize=(12, 12))

params = [
    {'n_neighbors': 5, 'min_dist': 0.1},
    {'n_neighbors': 50, 'min_dist': 0.1},
    {'n_neighbors': 15, 'min_dist': 0.01},
    {'n_neighbors': 15, 'min_dist': 0.5},
]

for ax, p in zip(axes.flat, params):
    reducer = umap.UMAP(**p, random_state=42)
    X_umap = reducer.fit_transform(X_scaled)
    ax.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='viridis', alpha=0.7)
    ax.set_title(f"n_neighbors={p['n_neighbors']}, min_dist={p['min_dist']}")

plt.tight_layout()
plt.show()

UMAP vs t-SNE Comparison

Aspect t-SNE (2012) UMAP (2018)
Speed Slow (O(n²))
10K points: 10-30 minutes
Fast (approximate neighbors)
10K points: 1-2 minutes!
Global Structure Poor preservation
Only local neighborhoods reliable
Better preservation
Both local and global trustworthy
Transform New Data Not supported
Must rerun on all data every time
Supported!
Can transform() new test data
Scalability Up to ~10K points
Becomes impractical beyond this
Millions of points!
Scales beautifully to large datasets
Use as ML Features Not recommended
Visualization ONLY - don't use for training
Can work well!
Use as preprocessing for ML models
Winner? Good for viz
When you only need 2D plots of small data
Modern Default!
Faster, more versatile, scales better
Bottom Line

For visualization: Try both! UMAP is usually faster and better, but t-SNE sometimes reveals different patterns.
For ML preprocessing: Use UMAP (or PCA). Never use t-SNE as features.
For large datasets (>10K points): UMAP is your only practical choice.

Using UMAP for ML Preprocessing

Unlike t-SNE, UMAP can be used as a preprocessing step for machine learning:

import umap
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits

# Load data
digits = load_digits()
X, y = digits.data, digits.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit UMAP on training data
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_umap = reducer.fit_transform(X_train)

# Transform test data using fitted UMAP
X_test_umap = reducer.transform(X_test)

# Train classifier on UMAP features
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_umap, y_train)
print(f"Accuracy: {clf.score(X_test_umap, y_test):.2%}")
Best Practice

Choosing a Technique

  • PCA: Start here. Fast, interpretable, good for feature reduction and denoising
  • t-SNE: Best visualization for finding clusters. Use only for visualization
  • UMAP: Modern default. Fast, good visualization, can be used for ML features

Tip: For visualization, try both t-SNE and UMAP. For ML pipelines, prefer PCA or UMAP since they support transforming new data.

Practice Questions: UMAP

Test your understanding with these hands-on exercises.

Task: Use UMAP to reduce the iris dataset to 2 dimensions.

Show Solution
import umap
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)

reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X_scaled)

print(f"Shape: {X_umap.shape}")

Task: Apply UMAP with n_neighbors=30 and min_dist=0.05.

Show Solution
import umap
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)

reducer = umap.UMAP(n_neighbors=30, min_dist=0.05, random_state=42)
X_umap = reducer.fit_transform(X_scaled)

print(f"Shape: {X_umap.shape}")

Task: Fit UMAP on training data and transform test data separately.

Show Solution
import umap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

reducer = umap.UMAP(n_components=2, random_state=42)
X_train_umap = reducer.fit_transform(X_train_scaled)
X_test_umap = reducer.transform(X_test_scaled)

print(f"Train: {X_train_umap.shape}, Test: {X_test_umap.shape}")

Task: Create a scatter plot of UMAP output colored by target class.

Show Solution
import umap
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)

reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
plt.scatter(X_umap[:, 0], X_umap[:, 1], c=iris.target, cmap='viridis')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.colorbar(label='Species')
plt.title('UMAP of Iris Dataset')
plt.show()

Decision Flowchart: Which Technique to Use?

Confused about when to use PCA, t-SNE, or UMAP? Follow this visual guide to choose the right dimensionality reduction technique for your specific situation!

START

I need dimensionality reduction

Question 1: What is your PRIMARY goal?

Think about what matters most for your task

Feature Reduction

I want to reduce features for use in machine learning models

Visualization

I want beautiful 2D/3D plots to explore and present my data

Both

I need features for ML AND want to visualize my data

Use PCA

Fast & Efficient

Why PCA?
• Preserves maximum variance
• Linear, interpretable
• Can transform new data
• Perfect for ML pipelines

Tip: Keep 90-95% variance. Use n_components=0.95 for automatic selection.

Go to Question 2

Need More Info

Multiple Options
For visualization, we need to consider:
• Dataset size
• Speed requirements
• Quality needs

Next: Answer questions about your data characteristics below.

Use UMAP

Best of Both!

Why UMAP?
• Great visualizations
• Can transform new data
• Faster than t-SNE
• Works for ML features

Tip: Use 2-3 components for viz, 10-50 for ML. Experiment with n_neighbors parameter.

Question 2: How large is your dataset?

(Only for visualization goal - if you chose feature reduction, you're already done with PCA!)

Small

< 1,000 samples
(e.g., Iris: 150 samples)

Medium

1K - 10K samples
(e.g., MNIST subset)

Large

> 10,000 samples
(e.g., Full MNIST, ImageNet)

t-SNE or UMAP

Both Work Great

Your Choice!
Small datasets run fast with either method. Try both and see which reveals clusters better!

t-SNE: perplexity=5-30
UMAP: n_neighbors=5-15

UMAP (Preferred)

Faster Choice

Why UMAP?
t-SNE starts getting slow here. UMAP provides similar quality much faster.

Settings: n_neighbors=15-30, min_dist=0.1 (default works well)

UMAP Only

t-SNE Too Slow

Must Use UMAP
t-SNE becomes impractical beyond 10K points. UMAP scales to millions!

Tip: For huge datasets, first reduce to ~50 dims with PCA, then apply UMAP.

Quick Reference Guide

PCA

Use when:

  • Need features for ML
  • Want speed & efficiency
  • Data is roughly linear
  • Interpretability matters

Linear Fast Transform

t-SNE

Use when:

  • Only need visualization
  • Small datasets (< 10K)
  • Want stunning cluster plots
  • Publication-quality viz

Non-linear Slow Viz Only

UMAP

Use when:

  • Need viz AND features
  • Any dataset size
  • Want speed + quality
  • Modern default choice

Non-linear Fast Versatile

Pro Tip: The Hybrid Approach

For large, high-dimensional datasets (like images with 784+ features):
1. Use PCA first to reduce to ~50 dimensions (fast, removes noise)
2. Then apply UMAP or t-SNE to reduce to 2D for visualization

This combination gives you: Speed + Quality + Efficiency!
X_pca = PCA(50).fit_transform(X) → X_viz = UMAP().fit_transform(X_pca)

Key Takeaways

PCA for Linear Reduction

Projects data onto orthogonal axes of maximum variance. Fast, interpretable, and great for preprocessing

Scree Plot for Component Selection

Look for the "elbow" where explained variance drops. Aim to retain 80-95% of total variance

t-SNE for Cluster Visualization

Excellent for revealing clusters in 2D/3D. Use perplexity 5-50 and only for visualization

UMAP is Faster and Versatile

Preserves global structure better than t-SNE. Good for both visualization and ML preprocessing

Always Scale Before Reduction

StandardScaler is essential for PCA. Features with larger scales will dominate otherwise

Choose Based on Goal

PCA for preprocessing/speed, t-SNE for publication visuals, UMAP for general-purpose reduction

Knowledge Check

Quick Quiz

Test what you've learned about dimensionality reduction

1 What does PCA maximize when selecting principal components?
2 What is the purpose of a scree plot?
3 Which technique is best for visualizing high-dimensional data in 2D while preserving local structure?
4 What is the main advantage of UMAP over t-SNE?
5 Why is scaling important before applying PCA?
6 What does the perplexity parameter control in t-SNE?
Answer all questions to check your score