Why Reduce Dimensions?
Imagine trying to describe a person using 1000 measurements. Most of that information is redundant! Dimensionality reduction is the art of keeping what matters while discarding the noise. It speeds up algorithms, enables visualization, and often improves model performance by removing irrelevant features.
What is Dimensionality Reduction? (Simple Definition)
Dimensionality reduction means taking data with many features (columns) and creating a smaller, simpler version that still captures the essential patterns. Think of it like summarizing a 500-page book into a 10-page summary.
Real-Life Examples
- Compressing a 10MP photo to a thumbnail
- Summarizing customer data for visualization
- Reducing 1000 gene expressions to key patterns
- Simplifying image features for faster recognition
Key Question It Answers
"I have data with 100+ features. How can I reduce it to 2-3 dimensions for visualization, or 10-20 for faster ML training, without losing the important information?"
The Curse of Dimensionality
As the number of features (dimensions) grows, strange things happen to your data. This phenomenon is called the curse of dimensionality, and it creates serious problems for machine learning algorithms.
Understanding the Curse (Beginner's Guide)
Imagine you're playing hide and seek. In a 1D world (a line), it's easy to find someone. In a 2D world (a room), harder. In a 100D world? Impossible! That's the curse of dimensionality.
Data Becomes Sparse
Points spread out exponentially. Your 1000 samples that seemed dense in 2D become isolated in 100D.
Distances Become Meaningless
In high dimensions, all points are roughly equidistant. "Nearest neighbor" loses its meaning.
Models Overfit Easily
More features than samples? Your model memorizes noise instead of learning patterns.
Rule of Thumb: You need exponentially more data as dimensions increase. 10 samples per dimension is a minimum. With 100 features, you need at least 1000 samples!
# Visualizing the Curse of Dimensionality
import numpy as np
from sklearn.neighbors import NearestNeighbors
# Generate random data in increasing dimensions
np.random.seed(42)
n_samples = 1000
for dims in [2, 10, 50, 100, 500]:
X = np.random.randn(n_samples, dims)
# Find distances to nearest neighbor
nn = NearestNeighbors(n_neighbors=2)
nn.fit(X)
distances, _ = nn.kneighbors(X)
# Ratio of max to min distance (should be large in low dims, small in high dims)
ratio = distances[:, 1].max() / distances[:, 1].min()
print(f"Dims: {dims:3d} | Avg NN dist: {distances[:, 1].mean():.2f} | Max/Min ratio: {ratio:.2f}")
This code generates random data points in different dimensions (2D, 10D, 50D, 100D, and 500D) to demonstrate the curse of dimensionality. For each dimension, we create 1000 random points using np.random.randn() and then use NearestNeighbors with n_neighbors=2 to find each point's closest neighbor (we use 2 because the first neighbor is always the point itself with distance 0). We then extract the actual nearest neighbor distances from column 1 using distances[:, 1]. The key observation is what happens to the max/min distance ratio: in low dimensions (like 2D), some points are very close while others are far apart, giving a large ratio. But as dimensions increase, something strange happens — all points become roughly the same distance from each other, and the ratio approaches 1. This is the curse of dimensionality in action! When all points are equidistant, the concept of "nearest neighbor" becomes meaningless, which is why algorithms based on distance (like KNN, K-Means, DBSCAN) struggle in high-dimensional spaces.
Two Approaches to Dimensionality Reduction
Feature Extraction
Create new features by combining existing ones. The new features are transformations (like weighted averages) of the originals.
PCA, t-SNE, UMAPFeature Selection
Keep only the best original features and discard the rest. No transformation, just selection based on importance.
Variance, Correlation, RFE# Quick Demo: High-dimensional data
from sklearn.datasets import load_digits
import numpy as np
# Load handwritten digits dataset
digits = load_digits()
X, y = digits.data, digits.target
print(f"Original shape: {X.shape}") # (1797, 64) - 64 pixel values per image
print(f"Each digit is 8x8 = 64 features")
print(f"Sample values: {X[0, :10]}...") # First 10 pixel values of first image
print(f"Target labels: {np.unique(y)}") # Digits 0-9
This code loads the famous handwritten digits dataset from scikit-learn using load_digits(). Each handwritten digit (numbers 0 through 9) is stored as an 8×8 pixel grayscale image, which means each sample has 64 features (one for each pixel's brightness value). When we check X.shape, we see (1797, 64) — that's 1797 handwritten digit images, each represented by 64 numbers. The y array contains the corresponding labels (0-9) telling us which digit each image actually represents. This dataset is perfect for demonstrating dimensionality reduction because 64 dimensions is impossible for humans to visualize directly, but techniques like PCA, t-SNE, and UMAP can compress this data down to just 2 dimensions so we can plot it and see if digits of the same type naturally cluster together. Spoiler: they do! Similar digits (like all the 3s) tend to be close in the high-dimensional space, and dimensionality reduction helps us visualize this structure.
Practice Questions: Introduction
Test your understanding with these coding challenges.
Task: Load the breast cancer dataset and examine its dimensionality.
Show Solution
from sklearn.datasets import load_breast_cancer
import numpy as np
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
print(f"Shape: {X.shape}") # (569, 30)
print(f"Features: {cancer.feature_names[:5]}...")
print(f"Classes: {cancer.target_names}")
print(f"Samples per class: {np.bincount(y)}")
Task: Compare average pairwise distances as dimensions increase.
Show Solution
from sklearn.metrics import pairwise_distances
import numpy as np
np.random.seed(42)
n_samples = 100
print("Dims | Avg Distance | Std Distance")
print("-" * 35)
for dims in [2, 5, 10, 50, 100]:
X = np.random.randn(n_samples, dims)
distances = pairwise_distances(X)
# Get upper triangle (exclude diagonal)
upper = distances[np.triu_indices(n_samples, k=1)]
print(f"{dims:4d} | {upper.mean():12.2f} | {upper.std():12.2f}")
Task: Find which features have the highest and lowest variance.
Show Solution
from sklearn.datasets import load_breast_cancer
import numpy as np
cancer = load_breast_cancer()
X = cancer.data
feature_names = cancer.feature_names
variances = np.var(X, axis=0)
sorted_idx = np.argsort(variances)[::-1]
print("Top 5 highest variance features:")
for i in sorted_idx[:5]:
print(f" {feature_names[i]}: {variances[i]:.2f}")
print("\nTop 5 lowest variance features:")
for i in sorted_idx[-5:]:
print(f" {feature_names[i]}: {variances[i]:.4f}")
Principal Component Analysis (PCA)
PCA is the workhorse of dimensionality reduction. It finds the directions (principal components) along which your data varies the most, then projects your data onto these directions. Think of it as finding the best camera angle to photograph a 3D object as a 2D picture.
Understanding PCA (Beginner's Guide)
PCA (Principal Component Analysis) finds the "best" way to squish high-dimensional data into fewer dimensions while keeping as much information as possible.
PC1: Most Variance
The first component points in the direction where data spreads the most.
PC2: Second Most
Perpendicular to PC1, captures the next most variance.
Keep Top K
Usually 2-3 for visualization, or enough to capture 95% variance.
PCA Superpower: It's fast, deterministic (same result every time), and the components are interpretable. You can see which original features contribute most to each component!
How PCA Works
PCA Steps (Simplified)
- Center the data - Subtract the mean from each feature
- Calculate covariance matrix - How features vary together
- Find eigenvectors - These are the principal component directions
- Sort by eigenvalues - Higher = more variance explained
- Project data - Transform onto top K components
Important Notes
- Always scale first! - PCA is sensitive to feature scales
- Linear only - Can't capture curved relationships
- Deterministic - Same input = same output
- Interpretable - Components are weighted sums of features
# Basic PCA with Scikit-learn
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import numpy as np
# Load and scale data
iris = load_iris()
X, y = iris.data, iris.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(f"Original shape: {X_scaled.shape}") # (150, 4)
# Apply PCA - reduce to 2 dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"Reduced shape: {X_pca.shape}") # (150, 2)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.2%}")
This code demonstrates the complete PCA workflow on the Iris dataset, reducing it from 4 features down to just 2. The first critical step is data standardization using StandardScaler().fit_transform(X) — this scales all features to have a mean of 0 and standard deviation of 1. This step is absolutely essential for PCA because without it, features with naturally larger values (like measurements in centimeters vs millimeters) would dominate the principal components purely due to their scale, not their actual importance. Next, we create a PCA object with PCA(n_components=2) to specify that we want exactly 2 output dimensions. The fit_transform() method does two things in one step: it first "fits" PCA by learning the principal components (the directions of maximum variance in the data), then "transforms" the data by projecting it onto these new axes. The explained_variance_ratio_ attribute is incredibly useful — it tells us what percentage of the original information (variance) each component captures. For example, if you see [0.73, 0.22], it means the first principal component captures 73% of the variance and the second captures 22%, for a total of 95% of the original information preserved in just 2 dimensions instead of 4!
Choosing the Number of Components
How many components should you keep? Two common approaches:
Choosing n_components
Option 1: Specify a number
PCA(n_components=2) - Get exactly 2 components
Option 2: Specify variance to keep
PCA(n_components=0.95) - Keep 95% of variance
# Finding Optimal Number of Components
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
import numpy as np
# Load digits dataset (64 features)
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
# Fit PCA with all components
pca_full = PCA()
pca_full.fit(X)
# Cumulative explained variance
cumsum = np.cumsum(pca_full.explained_variance_ratio_)
# Find number of components for different thresholds
for threshold in [0.80, 0.90, 0.95, 0.99]:
n_components = np.argmax(cumsum >= threshold) + 1
print(f"{threshold:.0%} variance: {n_components} components (from 64)")
# Or let PCA decide automatically
pca_95 = PCA(n_components=0.95)
X_reduced = pca_95.fit_transform(X)
print(f"\nAuto 95%: {pca_95.n_components_} components")
This code shows two approaches for determining the optimal number of principal components to keep. First, we fit PCA with no n_components argument, which means it calculates ALL possible components (64 for the digits dataset). We then use np.cumsum() to calculate the cumulative sum of explained variance ratios — this transforms something like [0.12, 0.22, 0.10...] into a running total [0.12, 0.34, 0.44...], showing how much total variance is explained as we add more components. Using np.argmax(cumsum >= 0.95), we find the first index where the cumulative variance reaches or exceeds 95%, telling us exactly how many components we need to preserve that much information. But there's an even easier way! You can pass a decimal value directly to PCA using PCA(n_components=0.95), and scikit-learn will automatically determine how many components are needed to retain 95% of the variance. For the 64-feature digits dataset, you typically only need about 29 components to keep 95% of the variance — that's a reduction of more than half the features while losing only 5% of the information! This technique helps you balance dimensionality reduction against information loss.
Scree Plot: Visualizing Explained Variance
A scree plot is a visual tool for choosing the number of components. It shows the explained variance of each component, helping you identify the "elbow" where adding more components provides diminishing returns.
# Step 1: Import libraries and load data
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
import numpy as np
import matplotlib.pyplot as plt
# Load and scale the digits dataset
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
print(f"Dataset shape: {X.shape}") # (1797, 64)
We start by importing all necessary libraries: PCA for dimensionality reduction, StandardScaler for normalizing our data, load_digits to get the handwritten digits dataset, numpy for numerical operations, and matplotlib.pyplot for creating visualizations. After loading the digits dataset, we immediately scale it using StandardScaler because PCA is sensitive to feature scales — without scaling, features with larger numerical values would unfairly dominate the principal components. The digits dataset has 1797 samples (handwritten digit images) with 64 features each (8×8 pixels).
# Step 2: Fit PCA with ALL components to analyze variance
pca = PCA() # No n_components = keep all 64
pca.fit(X)
# Check how much variance each component explains
print(f"Total components: {len(pca.explained_variance_ratio_)}")
print(f"First 5 components explain: {pca.explained_variance_ratio_[:5]}")
Here we create a PCA object without specifying n_components, which tells scikit-learn to keep ALL principal components (64 in this case). This is essential for scree plots because we need to see the variance explained by every component to make an informed decision about how many to keep. The explained_variance_ratio_ attribute is an array where each element tells us what fraction of the total variance that component captures. For example, if the first element is 0.12, it means PC1 explains 12% of the total variance in the data. The components are automatically sorted from highest to lowest variance.
# Step 3: Create the Scree Plot (bar chart)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Plot individual variance for first 20 components
axes[0].bar(range(1, 21), pca.explained_variance_ratio_[:20], alpha=0.7, color='steelblue')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Scree Plot - Individual Variance')
axes[0].set_xticks(range(1, 21))
This creates the classic scree plot as a bar chart. The name "scree plot" comes from geology — scree is the pile of broken rocks at the base of a cliff, and the shape of this plot resembles that slope. We use plt.subplots(1, 2) to create a figure with two side-by-side plots. The first plot shows the first 20 components as bars, where the height of each bar represents how much variance that single component explains. You'll typically see the first few bars being tall (PC1 might explain 12%, PC2 might explain 10%), then a steep drop-off, followed by a long "tail" of short bars. The "elbow" is where the bars stop dropping quickly and become uniformly short — that's your sweet spot for choosing components.
# Step 4: Create Cumulative Variance Plot
cumsum = np.cumsum(pca.explained_variance_ratio_)
axes[1].plot(range(1, len(cumsum)+1), cumsum, 'bo-', markersize=4)
axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
axes[1].axhline(y=0.90, color='orange', linestyle='--', label='90% threshold')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Variance Explained')
axes[1].legend()
axes[1].set_xlim(0, 50)
plt.tight_layout()
plt.show()
The second plot shows cumulative explained variance, which is often more practical for decision-making. We use np.cumsum() to calculate a running total — if PC1 explains 12% and PC2 explains 10%, the cumsum would be [0.12, 0.22, ...]. This tells us "with N components, how much total variance do I capture?" We add horizontal reference lines at 90% (orange) and 95% (red) thresholds using axhline() — these are common targets in practice. You can simply look at where your curve crosses these lines to determine how many components you need. For instance, if the curve crosses 0.95 at x=29, you need 29 components to retain 95% of the variance. The plt.tight_layout() ensures the plots don't overlap.
# Step 5: Print variance summary for first 10 components
print("Variance per component (first 10):")
print("-" * 40)
for i in range(10):
individual = pca.explained_variance_ratio_[i]
cumulative = cumsum[i]
print(f" PC{i+1}: {individual:.3f} ({cumulative:.1%} cumulative)")
Finally, we print a numerical summary showing both individual and cumulative variance for the first 10 components. This gives you exact numbers instead of reading from the plot. For the digits dataset, you might see output like: PC1 explains 0.120 (12.0% cumulative), PC2 explains 0.097 (21.7% cumulative), and so on. By PC10, you might already have ~60% cumulative variance. This table helps you make precise decisions — if you need exactly 95% variance, you can see exactly which component crosses that threshold. The combination of visual plots and numerical output gives you a complete picture for choosing the optimal number of components.
Interpreting Principal Components
# Understanding What Each Component Means
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
pca = PCA(n_components=2)
pca.fit(X_scaled)
# The components_ attribute shows feature loadings
print("Feature names:", iris.feature_names)
print("\nPC1 loadings (which features contribute most):")
for name, loading in zip(iris.feature_names, pca.components_[0]):
print(f" {name}: {loading:.3f}")
print("\nPC2 loadings:")
for name, loading in zip(iris.feature_names, pca.components_[1]):
print(f" {name}: {loading:.3f}")
# Higher absolute value = more important for that component
This code reveals what's happening "under the hood" of PCA by examining the pca.components_ attribute, which contains the "recipe" for each principal component. Think of each component as a weighted combination of your original features. The components_ attribute is a matrix where each row represents one principal component, and each column shows how much the corresponding original feature contributes to that component (these weights are called "loadings"). For example, if the loading for "petal length" is 0.5, it means petal length contributes positively to that component — samples with longer petals will have higher values on this component. A negative loading like -0.3 means the feature contributes in the opposite direction. The magnitude (absolute value) of each loading tells you how important that feature is: loadings close to 0 mean the feature barely influences the component, while loadings close to 1 or -1 mean the feature is very influential. This interpretability is one of PCA's biggest advantages over methods like t-SNE and UMAP! You can actually explain your components in human terms, like "PC1 seems to represent overall flower size because all measurement features have high positive loadings."
- Linear only: Can't capture non-linear relationships (use t-SNE/UMAP)
- Scale sensitive: Always standardize your data first!
- Assumes Gaussian: Works best when data is normally distributed
- Maximizes variance: Not always what you want for classification
Practice Questions: PCA
Test your understanding with these coding challenges.
Task: Reduce the Wine dataset from 13 features to 3 components.
Show Solution
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
wine = load_wine()
X_scaled = StandardScaler().fit_transform(wine.data)
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)
print(f"Original: {wine.data.shape}")
print(f"Reduced: {X_pca.shape}")
print(f"Variance explained: {pca.explained_variance_ratio_}")
print(f"Total: {sum(pca.explained_variance_ratio_):.2%}")
Task: Determine how many components capture 95% of variance in the digits dataset.
Show Solution
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
digits = load_digits()
X_scaled = StandardScaler().fit_transform(digits.data)
# Method 1: Specify variance threshold
pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_scaled)
print(f"Components for 95%: {pca.n_components_}")
# Method 2: Manual calculation
pca_full = PCA().fit(X_scaled)
cumsum = np.cumsum(pca_full.explained_variance_ratio_)
n_95 = np.argmax(cumsum >= 0.95) + 1
print(f"Verified: {n_95} components")
Task: Show the difference between PCA with and without scaling.
Show Solution
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
wine = load_wine()
X = wine.data
# Without scaling
pca_unscaled = PCA(n_components=2)
X_unscaled = pca_unscaled.fit_transform(X)
# With scaling
X_scaled = StandardScaler().fit_transform(X)
pca_scaled = PCA(n_components=2)
X_pca_scaled = pca_scaled.fit_transform(X_scaled)
print("Unscaled PCA - Variance explained:", pca_unscaled.explained_variance_ratio_)
print("Scaled PCA - Variance explained:", pca_scaled.explained_variance_ratio_)
print("\nNote: Unscaled is dominated by high-variance features!")
Task: Apply PCA, then inverse transform to reconstruct the original data.
Show Solution
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
digits = load_digits()
X_scaled = StandardScaler().fit_transform(digits.data)
for n in [2, 10, 30, 64]:
pca = PCA(n_components=n)
X_reduced = pca.fit_transform(X_scaled)
X_reconstructed = pca.inverse_transform(X_reduced)
# Calculate reconstruction error
error = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"n={n:2d}: Reconstruction MSE = {error:.4f}")
Task: Find which features contribute most to PC1 in the breast cancer dataset.
Show Solution
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
cancer = load_breast_cancer()
X_scaled = StandardScaler().fit_transform(cancer.data)
pca = PCA(n_components=2)
pca.fit(X_scaled)
# Get absolute loadings for PC1
loadings = np.abs(pca.components_[0])
sorted_idx = np.argsort(loadings)[::-1]
print("Top 5 features contributing to PC1:")
for i in sorted_idx[:5]:
print(f" {cancer.feature_names[i]}: {loadings[i]:.3f}")
t-SNE (t-Distributed Stochastic Neighbor Embedding)
While PCA is great for preserving global structure, t-SNE excels at revealing local clusters in high-dimensional data. It's the go-to technique for creating stunning 2D visualizations of complex datasets like gene expressions, word embeddings, or image features.
Understanding t-SNE (Beginner's Guide)
t-SNE is a visualization technique that groups similar points together in 2D/3D, even when the original data has hundreds of dimensions. It's like creating a map where friends stand close together.
Preserves Neighbors
Points that are close in high dimensions stay close in low dimensions.
Non-deterministic
Different runs may give slightly different results. Use random_state for reproducibility.
Slow on Large Data
O(n^2) complexity. For big datasets, use PCA first to reduce dimensions.
Key Parameter - Perplexity: Controls how many neighbors each point considers. Low (5-10) = focus on very local structure. High (30-50) = consider broader neighborhood. Try values between 5 and 50!
# Step 1: Import libraries and load data
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import numpy as np
# Load handwritten digits dataset
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target
print(f"Original shape: {X.shape}") # (1797, 64)
We start by importing TSNE from scikit-learn's manifold module — t-SNE is classified as a "manifold learning" technique because it tries to unfold the complex curved surface (manifold) on which high-dimensional data lives. We load the digits dataset which contains 1797 images of handwritten digits, each represented as 64 pixel values (8×8 images). Just like with PCA, we apply StandardScaler to normalize the data so all features have similar scales. We also keep the target labels y (digits 0-9) so we can color-code our visualization later to see if t-SNE correctly groups similar digits together.
# Step 2: Create t-SNE object with key parameters
tsne = TSNE(
n_components=2, # Output dimensions (2D for visualization)
perplexity=30, # How many neighbors to consider (5-50 typical)
random_state=42, # For reproducible results
n_iter=1000 # Number of optimization iterations
)
print("t-SNE configured with:")
print(f" - Output dimensions: {tsne.n_components}")
print(f" - Perplexity: {tsne.perplexity}")
print(f" - Iterations: {tsne.n_iter}")
Here we configure t-SNE with its most important parameters. The n_components=2 means we want a 2D output for plotting — unlike PCA, t-SNE is almost never used for more than 3 dimensions because its purpose is visualization. The perplexity=30 is the key hyperparameter: it roughly means "each point considers about 30 neighbors when deciding where to position itself." Think of perplexity as the "attention span" of each point. Lower values (5-10) make points focus only on their immediate neighbors, creating tight local clusters. Higher values (30-50) make points aware of a broader neighborhood, showing more global structure. We set random_state=42 because t-SNE uses random initialization and stochastic optimization — without this, you'd get different results each run! The n_iter=1000 controls how many optimization steps t-SNE takes to arrange the points; more iterations usually mean better results but take longer.
# Step 3: Apply t-SNE transformation
X_tsne = tsne.fit_transform(X)
print(f"t-SNE output shape: {X_tsne.shape}") # (1797, 2)
print(f"Sample coordinates: {X_tsne[0]}") # [x, y] for first point
print(f"X range: [{X_tsne[:,0].min():.1f}, {X_tsne[:,0].max():.1f}]")
print(f"Y range: [{X_tsne[:,1].min():.1f}, {X_tsne[:,1].max():.1f}]")
Now we apply t-SNE using fit_transform(X). This is a crucial difference from PCA: t-SNE only has fit_transform(), NOT separate fit() and transform() methods. This means you cannot train t-SNE on one dataset and then embed new points later — if you get new data, you must re-run t-SNE on the entire dataset including the new points. The output X_tsne has shape (1797, 2), meaning each of our 1797 digit images now has just 2 coordinates that we can plot on a 2D scatter plot. The actual coordinate values are arbitrary (they might range from -50 to +50 or any other range) — what matters is the relative positions of points: similar digits should cluster together. When you plot this, you'll see 10 distinct clusters corresponding to digits 0-9!
Perplexity: The Key Parameter
Perplexity roughly corresponds to the number of nearest neighbors considered for each point. Different values reveal different aspects of your data's structure.
# Step 1: Prepare data for perplexity comparison
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
print(f"Testing perplexity values on {X.shape[0]} samples")
We set up an experiment to see how different perplexity values affect t-SNE's output. We import NearestNeighbors which we'll use later to measure how spread out the resulting clusters are. Perplexity is a crucial hyperparameter that essentially controls the "scale" at which t-SNE looks at your data — should it focus on very local neighborhoods (low perplexity) or consider broader regions (high perplexity)? There's no single correct answer, which is why we'll test multiple values and compare the results.
# Step 2: Run t-SNE with different perplexity values
perplexity_values = [5, 15, 30, 50]
for perplexity in perplexity_values:
# Apply t-SNE with current perplexity
tsne = TSNE(n_components=2, perplexity=perplexity, random_state=42)
X_tsne = tsne.fit_transform(X)
print(f"Perplexity {perplexity}: Embedding complete")
We test four different perplexity values: 5, 15, 30, and 50. With perplexity=5, each point only "cares about" its 5 closest neighbors — this creates many small, tight, well-separated clusters, but you might miss the bigger picture of how clusters relate to each other. With perplexity=15-30, you get a balanced view that works well for most datasets. With perplexity=50, points consider a much wider neighborhood, which can cause smaller clusters to merge together but gives a better sense of overall structure. A good rule of thumb: perplexity should be less than the number of points in your smallest expected cluster, and definitely less than the total number of samples.
# Step 3: Quantify the effect - measure average neighbor distances
for perplexity in perplexity_values:
tsne = TSNE(n_components=2, perplexity=perplexity, random_state=42)
X_tsne = tsne.fit_transform(X)
# Find distances to 5 nearest neighbors in the embedding
nn = NearestNeighbors(n_neighbors=6).fit(X_tsne) # 6 because first is self
distances, _ = nn.kneighbors(X_tsne)
avg_dist = distances[:, 1:].mean() # Exclude self (column 0)
print(f"Perplexity {perplexity:2d}: Avg neighbor distance = {avg_dist:.3f}")
To objectively compare embeddings, we measure the average distance between neighboring points using NearestNeighbors. We use n_neighbors=6 because the first neighbor is always the point itself (with distance 0), so distances[:, 1:] gives us the 5 actual nearest neighbors. Lower average distance means points are packed more tightly together. You'll typically see that low perplexity values create tighter local clusters (lower average distance), while high perplexity values spread points out more evenly (higher average distance). This metric helps you quantify what you're seeing visually in the plots. The best practice is to try 2-3 different perplexity values, visualize each one, and choose the one that best reveals the structure you're interested in.
- Cluster sizes are meaningless: t-SNE can make small clusters look big
- Distances between clusters are arbitrary: Don't interpret far-apart clusters as more different
- Random initialization: Run multiple times to ensure stability
- Not for new data: Can't transform unseen points (no transform method)
Best Practices for t-SNE
# Step 1: Load and prepare data for speed comparison
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import time
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
print(f"Dataset shape: {X.shape}")
print(f"We'll compare two approaches...")
We're setting up a speed comparison experiment to demonstrate why PCA before t-SNE is a professional best practice. The digits dataset has 64 features (8×8 pixel images), which is high enough that t-SNE will take noticeable time. We import the time module to measure execution time for each approach. The question we're answering: Is it faster to run t-SNE directly on all 64 dimensions, or to first reduce to fewer dimensions with PCA?
# Step 2: Direct t-SNE on original dimensions (the slow way)
start = time.time()
tsne_direct = TSNE(n_components=2, random_state=42)
X_direct = tsne_direct.fit_transform(X)
direct_time = time.time() - start
print(f"Direct t-SNE time: {direct_time:.2f} seconds")
This is the straightforward approach that most beginners use: apply t-SNE directly to the full 64-dimensional data. We record the time before and after using time.time() to measure how long it takes. The reason this is slow is that t-SNE must compute distances between every pair of points in 64-dimensional space — that's a lot of calculations! For N samples with D dimensions, the complexity scales with both N² (all pairs) and D (dimensions). This approach works fine for small datasets, but becomes painfully slow as data size or dimensionality increases.
# Step 3: First, reduce dimensions with PCA (very fast)
start = time.time()
pca = PCA(n_components=50)
X_pca = pca.fit_transform(X)
pca_time = time.time() - start
print(f"PCA reduction time: {pca_time:.4f} seconds")
print(f"Reduced from {X.shape[1]} to {X_pca.shape[1]} dimensions")
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.2%}")
Here we apply PCA first, reducing from 64 dimensions to 50. This is extremely fast — typically just milliseconds even for large datasets — because PCA is a linear operation that just multiplies matrices. The explained_variance_ratio_ tells us how much information we kept; with 50 components, we typically retain 95%+ of the original variance, meaning we've lost almost nothing important while making the data much easier to work with. This preprocessing step removes noise and redundancy, leaving only the most important structure.
# Step 4: Now run t-SNE on the PCA-reduced data
start = time.time()
tsne = TSNE(n_components=2, random_state=42)
X_pca_tsne = tsne.fit_transform(X_pca)
tsne_time = time.time() - start
print(f"t-SNE on PCA output: {tsne_time:.2f} seconds")
Now we run t-SNE on the PCA-reduced data instead of the original. Even though we're still computing the same embedding, t-SNE now only needs to calculate distances in 50 dimensions instead of 64 — and more importantly, those 50 dimensions contain cleaner, more meaningful information without noise. The result is usually just as good (often better!) than direct t-SNE, but computed faster.
# Step 5: Compare the two approaches
total_pca_tsne_time = pca_time + tsne_time
speedup = direct_time / total_pca_tsne_time
print(f"\n=== Speed Comparison ===")
print(f"Direct t-SNE: {direct_time:.2f} seconds")
print(f"PCA + t-SNE: {total_pca_tsne_time:.2f} seconds")
print(f"Speedup: {speedup:.1f}x faster!")
print(f"\nRecommendation: Always use PCA first for data > 50 dimensions")
We compare the total time of both approaches. The PCA + t-SNE combination is typically 2-10x faster, and the speedup becomes even more dramatic with higher-dimensional data (like images with thousands of pixels or text embeddings with hundreds of dimensions). The original t-SNE authors themselves recommend this preprocessing step for any dataset with more than about 50 features. Beyond speed, PCA removes noise that might confuse t-SNE, often producing cleaner visualizations. This two-step approach is considered the professional standard and should be your default workflow for any serious dimensionality reduction task!
Practice Questions: t-SNE
Test your understanding with these coding challenges.
Task: Visualize the Iris dataset in 2D using t-SNE.
Show Solution
from sklearn.manifold import TSNE
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
print(f"Original: {iris.data.shape}")
print(f"t-SNE: {X_tsne.shape}")
print(f"Ranges: X=[{X_tsne[:,0].min():.1f}, {X_tsne[:,0].max():.1f}]")
Task: Run t-SNE with perplexity 5, 30, and 100 on digits data.
Show Solution
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target
for perplexity in [5, 30, 100]:
tsne = TSNE(n_components=2, perplexity=perplexity, random_state=42)
X_tsne = tsne.fit_transform(X)
score = silhouette_score(X_tsne, y)
print(f"Perplexity {perplexity:3d}: Silhouette = {score:.3f}")
Task: Compare cluster separation using silhouette scores.
Show Solution
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target
# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
pca_score = silhouette_score(X_pca, y)
# t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X)
tsne_score = silhouette_score(X_tsne, y)
print(f"PCA Silhouette: {pca_score:.3f}")
print(f"t-SNE Silhouette: {tsne_score:.3f}")
print("t-SNE usually gives better cluster separation!")
Task: Test different PCA dimensions before t-SNE and measure performance.
Show Solution
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import time
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target
print("PCA Dims | Time (s) | Silhouette")
print("-" * 35)
for n_pca in [10, 20, 30, 50, 64]:
start = time.time()
if n_pca < 64:
X_pca = PCA(n_components=n_pca).fit_transform(X)
else:
X_pca = X
X_tsne = TSNE(n_components=2, random_state=42).fit_transform(X_pca)
elapsed = time.time() - start
score = silhouette_score(X_tsne, y)
print(f"{n_pca:8d} | {elapsed:8.2f} | {score:.3f}")
UMAP (Uniform Manifold Approximation and Projection)
UMAP is the newer, faster alternative to t-SNE. It preserves both local and global structure, runs much faster on large datasets, and supports transforming new data points. It's becoming the default choice for visualization in modern ML workflows.
Understanding UMAP (Beginner's Guide)
UMAP is like t-SNE's faster, smarter cousin. It creates beautiful visualizations while also preserving the "big picture" structure of your data, and it can handle new points without rerunning!
Much Faster
Handles millions of points where t-SNE would take hours.
Global Structure
Preserves cluster relationships, not just local neighborhoods.
Transform New Data
Has a transform() method for new points!
Key: n_neighbors
Like perplexity in t-SNE. Higher = more global view.
Installation: UMAP is not in scikit-learn. Install with: pip install umap-learn
# Basic UMAP Usage
# First install: pip install umap-learn
import umap
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
# Load and scale data
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target
print(f"Original shape: {X.shape}") # (1797, 64)
# Apply UMAP
reducer = umap.UMAP(
n_components=2, # Output dimensions
n_neighbors=15, # Local neighborhood size (like perplexity)
min_dist=0.1, # How tightly points cluster
random_state=42
)
X_umap = reducer.fit_transform(X)
print(f"UMAP shape: {X_umap.shape}") # (1797, 2)
This code applies UMAP (Uniform Manifold Approximation and Projection) to reduce the 64-dimensional digits dataset to 2D for visualization. Note that UMAP is a separate package — you need to install it with pip install umap-learn (not just "umap"). Like t-SNE, we typically use n_components=2 for visualization purposes. The n_neighbors=15 parameter is similar to t-SNE's perplexity — it controls how many neighbors UMAP considers for each point when building its high-dimensional graph. Lower values focus on very local structure (tight clusters), while higher values capture more global relationships (smoother, more connected embeddings). The min_dist=0.1 parameter controls how tightly UMAP packs points together in the output. With min_dist=0.0, points can be placed right on top of each other, creating very dense, compact clusters. With higher values like 0.5 or 1.0, points are spread out more evenly. Like t-SNE, UMAP is stochastic (involves randomness), so we set random_state=42 for reproducible results. The major advantages of UMAP over t-SNE are speed (10-100x faster on large datasets), better preservation of global structure (distances between clusters are more meaningful), and the ability to transform new data (which we'll see next). For most modern applications, UMAP has become the go-to choice for non-linear dimensionality reduction.
UMAP Parameters
n_neighbors
Controls how UMAP balances local vs global structure.
- Low (2-5): Very local, may lose global structure
- Medium (15-30): Good balance (default: 15)
- High (50+): More global, smoother embeddings
min_dist
Minimum distance between points in the output.
- Low (0.0-0.1): Tight, dense clusters
- Medium (0.1-0.5): Balanced spacing
- High (0.5-1.0): Points spread out more
# UMAP Can Transform New Data (Unlike t-SNE!)
import umap
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit on training data
reducer = umap.UMAP(n_components=2, random_state=42)
reducer.fit(X_train)
# Transform both sets
X_train_umap = reducer.transform(X_train)
X_test_umap = reducer.transform(X_test)
print(f"Train shape: {X_train_umap.shape}")
print(f"Test shape: {X_test_umap.shape}")
print("UMAP can embed new data without refitting!")
This code demonstrates UMAP's killer feature that sets it apart from t-SNE: the ability to transform new, unseen data points without refitting the entire model. We first split our data using train_test_split into 80% training and 20% test sets to simulate a real-world scenario where you have training data and later receive new data. We then call reducer.fit(X_train) which learns the embedding structure from only the training data. After fitting, we can use reducer.transform(X_train) to get the 2D coordinates for our training points, but here's the magic: we can also call reducer.transform(X_test) to embed completely new test data that UMAP never saw during training! The new points are projected into the same 2D space in a way that's consistent with the original embedding. This is impossible with t-SNE — if you want to add even a single new point with t-SNE, you must re-run the entire algorithm on all your data from scratch. UMAP's transform capability is crucial for production machine learning pipelines: you can train UMAP once on your training data, save the fitted model, and then quickly embed new data points as they arrive without expensive recomputation. This makes UMAP suitable for real-time applications, streaming data, and any scenario where you need to process new samples on-the-fly.
UMAP vs t-SNE Comparison
| Aspect | t-SNE | UMAP |
|---|---|---|
| Speed | Slow (O(n^2)) | Fast (O(n log n)) |
| Global Structure | Weak | Strong |
| Transform New Data | No | Yes |
| Deterministic | No | No |
| Scalability | ~10K points max | Millions of points |
| Best For | Small datasets, local clusters | Large datasets, general use |
Practice Questions: UMAP
Test your understanding with these coding challenges.
Task: Reduce Iris to 2D using UMAP with default parameters.
Show Solution
import umap
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X_scaled = StandardScaler().fit_transform(iris.data)
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X_scaled)
print(f"Original: {iris.data.shape}")
print(f"UMAP: {X_umap.shape}")
print(f"X range: [{X_umap[:,0].min():.1f}, {X_umap[:,0].max():.1f}]")
Task: Test n_neighbors values of 5, 15, 50 on digits data.
Show Solution
import umap
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target
for n_neighbors in [5, 15, 50]:
reducer = umap.UMAP(n_components=2, n_neighbors=n_neighbors, random_state=42)
X_umap = reducer.fit_transform(X)
score = silhouette_score(X_umap, y)
print(f"n_neighbors={n_neighbors:2d}: Silhouette = {score:.3f}")
Task: Fit UMAP on training data and transform test data separately.
Show Solution
import umap
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
reducer = umap.UMAP(n_components=2, random_state=42)
X_train_umap = reducer.fit_transform(X_train)
X_test_umap = reducer.transform(X_test)
print(f"Train embedded: {X_train_umap.shape}")
print(f"Test embedded: {X_test_umap.shape}")
Task: Benchmark all three methods on the digits dataset.
Show Solution
import umap
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import time
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target
results = {}
for name, method in [
('PCA', PCA(n_components=2)),
('t-SNE', TSNE(n_components=2, random_state=42)),
('UMAP', umap.UMAP(n_components=2, random_state=42))
]:
start = time.time()
X_reduced = method.fit_transform(X)
elapsed = time.time() - start
score = silhouette_score(X_reduced, y)
print(f"{name:6s}: Time={elapsed:.2f}s, Silhouette={score:.3f}")
Feature Selection
Sometimes the best way to reduce dimensions is simply to throw away bad features. Feature selection keeps your original features (no transformation), making models more interpretable and often faster to train.
Feature Selection vs Feature Extraction
Feature Selection
Keep the best original features, discard the rest.
- Interpretable (same features as before)
- Fast at inference time
- May miss feature interactions
Feature Extraction
Create new features by combining originals.
- Captures more information per dimension
- New features are combinations
- Harder to interpret
Variance Threshold
The simplest feature selection: remove features with near-zero variance. If a feature barely changes across samples, it can't help distinguish between them.
# Variance Threshold - Remove Low-Variance Features
from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import load_breast_cancer
import numpy as np
cancer = load_breast_cancer()
X = cancer.data
print(f"Original features: {X.shape[1]}")
# Remove features with variance below threshold
selector = VarianceThreshold(threshold=0.5)
X_selected = selector.fit_transform(X)
print(f"After selection: {X_selected.shape[1]}")
# Which features were kept?
kept = selector.get_support()
print(f"\nKept features: {np.array(cancer.feature_names)[kept]}")
This code demonstrates the simplest form of feature selection: removing features that barely vary across samples. The logic is intuitive — if a feature has nearly the same value for every sample, it cannot possibly help distinguish between different samples or classes! We create a VarianceThreshold(threshold=0.5) selector which will remove any feature whose variance is below 0.5. When we call fit_transform(X), scikit-learn calculates the variance of each feature across all samples and keeps only those features with variance above our threshold. The get_support() method returns a boolean array showing which original features survived (True) or were removed (False). For example, imagine a "country" feature where 99% of your samples are from the same country — this feature has almost zero variance and provides virtually no discriminative information, so it should be removed. Variance threshold is an excellent first step in any feature selection pipeline because it's extremely fast, requires no labels (unsupervised), and removes obviously useless features before you apply more sophisticated methods. A practical tip: start with threshold=0 to remove only constant features, then gradually increase the threshold to see how aggressively you can trim features without hurting model performance.
Correlation-Based Selection
# Remove Highly Correlated Features
from sklearn.datasets import load_breast_cancer
import numpy as np
cancer = load_breast_cancer()
X = cancer.data
feature_names = cancer.feature_names
# Compute correlation matrix
corr_matrix = np.corrcoef(X.T)
# Find highly correlated pairs
threshold = 0.95
to_remove = set()
for i in range(len(corr_matrix)):
for j in range(i + 1, len(corr_matrix)):
if abs(corr_matrix[i, j]) > threshold:
to_remove.add(j)
print(f"Corr({feature_names[i]}, {feature_names[j]}) = {corr_matrix[i, j]:.2f}")
print(f"\nOriginal: {X.shape[1]} features")
print(f"To remove: {len(to_remove)} features")
print(f"Remaining: {X.shape[1] - len(to_remove)} features")
This code identifies and removes redundant features by detecting highly correlated pairs. When two features are strongly correlated, they essentially carry the same information — keeping both just adds noise and computational overhead. We compute the correlation matrix using np.corrcoef(X.T), which calculates the Pearson correlation coefficient between every pair of features. The result is a matrix where the value at position [i,j] represents the correlation between feature i and feature j. Correlation ranges from -1 to +1: a value of +1.0 means the features move perfectly together (when one goes up, the other goes up by a proportional amount), -1.0 means they move perfectly opposite (when one goes up, the other goes down), and 0 means no linear relationship. We set threshold = 0.95, meaning any pair with absolute correlation above 0.95 is considered redundant. For each such pair, we add one feature to our removal set using to_remove.add(j) — we keep the first feature and remove the second. A classic example: if your dataset has both "height_inches" and "height_cm", they're 100% correlated since one is just a unit conversion of the other. There's no reason to keep both! We use 0.95 as a conservative threshold, but you could use 0.8 or 0.9 to be more aggressive in removing features, especially if you suspect multicollinearity issues.
SelectKBest - Statistical Feature Selection
# Select Top K Features Based on Statistical Tests
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.datasets import load_breast_cancer
import numpy as np
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
# Select top 10 features using ANOVA F-test
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)
print(f"Original: {X.shape[1]} features")
print(f"Selected: {X_selected.shape[1]} features")
# Feature scores
scores = selector.scores_
top_indices = np.argsort(scores)[::-1][:10]
print("\nTop 10 features by F-score:")
for i in top_indices:
print(f" {cancer.feature_names[i]}: {scores[i]:.2f}")
This code uses statistical tests to score each feature based on how well it helps distinguish between classes, then keeps only the top K highest-scoring features. We use f_classif as our scoring function, which performs an ANOVA F-test for each feature. This test essentially asks: "Do the values of this feature differ significantly between classes?" If a feature has very different average values for malignant vs benign tumors, it gets a high F-score because it's useful for classification. We create SelectKBest(score_func=f_classif, k=10) to keep only the 10 features with the highest scores. Notice that unlike PCA, we call fit_transform(X, y) with both features AND labels — this is because SelectKBest is a supervised method that needs to know the target variable to evaluate how useful each feature is for prediction. After fitting, selector.scores_ contains the F-score for each original feature, letting you see exactly how useful each feature is (higher score = more discriminative). Different scoring functions are available for different situations: mutual_info_classif captures non-linear relationships between features and target, chi2 works with non-negative features (like word counts in text), and f_regression is used when your target is continuous instead of categorical. SelectKBest is perfect when you want interpretable results and need to explain which features your model relies on.
Recursive Feature Elimination (RFE)
# RFE - Iteratively Remove Least Important Features
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
# Use Random Forest to rank features
model = RandomForestClassifier(n_estimators=100, random_state=42)
rfe = RFE(estimator=model, n_features_to_select=10, step=1)
rfe.fit(X, y)
print(f"Original: {X.shape[1]} features")
print(f"Selected: {rfe.n_features_} features")
# Show selected features
selected = rfe.support_
print("\nSelected features:")
for name, keep in zip(cancer.feature_names, selected):
if keep:
print(f" {name}")
This code demonstrates Recursive Feature Elimination (RFE), which uses an actual machine learning model to iteratively identify and remove the least important features. We use a RandomForestClassifier as our estimator because Random Forest naturally provides feature importance scores based on how much each feature helps split the data and improve predictions. We create RFE(estimator=model, n_features_to_select=10, step=1), which will recursively remove one feature at a time (step=1) until only 10 features remain. Here's how RFE works step by step: first, it trains the model on all features and looks at the feature importances. It removes the single least important feature, then trains the model again on the remaining features. This process repeats — train, remove the worst, train, remove the worst — until only 10 features are left. The rfe.support_ attribute is a boolean array showing which features survived the elimination process (True means kept, False means removed). The rfe.ranking_ attribute is even more informative: features with ranking 1 were kept, ranking 2 means "removed in the last round," ranking 3 means "removed second-to-last," and so on. RFE's major advantage over simple statistical methods like SelectKBest is that it considers feature interactions — since a real model evaluates importance, the removal of one feature might change the importance of others. The downside is speed: RFE trains the model many times (once per elimination round), so it can be slow for large datasets or complex models.
Practice Questions: Feature Selection
Test your understanding with these coding challenges.
Task: Remove features with variance less than 1.0 from wine dataset.
Show Solution
from sklearn.feature_selection import VarianceThreshold
from sklearn.datasets import load_wine
wine = load_wine()
X = wine.data
selector = VarianceThreshold(threshold=1.0)
X_selected = selector.fit_transform(X)
print(f"Original: {X.shape[1]} features")
print(f"Selected: {X_selected.shape[1]} features")
print(f"Removed: {X.shape[1] - X_selected.shape[1]} features")
Task: Select top 5 features using mutual_info_classif.
Show Solution
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X, y = iris.data, iris.target
# Iris only has 4 features, so select all 4 for demo
selector = SelectKBest(score_func=mutual_info_classif, k=4)
X_selected = selector.fit_transform(X, y)
scores = selector.scores_
for name, score in zip(iris.feature_names, scores):
print(f"{name}: MI = {score:.3f}")
Task: Train a model with all features vs top 10 selected features.
Show Solution
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
model = RandomForestClassifier(n_estimators=100, random_state=42)
# All features
score_all = cross_val_score(model, X, y, cv=5).mean()
# Top 10 features
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
score_selected = cross_val_score(model, X_selected, y, cv=5).mean()
print(f"All {X.shape[1]} features: {score_all:.3f}")
print(f"Top 10 features: {score_selected:.3f}")
When to Use What?
Each dimensionality reduction technique has its strengths. Here's a practical guide to choosing the right tool for your task.
| Use Case | Best Choice | Why |
|---|---|---|
| Preprocessing for ML | PCA | Fast, deterministic, reduces overfitting, works with new data |
| Visualization (small data) | t-SNE | Best cluster separation for visual inspection |
| Visualization (large data) | UMAP | Fast, scales to millions, preserves global structure |
| Interpretable models | Feature Selection | Keeps original features, easy to explain |
| Image compression | PCA | Reconstructable, good for faces/simple images |
| Noise reduction | PCA | Removes low-variance components (often noise) |
Quick Decision Guide
- Need to train a model? Use PCA (or feature selection if interpretability matters)
- Just need to visualize? Use UMAP (or t-SNE for small datasets)
- Want to keep original features? Use feature selection
- Have new data to transform? Use PCA or UMAP (not t-SNE)
- Very large dataset? Use UMAP or PCA (t-SNE too slow)
# Step 1: Import all methods and load data
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import time
# Load and scale the digits dataset
digits = load_digits()
X = StandardScaler().fit_transform(digits.data)
y = digits.target
print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Classes: {len(set(y))} digit types (0-9)")
We set up a fair comparison by importing all three dimensionality reduction methods: PCA for linear reduction, TSNE for non-linear visualization, and umap (which needs to be installed separately with pip install umap-learn). We also import silhouette_score from sklearn's metrics module — this will measure how well-separated the resulting 2D clusters are. The score ranges from -1 to +1, where higher values mean clusters are more distinct and well-separated. We use the digits dataset (1797 samples, 64 features) because it has known class labels (digits 0-9), allowing us to objectively measure which method creates the best cluster separation.
# Step 2: Apply PCA (linear, fast, interpretable)
start = time.time()
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
pca_time = time.time() - start
pca_silhouette = silhouette_score(X_pca, y)
print(f"PCA: {pca_time:.3f}s | Silhouette: {pca_silhouette:.3f}")
print(f" - Can transform new data: Yes")
print(f" - Variance retained: {sum(pca.explained_variance_ratio_):.1%}")
PCA is the fastest method by far — often completing in just a few milliseconds even on larger datasets. This is because PCA only involves matrix operations (computing eigenvectors of the covariance matrix), which are highly optimized. However, you'll notice its silhouette score is typically lower than t-SNE or UMAP. This is because PCA is a linear method — it can only find straight directions in the data and cannot capture curved or complex cluster shapes. The upside is that PCA preserves global structure (relative distances are maintained) and can transform new data points using the same learned projection. The explained_variance_ratio_ tells us how much information we retained; with just 2 components from 64, we typically keep only 20-30% of the variance, which explains the lower cluster quality.
# Step 3: Apply t-SNE (non-linear, best clusters, slow)
start = time.time()
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X)
tsne_time = time.time() - start
tsne_silhouette = silhouette_score(X_tsne, y)
print(f"t-SNE: {tsne_time:.2f}s | Silhouette: {tsne_silhouette:.3f}")
print(f" - Can transform new data: No")
print(f" - Best for: Small datasets, publication-quality visualizations")
t-SNE typically produces the highest silhouette score — meaning the tightest, best-separated clusters — because it's specifically designed to preserve local neighborhood structure. Points that were neighbors in 64D will be neighbors in 2D. However, this quality comes at a cost: t-SNE is significantly slower (seconds instead of milliseconds) because it uses iterative optimization to arrange points. The major limitation is that t-SNE cannot transform new data — if you get a new sample, you must re-run t-SNE on the entire dataset including the new point. This makes t-SNE unsuitable for production ML pipelines but perfect for exploratory data analysis and creating beautiful visualizations for reports or publications. The random_state=42 ensures reproducibility since t-SNE uses random initialization.
# Step 4: Apply UMAP (fast + good clusters + transformable)
start = time.time()
reducer = umap.UMAP(n_components=2, random_state=42)
X_umap = reducer.fit_transform(X)
umap_time = time.time() - start
umap_silhouette = silhouette_score(X_umap, y)
print(f"UMAP: {umap_time:.2f}s | Silhouette: {umap_silhouette:.3f}")
print(f" - Can transform new data: Yes")
print(f" - Best for: Large datasets, production systems")
UMAP often achieves the best of both worlds: it produces cluster quality comparable to t-SNE (sometimes even better) while being much faster — often only slightly slower than PCA. UMAP's speed advantage becomes even more dramatic on larger datasets where it can be 10-100x faster than t-SNE. Critically, UMAP CAN transform new data using reducer.transform(new_data), making it suitable for production ML systems where you need to embed new samples without re-running the entire algorithm. UMAP also tends to preserve global structure better than t-SNE, meaning the relative positions of clusters in the visualization more accurately reflect their relationships in the original high-dimensional space. For most modern applications, UMAP has become the default choice for non-linear dimensionality reduction.
# Step 5: Print summary comparison table
print("\n" + "=" * 55)
print("Method | Time (s) | Silhouette | Transform New Data?")
print("-" * 55)
print(f"PCA | {pca_time:8.3f} | {pca_silhouette:10.3f} | Yes")
print(f"t-SNE | {tsne_time:8.2f} | {tsne_silhouette:10.3f} | No")
print(f"UMAP | {umap_time:8.2f} | {umap_silhouette:10.3f} | Yes")
print("=" * 55)
print("\nRecommendation:")
print(" - ML preprocessing: PCA (fast, interpretable)")
print(" - Visualization: UMAP (fast + good quality)")
print(" - Publication figures: t-SNE (best local structure)")
This summary table gives you a clear decision framework. For ML preprocessing (feeding reduced data into a classifier), use PCA because it's deterministic, fast, interpretable, and can transform new data — plus its linear nature works well as a preprocessing step. For visualization in most cases, use UMAP because it balances speed and quality, and can handle large datasets that would take forever with t-SNE. For publication-quality figures or when you need the absolute best cluster visualization on smaller datasets, use t-SNE because it often produces the most visually striking separation. The "Transform New Data?" column is crucial for production systems — if you're building an application that needs to embed user uploads or streaming data, t-SNE is simply not an option, leaving you with PCA or UMAP depending on whether you need linear or non-linear reduction.
Key Takeaways
Curse of Dimensionality
High dimensions make data sparse, distances meaningless, and models prone to overfitting
PCA for Preprocessing
Fast, linear reduction that preserves maximum variance - great for ML pipelines
t-SNE for Local Clusters
Best for visualizing cluster structure but slow and non-deterministic
UMAP is Modern Default
Fast, scalable, preserves global structure, and can transform new data
Feature Selection for Interpretability
Keep original features when you need to explain which inputs matter
Always Scale First
PCA and t-SNE are sensitive to feature scales - standardize before reducing
Knowledge Check
Test your understanding of dimensionality reduction:
What does the "curse of dimensionality" refer to?
What is the main goal of PCA?
Why is scaling important before applying PCA?
What is a key advantage of UMAP over t-SNE?
What does the perplexity parameter control in t-SNE?
When should you use feature selection instead of PCA?