Module 5.2

Hyperparameter Tuning

Learn how to find the optimal hyperparameters for your machine learning models. Master Grid Search, Random Search, and cross-validation techniques to maximize model performance while avoiding overfitting.

40 min
Intermediate
Hands-on
What You'll Learn
  • Parameters vs Hyperparameters
  • Grid Search with GridSearchCV
  • Random Search optimization
  • Cross-validation strategies
  • Best practices and pitfalls
Contents
01

Parameters vs Hyperparameters

Before diving into tuning, it is essential to understand the difference between parameters and hyperparameters. Parameters are learned from data during training (like weights in neural networks), while hyperparameters are set before training and control the learning process itself (like learning rate or tree depth).

Key Concept

What are Hyperparameters?

Hyperparameters are configuration settings that control the learning algorithm itself. They are not learned from data but must be specified before training begins. The right hyperparameter values can dramatically improve model performance.

Examples: Learning rate, number of trees in Random Forest, max depth of decision trees, regularization strength (C in SVM), number of neighbors (k in KNN)

Parameters

  • Learned from training data
  • Model weights and biases
  • Split points in decision trees
  • Coefficients in linear regression

Hyperparameters

  • Set before training begins
  • Control the learning process
  • Number of trees, max depth
  • Learning rate, regularization

Why Hyperparameter Tuning Matters

Default hyperparameters rarely give optimal results. A well-tuned model can significantly outperform a default one. However, tuning must be done carefully to avoid overfitting to the validation set.

# Example: Impact of hyperparameters on model performance
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42
)

# Default hyperparameters
rf_default = RandomForestClassifier(random_state=42)
rf_default.fit(X_train, y_train)
default_acc = accuracy_score(y_test, rf_default.predict(X_test))
print(f"Default accuracy: {default_acc:.4f}")  # ~0.9561

# Tuned hyperparameters
rf_tuned = RandomForestClassifier(
    n_estimators=200, max_depth=10, min_samples_split=5, random_state=42
)
rf_tuned.fit(X_train, y_train)
tuned_acc = accuracy_score(y_test, rf_tuned.predict(X_test))
print(f"Tuned accuracy: {tuned_acc:.4f}")  # ~0.9737

This code demonstrates the impact of hyperparameter tuning on model performance. We start by importing the necessary libraries and loading the breast cancer dataset, which we split into 80% training and 20% testing sets. First, we train a Random Forest classifier with default hyperparameters and achieve approximately 95.6% accuracy. Then, we create a tuned version with specific hyperparameters: 200 trees instead of the default 100, a maximum depth of 10 to prevent overfitting, and requiring at least 5 samples to split a node. The tuned model achieves approximately 97.4% accuracy - a significant improvement that shows why hyperparameter tuning matters.

Pro Tip: Hyperparameter tuning is about finding the sweet spot between underfitting (too simple) and overfitting (too complex). Use cross-validation to get reliable estimates.

Common Hyperparameters by Algorithm

Algorithm Key Hyperparameters Typical Range
Random Forest n_estimators, max_depth, min_samples_split 10-500, 3-30, 2-20
SVM C, kernel, gamma 0.001-1000, rbf/linear/poly, scale/auto
KNN n_neighbors, weights, metric 1-50, uniform/distance, euclidean/manhattan
Gradient Boosting learning_rate, n_estimators, max_depth 0.01-0.3, 50-500, 3-10
Logistic Regression C, penalty, solver 0.001-100, l1/l2, lbfgs/liblinear

Practice Questions

Task: Create a Random Forest classifier and print both the hyperparameters (before fitting) and the learned parameters (after fitting).

Show Solution
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

# Load data
X, y = load_iris(return_X_y=True)

# Create model - hyperparameters are set here
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)

# Print hyperparameters (before fitting)
print("Hyperparameters:")
print(f"  n_estimators: {rf.n_estimators}")
print(f"  max_depth: {rf.max_depth}")

# Train the model - parameters are learned here
rf.fit(X, y)

# Print learned parameters (after fitting)
print("\nLearned Parameters:")
print(f"  Number of trees: {len(rf.estimators_)}")
print(f"  Feature importances: {rf.feature_importances_}")

Task: Train two SVM classifiers on the breast cancer dataset - one with default parameters and one with C=10, gamma=0.001. Compare their accuracies.

Show Solution
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load and split data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Scale features (important for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Default SVM
svm_default = SVC(random_state=42)
svm_default.fit(X_train_scaled, y_train)
default_acc = accuracy_score(y_test, svm_default.predict(X_test_scaled))

# Tuned SVM
svm_tuned = SVC(C=10, gamma=0.001, random_state=42)
svm_tuned.fit(X_train_scaled, y_train)
tuned_acc = accuracy_score(y_test, svm_tuned.predict(X_test_scaled))

print(f"Default SVM accuracy: {default_acc:.4f}")
print(f"Tuned SVM accuracy: {tuned_acc:.4f}")
print(f"Improvement: {(tuned_acc - default_acc)*100:.2f}%")

Task: Train Decision Tree classifiers with max_depth values from 1 to 20. Plot training and validation accuracy to visualize underfitting vs overfitting.

Show Solution
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

depths = range(1, 21)
train_scores = []
val_scores = []

for depth in depths:
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    train_scores.append(dt.score(X_train, y_train))
    val_scores.append(dt.score(X_val, y_val))

plt.figure(figsize=(10, 6))
plt.plot(depths, train_scores, 'b-o', label='Training Accuracy')
plt.plot(depths, val_scores, 'r-o', label='Validation Accuracy')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Effect of max_depth on Model Performance')
plt.legend()
plt.grid(True)
plt.show()

best_depth = depths[val_scores.index(max(val_scores))]
print(f"Best max_depth: {best_depth} with val accuracy: {max(val_scores):.4f}")

Task: Create a function that returns a dictionary of common hyperparameters and their typical ranges for different sklearn classifiers (RandomForest, SVM, KNN).

Show Solution
def get_hyperparameter_ranges(algorithm):
    """Return common hyperparameter ranges for different algorithms."""
    
    ranges = {
        'RandomForest': {
            'n_estimators': (50, 500),
            'max_depth': (3, 30),
            'min_samples_split': (2, 20),
            'min_samples_leaf': (1, 10),
            'max_features': ['sqrt', 'log2', None]
        },
        'SVM': {
            'C': (0.001, 1000),  # Log scale recommended
            'kernel': ['rbf', 'linear', 'poly'],
            'gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
            'degree': (2, 5)  # Only for poly kernel
        },
        'KNN': {
            'n_neighbors': (1, 50),
            'weights': ['uniform', 'distance'],
            'metric': ['euclidean', 'manhattan', 'minkowski'],
            'p': (1, 2)  # 1=manhattan, 2=euclidean
        }
    }
    
    return ranges.get(algorithm, "Algorithm not found")

# Test the function
for algo in ['RandomForest', 'SVM', 'KNN']:
    print(f"\n{algo} Hyperparameters:")
    for param, range_val in get_hyperparameter_ranges(algo).items():
        print(f"  {param}: {range_val}")
04

Cross-Validation Strategies

Cross-validation is crucial for hyperparameter tuning because it provides a reliable estimate of model performance on unseen data. Different CV strategies are appropriate for different types of data. Choosing the right strategy prevents data leakage and gives you trustworthy results.

K-Fold CV

Standard approach. Splits data into k equal folds. Each fold serves as validation once. Use when data is i.i.d. (independent and identically distributed).

Stratified K-Fold

Preserves class proportions in each fold. Essential for imbalanced classification. Default for classification in sklearn.

Time Series Split

For sequential data. Training set grows, test set moves forward. Prevents future data leaking into training.

Implementing Different CV Strategies

from sklearn.model_selection import (KFold, StratifiedKFold, 
                                      TimeSeriesSplit, LeaveOneOut,
                                      cross_val_score)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np

X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(random_state=42)

# Standard K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf)
print(f"K-Fold: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

# Stratified K-Fold (preserves class proportions)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf)
print(f"Stratified K-Fold: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

# Time Series Split (for temporal data)
tscv = TimeSeriesSplit(n_splits=5)
# Note: Don't shuffle time series data!
scores = cross_val_score(model, X, y, cv=tscv)
print(f"Time Series Split: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

This code demonstrates three different cross-validation strategies and when to use each. We import KFold (basic splitting), StratifiedKFold (preserves class ratios), TimeSeriesSplit (for temporal data), and cross_val_score for evaluation. Standard K-Fold with 5 splits and shuffling enabled works well for balanced datasets with i.i.d. data. Stratified K-Fold ensures each fold maintains the same proportion of each class as the full dataset, which is critical for imbalanced classification problems. TimeSeriesSplit is designed for sequential data where the training window grows and the test window moves forward in time - we never shuffle this data to preserve temporal order and prevent future information from leaking into training. Each method reports the mean score and 95% confidence interval.

Using Custom CV in GridSearchCV

from sklearn.model_selection import GridSearchCV, StratifiedKFold

# Create stratified CV splitter
cv_strategy = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Use in GridSearchCV
param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=cv_strategy,  # Custom CV strategy
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X, y)
print(f"Best score with Stratified CV: {grid_search.best_score_:.4f}")

This code shows how to use a custom cross-validation strategy within GridSearchCV. We create a StratifiedKFold splitter with 5 folds, shuffling enabled, and a fixed random_state for reproducibility. Instead of passing just an integer to the cv parameter of GridSearchCV, we pass our custom cv_strategy object. This ensures that class proportions are maintained in every fold during the hyperparameter search, which is especially important for imbalanced datasets where regular K-Fold might create folds with very few samples of the minority class. The stratified approach gives more reliable and consistent cross-validation scores.

Pro Tip: Always use Stratified K-Fold for classification with imbalanced classes. For time series, never shuffle data - use TimeSeriesSplit to prevent future information leaking into training.

Practice Questions

Task: Create an imbalanced dataset and compare K-Fold vs Stratified K-Fold cross-validation scores.

Show Solution
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
import numpy as np

# Create imbalanced dataset (90% class 0, 10% class 1)
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], 
                           random_state=42)

model = LogisticRegression(random_state=42)

# Standard K-Fold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
kf_scores = cross_val_score(model, X, y, cv=kf, scoring='f1')
print(f"K-Fold F1: {kf_scores.mean():.4f} (+/- {kf_scores.std()*2:.4f})")
print(f"  Individual folds: {kf_scores}")

# Stratified K-Fold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
skf_scores = cross_val_score(model, X, y, cv=skf, scoring='f1')
print(f"\nStratified K-Fold F1: {skf_scores.mean():.4f} (+/- {skf_scores.std()*2:.4f})")
print(f"  Individual folds: {skf_scores}")

print(f"\nStratified has lower variance: {skf_scores.std() < kf_scores.std()}")

Task: Create synthetic time series data and visualize how TimeSeriesSplit creates the train/test folds. Show the growing training window.

Show Solution
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
import matplotlib.pyplot as plt

# Create synthetic time series
np.random.seed(42)
n_samples = 100
X = np.arange(n_samples).reshape(-1, 1)
y = np.sin(X.ravel() * 0.1) + np.random.randn(n_samples) * 0.1

# TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

fig, axes = plt.subplots(5, 1, figsize=(12, 8), sharex=True)

for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    ax = axes[fold]
    ax.scatter(train_idx, y[train_idx], c='blue', label='Train', alpha=0.7)
    ax.scatter(test_idx, y[test_idx], c='red', label='Test', alpha=0.7)
    ax.set_ylabel(f'Fold {fold + 1}')
    ax.legend(loc='upper right')
    ax.set_xlim(0, n_samples)
    print(f"Fold {fold + 1}: Train size={len(train_idx)}, Test size={len(test_idx)}")

axes[-1].set_xlabel('Time Index')
plt.suptitle('TimeSeriesSplit Visualization - Growing Training Window')
plt.tight_layout()
plt.show()
05

Best Practices

Hyperparameter tuning is powerful but can lead to overfitting if done incorrectly. Following best practices ensures your tuned model generalizes well to truly unseen data. These guidelines help you avoid common pitfalls and build more robust models.

Do's
  • Hold out a final test set: Never tune on the test set. Use train/validation/test split.
  • Start coarse, then refine: Begin with a wide search, then narrow down around promising values.
  • Use appropriate CV: Match your CV strategy to your data type (stratified for imbalanced, time series split for sequential).
  • Consider computational budget: Use Random Search when the parameter space is large.
  • Log your experiments: Track all hyperparameter combinations and their scores for reproducibility.
Don'ts
  • Don't tune on test data: This leads to overly optimistic performance estimates.
  • Don't ignore data leakage: Ensure preprocessing is done inside CV folds, not before.
  • Don't over-tune: Too many iterations can lead to overfitting the validation set.
  • Don't forget scaling: Many algorithms require feature scaling - include it in your pipeline.
  • Don't ignore variance: A model with slightly lower mean but much lower variance may be better.

Using Pipelines to Prevent Data Leakage

Always include preprocessing steps inside your pipeline. This ensures that transformations are fit only on training data during each CV fold.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

# Create pipeline with scaling INSIDE
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(random_state=42))
])

# Parameter names use step__param format
param_grid = {
    'svm__C': [0.1, 1, 10, 100],
    'svm__gamma': ['scale', 'auto', 0.01, 0.1]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X, y)

print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")

This code demonstrates how to use Pipelines to prevent data leakage during hyperparameter tuning. We create a Pipeline with named steps: 'scaler' for StandardScaler and 'svm' for SVC. The key benefit is that the scaler will be fit only on training data during each CV fold, preventing information from the validation set from leaking into the preprocessing step. When defining the parameter grid, we use the step__param naming convention - for example, 'svm__C' refers to the C parameter of the svm step. Running GridSearchCV on the entire pipeline ensures that scaling happens correctly inside each fold, giving us realistic performance estimates and preventing the common mistake of scaling all data before splitting.

Nested Cross-Validation

For unbiased performance estimation, use nested CV: outer loop for evaluation, inner loop for hyperparameter tuning.

from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

# Inner CV: hyperparameter tuning
param_grid = {'n_estimators': [50, 100], 'max_depth': [5, 10]}
inner_cv = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid, cv=3, n_jobs=-1
)

# Outer CV: unbiased performance estimate
outer_scores = cross_val_score(inner_cv, X, y, cv=5, scoring='accuracy')

print(f"Nested CV scores: {outer_scores}")
print(f"Mean: {outer_scores.mean():.4f} (+/- {outer_scores.std()*2:.4f})")

This code implements nested cross-validation for unbiased performance estimation. We create two CV loops: an inner loop (GridSearchCV with 3-fold CV) that tunes hyperparameters and finds the best configuration, and an outer loop (cross_val_score with 5-fold CV) that evaluates how well the entire tuning process generalizes to unseen data. The outer scores give us an unbiased estimate of how well our tuned model will actually perform in production. Without nested CV, we might overfit to the validation set during tuning and get overly optimistic performance estimates. This approach is especially important when you need to report realistic expected performance to stakeholders or compare different algorithms fairly.

Practice Questions

Task: Create a complete pipeline with scaling, PCA, and SVM. Use GridSearchCV to tune and properly evaluate on a held-out test set.

Show Solution
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import classification_report

X, y = load_breast_cancer(return_X_y=True)

# Hold out test set FIRST
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA()),
    ('svm', SVC(random_state=42))
])

# Parameter grid
param_grid = {
    'pca__n_components': [5, 10, 15],
    'svm__C': [0.1, 1, 10],
    'svm__gamma': ['scale', 0.01, 0.1]
}

# Tune on training data only
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"CV score: {grid_search.best_score_:.4f}")

# Final evaluation on test set
y_pred = grid_search.predict(X_test)
print(f"\nTest set performance:")
print(classification_report(y_test, y_pred))

Task: Combine hyperparameter tuning with early stopping for a Gradient Boosting model. Use validation score to stop training early.

Show Solution
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.datasets import load_breast_cancer
import numpy as np

X, y = load_breast_cancer(return_X_y=True)

# Split: train, validation (for early stopping), test
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.2, random_state=42)

# Grid search with early stopping via n_iter_no_change
param_grid = {
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
    'n_estimators': [500],  # High value, early stopping will limit
    'n_iter_no_change': [10],  # Stop if no improvement for 10 iterations
    'validation_fraction': [0.1]  # Use 10% for internal validation
}

grid_search = GridSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
print(f"Best params: {grid_search.best_params_}")
print(f"Actual n_estimators used: {best_model.n_estimators_}")
print(f"Test accuracy: {best_model.score(X_test, y_test):.4f}")

Task: Create a function that performs hyperparameter tuning with full logging of all experiments to a CSV file for reproducibility.

Show Solution
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from datetime import datetime

def tune_with_logging(estimator, param_grid, X, y, log_file='tuning_log.csv'):
    """Perform tuning with full experiment logging."""
    
    # Record start time
    start_time = datetime.now()
    
    # Run grid search
    grid_search = GridSearchCV(
        estimator, param_grid, cv=5, 
        scoring='accuracy', n_jobs=-1,
        return_train_score=True
    )
    grid_search.fit(X, y)
    
    # Create results DataFrame
    results = pd.DataFrame(grid_search.cv_results_)
    
    # Add metadata
    results['experiment_time'] = start_time.strftime('%Y-%m-%d %H:%M:%S')
    results['estimator'] = type(estimator).__name__
    results['total_fits'] = len(results) * 5  # n_combinations * cv_folds
    
    # Select important columns
    cols = ['experiment_time', 'estimator', 'params', 
            'mean_train_score', 'mean_test_score', 'std_test_score', 
            'rank_test_score', 'mean_fit_time']
    results = results[cols]
    
    # Save to CSV (append if exists)
    results.to_csv(log_file, mode='a', index=False, 
                   header=not pd.io.common.file_exists(log_file))
    
    print(f"Logged {len(results)} experiments to {log_file}")
    print(f"Best params: {grid_search.best_params_}")
    print(f"Best score: {grid_search.best_score_:.4f}")
    
    return grid_search

# Usage
X, y = load_breast_cancer(return_X_y=True)
param_grid = {'n_estimators': [50, 100], 'max_depth': [5, 10]}

grid_search = tune_with_logging(
    RandomForestClassifier(random_state=42),
    param_grid, X, y
)

Key Takeaways

Hyperparameters Matter

Hyperparameters control the learning process and can dramatically affect model performance. Default values rarely give optimal results.

Grid Search

Exhaustively tries all combinations. Thorough but slow. Use when parameter space is small and you need to be comprehensive.

Random Search

Samples randomly from distributions. Often more efficient than Grid Search. Use for large parameter spaces.

Cross-Validation

Essential for reliable estimates. Use Stratified for imbalanced data, TimeSeriesSplit for temporal data.

Prevent Data Leakage

Always use pipelines to ensure preprocessing happens inside CV. Never tune on test data.

Balance Performance

Consider both mean score and variance. A slightly lower mean with much lower variance may generalize better.

Knowledge Check

Test your understanding of hyperparameter tuning with this quick quiz.

Question 1 of 6

What is the main difference between parameters and hyperparameters?

Question 2 of 6

When should you prefer Random Search over Grid Search?

Question 3 of 6

Why should you use Stratified K-Fold for imbalanced classification?

Question 4 of 6

What is data leakage in the context of hyperparameter tuning?

Question 5 of 6

How do you access the best model after GridSearchCV finishes?

Question 6 of 6

What distribution should you use for hyperparameters like learning rate that span several orders of magnitude?

Answer all questions to check your score