Learning Curves & Bias-Variance Analysis
Learning curves show how model performance changes as training data increases. They're essential for diagnosing whether your model suffers from high bias (underfitting) or high variance (overfitting), helping you decide whether to collect more data, simplify, or complexify your model.
What is a Learning Curve?
A learning curve plots model performance (e.g., accuracy or error) against the number of training samples. By comparing training and validation scores at different training set sizes, you can diagnose whether your model needs more data, more regularization, or a different architecture.
Why it matters: Learning curves reveal whether collecting more data will help (high variance) or if you need a more complex model (high bias), saving you time and resources.
Understanding Bias and Variance
The bias-variance tradeoff is fundamental to machine learning. High bias means your model is too simple and underfits, while high variance means your model is too complex and overfits to training noise.
High Bias (Underfitting)
Both training and validation scores are low. The model is too simple to capture patterns. Solution: Use more complex model, add features, reduce regularization.
High Variance (Overfitting)
Training score is high but validation score is much lower. The model memorizes noise. Solution: Add more data, simplify model, increase regularization.
Plotting Learning Curves with Scikit-learn
Scikit-learn provides the learning_curve function that computes training and validation scores for different training set sizes using cross-validation.
# Step 1: Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
We import NumPy for numerical operations, Matplotlib for plotting, learning_curve from sklearn for generating the curve data, and load a classifier and dataset for our example.
# Step 2: Load data and create the model
X, y = load_breast_cancer(return_X_y=True)
model = RandomForestClassifier(n_estimators=100, random_state=42)
We load the breast cancer dataset and create a Random Forest classifier. This dataset has 569 samples with 30 features, making it ideal for demonstrating learning curves.
# Step 3: Compute learning curve
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10), # 10% to 100% of data
cv=5, # 5-fold cross-validation
scoring='accuracy',
n_jobs=-1 # Use all CPU cores
)
The learning_curve function trains the model on increasing portions of data (10% to 100% in 10 steps) and evaluates using 5-fold cross-validation. It returns the actual training sizes used, and the training and validation scores for each fold.
# Step 4: Calculate mean and standard deviation
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)
We compute the mean and standard deviation across all CV folds for both training and validation scores. The standard deviation helps us understand the variability in our estimates.
# Step 5: Plot the learning curve
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
plt.plot(train_sizes, val_mean, 'o-', color='green', label='Validation Score')
# Add confidence bands
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std,
alpha=0.1, color='blue')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std,
alpha=0.1, color='green')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy Score')
plt.title('Learning Curve - Random Forest')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
We create the visualization with training and validation curves, adding shaded regions to show the standard deviation (uncertainty) at each training size. The gap between curves indicates variance, while their absolute values indicate bias.
Interpreting Learning Curves
The shape of learning curves tells you what actions to take:
- Large gap between curves → High variance (overfitting). Get more data or simplify model.
- Both curves plateau low → High bias (underfitting). Use more complex model or add features.
- Curves converge at high score → Good fit! Your model generalizes well.
- Curves still converging → More data may help improve performance.
Practice Questions
Task: Create a learning curve for LogisticRegression on the iris dataset with 10 training sizes from 10% to 100%.
Show Solution
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
X, y = load_iris(return_X_y=True)
model = LogisticRegression(max_iter=200)
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5, scoring='accuracy'
)
plt.plot(train_sizes, train_scores.mean(axis=1), label='Training')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation')
plt.xlabel('Training Size')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Learning Curve - Logistic Regression')
plt.show()
Task: Plot learning curves for DecisionTreeClassifier with max_depth=2 (high bias) and max_depth=None (high variance) on the same plot to compare.
Show Solution
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import learning_curve
from sklearn.datasets import make_classification
import numpy as np
import matplotlib.pyplot as plt
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for ax, depth, title in zip(axes, [2, None], ['High Bias (depth=2)', 'High Variance (depth=None)']):
model = DecisionTreeClassifier(max_depth=depth, random_state=42)
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, train_sizes=np.linspace(0.1, 1.0, 10), cv=5
)
ax.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training')
ax.plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation')
ax.set_xlabel('Training Size')
ax.set_ylabel('Accuracy')
ax.set_title(title)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Task: Write a function that takes a model and data, plots the learning curve, and prints a diagnosis (high bias, high variance, or good fit) based on the final gap and score values.
Show Solution
def diagnose_model(model, X, y, threshold_gap=0.1, threshold_score=0.8):
"""Diagnose model fit using learning curves."""
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, train_sizes=np.linspace(0.1, 1.0, 10), cv=5
)
train_final = train_scores.mean(axis=1)[-1]
val_final = val_scores.mean(axis=1)[-1]
gap = train_final - val_final
# Plot
plt.figure(figsize=(10, 5))
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training')
plt.plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation')
plt.xlabel('Training Size')
plt.ylabel('Score')
plt.legend()
plt.title('Learning Curve Diagnosis')
plt.show()
# Diagnose
print(f"Training Score: {train_final:.3f}")
print(f"Validation Score: {val_final:.3f}")
print(f"Gap: {gap:.3f}")
if gap > threshold_gap:
print("Diagnosis: HIGH VARIANCE (overfitting)")
print("Recommendation: Get more data, simplify model, or increase regularization")
elif val_final < threshold_score:
print("Diagnosis: HIGH BIAS (underfitting)")
print("Recommendation: Use more complex model or add features")
else:
print("Diagnosis: GOOD FIT")
print("Model generalizes well!")
# Usage
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
diagnose_model(RandomForestClassifier(n_estimators=100), X, y)
Task: Generate a learning curve using F1-score instead of accuracy for a RandomForestClassifier on an imbalanced dataset.
Show Solution
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np
import matplotlib.pyplot as plt
# Create imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20,
weights=[0.9, 0.1], random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Use F1 scoring instead of accuracy
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5, scoring='f1' # F1-score for imbalanced data
)
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training F1')
plt.plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation F1')
plt.fill_between(train_sizes,
train_scores.mean(axis=1) - train_scores.std(axis=1),
train_scores.mean(axis=1) + train_scores.std(axis=1), alpha=0.1)
plt.fill_between(train_sizes,
val_scores.mean(axis=1) - val_scores.std(axis=1),
val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.1)
plt.xlabel('Training Size')
plt.ylabel('F1 Score')
plt.legend()
plt.title('Learning Curve with F1-Score (Imbalanced Dataset)')
plt.grid(True, alpha=0.3)
plt.show()
Task: Generate learning curves for SVC using parallel processing (n_jobs=-1) and measure the time difference compared to sequential processing.
Show Solution
from sklearn.model_selection import learning_curve
from sklearn.svm import SVC
from sklearn.datasets import make_classification
import numpy as np
import time
X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
model = SVC(kernel='rbf', random_state=42)
# Sequential processing
start = time.time()
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5, n_jobs=1 # Sequential
)
sequential_time = time.time() - start
# Parallel processing
start = time.time()
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5, n_jobs=-1 # Use all CPU cores
)
parallel_time = time.time() - start
print(f"Sequential time: {sequential_time:.2f}s")
print(f"Parallel time: {parallel_time:.2f}s")
print(f"Speedup: {sequential_time/parallel_time:.2f}x")
Task: Create a professional learning curve plot that includes shaded regions showing +/- 1 standard deviation for both training and validation scores.
Show Solution
from sklearn.model_selection import learning_curve
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt
X, y = load_breast_cancer(return_X_y=True)
model = GradientBoostingClassifier(n_estimators=50, random_state=42)
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5, scoring='accuracy'
)
train_mean = train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)
plt.figure(figsize=(10, 6))
# Plot training scores with std band
plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std,
alpha=0.15, color='blue')
# Plot validation scores with std band
plt.plot(train_sizes, val_mean, 'o-', color='green', label='Validation Score')
plt.fill_between(train_sizes, val_mean - val_std, val_mean + val_std,
alpha=0.15, color='green')
plt.xlabel('Training Set Size', fontsize=12)
plt.ylabel('Accuracy Score', fontsize=12)
plt.title('Learning Curve with Confidence Bands', fontsize=14)
plt.legend(loc='lower right', fontsize=10)
plt.grid(True, alpha=0.3)
plt.ylim(0.8, 1.02)
plt.tight_layout()
plt.show()
Task: Write a function that generates and plots learning curves for multiple models side-by-side, showing which model benefits most from additional training data.
Show Solution
def compare_learning_curves(models, X, y, cv=5, figsize=(15, 5)):
"""Compare learning curves for multiple models."""
from sklearn.model_selection import learning_curve
import numpy as np
import matplotlib.pyplot as plt
n_models = len(models)
fig, axes = plt.subplots(1, n_models, figsize=figsize)
if n_models == 1:
axes = [axes]
train_sizes_list = np.linspace(0.1, 1.0, 10)
for ax, (name, model) in zip(axes, models.items()):
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, train_sizes=train_sizes_list, cv=cv, n_jobs=-1
)
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_std = val_scores.std(axis=1)
ax.plot(train_sizes, train_mean, 'o-', label='Training')
ax.plot(train_sizes, val_mean, 'o-', label='Validation')
ax.fill_between(train_sizes, train_mean - train_std,
train_mean + train_std, alpha=0.1)
ax.fill_between(train_sizes, val_mean - val_std,
val_mean + val_std, alpha=0.1)
ax.set_xlabel('Training Size')
ax.set_ylabel('Score')
ax.set_title(f'{name}\nFinal Val: {val_mean[-1]:.3f}')
ax.legend(loc='lower right')
ax.grid(True, alpha=0.3)
ax.set_ylim(0.5, 1.05)
plt.suptitle('Learning Curve Comparison', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()
# Usage
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}
compare_learning_curves(models, X, y)
Validation Curves for Hyperparameter Analysis
While learning curves show how performance changes with data size, validation curves show how performance changes with hyperparameter values. They help you find the optimal hyperparameter value and understand the sensitivity of your model to different settings.
What is a Validation Curve?
A validation curve plots model performance against different values of a single hyperparameter. It shows both training and validation scores, helping you identify the optimal hyperparameter value and detect overfitting or underfitting at different settings.
Why it matters: Validation curves help you understand how sensitive your model is to a hyperparameter and find the sweet spot before running expensive grid searches.
Plotting Validation Curves
Scikit-learn's validation_curve function evaluates your model across a range of hyperparameter values using cross-validation.
# Step 1: Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import validation_curve
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
We import validation_curve from sklearn along with an SVC classifier. Support Vector Machines are a good example because their performance is highly sensitive to the C (regularization) parameter.
# Step 2: Load data and define parameter range
X, y = load_breast_cancer(return_X_y=True)
param_range = np.logspace(-3, 3, 7) # [0.001, 0.01, 0.1, 1, 10, 100, 1000]
We create a logarithmic range for the C parameter since regularization parameters typically span several orders of magnitude. This gives us values from 0.001 to 1000.
# Step 3: Compute validation curve
train_scores, val_scores = validation_curve(
SVC(kernel='rbf'), # The estimator
X, y, # Data
param_name='C', # Hyperparameter to vary
param_range=param_range, # Values to try
cv=5, # 5-fold cross-validation
scoring='accuracy',
n_jobs=-1
)
The validation_curve function trains the model at each C value using 5-fold CV and returns training and validation scores. Unlike learning_curve, it varies a hyperparameter instead of training size.
# Step 4: Plot the validation curve
plt.figure(figsize=(10, 6))
plt.semilogx(param_range, train_scores.mean(axis=1), 'o-', color='blue', label='Training Score')
plt.semilogx(param_range, val_scores.mean(axis=1), 'o-', color='green', label='Validation Score')
plt.fill_between(param_range,
train_scores.mean(axis=1) - train_scores.std(axis=1),
train_scores.mean(axis=1) + train_scores.std(axis=1),
alpha=0.1, color='blue')
plt.fill_between(param_range,
val_scores.mean(axis=1) - val_scores.std(axis=1),
val_scores.mean(axis=1) + val_scores.std(axis=1),
alpha=0.1, color='green')
plt.xlabel('C (Regularization Parameter)')
plt.ylabel('Accuracy Score')
plt.title('Validation Curve - SVM with RBF Kernel')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()
We use semilogx for the x-axis since C spans several orders of magnitude. The optimal C value is where the validation score peaks before the gap between training and validation widens (indicating overfitting).
Validation Curves for Different Hyperparameters
Different hyperparameters have different effects. Here's how to analyze tree depth for Decision Trees:
from sklearn.tree import DecisionTreeClassifier
# Vary max_depth from 1 to 20
param_range = np.arange(1, 21)
train_scores, val_scores = validation_curve(
DecisionTreeClassifier(random_state=42),
X, y,
param_name='max_depth',
param_range=param_range,
cv=5, scoring='accuracy'
)
plt.figure(figsize=(10, 6))
plt.plot(param_range, train_scores.mean(axis=1), 'o-', label='Training')
plt.plot(param_range, val_scores.mean(axis=1), 'o-', label='Validation')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.title('Validation Curve - Decision Tree Max Depth')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Find optimal depth
optimal_depth = param_range[np.argmax(val_scores.mean(axis=1))]
print(f"Optimal max_depth: {optimal_depth}")
For tree depth, we typically see training score increase monotonically (deeper trees fit training data better) while validation score peaks at some depth then decreases as the tree starts overfitting. The optimal depth is where validation score is maximized.
Practice Questions
Task: Create a validation curve for RandomForestClassifier varying n_estimators from 10 to 200 in steps of 20.
Show Solution
from sklearn.model_selection import validation_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt
X, y = load_breast_cancer(return_X_y=True)
param_range = np.arange(10, 201, 20)
train_scores, val_scores = validation_curve(
RandomForestClassifier(random_state=42),
X, y,
param_name='n_estimators',
param_range=param_range,
cv=5, scoring='accuracy', n_jobs=-1
)
plt.plot(param_range, train_scores.mean(axis=1), 'o-', label='Training')
plt.plot(param_range, val_scores.mean(axis=1), 'o-', label='Validation')
plt.xlabel('Number of Trees')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Validation Curve - Random Forest n_estimators')
plt.show()
Task: Create a 2x2 subplot showing validation curves for KNeighborsClassifier with n_neighbors (1-20) and for Ridge regression with alpha (0.001-1000 log scale).
Show Solution
from sklearn.model_selection import validation_curve
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import Ridge
from sklearn.datasets import load_breast_cancer, load_diabetes
import numpy as np
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# KNN n_neighbors
X, y = load_breast_cancer(return_X_y=True)
param_range = np.arange(1, 21)
train_s, val_s = validation_curve(
KNeighborsClassifier(), X, y, 'n_neighbors', param_range, cv=5
)
axes[0].plot(param_range, train_s.mean(axis=1), 'o-', label='Training')
axes[0].plot(param_range, val_s.mean(axis=1), 'o-', label='Validation')
axes[0].set_xlabel('n_neighbors')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('KNN - n_neighbors')
axes[0].legend()
# Ridge alpha
X, y = load_diabetes(return_X_y=True)
param_range = np.logspace(-3, 3, 7)
train_s, val_s = validation_curve(
Ridge(), X, y, 'alpha', param_range, cv=5, scoring='r2'
)
axes[1].semilogx(param_range, train_s.mean(axis=1), 'o-', label='Training')
axes[1].semilogx(param_range, val_s.mean(axis=1), 'o-', label='Validation')
axes[1].set_xlabel('Alpha (log scale)')
axes[1].set_ylabel('R² Score')
axes[1].set_title('Ridge - Alpha')
axes[1].legend()
plt.tight_layout()
plt.show()
Task: Create a validation curve for LogisticRegression varying C from 0.01 to 100 and print the optimal C value based on validation score.
Show Solution
from sklearn.model_selection import validation_curve
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt
X, y = load_breast_cancer(return_X_y=True)
param_range = np.logspace(-2, 2, 20)
train_scores, val_scores = validation_curve(
LogisticRegression(max_iter=1000),
X, y,
param_name='C',
param_range=param_range,
cv=5, scoring='accuracy'
)
# Find optimal C
optimal_idx = np.argmax(val_scores.mean(axis=1))
optimal_C = param_range[optimal_idx]
plt.figure(figsize=(10, 6))
plt.semilogx(param_range, train_scores.mean(axis=1), 'o-', label='Training')
plt.semilogx(param_range, val_scores.mean(axis=1), 'o-', label='Validation')
plt.axvline(optimal_C, color='red', linestyle='--', label=f'Optimal C={optimal_C:.3f}')
plt.xlabel('C (Regularization)')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Validation Curve - Logistic Regression')
plt.show()
print(f"Optimal C: {optimal_C:.4f}")
print(f"Best Validation Score: {val_scores.mean(axis=1)[optimal_idx]:.4f}")
Task: Create a validation curve for a Pipeline with PolynomialFeatures and LinearRegression, varying the polynomial degree from 1 to 10.
Show Solution
from sklearn.model_selection import validation_curve
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_diabetes
import numpy as np
import matplotlib.pyplot as plt
X, y = load_diabetes(return_X_y=True)
# Create pipeline
pipeline = Pipeline([
('poly', PolynomialFeatures()),
('regressor', LinearRegression())
])
param_range = np.arange(1, 11)
train_scores, val_scores = validation_curve(
pipeline, X, y,
param_name='poly__degree', # Note: use __ for nested parameters
param_range=param_range,
cv=5, scoring='r2', n_jobs=-1
)
plt.figure(figsize=(10, 6))
plt.plot(param_range, train_scores.mean(axis=1), 'o-', label='Training')
plt.plot(param_range, val_scores.mean(axis=1), 'o-', label='Validation')
plt.fill_between(param_range,
val_scores.mean(axis=1) - val_scores.std(axis=1),
val_scores.mean(axis=1) + val_scores.std(axis=1),
alpha=0.1)
plt.xlabel('Polynomial Degree')
plt.ylabel('R² Score')
plt.legend()
plt.title('Validation Curve - Polynomial Regression')
plt.xticks(param_range)
plt.show()
optimal_degree = param_range[np.argmax(val_scores.mean(axis=1))]
print(f"Optimal Polynomial Degree: {optimal_degree}")
Task: Create validation curves showing both accuracy and F1-score on the same plot for GradientBoostingClassifier varying n_estimators.
Show Solution
from sklearn.model_selection import validation_curve
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt
X, y = load_breast_cancer(return_X_y=True)
param_range = np.arange(10, 201, 20)
model = GradientBoostingClassifier(random_state=42)
# Compute for both metrics
metrics = {'accuracy': 'Accuracy', 'f1': 'F1-Score'}
colors = {'accuracy': 'blue', 'f1': 'green'}
plt.figure(figsize=(12, 5))
for i, (scoring, label) in enumerate(metrics.items()):
train_scores, val_scores = validation_curve(
model, X, y,
param_name='n_estimators',
param_range=param_range,
cv=5, scoring=scoring, n_jobs=-1
)
plt.subplot(1, 2, i+1)
plt.plot(param_range, train_scores.mean(axis=1), 'o-',
color=colors[scoring], label='Training')
plt.plot(param_range, val_scores.mean(axis=1), 's--',
color=colors[scoring], label='Validation')
plt.fill_between(param_range,
val_scores.mean(axis=1) - val_scores.std(axis=1),
val_scores.mean(axis=1) + val_scores.std(axis=1),
alpha=0.1, color=colors[scoring])
plt.xlabel('n_estimators')
plt.ylabel(label)
plt.title(f'Validation Curve - {label}')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Task: Write a function that automatically determines if a hyperparameter needs log scaling based on its typical range, then creates an appropriate validation curve.
Show Solution
def auto_validation_curve(estimator, X, y, param_name, param_range,
cv=5, scoring='accuracy'):
"""Automatically detect if log scale is needed and plot validation curve."""
from sklearn.model_selection import validation_curve
import numpy as np
import matplotlib.pyplot as plt
# Determine if log scale is appropriate
range_ratio = max(param_range) / min(param_range)
use_log = range_ratio > 100 # Use log scale if range spans > 2 orders of magnitude
train_scores, val_scores = validation_curve(
estimator, X, y,
param_name=param_name,
param_range=param_range,
cv=cv, scoring=scoring, n_jobs=-1
)
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)
plt.figure(figsize=(10, 6))
if use_log:
plt.semilogx(param_range, train_mean, 'o-', label='Training')
plt.semilogx(param_range, val_mean, 'o-', label='Validation')
else:
plt.plot(param_range, train_mean, 'o-', label='Training')
plt.plot(param_range, val_mean, 'o-', label='Validation')
plt.fill_between(param_range, val_mean - val_std, val_mean + val_std, alpha=0.1)
# Find and mark optimal value
optimal_idx = np.argmax(val_mean)
optimal_val = param_range[optimal_idx]
plt.axvline(optimal_val, color='red', linestyle='--', alpha=0.7,
label=f'Optimal: {optimal_val:.4g}')
plt.xlabel(f'{param_name}' + (' (log scale)' if use_log else ''))
plt.ylabel(scoring.capitalize())
plt.title(f'Validation Curve for {param_name}')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
return optimal_val, val_mean[optimal_idx]
# Usage
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
optimal_C, best_score = auto_validation_curve(
SVC(), X, y, 'C', np.logspace(-3, 3, 15)
)
print(f"Optimal C: {optimal_C}, Best Score: {best_score:.4f}")
Task: Create a heatmap visualization showing validation scores for SVC with varying C and gamma parameters simultaneously using GridSearchCV results.
Show Solution
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
import numpy as np
import matplotlib.pyplot as plt
X, y = load_breast_cancer(return_X_y=True)
# Define parameter grid
C_range = np.logspace(-2, 2, 9)
gamma_range = np.logspace(-4, 0, 9)
param_grid = {'C': C_range, 'gamma': gamma_range}
grid = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5,
scoring='accuracy', n_jobs=-1)
grid.fit(X, y)
# Reshape results for heatmap
scores = grid.cv_results_['mean_test_score'].reshape(len(C_range), len(gamma_range))
plt.figure(figsize=(10, 8))
plt.imshow(scores, interpolation='nearest', cmap='viridis', aspect='auto')
plt.colorbar(label='Accuracy')
plt.xticks(np.arange(len(gamma_range)), [f'{g:.0e}' for g in gamma_range], rotation=45)
plt.yticks(np.arange(len(C_range)), [f'{c:.0e}' for c in C_range])
plt.xlabel('Gamma')
plt.ylabel('C')
plt.title('SVM Hyperparameter Validation Heatmap')
# Mark best parameters
best_idx = np.unravel_index(scores.argmax(), scores.shape)
plt.scatter(best_idx[1], best_idx[0], marker='*', s=300, c='red',
edgecolors='white', linewidths=2, label='Best')
plt.legend()
plt.tight_layout()
plt.show()
print(f"Best C: {grid.best_params_['C']}")
print(f"Best gamma: {grid.best_params_['gamma']}")
print(f"Best Score: {grid.best_score_:.4f}")
Model Comparison & Statistical Significance
After tuning individual models, you need to compare them fairly. This involves using proper cross-validation, computing confidence intervals, and conducting statistical tests to ensure observed differences are significant rather than due to random chance.
Comparing Multiple Models
Compare models using the same cross-validation splits to ensure a fair comparison. Use cross_val_score with a fixed random state for reproducibility.
# Step 1: Import models and utilities
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
We import multiple classifiers to compare, along with RepeatedStratifiedKFold which provides more robust cross-validation by repeating the k-fold process multiple times with different random splits.
# Step 2: Load data and define models
X, y = load_breast_cancer(return_X_y=True)
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'SVM': SVC(kernel='rbf', random_state=42),
'KNN': KNeighborsClassifier(n_neighbors=5)
}
We create a dictionary of models to compare. Using a dictionary makes it easy to iterate and store results. We set random_state where applicable for reproducibility.
# Step 3: Evaluate all models with repeated cross-validation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
results = {}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy', n_jobs=-1)
results[name] = scores
print(f"{name}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
We use 10-fold CV repeated 3 times (30 total evaluations per model) for robust estimates. The standard deviation is multiplied by 2 to approximate a 95% confidence interval.
# Step 4: Visualize comparison with box plots
plt.figure(figsize=(12, 6))
plt.boxplot(results.values(), labels=results.keys())
plt.ylabel('Accuracy')
plt.title('Model Comparison - Cross-Validation Scores')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Box plots show the distribution of CV scores for each model. The box shows the interquartile range (25th-75th percentile), the line is the median, and whiskers extend to min/max. This visualization helps identify which models are consistently better.
Statistical Significance Testing
When comparing models, use statistical tests to determine if differences are significant. The paired t-test or Wilcoxon signed-rank test can compare two models on the same CV folds.
from scipy import stats
# Compare the top two models
model1_scores = results['Random Forest']
model2_scores = results['Gradient Boosting']
# Paired t-test (assumes normality)
t_stat, p_value = stats.ttest_rel(model1_scores, model2_scores)
print(f"Paired t-test: t={t_stat:.4f}, p={p_value:.4f}")
# Wilcoxon signed-rank test (non-parametric alternative)
w_stat, w_pvalue = stats.wilcoxon(model1_scores, model2_scores)
print(f"Wilcoxon test: W={w_stat:.4f}, p={w_pvalue:.4f}")
if p_value < 0.05:
print("The difference is statistically significant (p < 0.05)")
else:
print("No significant difference between models (p >= 0.05)")
The paired t-test compares means of paired observations (same CV folds). The Wilcoxon test is a non-parametric alternative that doesn't assume normality. A p-value less than 0.05 typically indicates the difference is unlikely due to chance.
Practice Questions
Task: Compare LinearRegression, Ridge, Lasso, and ElasticNet on the diabetes dataset using R² score with 5-fold CV repeated 3 times.
Show Solution
from sklearn.model_selection import cross_val_score, RepeatedKFold
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.datasets import load_diabetes
import matplotlib.pyplot as plt
X, y = load_diabetes(return_X_y=True)
models = {
'Linear Regression': LinearRegression(),
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.1),
'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5)
}
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
results = {}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=cv, scoring='r2')
results[name] = scores
print(f"{name}: R²={scores.mean():.4f} (+/- {scores.std()*2:.4f})")
plt.boxplot(results.values(), labels=results.keys())
plt.ylabel('R² Score')
plt.title('Regression Model Comparison')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Task: Create a function that compares models and returns a pandas DataFrame with mean score, std, and rank. Include statistical test results comparing each model to the best one.
Show Solution
import pandas as pd
from scipy import stats
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
def compare_models(models, X, y, scoring='accuracy', cv=None):
"""Compare models and return ranking DataFrame with statistical tests."""
if cv is None:
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=42)
results = {}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=cv, scoring=scoring, n_jobs=-1)
results[name] = scores
# Create DataFrame
df = pd.DataFrame({
'Model': list(results.keys()),
'Mean': [r.mean() for r in results.values()],
'Std': [r.std() for r in results.values()],
})
df = df.sort_values('Mean', ascending=False).reset_index(drop=True)
df['Rank'] = range(1, len(df) + 1)
# Statistical tests vs best model
best_name = df.iloc[0]['Model']
best_scores = results[best_name]
p_values = []
for name in df['Model']:
if name == best_name:
p_values.append(1.0)
else:
_, p = stats.wilcoxon(best_scores, results[name])
p_values.append(p)
df['p-value vs Best'] = p_values
return df
# Usage
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
X, y = load_breast_cancer(return_X_y=True)
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'SVM': SVC(random_state=42)
}
print(compare_models(models, X, y))
Task: Compare 3 classifiers (LogisticRegression, DecisionTree, KNN) using 10-fold CV and visualize results with a box plot.
Show Solution
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
X, y = load_iris(return_X_y=True)
models = {
'Logistic Regression': LogisticRegression(max_iter=200),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'KNN (k=5)': KNeighborsClassifier(n_neighbors=5)
}
results = {}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=10, scoring='accuracy')
results[name] = scores
print(f"{name}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
# Box plot visualization
fig, ax = plt.subplots(figsize=(10, 6))
ax.boxplot(results.values(), labels=results.keys())
ax.set_ylabel('Accuracy')
ax.set_title('Model Comparison - 10-Fold CV')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Task: Compare RandomForest and GradientBoosting using both accuracy and F1-score on the breast cancer dataset.
Show Solution
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
models = {
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}
metrics = ['accuracy', 'f1', 'precision', 'recall']
print("Model Comparison Across Multiple Metrics:")
print("-" * 60)
for name, model in models.items():
print(f"\n{name}:")
for metric in metrics:
scores = cross_val_score(model, X, y, cv=5, scoring=metric)
print(f" {metric.capitalize():12s}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
Task: Compare two models and use a paired t-test to determine if the performance difference is statistically significant (p < 0.05).
Show Solution
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from scipy import stats
X, y = load_breast_cancer(return_X_y=True)
# Use same CV splits for fair comparison
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
model1 = RandomForestClassifier(n_estimators=100, random_state=42)
model2 = GradientBoostingClassifier(random_state=42)
scores1 = cross_val_score(model1, X, y, cv=cv, scoring='accuracy')
scores2 = cross_val_score(model2, X, y, cv=cv, scoring='accuracy')
print(f"Random Forest: {scores1.mean():.4f} (+/- {scores1.std()*2:.4f})")
print(f"Gradient Boosting: {scores2.mean():.4f} (+/- {scores2.std()*2:.4f})")
# Paired t-test (samples are paired since same CV splits)
t_stat, p_value = stats.ttest_rel(scores1, scores2)
print(f"\nPaired t-test:")
print(f" t-statistic: {t_stat:.4f}")
print(f" p-value: {p_value:.4f}")
if p_value < 0.05:
winner = "Random Forest" if scores1.mean() > scores2.mean() else "Gradient Boosting"
print(f"\n✓ Difference is statistically significant! {winner} is better.")
else:
print("\n✗ Difference is NOT statistically significant.")
Task: Compare 4 models and visualize their score distributions using violin plots instead of box plots for richer information.
Show Solution
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=42)
models = {
'LR': LogisticRegression(max_iter=1000),
'RF': RandomForestClassifier(n_estimators=100, random_state=42),
'GB': GradientBoostingClassifier(random_state=42),
'SVM': SVC(random_state=42)
}
# Collect all scores
all_scores = []
labels = []
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
all_scores.append(scores)
labels.append(f"{name}\n({scores.mean():.3f})")
# Violin plot
fig, ax = plt.subplots(figsize=(10, 6))
parts = ax.violinplot(all_scores, positions=range(len(models)), showmeans=True, showmedians=True)
# Customize colors
for pc in parts['bodies']:
pc.set_facecolor('lightblue')
pc.set_alpha(0.7)
ax.set_xticks(range(len(models)))
ax.set_xticklabels(labels)
ax.set_ylabel('Accuracy')
ax.set_title('Model Comparison - Violin Plot')
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
Task: Create a comprehensive benchmarking function that compares models, tracks training time, tests statistical significance, and generates a formatted report.
Show Solution
import time
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from scipy import stats
def benchmark_models(models, X, y, scoring='accuracy', n_splits=5, n_repeats=3):
"""Comprehensive model benchmarking with timing and statistical tests."""
cv = RepeatedStratifiedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=42)
results = {'Model': [], 'Mean Score': [], 'Std': [], 'Train Time (s)': [],
'Scores': []}
for name, model in models.items():
start_time = time.time()
scores = cross_val_score(model, X, y, cv=cv, scoring=scoring, n_jobs=-1)
elapsed = time.time() - start_time
results['Model'].append(name)
results['Mean Score'].append(scores.mean())
results['Std'].append(scores.std())
results['Train Time (s)'].append(elapsed)
results['Scores'].append(scores)
# Create DataFrame
df = pd.DataFrame({k: v for k, v in results.items() if k != 'Scores'})
df = df.sort_values('Mean Score', ascending=False).reset_index(drop=True)
df['Rank'] = range(1, len(df) + 1)
# Statistical tests vs best model
best_idx = 0
best_scores = results['Scores'][list(models.keys()).index(df.iloc[0]['Model'])]
p_values = []
significant = []
for model_name in df['Model']:
idx = list(models.keys()).index(model_name)
model_scores = results['Scores'][idx]
if np.array_equal(model_scores, best_scores):
p_values.append(1.0)
significant.append('-')
else:
_, p = stats.wilcoxon(best_scores, model_scores)
p_values.append(p)
significant.append('Yes' if p < 0.05 else 'No')
df['p-value'] = p_values
df['Sig. Diff'] = significant
# Format output
print("=" * 80)
print(f"MODEL BENCHMARK REPORT - Scoring: {scoring}")
print("=" * 80)
print(df.to_string(index=False))
print("=" * 80)
print(f"Winner: {df.iloc[0]['Model']} with {df.iloc[0]['Mean Score']:.4f}")
return df
# Usage
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42),
'SVM': SVC(random_state=42)
}
report = benchmark_models(models, X, y)
Ensemble Methods for Model Combination
Instead of choosing a single best model, ensemble methods combine multiple models to achieve better performance than any individual model. Voting combines predictions from different model types, while stacking uses a meta-learner to optimally weight model contributions.
Voting Classifiers
Voting combines predictions from multiple models. Hard voting takes the majority vote, while soft voting averages predicted probabilities (often better if models output calibrated probabilities).
# Step 1: Import ensemble classes
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer
VotingClassifier from sklearn.ensemble allows combining multiple diverse classifiers. We'll combine logistic regression, decision tree, and SVM - three fundamentally different algorithms.
# Step 2: Create base models
X, y = load_breast_cancer(return_X_y=True)
# Define individual estimators
estimators = [
('lr', LogisticRegression(max_iter=1000)),
('dt', DecisionTreeClassifier(max_depth=5, random_state=42)),
('svc', SVC(probability=True, random_state=42)) # probability=True for soft voting
]
We create a list of tuples with (name, estimator) pairs. For soft voting with SVC, we need probability=True which enables probability estimates (slower but needed for averaging probabilities).
# Step 3: Create and evaluate voting classifiers
# Hard voting - majority vote
hard_voting = VotingClassifier(estimators=estimators, voting='hard')
# Soft voting - average probabilities
soft_voting = VotingClassifier(estimators=estimators, voting='soft')
# Compare all models
print("Individual Models:")
for name, model in estimators:
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f" {name}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
print("\nEnsemble Models:")
for name, ensemble in [('Hard Voting', hard_voting), ('Soft Voting', soft_voting)]:
scores = cross_val_score(ensemble, X, y, cv=5, scoring='accuracy')
print(f" {name}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
We compare individual models against voting ensembles. The ensemble typically outperforms individual models because errors from different models tend to cancel out when combined.
Stacking Classifiers
Stacking uses a meta-learner (final estimator) to learn how to best combine base model predictions. It often outperforms simple voting.
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
# Define base estimators
base_estimators = [
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('knn', KNeighborsClassifier(n_neighbors=5)),
('svc', SVC(probability=True, random_state=42))
]
# Create stacking classifier with logistic regression as meta-learner
stacking = StackingClassifier(
estimators=base_estimators,
final_estimator=LogisticRegression(max_iter=1000),
cv=5, # Cross-validation for generating meta-features
stack_method='auto' # Use predict_proba if available
)
# Evaluate
scores = cross_val_score(stacking, X, y, cv=5, scoring='accuracy')
print(f"Stacking: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
Stacking works in two levels: base estimators make predictions using cross-validation (to avoid data leakage), then a meta-learner (here LogisticRegression) learns to optimally combine those predictions. The cv=5 parameter controls how base estimator predictions are generated.
Voting Regressors
For regression problems, use VotingRegressor which averages predictions from multiple regressors.
from sklearn.ensemble import VotingRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.datasets import load_diabetes
X, y = load_diabetes(return_X_y=True)
# Create voting regressor
voting_reg = VotingRegressor(estimators=[
('ridge', Ridge()),
('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
('gb', GradientBoostingRegressor(random_state=42))
])
scores = cross_val_score(voting_reg, X, y, cv=5, scoring='r2')
print(f"Voting Regressor R²: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
VotingRegressor simply averages predictions from all base regressors. You can also specify weights to give more importance to certain models based on their individual performance.
Practice Questions
Task: Create a soft voting classifier with LogisticRegression (weight=2), DecisionTree (weight=1), and SVC (weight=2) on the iris dataset.
Show Solution
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
X, y = load_iris(return_X_y=True)
voting = VotingClassifier(
estimators=[
('lr', LogisticRegression(max_iter=1000)),
('dt', DecisionTreeClassifier(random_state=42)),
('svc', SVC(probability=True, random_state=42))
],
voting='soft',
weights=[2, 1, 2] # LR and SVC get double weight
)
scores = cross_val_score(voting, X, y, cv=5, scoring='accuracy')
print(f"Weighted Voting: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
Task: Create a stacking classifier using RandomForest, GradientBoosting, and SVC as base estimators, with a GradientBoostingClassifier as the final estimator.
Show Solution
from sklearn.ensemble import StackingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
X, y = load_breast_cancer(return_X_y=True)
stacking = StackingClassifier(
estimators=[
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('gb', GradientBoostingClassifier(random_state=42)),
('svc', SVC(probability=True, random_state=42))
],
final_estimator=GradientBoostingClassifier(
n_estimators=50, max_depth=3, random_state=42
),
cv=5
)
scores = cross_val_score(stacking, X, y, cv=5, scoring='accuracy')
print(f"Stacking with GB Meta-Learner: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
Task: Create a VotingRegressor combining Ridge, RandomForestRegressor, and GradientBoostingRegressor on the diabetes dataset.
Show Solution
from sklearn.ensemble import VotingRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
X, y = load_diabetes(return_X_y=True)
# Create individual regressors
estimators = [
('ridge', Ridge(alpha=1.0)),
('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
('gb', GradientBoostingRegressor(random_state=42))
]
# Create voting regressor
voting = VotingRegressor(estimators=estimators)
# Evaluate
scores = cross_val_score(voting, X, y, cv=5, scoring='r2')
print(f"Voting Regressor R²: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
# Compare with individual models
print("\nIndividual model scores:")
for name, model in estimators:
ind_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f" {name}: {ind_scores.mean():.4f}")
Task: Create both hard and soft voting classifiers and compare their performance on the iris dataset.
Show Solution
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
X, y = load_iris(return_X_y=True)
estimators = [
('lr', LogisticRegression(max_iter=200)),
('dt', DecisionTreeClassifier(random_state=42)),
('nb', GaussianNB()) # GaussianNB supports probability by default
]
# Hard voting - majority class wins
hard_voting = VotingClassifier(estimators=estimators, voting='hard')
# Soft voting - average probabilities
soft_voting = VotingClassifier(estimators=estimators, voting='soft')
print("Voting Classifier Comparison:")
print("-" * 40)
hard_scores = cross_val_score(hard_voting, X, y, cv=5, scoring='accuracy')
print(f"Hard Voting: {hard_scores.mean():.4f} (+/- {hard_scores.std()*2:.4f})")
soft_scores = cross_val_score(soft_voting, X, y, cv=5, scoring='accuracy')
print(f"Soft Voting: {soft_scores.mean():.4f} (+/- {soft_scores.std()*2:.4f})")
winner = "Soft Voting" if soft_scores.mean() > hard_scores.mean() else "Hard Voting"
print(f"\nBetter approach: {winner}")
Task: Create a StackingRegressor with Ridge, SVR, and RandomForest as base learners and LinearRegression as the meta-learner.
Show Solution
from sklearn.ensemble import StackingRegressor, RandomForestRegressor
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
X, y = load_diabetes(return_X_y=True)
# Base estimators (SVR needs scaling)
base_estimators = [
('ridge', Ridge(alpha=1.0)),
('svr', Pipeline([
('scaler', StandardScaler()),
('svr', SVR(kernel='rbf'))
])),
('rf', RandomForestRegressor(n_estimators=50, random_state=42))
]
# Create stacking regressor
stacking = StackingRegressor(
estimators=base_estimators,
final_estimator=LinearRegression(),
cv=5
)
# Evaluate
scores = cross_val_score(stacking, X, y, cv=5, scoring='r2')
print(f"Stacking Regressor R²: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
# Compare with base models
print("\nBase model performance:")
for name, model in base_estimators:
base_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f" {name}: {base_scores.mean():.4f}")
Task: Create a function that automatically determines optimal weights for a voting classifier based on individual model cross-validation scores.
Show Solution
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np
def create_weighted_voting(models, X, y, cv=5, scoring='accuracy'):
"""Create voting classifier with weights based on CV performance."""
# Get CV scores for each model
cv_scores = {}
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=cv, scoring=scoring)
cv_scores[name] = scores.mean()
print(f"{name}: {scores.mean():.4f}")
# Calculate weights proportional to scores
score_values = list(cv_scores.values())
min_score = min(score_values)
# Shift scores to be positive and normalize
weights = [(s - min_score + 0.01) for s in score_values]
weights = [w / sum(weights) * len(weights) for w in weights]
print(f"\nCalculated weights: {dict(zip(models.keys(), [f'{w:.2f}' for w in weights]))}")
# Create voting classifier with weights
estimators = [(name, model) for name, model in models.items()]
voting = VotingClassifier(
estimators=estimators,
voting='soft',
weights=weights
)
# Evaluate weighted ensemble
ensemble_scores = cross_val_score(voting, X, y, cv=cv, scoring=scoring)
print(f"\nWeighted Ensemble: {ensemble_scores.mean():.4f} (+/- {ensemble_scores.std()*2:.4f})")
return voting, weights
# Usage
X, y = load_breast_cancer(return_X_y=True)
models = {
'LR': LogisticRegression(max_iter=1000),
'RF': RandomForestClassifier(n_estimators=100, random_state=42),
'GB': GradientBoostingClassifier(random_state=42)
}
voting, weights = create_weighted_voting(models, X, y)
Task: Create a two-layer stacking ensemble where the first layer uses multiple diverse models and the second layer combines their predictions with another stacking classifier.
Show Solution
from sklearn.ensemble import StackingClassifier, RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score
X, y = load_breast_cancer(return_X_y=True)
# Layer 1: Create two stacking classifiers with different base learners
stack_1a = StackingClassifier(
estimators=[
('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
('svc', SVC(probability=True, random_state=42))
],
final_estimator=LogisticRegression(max_iter=1000),
cv=3
)
stack_1b = StackingClassifier(
estimators=[
('gb', GradientBoostingClassifier(n_estimators=50, random_state=42)),
('knn', KNeighborsClassifier(n_neighbors=5)),
('nb', GaussianNB())
],
final_estimator=LogisticRegression(max_iter=1000),
cv=3
)
# Layer 2: Stack the first layer stacking classifiers
final_stack = StackingClassifier(
estimators=[
('stack_1a', stack_1a),
('stack_1b', stack_1b)
],
final_estimator=GradientBoostingClassifier(n_estimators=50, random_state=42),
cv=3
)
# Evaluate
print("Multi-Layer Stacking Performance:")
print("-" * 50)
# Individual first-layer stackers
scores_1a = cross_val_score(stack_1a, X, y, cv=5, scoring='accuracy')
print(f"Stack 1A (RF+SVC): {scores_1a.mean():.4f} (+/- {scores_1a.std()*2:.4f})")
scores_1b = cross_val_score(stack_1b, X, y, cv=5, scoring='accuracy')
print(f"Stack 1B (GB+KNN+NB): {scores_1b.mean():.4f} (+/- {scores_1b.std()*2:.4f})")
# Final multi-layer stack
final_scores = cross_val_score(final_stack, X, y, cv=5, scoring='accuracy')
print(f"\nFinal Stack (2-Layer): {final_scores.mean():.4f} (+/- {final_scores.std()*2:.4f})")
Model Selection Best Practices
Putting it all together: a systematic approach to model selection involves proper experimental design, avoiding common pitfalls, and documenting your process for reproducibility.
The Complete Model Selection Workflow
Follow this workflow for robust model selection:
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
import numpy as np
# Step 1: Split data - hold out test set
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")
# Step 2: Define candidate models with pipelines
candidates = {
'Logistic': Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression(max_iter=1000))
]),
'RandomForest': Pipeline([
('model', RandomForestClassifier(n_estimators=100, random_state=42))
]),
'GradientBoosting': Pipeline([
('model', GradientBoostingClassifier(random_state=42))
]),
'SVM': Pipeline([
('scaler', StandardScaler()),
('model', SVC(random_state=42))
])
}
# Step 3: Compare models using cross-validation on training set only
print("\nModel Comparison (5-fold CV on training set):")
results = {}
for name, pipeline in candidates.items():
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')
results[name] = scores
print(f" {name}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
# Step 4: Select best model and tune hyperparameters
best_name = max(results, key=lambda k: results[k].mean())
print(f"\nBest model: {best_name}")
# Step 5: Final evaluation on held-out test set
best_model = candidates[best_name]
best_model.fit(X_train, y_train)
test_score = best_model.score(X_test, y_test)
print(f"Final Test Score: {test_score:.4f}")
This workflow ensures proper separation between model selection and final evaluation. The test set is never used during model selection or tuning, only for the final unbiased estimate of performance.
Common Pitfalls to Avoid
Data Leakage
Never use test data for any decision. Preprocess (scale, impute) inside cross-validation using Pipelines to avoid leaking information from validation folds.
Overfitting to Validation
If you try too many models/hyperparameters, you risk overfitting to your validation set. Use nested CV or a held-out test set for unbiased evaluation.
Nested Cross-Validation
Nested CV provides an unbiased estimate of generalization performance when hyperparameter tuning is involved.
from sklearn.model_selection import cross_val_score, GridSearchCV
# Outer CV evaluates overall model performance
# Inner CV (inside GridSearchCV) tunes hyperparameters
param_grid = {
'model__n_estimators': [50, 100, 200],
'model__max_depth': [3, 5, 10, None]
}
pipeline = Pipeline([
('model', RandomForestClassifier(random_state=42))
])
# Inner CV for hyperparameter tuning
inner_cv = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1)
# Outer CV for performance estimation
outer_scores = cross_val_score(inner_cv, X, y, cv=5, scoring='accuracy')
print(f"Nested CV Score: {outer_scores.mean():.4f} (+/- {outer_scores.std()*2:.4f})")
Nested CV runs GridSearchCV inside each fold of the outer cross-validation. This gives an honest estimate of how well the entire pipeline (including hyperparameter tuning) will perform on new data.
Practice Questions
Task: Create a function that takes X, y and performs complete model selection: splits data, compares 3 models, tunes the best one with GridSearchCV, and returns the final model with its test score.
Show Solution
def select_best_model(X, y, random_state=42):
"""Complete model selection pipeline."""
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=random_state, stratify=y
)
# Define candidates
candidates = {
'LogisticRegression': {
'pipeline': Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression(max_iter=1000))
]),
'params': {'model__C': [0.1, 1, 10]}
},
'RandomForest': {
'pipeline': Pipeline([('model', RandomForestClassifier(random_state=42))]),
'params': {'model__n_estimators': [50, 100], 'model__max_depth': [5, 10, None]}
},
'GradientBoosting': {
'pipeline': Pipeline([('model', GradientBoostingClassifier(random_state=42))]),
'params': {'model__n_estimators': [50, 100], 'model__learning_rate': [0.05, 0.1]}
}
}
# Compare models
print("Initial Model Comparison:")
cv_scores = {}
for name, config in candidates.items():
scores = cross_val_score(config['pipeline'], X_train, y_train, cv=5)
cv_scores[name] = scores.mean()
print(f" {name}: {scores.mean():.4f}")
# Select and tune best
best_name = max(cv_scores, key=cv_scores.get)
print(f"\nTuning best model: {best_name}")
grid_search = GridSearchCV(
candidates[best_name]['pipeline'],
candidates[best_name]['params'],
cv=5, scoring='accuracy', n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
# Final evaluation
test_score = grid_search.score(X_test, y_test)
print(f"Final test score: {test_score:.4f}")
return grid_search.best_estimator_, test_score
# Usage
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
best_model, test_score = select_best_model(X, y)
Task: Split data into train/test sets with stratification, train a RandomForest on training data only, and evaluate on the held-out test set.
Show Solution
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
# Stratified split to maintain class proportions
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"Training class distribution: {dict(zip(*np.unique(y_train, return_counts=True)))}")
print(f"Test class distribution: {dict(zip(*np.unique(y_test, return_counts=True)))}")
# Train model on training data ONLY
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate on held-out test set
y_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"\nTest Accuracy: {test_accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Task: Create a Pipeline that includes StandardScaler and LogisticRegression, then evaluate using cross-validation to ensure no data leakage.
Show Solution
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
# CORRECT: Pipeline ensures scaler is fit only on training data in each fold
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(max_iter=1000))
])
# Cross-validation with pipeline - no data leakage!
scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
print(f"Pipeline CV Score: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
# WRONG WAY (data leakage - DON'T DO THIS):
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X) # Leaks test info into training!
# scores = cross_val_score(LogisticRegression(), X_scaled, y, cv=5)
print("\nNote: Always use Pipeline for preprocessing to avoid data leakage!")
Task: Implement nested cross-validation with GridSearchCV as the inner loop and cross_val_score as the outer loop for unbiased performance estimation.
Show Solution
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
# Define model and parameter grid
model = RandomForestClassifier(random_state=42)
param_grid = {
'n_estimators': [50, 100],
'max_depth': [5, 10, None],
'min_samples_split': [2, 5]
}
# Inner CV: GridSearchCV for hyperparameter tuning
inner_cv = GridSearchCV(
model, param_grid,
cv=3, # 3-fold inner CV
scoring='accuracy',
n_jobs=-1
)
# Outer CV: Evaluate the entire tuning process
outer_scores = cross_val_score(
inner_cv, X, y,
cv=5, # 5-fold outer CV
scoring='accuracy'
)
print("Nested Cross-Validation Results:")
print(f"Outer CV Scores: {outer_scores}")
print(f"Mean: {outer_scores.mean():.4f} (+/- {outer_scores.std()*2:.4f})")
print("\nThis gives an unbiased estimate of generalization performance,"
"\nincluding the variability from hyperparameter tuning.")
Task: Implement a proper 3-way split (train/validation/test) for model selection and final evaluation without using cross-validation.
Show Solution
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
# First split: Separate test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Second split: Separate validation set (25% of remaining = 20% of original)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
print(f"Train: {len(X_train)} | Val: {len(X_val)} | Test: {len(X_test)}")
# Define candidate models
models = {
'Logistic Regression': LogisticRegression(max_iter=1000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(random_state=42)
}
# Select best model using validation set
print("\nModel Selection (Validation Set):")
best_model_name = None
best_val_score = 0
for name, model in models.items():
model.fit(X_train, y_train)
val_score = model.score(X_val, y_val)
print(f" {name}: {val_score:.4f}")
if val_score > best_val_score:
best_val_score = val_score
best_model_name = name
# Final evaluation on test set
print(f"\nBest Model: {best_model_name}")
final_model = models[best_model_name]
final_model.fit(X_train, y_train) # Could also retrain on train+val
test_score = final_model.score(X_test, y_test)
print(f"Final Test Score: {test_score:.4f}")
Task: Create a reproducible experiment class that tracks random seeds, logs all results, and allows experiments to be exactly replicated.
Show Solution
import numpy as np
import json
from datetime import datetime
from sklearn.model_selection import cross_val_score, train_test_split
class ReproducibleExperiment:
"""Framework for reproducible ML experiments."""
def __init__(self, random_state=42):
self.random_state = random_state
self.results = []
self.start_time = datetime.now()
np.random.seed(random_state)
def run_cv_experiment(self, model, X, y, cv=5, scoring='accuracy', name=None):
"""Run a cross-validation experiment and log results."""
model_name = name or model.__class__.__name__
scores = cross_val_score(model, X, y, cv=cv, scoring=scoring)
result = {
'model': model_name,
'cv_folds': cv,
'scoring': scoring,
'scores': scores.tolist(),
'mean': scores.mean(),
'std': scores.std(),
'random_state': self.random_state,
'timestamp': datetime.now().isoformat()
}
self.results.append(result)
print(f"{model_name}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
return result
def get_summary(self):
"""Get summary DataFrame of all experiments."""
import pandas as pd
return pd.DataFrame([{
'Model': r['model'],
'Mean': r['mean'],
'Std': r['std']
} for r in self.results]).sort_values('Mean', ascending=False)
def save_results(self, filename):
"""Save results to JSON for reproducibility."""
with open(filename, 'w') as f:
json.dump({
'random_state': self.random_state,
'experiments': self.results
}, f, indent=2)
print(f"Results saved to {filename}")
# Usage
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
exp = ReproducibleExperiment(random_state=42)
exp.run_cv_experiment(LogisticRegression(max_iter=1000), X, y)
exp.run_cv_experiment(RandomForestClassifier(n_estimators=100, random_state=42), X, y)
exp.run_cv_experiment(GradientBoostingClassifier(random_state=42), X, y)
print("\nSummary:")
print(exp.get_summary())
Task: Create an automated model selection function that tries multiple models with different hyperparameters, uses nested CV for evaluation, and returns the best overall pipeline.
Show Solution
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer
import warnings
warnings.filterwarnings('ignore')
def auto_select_model(X, y, cv=5, scoring='accuracy', verbose=True):
"""AutoML-style model selection with nested CV."""
# Define search space
model_configs = {
'LogisticRegression': {
'pipeline': Pipeline([
('scaler', StandardScaler()),
('model', LogisticRegression(max_iter=1000))
]),
'params': {'model__C': [0.1, 1, 10]}
},
'SVM': {
'pipeline': Pipeline([
('scaler', StandardScaler()),
('model', SVC(random_state=42))
]),
'params': {'model__C': [0.1, 1, 10], 'model__kernel': ['rbf', 'linear']}
},
'RandomForest': {
'pipeline': Pipeline([('model', RandomForestClassifier(random_state=42))]),
'params': {'model__n_estimators': [50, 100], 'model__max_depth': [5, 10, None]}
},
'GradientBoosting': {
'pipeline': Pipeline([('model', GradientBoostingClassifier(random_state=42))]),
'params': {'model__n_estimators': [50, 100], 'model__learning_rate': [0.05, 0.1]}
}
}
results = {}
if verbose:
print("=" * 60)
print("AUTO MODEL SELECTION")
print("=" * 60)
for name, config in model_configs.items():
# Nested CV: GridSearchCV inside cross_val_score
inner_cv = GridSearchCV(
config['pipeline'], config['params'],
cv=3, scoring=scoring, n_jobs=-1
)
outer_scores = cross_val_score(inner_cv, X, y, cv=cv, scoring=scoring)
results[name] = {
'mean': outer_scores.mean(),
'std': outer_scores.std(),
'scores': outer_scores,
'config': config
}
if verbose:
print(f"{name:20s}: {outer_scores.mean():.4f} (+/- {outer_scores.std()*2:.4f})")
# Select best
best_name = max(results, key=lambda k: results[k]['mean'])
best_config = results[best_name]['config']
# Final fit with best configuration
final_grid = GridSearchCV(
best_config['pipeline'], best_config['params'],
cv=cv, scoring=scoring, n_jobs=-1
)
final_grid.fit(X, y)
if verbose:
print("=" * 60)
print(f"BEST MODEL: {best_name}")
print(f"Best Params: {final_grid.best_params_}")
print(f"Nested CV Score: {results[best_name]['mean']:.4f}")
print("=" * 60)
return final_grid.best_estimator_, results
# Usage
X, y = load_breast_cancer(return_X_y=True)
best_model, all_results = auto_select_model(X, y)
Key Takeaways
Learning Curves for Diagnosis
Use learning_curve to visualize bias-variance tradeoff. Large gap = high variance (get more data), both curves low = high bias (use complex model).
Validation Curves for Tuning
Use validation_curve to find optimal hyperparameter values. Look for where validation score peaks before training-validation gap widens.
Fair Model Comparison
Use same CV splits for all models. Apply statistical tests (paired t-test, Wilcoxon) to verify differences are significant, not due to chance.
Ensembles Combine Strengths
VotingClassifier averages predictions, StackingClassifier learns optimal weights. Ensembles work best with diverse base models.
Avoid Data Leakage
Never use test data for decisions. Use Pipelines for preprocessing inside CV. Hold out final test set for unbiased evaluation.
Nested CV for Honest Estimates
When tuning hyperparameters, use nested CV: inner CV for tuning, outer CV for performance estimation.
Knowledge Check
What does a large gap between training and validation curves in a learning curve indicate?
What is the purpose of a validation curve?
Which cross-validation strategy provides more robust estimates by repeating k-fold multiple times?
What is the difference between hard voting and soft voting in VotingClassifier?
Why is nested cross-validation important when tuning hyperparameters?
What is data leakage in machine learning?