Module 9.2

Regression Algorithms

Master regression techniques from simple linear models to advanced regularization methods. Learn to predict continuous values and prevent overfitting with Ridge, Lasso, and Elastic Net!

50 min read
Intermediate
Hands-on Examples
What You'll Learn
  • Simple and multiple linear regression
  • Polynomial regression for curves
  • Ridge (L2) regularization
  • Lasso (L1) feature selection
  • Elastic Net combined approach
Contents
01

Simple and Multiple Linear Regression

Linear regression models the relationship between features and a continuous target by fitting a linear equation to observed data using Ordinary Least Squares (OLS).

Real-world example: Imagine predicting house prices based on square footage, number of bedrooms, and location. Linear regression finds the mathematical relationship between these features and the price, allowing you to estimate the value of new properties you haven't seen before. Banks use this for property valuations, and real estate platforms use it for price recommendations!
Understanding Linear Regression

Linear regression finds the best-fitting line (or hyperplane) through data points by minimizing the sum of squared differences between predicted and actual values. This method, called Ordinary Least Squares (OLS), provides the optimal coefficients for the linear equation.

Simple Linear Regression

One feature predicting the target:

y = b0 + b1 * x

  • y - Predicted value (target)
  • b0 - Intercept (y when x=0)
  • b1 - Slope (change in y per unit x)
  • x - Input feature
Multiple Linear Regression

Multiple features predicting the target:

y = b0 + b1*x1 + b2*x2 + ... + bn*xn

  • b0 - Intercept term
  • b1...bn - Coefficients for each feature
  • x1...xn - Input features
Ordinary Least Squares (OLS)

OLS finds coefficients by minimizing the sum of squared residuals - the squared differences between actual values (y) and predicted values (y-hat).

Cost Function: J = (1/n) * Sum((yi - y_hat_i)^2)
Residual

Distance from point to the fitted line

Squared Error

Squaring penalizes larger errors more

Minimization

Find the line with smallest total error

Simple Linear Regression

Let's build a simple model to predict exam scores based on study hours. This classic example demonstrates the core concepts of linear regression with a single input feature.

# ========================================
# STEP 1: Import required libraries
# ========================================
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

print("Libraries imported successfully!")

We'll use NumPy for numerical operations, scikit-learn's LinearRegression for the model, and train_test_split to divide our data. Now let's create our dataset:

# ========================================
# STEP 2: Create sample data
# ========================================
# Study hours (X) - must be 2D array for sklearn
hours = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])

# Exam scores (y) - the target we want to predict
scores = np.array([45, 55, 60, 70, 75, 82, 88, 92])

print("Dataset Information:")
print(f"  Number of students: {len(hours)}")
print(f"  Study hours range: {hours.min()} to {hours.max()} hours")
print(f"  Score range: {scores.min()} to {scores.max()} points")
print(f"\nNotice: More study hours generally leads to higher scores!")

Our dataset has 8 students. Notice the pattern: as study hours increase, exam scores tend to increase too. This is a positive linear relationship. Now let's split our data:

# ========================================
# STEP 3: Split data into training and test sets
# ========================================
# Reserve 25% for testing (2 students)
X_train, X_test, y_train, y_test = train_test_split(
    hours, scores, 
    test_size=0.25,  # 25% for testing
    random_state=42  # For reproducibility
)

print("Data Split:")
print(f"  Training set: {len(X_train)} students")
print(f"  Test set: {len(X_test)} students")
print(f"\nTraining data (hours, scores):")
for hour, score in zip(X_train.flatten(), y_train):
    print(f"  {hour} hours -> {score} points")

We split the data 75-25. The model will learn from 6 students (training set) and be tested on 2 unseen students (test set). This mimics real-world scenarios where we predict for new data. Let's train the model:

# ========================================
# STEP 4: Create and train the model
# ========================================
model = LinearRegression()
model.fit(X_train, y_train)  # This is where learning happens!

print("Model trained successfully!")
print("\nLearned Parameters:")
print(f"  Intercept (b0): {model.intercept_:.2f}")
print(f"  Slope (b1): {model.coef_[0]:.2f}")
print(f"\nEquation: score = {model.intercept_:.2f} + {model.coef_[0]:.2f} * hours")
print(f"\nInterpretation:")
print(f"  - Base score (0 hours): {model.intercept_:.2f} points")
print(f"  - Each additional hour adds: {model.coef_[0]:.2f} points")

The model learned that a student who studies 0 hours would get around 38.5 points (intercept), and each additional hour of study adds about 6.8 points to the score (slope). This is the best-fit line through our data! Now let's evaluate:

# ========================================
# STEP 5: Make predictions and evaluate
# ========================================
# Predict on test set
predictions = model.predict(X_test)

print("Test Set Predictions:")
for hour, actual, pred in zip(X_test.flatten(), y_test, predictions):
    diff = actual - pred
    print(f"  {hour} hours: Predicted={pred:.1f}, Actual={actual}, Error={diff:.1f}")

# Calculate performance metrics
r2 = r2_score(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))

print(f"\nModel Performance:")
print(f"  R² Score: {r2:.4f} (1.0 = perfect fit)")
print(f"  RMSE: {rmse:.2f} points")

if r2 > 0.9:
    print("  ✓ Excellent fit! Model captures the pattern well.")
elif r2 > 0.7:
    print("  ✓ Good fit! Predictions are reliable.")
else:
    print("  ⚠ Moderate fit. Consider more data or features.")

Our model shows strong performance with an R² near 1.0, meaning it explains most of the variance in exam scores. The small RMSE indicates predictions are close to actual values. Finally, let's predict for a new student:

# ========================================
# STEP 6: Predict for new data
# ========================================
# What if a student studies 5.5 hours?
new_student_hours = np.array([[5.5]])
predicted_score = model.predict(new_student_hours)[0]

print("\nPrediction for New Student:")
print(f"  Study time: 5.5 hours")
print(f"  Predicted score: {predicted_score:.1f} points")
print(f"\nCalculation:")
print(f"  {model.intercept_:.2f} + {model.coef_[0]:.2f} * 5.5 = {predicted_score:.1f}")

# Try a few more predictions
print("\nAdditional Predictions:")
for hours_val in [2.5, 6.5, 9.0]:
    pred = model.predict([[hours_val]])[0]
    print(f"  {hours_val} hours -> {pred:.1f} points")

The model predicts a student studying 5.5 hours would score around 75 points. We can predict for any study time using our learned equation. Notice how predictions for 9 hours might seem high - this is extrapolation (predicting outside the training range), which can be less reliable. The model assumes the linear relationship continues beyond the observed data.

Multiple Linear Regression

Now let's predict house prices using multiple features: size, bedrooms, and age. This demonstrates how linear regression handles multiple inputs simultaneously to make more accurate predictions.

# ========================================
# STEP 1: Import libraries and create dataset
# ========================================
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# Create a realistic house price dataset
data = {
    'size_sqft': [1200, 1500, 1800, 2100, 2400, 2700, 3000, 3300],
    'bedrooms': [2, 2, 3, 3, 4, 4, 5, 5],
    'age_years': [20, 15, 12, 8, 10, 5, 3, 1],
    'price_1000s': [180, 220, 280, 340, 380, 450, 520, 600]
}
df = pd.DataFrame(data)

print("House Price Dataset:")
print(df.to_string(index=False))
print(f"\nDataset has {len(df)} houses with {len(df.columns)-1} features")

Our dataset contains 8 houses with 3 features each. Notice how price generally increases with size and bedrooms, but decreases with age. These are the patterns our model will learn. Let's prepare the data:

# ========================================
# STEP 2: Separate features (X) and target (y)
# ========================================
# Features: everything except price
X = df[['size_sqft', 'bedrooms', 'age_years']]

# Target: what we want to predict
y = df['price_1000s']

print("Features (X):")
print(f"  Shape: {X.shape} (8 houses, 3 features)")
print(f"  Columns: {X.columns.tolist()}")

print("\nTarget (y):")
print(f"  Shape: {y.shape}")
print(f"  Name: {y.name}")
print(f"  Range: ${y.min()} - ${y.max()} (in thousands)")

We've separated our predictors (size, bedrooms, age) from what we're predicting (price). This X and y structure is standard in machine learning. Now let's split for training and testing:

# ========================================
# STEP 3: Split into training and test sets
# ========================================
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.25,  # Reserve 25% for testing
    random_state=42
)

print("Data Split:")
print(f"  Training set: {len(X_train)} houses")
print(f"  Test set: {len(X_test)} houses")

print("\nTraining samples:")
for idx in X_train.index:
    print(f"  {df.loc[idx, 'size_sqft']} sqft, {df.loc[idx, 'bedrooms']} bed, " +
          f"{df.loc[idx, 'age_years']} yrs -> ${df.loc[idx, 'price_1000s']}k")

We'll train on 6 houses and test on 2 unseen houses. This 75-25 split is common for small datasets. With more data, you might use 80-20 or 90-10. Let's train the model:

# ========================================
# STEP 4: Train the multiple regression model
# ========================================
model = LinearRegression()
model.fit(X_train, y_train)

print("Model Training Complete!")
print("\nLearned Coefficients:")
print("-" * 60)
for feature, coef in zip(X.columns, model.coef_):
    direction = "increases" if coef > 0 else "decreases"
    print(f"  {feature:12s}: {coef:>10.4f}")
    print(f"    -> Each unit {direction} price by ${abs(coef):.4f}k")

print(f"\n  Intercept: {model.intercept_:>10.4f}")
print(f"    -> Base price when all features are zero")

print("\n" + "="*60)
print("EQUATION:")
print(f"price = {model.intercept_:.2f} + {model.coef_[0]:.4f}*size + " +
      f"{model.coef_[1]:.4f}*bedrooms + {model.coef_[2]:.4f}*age")
print("="*60)

The model learned three relationships: (1) Each additional square foot adds ~$0.15k to price, (2) Each bedroom adds ~$25k, (3) Each year of age reduces price by ~$2k. The negative coefficient for age makes intuitive sense - older houses are generally less valuable. Now let's evaluate:

# ========================================
# STEP 5: Make predictions and evaluate accuracy
# ========================================
# Predict on test set
y_pred = model.predict(X_test)

print("Test Set Predictions:")
print("-" * 70)
for idx, (actual, pred) in enumerate(zip(y_test, y_pred)):
    error = actual - pred
    error_pct = (error / actual) * 100
    test_idx = X_test.index[idx]
    
    print(f"House {test_idx}:")
    print(f"  Features: {X_test.loc[test_idx, 'size_sqft']} sqft, " +
          f"{X_test.loc[test_idx, 'bedrooms']} bed, " +
          f"{X_test.loc[test_idx, 'age_years']} yrs")
    print(f"  Predicted: ${pred:.1f}k")
    print(f"  Actual: ${actual:.1f}k")
    print(f"  Error: ${error:.1f}k ({error_pct:+.1f}%)")
    print()

# Calculate performance metrics
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = np.mean(np.abs(y_test - y_pred))

print("="*70)
print("Model Performance Metrics:")
print(f"  R² Score: {r2:.4f}")
print(f"    -> Model explains {r2*100:.1f}% of price variance")
print(f"  RMSE: ${rmse:.2f}k")
print(f"    -> Average prediction error magnitude")
print(f"  MAE: ${mae:.2f}k")
print(f"    -> Mean absolute error")

Our model shows excellent performance with R² close to 1.0, indicating it captures the relationships between features and price very well. The low RMSE confirms predictions are accurate. Let's make predictions for new houses:

# ========================================
# STEP 6: Predict prices for new houses
# ========================================
# Create some hypothetical houses
new_houses = pd.DataFrame({
    'size_sqft': [2000, 2500, 1800],
    'bedrooms': [3, 4, 2],
    'age_years': [5, 2, 15]
})

print("Predictions for New Houses:")
print("="*70)

predictions = model.predict(new_houses)
for idx, (pred, row) in enumerate(zip(predictions, new_houses.itertuples()), 1):
    print(f"\nHouse {idx}:")
    print(f"  Size: {row.size_sqft} sqft")
    print(f"  Bedrooms: {row.bedrooms}")
    print(f"  Age: {row.age_years} years")
    print(f"  Predicted Price: ${pred:.1f}k (${pred*1000:,.0f})")
    
    # Break down the prediction
    print(f"\n  Price breakdown:")
    print(f"    Base: ${model.intercept_:.2f}k")
    print(f"    Size contribution: ${model.coef_[0]*row.size_sqft:.2f}k")
    print(f"    Bedroom contribution: ${model.coef_[1]*row.bedrooms:.2f}k")
    print(f"    Age impact: ${model.coef_[2]*row.age_years:.2f}k")
    print(f"    Total: ${pred:.1f}k")

print("\n" + "="*70)
print("Key Insight: The model combines all features to make holistic predictions!")

For each new house, the model combines all three features to make a prediction. Notice how a larger, newer house with more bedrooms gets a higher price prediction. The beauty of multiple regression is that it considers all features simultaneously, capturing their combined effect on price. Each coefficient represents the independent contribution of that feature, holding all other features constant.

Key Assumptions
Assumption Description Violation Effect
Linearity Relationship between X and y is linear Poor predictions, try polynomial
Independence Observations are independent Unreliable standard errors
Homoscedasticity Constant variance of residuals Inefficient estimates
Normality Residuals are normally distributed Invalid confidence intervals
No Multicollinearity Features are not highly correlated Unstable coefficients

Practice Questions: Linear Regression

Test your understanding of linear regression concepts.

Question: How do you interpret a coefficient of 50 for the size_sqft feature?

Show Solution
# Example: Interpreting coefficients
from sklearn.linear_model import LinearRegression
import numpy as np

# Train a simple model
X = np.array([[1000], [1500], [2000], [2500]])  # size in sqft
y = np.array([200000, 250000, 300000, 350000])   # price in dollars

model = LinearRegression()
model.fit(X, y)

print(f"Coefficient for size_sqft: {model.coef_[0]:.2f}")
print(f"Interpretation: Each additional sqft increases price by ${model.coef_[0]:.2f}")
print(f"\nPrediction for 1800 sqft: ${model.predict([[1800]])[0]:,.2f}")

Question: What does a negative coefficient indicate?

Show Solution
# Example: Negative coefficient
import numpy as np
from sklearn.linear_model import LinearRegression

# House age vs price (older houses = lower price)
X = np.array([[0], [5], [10], [15], [20]])  # age in years
y = np.array([300000, 275000, 250000, 225000, 200000])  # price

model = LinearRegression()
model.fit(X, y)

print(f"Coefficient for age: {model.coef_[0]:,.2f}")
print(f"Interpretation: Each additional year decreases price by ${abs(model.coef_[0]):,.2f}")
print(f"\nNew house (age=0): ${model.predict([[0]])[0]:,.2f}")
print(f"Old house (age=25): ${model.predict([[25]])[0]:,.2f}")

Question: Why does OLS square the errors instead of using absolute values?

Show Solution
# Demonstration: Squared vs Absolute Errors
import numpy as np

# Sample predictions and actuals
y_true = np.array([10, 20, 30, 40])
y_pred = np.array([12, 19, 35, 38])  # Small and large errors

# Calculate errors
errors = y_true - y_pred
absolute_errors = np.abs(errors)
squared_errors = errors ** 2

print("Error Analysis:")
for i in range(len(errors)):
    print(f"  Point {i+1}: error={errors[i]:2.0f}, |error|={absolute_errors[i]:2.0f}, error²={squared_errors[i]:4.0f}")

print(f"\nTotal Absolute Error: {np.sum(absolute_errors)}")
print(f"Total Squared Error: {np.sum(squared_errors)}")
print(f"\nNotice: Large errors (±5) contribute much more to squared error!")

Question: What is multicollinearity and how does it affect the model?

Show Solution
# Detecting multicollinearity
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

# Create correlated features (sqft and rooms are highly correlated)
np.random.seed(42)
sqft = np.random.randint(1000, 3000, 100)
rooms = sqft / 200 + np.random.randn(100) * 2  # Highly correlated!
price = sqft * 150 + rooms * 5000 + np.random.randn(100) * 10000

X = pd.DataFrame({'sqft': sqft, 'rooms': rooms})
y = price

# Check correlation
print("Correlation Matrix:")
print(X.corr())

# Train model
model = LinearRegression().fit(X, y)
print(f"\nCoefficients:")
print(f"  sqft: {model.coef_[0]:.2f}")
print(f"  rooms: {model.coef_[1]:.2f}")
print(f"\nHigh correlation (>{0.8:.1f}) indicates multicollinearity!")
02

Polynomial Regression

When data shows a curved pattern, polynomial regression captures non-linear relationships by adding polynomial terms to the feature set while remaining linear in its coefficients.

Real-world example: Product sales often follow a curve - they start low during launch, grow rapidly, plateau at maturity, and may decline. A straight line can't capture this S-curve or U-curve pattern. Polynomial regression allows you to model these non-linear relationships while still using familiar linear regression techniques!
Understanding Polynomial Regression

When data shows a curved pattern, a straight line cannot capture the relationship well. Polynomial regression extends linear regression by adding polynomial terms (x^2, x^3, etc.) as new features, allowing the model to fit curves while remaining linear in its coefficients.

Polynomial Equation: y = b0 + b1*x + b2*x^2 + b3*x^3 + ... + bn*x^n

1

Degree 1 (Linear)

Straight line: y = b0 + b1*x

2

Degree 2 (Quadratic)

Parabola: y = b0 + b1*x + b2*x^2

3+

Degree 3+ (Cubic+)

Complex curves with inflection points

Why is it Still Called Linear?

Polynomial regression is linear in its coefficients (b0, b1, b2), not in its features. The polynomial terms (x^2, x^3) are treated as new features. We are still finding a linear combination of features - just with transformed features.

Creating Polynomial Features

Polynomial features transform a single feature x into multiple features (x, x², x³, etc.). This allows linear regression to fit curved patterns. Let's see how this transformation works:

# ========================================
# STEP 1: Import and create original data
# ========================================
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

# Start with simple numbers
X = np.array([[2], [3], [4], [5]])

print("Original Feature Values:")
print(X)
print(f"Shape: {X.shape} (4 samples, 1 feature)")
print(f"\nThink of this as: [2, 3, 4, 5]")

We start with just one feature: the values 2, 3, 4, and 5. Right now, linear regression could only draw a straight line through these points. Let's add polynomial terms:

# ========================================
# STEP 2: Create degree-2 polynomial features
# ========================================
poly = PolynomialFeatures(
    degree=2,           # Maximum power (x, x^2)
    include_bias=False  # Don't add column of 1s (intercept handled by model)
)

# Transform the data
X_poly = poly.fit_transform(X)

print("Polynomial Features (degree=2):")
print(X_poly)
print(f"Shape: {X_poly.shape} (4 samples, 2 features)")

print("\nBreakdown for each sample:")
for i, (original, poly_row) in enumerate(zip(X, X_poly)):
    print(f"  x={original[0]:.0f} -> [x={poly_row[0]:.0f}, x²={poly_row[1]:.0f}]")

print("\nWhat happened:")
print("  Original: [2, 3, 4, 5]")
print("  Now: [[2, 4], [3, 9], [4, 16], [5, 25]]")
print("  Each row: [x, x²]")

The transformer created a new feature x² for each value! Now x=2 becomes [2, 4], meaning we have both the original value (2) and its square (4). This lets the model learn curved relationships. Let's try degree 3:

# ========================================
# STEP 3: Try degree-3 polynomial features
# ========================================
poly3 = PolynomialFeatures(degree=3, include_bias=False)
X_poly3 = poly3.fit_transform(X)

print("Polynomial Features (degree=3):")
print(X_poly3)
print(f"Shape: {X_poly3.shape} (4 samples, 3 features)")

print("\nBreakdown for each sample:")
for i, (original, poly_row) in enumerate(zip(X, X_poly3)):
    print(f"  x={original[0]:.0f} -> [x={poly_row[0]:.0f}, x²={poly_row[1]:.0f}, x³={poly_row[2]:.0f}]")

print("\nPattern:")
print("  Degree 1: [x]")
print("  Degree 2: [x, x²]")
print("  Degree 3: [x, x², x³]")
print("  Degree n: [x, x², x³, ..., x^n]")

With degree 3, we now have three features per sample: the original x, x², and x³. For x=2, we get [2, 4, 8]. Higher degrees allow more complex curves, but be careful - too high and you'll overfit! With multiple original features, PolynomialFeatures also creates interaction terms:

# ========================================
# STEP 4: Multiple features and interactions
# ========================================
# Two original features
X_multi = np.array([[1, 2], [3, 4], [5, 6]])

print("Original features (x1, x2):")
print(X_multi)
print(f"Shape: {X_multi.shape}")

# Apply degree-2 transformation
poly_multi = PolynomialFeatures(degree=2, include_bias=False)
X_multi_poly = poly_multi.fit_transform(X_multi)

print("\nPolynomial features (degree=2):")
print(X_multi_poly)
print(f"Shape: {X_multi_poly.shape}")

print("\nFeature names:")
print(poly_multi.get_feature_names_out(['x1', 'x2']))

print("\nBreakdown:")
print("  [x1, x2] becomes [x1, x2, x1², x1*x2, x2²]")
print("  Original: 2 features")
print("  After degree-2: 5 features (includes x1*x2 interaction!)")

print("\n" + "="*60)
print("KEY INSIGHT: With n features and degree d,")
print("the number of output features = C(n+d, d) - 1")
print("For 2 features, degree 2: 5 features")
print("For 10 features, degree 2: 65 features!")
print("Feature explosion can happen quickly!")

With multiple features, PolynomialFeatures creates not just powers (x1², x2²) but also interactions (x1*x2). This can lead to feature explosion - 10 features with degree 2 becomes 65 features! This is why feature selection and regularization become crucial with polynomial regression.

Polynomial Regression Pipeline

Let's build a complete polynomial regression pipeline that generates non-linear data, tests multiple polynomial degrees, and compares their performance. This shows how to find the optimal complexity for your data:

# ========================================
# STEP 1: Import libraries and generate curved data
# ========================================
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Set random seed for reproducibility
np.random.seed(42)

# Generate non-linear data: y = 2x² - 3x + 5 + noise
X = np.linspace(0, 10, 100).reshape(-1, 1)  # 100 points from 0 to 10
y = 2*X.flatten()**2 - 3*X.flatten() + 5 + np.random.randn(100)*10

print("Dataset Information:")
print(f"  Samples: {len(X)}")
print(f"  X range: [{X.min():.1f}, {X.max():.1f}]")
print(f"  y range: [{y.min():.1f}, {y.max():.1f}]")
print(f"  True relationship: y = 2x² - 3x + 5 + noise")
print(f"  This is a QUADRATIC (degree 2) pattern!")

We've created data following a parabola (y = 2x² - 3x + 5) with added noise. Since the true relationship is quadratic, we expect degree 2 to perform best. Let's split the data:

# ========================================
# STEP 2: Split data for training and testing
# ========================================
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42
)

print("Data Split:")
print(f"  Training: {len(X_train)} samples (80%)")
print(f"  Testing: {len(X_test)} samples (20%)")
print(f"\nWhy split? We need to test on unseen data to detect overfitting!")

The 80-20 split lets us train on most data while reserving some for unbiased evaluation. Now let's test different polynomial degrees:

# ========================================
# STEP 3: Test multiple polynomial degrees
# ========================================
print("\nTesting Different Polynomial Degrees:")
print("="*70)

degrees = [1, 2, 3, 5, 10]  # From simple to very complex
results = []

for degree in degrees:
    # Create pipeline: polynomial transformation -> linear regression
    pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('regressor', LinearRegression())
    ])
    
    # Train the pipeline
    pipeline.fit(X_train, y_train)
    
    # Predict on both sets
    y_train_pred = pipeline.predict(X_train)
    y_test_pred = pipeline.predict(X_test)
    
    # Calculate metrics
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    
    results.append({
        'degree': degree,
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'train_r2': train_r2,
        'test_r2': test_r2
    })
    
    print(f"\nDegree {degree}:")
    print(f"  Train RMSE: {train_rmse:6.2f}  |  Test RMSE: {test_rmse:6.2f}")
    print(f"  Train R²:   {train_r2:6.4f}  |  Test R²:   {test_r2:6.4f}")

Notice the pattern in the results! For degree 1 (linear), both train and test performance are poor - it's underfitting. For degree 2, performance improves dramatically because it matches the true quadratic relationship. For very high degrees (10+), train RMSE drops but test RMSE increases - that's overfitting! Let's analyze the results:

# ========================================
# STEP 4: Analyze and interpret results
# ========================================
print("\n" + "="*70)
print("ANALYSIS:")
print("="*70)

# Find best degree by test RMSE
best_idx = min(range(len(results)), key=lambda i: results[i]['test_rmse'])
best = results[best_idx]

print(f"\nBest degree: {best['degree']}")
print(f"  Test RMSE: {best['test_rmse']:.2f}")
print(f"  Test R²: {best['test_r2']:.4f}")

# Check for overfitting (large gap between train and test)
print("\nOverfitting Check:")
for res in results:
    gap = res['train_rmse'] - res['test_rmse']
    gap_pct = abs(gap) / res['train_rmse'] * 100
    
    if gap < -5:  # Test error much higher than train
        status = " OVERFITTING"
    elif gap_pct < 10:
        status = " Good fit"
    else:
        status = " Acceptable"
    
    print(f"  Degree {res['degree']:2d}: Gap = {gap:6.2f}  {status}")

print("\n" + "="*70)
print("KEY INSIGHTS:")
print("="*70)
print("1. Degree 1 (linear): Underfits - can't capture the curve")
print("2. Degree 2 (quadratic): BEST - matches true relationship!")
print("3. Degree 3-5: Slight overfitting, but still reasonable")
print("4. Degree 10+: Severe overfitting - train perfect, test poor")
print("\nRule of thumb: Choose lowest degree with good test performance!")

The analysis reveals the bias-variance tradeoff in action! Degree 1 has high bias (underfitting), degree 2 balances bias and variance perfectly, and degree 10 has high variance (overfitting). Let's visualize this:

# ========================================
# STEP 5: Visualize the fits (optional)
# ========================================
import matplotlib.pyplot as plt

plt.figure(figsize=(15, 5))

# Create smooth line for predictions
X_line = np.linspace(0, 10, 300).reshape(-1, 1)

for i, degree in enumerate([1, 2, 10], 1):
    plt.subplot(1, 3, i)
    
    # Train model
    pipeline = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('regressor', LinearRegression())
    ])
    pipeline.fit(X_train, y_train)
    y_line = pipeline.predict(X_line)
    
    # Plot
    plt.scatter(X_train, y_train, alpha=0.5, label='Training data', s=30)
    plt.scatter(X_test, y_test, alpha=0.5, label='Test data', s=30, color='orange')
    plt.plot(X_line, y_line, 'r-', linewidth=2, label=f'Degree {degree} fit')
    plt.xlabel('X')
    plt.ylabel('y')
    plt.title(f'Polynomial Degree {degree}')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('polynomial_comparison.png', dpi=100, bbox_inches='tight')
print("\nVisualization saved as 'polynomial_comparison.png'")
print("Notice: Degree 1 = straight line, Degree 2 = smooth curve, Degree 10 = wiggly!")

The visualization clearly shows the problem: degree 1 misses the curve entirely, degree 2 captures it smoothly, and degree 10 wiggles excessively trying to hit every training point. This visual confirmation helps explain why degree 2 achieves the best test performance - it matches the complexity of the true underlying relationship without overfitting to noise.

Cross-Validation Tip: Instead of manually testing degrees, use cross-validation to systematically find the optimal degree. The validation curve will show exactly where performance peaks before overfitting begins.
Choosing the Right Degree
Too Low (Underfitting)
  • Model too simple for the data
  • Cannot capture the curved pattern
  • Low train and test scores
  • High bias, low variance
Too High (Overfitting)
  • Model too complex for the data
  • Fits noise instead of pattern
  • High train, low test score
  • Low bias, high variance
Best Practice: Use cross-validation to compare different polynomial degrees. Choose the simplest degree that achieves good test performance. Start low and increase only if needed.

Practice Questions: Polynomial Regression

Test your understanding of polynomial regression and feature engineering.

Question: How many features does PolynomialFeatures(degree=3) create from one input?

Show Solution
# Counting polynomial features
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

X = np.array([[2], [3], [4]])

# Without bias
poly_no_bias = PolynomialFeatures(degree=3, include_bias=False)
X_poly_no_bias = poly_no_bias.fit_transform(X)

print("Without bias:")
print(f"  Original shape: {X.shape}")
print(f"  Polynomial shape: {X_poly_no_bias.shape}")
print(f"  Feature names: {poly_no_bias.get_feature_names_out(['x'])}")

# With bias (default)
poly_with_bias = PolynomialFeatures(degree=3, include_bias=True)
X_poly_with_bias = poly_with_bias.fit_transform(X)

print(f"\nWith bias:")
print(f"  Polynomial shape: {X_poly_with_bias.shape}")
print(f"  Feature names: {poly_with_bias.get_feature_names_out(['x'])}")

Question: Why is using a very high polynomial degree risky?

Show Solution
# Demonstrating overfitting with high degree
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

np.random.seed(42)
X = np.linspace(0, 10, 20).reshape(-1, 1)
y = 2 + 3*X.ravel() + np.random.randn(20) * 2  # Linear with noise

for degree in [2, 5, 15]:
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    
    model = LinearRegression().fit(X_poly, y)
    y_pred = model.predict(X_poly)
    
    print(f"Degree {degree:2d}:")
    print(f"  R²: {r2_score(y, y_pred):.4f}")
    print(f"  Max |coefficient|: {np.max(np.abs(model.coef_)):.2e}")
    print(f"  Features created: {X_poly.shape[1]}")
    
print("\nNotice: Degree 15 has perfect fit but huge coefficients!")

Question: What happens to feature count with multiple inputs and high degree?

Show Solution
# Feature explosion demonstration
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

print("Feature Explosion Analysis:")
print("-" * 50)

for n_features in [2, 5, 10]:
    X = np.random.rand(10, n_features)
    
    for degree in [2, 3]:
        poly = PolynomialFeatures(degree=degree, include_bias=False)
        X_poly = poly.fit_transform(X)
        
        print(f"\nOriginal: {n_features} features, Degree: {degree}")
        print(f"  Result: {X_poly.shape[1]} features")
        print(f"  Growth: {X_poly.shape[1]/n_features:.1f}x increase")

print("\n" + "="*50)
print("With 10 features and degree 3: 220 features!")
print("This makes the model very complex and prone to overfitting.")

Question: When should you consider polynomial regression?

Show Solution
# Checking if polynomial regression is needed
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score

# Create curved data (quadratic relationship)
np.random.seed(42)
X = np.linspace(0, 10, 50).reshape(-1, 1)
y = 2 + 3*X.ravel() - 0.5*X.ravel()**2 + np.random.randn(50) * 2

# Try linear
linear_model = LinearRegression().fit(X, y)
y_pred_linear = linear_model.predict(X)
linear_r2 = r2_score(y, y_pred_linear)

# Try polynomial (degree 2)
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
poly_model = LinearRegression().fit(X_poly, y)
y_pred_poly = poly_model.predict(X_poly)
poly_r2 = r2_score(y, y_pred_poly)

print("Model Comparison:")
print(f"  Linear R²: {linear_r2:.4f}")
print(f"  Polynomial R²: {poly_r2:.4f}")
print(f"\nImprovement: {(poly_r2 - linear_r2)*100:.1f}%")
print("\nConclusion: Use polynomial when linear shows poor fit!")
03

Ridge and Lasso Regularization

Regularization adds penalty terms to the loss function, discouraging large coefficients to prevent overfitting and improve generalization to new data.

Why regularization matters: In high-dimensional data (many features), standard linear regression can overfit dramatically. Imagine building a model with 1000 features but only 500 samples - without regularization, the model memorizes training data instead of learning patterns. Regularization acts as a safeguard, keeping models simple and generalizable. It's essential for modern machine learning!
Understanding Regularization

Regularization adds a penalty term to the loss function that discourages large coefficient values. This constrains the model, preventing it from fitting noise in the training data and improving generalization to new data.

Regularized Loss: Loss = MSE + alpha * Penalty(coefficients)

The alpha hyperparameter controls the strength of regularization:

alpha = 0

No regularization (standard OLS)

Optimal alpha

Balance between fit and simplicity

Large alpha

Heavy penalty, simple model

Ridge Regression (L2)

Penalty: Sum of squared coefficients

Penalty = alpha * Sum(bi^2)
Characteristics:
  • Shrinks coefficients toward zero
  • Never sets coefficients to exactly zero
  • Keeps all features in the model
  • Good when most features are useful
  • Handles multicollinearity well
Lasso Regression (L1)

Penalty: Sum of absolute coefficients

Penalty = alpha * Sum(|bi|)
Characteristics:
  • Shrinks coefficients toward zero
  • Can set coefficients to exactly zero
  • Performs automatic feature selection
  • Good when many features are irrelevant
  • Creates sparse, interpretable models
Always Scale Features First!

Regularization penalizes coefficient magnitude. Without scaling, features on different scales get penalized unfairly - large-scale features have smaller coefficients and get penalized less. Always use StandardScaler or MinMaxScaler before applying regularization.

Ridge Regression Implementation

Ridge regression (L2 regularization) shrinks coefficients by penalizing their squared values. This prevents overfitting by keeping coefficients small, making the model more stable. Let's see how it works:

# ========================================
# STEP 1: Import libraries and generate data
# ========================================
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
import numpy as np

# Generate regression data with multiple features
X, y = make_regression(
    n_samples=200,   # 200 data points
    n_features=20,   # 20 features (many for the sample size!)
    noise=10,        # Add some noise
    random_state=42
)

print("Dataset Information:")
print(f"  Samples: {X.shape[0]}")
print(f"  Features: {X.shape[1]}")
print(f"  Target range: [{y.min():.1f}, {y.max():.1f}]")
print(f"\nWith 20 features and 200 samples, overfitting is a real risk!")

We have 20 features for only 200 samples - a relatively high feature-to-sample ratio where regularization becomes important. Without it, the model might fit noise. Let's split and scale:

# ========================================
# STEP 2: Split data and scale features
# ========================================
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42
)

print("Data Split:")
print(f"  Training: {len(X_train)} samples")
print(f"  Testing: {len(X_test)} samples")

# CRITICAL: Scale features before regularization!
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("\nFeature Scaling:")
print(f"  Original X_train mean: {X_train.mean(axis=0)[:3]}")
print(f"  Scaled X_train mean: {X_train_scaled.mean(axis=0)[:3]}")
print(f"  Scaled X_train std: {X_train_scaled.std(axis=0)[:3]}")
print("\nAfter scaling: mean ≈ 0, std ≈ 1 for all features")
print("This ensures fair penalty across all features!")

StandardScaler transforms each feature to have mean=0 and std=1. This is crucial for regularization! Without scaling, features with large values would have artificially small coefficients, getting penalized less. Now let's try Ridge with a specific alpha:

# ========================================
# STEP 3: Train Ridge with specific alpha
# ========================================
# Try alpha = 1.0
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

# Evaluate on test set
train_score = ridge.score(X_train_scaled, y_train)
test_score = ridge.score(X_test_scaled, y_test)

print("Ridge Regression (alpha=1.0):")
print(f"  Training R²: {train_score:.4f}")
print(f"  Test R²: {test_score:.4f}")
print(f"  Gap: {train_score - test_score:.4f}")

if train_score - test_score < 0.05:
    print(" Small gap = good generalization!")
else:
    print(" Large gap = possible overfitting")

# Examine coefficients
print(f"\nCoefficient Statistics:")
print(f"  Max |coefficient|: {np.max(np.abs(ridge.coef_)):.4f}")
print(f"  Mean |coefficient|: {np.mean(np.abs(ridge.coef_)):.4f}")
print(f"  Coefficients near zero (<0.1): {np.sum(np.abs(ridge.coef_) < 0.1)}")
print(f"  All 20 coefficients are non-zero (Ridge never zeros out!)")

With alpha=1.0, Ridge shrinks coefficients but keeps all 20 features in the model. The penalty term (alpha * sum of squared coefficients) keeps them from getting too large. But is alpha=1.0 optimal? Let's use cross-validation to find the best value:

# ========================================
# STEP 4: Find optimal alpha with RidgeCV
# ========================================
# Test a range of alpha values
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

print("Testing different alpha values:")
print("-" * 60)

# RidgeCV performs cross-validation to find best alpha
ridge_cv = RidgeCV(
    alphas=alphas,
    cv=5,           # 5-fold cross-validation
    scoring='r2'    # Optimize for R² score
)
ridge_cv.fit(X_train_scaled, y_train)

print(f"\nBest alpha found: {ridge_cv.alpha_}")
print(f"\nThis was selected from {len(alphas)} candidates using 5-fold CV")
print(f"Each alpha was tested on 5 different train/validation splits!")

# Evaluate the best model
train_score_cv = ridge_cv.score(X_train_scaled, y_train)
test_score_cv = ridge_cv.score(X_test_scaled, y_test)

print(f"\nBest Ridge Model Performance:")
print(f"  Training R²: {train_score_cv:.4f}")
print(f"  Test R²: {test_score_cv:.4f}")
print(f"  Gap: {train_score_cv - test_score_cv:.4f}")

RidgeCV automatically tested all 7 alpha values using 5-fold cross-validation and selected the one with best validation performance. This systematic approach is better than manually guessing alpha. Let's compare with an unregularized model:

# ========================================
# STEP 5: Compare with unregularized model
# ========================================
from sklearn.linear_model import LinearRegression

# Train standard linear regression (no regularization)
linreg = LinearRegression()
linreg.fit(X_train_scaled, y_train)

linreg_train = linreg.score(X_train_scaled, y_train)
linreg_test = linreg.score(X_test_scaled, y_test)

print("\n" + "="*70)
print("COMPARISON: Ridge vs Standard Linear Regression")
print("="*70)

print("\nStandard Linear Regression:")
print(f"  Training R²: {linreg_train:.4f}")
print(f"  Test R²: {linreg_test:.4f}")
print(f"  Gap: {linreg_train - linreg_test:.4f}")
print(f"  Max |coefficient|: {np.max(np.abs(linreg.coef_)):.4f}")

print(f"\nRidge (alpha={ridge_cv.alpha_}):")
print(f"  Training R²: {train_score_cv:.4f}")
print(f"  Test R²: {test_score_cv:.4f}")
print(f"  Gap: {train_score_cv - test_score_cv:.4f}")
print(f"  Max |coefficient|: {np.max(np.abs(ridge_cv.coef_)):.4f}")

improvement = test_score_cv - linreg_test
coef_reduction = (1 - np.max(np.abs(ridge_cv.coef_)) / np.max(np.abs(linreg.coef_))) * 100

print(f"\n" + "="*70)
print("RIDGE BENEFITS:")
print(f"  Test performance improvement: {improvement:+.4f}")
print(f"  Coefficient shrinkage: {coef_reduction:.1f}%")
print(f"  Smaller gap between train/test (better generalization!)")
print("="*70)

Ridge regularization typically shows smaller coefficients and better generalization (smaller train-test gap) compared to standard linear regression. The L2 penalty keeps coefficients manageable, preventing the model from relying too heavily on any single feature. This is especially valuable when features are correlated or when sample size is limited relative to feature count.

Key Insight: Ridge regression trades a small amount of training accuracy for better test performance. By constraining coefficients, it prevents overfitting and creates more stable, generalizable models.
Lasso Regression for Feature Selection

Lasso (L1 regularization) can set coefficients to exactly zero, effectively removing features from the model. This makes it perfect for feature selection! Let's create data where only some features matter:

# ========================================
# STEP 1: Create data with relevant and irrelevant features
# ========================================
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

# Set seed for reproducibility
np.random.seed(42)
n_samples = 200

# Create 5 RELEVANT features
X_relevant = np.random.randn(n_samples, 5)

# Create 15 IRRELEVANT noise features
X_noise = np.random.randn(n_samples, 15)

# Combine: 20 features total (only first 5 matter)
X = np.hstack([X_relevant, X_noise])

print("Dataset Design:")
print(f"  Total features: 20")
print(f"  Relevant features: 5 (indices 0-4)")
print(f"  Noise features: 15 (indices 5-19)")

# Target depends ONLY on the first 5 features
y = (3*X[:, 0] +      # Feature 0 matters
     2*X[:, 1] +      # Feature 1 matters
     -1.5*X[:, 2] +   # Feature 2 matters (negative relationship!)
     0.5*X[:, 3] +    # Feature 3 matters (small effect)
     1*X[:, 4] +      # Feature 4 matters
     np.random.randn(n_samples)*0.5)  # Add small noise

print("\nTrue coefficients:")
print("  Feature 0: +3.0")
print("  Feature 1: +2.0")
print("  Feature 2: -1.5")
print("  Feature 3: +0.5")
print("  Feature 4: +1.0")
print("  Features 5-19: 0.0 (should have NO effect!)")

We've deliberately created data where features 5-19 are pure noise with no relationship to y. A good feature selection method should identify and eliminate these. Let's prepare the data:

# ========================================
# STEP 2: Split and scale the data
# ========================================
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print(f"\nData Split:")
print(f"  Training: {len(X_train)} samples")
print(f"  Testing: {len(X_test)} samples")

# Scale features (essential for regularization!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nFeatures scaled to mean=0, std=1")
print(f"This ensures fair L1 penalty across all features")

With scaling complete, all features are on equal footing. Now Lasso can fairly decide which features to keep based on their predictive value, not their scale. Let's train with cross-validation:

# ========================================
# STEP 3: Train Lasso with cross-validation
# ========================================
# LassoCV automatically finds best alpha
lasso_cv = LassoCV(
    cv=5,              # 5-fold cross-validation
    random_state=42,
    max_iter=10000     # Ensure convergence
)

print("Training Lasso with 5-fold cross-validation...")
lasso_cv.fit(X_train_scaled, y_train)

print(f"\nOptimal alpha: {lasso_cv.alpha_:.6f}")

# Evaluate performance
train_score = lasso_cv.score(X_train_scaled, y_train)
test_score = lasso_cv.score(X_test_scaled, y_test)

print(f"\nModel Performance:")
print(f"  Training R²: {train_score:.4f}")
print(f"  Test R²: {test_score:.4f}")
print(f"  Gap: {train_score - test_score:.4f} (smaller is better!)")

LassoCV found the optimal alpha value that balances fit and sparsity. Now comes the exciting part - let's see which features it selected:

# ========================================
# STEP 4: Examine feature selection results
# ========================================
print("\n" + "="*70)
print("FEATURE SELECTION ANALYSIS")
print("="*70)

# Count non-zero coefficients
n_selected = np.sum(lasso_cv.coef_ != 0)
n_zeroed = np.sum(lasso_cv.coef_ == 0)

print(f"\nFeatures selected: {n_selected} out of 20")
print(f"Features eliminated: {n_zeroed}")

# Find which features were selected
selected_indices = np.where(lasso_cv.coef_ != 0)[0]
zeroed_indices = np.where(lasso_cv.coef_ == 0)[0]

print(f"\nSelected feature indices: {selected_indices.tolist()}")
print(f"Eliminated feature indices: {zeroed_indices.tolist()[:10]}..." 
      if len(zeroed_indices) > 10 else f"Eliminated feature indices: {zeroed_indices.tolist()}")

print("\nCoefficients for SELECTED features:")
print("-" * 70)
for idx in selected_indices:
    coef = lasso_cv.coef_[idx]
    feature_type = "RELEVANT" if idx < 5 else "NOISE"
    print(f"  Feature {idx:2d}: {coef:>8.4f}  [{feature_type}]")

if len(selected_indices) > 0:
    n_correct = np.sum(selected_indices < 5)  # Should be indices 0-4
    n_incorrect = len(selected_indices) - n_correct
    
    print(f"\nSelection Accuracy:")
    print(f"  Correctly selected relevant features: {n_correct} / 5")
    print(f"  Incorrectly selected noise features: {n_incorrect}")

Lasso successfully identified the relevant features! Notice how it set coefficients for features 5-19 (the noise) to exactly zero, removing them from the model. This automatic feature selection is one of Lasso's most powerful capabilities. Let's compare with Ridge:

# ========================================
# STEP 5: Compare with Ridge Regression
# ========================================
from sklearn.linear_model import RidgeCV

print("\n" + "="*70)
print("LASSO vs RIDGE COMPARISON")
print("="*70)

# Train Ridge for comparison
ridge_cv = RidgeCV(cv=5)
ridge_cv.fit(X_train_scaled, y_train)

ridge_train = ridge_cv.score(X_train_scaled, y_train)
ridge_test = ridge_cv.score(X_test_scaled, y_test)
ridge_zeros = np.sum(np.abs(ridge_cv.coef_) < 1e-10)  # Effectively zero

print("\nLasso (L1 Regularization):")
print(f"  Test R²: {test_score:.4f}")
print(f"  Features with zero coefficients: {n_zeroed}")
print(f"  Model complexity: {n_selected} features")
print(f"  Max |coefficient|: {np.max(np.abs(lasso_cv.coef_)):.4f}")

print("\nRidge (L2 Regularization):")
print(f"  Test R²: {ridge_test:.4f}")
print(f"  Features with zero coefficients: {ridge_zeros}")
print(f"  Model complexity: 20 features (keeps all!)")
print(f"  Max |coefficient|: {np.max(np.abs(ridge_cv.coef_)):.4f}")

print("\n" + "="*70)
print("KEY INSIGHTS:")
print("="*70)
print("1. Lasso creates SPARSE models by setting coefficients to exactly 0")
print("2. Ridge shrinks ALL coefficients but keeps every feature")
print("3. When many features are irrelevant, Lasso > Ridge")
print("4. Lasso is easier to interpret: fewer features to explain")
print("5. Both prevent overfitting, but Lasso also selects features!")

The comparison reveals the fundamental difference: Lasso performs feature selection (sparse model with fewer features) while Ridge keeps all features but shrinks their coefficients. When you suspect many features are irrelevant, Lasso is the better choice. When all features contribute something, Ridge might work better. You can also combine both approaches using Elastic Net!

Pro Tip: If Lasso and Ridge give similar test scores but Lasso uses far fewer features, prefer Lasso! Simpler models are easier to deploy, explain, and maintain. Occam's Razor applies to machine learning too.
Ridge vs Lasso Comparison
Aspect Ridge (L2) Lasso (L1)
Penalty Type Squared coefficients (L2 norm) Absolute coefficients (L1 norm)
Coefficient Behavior Shrinks toward zero, never exactly zero Can shrink to exactly zero
Feature Selection No - keeps all features Yes - removes irrelevant features
Correlated Features Shares weight among correlated features Arbitrarily picks one, zeros others
Best Use Case When all features are relevant When many features are irrelevant
Interpretability All features have weights Sparse model, easier to interpret

Practice Questions: Ridge & Lasso Regularization

Test your understanding of regularization techniques.

Question: What happens if alpha is set too high?

Show Solution
# Testing different alpha values
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Generate data
np.random.seed(42)
X = np.random.randn(100, 5)
y = 2*X[:, 0] + 3*X[:, 1] - X[:, 2] + np.random.randn(100) * 0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Testing different alpha values:")
for alpha in [0.01, 1.0, 100.0, 10000.0]:
    model = Ridge(alpha=alpha)
    model.fit(X_train_scaled, y_train)
    
    train_score = model.score(X_train_scaled, y_train)
    test_score = model.score(X_test_scaled, y_test)
    
    print(f"\nAlpha = {alpha:7.2f}:")
    print(f"  Train R²: {train_score:.4f}")
    print(f"  Test R²: {test_score:.4f}")
    print(f"  Max |coef|: {np.max(np.abs(model.coef_)):.4f}")
    
print("\nNotice: High alpha -> poor scores (underfitting)!")

Question: Why does Lasso produce exactly zero coefficients but Ridge does not?

Show Solution
# Comparing Ridge vs Lasso coefficient behavior
import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
X = np.random.randn(100, 10)
# Only first 3 features are relevant
y = 2*X[:, 0] + 3*X[:, 1] - X[:, 2] + np.random.randn(100) * 0.1

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Ridge
ridge = Ridge(alpha=1.0).fit(X_scaled, y)
ridge_zeros = np.sum(np.abs(ridge.coef_) < 1e-10)

# Lasso
lasso = Lasso(alpha=0.1, max_iter=10000).fit(X_scaled, y)
lasso_zeros = np.sum(np.abs(lasso.coef_) < 1e-10)

print("Ridge Coefficients:")
print(f"  {ridge.coef_.round(3)}")
print(f"  Exactly zero: {ridge_zeros}")

print(f"\nLasso Coefficients:")
print(f"  {lasso.coef_.round(3)}")
print(f"  Exactly zero: {lasso_zeros}")

print("\nLasso zeros out irrelevant features, Ridge keeps all!")

Question: When would you prefer Ridge over Lasso?

Show Solution
# Scenario: All features are relevant
import numpy as np
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
X = np.random.randn(100, 8)
# ALL features are relevant with varying importance
y = (2*X[:, 0] + 1.5*X[:, 1] + X[:, 2] + 0.8*X[:, 3] + 
     0.5*X[:, 4] + 0.3*X[:, 5] + 0.2*X[:, 6] + 0.1*X[:, 7] + 
     np.random.randn(100) * 0.2)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compare Ridge vs Lasso
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1, max_iter=10000)

ridge_scores = cross_val_score(ridge, X_scaled, y, cv=5, scoring='r2')
lasso_scores = cross_val_score(lasso, X_scaled, y, cv=5, scoring='r2')

print("When ALL features are relevant:")
print(f"  Ridge R²: {ridge_scores.mean():.4f} (±{ridge_scores.std():.4f})")
print(f"  Lasso R²: {lasso_scores.mean():.4f} (±{lasso_scores.std():.4f})")
print("\nRidge performs better by keeping all features!")

Question: How does Lasso behave with correlated features?

Show Solution
# Lasso with correlated features
import numpy as np
from sklearn.linear_model import Lasso, Ridge
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 100

# Create highly correlated features
X1 = np.random.randn(n)
X2 = X1 + np.random.randn(n) * 0.1  # Highly correlated with X1
X3 = np.random.randn(n)  # Independent

X = np.column_stack([X1, X2, X3])
y = 2*X1 + 2*X2 + X3 + np.random.randn(n) * 0.1  # Both X1 and X2 matter!

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train models
lasso = Lasso(alpha=0.1, max_iter=10000).fit(X_scaled, y)
ridge = Ridge(alpha=1.0).fit(X_scaled, y)

print("True relationship: y = 2*X1 + 2*X2 + X3")
print(f"\nCorrelation X1-X2: {np.corrcoef(X1, X2)[0,1]:.3f}")
print(f"\nLasso Coefficients: {lasso.coef_.round(2)}")
print(f"  Lasso picks one correlated feature!")
print(f"\nRidge Coefficients: {ridge.coef_.round(2)}")
print(f"  Ridge distributes weight between X1 and X2!")
04

Elastic Net

Elastic Net combines L1 and L2 regularization, offering the feature selection of Lasso with the stability of Ridge for correlated features.

Pro tip: When should you use Elastic Net? If you have many correlated features (like different measurements of the same thing), Lasso might arbitrarily pick one and drop the others. Ridge keeps them all but doesn't simplify. Elastic Net gets the best of both - it can select feature groups while maintaining stability. It's the Swiss Army knife of regularization!
Understanding Elastic Net

Elastic Net addresses the limitations of using either Ridge or Lasso alone by combining both penalties. The combined penalty allows for feature selection (from L1) while handling correlated features better (from L2).

Elastic Net Penalty: alpha * (l1_ratio * |b| + (1 - l1_ratio) * b^2)

The l1_ratio parameter controls the mix of L1 and L2:

l1_ratio = 0

Pure Ridge (L2 only)

l1_ratio = 0.5

Equal mix of L1 and L2

l1_ratio = 1

Pure Lasso (L1 only)

When to Use Elastic Net
Elastic Net Excels When:
  • You have many features, some correlated
  • You want feature selection but need stability
  • Lasso keeps selecting different features on different runs
  • You have more features than samples (p > n)
  • Groups of correlated features should be selected together
Trade-offs:
  • Two hyperparameters to tune (alpha and l1_ratio)
  • Slightly more complex to configure
  • May be slower than pure Ridge or Lasso
  • Need cross-validation to find optimal mix
Basic Elastic Net Implementation

Elastic Net combines L1 and L2 penalties to get the best of both worlds: feature selection from Lasso plus stability with correlated features from Ridge. Let's create data where features are correlated to see how different l1_ratio values behave:

# ========================================
# STEP 1: Create data with correlated features
# ========================================
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

np.random.seed(42)
n_samples = 300

# Create 3 base features
base = np.random.randn(n_samples, 3)

# Create 3 CORRELATED copies (adding small noise)
# These are highly correlated with base features!
correlated = base + np.random.randn(n_samples, 3) * 0.1

# Add 4 completely IRRELEVANT noise features
noise = np.random.randn(n_samples, 4)

# Combine: 10 features total
X = np.hstack([base, correlated, noise])

print("Dataset Structure:")
print(f"  Total features: {X.shape[1]}")
print(f"  Base features (0-2): The originals")
print(f"  Correlated features (3-5): Copies of 0-2 with noise")
print(f"  Noise features (6-9): Completely irrelevant")

# Check correlation between feature 0 and feature 3
corr = np.corrcoef(X[:, 0], X[:, 3])[0, 1]
print(f"\nCorrelation between feature 0 and 3: {corr:.3f}")
print(f"This high correlation will challenge Lasso!")

We've created a scenario where features 0-2 and 3-5 are highly correlated (correlation ≈ 0.99). Lasso struggles here because it arbitrarily picks one from each correlated group. Let's create the target:

# ========================================
# STEP 2: Create target based on base features
# ========================================
# Target depends ONLY on the 3 base features (0-2)
y = (2*base[:, 0] -           # Feature 0 matters
     base[:, 1] +             # Feature 1 matters (negative)
     0.5*base[:, 2] +         # Feature 2 matters (small effect)
     np.random.randn(n_samples)*0.3)  # Small noise

print("\nTrue Relationship:")
print("  y = 2*f0 - 1*f1 + 0.5*f2 + noise")
print("  Features 3-5: Correlated with 0-2, so also predictive!")
print("  Features 6-9: Pure noise, should be eliminated")
print("\nChallenge: Should model use base (0-2) or correlated (3-5)?")

The target depends on the base features, but since features 3-5 are highly correlated with 0-2, they're also predictive! A good model should recognize this correlation. Let's prepare the data:

# ========================================
# STEP 3: Split and scale the data
# ========================================
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print(f"\nData Split:")
print(f"  Training: {len(X_train)} samples")
print(f"  Testing: {len(X_test)} samples")

# Scale features (critical for regularization!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nFeatures scaled for fair regularization")

Now let's test Elastic Net with different l1_ratio values to see how the L1/L2 mix affects feature selection:

# ========================================
# STEP 4: Test different l1_ratio values
# ========================================
print("\n" + "="*70)
print("TESTING DIFFERENT L1/L2 MIXES")
print("="*70)

l1_ratios = [0.1, 0.5, 0.9, 1.0]  # From mostly L2 to pure L1
results = []

for l1_ratio in l1_ratios:
    # Train Elastic Net
    enet = ElasticNet(
        alpha=0.1,         # Overall regularization strength
        l1_ratio=l1_ratio, # Mix of L1 vs L2
        random_state=42,
        max_iter=10000
    )
    enet.fit(X_train_scaled, y_train)
    
    # Evaluate
    train_score = enet.score(X_train_scaled, y_train)
    test_score = enet.score(X_test_scaled, y_test)
    n_nonzero = np.sum(enet.coef_ != 0)
    selected = np.where(enet.coef_ != 0)[0]
    
    print(f"\nl1_ratio = {l1_ratio} ({int(l1_ratio*100)}% L1, {int((1-l1_ratio)*100)}% L2):")
    print(f"  Test R²: {test_score:.4f}")
    print(f"  Non-zero coefficients: {n_nonzero} / 10")
    print(f"  Selected features: {selected.tolist()}")
    
    # Check which types of features were selected
    base_selected = np.sum((selected >= 0) & (selected <= 2))
    corr_selected = np.sum((selected >= 3) & (selected <= 5))
    noise_selected = np.sum((selected >= 6) & (selected <= 9))
    
    print(f"    Base features (0-2): {base_selected}")
    print(f"    Correlated features (3-5): {corr_selected}")
    print(f"    Noise features (6-9): {noise_selected}")
    
    if l1_ratio == 0.1:
        print("    -> Mostly Ridge: Keeps many features, distributes weight")
    elif l1_ratio == 0.5:
        print("    -> Balanced: Some feature selection, handles correlation")
    elif l1_ratio == 0.9:
        print("    -> Mostly Lasso: Aggressive feature selection")
    else:
        print("    -> Pure Lasso: Most aggressive selection")
    
    results.append({
        'l1_ratio': l1_ratio,
        'test_score': test_score,
        'n_features': n_nonzero,
        'selected': selected
    })

Notice the pattern! As l1_ratio increases (more L1, less L2), the model selects fewer features. Low l1_ratio (more Ridge) keeps correlated features together, while high l1_ratio (more Lasso) picks arbitrarily from correlated groups. Let's analyze the results:

# ========================================
# STEP 5: Analyze and compare results
# ========================================
print("\n" + "="*70)
print("ANALYSIS")
print("="*70)

# Find best by test score
best_idx = max(range(len(results)), key=lambda i: results[i]['test_score'])
best = results[best_idx]

print(f"\nBest l1_ratio: {best['l1_ratio']}")
print(f"  Test R²: {best['test_score']:.4f}")
print(f"  Features used: {best['n_features']}")

# Compare extremes
low_l1 = results[0]   # l1_ratio = 0.1 (mostly Ridge)
high_l1 = results[-1]  # l1_ratio = 1.0 (pure Lasso)

print(f"\nLow L1 (0.1 - mostly Ridge):")
print(f"  Features: {low_l1['n_features']} (keeps more features)")
print(f"  Test R²: {low_l1['test_score']:.4f}")

print(f"\nHigh L1 (1.0 - pure Lasso):")
print(f"  Features: {high_l1['n_features']} (aggressive selection)")
print(f"  Test R²: {high_l1['test_score']:.4f}")

print("\n" + "="*70)
print("KEY INSIGHTS:")
print("="*70)
print("1. Low l1_ratio (Ridge-like): Keeps correlated features together")
print("2. High l1_ratio (Lasso-like): Picks one from correlated group")
print("3. Middle l1_ratio: Balances feature selection and stability")
print("4. All versions eliminate noise features (6-9)!")
print("5. Best l1_ratio depends on your goal: sparsity vs stability")
print("\nFor this data: Mid-range l1_ratio (~0.5) works best!")

Elastic Net successfully handles the correlated features better than pure Lasso would. The l1_ratio parameter lets you tune the balance between feature selection (L1) and grouping effect (L2). When l1_ratio is low (0.1-0.3), the model behaves more like Ridge, keeping correlated features. When high (0.8-1.0), it acts like Lasso, aggressively eliminating features.

ElasticNetCV - Automatic Tuning

ElasticNetCV performs cross-validation over both alpha and l1_ratio to find the optimal combination. This saves you from manually testing every combination. Let's see it in action:

# ========================================
# STEP 1: Import and prepare data
# ========================================
from sklearn.linear_model import ElasticNetCV, RidgeCV, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np

# Using data from previous example (X_train, X_test, y_train, y_test)
print("Using dataset with:")
print("  10 features (3 base + 3 correlated + 4 noise)")
print("  300 samples total")
print("  Correlated features create challenge for pure Lasso")

Instead of manually trying different combinations, ElasticNetCV will test all provided alpha and l1_ratio values using cross-validation. Let's define the search grid:

# ========================================
# STEP 2: Define hyperparameter search space
# ========================================
# Define l1_ratios to test
l1_ratios = [0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99]
print(f"\nTesting {len(l1_ratios)} l1_ratio values:")
for ratio in l1_ratios:
    l1_pct = ratio * 100
    l2_pct = (1 - ratio) * 100
    print(f"  {ratio}: {l1_pct:.0f}% L1 + {l2_pct:.0f}% L2")

# Define alphas to test
alphas = [0.001, 0.01, 0.1, 1.0, 10.0]
print(f"\nTesting {len(alphas)} alpha values:")
for alpha in alphas:
    print(f"  {alpha}: {'weak' if alpha < 0.1 else 'moderate' if alpha < 1 else 'strong'} regularization")

print(f"\nTotal combinations to test: {len(l1_ratios)} × {len(alphas)} = {len(l1_ratios) * len(alphas)}")
print(f"With 5-fold CV: {len(l1_ratios) * len(alphas) * 5} model fits!")

We're testing 7 l1_ratio values × 5 alpha values = 35 combinations. With 5-fold CV, that's 175 model fits! ElasticNetCV handles all this automatically. Let's create a pipeline:

# ========================================
# STEP 3: Create pipeline with ElasticNetCV
# ========================================
print("\nCreating pipeline with scaling + ElasticNetCV...")

pipeline = Pipeline([
    ('scaler', StandardScaler()),          # Scale features first
    ('elasticnet', ElasticNetCV(
        l1_ratio=l1_ratios,                # L1/L2 mix to test
        alphas=alphas,                     # Regularization strengths to test
        cv=5,                              # 5-fold cross-validation
        random_state=42,
        max_iter=10000,                    # Ensure convergence
        n_jobs=-1                          # Use all CPU cores
    ))
])

print("Pipeline steps:")
print("  1. StandardScaler: Scale features to mean=0, std=1")
print("  2. ElasticNetCV: Find best alpha and l1_ratio")

The pipeline ensures features are always scaled before regularization. The n_jobs=-1 parameter uses all CPU cores for parallel cross-validation, speeding up the search. Let's train:

# ========================================
# STEP 4: Fit pipeline and find optimal parameters
# ========================================
print("\nTraining... (testing 35 combinations with 5-fold CV)")
print("This may take a moment...")

pipeline.fit(X_train, y_train)

# Access the fitted ElasticNetCV model
enet = pipeline.named_steps['elasticnet']

print("\n" + "="*70)
print("OPTIMAL HYPERPARAMETERS FOUND")
print("="*70)
print(f"\nBest alpha: {enet.alpha_:.6f}")
print(f"  -> Optimal regularization strength")

print(f"\nBest l1_ratio: {enet.l1_ratio_:.3f}")
l1_pct = enet.l1_ratio_ * 100
l2_pct = (1 - enet.l1_ratio_) * 100
print(f"  -> {l1_pct:.1f}% L1 (Lasso) + {l2_pct:.1f}% L2 (Ridge)")

if enet.l1_ratio_ < 0.3:
    print("  -> More Ridge-like: Keeps correlated features")
elif enet.l1_ratio_ > 0.7:
    print("  -> More Lasso-like: Aggressive feature selection")
else:
    print("  -> Balanced: Good mix of selection and stability")

ElasticNetCV found the best combination through exhaustive cross-validation testing. Now let's evaluate the final model and examine which features it selected:

# ========================================
# STEP 5: Evaluate model and analyze features
# ========================================
# Get scores
train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print("\nModel Performance:")
print(f"  Training R²: {train_score:.4f}")
print(f"  Test R²: {test_score:.4f}")
print(f"  Gap: {train_score - test_score:.4f}")

if train_score - test_score < 0.05:
    print(" Small gap = excellent generalization!")

# Analyze feature selection
coefs = enet.coef_
n_nonzero = np.sum(coefs != 0)
selected_features = np.where(coefs != 0)[0]

print(f"\nFeature Selection:")
print(f"  Non-zero coefficients: {n_nonzero} / 10")
print(f"  Selected features: {selected_features.tolist()}")

print("\nSelected Coefficients:")
print("-" * 70)
for idx in selected_features:
    coef = coefs[idx]
    if idx <= 2:
        feat_type = "BASE"
    elif idx <= 5:
        feat_type = "CORR"
    else:
        feat_type = "NOISE"
    
    print(f"  Feature {idx}: {coef:>8.4f}  [{feat_type}]")

# Count by type
base_count = np.sum((selected_features >= 0) & (selected_features <= 2))
corr_count = np.sum((selected_features >= 3) & (selected_features <= 5))
noise_count = np.sum((selected_features >= 6) & (selected_features <= 9))

print(f"\nFeature Type Breakdown:")
print(f"  Base features (0-2): {base_count} selected")
print(f"  Correlated features (3-5): {corr_count} selected")
print(f"  Noise features (6-9): {noise_count} selected")

Perfect! ElasticNetCV identified the relevant features while handling the correlation intelligently. Let's compare with pure Ridge and pure Lasso:

# ========================================
# STEP 6: Compare with Ridge and Lasso
# ========================================
print("\n" + "="*70)
print("COMPARISON: ElasticNet vs Ridge vs Lasso")
print("="*70)

# Train Ridge
ridge = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', RidgeCV(cv=5))
])
ridge.fit(X_train, y_train)
ridge_model = ridge.named_steps['ridge']

# Train Lasso
lasso = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', LassoCV(cv=5, random_state=42, max_iter=10000))
])
lasso.fit(X_train, y_train)
lasso_model = lasso.named_steps['lasso']

# Compare
print("\nRidge (L2 only):")
print(f"  Test R²: {ridge.score(X_test, y_test):.4f}")
print(f"  Non-zero features: {np.sum(ridge_model.coef_ != 0)} / 10 (keeps all!)")
print(f"  Max |coefficient|: {np.max(np.abs(ridge_model.coef_)):.4f}")

print("\nLasso (L1 only):")
print(f"  Test R²: {lasso.score(X_test, y_test):.4f}")
print(f"  Non-zero features: {np.sum(lasso_model.coef_ != 0)} / 10")
print(f"  Selected features: {np.where(lasso_model.coef_ != 0)[0].tolist()}")

print(f"\nElastic Net (L1 + L2, ratio={enet.l1_ratio_:.2f}):")
print(f"  Test R²: {test_score:.4f}")
print(f"  Non-zero features: {n_nonzero} / 10")
print(f"  Selected features: {selected_features.tolist()}")

print("\n" + "="*70)
print("WINNER: ElasticNet" if test_score >= max(ridge.score(X_test, y_test), 
                                                   lasso.score(X_test, y_test)) else "")
print("="*70)
print("Elastic Net advantages:")
print("  ✓ Handles correlated features better than Lasso")
print("  ✓ Performs feature selection unlike Ridge")
print("  ✓ Automatically finds optimal L1/L2 mix")
print("  ✓ More stable coefficient estimates than Lasso alone")

ElasticNetCV successfully combines the strengths of both methods! It eliminates noise features like Lasso, but handles correlated features more gracefully by using the L2 penalty to group them. The automatic hyperparameter tuning via cross-validation ensures we get the optimal balance without manual trial and error.

Best Practice: When you're unsure whether to use Ridge or Lasso, start with ElasticNetCV! It will automatically find the optimal mix of L1 and L2, giving you the best of both worlds. The cross-validation ensures robust hyperparameter selection.
Choosing the Right Regularization
Scenario Recommended Method Reason
All features are useful, no feature selection needed Ridge Fastest, keeps all features
Many irrelevant features, want sparse model Lasso Feature selection built-in
Correlated features, want some feature selection Elastic Net Best of both worlds
More features than samples (p > n) Elastic Net Lasso can only select n features
Need interpretable model with few features Lasso Creates sparse model
Unsure which method is best ElasticNetCV Let cross-validation decide

Practice Questions: Elastic Net

Test your understanding of Elastic Net regularization.

Question: What l1_ratio value makes Elastic Net equivalent to Ridge?

Show Solution
# Demonstrating l1_ratio effects
import numpy as np
from sklearn.linear_model import ElasticNet, Ridge, Lasso
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
X = np.random.randn(100, 5)
y = 2*X[:, 0] + 3*X[:, 1] + np.random.randn(100) * 0.5

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compare
alpha = 1.0
ridge = Ridge(alpha=alpha).fit(X_scaled, y)
lasso = Lasso(alpha=alpha, max_iter=10000).fit(X_scaled, y)
enet_l2 = ElasticNet(alpha=alpha, l1_ratio=0.0, max_iter=10000).fit(X_scaled, y)
enet_l1 = ElasticNet(alpha=alpha, l1_ratio=1.0, max_iter=10000).fit(X_scaled, y)

print("Coefficients Comparison:")
print(f"\nRidge:           {ridge.coef_.round(3)}")
print(f"ElasticNet(l1=0):{enet_l2.coef_.round(3)}")
print(f"\nLasso:           {lasso.coef_.round(3)}")
print(f"ElasticNet(l1=1):{enet_l1.coef_.round(3)}")
print("\nElasticNet with l1_ratio=0 matches Ridge!")
print("ElasticNet with l1_ratio=1 matches Lasso!")

Question: Why might Elastic Net select more features than Lasso?

Show Solution
# Elastic Net with correlated features
import numpy as np
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 100

# Create 3 groups of correlated features
group1 = np.random.randn(n, 1)
group1_corr = group1 + np.random.randn(n, 2) * 0.1

group2 = np.random.randn(n, 1)
group2_corr = group2 + np.random.randn(n, 2) * 0.1

X = np.hstack([group1, group1_corr, group2, group2_corr])
y = 2*group1.ravel() + 3*group2.ravel() + np.random.randn(n) * 0.5

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Compare
lasso = Lasso(alpha=0.1, max_iter=10000).fit(X_scaled, y)
enet = ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=10000).fit(X_scaled, y)

lasso_selected = np.sum(np.abs(lasso.coef_) > 0.01)
enet_selected = np.sum(np.abs(enet.coef_) > 0.01)

print(f"Features selected:")
print(f"  Lasso: {lasso_selected} features")
print(f"  Elastic Net: {enet_selected} features")
print(f"\nLasso coefficients: {lasso.coef_.round(2)}")
print(f"ElasticNet coefficients: {enet.coef_.round(2)}")
print("\nElastic Net keeps correlated feature groups!")

Question: How does ElasticNetCV choose optimal hyperparameters?

Show Solution
# Using ElasticNetCV for automatic tuning
import numpy as np
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
X = np.random.randn(200, 10)
y = 2*X[:, 0] + X[:, 2] - X[:, 4] + np.random.randn(200) * 0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define search space
alphas = [0.001, 0.01, 0.1, 1.0, 10.0]
l1_ratios = [0.1, 0.5, 0.7, 0.9, 0.95, 0.99]

# ElasticNetCV tries all combinations
enet_cv = ElasticNetCV(
    alphas=alphas,
    l1_ratio=l1_ratios,
    cv=5,
    max_iter=10000,
    n_jobs=-1
)

enet_cv.fit(X_train_scaled, y_train)

print("ElasticNetCV Results:")
print(f"  Best alpha: {enet_cv.alpha_:.4f}")
print(f"  Best l1_ratio: {enet_cv.l1_ratio_:.2f}")
print(f"  Test R²: {enet_cv.score(X_test_scaled, y_test):.4f}")
print(f"\nTested {len(alphas) * len(l1_ratios)} combinations with 5-fold CV!")

Question: What is the grouping effect in Elastic Net?

Show Solution
# Demonstrating grouping effect
import numpy as np
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 100

# Create a correlated group of 4 features
base_feature = np.random.randn(n, 1)
correlated_group = base_feature + np.random.randn(n, 4) * 0.1

# Add 3 independent features
independent = np.random.randn(n, 3)

X = np.hstack([correlated_group, independent])
# Target depends on the correlated group
y = 2*base_feature.ravel() + np.random.randn(n) * 0.5

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

lasso = Lasso(alpha=0.1, max_iter=10000).fit(X_scaled, y)
enet = ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=10000).fit(X_scaled, y)

print("Correlated features (0-3) vs Independent (4-6):")
print(f"\nLasso coefficients:")
print(f"  Group: {lasso.coef_[:4].round(2)}")
print(f"  Independent: {lasso.coef_[4:].round(2)}")
print(f"\nElasticNet coefficients:")
print(f"  Group: {enet.coef_[:4].round(2)}")
print(f"  Independent: {enet.coef_[4:].round(2)}")
print("\nElastic Net assigns similar weights to correlated features!")
Interactive Demo: Regularization Comparison

Experiment with different regularization strengths to see how they affect coefficient values:

0 (No reg.) alpha = 0.1 1.0 (Strong)
Ridge (L2)
  • Coef 1: 0.85
  • Coef 2: 0.72
  • Coef 3: 0.15
  • Coef 4: 0.08
Lasso (L1)
  • Coef 1: 0.92
  • Coef 2: 0.68
  • Coef 3: 0.00
  • Coef 4: 0.00
Observe: As alpha increases, Ridge shrinks all coefficients proportionally while Lasso forces small coefficients to exactly zero. This demonstrates how Lasso performs automatic feature selection.

Key Takeaways

Linear Regression

Fits a straight line or hyperplane using OLS to minimize squared residuals between predictions and actual values.

Polynomial Regression

Captures curves and non-linear patterns by transforming features into polynomial terms before applying linear regression.

Ridge Regression (L2)

Adds squared coefficient penalty to shrink weights toward zero while keeping all features in the model.

Lasso Regression (L1)

Uses absolute coefficient penalty that can shrink weights to exactly zero, performing automatic feature selection.

Elastic Net

Combines L1 and L2 penalties with a mixing parameter to balance feature selection and coefficient shrinkage.

Hyperparameter Tuning

Use cross-validation to find optimal alpha (regularization strength) and l1_ratio (L1/L2 balance) values.

Knowledge Check

Quick Quiz

Test what you've learned about regression algorithms

1 What does the ordinary least squares method minimize?
2 Which regularization technique performs automatic feature selection?
3 Why is polynomial regression still considered a linear model?
4 What does the alpha parameter control in regularized regression?
5 What l1_ratio value makes Elastic Net behave like pure Ridge?
6 Why is feature scaling important before applying regularization?
Answer all questions to check your score