Understanding Linear Regression
Linear regression is the foundation of machine learning - it's where your journey into predictive modeling begins. At its core, linear regression answers a fundamental question: how can we predict a continuous value based on one or more input features? Imagine drawing the best-fit line through scattered data points - that's linear regression in action.
What is Linear Regression?
Linear regression is a supervised learning algorithm that models the relationship between input features and a continuous target variable by fitting a linear equation to observed data.
Think of it like this: Imagine plotting points on a graph and trying to draw the "best-fit" straight line through them. Linear regression does exactly this mathematically! It finds the line (or hyperplane in multiple dimensions) that comes closest to all your data points, allowing you to make predictions for new inputs.
Why "Linear"? Because the relationship is represented as a straight line equation (like y = mx + b from algebra). The predictions follow a linear pattern - if you double the input, the predicted output doesn't necessarily double, but it changes in a consistent, proportional way.
Real-world analogy: It's like predicting how much you'll earn based on years of experience. If you plot salary vs. experience for many people, you'll see a general upward trend. Linear regression finds the line that best captures this trend, letting you estimate salary for any experience level.
Real-World Applications
Linear regression powers countless real-world applications. Predicting house prices based on square footage, estimating sales revenue from advertising spend, forecasting stock prices, determining insurance premiums, and predicting student test scores are just a few examples where linear regression excels.
Housing Prices
Predict home values based on size, location, bedrooms, age, and neighborhood features.
Sales Forecasting
Estimate future sales based on advertising budget, season, and market trends.
The Linear Equation
The simplest form of linear regression uses one input feature (x) to predict one output (y). The equation is:
Slope (m)
How steep the line is - the rate of change between x and y
Intercept (b)
Where the line crosses the y-axis (value when x=0)
Target (y)
The predicted output (dependent variable)
Feature (x)
The input feature (independent variable)
Simple Example in Python
import numpy as np
import matplotlib.pyplot as plt
# Sample data: hours studied vs test score
hours = np.array([1, 2, 3, 4, 5, 6, 7, 8])
scores = np.array([50, 55, 60, 65, 70, 75, 80, 85])
# Calculate slope (m) and intercept (b)
m = np.cov(hours, scores)[0, 1] / np.var(hours) # 5.0
b = np.mean(scores) - m * np.mean(hours) # 45.0
# Make prediction
new_hours = 9
predicted_score = m * new_hours + b # 90.0
print(f"Predicted score: {predicted_score}")
Regression vs Classification
| Aspect | Regression | Classification |
|---|---|---|
| Output Type | Continuous (numbers) | Discrete (categories) |
| Examples | Price, temperature, age | Spam/not spam, cat/dog |
| Algorithms | Linear, polynomial, ridge | Logistic, SVM, decision tree |
| Evaluation | MSE, MAE, R-squared | Accuracy, F1-score, AUC |
Practice Questions
Problem: Given a car's weight in pounds, predict its miles per gallon (MPG). Use the equation: MPG = 50 - 0.006 × weight. What's the MPG for a 3000-pound car?
Show Solution
# Given equation: MPG = 50 - 0.006 * weight
weight = 3000
# Calculate MPG
mpg = 50 - 0.006 * weight # 32.0
print(f"A {weight} lb car gets {mpg} MPG")
# Output: A 3000 lb car gets 32.0 MPG
Explanation: This is a simple linear equation where weight is the input feature, and MPG is the output. The negative slope (-0.006) indicates that heavier cars get worse mileage.
Problem: Given a car's weight in pounds, predict its miles per gallon (MPG). Use the equation: MPG = 50 - 0.006 × weight. What's the MPG for a 3000-pound car?
Show Solution
# Given equation: MPG = 50 - 0.006 * weight
weight = 3000
# Calculate MPG
mpg = 50 - 0.006 * weight # 32.0
print(f"A {weight} lb car gets {mpg} MPG")
# Output: A 3000 lb car gets 32.0 MPG
Explanation: This is a simple linear equation where weight is the input feature, and MPG is the output. The negative slope (-0.006) indicates that heavier cars get worse mileage.
Problem: Given two points on a line: (2, 5) and (6, 13), calculate the slope and intercept of the line. Then predict y when x = 10.
Show Solution
# Two points: (x1, y1) = (2, 5) and (x2, y2) = (6, 13)
x1, y1 = 2, 5
x2, y2 = 6, 13
# Calculate slope: m = (y2 - y1) / (x2 - x1)
slope = (y2 - y1) / (x2 - x1) # 2.0
# Calculate intercept: b = y - mx
intercept = y1 - slope * x1 # 1.0
# Predict for x = 10
x_new = 10
y_pred = slope * x_new + intercept # 21.0
print(f"Slope: {slope}, Intercept: {intercept}")
print(f"When x={x_new}, y={y_pred}")
Explanation: The slope formula measures the rate of change. The intercept is calculated by plugging one point into y = mx + b and solving for b.
Problem: Given two points on a line: (2, 5) and (6, 13), calculate the slope and intercept of the line. Then predict y when x = 10.
Show Solution
# Two points: (x1, y1) = (2, 5) and (x2, y2) = (6, 13)
x1, y1 = 2, 5
x2, y2 = 6, 13
# Calculate slope: m = (y2 - y1) / (x2 - x1)
slope = (y2 - y1) / (x2 - x1) # 2.0
# Calculate intercept: b = y - mx
intercept = y1 - slope * x1 # 1.0
# Predict for x = 10
x_new = 10
y_pred = slope * x_new + intercept # 21.0
print(f"Slope: {slope}, Intercept: {intercept}")
print(f"When x={x_new}, y={y_pred}")
Explanation: The slope formula measures the rate of change. The intercept is calculated by plugging one point into y = mx + b and solving for b.
Problem: Create a function that calculates the best-fit line for a dataset using the least squares method. Use it to fit a line to: x = [1, 2, 3, 4, 5] and y = [2, 4, 5, 4, 5].
Show Solution
import numpy as np
def fit_linear_regression(x, y):
"""Calculate slope and intercept using least squares"""
n = len(x)
# Calculate means
x_mean = np.mean(x)
y_mean = np.mean(y)
# Calculate slope
numerator = np.sum((x - x_mean) * (y - y_mean))
denominator = np.sum((x - x_mean) ** 2)
slope = numerator / denominator
# Calculate intercept
intercept = y_mean - slope * x_mean
return slope, intercept
# Test data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
# Fit the model
slope, intercept = fit_linear_regression(x, y) # 0.6, 2.2
print(f"Best-fit line: y = {slope}x + {intercept}")
print(f"Prediction for x=6: {slope * 6 + intercept}") # 5.8
Explanation: This implements the ordinary least squares method, which minimizes the sum of squared differences between actual and predicted values. The formulas calculate the optimal slope and intercept.
Problem: Create a function that calculates the best-fit line for a dataset using the least squares method. Use it to fit a line to: x = [1, 2, 3, 4, 5] and y = [2, 4, 5, 4, 5].
Show Solution
import numpy as np
def fit_linear_regression(x, y):
"""Calculate slope and intercept using least squares"""
n = len(x)
# Calculate means
x_mean = np.mean(x)
y_mean = np.mean(y)
# Calculate slope
numerator = np.sum((x - x_mean) * (y - y_mean))
denominator = np.sum((x - x_mean) ** 2)
slope = numerator / denominator
# Calculate intercept
intercept = y_mean - slope * x_mean
return slope, intercept
# Test data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
# Fit the model
slope, intercept = fit_linear_regression(x, y) # 0.6, 2.2
print(f"Best-fit line: y = {slope}x + {intercept}")
print(f"Prediction for x=6: {slope * 6 + intercept}") # 5.8
Explanation: This implements the ordinary least squares method, which minimizes the sum of squared differences between actual and predicted values. The formulas calculate the optimal slope and intercept.
The Mathematics
Understanding the math behind linear regression empowers you to debug models, tune hyperparameters, and extend the algorithm to solve new problems. Don't worry - we'll build the intuition step by step before diving into formulas.
The Cost Function (MSE)
How do we measure how "good" our line is? We use a cost function, specifically Mean Squared Error (MSE). It calculates the average squared difference between predicted values and actual values. The goal of training is to minimize this cost.
Mean Squared Error
What does this mean? MSE measures how "wrong" our predictions are on average. Here's the breakdown:
- $(y_i - \hat{y}_i)$: The difference between actual and predicted values (this is the "error")
- Squared $(...)^2$: We square each error to make all values positive (so positive and negative errors don't cancel out) and to penalize larger errors more heavily
- Sum $\sum$: Add up all the squared errors from every data point
- Average $\frac{1}{n}$: Divide by the number of samples to get the average squared error
Interpretation: Lower MSE = better model! An MSE of 0 means perfect predictions (rare in practice). The MSE value is in "squared units" - if predicting prices in dollars, MSE is in "dollars squared".
Step 1: Define the MSE Function
import numpy as np
def mean_squared_error(y_true, y_pred):
"""Calculate Mean Squared Error"""
return np.mean((y_true - y_pred) ** 2)
(y_true - y_pred) ** 2 computes the squared difference between actual and predicted values - squaring ensures all errors are positive and penalizes larger errors more heavily. Then np.mean() averages all these squared errors to give us a single performance metric.
Key insight: MSE is in "squared units" (if predicting prices in thousands, MSE is in "thousands squared"), which is why we often take the square root (RMSE) for easier interpretation.
Step 2: Prepare Example Data
# Example: actual vs predicted house prices (in thousands)
actual = np.array([200, 250, 300, 350, 400])
predicted = np.array([210, 240, 310, 340, 390])
actual contains the true house prices from the market, and predicted contains what our machine learning model predicted. Notice the predictions aren't perfect: first house is actually $200k but predicted $210k (off by $10k).
Real-world context: This mismatch is normal! No model is perfect. The goal is to minimize these differences, which is exactly what MSE measures.
Step 3: Calculate and Display MSE
mse = mean_squared_error(actual, predicted) # 120.0
print(f"MSE: ${mse}k squared") # Lower is better!
Performance guide: MSE alone isn't meaningful - you need context. For house prices, MSE=120 might be excellent for $200k-$400k homes, but terrible for $50k apartments. Always compare MSE to your target value range. Lower MSE = Better model!
Gradient Descent
Gradient descent is an iterative optimization algorithm that systematically finds the optimal parameters (slope and intercept) for our linear regression model. Imagine you're blindfolded on a hillside and need to reach the lowest point (valley). You can't see where the valley is, but you can feel the slope beneath your feet. You take small steps in the direction that feels most downhill, and with each step, you get closer to the bottom. That's exactly how gradient descent works!
The algorithm starts with random initial values for the slope (m) and intercept (b) - essentially making a random guess at the line. It then calculates the error (how wrong the predictions are) using the cost function. Next, it computes the "gradient" - a mathematical measure that tells us which direction and how steeply the error is increasing. By moving in the opposite direction of the gradient (downhill), we reduce the error. The algorithm repeats this process hundreds or thousands of times, taking small steps each time, until it converges to the optimal parameters where the error is minimized. The size of each step is controlled by the "learning rate" - too large and you might overshoot the valley; too small and it takes forever to get there.
Initialize
Start with random values for m (slope) and b (intercept), typically both set to 0. This is your starting point - imagine placing a completely random line on your data. The line will be completely wrong at first, but that's okay! The algorithm will improve it step by step.
Calculate Gradient
Calculate how much the error changes when you slightly adjust m and b. This tells you the direction of steepest descent - which way to move to reduce error fastest. Think of it as measuring the slope of the hill you're standing on to figure out which direction is most downhill.
Update Parameters
Move m and b in the opposite direction of the gradient (downhill) by a small step determined by the learning rate. Repeat steps 2-3 hundreds of times until the error stops decreasing significantly - this is called convergence, meaning you've found the optimal line!
Gradient Descent Equations
Where α is the learning rate (step size), typically between 0.001 and 0.1.
Understanding the equations:
- $m$ and $b$: The slope and intercept of our line (the parameters we're trying to find)
- $\frac{\partial MSE}{\partial m}$ and $\frac{\partial MSE}{\partial b}$: These are "gradients" - they tell us which direction to adjust $m$ and $b$ to reduce error
- $\alpha$ (learning rate): Controls how big each adjustment step is. Too large and we might overshoot; too small and learning takes forever
The intuition: Imagine you're hiking down a mountain in fog (can't see the bottom). You feel the slope under your feet and take a step downhill. That's gradient descent! The gradient tells you which way is "downhill" (toward lower error), and the learning rate determines your step size. You repeat this until you reach the valley (minimum error).
Step 1: Define the Gradient Descent Function
import numpy as np
def gradient_descent(x, y, learning_rate=0.01, epochs=1000):
m, b = 0, 0 # Initialize slope and intercept to zero
n = len(x) # Number of data points
m (slope) and b (intercept) both start at 0 - imagine starting with a flat horizontal line at y=0. These will iteratively improve during training.
Parameters explained:
•
learning_rate=0.01: Controls how big each step is during optimization. Too large = overshooting, too small = slow training.•
epochs=1000: Number of complete passes through the dataset to adjust m and b.•
n = len(x): Dataset size, needed for gradient calculations.
Step 2: Training Loop - Make Predictions
for epoch in range(epochs):
# Make predictions using current m and b
y_pred = m * x + b
y_pred = mx + b. In epoch 1, with m=0 and b=0, all predictions will be 0 - completely wrong! But that's okay, the gradients will tell us how to improve.
Training cycle: This loop runs 1000 times. Each iteration makes the predictions slightly better by adjusting m and b based on the errors.
Step 3: Calculate Gradients
# Calculate gradients (direction of steepest increase in error)
dm = -(2/n) * np.sum(x * (y - y_pred)) # Gradient for slope
db = -(2/n) * np.sum(y - y_pred) # Gradient for intercept
Mathematical intuition:
•
dm = -(2/n) * Σ[x * (y - y_pred)]: Shows how much to change slope m•
db = -(2/n) * Σ[(y - y_pred)]: Shows how much to change intercept b• Negative sign: We move opposite to the gradient (downhill) to minimize error
• The (2/n) factor comes from averaging over all n data points
Step 4: Update Parameters
# Update parameters by moving against the gradient
m = m - learning_rate * dm
b = b - learning_rate * db
# Print progress every 100 epochs
if epoch % 100 == 0:
mse = np.mean((y - y_pred) ** 2)
print(f"Epoch {epoch}: MSE = {mse:.2f}")
learning_rate (0.01) acts as a multiplier - it prevents us from taking huge jumps that might overshoot the optimal values.
Progress tracking: Every 100 epochs, we print the error to monitor training. You should see the error steadily decreasing - if it increases, your learning rate is too high! Typical output: "Epoch 0: Error = 50.2" → "Epoch 100: Error = 12.8" → "Epoch 1000: Error = 2.3"
Step 5: Complete Example with Test Data
return m, b
# Test with sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
m, b = gradient_descent(x, y) # m ≈ 0.6, b ≈ 2.2
print(f"Final equation: y = {m:.2f}x + {b:.2f}")
Verification: The test data points are [1,2,3,4,5] and [2,4,5,4,5]. Plugging into y = 0.6x + 2.2: when x=1 → y=2.8 (close to actual 2), when x=5 → y=5.2 (close to actual 5). The model has learned the underlying pattern! You can now use this equation to predict y for any new x value.
Learning Rate Impact
| Learning Rate | Behavior | Result |
|---|---|---|
| Too Small (0.0001) | Tiny steps down the hill | Slow convergence, may timeout |
| Just Right (0.01) | Steady progress downhill | Converges efficiently to minimum |
| Too Large (0.5) | Overshoots the valley | Oscillates or diverges |
The Normal Equation: Direct Solution
The Normal Equation is a powerful mathematical formula that provides a direct, closed-form analytical solution to linear regression - completely different from gradient descent's iterative approach. While gradient descent is like carefully walking down a mountain step-by-step in the fog, the Normal Equation is like having a helicopter that takes you directly to the bottom in one jump!
This method uses matrix calculus and linear algebra to find the optimal parameters θ (theta) that minimize the cost function in a single computation. Instead of repeatedly adjusting parameters through hundreds or thousands of iterations, it solves the optimization problem algebraically by taking the derivative of the cost function, setting it to zero, and solving for θ. The result is a formula that, when given your data, outputs the perfect parameters immediately.
The beauty of the Normal Equation lies in its mathematical elegance: it's derived from the principle that at the minimum of a convex function (like our MSE cost function), the gradient equals zero. By solving the equation ∇J(θ) = 0 using matrix differentiation, we arrive at a direct formula. However, this convenience comes with computational costs - the formula requires computing a matrix inverse, which becomes extremely slow and memory-intensive when dealing with large feature matrices. This is why understanding both gradient descent and the Normal Equation is crucial: each has its place depending on your dataset size and computational resources.
Breaking Down the Normal Equation
XTX (X transpose times X): Creates a square matrix that captures the relationships between all features. This is called the "Gram matrix" and it measures how features correlate with each other.
(XTX)-1 (The inverse): This is the computationally expensive part! Matrix inversion requires O(n³) operations where n is the number of features. For 10,000 features, that's 1 trillion operations! This is why the Normal Equation slows down dramatically with more features.
XTy (X transpose times y): Projects the target values onto the feature space, creating a vector that represents how each feature relates to the output.
Why this works: The formula essentially finds the point in parameter space where the gradient of the cost function equals zero - the mathematical definition of a minimum. For linear regression's convex cost function, this point is guaranteed to be the global minimum, giving us the absolutely best possible parameters.
The tradeoff:
- Advantages: Exact solution (no approximation), no hyperparameters to tune (no learning rate!), guaranteed to find the global minimum, no need for feature scaling, no risk of getting stuck in local minima.
- ✗ Disadvantages: Computationally expensive O(n³) for large feature sets, requires storing large matrices in memory, fails when XTX is non-invertible (singular matrix), doesn't work with non-linear models, can't be easily adapted to online learning.
When to use it: Ideal for small to medium datasets (< 10,000 samples and < 1,000 features) where you want an exact solution quickly. Perfect for academic projects, small business applications, and situations where you need interpretable, reproducible results. For massive datasets, web-scale applications, or real-time learning, stick with gradient descent or its variants.
Advantages
- No learning rate needed - No hyperparameter tuning required, eliminating trial and error
- One calculation - Solves in a single step instead of thousands of iterations
- Exact solution - Mathematically guaranteed optimal parameters, no approximation
- Perfect for small datasets - Fast and efficient when features are limited
Limitations
- Slow for large features - Becomes impractical when n > 10,000 features
- Matrix inversion cost - O(n³) complexity makes it exponentially slower
- Singular matrix issues - Fails when XTX is non-invertible
- Limited scope - Only works for linear models, not other algorithms
import numpy as np
# Sample data
X = np.array([[1, 1], [1, 2], [1, 3], [1, 4], [1, 5]]) # First column is 1s
y = np.array([2, 4, 5, 4, 5])
# Apply normal equation: theta = (X^T X)^-1 X^T y
X_transpose = X.T
theta = np.linalg.inv(X_transpose.dot(X)).dot(X_transpose).dot(y)
print(f"Optimal parameters: {theta}")
# [1.2 0.6] - intercept and slope
# Make predictions
X_test = np.array([[1, 6], [1, 7]]) # Include 1s for intercept
predictions = X_test.dot(theta)
print(f"Predictions: {predictions}") # [4.8 5.4]
- Gradient Descent: Large datasets (n > 10,000 features), works with other algorithms (logistic regression, neural networks), scalable
- Normal Equation: Small datasets (n < 10,000 features), need exact solution, no hyperparameter tuning required
Practice Questions
Problem: Given actual values [10, 20, 30] and predicted values [12, 18, 28], calculate the Mean Squared Error.
Show Solution
import numpy as np
actual = np.array([10, 20, 30])
predicted = np.array([12, 18, 28])
# Calculate squared errors
squared_errors = (actual - predicted) ** 2 # [4, 4, 4]
# Calculate mean
mse = np.mean(squared_errors) # 4.0
print(f"MSE: {mse}")
# Output: MSE: 4.0
Explanation: Each prediction is off by 2, so squared errors are all 4. The average is 4.0.
Problem: Given actual values [10, 20, 30] and predicted values [12, 18, 28], calculate the Mean Squared Error.
Show Solution
import numpy as np
actual = np.array([10, 20, 30])
predicted = np.array([12, 18, 28])
# Calculate squared errors
squared_errors = (actual - predicted) ** 2 # [4, 4, 4]
# Calculate mean
mse = np.mean(squared_errors) # 4.0
print(f"MSE: {mse}")
# Output: MSE: 4.0
Explanation: Each prediction is off by 2, so squared errors are all 4. The average is 4.0.
Problem: For the line y = 2x + 1 with one data point (3, 8), calculate the gradients ∂MSE/∂m and ∂MSE/∂b. Prediction is 2(3) + 1 = 7.
Show Solution
import numpy as np
# Data point
x, y = 3, 8
# Current parameters
m, b = 2, 1
# Make prediction
y_pred = m * x + b # 7
error = y - y_pred # 1
# Calculate gradients
dm = -2 * x * error # -6
db = -2 * error # -2
print(f"Gradient for m: {dm}")
print(f"Gradient for b: {db}")
# dm = -6 means m should increase
# db = -2 means b should increase
Explanation: Negative gradients indicate the parameters should increase to reduce error.
Problem: For the line y = 2x + 1 with one data point (3, 8), calculate the gradients ∂MSE/∂m and ∂MSE/∂b. Prediction is 2(3) + 1 = 7.
Show Solution
import numpy as np
# Data point
x, y = 3, 8
# Current parameters
m, b = 2, 1
# Make prediction
y_pred = m * x + b # 7
error = y - y_pred # 1
# Calculate gradients
dm = -2 * x * error # -6
db = -2 * error # -2
print(f"Gradient for m: {dm}")
print(f"Gradient for b: {db}")
# dm = -6 means m should increase
# db = -2 means b should increase
Explanation: Negative gradients indicate the parameters should increase to reduce error.
Problem: Implement a complete training loop with gradient descent for the dataset x=[1,2,3], y=[2,4,6]. Track MSE every 10 iterations for 100 epochs.
Show Solution
import numpy as np
x = np.array([1, 2, 3])
y = np.array([2, 4, 6])
m, b = 0, 0 # Initialize
learning_rate = 0.01
n = len(x)
print("Epoch | MSE")
print("-" * 20)
for epoch in range(101):
# Predictions
y_pred = m * x + b
# Calculate MSE
mse = np.mean((y - y_pred) ** 2)
# Print progress
if epoch % 10 == 0:
print(f"{epoch:5d} | {mse:.4f}")
# Calculate gradients
dm = -(2/n) * np.sum(x * (y - y_pred))
db = -(2/n) * np.sum(y - y_pred)
# Update parameters
m -= learning_rate * dm
b -= learning_rate * db
print(f"\nFinal: m={m:.4f}, b={b:.4f}")
# Perfect line is y=2x, so m≈2, b≈0
Explanation: MSE should decrease each iteration. The model learns m≈2 and b≈0, fitting the line y=2x.
Problem: Implement a complete training loop with gradient descent for the dataset x=[1,2,3], y=[2,4,6]. Track MSE every 10 iterations for 100 epochs.
Show Solution
import numpy as np
x = np.array([1, 2, 3])
y = np.array([2, 4, 6])
m, b = 0, 0 # Initialize
learning_rate = 0.01
n = len(x)
print("Epoch | MSE")
print("-" * 20)
for epoch in range(101):
# Predictions
y_pred = m * x + b
# Calculate MSE
mse = np.mean((y - y_pred) ** 2)
# Print progress
if epoch % 10 == 0:
print(f"{epoch:5d} | {mse:.4f}")
# Calculate gradients
dm = -(2/n) * np.sum(x * (y - y_pred))
db = -(2/n) * np.sum(y - y_pred)
# Update parameters
m -= learning_rate * dm
b -= learning_rate * db
print(f"\nFinal: m={m:.4f}, b={b:.4f}")
# Perfect line is y=2x, so m≈2, b≈0
Explanation: MSE should decrease each iteration. The model learns m≈2 and b≈0, fitting the line y=2x.
Simple Linear Regression
Simple linear regression is the most fundamental supervised learning algorithm in machine learning. It's called "simple" not because it's trivial, but because it uses exactly one input feature to predict the target variable. Think of it as finding the best-fitting straight line through your data points - like connecting the dots in a way that minimizes errors.
Imagine you're trying to predict someone's weight based solely on their height. You collect data from 100 people, plot height on the x-axis and weight on the y-axis, and notice the points form a rough upward pattern - taller people tend to weigh more. Simple linear regression draws the single straight line that best captures this trend, allowing you to predict weight for any height value. While real-world relationships are rarely perfectly linear, this technique provides an excellent starting point for understanding machine learning and works surprisingly well for many practical problems.
Simple Linear Regression Explained
What it is: Simple linear regression models the relationship between one input feature (x) and one output variable (y) by fitting a straight line through your data points. The line represents a mathematical function that transforms any input value into a predicted output value.
Understanding the equation:
- y (Target Variable): The value we want to predict. This is the "dependent variable" because its value depends on x. Examples: house price, student test score, stock price tomorrow, electricity consumption.
- x (Input Feature): The single piece of information we use to make predictions. This is the "independent variable" because we treat it as given. Examples: house square footage, hours studied, today's stock price, outdoor temperature.
- m (Slope): The rate of change - how much y increases (or decreases if negative) when x increases by 1 unit. This is the coefficient or weight. Example: if m = 5000, then each additional year of experience adds $5000 to predicted salary. A steeper slope means x has a stronger influence on y.
- b (Intercept): The baseline value - what we predict for y when x equals zero. This is also called the bias term. Example: if b = 30000, the model predicts a starting salary of $30,000 with 0 years of experience. Sometimes this doesn't make physical sense (negative height at age 0?), but it's mathematically necessary for the line to fit properly.
How the model learns: During training, the algorithm tries millions of different combinations of m and b values. For each combination, it calculates how wrong the predictions are (using the cost function). Through an optimization process (gradient descent or normal equation), it systematically adjusts m and b to minimize these errors. The final values of m and b represent the "learned" line that best captures the relationship between x and y in your training data.
Real-world interpretation: Once trained, you can interpret your model: "For every 1 unit increase in [x], we expect [y] to change by [m] units, starting from a baseline of [b] when [x] is zero." This interpretability is one of linear regression's greatest strengths - unlike complex neural networks, you can easily explain the model's decisions to stakeholders.
When to use simple linear regression:
- You have one primary feature that you believe influences your target
- The relationship appears roughly linear when you plot the data
- You need an interpretable model that's easy to explain
- You're just starting with machine learning and want to understand fundamentals
- ✗ Don't use if: you have multiple important features (use multiple regression), the relationship is clearly curved (use polynomial regression), or you have categorical predictions (use classification)
Key limitation: The word "simple" specifically means we use only ONE feature. Real-world problems often depend on multiple factors - house prices depend on size, location, age, neighborhood, etc. When you need to consider multiple features simultaneously, you'll graduate to Multiple Linear Regression, which extends this concept to many dimensions while keeping the same interpretable structure.
Example: Predicting Salary from Experience
Let's predict salary based on years of experience. This is a classic simple linear regression problem - as experience increases, we expect salary to increase linearly.
Step 1: Import Libraries and Prepare Data
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Data: years of experience and salary (in thousands)
experience = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
salary = np.array([40, 45, 50, 55, 65, 70, 75, 80, 85, 90])
.reshape(-1, 1) converts experience from a 1D array [1,2,3,...] to a 2D array [[1],[2],[3],...].
Why reshape? Scikit-learn REQUIRES features to be 2D (rows × columns matrix), even for single-feature problems. The -1 means "figure out the number of rows automatically" and 1 means "make it 1 column". This is a common gotcha for beginners!
Step 2: Create and Train the Model
# Create and train the model
model = LinearRegression()
model.fit(experience, salary)
.fit() on our data. The fit method automatically finds the optimal slope and intercept that minimize the mean squared error.
Behind the scenes: Sklearn uses the Normal Equation (analytical solution) by default for linear regression, which directly computes the optimal parameters using matrix operations: β = (XTX)-1XTy. This is faster than gradient descent for small datasets!
Step 3: Extract and Display Parameters
# Get parameters
slope = model.coef_[0] # 5.45
intercept = model.intercept_ # 36.97
print(f"Salary = {slope:.2f} × Experience + {intercept:.2f}")
# Output: Salary = 5.45 × Experience + 36.97
model.coef_ is an array containing the slope(s) - we use [0] to get the first (and only) coefficient. model.intercept_ gives us the y-intercept.
Interpreting results: The equation
y = 5.45x + 37.18 tells us:• Starting salary (0 experience) = $37,180
• Each year of experience adds $5,450 to salary
• After 10 years: $37,180 + (10 × $5,450) = $91,680
Step 4: Make Predictions
# Make prediction for 12 years experience
prediction = model.predict([[12]]) # ~102.42k
print(f"Predicted salary for 12 years: ${prediction[0]:.2f}k")
[[12]] as a 2D array (matching the training format) to get a prediction of approximately $102,420.
Important note: Predictions are most reliable within the training data range (1-8 years). Predicting for 12 years is extrapolation, which can be less accurate. For 50 years of experience, the model would predict an unrealistic salary because the linear relationship might not hold for extreme values!
Visualizing the Fit
import matplotlib.pyplot as plt
# Plot data points
plt.scatter(experience, salary, color='blue', label='Actual Data')
# Plot regression line
plt.plot(experience, model.predict(experience),
color='red', linewidth=2, label='Best Fit Line')
plt.xlabel('Years of Experience')
plt.ylabel('Salary ($1000s)')
plt.title('Salary vs Experience')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Shows how well the line fits through the data points
Assumptions of Linear Regression
Linearity
The relationship between x and y is linear. Plot the data to verify this assumption holds.
Independence
Observations are independent of each other. No correlation between residuals.
Homoscedasticity
Variance of errors is constant across all levels of the independent variable.
Normality
Residuals (errors) are normally distributed. Check with a histogram or Q-Q plot.
Complete Example with Train-Test Split
Step 1: Import Libraries and Generate Sample Data
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Generate sample data
np.random.seed(42) # For reproducibility
X = np.random.rand(100, 1) * 10
y = 2.5 * X.squeeze() + 5 + np.random.randn(100) * 2
y = 2.5x + 5 + noise. The np.random.randn(100) adds Gaussian (normal) random noise to simulate real-world data imperfections.
Why add noise? Real-world data is NEVER perfectly linear - there are always measurement errors, unmodeled factors, and random variations. Adding noise makes our synthetic data more realistic and tests the model's ability to find the underlying pattern despite the noise. The true parameters (m=2.5, b=5) are what we want the model to recover.
Step 2: Split Data into Training and Testing Sets
# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
train_test_split. The random_state=42 ensures reproducibility - you'll get the same split every time you run the code.
The golden rule: NEVER train and test on the same data! Training on all data gives falsely optimistic performance metrics. The test set simulates "unseen future data" - if the model performs well on test data, it will likely work well in production. This is the foundation of preventing overfitting!
Step 3: Train the Model
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
.fit(). At this point, the model has never seen the 20 test samples.
Best practice verification: The test data is "locked away" during training - pretend it doesn't exist! This separation is crucial for honest evaluation. If you peek at test data during training (data leakage), your performance metrics will be misleadingly high and your model will fail in production.
Step 4: Evaluate Model Performance
# Evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred) # ~4.2
r2 = r2_score(y_test, y_pred) # ~0.92
print(f"MSE: {mse:.2f}")
print(f"R²: {r2:.2f}")
print(f"Model: y = {model.coef_[0]:.2f}x + {model.intercept_:.2f}")
Interpreting the results:
• MSE = 4.2: Average squared prediction error. Taking sqrt gives RMSE = 2.05, meaning predictions are typically within ±2 units of actual values
• R² = 0.92: Model explains 92% of variance in the data - excellent! (0.7-0.8 is good, 0.9+ is great)
• Learned equation (y = 2.5x + 5): Perfectly recovered the true parameters despite noise! This validates our model and training process.
Practice Questions
Problem: Create a LinearRegression model to fit the data: x = [1, 2, 3, 4, 5] and y = [3, 5, 7, 9, 11]. Print the slope and intercept.
Show Solution
from sklearn.linear_model import LinearRegression
import numpy as np
# Prepare data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([3, 5, 7, 9, 11])
# Create and train model
model = LinearRegression()
model.fit(X, y)
# Extract parameters
slope = model.coef_[0] # 2.0
intercept = model.intercept_ # 1.0
print(f"Slope: {slope}")
print(f"Intercept: {intercept}")
print(f"Equation: y = {slope}x + {intercept}")
# Output: y = 2.0x + 1.0
Explanation: The data follows y = 2x + 1 perfectly, so the model learns these exact values.
Problem: Create a LinearRegression model to fit the data: x = [1, 2, 3, 4, 5] and y = [3, 5, 7, 9, 11]. Print the slope and intercept.
Show Solution
from sklearn.linear_model import LinearRegression
import numpy as np
# Prepare data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([3, 5, 7, 9, 11])
# Create and train model
model = LinearRegression()
model.fit(X, y)
# Extract parameters
slope = model.coef_[0] # 2.0
intercept = model.intercept_ # 1.0
print(f"Slope: {slope}")
print(f"Intercept: {intercept}")
print(f"Equation: y = {slope}x + {intercept}")
# Output: y = 2.0x + 1.0
Explanation: The data follows y = 2x + 1 perfectly, so the model learns these exact values.
Problem: Train a model on house data (size in sqft, price in $k): [(1000, 200), (1500, 250), (2000, 300), (2500, 350)]. Predict price for 1800 sqft.
Show Solution
from sklearn.linear_model import LinearRegression
import numpy as np
# Data
sizes = np.array([1000, 1500, 2000, 2500]).reshape(-1, 1)
prices = np.array([200, 250, 300, 350])
# Train model
model = LinearRegression()
model.fit(sizes, prices)
# Predict for 1800 sqft
size_new = np.array([[1800]])
predicted_price = model.predict(size_new) # ~280k
print(f"Predicted price: ${predicted_price[0]:.0f}k")
print(f"Price per sqft: ${model.coef_[0]:.2f}/sqft")
# Output: Predicted price: $280k
# Price per sqft: $0.10/sqft
Explanation: The model learns a linear relationship: each square foot adds $100 to the price.
Problem: Train a model on house data (size in sqft, price in $k): [(1000, 200), (1500, 250), (2000, 300), (2500, 350)]. Predict price for 1800 sqft.
Show Solution
from sklearn.linear_model import LinearRegression
import numpy as np
# Data
sizes = np.array([1000, 1500, 2000, 2500]).reshape(-1, 1)
prices = np.array([200, 250, 300, 350])
# Train model
model = LinearRegression()
model.fit(sizes, prices)
# Predict for 1800 sqft
size_new = np.array([[1800]])
predicted_price = model.predict(size_new) # ~280k
print(f"Predicted price: ${predicted_price[0]:.0f}k")
print(f"Price per sqft: ${model.coef_[0]:.2f}/sqft")
# Output: Predicted price: $280k
# Price per sqft: $0.10/sqft
Explanation: The model learns a linear relationship: each square foot adds $100 to the price.
Problem: Generate 50 data points from y = 3x + 5 + noise. Train 3 models with different train-test splits (60%, 70%, 80%). Compare R² scores.
Show Solution
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
# Generate data
np.random.seed(42)
X = np.random.rand(50, 1) * 10
y = 3 * X.squeeze() + 5 + np.random.randn(50)
# Test different split ratios
split_ratios = [0.4, 0.3, 0.2] # test sizes
print("Test Size | Train R² | Test R²")
print("-" * 35)
for test_size in split_ratios:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
train_r2 = model.score(X_train, y_train)
test_r2 = model.score(X_test, y_test)
print(f"{test_size:.1f} | {train_r2:.4f} | {test_r2:.4f}")
# More training data generally gives better, more stable models
Explanation: Larger training sets typically yield more reliable models. Test R² shows how well the model generalizes.
Problem: Generate 50 data points from y = 3x + 5 + noise. Train 3 models with different train-test splits (60%, 70%, 80%). Compare R² scores.
Show Solution
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
# Generate data
np.random.seed(42)
X = np.random.rand(50, 1) * 10
y = 3 * X.squeeze() + 5 + np.random.randn(50)
# Test different split ratios
split_ratios = [0.4, 0.3, 0.2] # test sizes
print("Test Size | Train R² | Test R²")
print("-" * 35)
for test_size in split_ratios:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=test_size, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
train_r2 = model.score(X_train, y_train)
test_r2 = model.score(X_test, y_test)
print(f"{test_size:.1f} | {train_r2:.4f} | {test_r2:.4f}")
# More training data generally gives better, more stable models
Explanation: Larger training sets typically yield more reliable models. Test R² shows how well the model generalizes.
Multiple Linear Regression
Real-world problems rarely depend on just one feature. Multiple linear regression extends simple linear regression to handle multiple input features simultaneously. Instead of a line in 2D space, we're fitting a hyperplane in multi-dimensional space.
Multiple Linear Regression
Multiple linear regression extends simple linear regression to handle multiple input features simultaneously. Instead of just one feature, we can use many features to make more accurate predictions:
Understanding the components:
- $y$: The target variable we're predicting (still just one output)
- $x_1, x_2, ..., x_n$: Multiple input features (e.g., for house prices: $x_1$ = size, $x_2$ = bedrooms, $x_3$ = location score, $x_4$ = age)
- $w_1, w_2, ..., w_n$: Weights (coefficients) - one for each feature, showing how much that feature impacts the prediction
- $b$: The bias/intercept term - base prediction when all features are zero
Real-world example: Predicting house prices...
- Simple regression: Price = 150 × Size + 50000
- Multiple regression: Price = 150 × Size + 20000 × Bedrooms - 500 × Age + 30000 × Location + 50000
The benefit: By considering multiple factors at once, we get more nuanced and accurate predictions that capture real-world complexity!
Example: House Price Prediction
Let's predict house prices using multiple features: size, number of bedrooms, age, and location score.
Step 1: Import Libraries and Create DataFrame
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
# House data: size, bedrooms, age, location_score, price
data = pd.DataFrame({
'size': [1500, 1800, 2400, 2000, 2200],
'bedrooms': [3, 3, 4, 3, 4],
'age': [10, 5, 2, 8, 3],
'location': [7, 8, 9, 6, 8],
'price': [300, 350, 450, 380, 420]
})
Data structure insight: Unlike simple regression (1 feature), we now have 4 features predicting price. This is multiple linear regression. The model will learn how each feature independently contributes to price:
price = b₀ + b₁(size) + b₂(bedrooms) + b₃(age) + b₄(location)
Step 2: Separate Features and Target
# Separate features and target
X = data[['size', 'bedrooms', 'age', 'location']]
y = data['price']
X) and target (y). X is a 10×4 matrix (10 houses, 4 features each), and y is a 10-element array of prices.
Machine learning terminology:
• Features (X): Independent variables, inputs, predictors - what we know about each house
• Target (y): Dependent variable, output, label - what we're trying to predict
This split is standard practice in ML - X and y are fed separately to the
.fit() method.
Step 3: Train the Model
# Train model
model = LinearRegression()
model.fit(X, y)
How it works: The model solves the system:
y = b₀ + b₁x₁ + b₂x₂ + b₃x₃ + b₄x₄ by finding the values of b₀ (intercept) and b₁-b₄ (coefficients) that minimize prediction error across all 10 houses. This happens in milliseconds using matrix algebra!
Step 4: Examine Feature Coefficients
# Print coefficients
print("Feature Importance:")
for feature, coef in zip(X.columns, model.coef_):
print(f"{feature:12s}: {coef:7.2f}")
print(f"Intercept: {model.intercept_:.2f}")
Coefficient interpretation:
• size = 0.15: Each additional sqft adds $150 to price (coefficient × 1000)
• bedrooms = 20.0: Each extra bedroom adds $20,000
• age = -2.5: Each year older reduces price by $2,500 (negative = bad for price)
• location = 15.0: Each point of location score adds $15,000
Larger magnitude = stronger influence on price predictions.
Step 5: Make Prediction for New House
# Predict for new house: 2100 sqft, 3 bed, 5 years old, location 8
new_house = [[2100, 3, 5, 8]]
predicted_price = model.predict(new_house) # ~405k
print(f"\nPredicted price: ${predicted_price[0]:.0f}k")
Critical requirement: Features MUST be provided in the exact same order and format as training:
[[size, bedrooms, age, location]]. Mixing up the order (e.g., [bedrooms, size, age, location]) will produce garbage predictions! The double brackets create a 2D array for a single prediction - for multiple houses, use: [[2100,3,5,8], [1800,2,3,9]]
Matrix Notation
Multiple linear regression is elegantly expressed using matrix notation, which is how it's actually computed:
Vectorized Equation
What's happening here? Instead of writing separate equations for each data point, we use matrices to represent ALL the data and calculations at once. This is the "vectorized" form - it's more compact and much faster to compute!
Breaking down the components:
- $\mathbf{X}$: Feature matrix (n samples × m features) - each row is one data point, each column is one feature. Think of it as a spreadsheet with your data!
- $\mathbf{w}$: Weight vector (m × 1) - contains the weight for each feature in a single column
- $\mathbf{y}$: Target vector (n × 1) - contains the predicted value for each sample
- $b$: Bias term - often incorporated into $\mathbf{X}$ by adding a column of 1s
The Normal Equation (vectorized):
$ \mathbf{w} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} $
This directly computes optimal weights using matrix operations.
Why use matrix form? It's not just notation! Matrix operations are highly optimized in libraries like NumPy, making calculations on thousands of data points nearly instant. Writing loops would be much slower!
Feature Scaling
When features have different scales (like size in thousands and bedrooms in single digits), it's important to standardize them. This makes coefficients comparable and helps gradient descent converge faster.
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import numpy as np
# Data with different scales
X = np.array([
[2000, 3, 10], # size, bedrooms, age
[1500, 2, 5],
[2500, 4, 2],
[1800, 3, 8]
])
y = np.array([400, 300, 500, 350])
# Scale features (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train model on scaled data
model = LinearRegression()
model.fit(X_scaled, y)
# Predict for new house
new_house = [[2100, 3, 6]]
new_house_scaled = scaler.transform(new_house)
prediction = model.predict(new_house_scaled) # ~380k
print(f"Predicted price: ${prediction[0]:.0f}k")
Feature Engineering
| Technique | Description | Example |
|---|---|---|
| Polynomial Features | Create interaction terms | size² or size × bedrooms |
| Log Transform | Handle skewed distributions | log(price) or log(size) |
| One-Hot Encoding | Convert categories to binary | location → [0,1,0,0] |
| Feature Binning | Group continuous values | age → [young, mid, old] |
Polynomial Regression for Non-Linear Data
Real-world relationships are often non-linear - like how productivity increases with sleep up to a point, then decreases. Polynomial regression extends linear regression to model curved relationships by adding polynomial features (x², x³, etc.).
Polynomial Regression
Polynomial regression allows us to fit curved patterns in data by transforming features into polynomial terms (squared, cubed, etc.), then applying linear regression techniques. It's perfect for when your data doesn't follow a straight line!
Example - Degree 2 (Quadratic):
$$ y = b_0 + b_1 x + b_2 x^2 $$
How it works:
- Original feature: $x$ (e.g., temperature)
- Create new features: $x^2$ (temperature squared), and optionally $x^3$, $x^4$, etc.
- Fit linear regression: Treat $x$ and $x^2$ as separate features and find coefficients $b_0, b_1, b_2$
- Result: A curved line that can capture U-shaped or hill-shaped patterns!
The name confusion: Despite having $x^2$ terms, it's still called "linear" regression because it's linear in the coefficients ($b_0, b_1, b_2$). We're not squaring the coefficients - we're squaring the features! The model still finds $b$ values using the same linear regression math.
Real-world example: Predicting ice cream sales vs. temperature. Sales might increase with temperature, but above 95°F people stay indoors, so sales drop. This U-shaped pattern needs polynomial regression - a straight line can't capture it!
Warning: Higher degrees (3, 4, 5+) can fit training data very closely but may "overfit" - they follow noise rather than true patterns. Start with degree 2 or 3 and test on validation data!
Degree 1 (Linear)
$$ y = b_0 + b_1 x $$
Straight line
Degree 2 (Quadratic)
$$ y = b_0 + b_1 x + b_2 x^2 $$
U-shaped or inverted-U curve
Degree 3 (Cubic)
$$ y = b_0 + b_1 x + b_2 x^2 + b_3 x^3 $$
S-shaped curves
Step 1: Import Libraries and Create Non-Linear Data
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# Non-linear data (productivity vs hours of sleep)
X = np.array([4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([60, 70, 85, 95, 90, 75, 60]) # Inverted U-shape
Why simple linear regression fails: A straight line can NEVER fit this curved pattern well - it would either underpredict the middle values or overpredict the extremes. This is where polynomial regression comes in! By adding x² terms, we can model curves, parabolas, and complex non-linear relationships.
Step 2: Create Polynomial Features
# Create polynomial features (degree 2)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)
print("Original features:", X[0]) # [4]
print("Polynomial features:", X_poly[0]) # [4, 16] (x, x²)
PolynomialFeatures. For degree=2, each x becomes [x, x²]. Example: x=4 → [4, 16].
The polynomial trick: We're still using linear regression, but on transformed features! The model learns:
y = b₀ + b₁x + b₂x². It's linear in the coefficients (b₀, b₁, b₂) but creates a curved prediction. Higher degrees (3, 4, 5...) create more complex curves but risk overfitting. include_bias=False means don't add a column of 1s (intercept is handled by LinearRegression).
Step 3: Train Model on Polynomial Features
# Train model on polynomial features
model = LinearRegression()
model.fit(X_poly, y)
X_poly. The model is "linear" but the relationship it learns is quadratic: y = b₀ + b₁x + b₂x².
Key insight: This is the magic of polynomial regression - we're using the exact same LinearRegression class, just feeding it engineered features! The model doesn't know it's fitting a curve; from its perspective, it's doing regular linear regression with 2 features. But when plotted against the original x, the predictions form a parabola that captures the U-shaped pattern.
Step 4: Make Predictions
# Predict
X_test = np.array([[7.5]])
X_test_poly = poly.transform(X_test)
prediction = model.predict(X_test_poly)
print(f"Predicted productivity at 7.5 hrs sleep: {prediction[0]:.1f}")
[7.5, 56.25] using poly.transform(), then feeds it to the model for prediction.
Critical step - DON'T FORGET: You MUST transform new data using the same
poly object that transformed training data. If you forget and just pass [[7.5]] to the model, you'll get an error ("Expected 2 features, got 1") or wrong predictions. Always: X_new → poly.transform(X_new) → model.predict()
Step 5: Visualize the Polynomial Fit
# Visualize
X_range = np.linspace(4, 10, 100).reshape(-1, 1)
X_range_poly = poly.transform(X_range)
y_pred = model.predict(X_range_poly)
plt.scatter(X, y, color='blue', label='Actual data')
plt.plot(X_range, y_pred, color='red', label='Polynomial fit (degree 2)')
plt.xlabel('Hours of Sleep')
plt.ylabel('Productivity Score')
plt.legend()
plt.title('Polynomial Regression: Productivity vs Sleep')
plt.show()
Visualization insight: You'll see the red polynomial curve pass through or near the blue actual data points, capturing the inverted-U pattern. Compare this to a linear fit (straight line), which would have large errors at the extremes. This demonstrates polynomial regression's power for non-linear relationships! Pro tip: Try different degrees (3, 4, 5) to see how curve complexity changes.
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
import numpy as np
# Try different polynomial degrees
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([2, 4, 5, 7, 8, 9, 10, 11, 11, 12])
for degree in range(1, 6):
# Create pipeline: polynomial features + linear regression
model = make_pipeline(
PolynomialFeatures(degree=degree),
LinearRegression()
)
# Cross-validation scores
scores = cross_val_score(model, X, y, cv=5,
scoring='r2')
print(f"Degree {degree}: R² = {scores.mean():.3f} (+/- {scores.std():.3f})")
# Choose degree with highest average R² without overfitting
# Output might show:
# Degree 1: R² = 0.920 (+/- 0.050)
# Degree 2: R² = 0.945 (+/- 0.030) <- Best balance
# Degree 3: R² = 0.935 (+/- 0.055) <- Starting to overfit
# Degree 4: R² = 0.880 (+/- 0.120) <- Overfitting
Multicollinearity Warning
import pandas as pd
import numpy as np
# Create sample data
data = pd.DataFrame({
'size': [1500, 1800, 2400, 2000, 2200],
'rooms': [5, 6, 8, 7, 8], # Highly correlated with size
'age': [10, 5, 2, 8, 3]
})
# Calculate correlation matrix
correlation_matrix = data.corr()
print(correlation_matrix)
# If correlation > 0.8, consider removing one feature
print("\nHigh correlations detected:")
for i in range(len(correlation_matrix.columns)):
for j in range(i+1, len(correlation_matrix.columns)):
if abs(correlation_matrix.iloc[i, j]) > 0.8:
col1 = correlation_matrix.columns[i]
col2 = correlation_matrix.columns[j]
print(f"{col1} & {col2}: {correlation_matrix.iloc[i, j]:.2f}")
Practice Questions
Problem: Train a model with two features: hours_studied [2, 3, 4, 5] and hours_slept [8, 7, 8, 9], predicting test_score [60, 65, 75, 80]. Predict for 4.5 hours studied and 8 hours slept.
Show Solution
from sklearn.linear_model import LinearRegression
import numpy as np
# Features
X = np.array([
[2, 8], # studied, slept
[3, 7],
[4, 8],
[5, 9]
])
y = np.array([60, 65, 75, 80])
# Train
model = LinearRegression()
model.fit(X, y)
# Predict
prediction = model.predict([[4.5, 8]]) # ~76
print(f"Predicted score: {prediction[0]:.1f}")
# Coefficients
print(f"Study effect: {model.coef_[0]:.2f} points per hour")
print(f"Sleep effect: {model.coef_[1]:.2f} points per hour")
Explanation: The model learns how much each factor (studying and sleeping) contributes to the test score.
Problem: Train a model with two features: hours_studied [2, 3, 4, 5] and hours_slept [8, 7, 8, 9], predicting test_score [60, 65, 75, 80]. Predict for 4.5 hours studied and 8 hours slept.
Show Solution
from sklearn.linear_model import LinearRegression
import numpy as np
# Features
X = np.array([
[2, 8], # studied, slept
[3, 7],
[4, 8],
[5, 9]
])
y = np.array([60, 65, 75, 80])
# Train
model = LinearRegression()
model.fit(X, y)
# Predict
prediction = model.predict([[4.5, 8]]) # ~76
print(f"Predicted score: {prediction[0]:.1f}")
# Coefficients
print(f"Study effect: {model.coef_[0]:.2f} points per hour")
print(f"Sleep effect: {model.coef_[1]:.2f} points per hour")
Explanation: The model learns how much each factor (studying and sleeping) contributes to the test score.
Problem: Given features with different scales - income [50000, 60000, 70000] and age [25, 30, 35] predicting spending [5000, 6000, 7000]. Apply StandardScaler and train a model.
Show Solution
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import numpy as np
# Data
X = np.array([[50000, 25], [60000, 30], [70000, 35]])
y = np.array([5000, 6000, 7000])
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Original:", X[0])
print("Scaled:", X_scaled[0])
# Train on scaled data
model = LinearRegression()
model.fit(X_scaled, y)
# Predict for income=65000, age=32
new_data = [[65000, 32]]
new_scaled = scaler.transform(new_data)
prediction = model.predict(new_scaled) # ~6500
print(f"Predicted spending: ${prediction[0]:.0f}")
Explanation: StandardScaler ensures both features contribute fairly despite income being in tens of thousands and age in tens.
Problem: Given features with different scales - income [50000, 60000, 70000] and age [25, 30, 35] predicting spending [5000, 6000, 7000]. Apply StandardScaler and train a model.
Show Solution
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
import numpy as np
# Data
X = np.array([[50000, 25], [60000, 30], [70000, 35]])
y = np.array([5000, 6000, 7000])
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Original:", X[0])
print("Scaled:", X_scaled[0])
# Train on scaled data
model = LinearRegression()
model.fit(X_scaled, y)
# Predict for income=65000, age=32
new_data = [[65000, 32]]
new_scaled = scaler.transform(new_data)
prediction = model.predict(new_scaled) # ~6500
print(f"Predicted spending: ${prediction[0]:.0f}")
Explanation: StandardScaler ensures both features contribute fairly despite income being in tens of thousands and age in tens.
Problem: Create polynomial features (x², xy) from x=[1,2,3] and y=[2,3,4], then train a model to predict z=[5,10,17]. Compare with and without polynomial features.
Show Solution
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
X = np.array([[1, 2], [2, 3], [3, 4]])
z = np.array([5, 10, 17])
# Model 1: Linear features only
model1 = LinearRegression()
model1.fit(X, z)
pred1 = model1.predict(X)
r2_linear = r2_score(z, pred1)
# Model 2: With polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X) # x, y, x², xy, y²
model2 = LinearRegression()
model2.fit(X_poly, z)
pred2 = model2.predict(X_poly)
r2_poly = r2_score(z, pred2)
print(f"Linear R²: {r2_linear:.4f}")
print(f"Polynomial R²: {r2_poly:.4f}")
print(f"\nNew features: {poly.get_feature_names_out()}")
# Polynomial features capture non-linear relationships!
Explanation: Polynomial features allow linear regression to model non-linear relationships by creating interaction and squared terms.
Problem: Create polynomial features (x², xy) from x=[1,2,3] and y=[2,3,4], then train a model to predict z=[5,10,17]. Compare with and without polynomial features.
Show Solution
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
X = np.array([[1, 2], [2, 3], [3, 4]])
z = np.array([5, 10, 17])
# Model 1: Linear features only
model1 = LinearRegression()
model1.fit(X, z)
pred1 = model1.predict(X)
r2_linear = r2_score(z, pred1)
# Model 2: With polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X) # x, y, x², xy, y²
model2 = LinearRegression()
model2.fit(X_poly, z)
pred2 = model2.predict(X_poly)
r2_poly = r2_score(z, pred2)
print(f"Linear R²: {r2_linear:.4f}")
print(f"Polynomial R²: {r2_poly:.4f}")
print(f"\nNew features: {poly.get_feature_names_out()}")
# Polynomial features capture non-linear relationships!
Explanation: Polynomial features allow linear regression to model non-linear relationships by creating interaction and squared terms.
Python Implementation
Let's put everything together with a complete, production-ready implementation. We'll load real data, preprocess it, train a model, evaluate performance, and make predictions - all the steps you'll use in actual projects.
Complete End-to-End Pipeline
Step 1: Import Libraries and Load Data
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt
# 1. Load and explore data
data = pd.DataFrame({
'square_feet': [1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700],
'bedrooms': [3, 3, 2, 4, 2, 3, 4, 5, 3, 3],
'age_years': [10, 8, 15, 5, 20, 12, 3, 2, 18, 9],
'price': [245, 312, 279, 308, 199, 219, 405, 324, 319, 255]
})
print("Dataset shape:", data.shape)
print("\nFirst few rows:")
print(data.head())
pd.DataFrame to organize data with labeled columns for easy manipulation.
Best practice: Always explore data first!
data.shape tells you dimensions (10 rows, 4 columns), data.head() shows first 5 rows to verify data loaded correctly, data.describe() gives statistics (mean, std, min, max). This prevents errors like missing values, wrong data types, or incorrect columns. Real-world tip: Add data.isnull().sum() to check for missing data!
Step 2: Prepare and Split Data
# 2. Prepare features and target
X = data[['square_feet', 'bedrooms', 'age_years']]
y = data['price']
# 3. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
random_state=42 ensures reproducibility.
The 80/20 rule: This is the most common split ratio, but it depends on dataset size:
• Small datasets (<100 samples): Use 80/20 or cross-validation
• Medium datasets (100-10k): 80/20 or 70/30
• Large datasets (>100k): Can use 90/10 or even 98/2
Rule of thumb: Test set should have at least 30-50 samples for reliable performance estimates. Our 2-sample test set is tiny - in production, you'd need more data!
Step 3: Scale Features
# 4. Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
StandardScaler so each feature has mean=0 and standard deviation=1. This rescales square_feet (1000-2000 range) and bedrooms (2-5 range) to comparable scales.
Critical workflow:
1.
scaler.fit(X_train): Learns mean and std from training data ONLY2.
scaler.transform(X_train): Applies transformation to training data3.
scaler.transform(X_test): Applies SAME transformation to test dataNever fit on test data! This would cause data leakage - the test set is supposed to be "unseen". Formula used:
z = (x - mean) / std
Step 4: Train Model
# 5. Train model
model = LinearRegression()
model.fit(X_train_scaled, y_train)
.fit(). The model learns the relationship between standardized features and prices.
Training speed: For linear regression with small datasets like this, training is nearly instantaneous (<1ms). The algorithm uses matrix operations to directly compute optimal coefficients. For comparison: gradient descent would take longer but works better for huge datasets that don't fit in memory.
Step 5: Make Predictions
# 6. Make predictions
y_train_pred = model.predict(X_train_scaled)
y_test_pred = model.predict(X_test_scaled)
Overfitting check: If
train_R² >> test_R² (e.g., train=0.95, test=0.65), the model memorized training data but doesn't generalize. For linear regression with proper train/test split, this is rare - but common with complex models like deep neural networks. Ideally: train and test scores should be close (e.g., both around 0.85-0.90).
Step 6: Evaluate Performance
# 7. Evaluate
print("\nModel Performance:")
print(f"Train R²: {r2_score(y_train, y_train_pred):.4f}")
print(f"Test R²: {r2_score(y_test, y_test_pred):.4f}")
print(f"Test MSE: {mean_squared_error(y_test, y_test_pred):.2f}")
print(f"Test MAE: {mean_absolute_error(y_test, y_test_pred):.2f}")
# 8. Feature importance
print("\nFeature Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
print(f"{feature}: {coef:.2f}")
Metrics interpretation guide:
• R² (Train vs Test): Both should be high and similar. If Train R²=0.95 and Test R²=0.93 → Great! If Test R² < 0.7 → Model struggles with unseen data
• MSE: Depends on scale - for $200k-$400k prices, MSE < 1000 is excellent
• MAE: More interpretable than MSE - shows average prediction error in same units as target
• Feature coefficients: After scaling, larger magnitude = more important feature. Sign shows direction (+ increases price, - decreases)
Evaluation Metrics Explained
R² Score
Range: 0 to 1 (higher is better)
Meaning: Proportion of variance in y explained by the model. R² = 0.85 means 85% of variance is explained.
MSE
Range: 0 to ∞ (lower is better)
Meaning: Average of squared errors. Penalizes large errors heavily. Units are squared ($ squared).
RMSE
Range: 0 to ∞ (lower is better)
Meaning: Square root of MSE. Same units as target ($). More interpretable than MSE.
MAE
Range: 0 to ∞ (lower is better)
Meaning: Average of absolute errors. More robust to outliers. Same units as target ($).
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
# True and predicted values
y_true = np.array([100, 150, 200, 250, 300])
y_pred = np.array([110, 140, 190, 260, 290])
# Calculate all metrics
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse) # Or use mean_squared_error(y_true, y_pred, squared=False)
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"MSE: {mse:.2f} (units squared)")
print(f"RMSE: {rmse:.2f} (same units as target)")
print(f"MAE: {mae:.2f} (same units as target)")
print(f"R²: {r2:.4f} (0 to 1, dimensionless)")
# Example output:
# MSE: 200.00 (units squared)
# RMSE: 14.14 (same units as target)
# MAE: 10.00 (same units as target)
# R²: 0.9840 (0 to 1, dimensionless)
- R²: Best for overall model quality and comparison
- RMSE: When you want interpretable error in original units, sensitive to outliers
- MAE: When you want interpretable error that's robust to outliers
- MSE: When large errors should be heavily penalized (in optimization)
Model Diagnostics
After training, always check residuals (prediction errors) to verify assumptions hold:
Step 1: Calculate Residuals
import numpy as np
import matplotlib.pyplot as plt
# Calculate residuals (prediction errors)
residuals = y_test - y_test_pred
residual = actual - predicted. Positive residual = underestimated, negative = overestimated.
Why analyze residuals? They reveal model problems that overall metrics (R², MSE) might hide:
• Patterns in residuals: Indicate missing features or wrong model type
• Non-normal distribution: Violates linear regression assumptions
• Increasing variance: Heteroscedasticity problem (predictions less reliable at extremes)
Perfect model: residuals are random noise centered at 0 with constant variance.
Step 2: Create Residual Plot
# 1. Residual plot (should show random scatter)
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.scatter(y_test_pred, residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
Good residual plot: Random scatter around y=0 with no pattern (like random noise)
Bad residual plot patterns:
• Curved pattern: Need polynomial features or non-linear model
• Cone/funnel shape: Heteroscedasticity - variance increases with predicted values (try log transformation)
• Clusters/groups: Missing categorical features or outliers
If you see patterns, your model is systematically wrong in predictable ways!
Step 3: Create Histogram of Residuals
# 2. Histogram (should be roughly normal/bell-shaped)
plt.subplot(1, 3, 2)
plt.hist(residuals, bins=10, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Residual Distribution')
Distribution shapes:
• Bell curve centered at 0: Perfect! Normality assumption satisfied
• Skewed right/left: Asymmetric errors - consider transforming target variable (log, sqrt)
• Multiple peaks (bimodal): Suggests missing features or subgroups in data
• Uniform/flat: Non-normal, possibly wrong model type
Why care? Non-normal residuals mean confidence intervals and hypothesis tests are unreliable!
Step 4: Create Q-Q Plot
# 3. Q-Q plot (points should follow diagonal line)
plt.subplot(1, 3, 3)
from scipy import stats
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.tight_layout()
plt.show()
📏 Reading Q-Q plots:
• Points on diagonal: Residuals are normally distributed
• S-shaped curve: Distribution has heavier or lighter tails than normal
• Points diverge at extremes: Outliers present
• Systematic deviation: Skewed distribution
Q-Q plots are MORE SENSITIVE than histograms for detecting non-normality. Even slight deviations from the line indicate problems. For large datasets, minor deviations are acceptable.
Step 5: Statistical Test for Normality
# Statistical test for normality
_, p_value = stats.shapiro(residuals)
print(f"Shapiro test p-value: {p_value:.4f}")
print("Normal!" if p_value > 0.05 else "Not normal - check assumptions")
Interpreting p-values:
• p > 0.05: Cannot reject null hypothesis → residuals are likely normal (good!)
• p ≤ 0.05: Reject null hypothesis → residuals are NOT normal (problem!)
• p < 0.01: Strong evidence of non-normality
Important caveats: With large samples (>1000), Shapiro-Wilk becomes oversensitive - tiny deviations from normality (that don't matter practically) will show p<0.05. For large datasets, rely more on visual plots (histogram, Q-Q) than p-values. For small datasets (<50), the test has low power and might miss non-normality.
Cross-Validation
Use k-fold cross-validation to get a more reliable estimate of model performance:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np
# Create model
model = LinearRegression()
# Perform 5-fold cross-validation
cv_scores = cross_val_score(
model, X_train_scaled, y_train,
cv=5, scoring='r2'
)
print("Cross-Validation R² Scores:", cv_scores)
print(f"Mean R²: {cv_scores.mean():.4f}")
print(f"Std Dev: {cv_scores.std():.4f}")
# Also check MSE
cv_mse = -cross_val_score(
model, X_train_scaled, y_train,
cv=5, scoring='neg_mean_squared_error'
)
print(f"\nMean MSE: {cv_mse.mean():.2f}")
Saving and Loading Models
import joblib
# Save model and scaler
joblib.dump(model, 'house_price_model.pkl')
joblib.dump(scaler, 'feature_scaler.pkl')
print("Model saved successfully!")
# Load model later for predictions
loaded_model = joblib.load('house_price_model.pkl')
loaded_scaler = joblib.load('feature_scaler.pkl')
# Make prediction with loaded model
new_house = [[2000, 3, 10]] # sqft, beds, age
new_house_scaled = loaded_scaler.transform(new_house)
prediction = loaded_model.predict(new_house_scaled)
print(f"Predicted price: ${prediction[0]:.0f}k")
Common Pitfalls to Avoid
Practice Questions
Problem: Given actual values [100, 200, 300] and predictions [110, 190, 290], calculate the R² score.
Show Solution
from sklearn.metrics import r2_score
import numpy as np
y_true = np.array([100, 200, 300])
y_pred = np.array([110, 190, 290])
r2 = r2_score(y_true, y_pred) # ~0.95
print(f"R² Score: {r2:.4f}")
print(f"The model explains {r2*100:.1f}% of the variance")
# Manual calculation
ss_res = np.sum((y_true - y_pred) ** 2) # residual sum of squares
ss_tot = np.sum((y_true - np.mean(y_true)) ** 2) # total sum of squares
r2_manual = 1 - (ss_res / ss_tot)
print(f"Manual R²: {r2_manual:.4f}")
Explanation: R² measures how much better your model is compared to just predicting the mean.
Problem: Given actual values [100, 200, 300] and predictions [110, 190, 290], calculate the R² score.
Show Solution
from sklearn.metrics import r2_score
import numpy as np
y_true = np.array([100, 200, 300])
y_pred = np.array([110, 190, 290])
r2 = r2_score(y_true, y_pred) # ~0.95
print(f"R² Score: {r2:.4f}")
print(f"The model explains {r2*100:.1f}% of the variance")
# Manual calculation
ss_res = np.sum((y_true - y_pred) ** 2) # residual sum of squares
ss_tot = np.sum((y_true - np.mean(y_true)) ** 2) # total sum of squares
r2_manual = 1 - (ss_res / ss_tot)
print(f"Manual R²: {r2_manual:.4f}")
Explanation: R² measures how much better your model is compared to just predicting the mean.
Problem: Create a complete pipeline with StandardScaler and LinearRegression. Use 3-fold cross-validation to evaluate. Data: X = [[1,2], [2,3], [3,4], [4,5], [5,6]], y = [3, 5, 7, 9, 11].
Show Solution
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np
X = np.array([[1,2], [2,3], [3,4], [4,5], [5,6]])
y = np.array([3, 5, 7, 9, 11])
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LinearRegression())
])
# Cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=3, scoring='r2')
print("CV Scores:", cv_scores)
print(f"Mean R²: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# Train on full data for deployment
pipeline.fit(X, y)
# Predict
new_data = [[6, 7]]
prediction = pipeline.predict(new_data)
print(f"Prediction for [6,7]: {prediction[0]:.2f}")
Explanation: Pipelines ensure preprocessing steps are applied consistently and prevent data leakage during cross-validation.
Problem: Create a complete pipeline with StandardScaler and LinearRegression. Use 3-fold cross-validation to evaluate. Data: X = [[1,2], [2,3], [3,4], [4,5], [5,6]], y = [3, 5, 7, 9, 11].
Show Solution
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np
X = np.array([[1,2], [2,3], [3,4], [4,5], [5,6]])
y = np.array([3, 5, 7, 9, 11])
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', LinearRegression())
])
# Cross-validation
cv_scores = cross_val_score(pipeline, X, y, cv=3, scoring='r2')
print("CV Scores:", cv_scores)
print(f"Mean R²: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
# Train on full data for deployment
pipeline.fit(X, y)
# Predict
new_data = [[6, 7]]
prediction = pipeline.predict(new_data)
print(f"Prediction for [6,7]: {prediction[0]:.2f}")
Explanation: Pipelines ensure preprocessing steps are applied consistently and prevent data leakage during cross-validation.
Problem: Generate 100 points with y = 2x + 3 + noise. Train a model, calculate residuals, and check if they're normally distributed using Shapiro-Wilk test.
Show Solution
import numpy as np
from sklearn.linear_model import LinearRegression
from scipy import stats
import matplotlib.pyplot as plt
# Generate data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2 * X.squeeze() + 3 + np.random.normal(0, 1, 100)
# Train model
model = LinearRegression()
model.fit(X, y)
# Calculate residuals
y_pred = model.predict(X)
residuals = y - y_pred
# Normality test
stat, p_value = stats.shapiro(residuals)
print(f"Shapiro-Wilk statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value > 0.05:
print(" Residuals are normally distributed (assumption holds)")
else:
print("✗ Residuals not normal (violation of assumption)")
# Visual check
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.subplot(1, 2, 2)
plt.hist(residuals, bins=20, edgecolor='black')
plt.xlabel('Residuals')
plt.title('Distribution')
plt.tight_layout()
plt.show()
Explanation: Shapiro-Wilk test checks normality. P > 0.05 means residuals are likely normal, validating a key regression assumption.
Problem: Generate 100 points with y = 2x + 3 + noise. Train a model, calculate residuals, and check if they're normally distributed using Shapiro-Wilk test.
Show Solution
import numpy as np
from sklearn.linear_model import LinearRegression
from scipy import stats
import matplotlib.pyplot as plt
# Generate data
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2 * X.squeeze() + 3 + np.random.normal(0, 1, 100)
# Train model
model = LinearRegression()
model.fit(X, y)
# Calculate residuals
y_pred = model.predict(X)
residuals = y - y_pred
# Normality test
stat, p_value = stats.shapiro(residuals)
print(f"Shapiro-Wilk statistic: {stat:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value > 0.05:
print(" Residuals are normally distributed (assumption holds)")
else:
print("✗ Residuals not normal (violation of assumption)")
# Visual check
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.subplot(1, 2, 2)
plt.hist(residuals, bins=20, edgecolor='black')
plt.xlabel('Residuals')
plt.title('Distribution')
plt.tight_layout()
plt.show()
Explanation: Shapiro-Wilk test checks normality. P > 0.05 means residuals are likely normal, validating a key regression assumption.
Key Takeaways
Linear Model
Linear regression fits a straight line through data points to predict continuous values based on input features
Cost Function
Uses Mean Squared Error to measure prediction accuracy and find optimal parameters through gradient descent
Scikit-learn
Python's scikit-learn provides LinearRegression class for easy implementation and training of models
Evaluation Metrics
Use R-squared, MSE, and MAE to assess model performance and compare different regression models
Assumptions
Linear regression assumes linearity, independence, homoscedasticity, and normally distributed errors
Feature Engineering
Transform and scale features appropriately to improve model performance and interpretability
Knowledge Check
Test your understanding of linear regression:
What is the primary goal of linear regression?
In the equation y = mx + b, what does 'm' represent?
What is the most common cost function used in linear regression?
What does an R-squared value of 0.85 indicate?
Which scikit-learn method trains a linear regression model?
What is the difference between simple and multiple linear regression?