Neurons and Perceptrons
The human brain contains approximately 86 billion neurons, each connected to thousands of others through synapses. Artificial neural networks draw inspiration from this biological architecture, using simplified mathematical models to process information. Understanding how artificial neurons work is the foundation for all deep learning concepts.
Artificial Neuron
An artificial neuron is a mathematical function that receives one or more inputs, multiplies each by a weight, sums them together with a bias term, and passes the result through an activation function to produce an output.
Mathematical Formula: output = activation(w1*x1 + w2*x2 + ... + wn*xn + b)
Biological vs Artificial Neurons
While artificial neurons are inspired by biological neurons, they are significantly simplified. Biological neurons communicate through electrical impulses and chemical signals, while artificial neurons perform straightforward mathematical operations. Despite this simplification, artificial neural networks can learn remarkably complex patterns.
Biological Neuron
- Dendrites receive signals from other neurons
- Cell body processes incoming signals
- Axon transmits output signal
- Synapses connect to other neurons
Artificial Neuron
- Inputs receive data values
- Weights determine input importance
- Bias shifts the activation threshold
- Activation function produces output
The Perceptron Model
The perceptron, invented by Frank Rosenblatt in 1958, is the simplest form of a neural network. It consists of a single artificial neuron that can learn to classify inputs into two categories. Despite its simplicity, the perceptron laid the groundwork for modern deep learning.
import numpy as np
class Perceptron:
def __init__(self, n_inputs, learning_rate=0.01):
# Initialize weights randomly and bias to zero
self.weights = np.random.randn(n_inputs) * 0.01
self.bias = 0
self.lr = learning_rate
def activation(self, x):
# Step function: returns 1 if x > 0, else 0
return 1 if x > 0 else 0
def predict(self, inputs):
# Weighted sum plus bias
weighted_sum = np.dot(inputs, self.weights) + self.bias
return self.activation(weighted_sum) # Returns 0 or 1
Weights and Biases
Weights determine how much influence each input has on the output. A larger weight means the corresponding input has more importance. The bias allows the neuron to shift its activation threshold, enabling it to fire even when all inputs are zero.
# Demonstrating weights and bias effect
import numpy as np
inputs = np.array([0.5, 0.3, 0.2])
weights = np.array([0.4, 0.6, 0.8])
bias = -0.1
# Weighted sum calculation
weighted_sum = np.dot(inputs, weights) + bias
print(f"Inputs: {inputs}")
print(f"Weights: {weights}")
print(f"Weighted sum: {weighted_sum:.3f}") # 0.5*0.4 + 0.3*0.6 + 0.2*0.8 - 0.1 = 0.44
Training a Perceptron
The perceptron learning algorithm updates weights based on prediction errors. When the perceptron makes a wrong prediction, weights are adjusted in the direction that would have produced the correct output.
def train(self, X, y, epochs=100):
for epoch in range(epochs):
errors = 0
for inputs, target in zip(X, y):
prediction = self.predict(inputs)
error = target - prediction
if error != 0:
# Update weights: w = w + learning_rate * error * input
self.weights += self.lr * error * inputs
self.bias += self.lr * error
errors += 1
if errors == 0:
print(f"Converged at epoch {epoch}")
break
Complete Perceptron Example: Logic Gates
Let us train a perceptron to learn the AND logic gate. This is a classic example that demonstrates how a single perceptron can learn linearly separable patterns.
import numpy as np
# AND gate training data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1]) # AND: only 1 when both inputs are 1
# Create and train perceptron
perceptron = Perceptron(n_inputs=2, learning_rate=0.1)
perceptron.train(X, y, epochs=100)
# Test the trained perceptron
for inputs in X:
prediction = perceptron.predict(inputs)
print(f"{inputs} -> {prediction}") # Should output: [0,0]->0, [0,1]->0, [1,0]->0, [1,1]->1
Multi-Layer Perceptrons (MLPs)
To overcome the limitations of single perceptrons, we stack multiple layers of neurons. This architecture, called a Multi-Layer Perceptron (MLP), can learn complex non-linear patterns. An MLP consists of an input layer, one or more hidden layers, and an output layer.
| Layer Type | Purpose | Characteristics |
|---|---|---|
| Input Layer | Receives raw input data | Number of neurons equals number of features |
| Hidden Layer(s) | Learns intermediate representations | Extract features and patterns from data |
| Output Layer | Produces final predictions | Size depends on task (1 for regression, n for n-class classification) |
Practice: Neurons and Perceptrons
Task: Given inputs [0.6, 0.8], weights [0.5, 0.5], and bias -0.4, calculate the weighted sum and apply a step activation function (output 1 if sum > 0, else 0).
Show Solution
import numpy as np
inputs = np.array([0.6, 0.8])
weights = np.array([0.5, 0.5])
bias = -0.4
# Calculate weighted sum
weighted_sum = np.dot(inputs, weights) + bias
print(f"Weighted sum: {weighted_sum}") # 0.6*0.5 + 0.8*0.5 - 0.4 = 0.3
# Apply step activation
output = 1 if weighted_sum > 0 else 0
print(f"Output: {output}") # 1 (since 0.3 > 0)
Task: Modify the perceptron example to learn the OR logic gate instead of AND. The OR gate outputs 1 if at least one input is 1.
Show Solution
import numpy as np
class Perceptron:
def __init__(self, n_inputs, learning_rate=0.1):
self.weights = np.random.randn(n_inputs) * 0.01
self.bias = 0
self.lr = learning_rate
def predict(self, inputs):
return 1 if np.dot(inputs, self.weights) + self.bias > 0 else 0
def train(self, X, y, epochs=100):
for epoch in range(epochs):
for inputs, target in zip(X, y):
error = target - self.predict(inputs)
self.weights += self.lr * error * inputs
self.bias += self.lr * error
# OR gate: output 1 if ANY input is 1
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 1])
perceptron = Perceptron(2)
perceptron.train(X, y)
for inputs in X:
print(f"{inputs} -> {perceptron.predict(inputs)}") # 0,1,1,1
Task: The XOR problem cannot be solved by a single perceptron. Implement a simple 2-layer network using NumPy with hardcoded weights that correctly computes XOR. Hint: XOR can be computed as (x1 OR x2) AND NOT(x1 AND x2).
Show Solution
import numpy as np
def step(x):
return (x > 0).astype(int)
# XOR using 2 layers with manual weights
# Hidden layer: one neuron for OR, one for NAND
# Output layer: AND of hidden outputs
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
# Hidden layer weights (2 neurons)
W1 = np.array([[1, 1], # OR neuron
[-1, -1]]) # NAND neuron (NOT AND)
b1 = np.array([-0.5, 1.5]) # Thresholds
# Output layer weights (1 neuron - AND)
W2 = np.array([[1], [1]])
b2 = np.array([-1.5])
for x in X:
# Forward pass
hidden = step(np.dot(x, W1.T) + b1)
output = step(np.dot(hidden, W2) + b2)
print(f"XOR{tuple(x)} = {output[0]}") # 0,1,1,0
Activation Functions
Activation functions are the secret sauce that gives neural networks their power. Without them, a neural network would simply be a linear transformation, incapable of learning complex patterns. Activation functions introduce non-linearity, allowing networks to approximate virtually any function and solve problems that linear models cannot.
Why Non-linearity Matters
If we stack linear transformations (without activation functions), the result is still a linear transformation. No matter how many layers we add, the network could only learn linear relationships.
Mathematical proof: f(g(x)) = W2(W1*x + b1) + b2 = (W2*W1)*x + (W2*b1 + b2) = W'x + b' (still linear!)
Sigmoid Activation
The sigmoid function squashes any input into the range (0, 1). It was historically popular but has fallen out of favor for hidden layers due to the vanishing gradient problem. However, it is still used in output layers for binary classification.
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s) # Derivative for backpropagation
# Example values
x = np.array([-2, -1, 0, 1, 2])
print(f"Input: {x}")
print(f"Sigmoid: {sigmoid(x).round(3)}") # [0.119, 0.269, 0.5, 0.731, 0.881]
print(f"Derivative: {sigmoid_derivative(x).round(3)}") # [0.105, 0.197, 0.25, 0.197, 0.105]
Sigmoid Advantages
- Output bounded between 0 and 1
- Smooth gradient, always defined
- Interpretable as probability
Sigmoid Disadvantages
- Vanishing gradients for large inputs
- Outputs not zero-centered
- Computationally expensive (exponential)
Tanh (Hyperbolic Tangent)
The tanh function is similar to sigmoid but outputs values between -1 and 1, making it zero-centered. This property can help with training convergence. Like sigmoid, it also suffers from vanishing gradients for extreme inputs.
def tanh(x):
return np.tanh(x)
def tanh_derivative(x):
return 1 - np.tanh(x)**2
# Compare sigmoid and tanh
x = np.array([-2, -1, 0, 1, 2])
print(f"Tanh: {tanh(x).round(3)}") # [-0.964, -0.762, 0.0, 0.762, 0.964]
print(f"Range: (-1, 1), Zero-centered: Yes")
ReLU (Rectified Linear Unit)
ReLU is the most widely used activation function in modern deep learning. It simply outputs the input if positive, otherwise zero. ReLU is computationally efficient, helps mitigate vanishing gradients, and enables faster training.
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float) # 1 if x > 0, else 0
# ReLU in action
x = np.array([-2, -1, 0, 1, 2])
print(f"Input: {x}")
print(f"ReLU: {relu(x)}") # [0, 0, 0, 1, 2]
print(f"Derivative: {relu_derivative(x)}") # [0, 0, 0, 1, 1]
Leaky ReLU and Variants
Leaky ReLU allows a small gradient when the input is negative, preventing neurons from dying. Other variants like Parametric ReLU (PReLU) and Exponential Linear Unit (ELU) offer similar benefits with different characteristics.
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
def elu(x, alpha=1.0):
return np.where(x > 0, x, alpha * (np.exp(x) - 1))
x = np.array([-2, -1, 0, 1, 2])
print(f"Leaky ReLU (alpha=0.01): {leaky_relu(x).round(3)}") # [-0.02, -0.01, 0, 1, 2]
print(f"ELU (alpha=1.0): {elu(x).round(3)}") # [-0.865, -0.632, 0, 1, 2]
Softmax for Multi-class Classification
Softmax is used in the output layer for multi-class classification. It converts raw scores (logits) into probabilities that sum to 1. Each output represents the probability of the input belonging to that class.
def softmax(x):
# Subtract max for numerical stability
exp_x = np.exp(x - np.max(x))
return exp_x / exp_x.sum()
# Raw scores (logits) for 3 classes
logits = np.array([2.0, 1.0, 0.1])
probabilities = softmax(logits)
print(f"Logits: {logits}")
print(f"Probabilities: {probabilities.round(3)}") # [0.659, 0.242, 0.099]
print(f"Sum: {probabilities.sum():.3f}") # 1.000
Activation Function Comparison
| Function | Range | Use Case | Gradient Issue |
|---|---|---|---|
| Sigmoid | (0, 1) | Binary classification output | Vanishing gradients |
| Tanh | (-1, 1) | RNNs, zero-centered outputs | Vanishing gradients |
| ReLU | [0, infinity) | Hidden layers (default choice) | Dying neurons |
| Leaky ReLU | (-infinity, infinity) | Hidden layers (prevents dying) | None significant |
| Softmax | (0, 1), sum=1 | Multi-class output layer | N/A |
Choosing the Right Activation
The choice of activation function depends on the layer position and the problem type. Here are the practical guidelines used by most practitioners today.
# Practical activation function selection
from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
# Hidden layers: Use ReLU (or Leaky ReLU)
layers.Dense(128, activation='relu'),
layers.Dense(64, activation='relu'),
# Output layer depends on task:
# Binary classification: sigmoid
layers.Dense(1, activation='sigmoid'),
# Multi-class classification: softmax
# layers.Dense(10, activation='softmax'),
# Regression: no activation (linear)
# layers.Dense(1, activation=None),
])
Practice: Activation Functions
Task: Create a plot comparing ReLU and Sigmoid functions over the range [-5, 5]. Plot both functions on the same graph with proper labels and legend.
Show Solution
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-5, 5, 100)
sigmoid = 1 / (1 + np.exp(-x))
relu = np.maximum(0, x)
plt.figure(figsize=(10, 5))
plt.plot(x, sigmoid, label='Sigmoid', linewidth=2)
plt.plot(x, relu, label='ReLU', linewidth=2)
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Sigmoid vs ReLU Activation Functions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Task: Calculate and compare the gradients of sigmoid and ReLU for inputs [-10, -5, 0, 5, 10]. Show how sigmoid gradients become very small for extreme values while ReLU maintains useful gradients.
Show Solution
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_gradient(x):
s = sigmoid(x)
return s * (1 - s)
def relu_gradient(x):
return (x > 0).astype(float)
x_values = np.array([-10, -5, 0, 5, 10])
print("Vanishing Gradient Demonstration")
print("-" * 40)
print(f"{'x':>6} | {'Sigmoid Grad':>12} | {'ReLU Grad':>10}")
print("-" * 40)
for x in x_values:
sig_grad = sigmoid_gradient(x)
relu_grad = relu_gradient(np.array([x]))[0]
print(f"{x:>6} | {sig_grad:>12.6f} | {relu_grad:>10.1f}")
# Output shows sigmoid gradient near 0 for x=-10, x=10
Task: Implement a softmax function with a temperature parameter. Show how temperature affects the probability distribution: low temperature makes it sharper (more confident), high temperature makes it flatter (more uniform).
Show Solution
import numpy as np
def softmax_with_temperature(logits, temperature=1.0):
scaled = logits / temperature
exp_x = np.exp(scaled - np.max(scaled))
return exp_x / exp_x.sum()
logits = np.array([2.0, 1.0, 0.5])
print("Softmax Temperature Scaling")
print("-" * 50)
for temp in [0.1, 0.5, 1.0, 2.0, 5.0]:
probs = softmax_with_temperature(logits, temp)
print(f"T={temp:<3} -> {probs.round(3)}")
# Low temp (0.1): almost one-hot [0.99, 0.01, 0.00]
# High temp (5.0): more uniform [0.39, 0.33, 0.28]
Interactive: Activation Function Explorer
Adjust the input value and see how different activation functions transform it in real-time.
| Function | Output |
|---|---|
| Sigmoid | 0.500 |
| Tanh | 0.000 |
| ReLU | 0 |
| Leaky ReLU | 0 |
Forward Propagation
Forward propagation is the process of passing input data through the neural network to generate predictions. Data flows from the input layer through hidden layers to the output layer, with each neuron computing a weighted sum and applying an activation function. Understanding this flow is essential before learning how networks are trained.
Forward Pass
The forward pass transforms input data into output predictions by sequentially applying linear transformations (weights and biases) followed by non-linear activations at each layer.
For each layer: Z = W*A_prev + b (linear), A = activation(Z) (non-linear)
Step-by-Step Forward Propagation
Let us walk through forward propagation in a simple 2-layer network (one hidden layer, one output layer). We will trace how a single input sample flows through the network to produce a prediction.
import numpy as np
# Network architecture: 3 inputs -> 4 hidden neurons -> 2 outputs
np.random.seed(42)
# Initialize weights and biases
W1 = np.random.randn(4, 3) * 0.01 # Hidden layer: 4 neurons, 3 inputs each
b1 = np.zeros((4, 1))
W2 = np.random.randn(2, 4) * 0.01 # Output layer: 2 neurons, 4 inputs each
b2 = np.zeros((2, 1))
# Sample input (3 features)
X = np.array([[0.5], [0.8], [0.2]]) # Shape: (3, 1)
print(f"Input shape: {X.shape}")
print(f"W1 shape: {W1.shape}, b1 shape: {b1.shape}")
print(f"W2 shape: {W2.shape}, b2 shape: {b2.shape}")
Layer 1: Input to Hidden
The first layer receives the raw input and transforms it. Each hidden neuron computes a weighted sum of all inputs, adds its bias, and applies the activation function.
def relu(z):
return np.maximum(0, z)
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Forward propagation through hidden layer
Z1 = np.dot(W1, X) + b1 # Linear transformation
A1 = relu(Z1) # Apply ReLU activation
print(f"Z1 (pre-activation): shape {Z1.shape}")
print(f"A1 (post-activation): shape {A1.shape}")
print(f"Hidden layer output:\n{A1.flatten().round(4)}")
Layer 2: Hidden to Output
The output layer takes the hidden layer activations as input and produces the final predictions. For binary classification, we use sigmoid to get probabilities between 0 and 1.
# Forward propagation through output layer
Z2 = np.dot(W2, A1) + b2 # Linear transformation
A2 = sigmoid(Z2) # Sigmoid for binary classification
print(f"Z2 (pre-activation): shape {Z2.shape}")
print(f"A2 (final output): shape {A2.shape}")
print(f"Predictions: {A2.flatten().round(4)}")
Complete Forward Propagation Function
Let us encapsulate the entire forward pass in a reusable function. This function stores intermediate values (cache) which will be needed for backpropagation during training.
def forward_propagation(X, parameters):
"""
Forward pass through a 2-layer network.
Returns predictions and cache for backprop.
"""
W1, b1, W2, b2 = parameters['W1'], parameters['b1'], parameters['W2'], parameters['b2']
# Hidden layer
Z1 = np.dot(W1, X) + b1
A1 = relu(Z1)
# Output layer
Z2 = np.dot(W2, A1) + b2
A2 = sigmoid(Z2)
# Store values for backpropagation
cache = {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2}
return A2, cache
# Package parameters
params = {'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
predictions, cache = forward_propagation(X, params)
print(f"Final predictions: {predictions.flatten()}")
Batch Processing
In practice, we process multiple samples simultaneously (batch processing). This is more efficient because matrix operations can be parallelized. The forward propagation code works the same way, just with different input dimensions.
# Batch of 5 samples, each with 3 features
X_batch = np.random.randn(3, 5) # Shape: (features, samples)
predictions_batch, _ = forward_propagation(X_batch, params)
print(f"Input batch shape: {X_batch.shape}")
print(f"Output shape: {predictions_batch.shape}") # (2, 5) - 2 outputs per sample
print(f"Predictions for 5 samples:\n{predictions_batch.round(4)}")
Computing the Loss
After forward propagation, we compare predictions to actual labels using a loss function. For binary classification, we use binary cross-entropy loss. The goal of training is to minimize this loss.
def compute_loss(A2, Y):
"""
Binary cross-entropy loss.
A2: predictions, shape (1, m)
Y: true labels, shape (1, m)
"""
m = Y.shape[1]
epsilon = 1e-8 # Prevent log(0)
loss = -np.mean(Y * np.log(A2 + epsilon) + (1 - Y) * np.log(1 - A2 + epsilon))
return loss
# Example: 5 samples with binary labels
Y = np.array([[1, 0, 1, 1, 0]]) # True labels
A2 = np.array([[0.9, 0.2, 0.8, 0.7, 0.1]]) # Predictions
loss = compute_loss(A2, Y)
print(f"Predictions: {A2}")
print(f"True labels: {Y}")
print(f"Binary Cross-Entropy Loss: {loss:.4f}") # Lower is better
| Step | Operation | Formula |
|---|---|---|
| 1. Linear (Layer 1) | Weighted sum + bias | Z1 = W1 * X + b1 |
| 2. Activation (Layer 1) | Apply ReLU | A1 = max(0, Z1) |
| 3. Linear (Layer 2) | Weighted sum + bias | Z2 = W2 * A1 + b2 |
| 4. Activation (Layer 2) | Apply Sigmoid | A2 = sigmoid(Z2) |
| 5. Loss Computation | Compare with labels | L = cross_entropy(A2, Y) |
Practice: Forward Propagation
Task: Given input X=[1, 2], weights W=[[0.5, 0.5], [0.3, 0.7]], and bias b=[0, 0], manually calculate Z, then apply ReLU to get the hidden layer output.
Show Solution
import numpy as np
X = np.array([[1], [2]]) # Shape (2, 1)
W = np.array([[0.5, 0.5],
[0.3, 0.7]]) # Shape (2, 2)
b = np.array([[0], [0]]) # Shape (2, 1)
# Linear transformation
Z = np.dot(W, X) + b
print(f"Z = W @ X + b")
print(f"Z[0] = 0.5*1 + 0.5*2 + 0 = {Z[0,0]}") # 1.5
print(f"Z[1] = 0.3*1 + 0.7*2 + 0 = {Z[1,0]}") # 1.7
# ReLU activation
A = np.maximum(0, Z)
print(f"\nA = ReLU(Z) = {A.flatten()}") # [1.5, 1.7]
Task: Extend the forward propagation to handle a 3-layer network (2 hidden layers + output). Use ReLU for hidden layers and sigmoid for output.
Show Solution
import numpy as np
def relu(z):
return np.maximum(0, z)
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def forward_3layer(X, params):
W1, b1 = params['W1'], params['b1']
W2, b2 = params['W2'], params['b2']
W3, b3 = params['W3'], params['b3']
# Hidden layer 1
Z1 = np.dot(W1, X) + b1
A1 = relu(Z1)
# Hidden layer 2
Z2 = np.dot(W2, A1) + b2
A2 = relu(Z2)
# Output layer
Z3 = np.dot(W3, A2) + b3
A3 = sigmoid(Z3)
return A3
# Example: 4 inputs -> 8 hidden -> 4 hidden -> 1 output
params = {
'W1': np.random.randn(8, 4) * 0.01, 'b1': np.zeros((8, 1)),
'W2': np.random.randn(4, 8) * 0.01, 'b2': np.zeros((4, 1)),
'W3': np.random.randn(1, 4) * 0.01, 'b3': np.zeros((1, 1))
}
X = np.random.randn(4, 1)
output = forward_3layer(X, params)
print(f"Output: {output[0,0]:.4f}")
Task: Create a general forward propagation function that works for networks with any number of layers. Store parameters as a list and use a loop to process each layer.
Show Solution
import numpy as np
def forward_deep(X, parameters, activations):
"""
Forward propagation for arbitrary depth.
parameters: list of (W, b) tuples
activations: list of activation function names
"""
A = X
caches = []
for i, ((W, b), act) in enumerate(zip(parameters, activations)):
Z = np.dot(W, A) + b
if act == 'relu':
A = np.maximum(0, Z)
elif act == 'sigmoid':
A = 1 / (1 + np.exp(-Z))
elif act == 'tanh':
A = np.tanh(Z)
else:
A = Z # Linear
caches.append((Z, A))
return A, caches
# 5-layer network: 10 -> 64 -> 32 -> 16 -> 8 -> 1
layer_dims = [10, 64, 32, 16, 8, 1]
params = [(np.random.randn(layer_dims[i+1], layer_dims[i]) * 0.01,
np.zeros((layer_dims[i+1], 1)))
for i in range(len(layer_dims)-1)]
acts = ['relu', 'relu', 'relu', 'relu', 'sigmoid']
X = np.random.randn(10, 1)
output, _ = forward_deep(X, params, acts)
print(f"5-layer output: {output[0,0]:.4f}")
Backpropagation and Gradient Descent
Backpropagation is the algorithm that enables neural networks to learn. It calculates how much each weight contributed to the prediction error, then uses gradient descent to update weights in the direction that reduces the error. This elegant combination of calculus and optimization is what makes deep learning possible.
Backpropagation
Backpropagation computes the gradient of the loss function with respect to each weight by applying the chain rule of calculus, propagating error gradients backward from the output layer to the input layer.
Chain Rule: dL/dW = dL/dA * dA/dZ * dZ/dW
The Chain Rule of Calculus
The chain rule allows us to compute derivatives of composite functions. In neural networks, the output is a composition of many functions (layers). The chain rule tells us how to find the derivative of this composition with respect to any parameter.
# Chain rule example: f(g(x)) derivative
# If y = f(g(x)), then dy/dx = df/dg * dg/dx
import numpy as np
# Example: y = (2x + 1)^2
# Let g(x) = 2x + 1, f(g) = g^2
# dy/dx = df/dg * dg/dx = 2g * 2 = 4(2x + 1)
x = 3
g = 2 * x + 1 # g = 7
df_dg = 2 * g # = 14
dg_dx = 2
dy_dx = df_dg * dg_dx # = 28
print(f"x = {x}")
print(f"dy/dx using chain rule: {dy_dx}") # 28
Computing Gradients for Output Layer
We start from the output and work backward. First, we compute the gradient of the loss with respect to the output layer's pre-activation (Z2), then use that to find gradients for W2 and b2.
def backward_output_layer(A2, Y, A1):
"""
Compute gradients for output layer (with sigmoid activation).
A2: predictions, Y: true labels, A1: previous layer activations
"""
m = Y.shape[1] # Number of samples
# Gradient of loss w.r.t. Z2 (simplified for sigmoid + cross-entropy)
dZ2 = A2 - Y # Shape: (1, m)
# Gradient of loss w.r.t. W2
dW2 = (1/m) * np.dot(dZ2, A1.T) # Shape: (1, hidden_size)
# Gradient of loss w.r.t. b2
db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True) # Shape: (1, 1)
return dZ2, dW2, db2
Computing Gradients for Hidden Layer
The hidden layer gradients are computed by propagating the error backward through the network. We use the gradients from the output layer (dZ2) and the weights connecting hidden to output (W2).
def relu_derivative(Z):
return (Z > 0).astype(float)
def backward_hidden_layer(dZ2, W2, Z1, X):
"""
Compute gradients for hidden layer (with ReLU activation).
"""
m = X.shape[1]
# Propagate gradient backward through W2
dA1 = np.dot(W2.T, dZ2)
# Gradient through ReLU
dZ1 = dA1 * relu_derivative(Z1)
# Gradients for W1 and b1
dW1 = (1/m) * np.dot(dZ1, X.T)
db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
return dW1, db1
Complete Backpropagation
Now let us combine everything into a complete backpropagation function that computes gradients for all parameters in our 2-layer network.
def backward_propagation(X, Y, params, cache):
"""
Complete backpropagation for 2-layer network.
Returns gradients for all parameters.
"""
m = X.shape[1]
W1, W2 = params['W1'], params['W2']
Z1, A1, Z2, A2 = cache['Z1'], cache['A1'], cache['Z2'], cache['A2']
# Output layer gradients
dZ2 = A2 - Y
dW2 = (1/m) * np.dot(dZ2, A1.T)
db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
# Hidden layer gradients
dA1 = np.dot(W2.T, dZ2)
dZ1 = dA1 * (Z1 > 0) # ReLU derivative
dW1 = (1/m) * np.dot(dZ1, X.T)
db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
gradients = {'dW1': dW1, 'db1': db1, 'dW2': dW2, 'db2': db2}
return gradients
Gradient Descent Update
Once we have gradients, we update weights using gradient descent. The learning rate controls how large each update step is. We subtract the gradient (times learning rate) because we want to move in the direction that decreases the loss.
def update_parameters(params, gradients, learning_rate):
"""
Update parameters using gradient descent.
New weight = old weight - learning_rate * gradient
"""
params['W1'] -= learning_rate * gradients['dW1']
params['b1'] -= learning_rate * gradients['db1']
params['W2'] -= learning_rate * gradients['dW2']
params['b2'] -= learning_rate * gradients['db2']
return params
# Example update
learning_rate = 0.01
params = update_parameters(params, gradients, learning_rate)
print("Parameters updated!")
Learning Rate Too High
- Updates are too large
- May overshoot the minimum
- Loss can oscillate or diverge
- Training becomes unstable
Learning Rate Too Low
- Updates are too small
- Training is very slow
- May get stuck in local minima
- Requires many more iterations
The Complete Training Loop
Training a neural network involves repeating forward propagation, loss computation, backpropagation, and parameter updates for many iterations (epochs). Here is a complete training loop that brings everything together.
def train_network(X, Y, layer_sizes, learning_rate=0.01, epochs=1000):
"""
Complete training loop for a 2-layer neural network.
"""
np.random.seed(42)
n_input, n_hidden, n_output = layer_sizes
# Initialize parameters
params = {
'W1': np.random.randn(n_hidden, n_input) * 0.01,
'b1': np.zeros((n_hidden, 1)),
'W2': np.random.randn(n_output, n_hidden) * 0.01,
'b2': np.zeros((n_output, 1))
}
losses = []
for epoch in range(epochs):
# Forward propagation
A2, cache = forward_propagation(X, params)
# Compute loss
loss = compute_loss(A2, Y)
losses.append(loss)
# Backward propagation
gradients = backward_propagation(X, Y, params, cache)
# Update parameters
params = update_parameters(params, gradients, learning_rate)
if epoch % 100 == 0:
print(f"Epoch {epoch}: Loss = {loss:.4f}")
return params, losses
Gradient Descent Variants
Standard gradient descent uses all training samples to compute each update. In practice, we use variants that are more efficient and often lead to better convergence.
| Variant | Batch Size | Characteristics |
|---|---|---|
| Batch GD | All samples | Stable but slow, memory intensive |
| Stochastic GD | 1 sample | Fast updates, noisy gradients |
| Mini-batch GD | 32-256 samples | Best of both worlds, most common |
Putting It All Together
Let us train a complete neural network on a simple classification problem to see all the concepts in action.
# Generate simple classification data
np.random.seed(42)
X = np.random.randn(2, 200) # 200 samples, 2 features
Y = ((X[0] + X[1]) > 0).astype(int).reshape(1, -1) # Label based on sum
# Train the network
layer_sizes = (2, 4, 1) # 2 inputs, 4 hidden, 1 output
params, losses = train_network(X, Y, layer_sizes, learning_rate=0.5, epochs=1000)
# Evaluate
predictions, _ = forward_propagation(X, params)
accuracy = np.mean((predictions > 0.5) == Y)
print(f"\nFinal Accuracy: {accuracy * 100:.1f}%")
Practice: Backpropagation and Gradient Descent
Task: For the function y = (x*w + b)^2, compute dy/dw and dy/db using the chain rule. Then verify numerically with x=2, w=3, b=1.
Show Solution
# Analytical solution using chain rule:
# Let z = x*w + b, then y = z^2
# dy/dz = 2z
# dz/dw = x
# dz/db = 1
# dy/dw = dy/dz * dz/dw = 2z * x = 2(x*w + b) * x
# dy/db = dy/dz * dz/db = 2z * 1 = 2(x*w + b)
x, w, b = 2, 3, 1
z = x * w + b # z = 7
y = z ** 2 # y = 49
dy_dw = 2 * z * x # 2 * 7 * 2 = 28
dy_db = 2 * z # 2 * 7 = 14
print(f"z = x*w + b = {z}")
print(f"y = z^2 = {y}")
print(f"dy/dw = {dy_dw}")
print(f"dy/db = {dy_db}")
# Numerical verification (small perturbation)
eps = 1e-5
y_plus_w = ((x * (w + eps) + b) ** 2)
numerical_dy_dw = (y_plus_w - y) / eps
print(f"Numerical dy/dw: {numerical_dy_dw:.1f}")
Task: Implement numerical gradient checking to verify backpropagation is correct. Compare analytical gradients from backprop with numerical gradients computed using finite differences.
Show Solution
import numpy as np
def numerical_gradient(params, X, Y, param_name, i, j, eps=1e-7):
"""Compute numerical gradient using finite differences."""
original = params[param_name][i, j]
# f(x + eps)
params[param_name][i, j] = original + eps
A2_plus, _ = forward_propagation(X, params)
loss_plus = compute_loss(A2_plus, Y)
# f(x - eps)
params[param_name][i, j] = original - eps
A2_minus, _ = forward_propagation(X, params)
loss_minus = compute_loss(A2_minus, Y)
# Restore
params[param_name][i, j] = original
return (loss_plus - loss_minus) / (2 * eps)
# Compare gradients
X = np.random.randn(3, 5)
Y = np.random.randint(0, 2, (1, 5))
A2, cache = forward_propagation(X, params)
grads = backward_propagation(X, Y, params, cache)
# Check W1[0,0]
num_grad = numerical_gradient(params, X, Y, 'W1', 0, 0)
ana_grad = grads['dW1'][0, 0]
diff = abs(num_grad - ana_grad) / max(abs(num_grad), abs(ana_grad))
print(f"Numerical: {num_grad:.6f}, Analytical: {ana_grad:.6f}")
print(f"Relative difference: {diff:.2e}")
Task: Create a 3D visualization showing gradient descent on a simple loss surface L(w1, w2) = w1^2 + w2^2. Plot the path taken by gradient descent from a random starting point to the minimum at (0, 0).
Show Solution
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def loss(w1, w2):
return w1**2 + w2**2
def gradient(w1, w2):
return 2*w1, 2*w2
# Gradient descent
w1, w2 = 4.0, 3.0
lr = 0.1
path = [(w1, w2, loss(w1, w2))]
for _ in range(20):
dw1, dw2 = gradient(w1, w2)
w1 -= lr * dw1
w2 -= lr * dw2
path.append((w1, w2, loss(w1, w2)))
# Plotting
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
# Loss surface
W1, W2 = np.meshgrid(np.linspace(-5, 5, 50), np.linspace(-5, 5, 50))
L = loss(W1, W2)
ax.plot_surface(W1, W2, L, alpha=0.5, cmap='viridis')
# GD path
path = np.array(path)
ax.plot(path[:,0], path[:,1], path[:,2], 'r.-', markersize=10, linewidth=2)
ax.set_xlabel('w1'); ax.set_ylabel('w2'); ax.set_zlabel('Loss')
plt.title('Gradient Descent on Loss Surface')
plt.show()
Key Takeaways
Biological Inspiration
Artificial neurons mimic biological neurons, receiving inputs, applying weights, and producing outputs through activation
Perceptron Foundation
The perceptron is the simplest neural network unit, combining weighted inputs with a bias and activation function
Non-linear Activations
Activation functions like ReLU, Sigmoid, and Tanh enable networks to learn complex non-linear patterns
Forward Propagation
Data flows forward through layers, with each neuron computing weighted sums and applying activations
Backpropagation
Errors propagate backward through the network, computing gradients for each weight using the chain rule
Gradient Descent
Weights are updated iteratively using gradients, moving down the loss surface toward optimal values
Knowledge Check
Test your understanding of neural network fundamentals: