Module 3.1

Neural Networks Fundamentals

Explore the building blocks of deep learning. Understand how artificial neurons work, how activation functions introduce non-linearity, and how networks learn through forward propagation and backpropagation with gradient descent.

55 min read
Intermediate
Hands-on
What You'll Learn
  • Biological inspiration and artificial neurons
  • Perceptron model and multi-layer networks
  • Activation functions (ReLU, Sigmoid, Tanh)
  • Forward propagation through networks
  • Backpropagation and gradient descent
Contents
01

Neurons and Perceptrons

The human brain contains approximately 86 billion neurons, each connected to thousands of others through synapses. Artificial neural networks draw inspiration from this biological architecture, using simplified mathematical models to process information. Understanding how artificial neurons work is the foundation for all deep learning concepts.

Key Concept

Artificial Neuron

An artificial neuron is a mathematical function that receives one or more inputs, multiplies each by a weight, sums them together with a bias term, and passes the result through an activation function to produce an output.

Mathematical Formula: output = activation(w1*x1 + w2*x2 + ... + wn*xn + b)

Biological vs Artificial Neurons

While artificial neurons are inspired by biological neurons, they are significantly simplified. Biological neurons communicate through electrical impulses and chemical signals, while artificial neurons perform straightforward mathematical operations. Despite this simplification, artificial neural networks can learn remarkably complex patterns.

Biological Neuron

  • Dendrites receive signals from other neurons
  • Cell body processes incoming signals
  • Axon transmits output signal
  • Synapses connect to other neurons

Artificial Neuron

  • Inputs receive data values
  • Weights determine input importance
  • Bias shifts the activation threshold
  • Activation function produces output

The Perceptron Model

The perceptron, invented by Frank Rosenblatt in 1958, is the simplest form of a neural network. It consists of a single artificial neuron that can learn to classify inputs into two categories. Despite its simplicity, the perceptron laid the groundwork for modern deep learning.

import numpy as np

class Perceptron:
    def __init__(self, n_inputs, learning_rate=0.01):
        # Initialize weights randomly and bias to zero
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0
        self.lr = learning_rate
    
    def activation(self, x):
        # Step function: returns 1 if x > 0, else 0
        return 1 if x > 0 else 0
    
    def predict(self, inputs):
        # Weighted sum plus bias
        weighted_sum = np.dot(inputs, self.weights) + self.bias
        return self.activation(weighted_sum)  # Returns 0 or 1

Weights and Biases

Weights determine how much influence each input has on the output. A larger weight means the corresponding input has more importance. The bias allows the neuron to shift its activation threshold, enabling it to fire even when all inputs are zero.

# Demonstrating weights and bias effect
import numpy as np

inputs = np.array([0.5, 0.3, 0.2])
weights = np.array([0.4, 0.6, 0.8])
bias = -0.1

# Weighted sum calculation
weighted_sum = np.dot(inputs, weights) + bias
print(f"Inputs: {inputs}")
print(f"Weights: {weights}")
print(f"Weighted sum: {weighted_sum:.3f}")  # 0.5*0.4 + 0.3*0.6 + 0.2*0.8 - 0.1 = 0.44
Learning Process: During training, the network adjusts weights and biases to minimize the difference between predicted and actual outputs. This is how neural networks "learn" from data.

Training a Perceptron

The perceptron learning algorithm updates weights based on prediction errors. When the perceptron makes a wrong prediction, weights are adjusted in the direction that would have produced the correct output.

def train(self, X, y, epochs=100):
    for epoch in range(epochs):
        errors = 0
        for inputs, target in zip(X, y):
            prediction = self.predict(inputs)
            error = target - prediction
            
            if error != 0:
                # Update weights: w = w + learning_rate * error * input
                self.weights += self.lr * error * inputs
                self.bias += self.lr * error
                errors += 1
        
        if errors == 0:
            print(f"Converged at epoch {epoch}")
            break

Complete Perceptron Example: Logic Gates

Let us train a perceptron to learn the AND logic gate. This is a classic example that demonstrates how a single perceptron can learn linearly separable patterns.

import numpy as np

# AND gate training data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])  # AND: only 1 when both inputs are 1

# Create and train perceptron
perceptron = Perceptron(n_inputs=2, learning_rate=0.1)
perceptron.train(X, y, epochs=100)

# Test the trained perceptron
for inputs in X:
    prediction = perceptron.predict(inputs)
    print(f"{inputs} -> {prediction}")  # Should output: [0,0]->0, [0,1]->0, [1,0]->0, [1,1]->1
Limitation: A single perceptron can only learn linearly separable patterns. It cannot learn XOR (exclusive or) because XOR requires a non-linear decision boundary.

Multi-Layer Perceptrons (MLPs)

To overcome the limitations of single perceptrons, we stack multiple layers of neurons. This architecture, called a Multi-Layer Perceptron (MLP), can learn complex non-linear patterns. An MLP consists of an input layer, one or more hidden layers, and an output layer.

Layer Type Purpose Characteristics
Input Layer Receives raw input data Number of neurons equals number of features
Hidden Layer(s) Learns intermediate representations Extract features and patterns from data
Output Layer Produces final predictions Size depends on task (1 for regression, n for n-class classification)

Practice: Neurons and Perceptrons

Task: Given inputs [0.6, 0.8], weights [0.5, 0.5], and bias -0.4, calculate the weighted sum and apply a step activation function (output 1 if sum > 0, else 0).

Show Solution
import numpy as np

inputs = np.array([0.6, 0.8])
weights = np.array([0.5, 0.5])
bias = -0.4

# Calculate weighted sum
weighted_sum = np.dot(inputs, weights) + bias
print(f"Weighted sum: {weighted_sum}")  # 0.6*0.5 + 0.8*0.5 - 0.4 = 0.3

# Apply step activation
output = 1 if weighted_sum > 0 else 0
print(f"Output: {output}")  # 1 (since 0.3 > 0)

Task: Modify the perceptron example to learn the OR logic gate instead of AND. The OR gate outputs 1 if at least one input is 1.

Show Solution
import numpy as np

class Perceptron:
    def __init__(self, n_inputs, learning_rate=0.1):
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0
        self.lr = learning_rate
    
    def predict(self, inputs):
        return 1 if np.dot(inputs, self.weights) + self.bias > 0 else 0
    
    def train(self, X, y, epochs=100):
        for epoch in range(epochs):
            for inputs, target in zip(X, y):
                error = target - self.predict(inputs)
                self.weights += self.lr * error * inputs
                self.bias += self.lr * error

# OR gate: output 1 if ANY input is 1
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 1])

perceptron = Perceptron(2)
perceptron.train(X, y)

for inputs in X:
    print(f"{inputs} -> {perceptron.predict(inputs)}")  # 0,1,1,1

Task: The XOR problem cannot be solved by a single perceptron. Implement a simple 2-layer network using NumPy with hardcoded weights that correctly computes XOR. Hint: XOR can be computed as (x1 OR x2) AND NOT(x1 AND x2).

Show Solution
import numpy as np

def step(x):
    return (x > 0).astype(int)

# XOR using 2 layers with manual weights
# Hidden layer: one neuron for OR, one for NAND
# Output layer: AND of hidden outputs

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

# Hidden layer weights (2 neurons)
W1 = np.array([[1, 1],      # OR neuron
               [-1, -1]])    # NAND neuron (NOT AND)
b1 = np.array([-0.5, 1.5])  # Thresholds

# Output layer weights (1 neuron - AND)
W2 = np.array([[1], [1]])
b2 = np.array([-1.5])

for x in X:
    # Forward pass
    hidden = step(np.dot(x, W1.T) + b1)
    output = step(np.dot(hidden, W2) + b2)
    print(f"XOR{tuple(x)} = {output[0]}")  # 0,1,1,0
02

Activation Functions

Activation functions are the secret sauce that gives neural networks their power. Without them, a neural network would simply be a linear transformation, incapable of learning complex patterns. Activation functions introduce non-linearity, allowing networks to approximate virtually any function and solve problems that linear models cannot.

Key Concept

Why Non-linearity Matters

If we stack linear transformations (without activation functions), the result is still a linear transformation. No matter how many layers we add, the network could only learn linear relationships.

Mathematical proof: f(g(x)) = W2(W1*x + b1) + b2 = (W2*W1)*x + (W2*b1 + b2) = W'x + b' (still linear!)

Sigmoid Activation

The sigmoid function squashes any input into the range (0, 1). It was historically popular but has fallen out of favor for hidden layers due to the vanishing gradient problem. However, it is still used in output layers for binary classification.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)  # Derivative for backpropagation

# Example values
x = np.array([-2, -1, 0, 1, 2])
print(f"Input: {x}")
print(f"Sigmoid: {sigmoid(x).round(3)}")  # [0.119, 0.269, 0.5, 0.731, 0.881]
print(f"Derivative: {sigmoid_derivative(x).round(3)}")  # [0.105, 0.197, 0.25, 0.197, 0.105]

Sigmoid Advantages

  • Output bounded between 0 and 1
  • Smooth gradient, always defined
  • Interpretable as probability

Sigmoid Disadvantages

  • Vanishing gradients for large inputs
  • Outputs not zero-centered
  • Computationally expensive (exponential)

Tanh (Hyperbolic Tangent)

The tanh function is similar to sigmoid but outputs values between -1 and 1, making it zero-centered. This property can help with training convergence. Like sigmoid, it also suffers from vanishing gradients for extreme inputs.

def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

# Compare sigmoid and tanh
x = np.array([-2, -1, 0, 1, 2])
print(f"Tanh: {tanh(x).round(3)}")  # [-0.964, -0.762, 0.0, 0.762, 0.964]
print(f"Range: (-1, 1), Zero-centered: Yes")
Relationship: tanh(x) = 2 * sigmoid(2x) - 1. Tanh is essentially a scaled and shifted version of sigmoid.

ReLU (Rectified Linear Unit)

ReLU is the most widely used activation function in modern deep learning. It simply outputs the input if positive, otherwise zero. ReLU is computationally efficient, helps mitigate vanishing gradients, and enables faster training.

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)  # 1 if x > 0, else 0

# ReLU in action
x = np.array([-2, -1, 0, 1, 2])
print(f"Input: {x}")
print(f"ReLU: {relu(x)}")  # [0, 0, 0, 1, 2]
print(f"Derivative: {relu_derivative(x)}")  # [0, 0, 0, 1, 1]
Dying ReLU Problem: If a neuron's output becomes negative during training, ReLU outputs 0 and the gradient is also 0. The neuron "dies" and stops learning. Leaky ReLU addresses this issue.

Leaky ReLU and Variants

Leaky ReLU allows a small gradient when the input is negative, preventing neurons from dying. Other variants like Parametric ReLU (PReLU) and Exponential Linear Unit (ELU) offer similar benefits with different characteristics.

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

def elu(x, alpha=1.0):
    return np.where(x > 0, x, alpha * (np.exp(x) - 1))

x = np.array([-2, -1, 0, 1, 2])
print(f"Leaky ReLU (alpha=0.01): {leaky_relu(x).round(3)}")  # [-0.02, -0.01, 0, 1, 2]
print(f"ELU (alpha=1.0): {elu(x).round(3)}")  # [-0.865, -0.632, 0, 1, 2]

Softmax for Multi-class Classification

Softmax is used in the output layer for multi-class classification. It converts raw scores (logits) into probabilities that sum to 1. Each output represents the probability of the input belonging to that class.

def softmax(x):
    # Subtract max for numerical stability
    exp_x = np.exp(x - np.max(x))
    return exp_x / exp_x.sum()

# Raw scores (logits) for 3 classes
logits = np.array([2.0, 1.0, 0.1])
probabilities = softmax(logits)

print(f"Logits: {logits}")
print(f"Probabilities: {probabilities.round(3)}")  # [0.659, 0.242, 0.099]
print(f"Sum: {probabilities.sum():.3f}")  # 1.000

Activation Function Comparison

Function Range Use Case Gradient Issue
Sigmoid (0, 1) Binary classification output Vanishing gradients
Tanh (-1, 1) RNNs, zero-centered outputs Vanishing gradients
ReLU [0, infinity) Hidden layers (default choice) Dying neurons
Leaky ReLU (-infinity, infinity) Hidden layers (prevents dying) None significant
Softmax (0, 1), sum=1 Multi-class output layer N/A

Choosing the Right Activation

The choice of activation function depends on the layer position and the problem type. Here are the practical guidelines used by most practitioners today.

# Practical activation function selection
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    # Hidden layers: Use ReLU (or Leaky ReLU)
    layers.Dense(128, activation='relu'),
    layers.Dense(64, activation='relu'),
    
    # Output layer depends on task:
    # Binary classification: sigmoid
    layers.Dense(1, activation='sigmoid'),
    
    # Multi-class classification: softmax
    # layers.Dense(10, activation='softmax'),
    
    # Regression: no activation (linear)
    # layers.Dense(1, activation=None),
])

Practice: Activation Functions

Task: Create a plot comparing ReLU and Sigmoid functions over the range [-5, 5]. Plot both functions on the same graph with proper labels and legend.

Show Solution
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 100)
sigmoid = 1 / (1 + np.exp(-x))
relu = np.maximum(0, x)

plt.figure(figsize=(10, 5))
plt.plot(x, sigmoid, label='Sigmoid', linewidth=2)
plt.plot(x, relu, label='ReLU', linewidth=2)
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Sigmoid vs ReLU Activation Functions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Task: Calculate and compare the gradients of sigmoid and ReLU for inputs [-10, -5, 0, 5, 10]. Show how sigmoid gradients become very small for extreme values while ReLU maintains useful gradients.

Show Solution
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_gradient(x):
    s = sigmoid(x)
    return s * (1 - s)

def relu_gradient(x):
    return (x > 0).astype(float)

x_values = np.array([-10, -5, 0, 5, 10])

print("Vanishing Gradient Demonstration")
print("-" * 40)
print(f"{'x':>6} | {'Sigmoid Grad':>12} | {'ReLU Grad':>10}")
print("-" * 40)

for x in x_values:
    sig_grad = sigmoid_gradient(x)
    relu_grad = relu_gradient(np.array([x]))[0]
    print(f"{x:>6} | {sig_grad:>12.6f} | {relu_grad:>10.1f}")
    
# Output shows sigmoid gradient near 0 for x=-10, x=10

Task: Implement a softmax function with a temperature parameter. Show how temperature affects the probability distribution: low temperature makes it sharper (more confident), high temperature makes it flatter (more uniform).

Show Solution
import numpy as np

def softmax_with_temperature(logits, temperature=1.0):
    scaled = logits / temperature
    exp_x = np.exp(scaled - np.max(scaled))
    return exp_x / exp_x.sum()

logits = np.array([2.0, 1.0, 0.5])

print("Softmax Temperature Scaling")
print("-" * 50)
for temp in [0.1, 0.5, 1.0, 2.0, 5.0]:
    probs = softmax_with_temperature(logits, temp)
    print(f"T={temp:<3} -> {probs.round(3)}")

# Low temp (0.1): almost one-hot [0.99, 0.01, 0.00]
# High temp (5.0): more uniform [0.39, 0.33, 0.28]

Interactive: Activation Function Explorer

Adjust the input value and see how different activation functions transform it in real-time.

Function Output
Sigmoid0.500
Tanh0.000
ReLU0
Leaky ReLU0
03

Forward Propagation

Forward propagation is the process of passing input data through the neural network to generate predictions. Data flows from the input layer through hidden layers to the output layer, with each neuron computing a weighted sum and applying an activation function. Understanding this flow is essential before learning how networks are trained.

Key Concept

Forward Pass

The forward pass transforms input data into output predictions by sequentially applying linear transformations (weights and biases) followed by non-linear activations at each layer.

For each layer: Z = W*A_prev + b (linear), A = activation(Z) (non-linear)

Step-by-Step Forward Propagation

Let us walk through forward propagation in a simple 2-layer network (one hidden layer, one output layer). We will trace how a single input sample flows through the network to produce a prediction.

import numpy as np

# Network architecture: 3 inputs -> 4 hidden neurons -> 2 outputs
np.random.seed(42)

# Initialize weights and biases
W1 = np.random.randn(4, 3) * 0.01  # Hidden layer: 4 neurons, 3 inputs each
b1 = np.zeros((4, 1))
W2 = np.random.randn(2, 4) * 0.01  # Output layer: 2 neurons, 4 inputs each
b2 = np.zeros((2, 1))

# Sample input (3 features)
X = np.array([[0.5], [0.8], [0.2]])  # Shape: (3, 1)

print(f"Input shape: {X.shape}")
print(f"W1 shape: {W1.shape}, b1 shape: {b1.shape}")
print(f"W2 shape: {W2.shape}, b2 shape: {b2.shape}")

Layer 1: Input to Hidden

The first layer receives the raw input and transforms it. Each hidden neuron computes a weighted sum of all inputs, adds its bias, and applies the activation function.

def relu(z):
    return np.maximum(0, z)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Forward propagation through hidden layer
Z1 = np.dot(W1, X) + b1  # Linear transformation
A1 = relu(Z1)            # Apply ReLU activation

print(f"Z1 (pre-activation): shape {Z1.shape}")
print(f"A1 (post-activation): shape {A1.shape}")
print(f"Hidden layer output:\n{A1.flatten().round(4)}")

Layer 2: Hidden to Output

The output layer takes the hidden layer activations as input and produces the final predictions. For binary classification, we use sigmoid to get probabilities between 0 and 1.

# Forward propagation through output layer
Z2 = np.dot(W2, A1) + b2  # Linear transformation
A2 = sigmoid(Z2)          # Sigmoid for binary classification

print(f"Z2 (pre-activation): shape {Z2.shape}")
print(f"A2 (final output): shape {A2.shape}")
print(f"Predictions: {A2.flatten().round(4)}")
Matrix Dimensions: Always verify shapes match. For W of shape (neurons_out, neurons_in) and A of shape (neurons_in, samples), the output W*A has shape (neurons_out, samples).

Complete Forward Propagation Function

Let us encapsulate the entire forward pass in a reusable function. This function stores intermediate values (cache) which will be needed for backpropagation during training.

def forward_propagation(X, parameters):
    """
    Forward pass through a 2-layer network.
    Returns predictions and cache for backprop.
    """
    W1, b1, W2, b2 = parameters['W1'], parameters['b1'], parameters['W2'], parameters['b2']
    
    # Hidden layer
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    
    # Output layer
    Z2 = np.dot(W2, A1) + b2
    A2 = sigmoid(Z2)
    
    # Store values for backpropagation
    cache = {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2}
    
    return A2, cache

# Package parameters
params = {'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
predictions, cache = forward_propagation(X, params)
print(f"Final predictions: {predictions.flatten()}")

Batch Processing

In practice, we process multiple samples simultaneously (batch processing). This is more efficient because matrix operations can be parallelized. The forward propagation code works the same way, just with different input dimensions.

# Batch of 5 samples, each with 3 features
X_batch = np.random.randn(3, 5)  # Shape: (features, samples)

predictions_batch, _ = forward_propagation(X_batch, params)
print(f"Input batch shape: {X_batch.shape}")
print(f"Output shape: {predictions_batch.shape}")  # (2, 5) - 2 outputs per sample
print(f"Predictions for 5 samples:\n{predictions_batch.round(4)}")

Computing the Loss

After forward propagation, we compare predictions to actual labels using a loss function. For binary classification, we use binary cross-entropy loss. The goal of training is to minimize this loss.

def compute_loss(A2, Y):
    """
    Binary cross-entropy loss.
    A2: predictions, shape (1, m)
    Y: true labels, shape (1, m)
    """
    m = Y.shape[1]
    epsilon = 1e-8  # Prevent log(0)
    
    loss = -np.mean(Y * np.log(A2 + epsilon) + (1 - Y) * np.log(1 - A2 + epsilon))
    return loss

# Example: 5 samples with binary labels
Y = np.array([[1, 0, 1, 1, 0]])  # True labels
A2 = np.array([[0.9, 0.2, 0.8, 0.7, 0.1]])  # Predictions

loss = compute_loss(A2, Y)
print(f"Predictions: {A2}")
print(f"True labels: {Y}")
print(f"Binary Cross-Entropy Loss: {loss:.4f}")  # Lower is better
Step Operation Formula
1. Linear (Layer 1) Weighted sum + bias Z1 = W1 * X + b1
2. Activation (Layer 1) Apply ReLU A1 = max(0, Z1)
3. Linear (Layer 2) Weighted sum + bias Z2 = W2 * A1 + b2
4. Activation (Layer 2) Apply Sigmoid A2 = sigmoid(Z2)
5. Loss Computation Compare with labels L = cross_entropy(A2, Y)

Practice: Forward Propagation

Task: Given input X=[1, 2], weights W=[[0.5, 0.5], [0.3, 0.7]], and bias b=[0, 0], manually calculate Z, then apply ReLU to get the hidden layer output.

Show Solution
import numpy as np

X = np.array([[1], [2]])  # Shape (2, 1)
W = np.array([[0.5, 0.5], 
              [0.3, 0.7]])  # Shape (2, 2)
b = np.array([[0], [0]])  # Shape (2, 1)

# Linear transformation
Z = np.dot(W, X) + b
print(f"Z = W @ X + b")
print(f"Z[0] = 0.5*1 + 0.5*2 + 0 = {Z[0,0]}")  # 1.5
print(f"Z[1] = 0.3*1 + 0.7*2 + 0 = {Z[1,0]}")  # 1.7

# ReLU activation
A = np.maximum(0, Z)
print(f"\nA = ReLU(Z) = {A.flatten()}")  # [1.5, 1.7]

Task: Extend the forward propagation to handle a 3-layer network (2 hidden layers + output). Use ReLU for hidden layers and sigmoid for output.

Show Solution
import numpy as np

def relu(z):
    return np.maximum(0, z)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def forward_3layer(X, params):
    W1, b1 = params['W1'], params['b1']
    W2, b2 = params['W2'], params['b2']
    W3, b3 = params['W3'], params['b3']
    
    # Hidden layer 1
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    
    # Hidden layer 2
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    
    # Output layer
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    return A3

# Example: 4 inputs -> 8 hidden -> 4 hidden -> 1 output
params = {
    'W1': np.random.randn(8, 4) * 0.01, 'b1': np.zeros((8, 1)),
    'W2': np.random.randn(4, 8) * 0.01, 'b2': np.zeros((4, 1)),
    'W3': np.random.randn(1, 4) * 0.01, 'b3': np.zeros((1, 1))
}

X = np.random.randn(4, 1)
output = forward_3layer(X, params)
print(f"Output: {output[0,0]:.4f}")

Task: Create a general forward propagation function that works for networks with any number of layers. Store parameters as a list and use a loop to process each layer.

Show Solution
import numpy as np

def forward_deep(X, parameters, activations):
    """
    Forward propagation for arbitrary depth.
    parameters: list of (W, b) tuples
    activations: list of activation function names
    """
    A = X
    caches = []
    
    for i, ((W, b), act) in enumerate(zip(parameters, activations)):
        Z = np.dot(W, A) + b
        
        if act == 'relu':
            A = np.maximum(0, Z)
        elif act == 'sigmoid':
            A = 1 / (1 + np.exp(-Z))
        elif act == 'tanh':
            A = np.tanh(Z)
        else:
            A = Z  # Linear
        
        caches.append((Z, A))
    
    return A, caches

# 5-layer network: 10 -> 64 -> 32 -> 16 -> 8 -> 1
layer_dims = [10, 64, 32, 16, 8, 1]
params = [(np.random.randn(layer_dims[i+1], layer_dims[i]) * 0.01,
           np.zeros((layer_dims[i+1], 1))) 
          for i in range(len(layer_dims)-1)]
acts = ['relu', 'relu', 'relu', 'relu', 'sigmoid']

X = np.random.randn(10, 1)
output, _ = forward_deep(X, params, acts)
print(f"5-layer output: {output[0,0]:.4f}")
04

Backpropagation and Gradient Descent

Backpropagation is the algorithm that enables neural networks to learn. It calculates how much each weight contributed to the prediction error, then uses gradient descent to update weights in the direction that reduces the error. This elegant combination of calculus and optimization is what makes deep learning possible.

Key Concept

Backpropagation

Backpropagation computes the gradient of the loss function with respect to each weight by applying the chain rule of calculus, propagating error gradients backward from the output layer to the input layer.

Chain Rule: dL/dW = dL/dA * dA/dZ * dZ/dW

The Chain Rule of Calculus

The chain rule allows us to compute derivatives of composite functions. In neural networks, the output is a composition of many functions (layers). The chain rule tells us how to find the derivative of this composition with respect to any parameter.

# Chain rule example: f(g(x)) derivative
# If y = f(g(x)), then dy/dx = df/dg * dg/dx

import numpy as np

# Example: y = (2x + 1)^2
# Let g(x) = 2x + 1, f(g) = g^2
# dy/dx = df/dg * dg/dx = 2g * 2 = 4(2x + 1)

x = 3
g = 2 * x + 1  # g = 7
df_dg = 2 * g  # = 14
dg_dx = 2
dy_dx = df_dg * dg_dx  # = 28

print(f"x = {x}")
print(f"dy/dx using chain rule: {dy_dx}")  # 28

Computing Gradients for Output Layer

We start from the output and work backward. First, we compute the gradient of the loss with respect to the output layer's pre-activation (Z2), then use that to find gradients for W2 and b2.

def backward_output_layer(A2, Y, A1):
    """
    Compute gradients for output layer (with sigmoid activation).
    A2: predictions, Y: true labels, A1: previous layer activations
    """
    m = Y.shape[1]  # Number of samples
    
    # Gradient of loss w.r.t. Z2 (simplified for sigmoid + cross-entropy)
    dZ2 = A2 - Y  # Shape: (1, m)
    
    # Gradient of loss w.r.t. W2
    dW2 = (1/m) * np.dot(dZ2, A1.T)  # Shape: (1, hidden_size)
    
    # Gradient of loss w.r.t. b2
    db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)  # Shape: (1, 1)
    
    return dZ2, dW2, db2

Computing Gradients for Hidden Layer

The hidden layer gradients are computed by propagating the error backward through the network. We use the gradients from the output layer (dZ2) and the weights connecting hidden to output (W2).

def relu_derivative(Z):
    return (Z > 0).astype(float)

def backward_hidden_layer(dZ2, W2, Z1, X):
    """
    Compute gradients for hidden layer (with ReLU activation).
    """
    m = X.shape[1]
    
    # Propagate gradient backward through W2
    dA1 = np.dot(W2.T, dZ2)
    
    # Gradient through ReLU
    dZ1 = dA1 * relu_derivative(Z1)
    
    # Gradients for W1 and b1
    dW1 = (1/m) * np.dot(dZ1, X.T)
    db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
    
    return dW1, db1
Gradient Flow: Gradients flow backward through the network. Each layer receives gradients from the next layer, computes its own gradients, and passes them to the previous layer.

Complete Backpropagation

Now let us combine everything into a complete backpropagation function that computes gradients for all parameters in our 2-layer network.

def backward_propagation(X, Y, params, cache):
    """
    Complete backpropagation for 2-layer network.
    Returns gradients for all parameters.
    """
    m = X.shape[1]
    W1, W2 = params['W1'], params['W2']
    Z1, A1, Z2, A2 = cache['Z1'], cache['A1'], cache['Z2'], cache['A2']
    
    # Output layer gradients
    dZ2 = A2 - Y
    dW2 = (1/m) * np.dot(dZ2, A1.T)
    db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
    
    # Hidden layer gradients
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = dA1 * (Z1 > 0)  # ReLU derivative
    dW1 = (1/m) * np.dot(dZ1, X.T)
    db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
    
    gradients = {'dW1': dW1, 'db1': db1, 'dW2': dW2, 'db2': db2}
    return gradients

Gradient Descent Update

Once we have gradients, we update weights using gradient descent. The learning rate controls how large each update step is. We subtract the gradient (times learning rate) because we want to move in the direction that decreases the loss.

def update_parameters(params, gradients, learning_rate):
    """
    Update parameters using gradient descent.
    New weight = old weight - learning_rate * gradient
    """
    params['W1'] -= learning_rate * gradients['dW1']
    params['b1'] -= learning_rate * gradients['db1']
    params['W2'] -= learning_rate * gradients['dW2']
    params['b2'] -= learning_rate * gradients['db2']
    
    return params

# Example update
learning_rate = 0.01
params = update_parameters(params, gradients, learning_rate)
print("Parameters updated!")

Learning Rate Too High

  • Updates are too large
  • May overshoot the minimum
  • Loss can oscillate or diverge
  • Training becomes unstable

Learning Rate Too Low

  • Updates are too small
  • Training is very slow
  • May get stuck in local minima
  • Requires many more iterations

The Complete Training Loop

Training a neural network involves repeating forward propagation, loss computation, backpropagation, and parameter updates for many iterations (epochs). Here is a complete training loop that brings everything together.

def train_network(X, Y, layer_sizes, learning_rate=0.01, epochs=1000):
    """
    Complete training loop for a 2-layer neural network.
    """
    np.random.seed(42)
    n_input, n_hidden, n_output = layer_sizes
    
    # Initialize parameters
    params = {
        'W1': np.random.randn(n_hidden, n_input) * 0.01,
        'b1': np.zeros((n_hidden, 1)),
        'W2': np.random.randn(n_output, n_hidden) * 0.01,
        'b2': np.zeros((n_output, 1))
    }
    
    losses = []
    
    for epoch in range(epochs):
        # Forward propagation
        A2, cache = forward_propagation(X, params)
        
        # Compute loss
        loss = compute_loss(A2, Y)
        losses.append(loss)
        
        # Backward propagation
        gradients = backward_propagation(X, Y, params, cache)
        
        # Update parameters
        params = update_parameters(params, gradients, learning_rate)
        
        if epoch % 100 == 0:
            print(f"Epoch {epoch}: Loss = {loss:.4f}")
    
    return params, losses

Gradient Descent Variants

Standard gradient descent uses all training samples to compute each update. In practice, we use variants that are more efficient and often lead to better convergence.

Variant Batch Size Characteristics
Batch GD All samples Stable but slow, memory intensive
Stochastic GD 1 sample Fast updates, noisy gradients
Mini-batch GD 32-256 samples Best of both worlds, most common
Modern Optimizers: In practice, we use advanced optimizers like Adam, RMSprop, or AdaGrad that adapt the learning rate for each parameter and often converge faster than vanilla gradient descent.

Putting It All Together

Let us train a complete neural network on a simple classification problem to see all the concepts in action.

# Generate simple classification data
np.random.seed(42)
X = np.random.randn(2, 200)  # 200 samples, 2 features
Y = ((X[0] + X[1]) > 0).astype(int).reshape(1, -1)  # Label based on sum

# Train the network
layer_sizes = (2, 4, 1)  # 2 inputs, 4 hidden, 1 output
params, losses = train_network(X, Y, layer_sizes, learning_rate=0.5, epochs=1000)

# Evaluate
predictions, _ = forward_propagation(X, params)
accuracy = np.mean((predictions > 0.5) == Y)
print(f"\nFinal Accuracy: {accuracy * 100:.1f}%")

Practice: Backpropagation and Gradient Descent

Task: For the function y = (x*w + b)^2, compute dy/dw and dy/db using the chain rule. Then verify numerically with x=2, w=3, b=1.

Show Solution
# Analytical solution using chain rule:
# Let z = x*w + b, then y = z^2
# dy/dz = 2z
# dz/dw = x
# dz/db = 1
# dy/dw = dy/dz * dz/dw = 2z * x = 2(x*w + b) * x
# dy/db = dy/dz * dz/db = 2z * 1 = 2(x*w + b)

x, w, b = 2, 3, 1

z = x * w + b  # z = 7
y = z ** 2     # y = 49

dy_dw = 2 * z * x  # 2 * 7 * 2 = 28
dy_db = 2 * z      # 2 * 7 = 14

print(f"z = x*w + b = {z}")
print(f"y = z^2 = {y}")
print(f"dy/dw = {dy_dw}")
print(f"dy/db = {dy_db}")

# Numerical verification (small perturbation)
eps = 1e-5
y_plus_w = ((x * (w + eps) + b) ** 2)
numerical_dy_dw = (y_plus_w - y) / eps
print(f"Numerical dy/dw: {numerical_dy_dw:.1f}")

Task: Implement numerical gradient checking to verify backpropagation is correct. Compare analytical gradients from backprop with numerical gradients computed using finite differences.

Show Solution
import numpy as np

def numerical_gradient(params, X, Y, param_name, i, j, eps=1e-7):
    """Compute numerical gradient using finite differences."""
    original = params[param_name][i, j]
    
    # f(x + eps)
    params[param_name][i, j] = original + eps
    A2_plus, _ = forward_propagation(X, params)
    loss_plus = compute_loss(A2_plus, Y)
    
    # f(x - eps)
    params[param_name][i, j] = original - eps
    A2_minus, _ = forward_propagation(X, params)
    loss_minus = compute_loss(A2_minus, Y)
    
    # Restore
    params[param_name][i, j] = original
    
    return (loss_plus - loss_minus) / (2 * eps)

# Compare gradients
X = np.random.randn(3, 5)
Y = np.random.randint(0, 2, (1, 5))

A2, cache = forward_propagation(X, params)
grads = backward_propagation(X, Y, params, cache)

# Check W1[0,0]
num_grad = numerical_gradient(params, X, Y, 'W1', 0, 0)
ana_grad = grads['dW1'][0, 0]
diff = abs(num_grad - ana_grad) / max(abs(num_grad), abs(ana_grad))
print(f"Numerical: {num_grad:.6f}, Analytical: {ana_grad:.6f}")
print(f"Relative difference: {diff:.2e}")

Task: Create a 3D visualization showing gradient descent on a simple loss surface L(w1, w2) = w1^2 + w2^2. Plot the path taken by gradient descent from a random starting point to the minimum at (0, 0).

Show Solution
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def loss(w1, w2):
    return w1**2 + w2**2

def gradient(w1, w2):
    return 2*w1, 2*w2

# Gradient descent
w1, w2 = 4.0, 3.0
lr = 0.1
path = [(w1, w2, loss(w1, w2))]

for _ in range(20):
    dw1, dw2 = gradient(w1, w2)
    w1 -= lr * dw1
    w2 -= lr * dw2
    path.append((w1, w2, loss(w1, w2)))

# Plotting
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Loss surface
W1, W2 = np.meshgrid(np.linspace(-5, 5, 50), np.linspace(-5, 5, 50))
L = loss(W1, W2)
ax.plot_surface(W1, W2, L, alpha=0.5, cmap='viridis')

# GD path
path = np.array(path)
ax.plot(path[:,0], path[:,1], path[:,2], 'r.-', markersize=10, linewidth=2)
ax.set_xlabel('w1'); ax.set_ylabel('w2'); ax.set_zlabel('Loss')
plt.title('Gradient Descent on Loss Surface')
plt.show()

Key Takeaways

Biological Inspiration

Artificial neurons mimic biological neurons, receiving inputs, applying weights, and producing outputs through activation

Perceptron Foundation

The perceptron is the simplest neural network unit, combining weighted inputs with a bias and activation function

Non-linear Activations

Activation functions like ReLU, Sigmoid, and Tanh enable networks to learn complex non-linear patterns

Forward Propagation

Data flows forward through layers, with each neuron computing weighted sums and applying activations

Backpropagation

Errors propagate backward through the network, computing gradients for each weight using the chain rule

Gradient Descent

Weights are updated iteratively using gradients, moving down the loss surface toward optimal values

Knowledge Check

Test your understanding of neural network fundamentals:

1 What are the main components of an artificial neuron (perceptron)?
2 Why are activation functions necessary in neural networks?
3 Which activation function is most commonly used in hidden layers of modern deep networks?
4 What happens during forward propagation?
5 What mathematical technique does backpropagation use to compute gradients?
6 In gradient descent, what does the learning rate control?
Answer all questions to check your score