Neural Networks | AI Course

Neurons and Perceptrons

The human brain contains approximately 86 billion neurons, each connected to thousands of others through synapses. Artificial neural networks draw inspiration from this biological architecture, using simplified mathematical models to process information. Understanding how artificial neurons work is the foundation for all deep learning concepts.

Key Concept

Artificial Neuron

An artificial neuron is a mathematical function that receives one or more inputs, multiplies each by a weight, sums them together with a bias term, and passes the result through an activation function to produce an output.

Mathematical Formula: output = activation(w1*x1 + w2*x2 + ... + wn*xn + b)

Biological vs Artificial Neurons

While artificial neurons are inspired by biological neurons, they are significantly simplified. Biological neurons communicate through electrical impulses and chemical signals, while artificial neurons perform straightforward mathematical operations. Despite this simplification, artificial neural networks can learn remarkably complex patterns.

Biological Neuron

Dendrites receive signals from other neurons
Cell body processes incoming signals
Axon transmits output signal
Synapses connect to other neurons

Artificial Neuron

Inputs receive data values
Weights determine input importance
Bias shifts the activation threshold
Activation function produces output

The Perceptron Model

The perceptron, invented by Frank Rosenblatt in 1958, is the simplest form of a neural network. It consists of a single artificial neuron that can learn to classify inputs into two categories. Despite its simplicity, the perceptron laid the groundwork for modern deep learning.

import numpy as np

class Perceptron:
    def __init__(self, n_inputs, learning_rate=0.01):
        # Initialize weights randomly and bias to zero
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0
        self.lr = learning_rate
    
    def activation(self, x):
        # Step function: returns 1 if x > 0, else 0
        return 1 if x > 0 else 0
    
    def predict(self, inputs):
        # Weighted sum plus bias
        weighted_sum = np.dot(inputs, self.weights) + self.bias
        return self.activation(weighted_sum)  # Returns 0 or 1

Weights and Biases

Weights determine how much influence each input has on the output. A larger weight means the corresponding input has more importance. The bias allows the neuron to shift its activation threshold, enabling it to fire even when all inputs are zero.

# Demonstrating weights and bias effect
import numpy as np

inputs = np.array([0.5, 0.3, 0.2])
weights = np.array([0.4, 0.6, 0.8])
bias = -0.1

# Weighted sum calculation
weighted_sum = np.dot(inputs, weights) + bias
print(f"Inputs: {inputs}")
print(f"Weights: {weights}")
print(f"Weighted sum: {weighted_sum:.3f}")  # 0.5*0.4 + 0.3*0.6 + 0.2*0.8 - 0.1 = 0.44

Learning Process: During training, the network adjusts weights and biases to minimize the difference between predicted and actual outputs. This is how neural networks "learn" from data.

Training a Perceptron

The perceptron learning algorithm updates weights based on prediction errors. When the perceptron makes a wrong prediction, weights are adjusted in the direction that would have produced the correct output.

def train(self, X, y, epochs=100):
    for epoch in range(epochs):
        errors = 0
        for inputs, target in zip(X, y):
            prediction = self.predict(inputs)
            error = target - prediction
            
            if error != 0:
                # Update weights: w = w + learning_rate * error * input
                self.weights += self.lr * error * inputs
                self.bias += self.lr * error
                errors += 1
        
        if errors == 0:
            print(f"Converged at epoch {epoch}")
            break

Complete Perceptron Example: Logic Gates

Let us train a perceptron to learn the AND logic gate. This is a classic example that demonstrates how a single perceptron can learn linearly separable patterns.

import numpy as np

# AND gate training data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1])  # AND: only 1 when both inputs are 1

# Create and train perceptron
perceptron = Perceptron(n_inputs=2, learning_rate=0.1)
perceptron.train(X, y, epochs=100)

# Test the trained perceptron
for inputs in X:
    prediction = perceptron.predict(inputs)
    print(f"{inputs} -> {prediction}")  # Should output: [0,0]->0, [0,1]->0, [1,0]->0, [1,1]->1

Limitation: A single perceptron can only learn linearly separable patterns. It cannot learn XOR (exclusive or) because XOR requires a non-linear decision boundary.

Multi-Layer Perceptrons (MLPs)

To overcome the limitations of single perceptrons, we stack multiple layers of neurons. This architecture, called a Multi-Layer Perceptron (MLP), can learn complex non-linear patterns. An MLP consists of an input layer, one or more hidden layers, and an output layer.

Layer Type	Purpose	Characteristics
Input Layer	Receives raw input data	Number of neurons equals number of features
Hidden Layer(s)	Learns intermediate representations	Extract features and patterns from data
Output Layer	Produces final predictions	Size depends on task (1 for regression, n for n-class classification)

Practice: Neurons and Perceptrons

Task: Given inputs [0.6, 0.8], weights [0.5, 0.5], and bias -0.4, calculate the weighted sum and apply a step activation function (output 1 if sum > 0, else 0).

Show Solution

import numpy as np

inputs = np.array([0.6, 0.8])
weights = np.array([0.5, 0.5])
bias = -0.4

# Calculate weighted sum
weighted_sum = np.dot(inputs, weights) + bias
print(f"Weighted sum: {weighted_sum}")  # 0.6*0.5 + 0.8*0.5 - 0.4 = 0.3

# Apply step activation
output = 1 if weighted_sum > 0 else 0
print(f"Output: {output}")  # 1 (since 0.3 > 0)

Task: Modify the perceptron example to learn the OR logic gate instead of AND. The OR gate outputs 1 if at least one input is 1.

Show Solution

import numpy as np

class Perceptron:
    def __init__(self, n_inputs, learning_rate=0.1):
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0
        self.lr = learning_rate
    
    def predict(self, inputs):
        return 1 if np.dot(inputs, self.weights) + self.bias > 0 else 0
    
    def train(self, X, y, epochs=100):
        for epoch in range(epochs):
            for inputs, target in zip(X, y):
                error = target - self.predict(inputs)
                self.weights += self.lr * error * inputs
                self.bias += self.lr * error

# OR gate: output 1 if ANY input is 1
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 1])

perceptron = Perceptron(2)
perceptron.train(X, y)

for inputs in X:
    print(f"{inputs} -> {perceptron.predict(inputs)}")  # 0,1,1,1

Task: The XOR problem cannot be solved by a single perceptron. Implement a simple 2-layer network using NumPy with hardcoded weights that correctly computes XOR. Hint: XOR can be computed as (x1 OR x2) AND NOT(x1 AND x2).

Show Solution

import numpy as np

def step(x):
    return (x > 0).astype(int)

# XOR using 2 layers with manual weights
# Hidden layer: one neuron for OR, one for NAND
# Output layer: AND of hidden outputs

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

# Hidden layer weights (2 neurons)
W1 = np.array([[1, 1],      # OR neuron
               [-1, -1]])    # NAND neuron (NOT AND)
b1 = np.array([-0.5, 1.5])  # Thresholds

# Output layer weights (1 neuron - AND)
W2 = np.array([[1], [1]])
b2 = np.array([-1.5])

for x in X:
    # Forward pass
    hidden = step(np.dot(x, W1.T) + b1)
    output = step(np.dot(hidden, W2) + b2)
    print(f"XOR{tuple(x)} = {output[0]}")  # 0,1,1,0

Activation Functions

Activation functions are the secret sauce that gives neural networks their power. Without them, a neural network would simply be a linear transformation, incapable of learning complex patterns. Activation functions introduce non-linearity, allowing networks to approximate virtually any function and solve problems that linear models cannot.

Key Concept

Why Non-linearity Matters

If we stack linear transformations (without activation functions), the result is still a linear transformation. No matter how many layers we add, the network could only learn linear relationships.

Mathematical proof: f(g(x)) = W2(W1*x + b1) + b2 = (W2*W1)*x + (W2*b1 + b2) = W'x + b' (still linear!)

Sigmoid Activation

The sigmoid function squashes any input into the range (0, 1). It was historically popular but has fallen out of favor for hidden layers due to the vanishing gradient problem. However, it is still used in output layers for binary classification.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)  # Derivative for backpropagation

# Example values
x = np.array([-2, -1, 0, 1, 2])
print(f"Input: {x}")
print(f"Sigmoid: {sigmoid(x).round(3)}")  # [0.119, 0.269, 0.5, 0.731, 0.881]
print(f"Derivative: {sigmoid_derivative(x).round(3)}")  # [0.105, 0.197, 0.25, 0.197, 0.105]

Sigmoid Advantages

Output bounded between 0 and 1
Smooth gradient, always defined
Interpretable as probability

Sigmoid Disadvantages

Vanishing gradients for large inputs
Outputs not zero-centered
Computationally expensive (exponential)

Tanh (Hyperbolic Tangent)

The tanh function is similar to sigmoid but outputs values between -1 and 1, making it zero-centered. This property can help with training convergence. Like sigmoid, it also suffers from vanishing gradients for extreme inputs.

def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

# Compare sigmoid and tanh
x = np.array([-2, -1, 0, 1, 2])
print(f"Tanh: {tanh(x).round(3)}")  # [-0.964, -0.762, 0.0, 0.762, 0.964]
print(f"Range: (-1, 1), Zero-centered: Yes")

Relationship: tanh(x) = 2 * sigmoid(2x) - 1. Tanh is essentially a scaled and shifted version of sigmoid.

ReLU (Rectified Linear Unit)

ReLU is the most widely used activation function in modern deep learning. It simply outputs the input if positive, otherwise zero. ReLU is computationally efficient, helps mitigate vanishing gradients, and enables faster training.

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)  # 1 if x > 0, else 0

# ReLU in action
x = np.array([-2, -1, 0, 1, 2])
print(f"Input: {x}")
print(f"ReLU: {relu(x)}")  # [0, 0, 0, 1, 2]
print(f"Derivative: {relu_derivative(x)}")  # [0, 0, 0, 1, 1]

Dying ReLU Problem: If a neuron's output becomes negative during training, ReLU outputs 0 and the gradient is also 0. The neuron "dies" and stops learning. Leaky ReLU addresses this issue.

Leaky ReLU and Variants

Leaky ReLU allows a small gradient when the input is negative, preventing neurons from dying. Other variants like Parametric ReLU (PReLU) and Exponential Linear Unit (ELU) offer similar benefits with different characteristics.

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

def elu(x, alpha=1.0):
    return np.where(x > 0, x, alpha * (np.exp(x) - 1))

x = np.array([-2, -1, 0, 1, 2])
print(f"Leaky ReLU (alpha=0.01): {leaky_relu(x).round(3)}")  # [-0.02, -0.01, 0, 1, 2]
print(f"ELU (alpha=1.0): {elu(x).round(3)}")  # [-0.865, -0.632, 0, 1, 2]

Softmax for Multi-class Classification

Softmax is used in the output layer for multi-class classification. It converts raw scores (logits) into probabilities that sum to 1. Each output represents the probability of the input belonging to that class.

def softmax(x):
    # Subtract max for numerical stability
    exp_x = np.exp(x - np.max(x))
    return exp_x / exp_x.sum()

# Raw scores (logits) for 3 classes
logits = np.array([2.0, 1.0, 0.1])
probabilities = softmax(logits)

print(f"Logits: {logits}")
print(f"Probabilities: {probabilities.round(3)}")  # [0.659, 0.242, 0.099]
print(f"Sum: {probabilities.sum():.3f}")  # 1.000

Activation Function Comparison

Function	Range	Use Case	Gradient Issue
Sigmoid	(0, 1)	Binary classification output	Vanishing gradients
Tanh	(-1, 1)	RNNs, zero-centered outputs	Vanishing gradients
ReLU	[0, infinity)	Hidden layers (default choice)	Dying neurons
Leaky ReLU	(-infinity, infinity)	Hidden layers (prevents dying)	None significant
Softmax	(0, 1), sum=1	Multi-class output layer	N/A

Choosing the Right Activation

The choice of activation function depends on the layer position and the problem type. Here are the practical guidelines used by most practitioners today.

# Practical activation function selection
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    # Hidden layers: Use ReLU (or Leaky ReLU)
    layers.Dense(128, activation='relu'),
    layers.Dense(64, activation='relu'),
    
    # Output layer depends on task:
    # Binary classification: sigmoid
    layers.Dense(1, activation='sigmoid'),
    
    # Multi-class classification: softmax
    # layers.Dense(10, activation='softmax'),
    
    # Regression: no activation (linear)
    # layers.Dense(1, activation=None),
])

Practice: Activation Functions

Task: Create a plot comparing ReLU and Sigmoid functions over the range [-5, 5]. Plot both functions on the same graph with proper labels and legend.

Show Solution

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 100)
sigmoid = 1 / (1 + np.exp(-x))
relu = np.maximum(0, x)

plt.figure(figsize=(10, 5))
plt.plot(x, sigmoid, label='Sigmoid', linewidth=2)
plt.plot(x, relu, label='ReLU', linewidth=2)
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.5)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5)
plt.xlabel('Input')
plt.ylabel('Output')
plt.title('Sigmoid vs ReLU Activation Functions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Task: Calculate and compare the gradients of sigmoid and ReLU for inputs [-10, -5, 0, 5, 10]. Show how sigmoid gradients become very small for extreme values while ReLU maintains useful gradients.

Show Solution

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_gradient(x):
    s = sigmoid(x)
    return s * (1 - s)

def relu_gradient(x):
    return (x > 0).astype(float)

x_values = np.array([-10, -5, 0, 5, 10])

print("Vanishing Gradient Demonstration")
print("-" * 40)
print(f"{'x':>6} | {'Sigmoid Grad':>12} | {'ReLU Grad':>10}")
print("-" * 40)

for x in x_values:
    sig_grad = sigmoid_gradient(x)
    relu_grad = relu_gradient(np.array([x]))[0]
    print(f"{x:>6} | {sig_grad:>12.6f} | {relu_grad:>10.1f}")
    
# Output shows sigmoid gradient near 0 for x=-10, x=10

Task: Implement a softmax function with a temperature parameter. Show how temperature affects the probability distribution: low temperature makes it sharper (more confident), high temperature makes it flatter (more uniform).

Show Solution

import numpy as np

def softmax_with_temperature(logits, temperature=1.0):
    scaled = logits / temperature
    exp_x = np.exp(scaled - np.max(scaled))
    return exp_x / exp_x.sum()

logits = np.array([2.0, 1.0, 0.5])

print("Softmax Temperature Scaling")
print("-" * 50)
for temp in [0.1, 0.5, 1.0, 2.0, 5.0]:
    probs = softmax_with_temperature(logits, temp)
    print(f"T={temp:<3} -> {probs.round(3)}")

# Low temp (0.1): almost one-hot [0.99, 0.01, 0.00]
# High temp (5.0): more uniform [0.39, 0.33, 0.28]

Interactive: Activation Function Explorer

Adjust the input value and see how different activation functions transform it in real-time.

Input Value: 0

Function	Output
Sigmoid	0.500
Tanh	0.000
ReLU	0
Leaky ReLU	0

Forward Propagation

Forward propagation is the process of passing input data through the neural network to generate predictions. Data flows from the input layer through hidden layers to the output layer, with each neuron computing a weighted sum and applying an activation function. Understanding this flow is essential before learning how networks are trained.

Key Concept

Forward Pass

The forward pass transforms input data into output predictions by sequentially applying linear transformations (weights and biases) followed by non-linear activations at each layer.

For each layer: Z = W*A_prev + b (linear), A = activation(Z) (non-linear)

Step-by-Step Forward Propagation

Let us walk through forward propagation in a simple 2-layer network (one hidden layer, one output layer). We will trace how a single input sample flows through the network to produce a prediction.

import numpy as np

# Network architecture: 3 inputs -> 4 hidden neurons -> 2 outputs
np.random.seed(42)

# Initialize weights and biases
W1 = np.random.randn(4, 3) * 0.01  # Hidden layer: 4 neurons, 3 inputs each
b1 = np.zeros((4, 1))
W2 = np.random.randn(2, 4) * 0.01  # Output layer: 2 neurons, 4 inputs each
b2 = np.zeros((2, 1))

# Sample input (3 features)
X = np.array([[0.5], [0.8], [0.2]])  # Shape: (3, 1)

print(f"Input shape: {X.shape}")
print(f"W1 shape: {W1.shape}, b1 shape: {b1.shape}")
print(f"W2 shape: {W2.shape}, b2 shape: {b2.shape}")

Layer 1: Input to Hidden

The first layer receives the raw input and transforms it. Each hidden neuron computes a weighted sum of all inputs, adds its bias, and applies the activation function.

def relu(z):
    return np.maximum(0, z)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Forward propagation through hidden layer
Z1 = np.dot(W1, X) + b1  # Linear transformation
A1 = relu(Z1)            # Apply ReLU activation

print(f"Z1 (pre-activation): shape {Z1.shape}")
print(f"A1 (post-activation): shape {A1.shape}")
print(f"Hidden layer output:\n{A1.flatten().round(4)}")

Layer 2: Hidden to Output

The output layer takes the hidden layer activations as input and produces the final predictions. For binary classification, we use sigmoid to get probabilities between 0 and 1.

# Forward propagation through output layer
Z2 = np.dot(W2, A1) + b2  # Linear transformation
A2 = sigmoid(Z2)          # Sigmoid for binary classification

print(f"Z2 (pre-activation): shape {Z2.shape}")
print(f"A2 (final output): shape {A2.shape}")
print(f"Predictions: {A2.flatten().round(4)}")

Matrix Dimensions: Always verify shapes match. For W of shape (neurons_out, neurons_in) and A of shape (neurons_in, samples), the output W*A has shape (neurons_out, samples).

Complete Forward Propagation Function

Let us encapsulate the entire forward pass in a reusable function. This function stores intermediate values (cache) which will be needed for backpropagation during training.

def forward_propagation(X, parameters):
    """
    Forward pass through a 2-layer network.
    Returns predictions and cache for backprop.
    """
    W1, b1, W2, b2 = parameters['W1'], parameters['b1'], parameters['W2'], parameters['b2']
    
    # Hidden layer
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    
    # Output layer
    Z2 = np.dot(W2, A1) + b2
    A2 = sigmoid(Z2)
    
    # Store values for backpropagation
    cache = {'Z1': Z1, 'A1': A1, 'Z2': Z2, 'A2': A2}
    
    return A2, cache

# Package parameters
params = {'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
predictions, cache = forward_propagation(X, params)
print(f"Final predictions: {predictions.flatten()}")

Batch Processing

In practice, we process multiple samples simultaneously (batch processing). This is more efficient because matrix operations can be parallelized. The forward propagation code works the same way, just with different input dimensions.

# Batch of 5 samples, each with 3 features
X_batch = np.random.randn(3, 5)  # Shape: (features, samples)

predictions_batch, _ = forward_propagation(X_batch, params)
print(f"Input batch shape: {X_batch.shape}")
print(f"Output shape: {predictions_batch.shape}")  # (2, 5) - 2 outputs per sample
print(f"Predictions for 5 samples:\n{predictions_batch.round(4)}")

Computing the Loss

After forward propagation, we compare predictions to actual labels using a loss function. For binary classification, we use binary cross-entropy loss. The goal of training is to minimize this loss.

def compute_loss(A2, Y):
    """
    Binary cross-entropy loss.
    A2: predictions, shape (1, m)
    Y: true labels, shape (1, m)
    """
    m = Y.shape[1]
    epsilon = 1e-8  # Prevent log(0)
    
    loss = -np.mean(Y * np.log(A2 + epsilon) + (1 - Y) * np.log(1 - A2 + epsilon))
    return loss

# Example: 5 samples with binary labels
Y = np.array([[1, 0, 1, 1, 0]])  # True labels
A2 = np.array([[0.9, 0.2, 0.8, 0.7, 0.1]])  # Predictions

loss = compute_loss(A2, Y)
print(f"Predictions: {A2}")
print(f"True labels: {Y}")
print(f"Binary Cross-Entropy Loss: {loss:.4f}")  # Lower is better

Step	Operation	Formula
1. Linear (Layer 1)	Weighted sum + bias	Z1 = W1 * X + b1
2. Activation (Layer 1)	Apply ReLU	A1 = max(0, Z1)
3. Linear (Layer 2)	Weighted sum + bias	Z2 = W2 * A1 + b2
4. Activation (Layer 2)	Apply Sigmoid	A2 = sigmoid(Z2)
5. Loss Computation	Compare with labels	L = cross_entropy(A2, Y)

Practice: Forward Propagation

Task: Given input X=[1, 2], weights W=[[0.5, 0.5], [0.3, 0.7]], and bias b=[0, 0], manually calculate Z, then apply ReLU to get the hidden layer output.

Show Solution

import numpy as np

X = np.array([[1], [2]])  # Shape (2, 1)
W = np.array([[0.5, 0.5], 
              [0.3, 0.7]])  # Shape (2, 2)
b = np.array([[0], [0]])  # Shape (2, 1)

# Linear transformation
Z = np.dot(W, X) + b
print(f"Z = W @ X + b")
print(f"Z[0] = 0.5*1 + 0.5*2 + 0 = {Z[0,0]}")  # 1.5
print(f"Z[1] = 0.3*1 + 0.7*2 + 0 = {Z[1,0]}")  # 1.7

# ReLU activation
A = np.maximum(0, Z)
print(f"\nA = ReLU(Z) = {A.flatten()}")  # [1.5, 1.7]

Task: Extend the forward propagation to handle a 3-layer network (2 hidden layers + output). Use ReLU for hidden layers and sigmoid for output.

Show Solution

import numpy as np

def relu(z):
    return np.maximum(0, z)

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def forward_3layer(X, params):
    W1, b1 = params['W1'], params['b1']
    W2, b2 = params['W2'], params['b2']
    W3, b3 = params['W3'], params['b3']
    
    # Hidden layer 1
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    
    # Hidden layer 2
    Z2 = np.dot(W2, A1) + b2
    A2 = relu(Z2)
    
    # Output layer
    Z3 = np.dot(W3, A2) + b3
    A3 = sigmoid(Z3)
    
    return A3

# Example: 4 inputs -> 8 hidden -> 4 hidden -> 1 output
params = {
    'W1': np.random.randn(8, 4) * 0.01, 'b1': np.zeros((8, 1)),
    'W2': np.random.randn(4, 8) * 0.01, 'b2': np.zeros((4, 1)),
    'W3': np.random.randn(1, 4) * 0.01, 'b3': np.zeros((1, 1))
}

X = np.random.randn(4, 1)
output = forward_3layer(X, params)
print(f"Output: {output[0,0]:.4f}")

Task: Create a general forward propagation function that works for networks with any number of layers. Store parameters as a list and use a loop to process each layer.

Show Solution

import numpy as np

def forward_deep(X, parameters, activations):
    """
    Forward propagation for arbitrary depth.
    parameters: list of (W, b) tuples
    activations: list of activation function names
    """
    A = X
    caches = []
    
    for i, ((W, b), act) in enumerate(zip(parameters, activations)):
        Z = np.dot(W, A) + b
        
        if act == 'relu':
            A = np.maximum(0, Z)
        elif act == 'sigmoid':
            A = 1 / (1 + np.exp(-Z))
        elif act == 'tanh':
            A = np.tanh(Z)
        else:
            A = Z  # Linear
        
        caches.append((Z, A))
    
    return A, caches

# 5-layer network: 10 -> 64 -> 32 -> 16 -> 8 -> 1
layer_dims = [10, 64, 32, 16, 8, 1]
params = [(np.random.randn(layer_dims[i+1], layer_dims[i]) * 0.01,
           np.zeros((layer_dims[i+1], 1))) 
          for i in range(len(layer_dims)-1)]
acts = ['relu', 'relu', 'relu', 'relu', 'sigmoid']

X = np.random.randn(10, 1)
output, _ = forward_deep(X, params, acts)
print(f"5-layer output: {output[0,0]:.4f}")

Backpropagation and Gradient Descent

Backpropagation is the algorithm that enables neural networks to learn. It calculates how much each weight contributed to the prediction error, then uses gradient descent to update weights in the direction that reduces the error. This elegant combination of calculus and optimization is what makes deep learning possible.

Key Concept

Backpropagation

Backpropagation computes the gradient of the loss function with respect to each weight by applying the chain rule of calculus, propagating error gradients backward from the output layer to the input layer.

Chain Rule: dL/dW = dL/dA * dA/dZ * dZ/dW

The Chain Rule of Calculus

The chain rule allows us to compute derivatives of composite functions. In neural networks, the output is a composition of many functions (layers). The chain rule tells us how to find the derivative of this composition with respect to any parameter.

# Chain rule example: f(g(x)) derivative
# If y = f(g(x)), then dy/dx = df/dg * dg/dx

import numpy as np

# Example: y = (2x + 1)^2
# Let g(x) = 2x + 1, f(g) = g^2
# dy/dx = df/dg * dg/dx = 2g * 2 = 4(2x + 1)

x = 3
g = 2 * x + 1  # g = 7
df_dg = 2 * g  # = 14
dg_dx = 2
dy_dx = df_dg * dg_dx  # = 28

print(f"x = {x}")
print(f"dy/dx using chain rule: {dy_dx}")  # 28

Computing Gradients for Output Layer

We start from the output and work backward. First, we compute the gradient of the loss with respect to the output layer's pre-activation (Z2), then use that to find gradients for W2 and b2.

def backward_output_layer(A2, Y, A1):
    """
    Compute gradients for output layer (with sigmoid activation).
    A2: predictions, Y: true labels, A1: previous layer activations
    """
    m = Y.shape[1]  # Number of samples
    
    # Gradient of loss w.r.t. Z2 (simplified for sigmoid + cross-entropy)
    dZ2 = A2 - Y  # Shape: (1, m)
    
    # Gradient of loss w.r.t. W2
    dW2 = (1/m) * np.dot(dZ2, A1.T)  # Shape: (1, hidden_size)
    
    # Gradient of loss w.r.t. b2
    db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)  # Shape: (1, 1)
    
    return dZ2, dW2, db2

Computing Gradients for Hidden Layer

The hidden layer gradients are computed by propagating the error backward through the network. We use the gradients from the output layer (dZ2) and the weights connecting hidden to output (W2).

def relu_derivative(Z):
    return (Z > 0).astype(float)

def backward_hidden_layer(dZ2, W2, Z1, X):
    """
    Compute gradients for hidden layer (with ReLU activation).
    """
    m = X.shape[1]
    
    # Propagate gradient backward through W2
    dA1 = np.dot(W2.T, dZ2)
    
    # Gradient through ReLU
    dZ1 = dA1 * relu_derivative(Z1)
    
    # Gradients for W1 and b1
    dW1 = (1/m) * np.dot(dZ1, X.T)
    db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
    
    return dW1, db1

Gradient Flow: Gradients flow backward through the network. Each layer receives gradients from the next layer, computes its own gradients, and passes them to the previous layer.

Complete Backpropagation

Now let us combine everything into a complete backpropagation function that computes gradients for all parameters in our 2-layer network.

def backward_propagation(X, Y, params, cache):
    """
    Complete backpropagation for 2-layer network.
    Returns gradients for all parameters.
    """
    m = X.shape[1]
    W1, W2 = params['W1'], params['W2']
    Z1, A1, Z2, A2 = cache['Z1'], cache['A1'], cache['Z2'], cache['A2']
    
    # Output layer gradients
    dZ2 = A2 - Y
    dW2 = (1/m) * np.dot(dZ2, A1.T)
    db2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
    
    # Hidden layer gradients
    dA1 = np.dot(W2.T, dZ2)
    dZ1 = dA1 * (Z1 > 0)  # ReLU derivative
    dW1 = (1/m) * np.dot(dZ1, X.T)
    db1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
    
    gradients = {'dW1': dW1, 'db1': db1, 'dW2': dW2, 'db2': db2}
    return gradients

Gradient Descent Update

Once we have gradients, we update weights using gradient descent. The learning rate controls how large each update step is. We subtract the gradient (times learning rate) because we want to move in the direction that decreases the loss.

def update_parameters(params, gradients, learning_rate):
    """
    Update parameters using gradient descent.
    New weight = old weight - learning_rate * gradient
    """
    params['W1'] -= learning_rate * gradients['dW1']
    params['b1'] -= learning_rate * gradients['db1']
    params['W2'] -= learning_rate * gradients['dW2']
    params['b2'] -= learning_rate * gradients['db2']
    
    return params

# Example update
learning_rate = 0.01
params = update_parameters(params, gradients, learning_rate)
print("Parameters updated!")

Learning Rate Too High

Updates are too large
May overshoot the minimum
Loss can oscillate or diverge
Training becomes unstable

Learning Rate Too Low

Updates are too small
Training is very slow
May get stuck in local minima
Requires many more iterations

The Complete Training Loop

Training a neural network involves repeating forward propagation, loss computation, backpropagation, and parameter updates for many iterations (epochs). Here is a complete training loop that brings everything together.

def train_network(X, Y, layer_sizes, learning_rate=0.01, epochs=1000):
    """
    Complete training loop for a 2-layer neural network.
    """
    np.random.seed(42)
    n_input, n_hidden, n_output = layer_sizes
    
    # Initialize parameters
    params = {
        'W1': np.random.randn(n_hidden, n_input) * 0.01,
        'b1': np.zeros((n_hidden, 1)),
        'W2': np.random.randn(n_output, n_hidden) * 0.01,
        'b2': np.zeros((n_output, 1))
    }
    
    losses = []
    
    for epoch in range(epochs):
        # Forward propagation
        A2, cache = forward_propagation(X, params)
        
        # Compute loss
        loss = compute_loss(A2, Y)
        losses.append(loss)
        
        # Backward propagation
        gradients = backward_propagation(X, Y, params, cache)
        
        # Update parameters
        params = update_parameters(params, gradients, learning_rate)
        
        if epoch % 100 == 0:
            print(f"Epoch {epoch}: Loss = {loss:.4f}")
    
    return params, losses

Gradient Descent Variants

Standard gradient descent uses all training samples to compute each update. In practice, we use variants that are more efficient and often lead to better convergence.

Variant	Batch Size	Characteristics
Batch GD	All samples	Stable but slow, memory intensive
Stochastic GD	1 sample	Fast updates, noisy gradients
Mini-batch GD	32-256 samples	Best of both worlds, most common

Modern Optimizers: In practice, we use advanced optimizers like Adam, RMSprop, or AdaGrad that adapt the learning rate for each parameter and often converge faster than vanilla gradient descent.

Putting It All Together

Let us train a complete neural network on a simple classification problem to see all the concepts in action.

# Generate simple classification data
np.random.seed(42)
X = np.random.randn(2, 200)  # 200 samples, 2 features
Y = ((X[0] + X[1]) > 0).astype(int).reshape(1, -1)  # Label based on sum

# Train the network
layer_sizes = (2, 4, 1)  # 2 inputs, 4 hidden, 1 output
params, losses = train_network(X, Y, layer_sizes, learning_rate=0.5, epochs=1000)

# Evaluate
predictions, _ = forward_propagation(X, params)
accuracy = np.mean((predictions > 0.5) == Y)
print(f"\nFinal Accuracy: {accuracy * 100:.1f}%")

Practice: Backpropagation and Gradient Descent

Task: For the function y = (x*w + b)^2, compute dy/dw and dy/db using the chain rule. Then verify numerically with x=2, w=3, b=1.

Show Solution

# Analytical solution using chain rule:
# Let z = x*w + b, then y = z^2
# dy/dz = 2z
# dz/dw = x
# dz/db = 1
# dy/dw = dy/dz * dz/dw = 2z * x = 2(x*w + b) * x
# dy/db = dy/dz * dz/db = 2z * 1 = 2(x*w + b)

x, w, b = 2, 3, 1

z = x * w + b  # z = 7
y = z ** 2     # y = 49

dy_dw = 2 * z * x  # 2 * 7 * 2 = 28
dy_db = 2 * z      # 2 * 7 = 14

print(f"z = x*w + b = {z}")
print(f"y = z^2 = {y}")
print(f"dy/dw = {dy_dw}")
print(f"dy/db = {dy_db}")

# Numerical verification (small perturbation)
eps = 1e-5
y_plus_w = ((x * (w + eps) + b) ** 2)
numerical_dy_dw = (y_plus_w - y) / eps
print(f"Numerical dy/dw: {numerical_dy_dw:.1f}")

Task: Implement numerical gradient checking to verify backpropagation is correct. Compare analytical gradients from backprop with numerical gradients computed using finite differences.

Show Solution

import numpy as np

def numerical_gradient(params, X, Y, param_name, i, j, eps=1e-7):
    """Compute numerical gradient using finite differences."""
    original = params[param_name][i, j]
    
    # f(x + eps)
    params[param_name][i, j] = original + eps
    A2_plus, _ = forward_propagation(X, params)
    loss_plus = compute_loss(A2_plus, Y)
    
    # f(x - eps)
    params[param_name][i, j] = original - eps
    A2_minus, _ = forward_propagation(X, params)
    loss_minus = compute_loss(A2_minus, Y)
    
    # Restore
    params[param_name][i, j] = original
    
    return (loss_plus - loss_minus) / (2 * eps)

# Compare gradients
X = np.random.randn(3, 5)
Y = np.random.randint(0, 2, (1, 5))

A2, cache = forward_propagation(X, params)
grads = backward_propagation(X, Y, params, cache)

# Check W1[0,0]
num_grad = numerical_gradient(params, X, Y, 'W1', 0, 0)
ana_grad = grads['dW1'][0, 0]
diff = abs(num_grad - ana_grad) / max(abs(num_grad), abs(ana_grad))
print(f"Numerical: {num_grad:.6f}, Analytical: {ana_grad:.6f}")
print(f"Relative difference: {diff:.2e}")

Task: Create a 3D visualization showing gradient descent on a simple loss surface L(w1, w2) = w1^2 + w2^2. Plot the path taken by gradient descent from a random starting point to the minimum at (0, 0).

Show Solution

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

def loss(w1, w2):
    return w1**2 + w2**2

def gradient(w1, w2):
    return 2*w1, 2*w2

# Gradient descent
w1, w2 = 4.0, 3.0
lr = 0.1
path = [(w1, w2, loss(w1, w2))]

for _ in range(20):
    dw1, dw2 = gradient(w1, w2)
    w1 -= lr * dw1
    w2 -= lr * dw2
    path.append((w1, w2, loss(w1, w2)))

# Plotting
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Loss surface
W1, W2 = np.meshgrid(np.linspace(-5, 5, 50), np.linspace(-5, 5, 50))
L = loss(W1, W2)
ax.plot_surface(W1, W2, L, alpha=0.5, cmap='viridis')

# GD path
path = np.array(path)
ax.plot(path[:,0], path[:,1], path[:,2], 'r.-', markersize=10, linewidth=2)
ax.set_xlabel('w1'); ax.set_ylabel('w2'); ax.set_zlabel('Loss')
plt.title('Gradient Descent on Loss Surface')
plt.show()

Key Takeaways

Biological Inspiration

Artificial neurons mimic biological neurons, receiving inputs, applying weights, and producing outputs through activation

Perceptron Foundation

The perceptron is the simplest neural network unit, combining weighted inputs with a bias and activation function

Non-linear Activations

Activation functions like ReLU, Sigmoid, and Tanh enable networks to learn complex non-linear patterns

Forward Propagation

Data flows forward through layers, with each neuron computing weighted sums and applying activations

Backpropagation

Errors propagate backward through the network, computing gradients for each weight using the chain rule

Gradient Descent

Weights are updated iteratively using gradients, moving down the loss surface toward optimal values

Neural Networks Fundamentals

What You'll Learn

Contents

Neurons and Perceptrons

Artificial Neuron

Biological vs Artificial Neurons

Biological Neuron

Artificial Neuron

The Perceptron Model

Weights and Biases

Training a Perceptron

Complete Perceptron Example: Logic Gates

Multi-Layer Perceptrons (MLPs)

Practice: Neurons and Perceptrons

Easy Calculate the output of a single neuron

Medium Train a perceptron for OR gate

Hard Implement a 2-layer network for XOR

Activation Functions

Why Non-linearity Matters

Sigmoid Activation

Sigmoid Advantages

Sigmoid Disadvantages

Tanh (Hyperbolic Tangent)

ReLU (Rectified Linear Unit)

Leaky ReLU and Variants

Softmax for Multi-class Classification

Activation Function Comparison

Choosing the Right Activation

Practice: Activation Functions

Easy Implement and visualize ReLU vs Sigmoid

Medium Demonstrate the vanishing gradient problem

Hard Implement softmax with temperature scaling

Interactive: Activation Function Explorer

Forward Propagation

Forward Pass

Step-by-Step Forward Propagation

Layer 1: Input to Hidden

Layer 2: Hidden to Output

Complete Forward Propagation Function

Batch Processing

Computing the Loss

Practice: Forward Propagation

Easy Trace values through a tiny network

Medium Implement forward propagation for 3 layers

Hard Build a flexible forward pass for any depth

Backpropagation and Gradient Descent

Backpropagation

The Chain Rule of Calculus

Computing Gradients for Output Layer

Computing Gradients for Hidden Layer

Complete Backpropagation

Gradient Descent Update

Learning Rate Too High

Learning Rate Too Low

The Complete Training Loop

Gradient Descent Variants

Putting It All Together

Practice: Backpropagation and Gradient Descent

Easy Calculate gradient of a simple function

Medium Implement gradient checking

Hard Visualize gradient descent on a loss surface

Key Takeaways

Biological Inspiration

Perceptron Foundation

Non-linear Activations

Forward Propagation

Backpropagation

Gradient Descent

Knowledge Check

1 What are the main components of an artificial neuron (perceptron)?

2 Why are activation functions necessary in neural networks?

3 Which activation function is most commonly used in hidden layers of modern deep networks?

4 What happens during forward propagation?

5 What mathematical technique does backpropagation use to compute gradients?

6 In gradient descent, what does the learning rate control?