make Recurrent Neural Networks | AI Course | ClassNotes.Site
Module 3.3

Recurrent Neural Networks

Unlock the power of sequential data processing. Learn how RNNs maintain memory across time steps, master LSTM and GRU architectures that solve vanishing gradients, build time series forecasting models, and discover attention mechanisms that revolutionized NLP.

65 min read
Intermediate
Hands-on
What You'll Learn
  • Sequential data and temporal dependencies
  • Vanilla RNN architecture and limitations
  • LSTM and GRU gating mechanisms
  • Time series forecasting with RNNs
  • Attention mechanisms and self-attention
Contents
01

Sequence Modeling

Many real-world problems involve sequential data where order matters. Text is a sequence of words, audio is a sequence of samples, and stock prices are sequences of values over time. Unlike feedforward networks that process fixed-size inputs independently, Recurrent Neural Networks maintain a hidden state that captures information from previous time steps, enabling them to model temporal dependencies.

Sequential Data

Sequential data is any data where the order of elements carries meaning. Examples include text (word order determines meaning), time series (temporal patterns), audio (waveform evolution), and video (frame sequences). The key characteristic is that elements are not independent; each depends on what came before.

Why Traditional Networks Fail

Feedforward neural networks treat each input independently with no memory of previous inputs. For the sentence "The cat sat on the ___", a feedforward network cannot use the context of previous words to predict "mat". RNNs solve this by maintaining hidden state that passes information forward through time.

# The problem with feedforward networks for sequences
import numpy as np

# Feedforward: each input processed independently
def feedforward_predict(word_embedding):
    """No context from previous words."""
    # Each word processed in isolation
    hidden = np.tanh(np.dot(word_embedding, W_input) + b)
    output = np.dot(hidden, W_output)
    return output

# RNN: maintains hidden state across time steps
def rnn_predict(word_embedding, previous_hidden):
    """Uses context from previous words via hidden state."""
    # Combine current input with previous hidden state
    hidden = np.tanh(
        np.dot(word_embedding, W_input) + 
        np.dot(previous_hidden, W_hidden) +  # Memory!
        b
    )
    output = np.dot(hidden, W_output)
    return output, hidden  # Pass hidden to next step

The Recurrent Neural Network Architecture

An RNN processes sequences one element at a time. At each time step t, it takes the current input x_t and the previous hidden state h_(t-1), computes a new hidden state h_t, and optionally produces an output y_t. The same weights are shared across all time steps.

# Vanilla RNN implementation from scratch
import numpy as np

class SimpleRNN:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights
        self.Wxh = np.random.randn(input_size, hidden_size) * 0.01
        self.Whh = np.random.randn(hidden_size, hidden_size) * 0.01
        self.Why = np.random.randn(hidden_size, output_size) * 0.01
        self.bh = np.zeros((1, hidden_size))
        self.by = np.zeros((1, output_size))
        self.hidden_size = hidden_size
    
    def forward(self, inputs):
        """
        Process a sequence of inputs.
        inputs: list of input vectors [x_1, x_2, ..., x_T]
        """
        h = np.zeros((1, self.hidden_size))  # Initial hidden state
        self.hidden_states = [h]
        outputs = []
        
        for x in inputs:
            # RNN step: h_t = tanh(x_t @ Wxh + h_(t-1) @ Whh + bh)
            h = np.tanh(np.dot(x, self.Wxh) + np.dot(h, self.Whh) + self.bh)
            self.hidden_states.append(h)
            
            # Output: y_t = h_t @ Why + by
            y = np.dot(h, self.Why) + self.by
            outputs.append(y)
        
        return outputs, h  # Return all outputs and final hidden state

# Example usage
rnn = SimpleRNN(input_size=10, hidden_size=32, output_size=5)
sequence = [np.random.randn(1, 10) for _ in range(20)]  # 20 time steps
outputs, final_hidden = rnn.forward(sequence)
Hidden State

The hidden state h_t is a vector that encodes information about all previous inputs seen so far. It acts as the network's "memory". At each time step, the hidden state is updated based on the current input and the previous hidden state: h_t = f(W_xh * x_t + W_hh * h_(t-1) + b). This recurrence is what gives RNNs their ability to model sequences.

Building RNNs with Keras

Keras provides high-level RNN layers that handle the recurrence automatically. The SimpleRNN layer processes sequences and can return either the final hidden state or outputs at every time step.

# RNN layers in Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Embedding

# Text classification with RNN
def create_text_classifier(vocab_size, embedding_dim, hidden_units, num_classes):
    """
    RNN for sentiment analysis or text classification.
    """
    model = Sequential([
        # Convert word indices to dense vectors
        Embedding(vocab_size, embedding_dim, input_length=100),
        
        # RNN layer processes the sequence
        # return_sequences=False: only return final hidden state
        SimpleRNN(hidden_units, return_sequences=False),
        
        # Classification head
        Dense(64, activation='relu'),
        Dense(num_classes, activation='softmax')
    ])
    
    model.compile(
        optimizer='adam',
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

# Create model
model = create_text_classifier(
    vocab_size=10000,
    embedding_dim=128,
    hidden_units=64,
    num_classes=3  # Positive, Negative, Neutral
)
model.summary()

Sequence-to-Sequence Patterns

RNNs can be configured for different input-output relationships: many-to-one (classification), one-to-many (generation), and many-to-many (translation). The architecture depends on whether you need outputs at every step or just the final prediction.

# Different RNN configurations
from tensorflow.keras.layers import SimpleRNN, Dense, TimeDistributed
from tensorflow.keras.models import Sequential

# Many-to-One: Sequence classification (sentiment analysis)
# Input: sequence of words, Output: single class
many_to_one = Sequential([
    SimpleRNN(64, return_sequences=False, input_shape=(100, 128)),
    Dense(3, activation='softmax')
])

# Many-to-Many (same length): Sequence labeling (POS tagging)
# Input: sequence of words, Output: tag for each word
many_to_many = Sequential([
    SimpleRNN(64, return_sequences=True, input_shape=(100, 128)),
    TimeDistributed(Dense(20, activation='softmax'))  # 20 POS tags
])

# Stacked RNNs for more capacity
stacked_rnn = Sequential([
    SimpleRNN(64, return_sequences=True, input_shape=(100, 128)),
    SimpleRNN(64, return_sequences=True),  # Stack another layer
    SimpleRNN(32, return_sequences=False),  # Final layer
    Dense(10, activation='softmax')
])

print("Many-to-One output:", many_to_one.output_shape)
print("Many-to-Many output:", many_to_many.output_shape)
The Vanishing Gradient Problem

Vanilla RNNs struggle with long sequences because gradients either vanish (shrink to zero) or explode (grow unbounded) when backpropagating through many time steps. After about 10-20 steps, the gradient becomes too small to update early weights, making it impossible to learn long-range dependencies. This is why LSTM and GRU were invented.

Practice: Sequence Modeling

Answer: The hidden state serves as the network's memory, encoding information about all previous inputs in the sequence. It allows the RNN to use context from earlier time steps when processing the current input, enabling the network to model temporal dependencies.

Answer: When return_sequences=True, the RNN outputs the hidden state at every time step, resulting in a 3D output (batch, timesteps, features). Use it when you need outputs at each position (sequence labeling, stacking RNN layers). When False, only the final hidden state is returned (batch, features), used for sequence classification.

Answer: Weight sharing serves multiple purposes: (1) It allows RNNs to process sequences of any length with a fixed number of parameters. (2) It enforces the assumption that the same transformation should apply at every position. (3) It dramatically reduces parameters compared to having separate weights per timestep. (4) It enables the network to generalize patterns learned at one position to other positions in the sequence.

02

LSTM and GRU Cells

Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures revolutionized sequence modeling by solving the vanishing gradient problem. They use gating mechanisms to control information flow, allowing networks to selectively remember or forget information over long sequences. These cells have become the workhorses of modern NLP and time series applications.

Long Short-Term Memory (LSTM)

An LSTM cell maintains two states: the hidden state h_t and the cell state c_t. The cell state acts as a "conveyor belt" that carries information across time steps with minimal modification. Three gates control information flow: the forget gate decides what to discard, the input gate decides what to store, and the output gate decides what to output. This architecture enables learning dependencies over hundreds of time steps.

LSTM Gates Explained

Each LSTM gate is a neural network layer with sigmoid activation that outputs values between 0 (block everything) and 1 (let everything through). The gates work together to selectively update the cell state and produce the output.

# LSTM cell implementation from scratch
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

class LSTMCell:
    def __init__(self, input_size, hidden_size):
        # Combined weight matrices for efficiency
        # [forget, input, candidate, output] gates
        self.Wf = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
        self.Wi = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
        self.Wc = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
        self.Wo = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
        self.bf = np.zeros((1, hidden_size))
        self.bi = np.zeros((1, hidden_size))
        self.bc = np.zeros((1, hidden_size))
        self.bo = np.zeros((1, hidden_size))
    
    def forward(self, x, h_prev, c_prev):
        """
        Single LSTM step.
        x: current input (batch, input_size)
        h_prev: previous hidden state (batch, hidden_size)
        c_prev: previous cell state (batch, hidden_size)
        """
        # Concatenate input and previous hidden state
        combined = np.concatenate([x, h_prev], axis=1)
        
        # Forget gate: what to discard from cell state
        f = sigmoid(np.dot(combined, self.Wf) + self.bf)
        
        # Input gate: what new info to store
        i = sigmoid(np.dot(combined, self.Wi) + self.bi)
        
        # Candidate values: potential new content
        c_tilde = np.tanh(np.dot(combined, self.Wc) + self.bc)
        
        # Update cell state: forget old + add new
        c = f * c_prev + i * c_tilde
        
        # Output gate: what to output
        o = sigmoid(np.dot(combined, self.Wo) + self.bo)
        
        # Hidden state: filtered cell state
        h = o * np.tanh(c)
        
        return h, c

# Example usage
lstm_cell = LSTMCell(input_size=128, hidden_size=256)
h = np.zeros((1, 256))
c = np.zeros((1, 256))
x = np.random.randn(1, 128)
h_new, c_new = lstm_cell.forward(x, h, c)

Using LSTM in Keras

Keras provides optimized LSTM layers that are easy to use. The layer handles the recurrence internally and supports GPU acceleration through cuDNN when available.

# LSTM layers in Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout, Bidirectional

# Sentiment analysis with LSTM
def create_sentiment_model(vocab_size=10000, max_length=200):
    """LSTM for binary sentiment classification."""
    model = Sequential([
        Embedding(vocab_size, 128, input_length=max_length),
        
        # LSTM layer with dropout for regularization
        LSTM(128, dropout=0.2, recurrent_dropout=0.2),
        
        Dense(64, activation='relu'),
        Dropout(0.5),
        Dense(1, activation='sigmoid')  # Binary classification
    ])
    
    model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    return model

# Stacked LSTM for more capacity
def create_deep_lstm(vocab_size=10000, max_length=200):
    """Multi-layer LSTM for complex sequences."""
    model = Sequential([
        Embedding(vocab_size, 128, input_length=max_length),
        
        # First LSTM: return sequences for stacking
        LSTM(128, return_sequences=True, dropout=0.2),
        
        # Second LSTM: return sequences for stacking  
        LSTM(64, return_sequences=True, dropout=0.2),
        
        # Final LSTM: return only last output
        LSTM(32, dropout=0.2),
        
        Dense(1, activation='sigmoid')
    ])
    return model

model = create_sentiment_model()
model.summary()
Gated Recurrent Unit (GRU)

The GRU simplifies LSTM by combining the forget and input gates into a single update gate and merging the cell and hidden states. It uses only two gates: the reset gate controls how much past information to forget, and the update gate controls how much new information to incorporate. GRU has fewer parameters than LSTM while achieving comparable performance on many tasks.

GRU Architecture

GRU achieves similar performance to LSTM with a simpler architecture. The reduced complexity means faster training and less memory usage, making GRU a popular choice when resources are limited.

# GRU cell implementation
import numpy as np

class GRUCell:
    def __init__(self, input_size, hidden_size):
        # Reset gate weights
        self.Wr = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
        # Update gate weights
        self.Wz = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
        # Candidate hidden state weights
        self.Wh = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
        
        self.br = np.zeros((1, hidden_size))
        self.bz = np.zeros((1, hidden_size))
        self.bh = np.zeros((1, hidden_size))
    
    def forward(self, x, h_prev):
        """
        GRU step: simpler than LSTM, no separate cell state.
        """
        combined = np.concatenate([x, h_prev], axis=1)
        
        # Reset gate: how much past to forget
        r = sigmoid(np.dot(combined, self.Wr) + self.br)
        
        # Update gate: balance between old and new
        z = sigmoid(np.dot(combined, self.Wz) + self.bz)
        
        # Candidate hidden state with reset gate applied
        combined_reset = np.concatenate([x, r * h_prev], axis=1)
        h_tilde = np.tanh(np.dot(combined_reset, self.Wh) + self.bh)
        
        # Final hidden state: interpolate between old and new
        h = (1 - z) * h_prev + z * h_tilde
        
        return h

# GRU has no cell state, just hidden state
gru_cell = GRUCell(input_size=128, hidden_size=256)
h = np.zeros((1, 256))
x = np.random.randn(1, 128)
h_new = gru_cell.forward(x, h)

Bidirectional RNNs

Bidirectional RNNs process sequences in both forward and backward directions, capturing context from both past and future. This is particularly useful for tasks like named entity recognition where the meaning of a word depends on both preceding and following words.

# Bidirectional LSTM and GRU in Keras
from tensorflow.keras.layers import Bidirectional, LSTM, GRU, Dense
from tensorflow.keras.models import Sequential

# Bidirectional LSTM for NER or sequence labeling
def create_bidirectional_model(vocab_size, max_length, num_tags):
    """
    Bidirectional LSTM for sequence labeling.
    Captures both left and right context for each token.
    """
    model = Sequential([
        Embedding(vocab_size, 128, input_length=max_length),
        
        # Bidirectional: runs LSTM forward AND backward
        # Output size is 2 * hidden_size (concatenated)
        Bidirectional(LSTM(64, return_sequences=True)),
        
        # Second bidirectional layer
        Bidirectional(LSTM(32, return_sequences=True)),
        
        # Output tag for each position
        TimeDistributed(Dense(num_tags, activation='softmax'))
    ])
    
    model.compile(
        optimizer='adam',
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    return model

# Compare LSTM vs GRU
from tensorflow.keras.layers import TimeDistributed

# GRU variant (fewer parameters, often similar performance)
gru_model = Sequential([
    Embedding(10000, 128, input_length=100),
    Bidirectional(GRU(64, return_sequences=True)),
    Bidirectional(GRU(32, return_sequences=False)),
    Dense(64, activation='relu'),
    Dense(5, activation='softmax')
])

print("GRU model parameters:", gru_model.count_params())
LSTM vs GRU: When to Choose Which

LSTM generally performs better on tasks requiring very long-term memory (1000+ steps) due to its separate cell state. GRU is faster to train and often sufficient for shorter sequences. In practice, try both and compare on your specific task. GRU is a good default for prototyping; switch to LSTM if you need more capacity. Modern alternatives like Transformers often outperform both on NLP tasks.

Practice: LSTM and GRU

Answer: The forget gate decides what information to discard from the cell state. It outputs values between 0 (forget completely) and 1 (keep fully) for each element in the cell state. This allows the LSTM to clear irrelevant information and make room for new content.

Answer: A bidirectional LSTM runs two separate LSTMs: one processing the sequence forward and one backward. Their outputs are concatenated at each time step, doubling the output dimension. If each LSTM has 64 hidden units, the bidirectional output is 128 dimensions (64 forward + 64 backward).

Answer: The cell state creates a highway for gradients to flow unchanged across time steps. When the forget gate is 1 and input gate is 0, the cell state is copied exactly: c_t = c_(t-1). During backpropagation, this creates a gradient of 1, avoiding the multiplicative decay that causes vanishing gradients. The gates learn when to allow this direct flow, enabling the network to maintain gradients over hundreds of steps.

03

Time Series Prediction

Time series forecasting is one of the most practical applications of RNNs. From predicting stock prices and weather patterns to forecasting energy demand and sales, LSTMs excel at learning temporal patterns in sequential data. This section covers data preparation, windowing strategies, and building robust forecasting models.

Time Series Data

A time series is a sequence of data points indexed by time. Key characteristics include trend (long-term direction), seasonality (repeating patterns), and noise (random variation). RNNs learn to capture these patterns and extrapolate them for forecasting. Common examples: stock prices, temperature readings, website traffic, and sensor data.

Data Preparation for Time Series

Preparing time series data for RNNs involves creating sequences of fixed length (lookback windows) as inputs and the next value(s) as targets. Proper normalization and train-test splitting are crucial for reliable predictions.

# Time series data preparation
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

def prepare_time_series_data(data, lookback=60, forecast_horizon=1):
    """
    Create sequences for time series prediction.
    
    lookback: number of past timesteps to use as input
    forecast_horizon: number of future steps to predict
    """
    # Normalize data to [0, 1] range
    scaler = MinMaxScaler()
    data_scaled = scaler.fit_transform(data.reshape(-1, 1))
    
    X, y = [], []
    for i in range(lookback, len(data_scaled) - forecast_horizon + 1):
        # Input: previous 'lookback' values
        X.append(data_scaled[i - lookback:i, 0])
        # Target: next 'forecast_horizon' values
        y.append(data_scaled[i:i + forecast_horizon, 0])
    
    X = np.array(X)
    y = np.array(y)
    
    # Reshape X for LSTM: (samples, timesteps, features)
    X = X.reshape((X.shape[0], X.shape[1], 1))
    
    return X, y, scaler

# Example: prepare stock price data
prices = np.random.randn(1000).cumsum() + 100  # Simulated prices
X, y, scaler = prepare_time_series_data(prices, lookback=60, forecast_horizon=1)

print(f"Input shape: {X.shape}")   # (samples, 60, 1)
print(f"Target shape: {y.shape}")  # (samples, 1)

# Train-test split (time series: use recent data for testing)
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

Building an LSTM Forecasting Model

A typical LSTM forecasting model takes sequences of historical values and predicts future values. The architecture can be simple (single LSTM layer) or deep (stacked LSTMs) depending on data complexity.

# LSTM model for time series forecasting
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

def create_forecasting_model(lookback, n_features=1, forecast_horizon=1):
    """
    LSTM model for univariate or multivariate time series.
    """
    model = Sequential([
        # First LSTM layer
        LSTM(64, return_sequences=True, 
             input_shape=(lookback, n_features)),
        Dropout(0.2),
        
        # Second LSTM layer
        LSTM(32, return_sequences=False),
        Dropout(0.2),
        
        # Dense layers for prediction
        Dense(16, activation='relu'),
        Dense(forecast_horizon)  # Output: predicted values
    ])
    
    model.compile(
        optimizer='adam',
        loss='mse',  # Mean Squared Error for regression
        metrics=['mae']  # Mean Absolute Error
    )
    return model

# Create and train model
model = create_forecasting_model(lookback=60, n_features=1)

callbacks = [
    EarlyStopping(patience=10, restore_best_weights=True),
    ReduceLROnPlateau(factor=0.5, patience=5)
]

history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.1,
    callbacks=callbacks,
    verbose=1
)
Multivariate Time Series

Multivariate time series have multiple features at each time step. For example, predicting stock prices using not just past prices, but also trading volume, market indices, and sentiment scores. RNNs naturally handle multivariate data by accepting input shape (timesteps, features). More features can improve predictions if they contain relevant information.

Multivariate Forecasting

Real-world forecasting often benefits from multiple input features. LSTMs can learn relationships between different variables to make better predictions.

# Multivariate time series forecasting
import numpy as np
import pandas as pd

def prepare_multivariate_data(df, target_col, lookback=30):
    """
    Prepare multivariate data for LSTM.
    df: DataFrame with multiple features
    target_col: column name to predict
    """
    from sklearn.preprocessing import StandardScaler
    
    # Scale all features
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(df.values)
    
    # Get target column index
    target_idx = df.columns.get_loc(target_col)
    
    X, y = [], []
    for i in range(lookback, len(scaled_data)):
        # Input: all features for past 'lookback' steps
        X.append(scaled_data[i - lookback:i, :])
        # Target: only the target column
        y.append(scaled_data[i, target_idx])
    
    return np.array(X), np.array(y), scaler

# Example: predict temperature using multiple weather features
weather_data = pd.DataFrame({
    'temperature': np.random.randn(1000).cumsum(),
    'humidity': np.random.rand(1000) * 100,
    'pressure': 1013 + np.random.randn(1000) * 10,
    'wind_speed': np.abs(np.random.randn(1000)) * 20
})

X, y, scaler = prepare_multivariate_data(
    weather_data, 
    target_col='temperature',
    lookback=24  # 24 hours of history
)

print(f"Input shape: {X.shape}")  # (samples, 24, 4 features)
print(f"Target shape: {y.shape}")  # (samples,)

# Build model for 4 input features
model = create_forecasting_model(lookback=24, n_features=4, forecast_horizon=1)

Multi-Step Forecasting

Often we need to predict multiple future time steps, not just the next one. Two approaches: direct prediction (output all future steps at once) or recursive prediction (predict one step, feed it back as input).

# Multi-step forecasting strategies
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, RepeatVector, TimeDistributed

# Strategy 1: Direct multi-output prediction
def create_direct_multistep_model(lookback, n_features, forecast_steps):
    """Predict all future steps at once."""
    model = Sequential([
        LSTM(64, input_shape=(lookback, n_features)),
        Dense(32, activation='relu'),
        Dense(forecast_steps)  # Output all future steps
    ])
    model.compile(optimizer='adam', loss='mse')
    return model

# Strategy 2: Encoder-Decoder for sequence-to-sequence
def create_seq2seq_model(lookback, n_features, forecast_steps):
    """Encoder-Decoder architecture for multi-step prediction."""
    model = Sequential([
        # Encoder: compress input sequence
        LSTM(64, input_shape=(lookback, n_features)),
        
        # Repeat encoder output for each forecast step
        RepeatVector(forecast_steps),
        
        # Decoder: generate output sequence
        LSTM(64, return_sequences=True),
        
        # Output at each forecast step
        TimeDistributed(Dense(1))
    ])
    model.compile(optimizer='adam', loss='mse')
    return model

# Strategy 3: Recursive prediction
def recursive_forecast(model, initial_sequence, n_steps, scaler):
    """Predict one step at a time, feeding predictions back."""
    sequence = initial_sequence.copy()
    predictions = []
    
    for _ in range(n_steps):
        # Predict next step
        pred = model.predict(sequence.reshape(1, -1, 1), verbose=0)
        predictions.append(pred[0, 0])
        
        # Shift sequence and append prediction
        sequence = np.roll(sequence, -1)
        sequence[-1] = pred[0, 0]
    
    return np.array(predictions)

# Example usage
direct_model = create_direct_multistep_model(60, 1, 7)  # Predict 7 days
seq2seq_model = create_seq2seq_model(60, 1, 7)
Time Series Best Practices

Always use time-based train-test splits (never shuffle time series data). Normalize data within training set and apply same transform to test set to prevent data leakage. Consider using multiple lookback windows and pick the best through validation. For highly seasonal data, ensure your lookback covers at least one full cycle. Monitor for concept drift in production where patterns change over time.

Practice: Time Series Prediction

Answer: Shuffling would cause data leakage where future information appears in the training set. In time series, we must simulate real-world conditions where we only have access to past data. The test set should always come after the training set chronologically to properly evaluate how the model performs on unseen future data.

Answer: Larger lookback windows capture more historical patterns and long-term dependencies, but increase computational cost, memory usage, and risk of including irrelevant old data. Smaller windows are faster and may work well for short-term patterns, but may miss important seasonal or cyclical patterns. The optimal size depends on the data characteristics and should be chosen through validation.

Answer: Direct prediction outputs all future steps at once, avoiding error accumulation but requiring the model to learn multiple output relationships simultaneously. Recursive prediction is simpler but errors compound as predictions are fed back as inputs. Direct is better for short horizons; recursive can use simpler models but degrades over longer horizons. Encoder-decoder architectures offer a middle ground, learning to generate output sequences step-by-step while maintaining a learned state.

04

Attention Mechanisms

Attention mechanisms revolutionized sequence modeling by allowing networks to focus on relevant parts of the input when producing each output. Instead of compressing an entire sequence into a fixed-size vector, attention creates dynamic weighted connections between inputs and outputs. This breakthrough led directly to Transformers, which replaced RNNs entirely for many NLP tasks.

Attention Mechanism

An attention mechanism computes a weighted sum of input representations, where the weights indicate relevance to the current task. It uses three components: a Query (what we are looking for), Keys (what we are comparing against), and Values (what we retrieve). Attention weights are computed as softmax(Query * Keys^T), then applied to Values to get the context-aware output.

The Bottleneck Problem

Traditional encoder-decoder models compress the entire input sequence into a single fixed-size context vector. For long sequences, this bottleneck loses important information. Attention allows the decoder to look back at all encoder states, focusing on relevant parts for each output step.

# The bottleneck problem in seq2seq
import numpy as np

# Traditional Encoder-Decoder (bottleneck)
class TraditionalSeq2Seq:
    def encode(self, input_sequence):
        """Compress entire sequence into single vector."""
        hidden = np.zeros(256)
        for token in input_sequence:
            hidden = self.encoder_step(token, hidden)
        # Only final hidden state passed to decoder
        return hidden  # Lost information about early tokens!
    
    def decode(self, context_vector, max_length):
        """Generate output using only the context vector."""
        hidden = context_vector
        outputs = []
        for _ in range(max_length):
            output, hidden = self.decoder_step(hidden)
            outputs.append(output)
        return outputs

# With Attention: decoder can access ALL encoder states
class AttentionSeq2Seq:
    def encode(self, input_sequence):
        """Store all hidden states for attention."""
        hidden_states = []
        hidden = np.zeros(256)
        for token in input_sequence:
            hidden = self.encoder_step(token, hidden)
            hidden_states.append(hidden)  # Keep all states!
        return np.array(hidden_states)
    
    def decode_with_attention(self, encoder_states, max_length):
        """Use attention to focus on relevant encoder states."""
        hidden = np.zeros(256)
        outputs = []
        for _ in range(max_length):
            # Compute attention weights
            context = self.attention(hidden, encoder_states)
            # Combine with decoder state
            output, hidden = self.decoder_step(hidden, context)
            outputs.append(output)
        return outputs

Bahdanau Attention (Additive)

Bahdanau attention (2014) was the first attention mechanism for neural machine translation. It uses a feedforward network to compute alignment scores between the decoder state and each encoder state.

# Bahdanau (Additive) Attention implementation
import numpy as np
from tensorflow.keras.layers import Layer, Dense
import tensorflow as tf

class BahdanauAttention(Layer):
    def __init__(self, units):
        super().__init__()
        self.W1 = Dense(units)  # Transform encoder states
        self.W2 = Dense(units)  # Transform decoder state
        self.V = Dense(1)       # Score function
    
    def call(self, decoder_hidden, encoder_outputs):
        """
        decoder_hidden: (batch, hidden_size) - current decoder state
        encoder_outputs: (batch, seq_len, hidden_size) - all encoder states
        """
        # Expand decoder hidden for broadcasting
        # (batch, hidden_size) -> (batch, 1, hidden_size)
        decoder_hidden_expanded = tf.expand_dims(decoder_hidden, 1)
        
        # Compute attention scores
        # score = V(tanh(W1(encoder) + W2(decoder)))
        score = self.V(tf.nn.tanh(
            self.W1(encoder_outputs) + self.W2(decoder_hidden_expanded)
        ))  # (batch, seq_len, 1)
        
        # Convert to probabilities
        attention_weights = tf.nn.softmax(score, axis=1)  # (batch, seq_len, 1)
        
        # Weighted sum of encoder outputs
        context = attention_weights * encoder_outputs  # (batch, seq_len, hidden)
        context = tf.reduce_sum(context, axis=1)  # (batch, hidden)
        
        return context, attention_weights

# Example usage
attention = BahdanauAttention(units=64)
decoder_state = tf.random.normal((32, 256))
encoder_outputs = tf.random.normal((32, 50, 256))  # 50 tokens
context, weights = attention(decoder_state, encoder_outputs)
print(f"Context shape: {context.shape}")  # (32, 256)
print(f"Attention weights shape: {weights.shape}")  # (32, 50, 1)
Self-Attention

Self-attention allows each position in a sequence to attend to all other positions in the same sequence. Unlike cross-attention between encoder and decoder, self-attention computes relationships within a single sequence. Each token generates Query, Key, and Value vectors; the attention output for each position is a weighted sum of all Values, where weights come from Query-Key compatibility. This is the foundation of Transformers.

Scaled Dot-Product Attention

Scaled dot-product attention, used in Transformers, is more computationally efficient than additive attention. It computes attention as softmax(QK^T / sqrt(d_k)) * V, where the scaling factor prevents softmax from having extremely small gradients.

# Scaled Dot-Product Attention (Transformer-style)
import tensorflow as tf
import numpy as np

def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Compute attention using scaled dot product.
    
    query: (batch, num_heads, seq_len_q, d_k)
    key: (batch, num_heads, seq_len_k, d_k)
    value: (batch, num_heads, seq_len_v, d_v)
    mask: optional mask for padding or causal attention
    """
    # Compute attention scores: Q @ K^T
    matmul_qk = tf.matmul(query, key, transpose_b=True)
    
    # Scale by sqrt(d_k) to prevent vanishing gradients in softmax
    d_k = tf.cast(tf.shape(key)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(d_k)
    
    # Apply mask if provided (for padding or causal masking)
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)
    
    # Softmax to get attention weights
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
    
    # Weighted sum of values
    output = tf.matmul(attention_weights, value)
    
    return output, attention_weights

# Example: self-attention on a sequence
seq_len = 10
d_model = 64

# Same input serves as Q, K, V for self-attention
x = tf.random.normal((1, seq_len, d_model))
query = key = value = x

output, weights = scaled_dot_product_attention(query, key, value)
print(f"Output shape: {output.shape}")  # (1, 10, 64)
print(f"Each position attends to all 10 positions")

Attention in Keras RNN Models

Keras provides an Attention layer that can be integrated with RNN encoder-decoder models. This combines the sequential processing of LSTMs with the focusing ability of attention.

# Attention layer with LSTM in Keras
from tensorflow.keras.layers import (Input, LSTM, Dense, Attention, 
                                      Concatenate, TimeDistributed)
from tensorflow.keras.models import Model

def create_attention_seq2seq(input_vocab, output_vocab, 
                              embedding_dim=256, hidden_units=512):
    """
    Sequence-to-sequence model with Bahdanau attention.
    For machine translation, summarization, etc.
    """
    # Encoder
    encoder_inputs = Input(shape=(None,), name='encoder_input')
    encoder_embedding = Embedding(input_vocab, embedding_dim)(encoder_inputs)
    encoder_lstm = LSTM(hidden_units, return_sequences=True, 
                        return_state=True, name='encoder_lstm')
    encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
    
    # Decoder
    decoder_inputs = Input(shape=(None,), name='decoder_input')
    decoder_embedding = Embedding(output_vocab, embedding_dim)(decoder_inputs)
    decoder_lstm = LSTM(hidden_units, return_sequences=True,
                        return_state=True, name='decoder_lstm')
    decoder_outputs, _, _ = decoder_lstm(
        decoder_embedding, 
        initial_state=[state_h, state_c]
    )
    
    # Attention layer
    attention = Attention(name='attention')
    context = attention([decoder_outputs, encoder_outputs])
    
    # Combine attention context with decoder output
    decoder_combined = Concatenate()([decoder_outputs, context])
    
    # Output projection
    output = TimeDistributed(Dense(output_vocab, activation='softmax'))(
        decoder_combined
    )
    
    model = Model([encoder_inputs, decoder_inputs], output)
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
    
    return model

from tensorflow.keras.layers import Embedding

# Create translation model
model = create_attention_seq2seq(
    input_vocab=10000,   # Source vocabulary
    output_vocab=8000,   # Target vocabulary
    embedding_dim=256,
    hidden_units=512
)
model.summary()
From Attention to Transformers

The success of attention led researchers to ask: do we need RNNs at all? The Transformer architecture (2017) uses only attention mechanisms, processing all positions in parallel. This enables much faster training and better long-range dependency modeling. For most NLP tasks today, Transformers (BERT, GPT) have replaced RNNs. However, RNNs remain relevant for streaming data and resource-constrained environments.

Practice: Attention Mechanisms

Answer: Query (Q) represents what we are looking for. Keys (K) represent what we are comparing against. Values (V) contain the actual information we want to retrieve. Attention computes similarity between Q and K to determine which V elements to focus on, then returns a weighted sum of V.

Answer: When d_k is large, the dot products between Q and K grow large in magnitude, pushing the softmax into regions with extremely small gradients. Dividing by sqrt(d_k) keeps the variance of the dot products stable regardless of dimension, ensuring the softmax produces reasonable gradients for training.

Answer: In encoder-decoder attention, the decoder queries attend to encoder key-value pairs from a different sequence. In self-attention, Q, K, and V all come from the same sequence, allowing each position to attend to all other positions within the sequence. This provides two key advantages: (1) direct modeling of long-range dependencies without passing through many RNN steps, and (2) parallel computation since all positions can be processed simultaneously, unlike the sequential nature of RNNs.

Answer: Attention allows the decoder to focus on different source positions for each target word, regardless of their sequential order. When generating a word early in the target sentence that corresponds to a word late in the source sentence, attention can assign high weight to that distant source position. This flexibility handles word reordering naturally without requiring the model to memorize long-range alignments in its hidden state.

Interactive Demo: Attention Visualizer

Explore how attention mechanisms focus on different parts of a sequence. Select a target word to see which source words receive the most attention weight.

Source Sequence (English)
The cat sat on the mat
Target Sequence (French) - Click a word
Le chat etait assis sur le tapis

Click a French word above to see which English words it attends to

Understanding the Visualization

This demo simulates attention weights in neural machine translation. Notice how each French word primarily attends to its English equivalent, but also considers context. For example, "etait" and "assis" (was sitting) both attend heavily to "sat", while articles like "Le" and "le" attend to their respective "The" and "the" positions.

Key Takeaways

Sequential Memory

RNNs process sequences one step at a time, maintaining hidden state that captures information from previous inputs

Long-Term Dependencies

LSTM cells use gates to control information flow, solving the vanishing gradient problem for long sequences

GRU Efficiency

GRU simplifies LSTM with fewer gates while maintaining similar performance, reducing computation

Time Series Forecasting

RNNs excel at predicting future values from historical patterns in stock prices, weather, and sensor data

Attention Focus

Attention mechanisms let models focus on relevant parts of input sequences, enabling better context understanding

NLP Foundation

RNNs and attention laid the groundwork for modern NLP, from machine translation to text generation

Knowledge Check

Test your understanding of Recurrent Neural Networks:

1 What is the main advantage of RNNs over feedforward neural networks for sequence data?
2 What problem do LSTM cells solve that vanilla RNNs struggle with?
3 How many gates does a GRU cell have compared to an LSTM cell?
4 In time series prediction, what does a "lookback window" refer to?
5 What does the attention mechanism compute?
6 What is the key difference between self-attention and cross-attention?
Answer all questions to check your score