Sequence Modeling
Many real-world problems involve sequential data where order matters. Text is a sequence of words, audio is a sequence of samples, and stock prices are sequences of values over time. Unlike feedforward networks that process fixed-size inputs independently, Recurrent Neural Networks maintain a hidden state that captures information from previous time steps, enabling them to model temporal dependencies.
Sequential data is any data where the order of elements carries meaning. Examples include text (word order determines meaning), time series (temporal patterns), audio (waveform evolution), and video (frame sequences). The key characteristic is that elements are not independent; each depends on what came before.
Why Traditional Networks Fail
Feedforward neural networks treat each input independently with no memory of previous inputs. For the sentence "The cat sat on the ___", a feedforward network cannot use the context of previous words to predict "mat". RNNs solve this by maintaining hidden state that passes information forward through time.
# The problem with feedforward networks for sequences
import numpy as np
# Feedforward: each input processed independently
def feedforward_predict(word_embedding):
"""No context from previous words."""
# Each word processed in isolation
hidden = np.tanh(np.dot(word_embedding, W_input) + b)
output = np.dot(hidden, W_output)
return output
# RNN: maintains hidden state across time steps
def rnn_predict(word_embedding, previous_hidden):
"""Uses context from previous words via hidden state."""
# Combine current input with previous hidden state
hidden = np.tanh(
np.dot(word_embedding, W_input) +
np.dot(previous_hidden, W_hidden) + # Memory!
b
)
output = np.dot(hidden, W_output)
return output, hidden # Pass hidden to next step
The Recurrent Neural Network Architecture
An RNN processes sequences one element at a time. At each time step t, it takes the current input x_t and the previous hidden state h_(t-1), computes a new hidden state h_t, and optionally produces an output y_t. The same weights are shared across all time steps.
# Vanilla RNN implementation from scratch
import numpy as np
class SimpleRNN:
def __init__(self, input_size, hidden_size, output_size):
# Initialize weights
self.Wxh = np.random.randn(input_size, hidden_size) * 0.01
self.Whh = np.random.randn(hidden_size, hidden_size) * 0.01
self.Why = np.random.randn(hidden_size, output_size) * 0.01
self.bh = np.zeros((1, hidden_size))
self.by = np.zeros((1, output_size))
self.hidden_size = hidden_size
def forward(self, inputs):
"""
Process a sequence of inputs.
inputs: list of input vectors [x_1, x_2, ..., x_T]
"""
h = np.zeros((1, self.hidden_size)) # Initial hidden state
self.hidden_states = [h]
outputs = []
for x in inputs:
# RNN step: h_t = tanh(x_t @ Wxh + h_(t-1) @ Whh + bh)
h = np.tanh(np.dot(x, self.Wxh) + np.dot(h, self.Whh) + self.bh)
self.hidden_states.append(h)
# Output: y_t = h_t @ Why + by
y = np.dot(h, self.Why) + self.by
outputs.append(y)
return outputs, h # Return all outputs and final hidden state
# Example usage
rnn = SimpleRNN(input_size=10, hidden_size=32, output_size=5)
sequence = [np.random.randn(1, 10) for _ in range(20)] # 20 time steps
outputs, final_hidden = rnn.forward(sequence)
The hidden state h_t is a vector that encodes information about all previous inputs seen so far. It acts as the network's "memory". At each time step, the hidden state is updated based on the current input and the previous hidden state: h_t = f(W_xh * x_t + W_hh * h_(t-1) + b). This recurrence is what gives RNNs their ability to model sequences.
Building RNNs with Keras
Keras provides high-level RNN layers that handle the recurrence automatically. The SimpleRNN layer processes sequences and can return either the final hidden state or outputs at every time step.
# RNN layers in Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Embedding
# Text classification with RNN
def create_text_classifier(vocab_size, embedding_dim, hidden_units, num_classes):
"""
RNN for sentiment analysis or text classification.
"""
model = Sequential([
# Convert word indices to dense vectors
Embedding(vocab_size, embedding_dim, input_length=100),
# RNN layer processes the sequence
# return_sequences=False: only return final hidden state
SimpleRNN(hidden_units, return_sequences=False),
# Classification head
Dense(64, activation='relu'),
Dense(num_classes, activation='softmax')
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
# Create model
model = create_text_classifier(
vocab_size=10000,
embedding_dim=128,
hidden_units=64,
num_classes=3 # Positive, Negative, Neutral
)
model.summary()
Sequence-to-Sequence Patterns
RNNs can be configured for different input-output relationships: many-to-one (classification), one-to-many (generation), and many-to-many (translation). The architecture depends on whether you need outputs at every step or just the final prediction.
# Different RNN configurations
from tensorflow.keras.layers import SimpleRNN, Dense, TimeDistributed
from tensorflow.keras.models import Sequential
# Many-to-One: Sequence classification (sentiment analysis)
# Input: sequence of words, Output: single class
many_to_one = Sequential([
SimpleRNN(64, return_sequences=False, input_shape=(100, 128)),
Dense(3, activation='softmax')
])
# Many-to-Many (same length): Sequence labeling (POS tagging)
# Input: sequence of words, Output: tag for each word
many_to_many = Sequential([
SimpleRNN(64, return_sequences=True, input_shape=(100, 128)),
TimeDistributed(Dense(20, activation='softmax')) # 20 POS tags
])
# Stacked RNNs for more capacity
stacked_rnn = Sequential([
SimpleRNN(64, return_sequences=True, input_shape=(100, 128)),
SimpleRNN(64, return_sequences=True), # Stack another layer
SimpleRNN(32, return_sequences=False), # Final layer
Dense(10, activation='softmax')
])
print("Many-to-One output:", many_to_one.output_shape)
print("Many-to-Many output:", many_to_many.output_shape)
The Vanishing Gradient Problem
Vanilla RNNs struggle with long sequences because gradients either vanish (shrink to zero) or explode (grow unbounded) when backpropagating through many time steps. After about 10-20 steps, the gradient becomes too small to update early weights, making it impossible to learn long-range dependencies. This is why LSTM and GRU were invented.
Practice: Sequence Modeling
Answer: The hidden state serves as the network's memory, encoding information about all previous inputs in the sequence. It allows the RNN to use context from earlier time steps when processing the current input, enabling the network to model temporal dependencies.
Answer: When return_sequences=True, the RNN outputs the hidden state at every time step, resulting in a 3D output (batch, timesteps, features). Use it when you need outputs at each position (sequence labeling, stacking RNN layers). When False, only the final hidden state is returned (batch, features), used for sequence classification.
Answer: Weight sharing serves multiple purposes: (1) It allows RNNs to process sequences of any length with a fixed number of parameters. (2) It enforces the assumption that the same transformation should apply at every position. (3) It dramatically reduces parameters compared to having separate weights per timestep. (4) It enables the network to generalize patterns learned at one position to other positions in the sequence.
LSTM and GRU Cells
Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures revolutionized sequence modeling by solving the vanishing gradient problem. They use gating mechanisms to control information flow, allowing networks to selectively remember or forget information over long sequences. These cells have become the workhorses of modern NLP and time series applications.
An LSTM cell maintains two states: the hidden state h_t and the cell state c_t. The cell state acts as a "conveyor belt" that carries information across time steps with minimal modification. Three gates control information flow: the forget gate decides what to discard, the input gate decides what to store, and the output gate decides what to output. This architecture enables learning dependencies over hundreds of time steps.
LSTM Gates Explained
Each LSTM gate is a neural network layer with sigmoid activation that outputs values between 0 (block everything) and 1 (let everything through). The gates work together to selectively update the cell state and produce the output.
# LSTM cell implementation from scratch
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
class LSTMCell:
def __init__(self, input_size, hidden_size):
# Combined weight matrices for efficiency
# [forget, input, candidate, output] gates
self.Wf = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
self.Wi = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
self.Wc = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
self.Wo = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
self.bf = np.zeros((1, hidden_size))
self.bi = np.zeros((1, hidden_size))
self.bc = np.zeros((1, hidden_size))
self.bo = np.zeros((1, hidden_size))
def forward(self, x, h_prev, c_prev):
"""
Single LSTM step.
x: current input (batch, input_size)
h_prev: previous hidden state (batch, hidden_size)
c_prev: previous cell state (batch, hidden_size)
"""
# Concatenate input and previous hidden state
combined = np.concatenate([x, h_prev], axis=1)
# Forget gate: what to discard from cell state
f = sigmoid(np.dot(combined, self.Wf) + self.bf)
# Input gate: what new info to store
i = sigmoid(np.dot(combined, self.Wi) + self.bi)
# Candidate values: potential new content
c_tilde = np.tanh(np.dot(combined, self.Wc) + self.bc)
# Update cell state: forget old + add new
c = f * c_prev + i * c_tilde
# Output gate: what to output
o = sigmoid(np.dot(combined, self.Wo) + self.bo)
# Hidden state: filtered cell state
h = o * np.tanh(c)
return h, c
# Example usage
lstm_cell = LSTMCell(input_size=128, hidden_size=256)
h = np.zeros((1, 256))
c = np.zeros((1, 256))
x = np.random.randn(1, 128)
h_new, c_new = lstm_cell.forward(x, h, c)
Using LSTM in Keras
Keras provides optimized LSTM layers that are easy to use. The layer handles the recurrence internally and supports GPU acceleration through cuDNN when available.
# LSTM layers in Keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout, Bidirectional
# Sentiment analysis with LSTM
def create_sentiment_model(vocab_size=10000, max_length=200):
"""LSTM for binary sentiment classification."""
model = Sequential([
Embedding(vocab_size, 128, input_length=max_length),
# LSTM layer with dropout for regularization
LSTM(128, dropout=0.2, recurrent_dropout=0.2),
Dense(64, activation='relu'),
Dropout(0.5),
Dense(1, activation='sigmoid') # Binary classification
])
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
return model
# Stacked LSTM for more capacity
def create_deep_lstm(vocab_size=10000, max_length=200):
"""Multi-layer LSTM for complex sequences."""
model = Sequential([
Embedding(vocab_size, 128, input_length=max_length),
# First LSTM: return sequences for stacking
LSTM(128, return_sequences=True, dropout=0.2),
# Second LSTM: return sequences for stacking
LSTM(64, return_sequences=True, dropout=0.2),
# Final LSTM: return only last output
LSTM(32, dropout=0.2),
Dense(1, activation='sigmoid')
])
return model
model = create_sentiment_model()
model.summary()
The GRU simplifies LSTM by combining the forget and input gates into a single update gate and merging the cell and hidden states. It uses only two gates: the reset gate controls how much past information to forget, and the update gate controls how much new information to incorporate. GRU has fewer parameters than LSTM while achieving comparable performance on many tasks.
GRU Architecture
GRU achieves similar performance to LSTM with a simpler architecture. The reduced complexity means faster training and less memory usage, making GRU a popular choice when resources are limited.
# GRU cell implementation
import numpy as np
class GRUCell:
def __init__(self, input_size, hidden_size):
# Reset gate weights
self.Wr = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
# Update gate weights
self.Wz = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
# Candidate hidden state weights
self.Wh = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
self.br = np.zeros((1, hidden_size))
self.bz = np.zeros((1, hidden_size))
self.bh = np.zeros((1, hidden_size))
def forward(self, x, h_prev):
"""
GRU step: simpler than LSTM, no separate cell state.
"""
combined = np.concatenate([x, h_prev], axis=1)
# Reset gate: how much past to forget
r = sigmoid(np.dot(combined, self.Wr) + self.br)
# Update gate: balance between old and new
z = sigmoid(np.dot(combined, self.Wz) + self.bz)
# Candidate hidden state with reset gate applied
combined_reset = np.concatenate([x, r * h_prev], axis=1)
h_tilde = np.tanh(np.dot(combined_reset, self.Wh) + self.bh)
# Final hidden state: interpolate between old and new
h = (1 - z) * h_prev + z * h_tilde
return h
# GRU has no cell state, just hidden state
gru_cell = GRUCell(input_size=128, hidden_size=256)
h = np.zeros((1, 256))
x = np.random.randn(1, 128)
h_new = gru_cell.forward(x, h)
Bidirectional RNNs
Bidirectional RNNs process sequences in both forward and backward directions, capturing context from both past and future. This is particularly useful for tasks like named entity recognition where the meaning of a word depends on both preceding and following words.
# Bidirectional LSTM and GRU in Keras
from tensorflow.keras.layers import Bidirectional, LSTM, GRU, Dense
from tensorflow.keras.models import Sequential
# Bidirectional LSTM for NER or sequence labeling
def create_bidirectional_model(vocab_size, max_length, num_tags):
"""
Bidirectional LSTM for sequence labeling.
Captures both left and right context for each token.
"""
model = Sequential([
Embedding(vocab_size, 128, input_length=max_length),
# Bidirectional: runs LSTM forward AND backward
# Output size is 2 * hidden_size (concatenated)
Bidirectional(LSTM(64, return_sequences=True)),
# Second bidirectional layer
Bidirectional(LSTM(32, return_sequences=True)),
# Output tag for each position
TimeDistributed(Dense(num_tags, activation='softmax'))
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
return model
# Compare LSTM vs GRU
from tensorflow.keras.layers import TimeDistributed
# GRU variant (fewer parameters, often similar performance)
gru_model = Sequential([
Embedding(10000, 128, input_length=100),
Bidirectional(GRU(64, return_sequences=True)),
Bidirectional(GRU(32, return_sequences=False)),
Dense(64, activation='relu'),
Dense(5, activation='softmax')
])
print("GRU model parameters:", gru_model.count_params())
LSTM vs GRU: When to Choose Which
LSTM generally performs better on tasks requiring very long-term memory (1000+ steps) due to its separate cell state. GRU is faster to train and often sufficient for shorter sequences. In practice, try both and compare on your specific task. GRU is a good default for prototyping; switch to LSTM if you need more capacity. Modern alternatives like Transformers often outperform both on NLP tasks.
Practice: LSTM and GRU
Answer: The forget gate decides what information to discard from the cell state. It outputs values between 0 (forget completely) and 1 (keep fully) for each element in the cell state. This allows the LSTM to clear irrelevant information and make room for new content.
Answer: A bidirectional LSTM runs two separate LSTMs: one processing the sequence forward and one backward. Their outputs are concatenated at each time step, doubling the output dimension. If each LSTM has 64 hidden units, the bidirectional output is 128 dimensions (64 forward + 64 backward).
Answer: The cell state creates a highway for gradients to flow unchanged across time steps. When the forget gate is 1 and input gate is 0, the cell state is copied exactly: c_t = c_(t-1). During backpropagation, this creates a gradient of 1, avoiding the multiplicative decay that causes vanishing gradients. The gates learn when to allow this direct flow, enabling the network to maintain gradients over hundreds of steps.
Time Series Prediction
Time series forecasting is one of the most practical applications of RNNs. From predicting stock prices and weather patterns to forecasting energy demand and sales, LSTMs excel at learning temporal patterns in sequential data. This section covers data preparation, windowing strategies, and building robust forecasting models.
A time series is a sequence of data points indexed by time. Key characteristics include trend (long-term direction), seasonality (repeating patterns), and noise (random variation). RNNs learn to capture these patterns and extrapolate them for forecasting. Common examples: stock prices, temperature readings, website traffic, and sensor data.
Data Preparation for Time Series
Preparing time series data for RNNs involves creating sequences of fixed length (lookback windows) as inputs and the next value(s) as targets. Proper normalization and train-test splitting are crucial for reliable predictions.
# Time series data preparation
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
def prepare_time_series_data(data, lookback=60, forecast_horizon=1):
"""
Create sequences for time series prediction.
lookback: number of past timesteps to use as input
forecast_horizon: number of future steps to predict
"""
# Normalize data to [0, 1] range
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data.reshape(-1, 1))
X, y = [], []
for i in range(lookback, len(data_scaled) - forecast_horizon + 1):
# Input: previous 'lookback' values
X.append(data_scaled[i - lookback:i, 0])
# Target: next 'forecast_horizon' values
y.append(data_scaled[i:i + forecast_horizon, 0])
X = np.array(X)
y = np.array(y)
# Reshape X for LSTM: (samples, timesteps, features)
X = X.reshape((X.shape[0], X.shape[1], 1))
return X, y, scaler
# Example: prepare stock price data
prices = np.random.randn(1000).cumsum() + 100 # Simulated prices
X, y, scaler = prepare_time_series_data(prices, lookback=60, forecast_horizon=1)
print(f"Input shape: {X.shape}") # (samples, 60, 1)
print(f"Target shape: {y.shape}") # (samples, 1)
# Train-test split (time series: use recent data for testing)
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
Building an LSTM Forecasting Model
A typical LSTM forecasting model takes sequences of historical values and predicts future values. The architecture can be simple (single LSTM layer) or deep (stacked LSTMs) depending on data complexity.
# LSTM model for time series forecasting
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
def create_forecasting_model(lookback, n_features=1, forecast_horizon=1):
"""
LSTM model for univariate or multivariate time series.
"""
model = Sequential([
# First LSTM layer
LSTM(64, return_sequences=True,
input_shape=(lookback, n_features)),
Dropout(0.2),
# Second LSTM layer
LSTM(32, return_sequences=False),
Dropout(0.2),
# Dense layers for prediction
Dense(16, activation='relu'),
Dense(forecast_horizon) # Output: predicted values
])
model.compile(
optimizer='adam',
loss='mse', # Mean Squared Error for regression
metrics=['mae'] # Mean Absolute Error
)
return model
# Create and train model
model = create_forecasting_model(lookback=60, n_features=1)
callbacks = [
EarlyStopping(patience=10, restore_best_weights=True),
ReduceLROnPlateau(factor=0.5, patience=5)
]
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=32,
validation_split=0.1,
callbacks=callbacks,
verbose=1
)
Multivariate time series have multiple features at each time step. For example, predicting stock prices using not just past prices, but also trading volume, market indices, and sentiment scores. RNNs naturally handle multivariate data by accepting input shape (timesteps, features). More features can improve predictions if they contain relevant information.
Multivariate Forecasting
Real-world forecasting often benefits from multiple input features. LSTMs can learn relationships between different variables to make better predictions.
# Multivariate time series forecasting
import numpy as np
import pandas as pd
def prepare_multivariate_data(df, target_col, lookback=30):
"""
Prepare multivariate data for LSTM.
df: DataFrame with multiple features
target_col: column name to predict
"""
from sklearn.preprocessing import StandardScaler
# Scale all features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df.values)
# Get target column index
target_idx = df.columns.get_loc(target_col)
X, y = [], []
for i in range(lookback, len(scaled_data)):
# Input: all features for past 'lookback' steps
X.append(scaled_data[i - lookback:i, :])
# Target: only the target column
y.append(scaled_data[i, target_idx])
return np.array(X), np.array(y), scaler
# Example: predict temperature using multiple weather features
weather_data = pd.DataFrame({
'temperature': np.random.randn(1000).cumsum(),
'humidity': np.random.rand(1000) * 100,
'pressure': 1013 + np.random.randn(1000) * 10,
'wind_speed': np.abs(np.random.randn(1000)) * 20
})
X, y, scaler = prepare_multivariate_data(
weather_data,
target_col='temperature',
lookback=24 # 24 hours of history
)
print(f"Input shape: {X.shape}") # (samples, 24, 4 features)
print(f"Target shape: {y.shape}") # (samples,)
# Build model for 4 input features
model = create_forecasting_model(lookback=24, n_features=4, forecast_horizon=1)
Multi-Step Forecasting
Often we need to predict multiple future time steps, not just the next one. Two approaches: direct prediction (output all future steps at once) or recursive prediction (predict one step, feed it back as input).
# Multi-step forecasting strategies
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, RepeatVector, TimeDistributed
# Strategy 1: Direct multi-output prediction
def create_direct_multistep_model(lookback, n_features, forecast_steps):
"""Predict all future steps at once."""
model = Sequential([
LSTM(64, input_shape=(lookback, n_features)),
Dense(32, activation='relu'),
Dense(forecast_steps) # Output all future steps
])
model.compile(optimizer='adam', loss='mse')
return model
# Strategy 2: Encoder-Decoder for sequence-to-sequence
def create_seq2seq_model(lookback, n_features, forecast_steps):
"""Encoder-Decoder architecture for multi-step prediction."""
model = Sequential([
# Encoder: compress input sequence
LSTM(64, input_shape=(lookback, n_features)),
# Repeat encoder output for each forecast step
RepeatVector(forecast_steps),
# Decoder: generate output sequence
LSTM(64, return_sequences=True),
# Output at each forecast step
TimeDistributed(Dense(1))
])
model.compile(optimizer='adam', loss='mse')
return model
# Strategy 3: Recursive prediction
def recursive_forecast(model, initial_sequence, n_steps, scaler):
"""Predict one step at a time, feeding predictions back."""
sequence = initial_sequence.copy()
predictions = []
for _ in range(n_steps):
# Predict next step
pred = model.predict(sequence.reshape(1, -1, 1), verbose=0)
predictions.append(pred[0, 0])
# Shift sequence and append prediction
sequence = np.roll(sequence, -1)
sequence[-1] = pred[0, 0]
return np.array(predictions)
# Example usage
direct_model = create_direct_multistep_model(60, 1, 7) # Predict 7 days
seq2seq_model = create_seq2seq_model(60, 1, 7)
Time Series Best Practices
Always use time-based train-test splits (never shuffle time series data). Normalize data within training set and apply same transform to test set to prevent data leakage. Consider using multiple lookback windows and pick the best through validation. For highly seasonal data, ensure your lookback covers at least one full cycle. Monitor for concept drift in production where patterns change over time.
Practice: Time Series Prediction
Answer: Shuffling would cause data leakage where future information appears in the training set. In time series, we must simulate real-world conditions where we only have access to past data. The test set should always come after the training set chronologically to properly evaluate how the model performs on unseen future data.
Answer: Larger lookback windows capture more historical patterns and long-term dependencies, but increase computational cost, memory usage, and risk of including irrelevant old data. Smaller windows are faster and may work well for short-term patterns, but may miss important seasonal or cyclical patterns. The optimal size depends on the data characteristics and should be chosen through validation.
Answer: Direct prediction outputs all future steps at once, avoiding error accumulation but requiring the model to learn multiple output relationships simultaneously. Recursive prediction is simpler but errors compound as predictions are fed back as inputs. Direct is better for short horizons; recursive can use simpler models but degrades over longer horizons. Encoder-decoder architectures offer a middle ground, learning to generate output sequences step-by-step while maintaining a learned state.
Attention Mechanisms
Attention mechanisms revolutionized sequence modeling by allowing networks to focus on relevant parts of the input when producing each output. Instead of compressing an entire sequence into a fixed-size vector, attention creates dynamic weighted connections between inputs and outputs. This breakthrough led directly to Transformers, which replaced RNNs entirely for many NLP tasks.
An attention mechanism computes a weighted sum of input representations, where the weights indicate relevance to the current task. It uses three components: a Query (what we are looking for), Keys (what we are comparing against), and Values (what we retrieve). Attention weights are computed as softmax(Query * Keys^T), then applied to Values to get the context-aware output.
The Bottleneck Problem
Traditional encoder-decoder models compress the entire input sequence into a single fixed-size context vector. For long sequences, this bottleneck loses important information. Attention allows the decoder to look back at all encoder states, focusing on relevant parts for each output step.
# The bottleneck problem in seq2seq
import numpy as np
# Traditional Encoder-Decoder (bottleneck)
class TraditionalSeq2Seq:
def encode(self, input_sequence):
"""Compress entire sequence into single vector."""
hidden = np.zeros(256)
for token in input_sequence:
hidden = self.encoder_step(token, hidden)
# Only final hidden state passed to decoder
return hidden # Lost information about early tokens!
def decode(self, context_vector, max_length):
"""Generate output using only the context vector."""
hidden = context_vector
outputs = []
for _ in range(max_length):
output, hidden = self.decoder_step(hidden)
outputs.append(output)
return outputs
# With Attention: decoder can access ALL encoder states
class AttentionSeq2Seq:
def encode(self, input_sequence):
"""Store all hidden states for attention."""
hidden_states = []
hidden = np.zeros(256)
for token in input_sequence:
hidden = self.encoder_step(token, hidden)
hidden_states.append(hidden) # Keep all states!
return np.array(hidden_states)
def decode_with_attention(self, encoder_states, max_length):
"""Use attention to focus on relevant encoder states."""
hidden = np.zeros(256)
outputs = []
for _ in range(max_length):
# Compute attention weights
context = self.attention(hidden, encoder_states)
# Combine with decoder state
output, hidden = self.decoder_step(hidden, context)
outputs.append(output)
return outputs
Bahdanau Attention (Additive)
Bahdanau attention (2014) was the first attention mechanism for neural machine translation. It uses a feedforward network to compute alignment scores between the decoder state and each encoder state.
# Bahdanau (Additive) Attention implementation
import numpy as np
from tensorflow.keras.layers import Layer, Dense
import tensorflow as tf
class BahdanauAttention(Layer):
def __init__(self, units):
super().__init__()
self.W1 = Dense(units) # Transform encoder states
self.W2 = Dense(units) # Transform decoder state
self.V = Dense(1) # Score function
def call(self, decoder_hidden, encoder_outputs):
"""
decoder_hidden: (batch, hidden_size) - current decoder state
encoder_outputs: (batch, seq_len, hidden_size) - all encoder states
"""
# Expand decoder hidden for broadcasting
# (batch, hidden_size) -> (batch, 1, hidden_size)
decoder_hidden_expanded = tf.expand_dims(decoder_hidden, 1)
# Compute attention scores
# score = V(tanh(W1(encoder) + W2(decoder)))
score = self.V(tf.nn.tanh(
self.W1(encoder_outputs) + self.W2(decoder_hidden_expanded)
)) # (batch, seq_len, 1)
# Convert to probabilities
attention_weights = tf.nn.softmax(score, axis=1) # (batch, seq_len, 1)
# Weighted sum of encoder outputs
context = attention_weights * encoder_outputs # (batch, seq_len, hidden)
context = tf.reduce_sum(context, axis=1) # (batch, hidden)
return context, attention_weights
# Example usage
attention = BahdanauAttention(units=64)
decoder_state = tf.random.normal((32, 256))
encoder_outputs = tf.random.normal((32, 50, 256)) # 50 tokens
context, weights = attention(decoder_state, encoder_outputs)
print(f"Context shape: {context.shape}") # (32, 256)
print(f"Attention weights shape: {weights.shape}") # (32, 50, 1)
Self-attention allows each position in a sequence to attend to all other positions in the same sequence. Unlike cross-attention between encoder and decoder, self-attention computes relationships within a single sequence. Each token generates Query, Key, and Value vectors; the attention output for each position is a weighted sum of all Values, where weights come from Query-Key compatibility. This is the foundation of Transformers.
Scaled Dot-Product Attention
Scaled dot-product attention, used in Transformers, is more computationally efficient than additive attention. It computes attention as softmax(QK^T / sqrt(d_k)) * V, where the scaling factor prevents softmax from having extremely small gradients.
# Scaled Dot-Product Attention (Transformer-style)
import tensorflow as tf
import numpy as np
def scaled_dot_product_attention(query, key, value, mask=None):
"""
Compute attention using scaled dot product.
query: (batch, num_heads, seq_len_q, d_k)
key: (batch, num_heads, seq_len_k, d_k)
value: (batch, num_heads, seq_len_v, d_v)
mask: optional mask for padding or causal attention
"""
# Compute attention scores: Q @ K^T
matmul_qk = tf.matmul(query, key, transpose_b=True)
# Scale by sqrt(d_k) to prevent vanishing gradients in softmax
d_k = tf.cast(tf.shape(key)[-1], tf.float32)
scaled_attention_logits = matmul_qk / tf.math.sqrt(d_k)
# Apply mask if provided (for padding or causal masking)
if mask is not None:
scaled_attention_logits += (mask * -1e9)
# Softmax to get attention weights
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
# Weighted sum of values
output = tf.matmul(attention_weights, value)
return output, attention_weights
# Example: self-attention on a sequence
seq_len = 10
d_model = 64
# Same input serves as Q, K, V for self-attention
x = tf.random.normal((1, seq_len, d_model))
query = key = value = x
output, weights = scaled_dot_product_attention(query, key, value)
print(f"Output shape: {output.shape}") # (1, 10, 64)
print(f"Each position attends to all 10 positions")
Attention in Keras RNN Models
Keras provides an Attention layer that can be integrated with RNN encoder-decoder models. This combines the sequential processing of LSTMs with the focusing ability of attention.
# Attention layer with LSTM in Keras
from tensorflow.keras.layers import (Input, LSTM, Dense, Attention,
Concatenate, TimeDistributed)
from tensorflow.keras.models import Model
def create_attention_seq2seq(input_vocab, output_vocab,
embedding_dim=256, hidden_units=512):
"""
Sequence-to-sequence model with Bahdanau attention.
For machine translation, summarization, etc.
"""
# Encoder
encoder_inputs = Input(shape=(None,), name='encoder_input')
encoder_embedding = Embedding(input_vocab, embedding_dim)(encoder_inputs)
encoder_lstm = LSTM(hidden_units, return_sequences=True,
return_state=True, name='encoder_lstm')
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
# Decoder
decoder_inputs = Input(shape=(None,), name='decoder_input')
decoder_embedding = Embedding(output_vocab, embedding_dim)(decoder_inputs)
decoder_lstm = LSTM(hidden_units, return_sequences=True,
return_state=True, name='decoder_lstm')
decoder_outputs, _, _ = decoder_lstm(
decoder_embedding,
initial_state=[state_h, state_c]
)
# Attention layer
attention = Attention(name='attention')
context = attention([decoder_outputs, encoder_outputs])
# Combine attention context with decoder output
decoder_combined = Concatenate()([decoder_outputs, context])
# Output projection
output = TimeDistributed(Dense(output_vocab, activation='softmax'))(
decoder_combined
)
model = Model([encoder_inputs, decoder_inputs], output)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
return model
from tensorflow.keras.layers import Embedding
# Create translation model
model = create_attention_seq2seq(
input_vocab=10000, # Source vocabulary
output_vocab=8000, # Target vocabulary
embedding_dim=256,
hidden_units=512
)
model.summary()
From Attention to Transformers
The success of attention led researchers to ask: do we need RNNs at all? The Transformer architecture (2017) uses only attention mechanisms, processing all positions in parallel. This enables much faster training and better long-range dependency modeling. For most NLP tasks today, Transformers (BERT, GPT) have replaced RNNs. However, RNNs remain relevant for streaming data and resource-constrained environments.
Practice: Attention Mechanisms
Answer: Query (Q) represents what we are looking for. Keys (K) represent what we are comparing against. Values (V) contain the actual information we want to retrieve. Attention computes similarity between Q and K to determine which V elements to focus on, then returns a weighted sum of V.
Answer: When d_k is large, the dot products between Q and K grow large in magnitude, pushing the softmax into regions with extremely small gradients. Dividing by sqrt(d_k) keeps the variance of the dot products stable regardless of dimension, ensuring the softmax produces reasonable gradients for training.
Answer: In encoder-decoder attention, the decoder queries attend to encoder key-value pairs from a different sequence. In self-attention, Q, K, and V all come from the same sequence, allowing each position to attend to all other positions within the sequence. This provides two key advantages: (1) direct modeling of long-range dependencies without passing through many RNN steps, and (2) parallel computation since all positions can be processed simultaneously, unlike the sequential nature of RNNs.
Answer: Attention allows the decoder to focus on different source positions for each target word, regardless of their sequential order. When generating a word early in the target sentence that corresponds to a word late in the source sentence, attention can assign high weight to that distant source position. This flexibility handles word reordering naturally without requiring the model to memorize long-range alignments in its hidden state.
Interactive Demo: Attention Visualizer
Explore how attention mechanisms focus on different parts of a sequence. Select a target word to see which source words receive the most attention weight.
Click a French word above to see which English words it attends to
Understanding the Visualization
This demo simulates attention weights in neural machine translation. Notice how each French word primarily attends to its English equivalent, but also considers context. For example, "etait" and "assis" (was sitting) both attend heavily to "sat", while articles like "Le" and "le" attend to their respective "The" and "the" positions.
Key Takeaways
Sequential Memory
RNNs process sequences one step at a time, maintaining hidden state that captures information from previous inputs
Long-Term Dependencies
LSTM cells use gates to control information flow, solving the vanishing gradient problem for long sequences
GRU Efficiency
GRU simplifies LSTM with fewer gates while maintaining similar performance, reducing computation
Time Series Forecasting
RNNs excel at predicting future values from historical patterns in stock prices, weather, and sensor data
Attention Focus
Attention mechanisms let models focus on relevant parts of input sequences, enabling better context understanding
NLP Foundation
RNNs and attention laid the groundwork for modern NLP, from machine translation to text generation
Knowledge Check
Test your understanding of Recurrent Neural Networks: