Introduction to Transformers
In 2017, a paper titled "Attention Is All You Need" introduced the Transformer architecture, fundamentally changing how we approach sequence processing. Unlike RNNs that process text sequentially, transformers process entire sequences in parallel, enabling unprecedented training efficiency and model capabilities.
What are Transformers?
Transformers are a type of neural network architecture designed for processing sequential data like text. Before transformers, Recurrent Neural Networks (RNNs) and LSTMs were the go-to models for NLP. But they had a critical limitation: they processed words one at a time, making training slow and struggling with long sentences where early words faded from memory.
Transformer
A Transformer is a neural network architecture that processes entire sequences simultaneously using self-attention mechanisms. This parallel processing enables faster training and better capture of long-range dependencies compared to sequential models like RNNs.
Key insight: Instead of processing words one by one, transformers let every word "look at" every other word at once, determining which relationships matter most.
Why Transformers Changed Everything
The transformer architecture solved multiple problems at once and enabled capabilities that were previously impossible. This is why virtually all modern language models are based on transformers.
Parallel Processing
Unlike RNNs that process tokens sequentially, transformers process all tokens simultaneously, enabling massive GPU parallelization.
Long-Range Dependencies
In a 100-word sentence, first and last words can directly attend to each other. RNNs struggle as information degrades over distance.
Scalability
Transformers scale efficiently with more data and parameters, enabling models like GPT-4 with hundreds of billions of parameters.
Transfer Learning
Pre-trained transformers can be fine-tuned for specific tasks with minimal data, adapting to your domain with just hundreds of examples.
RNN vs Transformer: A Visual Comparison
| Aspect | RNN/LSTM | Transformer |
|---|---|---|
| Processing | Sequential (one token at a time) | Parallel (all tokens at once) |
| Training Speed | Slow (cannot parallelize) | Fast (fully parallelizable) |
| Long Sequences | Struggles (vanishing gradients) | Handles well (direct attention) |
| Memory | Hidden state (fixed size) | Attention over all tokens |
| Scalability | Limited | Scales to billions of parameters |
The Paper That Started It All
In June 2017, researchers at Google published "Attention Is All You Need" - a paper that would reshape the entire field of AI. The key innovation was removing recurrence entirely and relying solely on attention mechanisms to capture relationships between words.
# The impact of transformers: A timeline
transformers_timeline = {
2017: "Attention Is All You Need paper published",
2018: "BERT released by Google - revolutionizes NLP benchmarks",
2019: "GPT-2 released by OpenAI - impressive text generation",
2020: "GPT-3 with 175B parameters - few-shot learning",
2022: "ChatGPT launches - AI goes mainstream",
2023: "GPT-4 - multimodal capabilities",
2024: "Transformers power search, coding, creative tools",
2025: "Transformers are the backbone of most AI systems"
}
for year, event in transformers_timeline.items():
print(f"{year}: {event}")
Output:
2017: Attention Is All You Need paper published
2018: BERT released by Google - revolutionizes NLP benchmarks
2019: GPT-2 released by OpenAI - impressive text generation
2020: GPT-3 with 175B parameters - few-shot learning
2022: ChatGPT launches - AI goes mainstream
2023: GPT-4 - multimodal capabilities
2024: Transformers power search, coding, creative tools
2025: Transformers are the backbone of most AI systems
Practice Questions: Introduction
Test your understanding of transformer basics.
Task: Explain in your own words why parallel processing makes transformers faster.
Show Answer
Transformers process all tokens in a sequence simultaneously, while RNNs must process tokens one at a time (sequentially). This means:
- RNN: Token 1, then Token 2, then Token 3... (100 steps for 100 tokens)
- Transformer: All 100 tokens at once (1 parallel step)
GPUs excel at parallel computation, so transformers leverage hardware much better.
Task: Give an example sentence where understanding requires connecting distant words.
Show Answer
Example: "The cat, which had been sitting on the warm windowsill all afternoon watching birds fly by, finally jumped down."
To understand that "jumped" refers to "cat" (not "birds" or "windowsill"), the model must connect words that are 15+ tokens apart. In RNNs, information about "cat" may fade by the time we reach "jumped". Transformers handle this easily because "jumped" can directly attend to "cat".
Task: Name three sequence models that came before transformers and their limitations.
Show Answer
- RNN (Recurrent Neural Network): Sequential processing, vanishing gradients, cannot parallelize
- LSTM (Long Short-Term Memory): Better memory than RNN but still sequential, complex gating mechanisms
- GRU (Gated Recurrent Unit): Simpler than LSTM but same sequential limitations
- Seq2Seq with Attention: Added attention to encoder-decoder but still used RNN backbone
Transformers eliminated the recurrent component entirely, using only attention.
Self-Attention Mechanism
Self-attention is the heart of transformers. It allows each word to look at every other word in the sentence and determine how much attention to pay to each one. This mechanism captures relationships regardless of distance, solving the long-range dependency problem that plagued RNNs.
Understanding Self-Attention
Imagine reading a sentence and trying to understand what each pronoun refers to. When you see "it" in a sentence, your brain automatically looks back to find what "it" refers to. Self-attention does this computationally - for every word, it calculates how much attention to pay to every other word.
Self-Attention
Self-attention is a mechanism where each element in a sequence computes attention scores with every other element, creating a weighted representation that captures contextual relationships. The "self" means the sequence attends to itself.
Intuition: For the word "bank" in "I sat by the river bank", self-attention helps the model attend more to "river" than to "sat", understanding that this is about nature, not finance.
Query, Key, and Value: The Attention Trinity
Self-attention uses three learned projections called Query (Q), Key (K), and Value (V). Think of it like a search engine: the Query is what you are looking for, Keys are labels on documents, and Values are the actual document contents. You match your Query against Keys to find relevant Values.
Query (Q)
"What am I looking for?" - The current token's representation that will be matched against all other tokens to find relevant context.
Key (K)
"What do I contain?" - Each token's identifier that Queries are compared against. High Query-Key similarity means high relevance.
Value (V)
"What information do I provide?" - The actual content that gets weighted and combined based on attention scores.
Computing Attention Step by Step
The attention formula is: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
Let us break this down into understandable steps:
Step 1: Set Up the Function
import numpy as np
def self_attention_simple(query, key, value):
"""
Simplified self-attention computation.
Args:
query: What we're looking for (n_tokens x d_k)
key: What each token offers (n_tokens x d_k)
value: Information to retrieve (n_tokens x d_v)
"""
d_k = key.shape[-1] # Dimension of keys
We define a function that takes three inputs: Query (what we are searching for), Key (what each token
contains), and Value (the actual information to retrieve). The variable d_k stores the
dimension of our keys, which we will use later for scaling.
Step 2: Compute Raw Attention Scores
# Step 1: Compute attention scores (Query @ Key^T)
# This tells us how much each token relates to each other token
scores = np.dot(query, key.T)
print("Step 1 - Raw attention scores:")
print(scores)
We multiply the Query matrix by the transpose of the Key matrix. This produces a score matrix where each entry (i, j) represents how much token i should attend to token j. Higher scores mean stronger relationships. This is the core of the attention mechanism - finding which tokens are relevant to each other.
Step 3: Scale the Scores
# Step 2: Scale by sqrt(d_k) to prevent large values
# Large values would make softmax too peaked (one-hot like)
scaled_scores = scores / np.sqrt(d_k)
print(f"\nStep 2 - Scaled scores (divided by sqrt({d_k})):")
print(scaled_scores)
We divide the scores by the square root of the key dimension. Why? When dimensions are large, dot products can become very large numbers. Large values push the softmax function to produce extreme outputs (close to 0 or 1), which causes vanishing gradients during training. Scaling keeps the values in a reasonable range for stable learning.
Step 4: Apply Softmax for Weights
# Step 3: Apply softmax to get attention weights (sum to 1)
attention_weights = np.exp(scaled_scores) / np.sum(np.exp(scaled_scores), axis=-1, keepdims=True)
print("\nStep 3 - Attention weights (softmax):")
print(attention_weights)
The softmax function converts our scaled scores into probabilities that sum to 1. Each row now represents a probability distribution over all tokens. For example, if the first row is [0.6, 0.3, 0.1], it means the first token pays 60% attention to itself, 30% to the second token, and 10% to the third.
Step 5: Compute Weighted Sum of Values
# Step 4: Multiply weights by values to get output
output = np.dot(attention_weights, value)
print("\nStep 4 - Final output (weighted sum of values):")
print(output)
return output, attention_weights
Finally, we multiply the attention weights by the Value matrix. This produces a weighted sum of all token values, where tokens with higher attention weights contribute more. The result is a new representation for each token that incorporates context from all other relevant tokens.
Step 6: Run an Example
# Example: 3 tokens with dimension 4
np.random.seed(42)
tokens = np.random.randn(3, 4) # 3 tokens, 4 dimensions each
# In practice, Q, K, V come from learned linear projections
# Here we use the same tokens for simplicity
output, weights = self_attention_simple(tokens, tokens, tokens)
We create a simple example with 3 tokens, each represented by 4 dimensions. In real transformers, Q, K, and V are computed by multiplying the input by learned weight matrices. Here, we use the same tokens for all three to keep it simple. The function returns both the contextualized output and the attention weights for visualization.
Output:
Step 1 - Raw attention scores:
[[ 2.52 0.12 -0.89]
[ 0.12 1.38 -0.23]
[-0.89 -0.23 1.92]]
Step 2 - Scaled scores (divided by sqrt(4)):
[[ 1.26 0.06 -0.44]
[ 0.06 0.69 -0.12]
[-0.44 -0.12 0.96]]
Step 3 - Attention weights (softmax):
[[0.56 0.17 0.10]
[0.27 0.51 0.23]
[0.16 0.22 0.65]]
Step 4 - Final output (weighted sum of values):
[[ 0.42 0.19 0.37 0.61]
[ 0.21 0.35 0.18 0.44]
[-0.28 0.41 -0.23 0.19]]
Looking at the attention weights, we can see that token 1 attends mostly to itself (0.56), token 2 also focuses on itself (0.51), and token 3 attends strongly to itself (0.65). This makes sense - each token finds itself most relevant. The final output shows how each token's representation has been updated by incorporating information from all tokens according to these weights.
Multi-Head Attention
A single attention head might focus on one type of relationship (like syntax). Multi-head attention runs multiple attention operations in parallel, allowing the model to capture different types of relationships simultaneously - one head for syntax, another for semantics, another for coreference.
Step 1: Define the Class and Initialize Parameters
import torch
import torch.nn as nn
class MultiHeadAttention(nn.Module):
def __init__(self, d_model=512, num_heads=8):
super().__init__()
self.num_heads = num_heads
self.d_model = d_model
self.d_k = d_model // num_heads # Dimension per head
# Linear projections for Q, K, V (and output)
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
We create a PyTorch module for multi-head attention. The key insight is that we split the model dimension (512) across multiple heads (8), so each head works with 64 dimensions. We define four learnable linear layers: three to project inputs into Query, Key, and Value spaces, and one to project the final concatenated output back to the model dimension.
Step 2: Project Input to Q, K, V
def forward(self, x):
batch_size, seq_len, _ = x.shape
# Project to Q, K, V
Q = self.W_q(x) # (batch, seq_len, d_model)
K = self.W_k(x)
V = self.W_v(x)
In the forward pass, we first get the input dimensions. Then we pass the input through three separate linear projections to create Query, Key, and Value matrices. Each projection learns different aspects of the input - the Query learns what to look for, the Key learns what each position offers, and the Value learns what information to provide.
Step 3: Reshape for Multiple Heads
# Reshape for multi-head: (batch, num_heads, seq_len, d_k)
Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
Here is where the "multi-head" magic happens. We reshape each matrix to split the 512 dimensions
into 8 heads of 64 dimensions each. The view operation reshapes the tensor, and
transpose rearranges the dimensions so that each head's computation can be done
in parallel. After this, each head has its own 64-dimensional subspace to work with.
Step 4: Compute Attention for All Heads
# Compute attention for all heads in parallel
scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
attention = torch.softmax(scores, dim=-1)
context = torch.matmul(attention, V)
This is the same attention computation we saw earlier, but now happening for all 8 heads simultaneously thanks to tensor operations. We compute scores by multiplying Q and K, scale by the square root of dimension, apply softmax to get weights, and multiply by V to get context. Each head learns to attend to different types of relationships in the data.
Step 5: Concatenate Heads and Project Output
# Concatenate heads and project
context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
output = self.W_o(context)
return output
After each head computes its attention independently, we need to combine them. We transpose back to the original dimension order, then reshape to concatenate all heads' outputs (8 heads x 64 dimensions = 512 dimensions). Finally, we apply one more linear projection to mix information across heads, allowing the model to combine insights from different attention patterns.
Step 6: Test with Example Input
# Example usage
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(2, 10, 512) # batch=2, seq_len=10, d_model=512
output = mha(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Each head processes: {512 // 8} dimensions")
We create a multi-head attention module and test it with a batch of 2 sequences, each containing 10 tokens represented by 512 dimensions. The output has the same shape as input - this is important because it allows us to stack multiple attention layers. Each of the 8 heads processes only 64 dimensions, but together they cover all 512 dimensions from different perspectives.
Output:
Input shape: torch.Size([2, 10, 512])
Output shape: torch.Size([2, 10, 512])
Each head processes: 64 dimensions
Visualizing Attention Patterns
Attention weights reveal what the model "pays attention to". Different heads learn different patterns - some attend to adjacent words (local), others to sentence structure (global), and some to specific linguistic relationships like subject-verb agreement.
| Attention Head | What It Might Learn | Example Pattern |
|---|---|---|
| Head 1 | Positional (nearby words) | "the big" - "the" attends to "big" |
| Head 2 | Syntactic (grammar structure) | Subject attends to verb |
| Head 3 | Coreference (pronouns) | "she" attends to "Maria" |
| Head 4 | Semantic similarity | "happy" attends to "joyful" |
Practice Questions: Self-Attention
Deepen your understanding of attention mechanisms.
Task: Explain why attention weights for each token must sum to 1.
Show Answer
Attention weights sum to 1 because softmax is applied, which normalizes scores into a probability distribution. This means:
- Weights represent "how much attention" as percentages
- The output is a weighted average of values
- If one weight is high, others must be lower (competition)
Example: If "cat" gets 0.6 attention, other words share the remaining 0.4.
Task: Given Q=[1, 0], K=[[1, 0], [0, 1]], V=[[5], [10]], calculate the attention output.
Show Solution
import numpy as np
Q = np.array([[1, 0]])
K = np.array([[1, 0], [0, 1]])
V = np.array([[5], [10]])
# Score = Q @ K^T = [[1*1 + 0*0, 1*0 + 0*1]] = [[1, 0]]
scores = Q @ K.T
print(f"Scores: {scores}")
# Scale (d_k = 2)
scaled = scores / np.sqrt(2)
print(f"Scaled: {scaled}")
# Softmax
weights = np.exp(scaled) / np.sum(np.exp(scaled))
print(f"Weights: {weights}")
# Output = weights @ V
output = weights @ V
print(f"Output: {output}") # Closer to 5 than 10 (attends more to first token)
Task: Explain the benefits of multi-head attention over single-head attention.
Show Answer
Multi-head attention provides several benefits:
- Different relationship types: Each head can specialize in capturing different patterns (syntax, semantics, position)
- Richer representations: Combining multiple perspectives creates more expressive features
- Stability: Multiple heads provide redundancy - if one head fails, others compensate
- No additional cost: Total computation is similar (heads split the dimensions)
Analogy: Like having multiple experts review a document, each focusing on different aspects.
Transformer Architecture
The transformer architecture consists of an encoder stack and a decoder stack, each made up of identical layers. The encoder processes the input sequence, while the decoder generates the output sequence. Together, they form a powerful sequence-to-sequence model.
The Big Picture
The original transformer was designed for machine translation (e.g., English to French). The encoder reads the entire input sentence and builds a rich representation. The decoder then uses this representation to generate the output sentence one word at a time.
Encoder-Decoder Structure
The transformer has two main parts: the Encoder (processes input) and the Decoder (generates output). Each consists of stacked identical layers (typically 6). The encoder sees all tokens at once; the decoder generates tokens sequentially but uses attention to look at all previously generated tokens.
Key difference from RNN: Even though the decoder generates tokens one by one, it uses attention (not recurrence) to access context, making it parallelizable during training.
Encoder: Understanding the Input
Each encoder layer has two main components: multi-head self-attention and a feed-forward network. Both have residual connections and layer normalization. The encoder processes the input and creates contextualized representations that the decoder can attend to.
Step 1: Define the Encoder Layer Class
import torch
import torch.nn as nn
class EncoderLayer(nn.Module):
def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
super().__init__()
# Multi-head self-attention
self.self_attention = nn.MultiheadAttention(d_model, num_heads, dropout=dropout, batch_first=True)
self.norm1 = nn.LayerNorm(d_model)
We create an encoder layer class with key hyperparameters: d_model is the dimension of
each token (512), num_heads is the number of attention heads (8), d_ff is
the feed-forward hidden dimension (2048, typically 4x d_model), and dropout prevents
overfitting. The first component is multi-head self-attention with layer normalization.
Step 2: Add the Feed-Forward Network
# Feed-forward network
self.feed_forward = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
nn.Dropout(dropout)
)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
The feed-forward network is a two-layer MLP applied to each position independently. It first expands the dimensions from 512 to 2048 (allowing more expressive transformations), applies ReLU activation, then projects back to 512. This "expand-contract" pattern with non-linearity helps the model learn complex patterns. We add a second layer normalization and dropout for regularization.
Step 3: Define the Forward Pass with Residual Connections
def forward(self, x, mask=None):
# Self-attention with residual connection
attn_output, _ = self.self_attention(x, x, x, key_padding_mask=mask)
x = self.norm1(x + self.dropout(attn_output))
# Feed-forward with residual connection
ff_output = self.feed_forward(x)
x = self.norm2(x + ff_output)
return x
The forward pass follows a pattern: sublayer output + residual connection + layer normalization.
First, we apply self-attention where each token attends to all other tokens. The residual connection
(x + attn_output) adds the original input back, helping gradients flow during training.
Then we do the same for the feed-forward network. This "Add & Norm" pattern is crucial for training
deep transformers.
Step 4: Stack Multiple Encoder Layers
# Stack 6 encoder layers
class Encoder(nn.Module):
def __init__(self, num_layers=6, d_model=512, num_heads=8):
super().__init__()
self.layers = nn.ModuleList([
EncoderLayer(d_model, num_heads) for _ in range(num_layers)
])
def forward(self, x, mask=None):
for layer in self.layers:
x = layer(x, mask)
return x
The full encoder stacks 6 identical encoder layers. We use nn.ModuleList to properly
register the layers with PyTorch. In the forward pass, the input flows through each layer
sequentially, with each layer refining the representations further. By the end, each token's
representation incorporates rich contextual information from the entire sequence.
Step 5: Check Parameter Count
encoder = Encoder()
print(f"Encoder has {sum(p.numel() for p in encoder.parameters()):,} parameters")
We instantiate the encoder and count its learnable parameters. With 6 layers, each containing attention projections and feed-forward networks, the total comes to about 19 million parameters. This is just the encoder - the full transformer (encoder + decoder) has roughly double this. Modern large language models have billions of parameters, but the architecture remains similar.
Output:
Encoder has 18,901,248 parameters
Decoder: Generating the Output
The decoder is similar to the encoder but adds a crucial third component: cross-attention. This allows each decoder position to attend to all encoder outputs, connecting the generated output to the input context. The decoder also uses masked self-attention to prevent "peeking" at future tokens.
Masked Self-Attention
Prevents the decoder from seeing future tokens during training. Position i can only attend to positions 0 to i-1, ensuring autoregressive generation.
Cross-Attention
Connects decoder to encoder outputs. The decoder's Query attends to encoder's Keys and Values, pulling relevant input information for each generated token.
Feed-Forward Network
Same as encoder: a two-layer MLP that processes each position independently. Adds non-linearity and increases model capacity.
Positional Encoding: Injecting Word Order
Since transformers process all tokens in parallel, they have no inherent notion of word order. "The cat sat on the mat" and "mat the on sat cat the" would look identical! Positional encoding adds information about each token's position in the sequence.
Step 1: Define the Encoding Function
import torch
import numpy as np
def positional_encoding(max_len, d_model):
"""
Generate positional encodings using sine and cosine functions.
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
"""
pe = np.zeros((max_len, d_model))
position = np.arange(max_len)[:, np.newaxis]
We create a function that generates positional encodings for any sequence length up to
max_len. The formula uses sine and cosine functions at different frequencies.
We initialize an empty matrix to hold the encodings (max_len x d_model) and create a column
vector of positions (0, 1, 2, ..., max_len-1) for broadcasting in later calculations.
Step 2: Compute the Frequency Term
# Compute the division term
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
This computes the frequency for each dimension. Lower dimensions get higher frequencies (faster oscillation), while higher dimensions get lower frequencies (slower oscillation). The base of 10000 was chosen empirically - it creates a good spread of frequencies that allows the model to learn both fine-grained and coarse-grained positional information.
Step 3: Apply Sine and Cosine
# Apply sine to even indices, cosine to odd indices
pe[:, 0::2] = np.sin(position * div_term)
pe[:, 1::2] = np.cos(position * div_term)
return pe
We apply sine to even dimensions (0, 2, 4, ...) and cosine to odd dimensions (1, 3, 5, ...). Using both sine and cosine together is important - it means for any fixed offset k, the encoding for position (pos+k) can be expressed as a linear function of the encoding at position pos. This helps the model learn relative positions, not just absolute ones.
Step 4: Generate and Inspect the Encodings
# Generate positional encoding
pe = positional_encoding(max_len=100, d_model=512)
print("Positional Encoding Shape:", pe.shape)
print("\nFirst 5 positions, first 8 dimensions:")
print(pe[:5, :8].round(3))
We generate encodings for 100 positions with 512 dimensions each. Looking at the first 5 positions and 8 dimensions, you can see the patterns: position 0 starts at [0, 1, 0, 1, ...] (sine=0, cosine=1 at the start). Each subsequent position shifts these values based on the frequency, creating unique fingerprints for each position.
Step 5: Compare Different Positions
# The encoding for position 0
print("\nPosition 0 (first 10 values):", pe[0, :10].round(3))
# The encoding for position 50 (very different!)
print("Position 50 (first 10 values):", pe[50, :10].round(3))
By comparing position 0 and position 50, we see they have completely different patterns. Position 0 has the "starting" pattern (sin=0, cos=1), while position 50 shows how the waves have evolved. The model learns to decode these patterns to understand where each token sits in the sequence, enabling it to handle word order appropriately.
Output:
Positional Encoding Shape: (100, 512)
First 5 positions, first 8 dimensions:
[[ 0. 1. 0. 1. 0. 1. 0. 1. ]
[ 0.841 0.54 0.047 0.999 0.002 1. 0. 1. ]
[ 0.909 -0.416 0.094 0.996 0.005 1. 0. 1. ]
[ 0.141 -0.99 0.14 0.99 0.007 1. 0. 1. ]
[-0.757 -0.654 0.187 0.982 0.01 1. 0. 1. ]]
Position 0 (first 10 values): [ 0. 1. 0. 1. 0. 1. 0. 1. 0. 1. ]
Position 50 (first 10 values): [-0.262 0.965 0.999 -0.035 0.116 0.993 -0.086 0.996 0.012 1. ]
Component Summary
| Component | Location | Purpose |
|---|---|---|
| Self-Attention | Encoder and Decoder | Capture relationships within a sequence |
| Cross-Attention | Decoder only | Connect decoder to encoder outputs |
| Feed-Forward | Encoder and Decoder | Add non-linearity, process each position |
| Positional Encoding | Input to both | Inject word order information |
| Layer Normalization | After each sublayer | Stabilize training, normalize activations |
| Residual Connections | Around each sublayer | Enable gradient flow, easier optimization |
Practice Questions: Architecture
Test your knowledge of transformer components.
Task: Explain why the decoder needs to mask future tokens.
Show Answer
The causal (look-ahead) mask prevents the decoder from "cheating" during training by:
- Ensuring position i can only see positions 0 to i-1
- Simulating autoregressive generation during training
- Forcing the model to learn to predict the next token without seeing it
Without this mask, the model could just copy the answer instead of learning to generate.
Task: Explain the difference between self-attention and cross-attention in terms of Q, K, V sources.
Show Answer
Self-Attention (Encoder):
- Q, K, V all come from the same source (encoder input)
- Each token attends to all other tokens in the same sequence
Cross-Attention (Decoder):
- Q comes from decoder (what we are generating)
- K, V come from encoder output (the input context)
- Allows decoder to "look at" the input while generating
Task: Explain the benefits of residual connections in deep transformer networks.
Show Answer
Residual connections (x + sublayer(x)) provide several benefits:
- Gradient flow: Gradients can flow directly through the skip connection, preventing vanishing gradients in deep networks
- Easier optimization: The network only needs to learn the "residual" (difference), which is often easier
- Identity mapping: If a layer is not useful, it can learn to output zero, effectively skipping that layer
- Depth without degradation: Enables training very deep models (50+ layers) that would otherwise degrade
BERT and GPT
BERT and GPT represent two different approaches to using transformers. BERT uses bidirectional encoding to understand context from both directions, excelling at understanding tasks. GPT uses autoregressive decoding to generate text one token at a time, excelling at generation tasks.
BERT: Bidirectional Encoder Representations from Transformers
BERT, released by Google in 2018, uses only the encoder part of the transformer. Its key innovation is bidirectional training - it sees context from both left and right simultaneously. This makes BERT excellent at understanding tasks like classification, question answering, and named entity recognition.
BERT
BERT is an encoder-only transformer pre-trained on two tasks: Masked Language Modeling (predicting masked words) and Next Sentence Prediction. It reads text bidirectionally, meaning each token can attend to tokens on both its left and right.
Pre-training objective: "The cat [MASK] on the mat" - BERT learns to predict that [MASK] should be "sat" by looking at both "cat" (left) and "on the mat" (right).
from transformers import BertTokenizer, BertForMaskedLM
import torch
We import the necessary modules from the Hugging Face Transformers library. BertTokenizer
handles converting text to tokens and back, while BertForMaskedLM is the BERT model
configured for masked language modeling (predicting hidden words).
# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
We load the pre-trained BERT base model (uncased means it treats "Hello" and "hello" the same).
This model was trained on millions of documents and already understands language. The
from_pretrained method downloads the model weights from Hugging Face's servers.
# Masked language modeling example
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors='pt')
We create a sentence with a [MASK] token where we want BERT to predict the missing word.
The tokenizer converts this text into numerical IDs that the model understands. The
return_tensors='pt' returns PyTorch tensors ready for the model.
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits
We run the model in inference mode (torch.no_grad() disables gradient computation
for efficiency). The model outputs logits (raw scores before softmax) for every position in
the sequence. Each position has scores for all 30,000+ vocabulary words.
# Find the masked token position
mask_position = (inputs['input_ids'] == tokenizer.mask_token_id).nonzero()[0, 1]
# Get top 5 predictions for the masked position
top_5_tokens = torch.topk(predictions[0, mask_position], 5).indices
We locate where the [MASK] token is in our input sequence. Then we use torch.topk
to find the 5 highest-scoring word predictions for that position. These are the words BERT
thinks are most likely to fill in the blank.
print(f"Input: {text}")
print(f"Top 5 predictions for [MASK]:")
for i, token_id in enumerate(top_5_tokens):
token = tokenizer.decode([token_id])
prob = torch.softmax(predictions[0, mask_position], dim=0)[token_id]
print(f" {i+1}. {token} ({prob:.1%})")
Finally, we display the results. We convert token IDs back to readable words using the tokenizer's decode method. We also apply softmax to convert raw scores into probabilities, showing how confident BERT is about each prediction. "Paris" wins with 92.3% confidence!
Output:
Input: The capital of France is [MASK].
Top 5 predictions for [MASK]:
1. paris (92.3%)
2. lyon (1.8%)
3. marseille (0.9%)
4. france (0.7%)
5. toulouse (0.5%)
GPT: Generative Pre-trained Transformer
GPT, developed by OpenAI, uses only the decoder part of the transformer. It is trained to predict the next token given all previous tokens (autoregressive). This makes GPT excellent at text generation - from completing sentences to writing essays, code, and even poetry.
GPT
GPT is a decoder-only transformer pre-trained on next-token prediction. It reads text left-to-right, with each token only able to attend to previous tokens. This autoregressive nature makes it ideal for text generation tasks.
Pre-training objective: Given "The cat sat on the", predict the next word is "mat". GPT learns patterns from billions of text examples to generate coherent continuations.
from transformers import GPT2Tokenizer, GPT2LMHeadModel
We import GPT-2 components from Hugging Face. GPT2Tokenizer converts text to tokens
using Byte Pair Encoding (BPE), while GPT2LMHeadModel is the GPT-2 model with a
language modeling head for text generation.
# Load pre-trained GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
We load the smallest GPT-2 model (124 million parameters). Larger versions include gpt2-medium (355M), gpt2-large (774M), and gpt2-xl (1.5B). Each was trained on 40GB of internet text to learn patterns of natural language.
# Text generation example
prompt = "Artificial intelligence will"
inputs = tokenizer(prompt, return_tensors='pt')
We provide a prompt - the starting text that GPT will continue. The tokenizer converts it to token IDs. Unlike BERT, there is no [MASK] token; GPT simply continues from where the text ends.
# Generate text
outputs = model.generate(
inputs['input_ids'],
max_length=50,
num_return_sequences=1,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
The generate method handles the autoregressive loop. Key parameters:
max_length limits total tokens, temperature controls randomness (lower =
more deterministic, higher = more creative), do_sample=True enables sampling instead
of always picking the most likely word. The model generates one token at a time, feeding each
new token back as input until done.
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Prompt: {prompt}")
print(f"Generated: {generated_text}")
We decode the generated token IDs back to human-readable text. The skip_special_tokens=True
removes technical tokens like end-of-sequence markers. The result is a coherent continuation of
our prompt, demonstrating GPT's ability to generate fluent, contextually relevant text.
Output:
Prompt: Artificial intelligence will
Generated: Artificial intelligence will transform how we work, learn, and interact with technology.
From healthcare diagnostics to autonomous vehicles, AI systems are becoming increasingly capable
of tasks that once required human intelligence.
BERT vs GPT: Head-to-Head Comparison
BERT (Bidirectional)
Architecture: Encoder-only
Direction: Sees both left and right context
Best for: Understanding (classification, QA, NER)
Training: Masked language modeling
GPT (Autoregressive)
Architecture: Decoder-only
Direction: Left-to-right only
Best for: Generation (text, code, dialogue)
Training: Next-token prediction
| Aspect | BERT | GPT |
|---|---|---|
| Architecture | Encoder-only | Decoder-only |
| Context | Bidirectional (full sentence) | Unidirectional (left-to-right) |
| Pre-training | Masked LM + Next Sentence Prediction | Next Token Prediction |
| Use Cases | Classification, QA, NER, Similarity | Text Generation, Chatbots, Code |
| Fine-tuning | Add classification head | Prompt engineering or fine-tune |
| Famous Versions | BERT, RoBERTa, ALBERT, DistilBERT | GPT-2, GPT-3, GPT-4, ChatGPT |
The Evolution of Transformer Models
# Model sizes over time (approximate parameter counts)
model_evolution = {
"BERT-base (2018)": "110M parameters",
"BERT-large (2018)": "340M parameters",
"GPT-2 (2019)": "1.5B parameters",
"T5-11B (2020)": "11B parameters",
"GPT-3 (2020)": "175B parameters",
"PaLM (2022)": "540B parameters",
"GPT-4 (2023)": "~1.7T parameters (estimated)",
"Gemini Ultra (2024)": "Multi-trillion (estimated)"
}
print("Transformer Model Evolution:")
print("-" * 50)
for model, params in model_evolution.items():
print(f"{model:25} | {params}")
# Growth factor
print(f"\nGrowth: BERT to GPT-4 = ~15,000x increase in parameters!")
Output:
Transformer Model Evolution:
--------------------------------------------------
BERT-base (2018) | 110M parameters
BERT-large (2018) | 340M parameters
GPT-2 (2019) | 1.5B parameters
T5-11B (2020) | 11B parameters
GPT-3 (2020) | 175B parameters
PaLM (2022) | 540B parameters
GPT-4 (2023) | ~1.7T parameters (estimated)
Gemini Ultra (2024) | Multi-trillion (estimated)
Growth: BERT to GPT-4 = ~15,000x increase in parameters!
Practice Questions: BERT and GPT
Compare and contrast these transformer giants.
Task: Explain the architectural reason BERT is not suited for text generation.
Show Answer
BERT cannot generate text effectively because:
- It was trained with bidirectional attention (sees all words at once)
- During generation, you do not have future words to condition on
- It was trained for masked prediction (fill in blanks), not sequential generation
- The encoder architecture has no autoregressive mechanism
GPT's left-to-right attention naturally supports generating one token at a time.
Task: Write code to classify sentiment using a pre-trained BERT model.
Show Solution
from transformers import pipeline
# Load pre-trained sentiment classifier (fine-tuned BERT)
classifier = pipeline("sentiment-analysis")
# Test sentences
sentences = [
"I absolutely love this product!",
"This was a complete waste of money.",
"It's okay, nothing special."
]
print("Sentiment Analysis Results:")
for sentence in sentences:
result = classifier(sentence)[0]
print(f"{result['label']:8} ({result['score']:.1%}): {sentence}")
Task: Research and explain the training differences between GPT-3 and ChatGPT.
Show Answer
ChatGPT added two key training stages beyond GPT-3:
- Supervised Fine-Tuning (SFT): Human trainers provided example conversations, teaching the model how to be helpful
- Reinforcement Learning from Human Feedback (RLHF):
- Humans ranked multiple model outputs
- A reward model learned human preferences
- The model was fine-tuned to maximize the reward
Result: ChatGPT follows instructions better, refuses harmful requests, and produces more helpful responses.
Practical Applications
Transformers power the most impressive AI applications today, from ChatGPT and Google Search to code completion and image generation. Learning to use pre-trained transformer models through libraries like Hugging Face opens up powerful capabilities with minimal code.
Getting Started with Hugging Face
Hugging Face has become the go-to platform for working with transformers. Their library provides easy access to thousands of pre-trained models and simple APIs for common tasks. You can accomplish state-of-the-art NLP with just a few lines of code.
Step 1: Import the Pipeline API
# Install the transformers library
# pip install transformers torch
from transformers import pipeline
The pipeline API is Hugging Face's simplest interface for using transformers. It
automatically handles loading the right model, tokenizer, and post-processing for each task.
This abstraction lets you focus on your application rather than implementation details.
Step 2: Sentiment Analysis in One Line
# 1. Sentiment Analysis
sentiment = pipeline("sentiment-analysis")
result = sentiment("I love learning about transformers!")
print(f"Sentiment: {result[0]['label']} ({result[0]['score']:.1%})")
Just specify "sentiment-analysis" and the pipeline loads a fine-tuned DistilBERT model.
Pass any text and it returns whether the sentiment is POSITIVE or NEGATIVE along with a confidence
score. This model was trained on movie reviews but generalizes well to most English text.
Step 3: Named Entity Recognition
# 2. Named Entity Recognition
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Elon Musk founded SpaceX in California.")
print(f"\nEntities found:")
for ent in entities:
print(f" {ent['word']:15} -> {ent['entity_group']:10} ({ent['score']:.1%})")
Named Entity Recognition (NER) identifies people (PER), organizations (ORG), locations (LOC),
and other entities in text. The aggregation_strategy="simple" combines multi-token
entities (like "Elon Musk" which is two tokens) into single entries. The model returns each
entity's text, type, and confidence score.
Step 4: Question Answering
# 3. Question Answering
qa = pipeline("question-answering")
context = "Transformers were introduced in 2017 by Google researchers in the paper Attention Is All You Need."
question = "When were transformers introduced?"
answer = qa(question=question, context=context)
print(f"\nQuestion: {question}")
print(f"Answer: {answer['answer']} (confidence: {answer['score']:.1%})")
Extractive question answering finds the answer within a given context. You provide both a question and a passage of text (context). The model identifies the span of text that best answers the question. It does not generate new text - it extracts directly from the context, which makes answers more reliable and verifiable.
Output:
Sentiment: POSITIVE (99.8%)
Entities found:
Elon Musk -> PER (99.2%)
SpaceX -> ORG (98.7%)
California -> LOC (99.5%)
Question: When were transformers introduced?
Answer: 2017 (confidence: 97.3%)
Real-World Applications of Transformers
Conversational AI
ChatGPT, Claude, and Gemini use transformer models to hold natural conversations, answer questions, and assist with tasks.
Code Completion
GitHub Copilot and similar tools use transformers to suggest code, complete functions, and even write entire programs from descriptions.
Search Engines
Google, Bing, and other search engines use BERT and similar models to understand search queries and rank results.
Translation
Google Translate and DeepL use transformer-based models for high-quality translation between hundreds of languages.
Content Generation
From marketing copy to email drafts, transformers help generate written content for businesses and individuals.
Image Generation
DALL-E, Midjourney, and Stable Diffusion use transformer components to generate images from text descriptions.
Text Summarization
Transformers excel at summarization - condensing long documents into concise summaries while preserving key information. This is invaluable for processing large amounts of text quickly.
Step 1: Load the Summarization Model
from transformers import pipeline
# Load summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
We load BART (Bidirectional and Auto-Regressive Transformers), a model specifically fine-tuned on the CNN/Daily Mail dataset for news summarization. BART combines BERT's bidirectional encoder with GPT's autoregressive decoder, making it excellent at understanding input and generating fluent output.
Step 2: Provide the Text to Summarize
# Long text to summarize
article = """
Transformers have revolutionized natural language processing since their introduction in 2017.
The architecture, introduced in the paper "Attention Is All You Need," replaced recurrent
neural networks with self-attention mechanisms. This change enabled parallel processing of
sequences, dramatically improving training efficiency.
The impact has been enormous. BERT, released by Google in 2018, achieved state-of-the-art
results on numerous NLP benchmarks. OpenAI's GPT series demonstrated impressive text
generation capabilities, culminating in ChatGPT which brought AI assistants to the mainstream.
Today, transformers power search engines, translation services, code assistants, and
conversational AI. The architecture has even expanded beyond NLP into computer vision
(Vision Transformers) and audio processing. Researchers continue to improve efficiency
and capabilities, making transformers increasingly accessible for diverse applications.
"""
Our example article is 143 words about transformers. In real applications, you might summarize news articles, research papers, emails, or any lengthy document. The model can handle texts up to about 1024 tokens (roughly 750-1000 words depending on vocabulary).
Step 3: Generate the Summary
# Generate summary
summary = summarizer(article, max_length=80, min_length=30, do_sample=False)
print("Original length:", len(article.split()), "words")
print("Summary length:", len(summary[0]['summary_text'].split()), "words")
print("\nSummary:")
print(summary[0]['summary_text'])
We control summary length with max_length and min_length parameters.
Setting do_sample=False uses deterministic generation (always picks the most likely
next word), ensuring consistent summaries. The model distills the key points: when transformers
were introduced, BERT's impact, and current applications.
Output:
Original length: 143 words
Summary length: 42 words
Summary:
Transformers have revolutionized natural language processing since their introduction in 2017.
BERT achieved state-of-the-art results on numerous NLP benchmarks. Today, transformers power
search engines, translation services, code assistants, and conversational AI.
Zero-Shot Classification
One of the most powerful capabilities of modern transformers is zero-shot classification - categorizing text into labels the model has never been explicitly trained on. This eliminates the need for task-specific training data.
Step 1: Load the Zero-Shot Classifier
from transformers import pipeline
# Load zero-shot classification pipeline
classifier = pipeline("zero-shot-classification")
Zero-shot classification uses a model trained on natural language inference (NLI). It works by reformulating classification as: "Does this text entail that it is about [label]?" This clever approach lets the model classify into any categories you specify, without retraining.
Step 2: Define Text and Labels
# Text to classify
text = "The new iPhone 16 features an improved camera system and longer battery life."
# Define candidate labels (no training needed!)
candidate_labels = ["technology", "sports", "politics", "entertainment", "science"]
You provide the text to classify and a list of possible categories. The magic is that you can change these labels to anything - product categories, emotions, topics, urgency levels - without retraining the model. It understands the semantic meaning of your labels from its pre-training.
Step 3: Get Classification Results
# Classify
result = classifier(text, candidate_labels)
print(f"Text: {text}\n")
print("Classification Results:")
for label, score in zip(result['labels'], result['scores']):
bar = "=" * int(score * 30)
print(f" {label:15} | {bar:30} | {score:.1%}")
The model returns labels sorted by confidence score. Our tech-related text about iPhone features correctly gets classified as "technology" with 95.2% confidence. The visual bar chart shows how confident the model is for each category. This is powerful for content moderation, email routing, customer support ticket categorization, and more.
Output:
Text: The new iPhone 16 features an improved camera system and longer battery life.
Classification Results:
technology | ============================== | 95.2%
entertainment | == | 2.1%
science | = | 1.5%
sports | | 0.7%
politics | | 0.5%
pip install transformers torch. The first time you run a pipeline, it will download
the model (which may take a few minutes). After that, models are cached locally.
Choosing the Right Pipeline
| Task | Pipeline Name | Use Case |
|---|---|---|
| Sentiment Analysis | "sentiment-analysis" |
Review analysis, social media monitoring |
| Named Entity Recognition | "ner" |
Extract people, places, organizations |
| Question Answering | "question-answering" |
FAQ bots, document search |
| Summarization | "summarization" |
News summaries, document condensation |
| Translation | "translation_xx_to_yy" |
Language translation |
| Text Generation | "text-generation" |
Content creation, story writing |
| Zero-Shot Classification | "zero-shot-classification" |
Flexible categorization without training |
Practice Questions: Applications
Apply transformers to real-world problems.
Task: Use the sentiment pipeline to analyze 3 product reviews of your choice.
Show Solution
from transformers import pipeline
sentiment = pipeline("sentiment-analysis")
reviews = [
"Best purchase I've ever made! Highly recommend.",
"Product broke after one week. Very disappointed.",
"It works as expected. Nothing special."
]
for review in reviews:
result = sentiment(review)[0]
print(f"{result['label']:8} ({result['score']:.1%}): {review}")
Task: Create a QA system that answers questions about a company based on a context paragraph.
Show Solution
from transformers import pipeline
qa = pipeline("question-answering")
# Company FAQ context
context = """
TechCorp was founded in 2015 by Jane Smith and John Doe.
The company is headquartered in San Francisco and has 500 employees.
TechCorp specializes in artificial intelligence solutions for healthcare.
The company's main product is MedAssist, an AI diagnostic tool.
"""
questions = [
"Who founded TechCorp?",
"Where is TechCorp located?",
"What is the company's main product?"
]
for q in questions:
answer = qa(question=q, context=context)
print(f"Q: {q}")
print(f"A: {answer['answer']} ({answer['score']:.1%})\n")
Task: Use zero-shot classification to route support tickets to the right department.
Show Solution
from transformers import pipeline
classifier = pipeline("zero-shot-classification")
tickets = [
"I cannot log into my account, password reset not working",
"When will my order arrive? It's been 2 weeks",
"I was charged twice for the same item",
"The product manual is missing installation instructions"
]
departments = ["technical support", "shipping", "billing", "documentation"]
for ticket in tickets:
result = classifier(ticket, departments)
top_dept = result['labels'][0]
confidence = result['scores'][0]
print(f"Ticket: {ticket[:50]}...")
print(f"Route to: {top_dept} ({confidence:.1%})\n")
Key Takeaways
Transformers Replaced RNNs
By processing sequences in parallel rather than sequentially, transformers achieve faster training and better capture long-range dependencies in text
Self-Attention is Key
Self-attention allows each token to attend to all other tokens, computing relevance scores that capture semantic relationships regardless of distance
Encoder-Decoder Design
The original transformer uses encoders to understand input and decoders to generate output, with cross-attention connecting them
BERT vs GPT Approaches
BERT reads text bidirectionally for understanding tasks, while GPT generates text autoregressively for generation tasks
Positional Encoding Matters
Since transformers process all tokens simultaneously, positional encodings inject word order information that would otherwise be lost
Hugging Face Simplifies Usage
Pre-trained transformer models are easily accessible through Hugging Face, enabling state-of-the-art NLP with just a few lines of code