Transformers

Introduction to Transformers

In 2017, a paper titled "Attention Is All You Need" introduced the Transformer architecture, fundamentally changing how we approach sequence processing. Unlike RNNs that process text sequentially, transformers process entire sequences in parallel, enabling unprecedented training efficiency and model capabilities.

What are Transformers?

Transformers are a type of neural network architecture designed for processing sequential data like text. Before transformers, Recurrent Neural Networks (RNNs) and LSTMs were the go-to models for NLP. But they had a critical limitation: they processed words one at a time, making training slow and struggling with long sentences where early words faded from memory.

Key Concept

Transformer

A Transformer is a neural network architecture that processes entire sequences simultaneously using self-attention mechanisms. This parallel processing enables faster training and better capture of long-range dependencies compared to sequential models like RNNs.

Key insight: Instead of processing words one by one, transformers let every word "look at" every other word at once, determining which relationships matter most.

Why Transformers Changed Everything

The transformer architecture solved multiple problems at once and enabled capabilities that were previously impossible. This is why virtually all modern language models are based on transformers.

Parallel Processing

Unlike RNNs that process tokens sequentially, transformers process all tokens simultaneously, enabling massive GPU parallelization.

Long-Range Dependencies

In a 100-word sentence, first and last words can directly attend to each other. RNNs struggle as information degrades over distance.

Scalability

Transformers scale efficiently with more data and parameters, enabling models like GPT-4 with hundreds of billions of parameters.

Transfer Learning

Pre-trained transformers can be fine-tuned for specific tasks with minimal data, adapting to your domain with just hundreds of examples.

RNN vs Transformer: A Visual Comparison

Aspect	RNN/LSTM	Transformer
Processing	Sequential (one token at a time)	Parallel (all tokens at once)
Training Speed	Slow (cannot parallelize)	Fast (fully parallelizable)
Long Sequences	Struggles (vanishing gradients)	Handles well (direct attention)
Memory	Hidden state (fixed size)	Attention over all tokens
Scalability	Limited	Scales to billions of parameters

The Paper That Started It All

In June 2017, researchers at Google published "Attention Is All You Need" - a paper that would reshape the entire field of AI. The key innovation was removing recurrence entirely and relying solely on attention mechanisms to capture relationships between words.

# The impact of transformers: A timeline
transformers_timeline = {
    2017: "Attention Is All You Need paper published",
    2018: "BERT released by Google - revolutionizes NLP benchmarks",
    2019: "GPT-2 released by OpenAI - impressive text generation",
    2020: "GPT-3 with 175B parameters - few-shot learning",
    2022: "ChatGPT launches - AI goes mainstream",
    2023: "GPT-4 - multimodal capabilities",
    2024: "Transformers power search, coding, creative tools",
    2025: "Transformers are the backbone of most AI systems"
}

for year, event in transformers_timeline.items():
    print(f"{year}: {event}")

Output:

2017: Attention Is All You Need paper published
2018: BERT released by Google - revolutionizes NLP benchmarks
2019: GPT-2 released by OpenAI - impressive text generation
2020: GPT-3 with 175B parameters - few-shot learning
2022: ChatGPT launches - AI goes mainstream
2023: GPT-4 - multimodal capabilities
2024: Transformers power search, coding, creative tools
2025: Transformers are the backbone of most AI systems

Fun Fact: The name "Transformer" was chosen because the architecture transforms input sequences into output sequences. The authors also considered names like "Attention Net" but settled on Transformer for its simplicity.

Practice Questions: Introduction

Test your understanding of transformer basics.

Task: Explain in your own words why parallel processing makes transformers faster.

Show Answer

Transformers process all tokens in a sequence simultaneously, while RNNs must process tokens one at a time (sequentially). This means:

RNN: Token 1, then Token 2, then Token 3... (100 steps for 100 tokens)
Transformer: All 100 tokens at once (1 parallel step)

GPUs excel at parallel computation, so transformers leverage hardware much better.

Task: Give an example sentence where understanding requires connecting distant words.

Show Answer

Example: "The cat, which had been sitting on the warm windowsill all afternoon watching birds fly by, finally jumped down."

To understand that "jumped" refers to "cat" (not "birds" or "windowsill"), the model must connect words that are 15+ tokens apart. In RNNs, information about "cat" may fade by the time we reach "jumped". Transformers handle this easily because "jumped" can directly attend to "cat".

Task: Name three sequence models that came before transformers and their limitations.

Show Answer

RNN (Recurrent Neural Network): Sequential processing, vanishing gradients, cannot parallelize
LSTM (Long Short-Term Memory): Better memory than RNN but still sequential, complex gating mechanisms
GRU (Gated Recurrent Unit): Simpler than LSTM but same sequential limitations
Seq2Seq with Attention: Added attention to encoder-decoder but still used RNN backbone

Transformers eliminated the recurrent component entirely, using only attention.

Self-Attention Mechanism

Self-attention is the heart of transformers. It allows each word to look at every other word in the sentence and determine how much attention to pay to each one. This mechanism captures relationships regardless of distance, solving the long-range dependency problem that plagued RNNs.

Understanding Self-Attention

Imagine reading a sentence and trying to understand what each pronoun refers to. When you see "it" in a sentence, your brain automatically looks back to find what "it" refers to. Self-attention does this computationally - for every word, it calculates how much attention to pay to every other word.

Key Concept

Self-Attention

Self-attention is a mechanism where each element in a sequence computes attention scores with every other element, creating a weighted representation that captures contextual relationships. The "self" means the sequence attends to itself.

Intuition: For the word "bank" in "I sat by the river bank", self-attention helps the model attend more to "river" than to "sat", understanding that this is about nature, not finance.

Query, Key, and Value: The Attention Trinity

Self-attention uses three learned projections called Query (Q), Key (K), and Value (V). Think of it like a search engine: the Query is what you are looking for, Keys are labels on documents, and Values are the actual document contents. You match your Query against Keys to find relevant Values.

Query (Q)

"What am I looking for?" - The current token's representation that will be matched against all other tokens to find relevant context.

Key (K)

"What do I contain?" - Each token's identifier that Queries are compared against. High Query-Key similarity means high relevance.

Value (V)

"What information do I provide?" - The actual content that gets weighted and combined based on attention scores.

Computing Attention Step by Step

The attention formula is: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Let us break this down into understandable steps:

Step 1: Set Up the Function

import numpy as np

def self_attention_simple(query, key, value):
    """
    Simplified self-attention computation.
    
    Args:
        query: What we're looking for (n_tokens x d_k)
        key: What each token offers (n_tokens x d_k)
        value: Information to retrieve (n_tokens x d_v)
    """
    d_k = key.shape[-1]  # Dimension of keys

We define a function that takes three inputs: Query (what we are searching for), Key (what each token contains), and Value (the actual information to retrieve). The variable d_k stores the dimension of our keys, which we will use later for scaling.

Step 2: Compute Raw Attention Scores

    # Step 1: Compute attention scores (Query @ Key^T)
    # This tells us how much each token relates to each other token
    scores = np.dot(query, key.T)
    print("Step 1 - Raw attention scores:")
    print(scores)

We multiply the Query matrix by the transpose of the Key matrix. This produces a score matrix where each entry (i, j) represents how much token i should attend to token j. Higher scores mean stronger relationships. This is the core of the attention mechanism - finding which tokens are relevant to each other.

Step 3: Scale the Scores

    # Step 2: Scale by sqrt(d_k) to prevent large values
    # Large values would make softmax too peaked (one-hot like)
    scaled_scores = scores / np.sqrt(d_k)
    print(f"\nStep 2 - Scaled scores (divided by sqrt({d_k})):")
    print(scaled_scores)

We divide the scores by the square root of the key dimension. Why? When dimensions are large, dot products can become very large numbers. Large values push the softmax function to produce extreme outputs (close to 0 or 1), which causes vanishing gradients during training. Scaling keeps the values in a reasonable range for stable learning.

Step 4: Apply Softmax for Weights

    # Step 3: Apply softmax to get attention weights (sum to 1)
    attention_weights = np.exp(scaled_scores) / np.sum(np.exp(scaled_scores), axis=-1, keepdims=True)
    print("\nStep 3 - Attention weights (softmax):")
    print(attention_weights)

The softmax function converts our scaled scores into probabilities that sum to 1. Each row now represents a probability distribution over all tokens. For example, if the first row is [0.6, 0.3, 0.1], it means the first token pays 60% attention to itself, 30% to the second token, and 10% to the third.

Step 5: Compute Weighted Sum of Values

    # Step 4: Multiply weights by values to get output
    output = np.dot(attention_weights, value)
    print("\nStep 4 - Final output (weighted sum of values):")
    print(output)
    
    return output, attention_weights

Finally, we multiply the attention weights by the Value matrix. This produces a weighted sum of all token values, where tokens with higher attention weights contribute more. The result is a new representation for each token that incorporates context from all other relevant tokens.

Step 6: Run an Example

# Example: 3 tokens with dimension 4
np.random.seed(42)
tokens = np.random.randn(3, 4)  # 3 tokens, 4 dimensions each

# In practice, Q, K, V come from learned linear projections
# Here we use the same tokens for simplicity
output, weights = self_attention_simple(tokens, tokens, tokens)

We create a simple example with 3 tokens, each represented by 4 dimensions. In real transformers, Q, K, and V are computed by multiplying the input by learned weight matrices. Here, we use the same tokens for all three to keep it simple. The function returns both the contextualized output and the attention weights for visualization.

Output:

Step 1 - Raw attention scores:
[[ 2.52  0.12 -0.89]
 [ 0.12  1.38 -0.23]
 [-0.89 -0.23  1.92]]

Step 2 - Scaled scores (divided by sqrt(4)):
[[ 1.26  0.06 -0.44]
 [ 0.06  0.69 -0.12]
 [-0.44 -0.12  0.96]]

Step 3 - Attention weights (softmax):
[[0.56 0.17 0.10]
 [0.27 0.51 0.23]
 [0.16 0.22 0.65]]

Step 4 - Final output (weighted sum of values):
[[ 0.42  0.19  0.37  0.61]
 [ 0.21  0.35  0.18  0.44]
 [-0.28  0.41 -0.23  0.19]]

Looking at the attention weights, we can see that token 1 attends mostly to itself (0.56), token 2 also focuses on itself (0.51), and token 3 attends strongly to itself (0.65). This makes sense - each token finds itself most relevant. The final output shows how each token's representation has been updated by incorporating information from all tokens according to these weights.

Why Scale by sqrt(d_k)? When dimensions are large, dot products can become very large. Large values push softmax outputs toward 0 and 1 (extremely peaked), causing vanishing gradients. Scaling keeps values in a reasonable range for stable training.

Multi-Head Attention

A single attention head might focus on one type of relationship (like syntax). Multi-head attention runs multiple attention operations in parallel, allowing the model to capture different types of relationships simultaneously - one head for syntax, another for semantics, another for coreference.

Step 1: Define the Class and Initialize Parameters

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.d_k = d_model // num_heads  # Dimension per head
        
        # Linear projections for Q, K, V (and output)
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

We create a PyTorch module for multi-head attention. The key insight is that we split the model dimension (512) across multiple heads (8), so each head works with 64 dimensions. We define four learnable linear layers: three to project inputs into Query, Key, and Value spaces, and one to project the final concatenated output back to the model dimension.

Step 2: Project Input to Q, K, V

    def forward(self, x):
        batch_size, seq_len, _ = x.shape
        
        # Project to Q, K, V
        Q = self.W_q(x)  # (batch, seq_len, d_model)
        K = self.W_k(x)
        V = self.W_v(x)

In the forward pass, we first get the input dimensions. Then we pass the input through three separate linear projections to create Query, Key, and Value matrices. Each projection learns different aspects of the input - the Query learns what to look for, the Key learns what each position offers, and the Value learns what information to provide.

Step 3: Reshape for Multiple Heads

        # Reshape for multi-head: (batch, num_heads, seq_len, d_k)
        Q = Q.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = K.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = V.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)

Here is where the "multi-head" magic happens. We reshape each matrix to split the 512 dimensions into 8 heads of 64 dimensions each. The view operation reshapes the tensor, and transpose rearranges the dimensions so that each head's computation can be done in parallel. After this, each head has its own 64-dimensional subspace to work with.

Step 4: Compute Attention for All Heads

        # Compute attention for all heads in parallel
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
        attention = torch.softmax(scores, dim=-1)
        context = torch.matmul(attention, V)

This is the same attention computation we saw earlier, but now happening for all 8 heads simultaneously thanks to tensor operations. We compute scores by multiplying Q and K, scale by the square root of dimension, apply softmax to get weights, and multiply by V to get context. Each head learns to attend to different types of relationships in the data.

Step 5: Concatenate Heads and Project Output

        # Concatenate heads and project
        context = context.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
        output = self.W_o(context)
        
        return output

After each head computes its attention independently, we need to combine them. We transpose back to the original dimension order, then reshape to concatenate all heads' outputs (8 heads x 64 dimensions = 512 dimensions). Finally, we apply one more linear projection to mix information across heads, allowing the model to combine insights from different attention patterns.

Step 6: Test with Example Input

# Example usage
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(2, 10, 512)  # batch=2, seq_len=10, d_model=512
output = mha(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Each head processes: {512 // 8} dimensions")

We create a multi-head attention module and test it with a batch of 2 sequences, each containing 10 tokens represented by 512 dimensions. The output has the same shape as input - this is important because it allows us to stack multiple attention layers. Each of the 8 heads processes only 64 dimensions, but together they cover all 512 dimensions from different perspectives.

Output:

Input shape: torch.Size([2, 10, 512])
Output shape: torch.Size([2, 10, 512])
Each head processes: 64 dimensions

Visualizing Attention Patterns

Attention weights reveal what the model "pays attention to". Different heads learn different patterns - some attend to adjacent words (local), others to sentence structure (global), and some to specific linguistic relationships like subject-verb agreement.

Attention Head	What It Might Learn	Example Pattern
Head 1	Positional (nearby words)	"the big" - "the" attends to "big"
Head 2	Syntactic (grammar structure)	Subject attends to verb
Head 3	Coreference (pronouns)	"she" attends to "Maria"
Head 4	Semantic similarity	"happy" attends to "joyful"

Practice Questions: Self-Attention

Deepen your understanding of attention mechanisms.

Task: Explain why attention weights for each token must sum to 1.

Show Answer

Attention weights sum to 1 because softmax is applied, which normalizes scores into a probability distribution. This means:

Weights represent "how much attention" as percentages
The output is a weighted average of values
If one weight is high, others must be lower (competition)

Example: If "cat" gets 0.6 attention, other words share the remaining 0.4.

Task: Given Q=[1, 0], K=[[1, 0], [0, 1]], V=[[5], [10]], calculate the attention output.

Show Solution

import numpy as np

Q = np.array([[1, 0]])
K = np.array([[1, 0], [0, 1]])
V = np.array([[5], [10]])

# Score = Q @ K^T = [[1*1 + 0*0, 1*0 + 0*1]] = [[1, 0]]
scores = Q @ K.T
print(f"Scores: {scores}")

# Scale (d_k = 2)
scaled = scores / np.sqrt(2)
print(f"Scaled: {scaled}")

# Softmax
weights = np.exp(scaled) / np.sum(np.exp(scaled))
print(f"Weights: {weights}")

# Output = weights @ V
output = weights @ V
print(f"Output: {output}")  # Closer to 5 than 10 (attends more to first token)

Task: Explain the benefits of multi-head attention over single-head attention.

Show Answer

Multi-head attention provides several benefits:

Different relationship types: Each head can specialize in capturing different patterns (syntax, semantics, position)
Richer representations: Combining multiple perspectives creates more expressive features
Stability: Multiple heads provide redundancy - if one head fails, others compensate
No additional cost: Total computation is similar (heads split the dimensions)

Analogy: Like having multiple experts review a document, each focusing on different aspects.

Transformer Architecture

The transformer architecture consists of an encoder stack and a decoder stack, each made up of identical layers. The encoder processes the input sequence, while the decoder generates the output sequence. Together, they form a powerful sequence-to-sequence model.

The Big Picture

The original transformer was designed for machine translation (e.g., English to French). The encoder reads the entire input sentence and builds a rich representation. The decoder then uses this representation to generate the output sentence one word at a time.

Architecture Overview

Encoder-Decoder Structure

The transformer has two main parts: the Encoder (processes input) and the Decoder (generates output). Each consists of stacked identical layers (typically 6). The encoder sees all tokens at once; the decoder generates tokens sequentially but uses attention to look at all previously generated tokens.

Key difference from RNN: Even though the decoder generates tokens one by one, it uses attention (not recurrence) to access context, making it parallelizable during training.

Encoder: Understanding the Input

Each encoder layer has two main components: multi-head self-attention and a feed-forward network. Both have residual connections and layer normalization. The encoder processes the input and creates contextualized representations that the decoder can attend to.

Step 1: Define the Encoder Layer Class

import torch
import torch.nn as nn

class EncoderLayer(nn.Module):
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        
        # Multi-head self-attention
        self.self_attention = nn.MultiheadAttention(d_model, num_heads, dropout=dropout, batch_first=True)
        self.norm1 = nn.LayerNorm(d_model)

We create an encoder layer class with key hyperparameters: d_model is the dimension of each token (512), num_heads is the number of attention heads (8), d_ff is the feed-forward hidden dimension (2048, typically 4x d_model), and dropout prevents overfitting. The first component is multi-head self-attention with layer normalization.

Step 2: Add the Feed-Forward Network

        # Feed-forward network
        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )
        self.norm2 = nn.LayerNorm(d_model)
        
        self.dropout = nn.Dropout(dropout)

The feed-forward network is a two-layer MLP applied to each position independently. It first expands the dimensions from 512 to 2048 (allowing more expressive transformations), applies ReLU activation, then projects back to 512. This "expand-contract" pattern with non-linearity helps the model learn complex patterns. We add a second layer normalization and dropout for regularization.

Step 3: Define the Forward Pass with Residual Connections

    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output, _ = self.self_attention(x, x, x, key_padding_mask=mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + ff_output)
        
        return x

The forward pass follows a pattern: sublayer output + residual connection + layer normalization. First, we apply self-attention where each token attends to all other tokens. The residual connection (x + attn_output) adds the original input back, helping gradients flow during training. Then we do the same for the feed-forward network. This "Add & Norm" pattern is crucial for training deep transformers.

Step 4: Stack Multiple Encoder Layers

# Stack 6 encoder layers
class Encoder(nn.Module):
    def __init__(self, num_layers=6, d_model=512, num_heads=8):
        super().__init__()
        self.layers = nn.ModuleList([
            EncoderLayer(d_model, num_heads) for _ in range(num_layers)
        ])
        
    def forward(self, x, mask=None):
        for layer in self.layers:
            x = layer(x, mask)
        return x

The full encoder stacks 6 identical encoder layers. We use nn.ModuleList to properly register the layers with PyTorch. In the forward pass, the input flows through each layer sequentially, with each layer refining the representations further. By the end, each token's representation incorporates rich contextual information from the entire sequence.

Step 5: Check Parameter Count

encoder = Encoder()
print(f"Encoder has {sum(p.numel() for p in encoder.parameters()):,} parameters")

We instantiate the encoder and count its learnable parameters. With 6 layers, each containing attention projections and feed-forward networks, the total comes to about 19 million parameters. This is just the encoder - the full transformer (encoder + decoder) has roughly double this. Modern large language models have billions of parameters, but the architecture remains similar.

Output:

Encoder has 18,901,248 parameters

Decoder: Generating the Output

The decoder is similar to the encoder but adds a crucial third component: cross-attention. This allows each decoder position to attend to all encoder outputs, connecting the generated output to the input context. The decoder also uses masked self-attention to prevent "peeking" at future tokens.

Masked Self-Attention

Prevents the decoder from seeing future tokens during training. Position i can only attend to positions 0 to i-1, ensuring autoregressive generation.

Cross-Attention

Connects decoder to encoder outputs. The decoder's Query attends to encoder's Keys and Values, pulling relevant input information for each generated token.

Feed-Forward Network

Same as encoder: a two-layer MLP that processes each position independently. Adds non-linearity and increases model capacity.

Positional Encoding: Injecting Word Order

Since transformers process all tokens in parallel, they have no inherent notion of word order. "The cat sat on the mat" and "mat the on sat cat the" would look identical! Positional encoding adds information about each token's position in the sequence.

Step 1: Define the Encoding Function

import torch
import numpy as np

def positional_encoding(max_len, d_model):
    """
    Generate positional encodings using sine and cosine functions.
    
    PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    pe = np.zeros((max_len, d_model))
    position = np.arange(max_len)[:, np.newaxis]

We create a function that generates positional encodings for any sequence length up to max_len. The formula uses sine and cosine functions at different frequencies. We initialize an empty matrix to hold the encodings (max_len x d_model) and create a column vector of positions (0, 1, 2, ..., max_len-1) for broadcasting in later calculations.

Step 2: Compute the Frequency Term

    # Compute the division term
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

This computes the frequency for each dimension. Lower dimensions get higher frequencies (faster oscillation), while higher dimensions get lower frequencies (slower oscillation). The base of 10000 was chosen empirically - it creates a good spread of frequencies that allows the model to learn both fine-grained and coarse-grained positional information.

Step 3: Apply Sine and Cosine

    # Apply sine to even indices, cosine to odd indices
    pe[:, 0::2] = np.sin(position * div_term)
    pe[:, 1::2] = np.cos(position * div_term)
    
    return pe

We apply sine to even dimensions (0, 2, 4, ...) and cosine to odd dimensions (1, 3, 5, ...). Using both sine and cosine together is important - it means for any fixed offset k, the encoding for position (pos+k) can be expressed as a linear function of the encoding at position pos. This helps the model learn relative positions, not just absolute ones.

Step 4: Generate and Inspect the Encodings

# Generate positional encoding
pe = positional_encoding(max_len=100, d_model=512)

print("Positional Encoding Shape:", pe.shape)
print("\nFirst 5 positions, first 8 dimensions:")
print(pe[:5, :8].round(3))

We generate encodings for 100 positions with 512 dimensions each. Looking at the first 5 positions and 8 dimensions, you can see the patterns: position 0 starts at [0, 1, 0, 1, ...] (sine=0, cosine=1 at the start). Each subsequent position shifts these values based on the frequency, creating unique fingerprints for each position.

Step 5: Compare Different Positions

# The encoding for position 0
print("\nPosition 0 (first 10 values):", pe[0, :10].round(3))
# The encoding for position 50 (very different!)
print("Position 50 (first 10 values):", pe[50, :10].round(3))

By comparing position 0 and position 50, we see they have completely different patterns. Position 0 has the "starting" pattern (sin=0, cos=1), while position 50 shows how the waves have evolved. The model learns to decode these patterns to understand where each token sits in the sequence, enabling it to handle word order appropriately.

Output:

Positional Encoding Shape: (100, 512)

First 5 positions, first 8 dimensions:
[[ 0.     1.     0.     1.     0.     1.     0.     1.   ]
 [ 0.841  0.54   0.047  0.999  0.002  1.     0.     1.   ]
 [ 0.909 -0.416  0.094  0.996  0.005  1.     0.     1.   ]
 [ 0.141 -0.99   0.14   0.99   0.007  1.     0.     1.   ]
 [-0.757 -0.654  0.187  0.982  0.01   1.     0.     1.   ]]

Position 0 (first 10 values): [ 0.     1.     0.     1.     0.     1.     0.     1.     0.     1.   ]
Position 50 (first 10 values): [-0.262  0.965  0.999 -0.035  0.116  0.993 -0.086  0.996  0.012  1.   ]

Why Sine and Cosine? These functions create unique patterns for each position. They also allow the model to learn relative positions - the encoding for position 10 relative to position 5 is the same as position 20 relative to position 15 (a fixed offset of 5).

Component Summary

Component	Location	Purpose
Self-Attention	Encoder and Decoder	Capture relationships within a sequence
Cross-Attention	Decoder only	Connect decoder to encoder outputs
Feed-Forward	Encoder and Decoder	Add non-linearity, process each position
Positional Encoding	Input to both	Inject word order information
Layer Normalization	After each sublayer	Stabilize training, normalize activations
Residual Connections	Around each sublayer	Enable gradient flow, easier optimization

Practice Questions: Architecture

Test your knowledge of transformer components.

Task: Explain why the decoder needs to mask future tokens.

Show Answer

The causal (look-ahead) mask prevents the decoder from "cheating" during training by:

Ensuring position i can only see positions 0 to i-1
Simulating autoregressive generation during training
Forcing the model to learn to predict the next token without seeing it

Without this mask, the model could just copy the answer instead of learning to generate.

Task: Explain the difference between self-attention and cross-attention in terms of Q, K, V sources.

Show Answer

Self-Attention (Encoder):

Q, K, V all come from the same source (encoder input)
Each token attends to all other tokens in the same sequence

Cross-Attention (Decoder):

Q comes from decoder (what we are generating)
K, V come from encoder output (the input context)
Allows decoder to "look at" the input while generating

Task: Explain the benefits of residual connections in deep transformer networks.

Show Answer

Residual connections (x + sublayer(x)) provide several benefits:

Gradient flow: Gradients can flow directly through the skip connection, preventing vanishing gradients in deep networks
Easier optimization: The network only needs to learn the "residual" (difference), which is often easier
Identity mapping: If a layer is not useful, it can learn to output zero, effectively skipping that layer
Depth without degradation: Enables training very deep models (50+ layers) that would otherwise degrade

BERT and GPT

BERT and GPT represent two different approaches to using transformers. BERT uses bidirectional encoding to understand context from both directions, excelling at understanding tasks. GPT uses autoregressive decoding to generate text one token at a time, excelling at generation tasks.

BERT: Bidirectional Encoder Representations from Transformers

BERT, released by Google in 2018, uses only the encoder part of the transformer. Its key innovation is bidirectional training - it sees context from both left and right simultaneously. This makes BERT excellent at understanding tasks like classification, question answering, and named entity recognition.

Model Architecture

BERT

BERT is an encoder-only transformer pre-trained on two tasks: Masked Language Modeling (predicting masked words) and Next Sentence Prediction. It reads text bidirectionally, meaning each token can attend to tokens on both its left and right.

Pre-training objective: "The cat [MASK] on the mat" - BERT learns to predict that [MASK] should be "sat" by looking at both "cat" (left) and "on the mat" (right).

from transformers import BertTokenizer, BertForMaskedLM
import torch

We import the necessary modules from the Hugging Face Transformers library. BertTokenizer handles converting text to tokens and back, while BertForMaskedLM is the BERT model configured for masked language modeling (predicting hidden words).

# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

We load the pre-trained BERT base model (uncased means it treats "Hello" and "hello" the same). This model was trained on millions of documents and already understands language. The from_pretrained method downloads the model weights from Hugging Face's servers.

# Masked language modeling example
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors='pt')

We create a sentence with a [MASK] token where we want BERT to predict the missing word. The tokenizer converts this text into numerical IDs that the model understands. The return_tensors='pt' returns PyTorch tensors ready for the model.

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits

We run the model in inference mode (torch.no_grad() disables gradient computation for efficiency). The model outputs logits (raw scores before softmax) for every position in the sequence. Each position has scores for all 30,000+ vocabulary words.

# Find the masked token position
mask_position = (inputs['input_ids'] == tokenizer.mask_token_id).nonzero()[0, 1]

# Get top 5 predictions for the masked position
top_5_tokens = torch.topk(predictions[0, mask_position], 5).indices

We locate where the [MASK] token is in our input sequence. Then we use torch.topk to find the 5 highest-scoring word predictions for that position. These are the words BERT thinks are most likely to fill in the blank.

print(f"Input: {text}")
print(f"Top 5 predictions for [MASK]:")
for i, token_id in enumerate(top_5_tokens):
    token = tokenizer.decode([token_id])
    prob = torch.softmax(predictions[0, mask_position], dim=0)[token_id]
    print(f"  {i+1}. {token} ({prob:.1%})")

Finally, we display the results. We convert token IDs back to readable words using the tokenizer's decode method. We also apply softmax to convert raw scores into probabilities, showing how confident BERT is about each prediction. "Paris" wins with 92.3% confidence!

Output:

Input: The capital of France is [MASK].
Top 5 predictions for [MASK]:
  1. paris (92.3%)
  2. lyon (1.8%)
  3. marseille (0.9%)
  4. france (0.7%)
  5. toulouse (0.5%)

GPT: Generative Pre-trained Transformer

GPT, developed by OpenAI, uses only the decoder part of the transformer. It is trained to predict the next token given all previous tokens (autoregressive). This makes GPT excellent at text generation - from completing sentences to writing essays, code, and even poetry.

Model Architecture

GPT

GPT is a decoder-only transformer pre-trained on next-token prediction. It reads text left-to-right, with each token only able to attend to previous tokens. This autoregressive nature makes it ideal for text generation tasks.

Pre-training objective: Given "The cat sat on the", predict the next word is "mat". GPT learns patterns from billions of text examples to generate coherent continuations.

from transformers import GPT2Tokenizer, GPT2LMHeadModel

We import GPT-2 components from Hugging Face. GPT2Tokenizer converts text to tokens using Byte Pair Encoding (BPE), while GPT2LMHeadModel is the GPT-2 model with a language modeling head for text generation.

# Load pre-trained GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

We load the smallest GPT-2 model (124 million parameters). Larger versions include gpt2-medium (355M), gpt2-large (774M), and gpt2-xl (1.5B). Each was trained on 40GB of internet text to learn patterns of natural language.

# Text generation example
prompt = "Artificial intelligence will"
inputs = tokenizer(prompt, return_tensors='pt')

We provide a prompt - the starting text that GPT will continue. The tokenizer converts it to token IDs. Unlike BERT, there is no [MASK] token; GPT simply continues from where the text ends.

# Generate text
outputs = model.generate(
    inputs['input_ids'],
    max_length=50,
    num_return_sequences=1,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

The generate method handles the autoregressive loop. Key parameters: max_length limits total tokens, temperature controls randomness (lower = more deterministic, higher = more creative), do_sample=True enables sampling instead of always picking the most likely word. The model generates one token at a time, feeding each new token back as input until done.

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Prompt: {prompt}")
print(f"Generated: {generated_text}")

We decode the generated token IDs back to human-readable text. The skip_special_tokens=True removes technical tokens like end-of-sequence markers. The result is a coherent continuation of our prompt, demonstrating GPT's ability to generate fluent, contextually relevant text.

Output:

Prompt: Artificial intelligence will
Generated: Artificial intelligence will transform how we work, learn, and interact with technology. 
From healthcare diagnostics to autonomous vehicles, AI systems are becoming increasingly capable 
of tasks that once required human intelligence.

BERT vs GPT: Head-to-Head Comparison

BERT (Bidirectional)

Architecture: Encoder-only

Direction: Sees both left and right context

Best for: Understanding (classification, QA, NER)

Training: Masked language modeling

GPT (Autoregressive)

Architecture: Decoder-only

Direction: Left-to-right only

Best for: Generation (text, code, dialogue)

Training: Next-token prediction

Aspect	BERT	GPT
Architecture	Encoder-only	Decoder-only
Context	Bidirectional (full sentence)	Unidirectional (left-to-right)
Pre-training	Masked LM + Next Sentence Prediction	Next Token Prediction
Use Cases	Classification, QA, NER, Similarity	Text Generation, Chatbots, Code
Fine-tuning	Add classification head	Prompt engineering or fine-tune
Famous Versions	BERT, RoBERTa, ALBERT, DistilBERT	GPT-2, GPT-3, GPT-4, ChatGPT

The Evolution of Transformer Models

# Model sizes over time (approximate parameter counts)
model_evolution = {
    "BERT-base (2018)": "110M parameters",
    "BERT-large (2018)": "340M parameters",
    "GPT-2 (2019)": "1.5B parameters",
    "T5-11B (2020)": "11B parameters",
    "GPT-3 (2020)": "175B parameters",
    "PaLM (2022)": "540B parameters",
    "GPT-4 (2023)": "~1.7T parameters (estimated)",
    "Gemini Ultra (2024)": "Multi-trillion (estimated)"
}

print("Transformer Model Evolution:")
print("-" * 50)
for model, params in model_evolution.items():
    print(f"{model:25} | {params}")

# Growth factor
print(f"\nGrowth: BERT to GPT-4 = ~15,000x increase in parameters!")

Output:

Transformer Model Evolution:
--------------------------------------------------
BERT-base (2018)          | 110M parameters
BERT-large (2018)         | 340M parameters
GPT-2 (2019)              | 1.5B parameters
T5-11B (2020)             | 11B parameters
GPT-3 (2020)              | 175B parameters
PaLM (2022)               | 540B parameters
GPT-4 (2023)              | ~1.7T parameters (estimated)
Gemini Ultra (2024)       | Multi-trillion (estimated)

Growth: BERT to GPT-4 = ~15,000x increase in parameters!

Which should you use? For understanding tasks (sentiment analysis, search, classification), use BERT-based models. For generation tasks (chatbots, content creation, code completion), use GPT-based models. Many modern systems use both!

Practice Questions: BERT and GPT

Compare and contrast these transformer giants.

Task: Explain the architectural reason BERT is not suited for text generation.

Show Answer

BERT cannot generate text effectively because:

It was trained with bidirectional attention (sees all words at once)
During generation, you do not have future words to condition on
It was trained for masked prediction (fill in blanks), not sequential generation
The encoder architecture has no autoregressive mechanism

GPT's left-to-right attention naturally supports generating one token at a time.

Task: Write code to classify sentiment using a pre-trained BERT model.

Show Solution

from transformers import pipeline

# Load pre-trained sentiment classifier (fine-tuned BERT)
classifier = pipeline("sentiment-analysis")

# Test sentences
sentences = [
    "I absolutely love this product!",
    "This was a complete waste of money.",
    "It's okay, nothing special."
]

print("Sentiment Analysis Results:")
for sentence in sentences:
    result = classifier(sentence)[0]
    print(f"{result['label']:8} ({result['score']:.1%}): {sentence}")

Task: Research and explain the training differences between GPT-3 and ChatGPT.

Show Answer

ChatGPT added two key training stages beyond GPT-3:

Supervised Fine-Tuning (SFT): Human trainers provided example conversations, teaching the model how to be helpful
Reinforcement Learning from Human Feedback (RLHF):
- Humans ranked multiple model outputs
- A reward model learned human preferences
- The model was fine-tuned to maximize the reward

Result: ChatGPT follows instructions better, refuses harmful requests, and produces more helpful responses.

Practical Applications

Transformers power the most impressive AI applications today, from ChatGPT and Google Search to code completion and image generation. Learning to use pre-trained transformer models through libraries like Hugging Face opens up powerful capabilities with minimal code.

Getting Started with Hugging Face

Hugging Face has become the go-to platform for working with transformers. Their library provides easy access to thousands of pre-trained models and simple APIs for common tasks. You can accomplish state-of-the-art NLP with just a few lines of code.

Step 1: Import the Pipeline API

# Install the transformers library
# pip install transformers torch

from transformers import pipeline

The pipeline API is Hugging Face's simplest interface for using transformers. It automatically handles loading the right model, tokenizer, and post-processing for each task. This abstraction lets you focus on your application rather than implementation details.

Step 2: Sentiment Analysis in One Line

# 1. Sentiment Analysis
sentiment = pipeline("sentiment-analysis")
result = sentiment("I love learning about transformers!")
print(f"Sentiment: {result[0]['label']} ({result[0]['score']:.1%})")

Just specify "sentiment-analysis" and the pipeline loads a fine-tuned DistilBERT model. Pass any text and it returns whether the sentiment is POSITIVE or NEGATIVE along with a confidence score. This model was trained on movie reviews but generalizes well to most English text.

Step 3: Named Entity Recognition

# 2. Named Entity Recognition
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Elon Musk founded SpaceX in California.")
print(f"\nEntities found:")
for ent in entities:
    print(f"  {ent['word']:15} -> {ent['entity_group']:10} ({ent['score']:.1%})")

Named Entity Recognition (NER) identifies people (PER), organizations (ORG), locations (LOC), and other entities in text. The aggregation_strategy="simple" combines multi-token entities (like "Elon Musk" which is two tokens) into single entries. The model returns each entity's text, type, and confidence score.

Step 4: Question Answering

# 3. Question Answering
qa = pipeline("question-answering")
context = "Transformers were introduced in 2017 by Google researchers in the paper Attention Is All You Need."
question = "When were transformers introduced?"
answer = qa(question=question, context=context)
print(f"\nQuestion: {question}")
print(f"Answer: {answer['answer']} (confidence: {answer['score']:.1%})")

Extractive question answering finds the answer within a given context. You provide both a question and a passage of text (context). The model identifies the span of text that best answers the question. It does not generate new text - it extracts directly from the context, which makes answers more reliable and verifiable.

Output:

Sentiment: POSITIVE (99.8%)

Entities found:
  Elon Musk       -> PER        (99.2%)
  SpaceX          -> ORG        (98.7%)
  California      -> LOC        (99.5%)

Question: When were transformers introduced?
Answer: 2017 (confidence: 97.3%)

Real-World Applications of Transformers

Conversational AI

ChatGPT, Claude, and Gemini use transformer models to hold natural conversations, answer questions, and assist with tasks.

Code Completion

GitHub Copilot and similar tools use transformers to suggest code, complete functions, and even write entire programs from descriptions.

Search Engines

Google, Bing, and other search engines use BERT and similar models to understand search queries and rank results.

Translation

Google Translate and DeepL use transformer-based models for high-quality translation between hundreds of languages.

Content Generation

From marketing copy to email drafts, transformers help generate written content for businesses and individuals.

Image Generation

DALL-E, Midjourney, and Stable Diffusion use transformer components to generate images from text descriptions.

Text Summarization

Transformers excel at summarization - condensing long documents into concise summaries while preserving key information. This is invaluable for processing large amounts of text quickly.

Step 1: Load the Summarization Model

from transformers import pipeline

# Load summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

We load BART (Bidirectional and Auto-Regressive Transformers), a model specifically fine-tuned on the CNN/Daily Mail dataset for news summarization. BART combines BERT's bidirectional encoder with GPT's autoregressive decoder, making it excellent at understanding input and generating fluent output.

Step 2: Provide the Text to Summarize

# Long text to summarize
article = """
Transformers have revolutionized natural language processing since their introduction in 2017. 
The architecture, introduced in the paper "Attention Is All You Need," replaced recurrent 
neural networks with self-attention mechanisms. This change enabled parallel processing of 
sequences, dramatically improving training efficiency. 

The impact has been enormous. BERT, released by Google in 2018, achieved state-of-the-art 
results on numerous NLP benchmarks. OpenAI's GPT series demonstrated impressive text 
generation capabilities, culminating in ChatGPT which brought AI assistants to the mainstream.

Today, transformers power search engines, translation services, code assistants, and 
conversational AI. The architecture has even expanded beyond NLP into computer vision 
(Vision Transformers) and audio processing. Researchers continue to improve efficiency 
and capabilities, making transformers increasingly accessible for diverse applications.
"""

Our example article is 143 words about transformers. In real applications, you might summarize news articles, research papers, emails, or any lengthy document. The model can handle texts up to about 1024 tokens (roughly 750-1000 words depending on vocabulary).

Step 3: Generate the Summary

# Generate summary
summary = summarizer(article, max_length=80, min_length=30, do_sample=False)

print("Original length:", len(article.split()), "words")
print("Summary length:", len(summary[0]['summary_text'].split()), "words")
print("\nSummary:")
print(summary[0]['summary_text'])

We control summary length with max_length and min_length parameters. Setting do_sample=False uses deterministic generation (always picks the most likely next word), ensuring consistent summaries. The model distills the key points: when transformers were introduced, BERT's impact, and current applications.

Output:

Original length: 143 words
Summary length: 42 words

Summary:
Transformers have revolutionized natural language processing since their introduction in 2017. 
BERT achieved state-of-the-art results on numerous NLP benchmarks. Today, transformers power 
search engines, translation services, code assistants, and conversational AI.

Zero-Shot Classification

One of the most powerful capabilities of modern transformers is zero-shot classification - categorizing text into labels the model has never been explicitly trained on. This eliminates the need for task-specific training data.

Step 1: Load the Zero-Shot Classifier

from transformers import pipeline

# Load zero-shot classification pipeline
classifier = pipeline("zero-shot-classification")

Zero-shot classification uses a model trained on natural language inference (NLI). It works by reformulating classification as: "Does this text entail that it is about [label]?" This clever approach lets the model classify into any categories you specify, without retraining.

Step 2: Define Text and Labels

# Text to classify
text = "The new iPhone 16 features an improved camera system and longer battery life."

# Define candidate labels (no training needed!)
candidate_labels = ["technology", "sports", "politics", "entertainment", "science"]

You provide the text to classify and a list of possible categories. The magic is that you can change these labels to anything - product categories, emotions, topics, urgency levels - without retraining the model. It understands the semantic meaning of your labels from its pre-training.

Step 3: Get Classification Results

# Classify
result = classifier(text, candidate_labels)

print(f"Text: {text}\n")
print("Classification Results:")
for label, score in zip(result['labels'], result['scores']):
    bar = "=" * int(score * 30)
    print(f"  {label:15} | {bar:30} | {score:.1%}")

The model returns labels sorted by confidence score. Our tech-related text about iPhone features correctly gets classified as "technology" with 95.2% confidence. The visual bar chart shows how confident the model is for each category. This is powerful for content moderation, email routing, customer support ticket categorization, and more.

Output:

Text: The new iPhone 16 features an improved camera system and longer battery life.

Classification Results:
  technology      | ============================== | 95.2%
  entertainment   | ==                             | 2.1%
  science         | =                              | 1.5%
  sports          |                                | 0.7%
  politics        |                                | 0.5%

Getting Started: To use these examples, install the transformers library with pip install transformers torch. The first time you run a pipeline, it will download the model (which may take a few minutes). After that, models are cached locally.

Choosing the Right Pipeline

Task	Pipeline Name	Use Case
Sentiment Analysis	`"sentiment-analysis"`	Review analysis, social media monitoring
Named Entity Recognition	`"ner"`	Extract people, places, organizations
Question Answering	`"question-answering"`	FAQ bots, document search
Summarization	`"summarization"`	News summaries, document condensation
Translation	`"translation_xx_to_yy"`	Language translation
Text Generation	`"text-generation"`	Content creation, story writing
Zero-Shot Classification	`"zero-shot-classification"`	Flexible categorization without training

Practice Questions: Applications

Apply transformers to real-world problems.

Task: Use the sentiment pipeline to analyze 3 product reviews of your choice.

Show Solution

from transformers import pipeline

sentiment = pipeline("sentiment-analysis")

reviews = [
    "Best purchase I've ever made! Highly recommend.",
    "Product broke after one week. Very disappointed.",
    "It works as expected. Nothing special."
]

for review in reviews:
    result = sentiment(review)[0]
    print(f"{result['label']:8} ({result['score']:.1%}): {review}")

Task: Create a QA system that answers questions about a company based on a context paragraph.

Show Solution

from transformers import pipeline

qa = pipeline("question-answering")

# Company FAQ context
context = """
TechCorp was founded in 2015 by Jane Smith and John Doe. 
The company is headquartered in San Francisco and has 500 employees.
TechCorp specializes in artificial intelligence solutions for healthcare.
The company's main product is MedAssist, an AI diagnostic tool.
"""

questions = [
    "Who founded TechCorp?",
    "Where is TechCorp located?",
    "What is the company's main product?"
]

for q in questions:
    answer = qa(question=q, context=context)
    print(f"Q: {q}")
    print(f"A: {answer['answer']} ({answer['score']:.1%})\n")

Task: Use zero-shot classification to route support tickets to the right department.

Show Solution

from transformers import pipeline

classifier = pipeline("zero-shot-classification")

tickets = [
    "I cannot log into my account, password reset not working",
    "When will my order arrive? It's been 2 weeks",
    "I was charged twice for the same item",
    "The product manual is missing installation instructions"
]

departments = ["technical support", "shipping", "billing", "documentation"]

for ticket in tickets:
    result = classifier(ticket, departments)
    top_dept = result['labels'][0]
    confidence = result['scores'][0]
    print(f"Ticket: {ticket[:50]}...")
    print(f"Route to: {top_dept} ({confidence:.1%})\n")

Key Takeaways

Transformers Replaced RNNs

By processing sequences in parallel rather than sequentially, transformers achieve faster training and better capture long-range dependencies in text

Self-Attention is Key

Self-attention allows each token to attend to all other tokens, computing relevance scores that capture semantic relationships regardless of distance

Encoder-Decoder Design

The original transformer uses encoders to understand input and decoders to generate output, with cross-attention connecting them

BERT vs GPT Approaches

BERT reads text bidirectionally for understanding tasks, while GPT generates text autoregressively for generation tasks

Positional Encoding Matters

Since transformers process all tokens simultaneously, positional encodings inject word order information that would otherwise be lost

Hugging Face Simplifies Usage

Pre-trained transformer models are easily accessible through Hugging Face, enabling state-of-the-art NLP with just a few lines of code

What You'll Learn

Contents

Introduction to Transformers

What are Transformers?

Transformer

Why Transformers Changed Everything

Parallel Processing

Long-Range Dependencies

Scalability

Transfer Learning

RNN vs Transformer: A Visual Comparison

The Paper That Started It All

Practice Questions: Introduction

Easy Why can transformers train faster than RNNs?

Medium What is the "long-range dependency" problem?

Hard Research: What models existed before transformers?

Self-Attention Mechanism

Understanding Self-Attention

Self-Attention

Query, Key, and Value: The Attention Trinity

Query (Q)

Key (K)

Value (V)

Computing Attention Step by Step

Multi-Head Attention

Visualizing Attention Patterns

Practice Questions: Self-Attention

Easy What do attention weights sum to?

Medium Calculate attention for a simple example

Hard Why use multiple attention heads?

Transformer Architecture

The Big Picture

Encoder-Decoder Structure

Encoder: Understanding the Input

Decoder: Generating the Output

Masked Self-Attention

Cross-Attention

Feed-Forward Network

Positional Encoding: Injecting Word Order

Component Summary

Practice Questions: Architecture

Easy What is the purpose of the causal mask in decoders?

Medium Compare encoder self-attention vs cross-attention

Hard Why use residual connections?

BERT and GPT

BERT: Bidirectional Encoder Representations from Transformers

BERT

GPT: Generative Pre-trained Transformer

GPT

BERT vs GPT: Head-to-Head Comparison

BERT (Bidirectional)

GPT (Autoregressive)

The Evolution of Transformer Models

Practice Questions: BERT and GPT

Easy Why cannot BERT generate text like GPT?

Medium Use BERT for sentiment classification

Hard Explain how ChatGPT differs from GPT-3

Practical Applications

Getting Started with Hugging Face

Real-World Applications of Transformers

Conversational AI

Code Completion

Search Engines

Translation

Content Generation

Image Generation

Text Summarization

Zero-Shot Classification

Choosing the Right Pipeline

Practice Questions: Applications

Easy Analyze sentiment of product reviews

Medium Build a simple FAQ answering system

Hard Classify support tickets into categories

Key Takeaways

Transformers Replaced RNNs

Self-Attention is Key

Encoder-Decoder Design

BERT vs GPT Approaches

Positional Encoding Matters