Attention Is All You Need: The Transformer Architecture

1. Introduction and Problem Statement

The Transformer represents a paradigm shift in sequence modeling. Prior to this work, state-of-the-art sequence transduction models (like machine translation systems) relied heavily on complex recurrent neural networks (RNNs) or convolutional neural networks (CNNs) with encoder-decoder architectures.

Key Limitations of Previous Approaches

Sequential computation bottleneck: RNNs process tokens one-by-one, making parallelization impossible within training examples
Long-range dependency challenges: Information must traverse many sequential steps, making it difficult to learn relationships between distant tokens
Training inefficiency: Memory constraints and sequential processing lead to slow training, especially on longer sequences

The Core Innovation: The Transformer architecture dispenses with recurrence and convolutions entirely, relying solely on attention mechanisms to draw global dependencies between input and output.

2. Technical Approach

2.1 High-Level Architecture

The Transformer follows the standard encoder-decoder structure but replaces recurrent layers with self-attention and position-wise feed-forward networks.

Rendering diagram...

2.2 Scaled Dot-Product Attention

The fundamental building block is the Scaled Dot-Product Attention mechanism:

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Args:
        Q: Queries matrix (batch_size, seq_len, d_k)
        K: Keys matrix (batch_size, seq_len, d_k)
        V: Values matrix (batch_size, seq_len, d_v)
        mask: Optional mask for illegal connections
    
    Returns:
        attention_output: Weighted sum of values
        attention_weights: Attention distribution
    """
    d_k = Q.shape[-1]
    
    # Compute attention scores
    scores = matmul(Q, transpose(K)) / sqrt(d_k)
    
    # Apply mask (for decoder self-attention)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Apply softmax to get attention weights
    attention_weights = softmax(scores, dim=-1)
    
    # Compute weighted sum of values
    output = matmul(attention_weights, V)
    
    return output, attention_weights

Mathematical formulation:

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

Key design choice: The scaling factor $\frac{1}{d _{k}}$ prevents dot products from growing too large, which would push the softmax into regions with extremely small gradients.

2.3 Multi-Head Attention

Instead of performing a single attention function, the model uses multiple attention heads in parallel, each learning different representation subspaces.

Rendering diagram...

Implementation:

def multi_head_attention(Q, K, V, num_heads=8, d_model=512):
    """
    Args:
        Q, K, V: Input queries, keys, values
        num_heads: Number of attention heads (h)
        d_model: Model dimension
    """
    d_k = d_v = d_model // num_heads  # 64 in base model
    
    heads = []
    for i in range(num_heads):
        # Linear projections for each head
        Q_i = linear(Q, W_Q[i])  # Shape: (batch, seq_len, d_k)
        K_i = linear(K, W_K[i])
        V_i = linear(V, W_V[i])
        
        # Apply scaled dot-product attention
        head_i = scaled_dot_product_attention(Q_i, K_i, V_i)
        heads.append(head_i)
    
    # Concatenate all heads
    multi_head = concatenate(heads, dim=-1)
    
    # Final linear projection
    output = linear(multi_head, W_O)
    
    return output

Parameters:

Base model: h = 8 heads, d_k = d_v = 64 dimensions per head
Total computational cost similar to single-head attention with full dimensionality

2.4 Three Types of Attention in the Transformer

Encoder Self-Attention: Each position attends to all positions in the previous encoder layer
Decoder Self-Attention: Each position attends to all previous positions (masked to preserve auto-regressive property)
Encoder-Decoder Cross-Attention: Decoder queries attend to encoder outputs (keys and values)

2.5 Position-wise Feed-Forward Networks

After attention, each position passes through an identical feed-forward network:

def position_wise_ffn(x, d_model=512, d_ff=2048):
    """
    FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
    
    Two linear transformations with ReLU activation
    """
    hidden = relu(linear(x, W_1, b_1))  # (batch, seq_len, d_ff)
    output = linear(hidden, W_2, b_2)    # (batch, seq_len, d_model)
    return output

Inner layer dimensionality: d_ff = 2048
Can be viewed as two 1x1 convolutions

2.6 Positional Encoding

Since the model has no recurrence or convolution, positional encodings inject sequence order information:

def positional_encoding(position, d_model):
    """
    PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    pe = np.zeros((position, d_model))
    
    for pos in range(position):
        for i in range(0, d_model, 2):
            pe[pos, i] = np.sin(pos / (10000 ** (2*i/d_model)))
            pe[pos, i+1] = np.cos(pos / (10000 ** (2*i/d_model)))
    
    return pe

Why sinusoidal functions?

Allow the model to attend by relative positions
For any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos)
May extrapolate to longer sequences than seen during training

3. Why Self-Attention? Comparative Analysis

3.1 Complexity Comparison

Layer Type	Complexity per Layer	Sequential Operations	Maximum Path Length
Self-Attention	O(n² · d)	O(1)	O(1)
Recurrent	O(n · d²)	O(n)	O(n)
Convolutional	O(k · n · d²)	O(1)	O(log_k(n))

Where:

n = sequence length
d = representation dimension
k = kernel size

3.2 Key Advantages

Constant path length: Self-attention connects all positions with O(1) sequential operations, making it easier to learn long-range dependencies
Parallelization: Unlike RNNs, all positions can be processed simultaneously
Computational efficiency: When n < d (typical for sentence representations), self-attention is faster than recurrent layers
Interpretability: Attention distributions can be visualized to understand what the model learns

4. Training Details

4.1 Training Configuration

Dataset:

WMT 2014 English-German: ~4.5M sentence pairs
WMT 2014 English-French: 36M sentences
Byte-pair encoding with shared vocabulary

Hardware:

8 NVIDIA P100 GPUs
Base model: 0.4 seconds/step, 100K steps (12 hours)
Big model: 1.0 seconds/step, 300K steps (3.5 days)

4.2 Optimizer: Warmup + Decay

def learning_rate_schedule(step_num, d_model=512, warmup_steps=4000):
    """
    lrate = d_model^(-0.5) * min(step_num^(-0.5), 
                                  step_num * warmup_steps^(-1.5))
    """
    arg1 = step_num ** (-0.5)
    arg2 = step_num * (warmup_steps ** (-1.5))
    
    return (d_model ** (-0.5)) * min(arg1, arg2)

Rendering diagram...

Optimizer: Adam with β₁ = 0.9, β₂ = 0.98, ε = 10⁻⁹

4.3 Regularization Techniques

Residual Dropout: P_drop = 0.1 (base), 0.3 (big model for EN-FR)
- Applied to attention outputs and feed-forward outputs
- Applied to embedding + positional encoding sums
Label Smoothing: ε_ls = 0.1
- Hurts perplexity but improves accuracy and BLEU

5. Key Results

5.1 Machine Translation Performance

WMT 2014 English-to-German:

Model	BLEU Score	Training Cost (FLOPs)
ByteNet	23.75	-
GNMT + RL	24.6	2.3×10¹⁹
ConvS2S	25.16	9.6×10¹⁸
Transformer (base)	27.3	3.3×10¹⁸
Transformer (big)	28.4	2.3×10¹⁹

WMT 2014 English-to-French:

Model	BLEU Score	Training Cost (FLOPs)
GNMT + RL	39.92	1.4×10²⁰
ConvS2S	40.46	1.5×10²⁰
Transformer (base)	38.1	3.3×10¹⁸
Transformer (big)	41.8	2.3×10¹⁹

Key Achievement: The Transformer (big) achieves 28.4 BLEU on EN-DE, outperforming all previous models including ensembles by over 2 BLEU points, while training in just 3.5 days.

5.2 Model Variations Study

Impact of attention heads (Table 3, rows A):

Single head: 0.9 BLEU worse than best setting
8 heads (base): Optimal balance
Too many heads also degrades quality

Impact of attention key size (rows B):

Reducing d_k hurts model quality
Suggests compatibility computation is non-trivial

Model size matters (rows C):

Bigger models consistently perform better
Dropout crucial for avoiding overfitting

Positional encoding (row E):

Sinusoidal vs. learned embeddings: nearly identical results

5.3 English Constituency Parsing

To test generalization beyond translation:

Parser	Training Data	WSJ 23 F1
Vinyals & Kaiser (2014)	WSJ only	88.3
Dyer et al. (2016)	WSJ only	91.7
Transformer (4 layers)	WSJ only	91.3
BerkeleyParser	Semi-supervised	92.1
Transformer (4 layers)	Semi-supervised	92.7

Surprising finding: Despite no task-specific tuning, the Transformer outperforms the BerkeleyParser even with only 40K training sentences.

6. Practical Implications

6.1 Real-World Applications

Machine Translation: State-of-the-art quality with dramatically reduced training time
- Production systems can be trained in days instead of weeks
- Lower computational cost enables more experimentation
Sequence Modeling: General architecture applicable to:
- Language modeling
- Text summarization
- Question answering
- Constituency parsing
Parallelization Benefits:
- Efficient use of modern GPU/TPU hardware
- Scales better to longer sequences than RNNs

6.2 Architectural Insights

The attention mechanism provides interpretability:

Rendering diagram...

Different attention heads learn to perform different linguistic tasks automatically.

7.1 Evolution from Prior Approaches

Recurrent Models (LSTMs, GRUs):

Sequential bottleneck limits parallelization
Difficulty learning long-range dependencies
State-of-the-art before Transformers

Convolutional Models (ByteNet, ConvS2S):

Parallel computation within layers
Path length grows with distance (linearly or logarithmically)
More expensive than recurrent layers

Attention Mechanisms:

Previously used with recurrent networks
Transformer is first to rely entirely on attention

7.2 Key Innovations

Self-attention as primary mechanism: Replaces recurrence entirely
Multi-head attention: Allows attending to different representation subspaces
Positional encoding: Injects sequence order without recurrence
Scaled dot-product: Prevents gradient vanishing in attention computation

7.3 Architectural Comparison

Rendering diagram...

8. Model Architecture Details

8.1 Base Model Configuration

base_config = {
    'N': 6,              # Number of layers
    'd_model': 512,      # Model dimension
    'd_ff': 2048,        # Feed-forward dimension
    'h': 8,              # Number of attention heads
    'd_k': 64,           # Key dimension (d_model / h)
    'd_v': 64,           # Value dimension
    'P_drop': 0.1,       # Dropout rate
    'params': 65e6       # Total parameters
}

8.2 Big Model Configuration

big_config = {
    'N': 6,
    'd_model': 1024,
    'd_ff': 4096,
    'h': 16,
    'd_k': 64,
    'd_v': 64,
    'P_drop': 0.3,
    'params': 213e6
}

9. Conclusion and Impact

9.1 Key Contributions

First sequence transduction model based entirely on attention
Superior translation quality with significantly faster training
Strong generalization to other tasks (parsing)
Highly parallelizable architecture

9.2 Future Directions (from paper)

Extend to other modalities (images, audio, video)
Investigate local, restricted attention for large inputs
Make generation less sequential

9.3 Historical Impact

The Transformer architecture has become the foundation for modern NLP, spawning models like BERT, GPT, T5, and countless others. Its influence extends beyond NLP to computer vision (Vision Transformers) and multimodal learning.

Code availability: https://github.com/tensorflow/tensor2tensor

Appendix: Attention Visualization Examples

The paper includes visualizations showing attention heads learning interpretable patterns:

Anaphora resolution: Attention heads that resolve pronouns to their referents
Syntactic dependencies: Heads that capture grammatical relationships
Long-range dependencies: Heads that connect distant related words

These visualizations demonstrate that the model learns linguistically meaningful representations without explicit supervision.

Attention Is All You Need: The Transformer Architecture

1. Introduction and Problem Statement

Key Limitations of Previous Approaches

2. Technical Approach

2.1 High-Level Architecture

2.2 Scaled Dot-Product Attention

2.3 Multi-Head Attention

2.4 Three Types of Attention in the Transformer

2.5 Position-wise Feed-Forward Networks

2.6 Positional Encoding

3. Why Self-Attention? Comparative Analysis

3.1 Complexity Comparison

3.2 Key Advantages

4. Training Details

4.1 Training Configuration

4.2 Optimizer: Warmup + Decay

4.3 Regularization Techniques

5. Key Results

5.1 Machine Translation Performance

5.2 Model Variations Study

5.3 English Constituency Parsing

6. Practical Implications

6.1 Real-World Applications

6.2 Architectural Insights

7. Related Work and Context

7.1 Evolution from Prior Approaches

7.2 Key Innovations

7.3 Architectural Comparison

8. Model Architecture Details

8.1 Base Model Configuration

8.2 Big Model Configuration

9. Conclusion and Impact

9.1 Key Contributions

9.2 Future Directions (from paper)

9.3 Historical Impact

Appendix: Attention Visualization Examples