Attention Is All You Need
This paper introduces the Transformer, a novel neural network architecture based entirely on attention mechanisms, eliminating the need for recurrence and convolutions. The model achieves state-of-the-art results on machine translation tasks (28.4 BLEU on WMT 2014 English-to-German) while being significantly more parallelizable and requiring less training time than previous approaches.
Attention Is All You Need: The Transformer Architecture
1. Introduction and Problem Statement
The Transformer represents a paradigm shift in sequence modeling. Prior to this work, state-of-the-art sequence transduction models (like machine translation systems) relied heavily on complex recurrent neural networks (RNNs) or convolutional neural networks (CNNs) with encoder-decoder architectures.
Key Limitations of Previous Approaches
- Sequential computation bottleneck: RNNs process tokens one-by-one, making parallelization impossible within training examples
- Long-range dependency challenges: Information must traverse many sequential steps, making it difficult to learn relationships between distant tokens
- Training inefficiency: Memory constraints and sequential processing lead to slow training, especially on longer sequences
The Core Innovation: The Transformer architecture dispenses with recurrence and convolutions entirely, relying solely on attention mechanisms to draw global dependencies between input and output.
2. Technical Approach
2.1 High-Level Architecture
The Transformer follows the standard encoder-decoder structure but replaces recurrent layers with self-attention and position-wise feed-forward networks.
2.2 Scaled Dot-Product Attention
The fundamental building block is the Scaled Dot-Product Attention mechanism:
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Args:
Q: Queries matrix (batch_size, seq_len, d_k)
K: Keys matrix (batch_size, seq_len, d_k)
V: Values matrix (batch_size, seq_len, d_v)
mask: Optional mask for illegal connections
Returns:
attention_output: Weighted sum of values
attention_weights: Attention distribution
"""
d_k = Q.shape[-1]
# Compute attention scores
scores = matmul(Q, transpose(K)) / sqrt(d_k)
# Apply mask (for decoder self-attention)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Apply softmax to get attention weights
attention_weights = softmax(scores, dim=-1)
# Compute weighted sum of values
output = matmul(attention_weights, V)
return output, attention_weightsMathematical formulation:
Key design choice: The scaling factor prevents dot products from growing too large, which would push the softmax into regions with extremely small gradients.
2.3 Multi-Head Attention
Instead of performing a single attention function, the model uses multiple attention heads in parallel, each learning different representation subspaces.
Implementation:
def multi_head_attention(Q, K, V, num_heads=8, d_model=512):
"""
Args:
Q, K, V: Input queries, keys, values
num_heads: Number of attention heads (h)
d_model: Model dimension
"""
d_k = d_v = d_model // num_heads # 64 in base model
heads = []
for i in range(num_heads):
# Linear projections for each head
Q_i = linear(Q, W_Q[i]) # Shape: (batch, seq_len, d_k)
K_i = linear(K, W_K[i])
V_i = linear(V, W_V[i])
# Apply scaled dot-product attention
head_i = scaled_dot_product_attention(Q_i, K_i, V_i)
heads.append(head_i)
# Concatenate all heads
multi_head = concatenate(heads, dim=-1)
# Final linear projection
output = linear(multi_head, W_O)
return outputParameters:
- Base model: h = 8 heads, d_k = d_v = 64 dimensions per head
- Total computational cost similar to single-head attention with full dimensionality
2.4 Three Types of Attention in the Transformer
- Encoder Self-Attention: Each position attends to all positions in the previous encoder layer
- Decoder Self-Attention: Each position attends to all previous positions (masked to preserve auto-regressive property)
- Encoder-Decoder Cross-Attention: Decoder queries attend to encoder outputs (keys and values)
2.5 Position-wise Feed-Forward Networks
After attention, each position passes through an identical feed-forward network:
def position_wise_ffn(x, d_model=512, d_ff=2048):
"""
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
Two linear transformations with ReLU activation
"""
hidden = relu(linear(x, W_1, b_1)) # (batch, seq_len, d_ff)
output = linear(hidden, W_2, b_2) # (batch, seq_len, d_model)
return output- Inner layer dimensionality: d_ff = 2048
- Can be viewed as two 1x1 convolutions
2.6 Positional Encoding
Since the model has no recurrence or convolution, positional encodings inject sequence order information:
def positional_encoding(position, d_model):
"""
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
"""
pe = np.zeros((position, d_model))
for pos in range(position):
for i in range(0, d_model, 2):
pe[pos, i] = np.sin(pos / (10000 ** (2*i/d_model)))
pe[pos, i+1] = np.cos(pos / (10000 ** (2*i/d_model)))
return peWhy sinusoidal functions?
- Allow the model to attend by relative positions
- For any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos)
- May extrapolate to longer sequences than seen during training
3. Why Self-Attention? Comparative Analysis
3.1 Complexity Comparison
| Layer Type | Complexity per Layer | Sequential Operations | Maximum Path Length |
|---|---|---|---|
| Self-Attention | O(n² · d) | O(1) | O(1) |
| Recurrent | O(n · d²) | O(n) | O(n) |
| Convolutional | O(k · n · d²) | O(1) | O(log_k(n)) |
Where:
- n = sequence length
- d = representation dimension
- k = kernel size
3.2 Key Advantages
-
Constant path length: Self-attention connects all positions with O(1) sequential operations, making it easier to learn long-range dependencies
-
Parallelization: Unlike RNNs, all positions can be processed simultaneously
-
Computational efficiency: When n < d (typical for sentence representations), self-attention is faster than recurrent layers
-
Interpretability: Attention distributions can be visualized to understand what the model learns
4. Training Details
4.1 Training Configuration
Dataset:
- WMT 2014 English-German: ~4.5M sentence pairs
- WMT 2014 English-French: 36M sentences
- Byte-pair encoding with shared vocabulary
Hardware:
- 8 NVIDIA P100 GPUs
- Base model: 0.4 seconds/step, 100K steps (12 hours)
- Big model: 1.0 seconds/step, 300K steps (3.5 days)
4.2 Optimizer: Warmup + Decay
def learning_rate_schedule(step_num, d_model=512, warmup_steps=4000):
"""
lrate = d_model^(-0.5) * min(step_num^(-0.5),
step_num * warmup_steps^(-1.5))
"""
arg1 = step_num ** (-0.5)
arg2 = step_num * (warmup_steps ** (-1.5))
return (d_model ** (-0.5)) * min(arg1, arg2)Optimizer: Adam with β₁ = 0.9, β₂ = 0.98, ε = 10⁻⁹
4.3 Regularization Techniques
-
Residual Dropout: P_drop = 0.1 (base), 0.3 (big model for EN-FR)
- Applied to attention outputs and feed-forward outputs
- Applied to embedding + positional encoding sums
-
Label Smoothing: ε_ls = 0.1
- Hurts perplexity but improves accuracy and BLEU
5. Key Results
5.1 Machine Translation Performance
WMT 2014 English-to-German:
| Model | BLEU Score | Training Cost (FLOPs) |
|---|---|---|
| ByteNet | 23.75 | - |
| GNMT + RL | 24.6 | 2.3×10¹⁹ |
| ConvS2S | 25.16 | 9.6×10¹⁸ |
| Transformer (base) | 27.3 | 3.3×10¹⁸ |
| Transformer (big) | 28.4 | 2.3×10¹⁹ |
WMT 2014 English-to-French:
| Model | BLEU Score | Training Cost (FLOPs) |
|---|---|---|
| GNMT + RL | 39.92 | 1.4×10²⁰ |
| ConvS2S | 40.46 | 1.5×10²⁰ |
| Transformer (base) | 38.1 | 3.3×10¹⁸ |
| Transformer (big) | 41.8 | 2.3×10¹⁹ |
Key Achievement: The Transformer (big) achieves 28.4 BLEU on EN-DE, outperforming all previous models including ensembles by over 2 BLEU points, while training in just 3.5 days.
5.2 Model Variations Study
Impact of attention heads (Table 3, rows A):
- Single head: 0.9 BLEU worse than best setting
- 8 heads (base): Optimal balance
- Too many heads also degrades quality
Impact of attention key size (rows B):
- Reducing d_k hurts model quality
- Suggests compatibility computation is non-trivial
Model size matters (rows C):
- Bigger models consistently perform better
- Dropout crucial for avoiding overfitting
Positional encoding (row E):
- Sinusoidal vs. learned embeddings: nearly identical results
5.3 English Constituency Parsing
To test generalization beyond translation:
| Parser | Training Data | WSJ 23 F1 |
|---|---|---|
| Vinyals & Kaiser (2014) | WSJ only | 88.3 |
| Dyer et al. (2016) | WSJ only | 91.7 |
| Transformer (4 layers) | WSJ only | 91.3 |
| BerkeleyParser | Semi-supervised | 92.1 |
| Transformer (4 layers) | Semi-supervised | 92.7 |
Surprising finding: Despite no task-specific tuning, the Transformer outperforms the BerkeleyParser even with only 40K training sentences.
6. Practical Implications
6.1 Real-World Applications
-
Machine Translation: State-of-the-art quality with dramatically reduced training time
- Production systems can be trained in days instead of weeks
- Lower computational cost enables more experimentation
-
Sequence Modeling: General architecture applicable to:
- Language modeling
- Text summarization
- Question answering
- Constituency parsing
-
Parallelization Benefits:
- Efficient use of modern GPU/TPU hardware
- Scales better to longer sequences than RNNs
6.2 Architectural Insights
The attention mechanism provides interpretability:
Different attention heads learn to perform different linguistic tasks automatically.
7. Related Work and Context
7.1 Evolution from Prior Approaches
Recurrent Models (LSTMs, GRUs):
- Sequential bottleneck limits parallelization
- Difficulty learning long-range dependencies
- State-of-the-art before Transformers
Convolutional Models (ByteNet, ConvS2S):
- Parallel computation within layers
- Path length grows with distance (linearly or logarithmically)
- More expensive than recurrent layers
Attention Mechanisms:
- Previously used with recurrent networks
- Transformer is first to rely entirely on attention
7.2 Key Innovations
- Self-attention as primary mechanism: Replaces recurrence entirely
- Multi-head attention: Allows attending to different representation subspaces
- Positional encoding: Injects sequence order without recurrence
- Scaled dot-product: Prevents gradient vanishing in attention computation
7.3 Architectural Comparison
8. Model Architecture Details
8.1 Base Model Configuration
base_config = {
'N': 6, # Number of layers
'd_model': 512, # Model dimension
'd_ff': 2048, # Feed-forward dimension
'h': 8, # Number of attention heads
'd_k': 64, # Key dimension (d_model / h)
'd_v': 64, # Value dimension
'P_drop': 0.1, # Dropout rate
'params': 65e6 # Total parameters
}8.2 Big Model Configuration
big_config = {
'N': 6,
'd_model': 1024,
'd_ff': 4096,
'h': 16,
'd_k': 64,
'd_v': 64,
'P_drop': 0.3,
'params': 213e6
}9. Conclusion and Impact
9.1 Key Contributions
- First sequence transduction model based entirely on attention
- Superior translation quality with significantly faster training
- Strong generalization to other tasks (parsing)
- Highly parallelizable architecture
9.2 Future Directions (from paper)
- Extend to other modalities (images, audio, video)
- Investigate local, restricted attention for large inputs
- Make generation less sequential
9.3 Historical Impact
The Transformer architecture has become the foundation for modern NLP, spawning models like BERT, GPT, T5, and countless others. Its influence extends beyond NLP to computer vision (Vision Transformers) and multimodal learning.
Code availability: https://github.com/tensorflow/tensor2tensor
Appendix: Attention Visualization Examples
The paper includes visualizations showing attention heads learning interpretable patterns:
- Anaphora resolution: Attention heads that resolve pronouns to their referents
- Syntactic dependencies: Heads that capture grammatical relationships
- Long-range dependencies: Heads that connect distant related words
These visualizations demonstrate that the model learns linguistically meaningful representations without explicit supervision.