Skip to main content
Back to Research
Computer ArchitectureNIPS 2017 · 2017transformerattention mechanismmachine translationneural networksparallel processingsequence modelingdeep learningNLP

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin

This paper introduces the Transformer, a novel neural network architecture based entirely on attention mechanisms, eliminating the need for recurrence and convolutions. The model achieves state-of-the-art results on machine translation tasks (28.4 BLEU on WMT 2014 English-to-German) while being significantly more parallelizable and requiring less training time than previous approaches.

11 min read

Attention Is All You Need: The Transformer Architecture

1. Introduction and Problem Statement

The Transformer represents a paradigm shift in sequence modeling. Prior to this work, state-of-the-art sequence transduction models (like machine translation systems) relied heavily on complex recurrent neural networks (RNNs) or convolutional neural networks (CNNs) with encoder-decoder architectures.

Key Limitations of Previous Approaches

  • Sequential computation bottleneck: RNNs process tokens one-by-one, making parallelization impossible within training examples
  • Long-range dependency challenges: Information must traverse many sequential steps, making it difficult to learn relationships between distant tokens
  • Training inefficiency: Memory constraints and sequential processing lead to slow training, especially on longer sequences

The Core Innovation: The Transformer architecture dispenses with recurrence and convolutions entirely, relying solely on attention mechanisms to draw global dependencies between input and output.


2. Technical Approach

2.1 High-Level Architecture

The Transformer follows the standard encoder-decoder structure but replaces recurrent layers with self-attention and position-wise feed-forward networks.

Rendering diagram...

2.2 Scaled Dot-Product Attention

The fundamental building block is the Scaled Dot-Product Attention mechanism:

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Args:
        Q: Queries matrix (batch_size, seq_len, d_k)
        K: Keys matrix (batch_size, seq_len, d_k)
        V: Values matrix (batch_size, seq_len, d_v)
        mask: Optional mask for illegal connections
    
    Returns:
        attention_output: Weighted sum of values
        attention_weights: Attention distribution
    """
    d_k = Q.shape[-1]
    
    # Compute attention scores
    scores = matmul(Q, transpose(K)) / sqrt(d_k)
    
    # Apply mask (for decoder self-attention)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    # Apply softmax to get attention weights
    attention_weights = softmax(scores, dim=-1)
    
    # Compute weighted sum of values
    output = matmul(attention_weights, V)
    
    return output, attention_weights

Mathematical formulation:

Key design choice: The scaling factor prevents dot products from growing too large, which would push the softmax into regions with extremely small gradients.

2.3 Multi-Head Attention

Instead of performing a single attention function, the model uses multiple attention heads in parallel, each learning different representation subspaces.

Rendering diagram...

Implementation:

def multi_head_attention(Q, K, V, num_heads=8, d_model=512):
    """
    Args:
        Q, K, V: Input queries, keys, values
        num_heads: Number of attention heads (h)
        d_model: Model dimension
    """
    d_k = d_v = d_model // num_heads  # 64 in base model
    
    heads = []
    for i in range(num_heads):
        # Linear projections for each head
        Q_i = linear(Q, W_Q[i])  # Shape: (batch, seq_len, d_k)
        K_i = linear(K, W_K[i])
        V_i = linear(V, W_V[i])
        
        # Apply scaled dot-product attention
        head_i = scaled_dot_product_attention(Q_i, K_i, V_i)
        heads.append(head_i)
    
    # Concatenate all heads
    multi_head = concatenate(heads, dim=-1)
    
    # Final linear projection
    output = linear(multi_head, W_O)
    
    return output

Parameters:

  • Base model: h = 8 heads, d_k = d_v = 64 dimensions per head
  • Total computational cost similar to single-head attention with full dimensionality

2.4 Three Types of Attention in the Transformer

  1. Encoder Self-Attention: Each position attends to all positions in the previous encoder layer
  2. Decoder Self-Attention: Each position attends to all previous positions (masked to preserve auto-regressive property)
  3. Encoder-Decoder Cross-Attention: Decoder queries attend to encoder outputs (keys and values)

2.5 Position-wise Feed-Forward Networks

After attention, each position passes through an identical feed-forward network:

def position_wise_ffn(x, d_model=512, d_ff=2048):
    """
    FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
    
    Two linear transformations with ReLU activation
    """
    hidden = relu(linear(x, W_1, b_1))  # (batch, seq_len, d_ff)
    output = linear(hidden, W_2, b_2)    # (batch, seq_len, d_model)
    return output
  • Inner layer dimensionality: d_ff = 2048
  • Can be viewed as two 1x1 convolutions

2.6 Positional Encoding

Since the model has no recurrence or convolution, positional encodings inject sequence order information:

def positional_encoding(position, d_model):
    """
    PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    pe = np.zeros((position, d_model))
    
    for pos in range(position):
        for i in range(0, d_model, 2):
            pe[pos, i] = np.sin(pos / (10000 ** (2*i/d_model)))
            pe[pos, i+1] = np.cos(pos / (10000 ** (2*i/d_model)))
    
    return pe

Why sinusoidal functions?

  • Allow the model to attend by relative positions
  • For any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos)
  • May extrapolate to longer sequences than seen during training

3. Why Self-Attention? Comparative Analysis

3.1 Complexity Comparison

Layer TypeComplexity per LayerSequential OperationsMaximum Path Length
Self-AttentionO(n² · d)O(1)O(1)
RecurrentO(n · d²)O(n)O(n)
ConvolutionalO(k · n · d²)O(1)O(log_k(n))

Where:

  • n = sequence length
  • d = representation dimension
  • k = kernel size

3.2 Key Advantages

  1. Constant path length: Self-attention connects all positions with O(1) sequential operations, making it easier to learn long-range dependencies

  2. Parallelization: Unlike RNNs, all positions can be processed simultaneously

  3. Computational efficiency: When n < d (typical for sentence representations), self-attention is faster than recurrent layers

  4. Interpretability: Attention distributions can be visualized to understand what the model learns


4. Training Details

4.1 Training Configuration

Dataset:

  • WMT 2014 English-German: ~4.5M sentence pairs
  • WMT 2014 English-French: 36M sentences
  • Byte-pair encoding with shared vocabulary

Hardware:

  • 8 NVIDIA P100 GPUs
  • Base model: 0.4 seconds/step, 100K steps (12 hours)
  • Big model: 1.0 seconds/step, 300K steps (3.5 days)

4.2 Optimizer: Warmup + Decay

def learning_rate_schedule(step_num, d_model=512, warmup_steps=4000):
    """
    lrate = d_model^(-0.5) * min(step_num^(-0.5), 
                                  step_num * warmup_steps^(-1.5))
    """
    arg1 = step_num ** (-0.5)
    arg2 = step_num * (warmup_steps ** (-1.5))
    
    return (d_model ** (-0.5)) * min(arg1, arg2)
Rendering diagram...

Optimizer: Adam with β₁ = 0.9, β₂ = 0.98, ε = 10⁻⁹

4.3 Regularization Techniques

  1. Residual Dropout: P_drop = 0.1 (base), 0.3 (big model for EN-FR)

    • Applied to attention outputs and feed-forward outputs
    • Applied to embedding + positional encoding sums
  2. Label Smoothing: ε_ls = 0.1

    • Hurts perplexity but improves accuracy and BLEU

5. Key Results

5.1 Machine Translation Performance

WMT 2014 English-to-German:

ModelBLEU ScoreTraining Cost (FLOPs)
ByteNet23.75-
GNMT + RL24.62.3×10¹⁹
ConvS2S25.169.6×10¹⁸
Transformer (base)27.33.3×10¹⁸
Transformer (big)28.42.3×10¹⁹

WMT 2014 English-to-French:

ModelBLEU ScoreTraining Cost (FLOPs)
GNMT + RL39.921.4×10²⁰
ConvS2S40.461.5×10²⁰
Transformer (base)38.13.3×10¹⁸
Transformer (big)41.82.3×10¹⁹

Key Achievement: The Transformer (big) achieves 28.4 BLEU on EN-DE, outperforming all previous models including ensembles by over 2 BLEU points, while training in just 3.5 days.

5.2 Model Variations Study

Impact of attention heads (Table 3, rows A):

  • Single head: 0.9 BLEU worse than best setting
  • 8 heads (base): Optimal balance
  • Too many heads also degrades quality

Impact of attention key size (rows B):

  • Reducing d_k hurts model quality
  • Suggests compatibility computation is non-trivial

Model size matters (rows C):

  • Bigger models consistently perform better
  • Dropout crucial for avoiding overfitting

Positional encoding (row E):

  • Sinusoidal vs. learned embeddings: nearly identical results

5.3 English Constituency Parsing

To test generalization beyond translation:

ParserTraining DataWSJ 23 F1
Vinyals & Kaiser (2014)WSJ only88.3
Dyer et al. (2016)WSJ only91.7
Transformer (4 layers)WSJ only91.3
BerkeleyParserSemi-supervised92.1
Transformer (4 layers)Semi-supervised92.7

Surprising finding: Despite no task-specific tuning, the Transformer outperforms the BerkeleyParser even with only 40K training sentences.


6. Practical Implications

6.1 Real-World Applications

  1. Machine Translation: State-of-the-art quality with dramatically reduced training time

    • Production systems can be trained in days instead of weeks
    • Lower computational cost enables more experimentation
  2. Sequence Modeling: General architecture applicable to:

    • Language modeling
    • Text summarization
    • Question answering
    • Constituency parsing
  3. Parallelization Benefits:

    • Efficient use of modern GPU/TPU hardware
    • Scales better to longer sequences than RNNs

6.2 Architectural Insights

The attention mechanism provides interpretability:

Rendering diagram...

Different attention heads learn to perform different linguistic tasks automatically.


7.1 Evolution from Prior Approaches

Recurrent Models (LSTMs, GRUs):

  • Sequential bottleneck limits parallelization
  • Difficulty learning long-range dependencies
  • State-of-the-art before Transformers

Convolutional Models (ByteNet, ConvS2S):

  • Parallel computation within layers
  • Path length grows with distance (linearly or logarithmically)
  • More expensive than recurrent layers

Attention Mechanisms:

  • Previously used with recurrent networks
  • Transformer is first to rely entirely on attention

7.2 Key Innovations

  1. Self-attention as primary mechanism: Replaces recurrence entirely
  2. Multi-head attention: Allows attending to different representation subspaces
  3. Positional encoding: Injects sequence order without recurrence
  4. Scaled dot-product: Prevents gradient vanishing in attention computation

7.3 Architectural Comparison

Rendering diagram...

8. Model Architecture Details

8.1 Base Model Configuration

base_config = {
    'N': 6,              # Number of layers
    'd_model': 512,      # Model dimension
    'd_ff': 2048,        # Feed-forward dimension
    'h': 8,              # Number of attention heads
    'd_k': 64,           # Key dimension (d_model / h)
    'd_v': 64,           # Value dimension
    'P_drop': 0.1,       # Dropout rate
    'params': 65e6       # Total parameters
}

8.2 Big Model Configuration

big_config = {
    'N': 6,
    'd_model': 1024,
    'd_ff': 4096,
    'h': 16,
    'd_k': 64,
    'd_v': 64,
    'P_drop': 0.3,
    'params': 213e6
}

9. Conclusion and Impact

9.1 Key Contributions

  1. First sequence transduction model based entirely on attention
  2. Superior translation quality with significantly faster training
  3. Strong generalization to other tasks (parsing)
  4. Highly parallelizable architecture

9.2 Future Directions (from paper)

  • Extend to other modalities (images, audio, video)
  • Investigate local, restricted attention for large inputs
  • Make generation less sequential

9.3 Historical Impact

The Transformer architecture has become the foundation for modern NLP, spawning models like BERT, GPT, T5, and countless others. Its influence extends beyond NLP to computer vision (Vision Transformers) and multimodal learning.

Code availability: https://github.com/tensorflow/tensor2tensor


Appendix: Attention Visualization Examples

The paper includes visualizations showing attention heads learning interpretable patterns:

  • Anaphora resolution: Attention heads that resolve pronouns to their referents
  • Syntactic dependencies: Heads that capture grammatical relationships
  • Long-range dependencies: Heads that connect distant related words

These visualizations demonstrate that the model learns linguistically meaningful representations without explicit supervision.