Transformer Architecture: Compute & Data Movement Characteristics
Understanding the phases of transformer architecture and their NoC traffic patterns, from embedding to self-attention to feedforward layers.
Prerequisites
Make sure you're familiar with these concepts before diving in:
Learning Objectives
By the end of this topic, you will be able to:
Table of Contents
98% of this content was generated by LLM.
Table of Contents
- š Phases of Transformer Architecture in Terms of Computation & Data Movement
š Phases of Transformer Architecture in Terms of Computation & Data Movement
The Transformer model (used in GPT, BERT, LLaMA, etc.) consists of multiple stages of computation, each with unique NoC (Network-on-Chip) stress patterns. Understanding these phases is crucial for optimizing AI accelerators.
1. š 1. High-Level Phases of Transformer Computation
The Transformer model operates in three major phases during training and inference:
Phase | Computation Type | Data Movement Characteristics | Latency Bottlenecks |
---|---|---|---|
1. Embedding & Input Projection | Lookup tables, matrix multiplications | Reads from memory, low bandwidth | Memory access time |
2. Multi-Head Self-Attention (MHSA) | Matrix multiplications (QKV), softmax | High bandwidth, many-to-many communication | NoC congestion |
3. Feedforward Layers (MLP) | Fully connected layers (FC), activation functions | Less bandwidth, structured memory access | Memory latency |
4. Layer Norm & Residual Connections | Element-wise operations, normalization | Small memory access, low NoC traffic | Minimal latency impact |
5. Output Projection & Softmax | Softmax, final probability computation | Heavy memory writes | Last-layer memory bottleneck |
š Key Takeaway:
- MHSA phase is the most NoC-intensive part due to massive all-to-all communication.
- Feedforward (MLP) layers are compute-heavy but require structured memory access.
2. š References
For deeper understanding of Transformer architecture and visualization:
- The Illustrated Transformer by Jay Alammar - An excellent visual guide to understanding the Transformer architecture with step-by-step illustrations
- Transformer Explainer - An interactive visualization tool that lets you explore how Transformers work at multiple levels of abstraction
3. š 2. Step-by-Step Transformer Data Flow
3.1 š Phase 1: Embedding & Input Projection
š¹ Computation: Convert input tokens into dense vector embeddings. Perform matrix multiplications to project embeddings into the model's hidden space.
š¹ Data Movement in NoC:
Operation | NoC Traffic Type |
---|---|
Read embeddings from memory | Memory-to-core transfer (HBM) |
Compute input projections | Local core communication |
Store projected embeddings | Write to DRAM/HBM |
ā NoC Behavior: Low traffic ā Mostly memory-bound, not NoC-intensive. Bottleneck: DRAM bandwidth if embeddings are large.
3.2 š Phase 2: Multi-Head Self-Attention (MHSA)
š Most NoC-Intensive Phase!
š¹ Computation:
- Compute Query (Q), Key (K), and Value (V) matrices.
- Perform QK^T (Attention Score Calculation).
- Apply Softmax & Weighted Sum of Values.
š¹ Data Movement in NoC:
Operation | NoC Traffic Type |
---|---|
Broadcast Key (K) and Value (V) to all heads | All-to-All (many-to-many) |
Compute QK^T | Memory-intensive tensor multiplication |
Softmax normalization | Local core memory accesses |
Weighted sum of values | High-bandwidth data movement |
ā NoC Behavior:
Extreme congestion ā NoC must support high-bandwidth many-to-many traffic.
Major bottleneck ā Memory-bound attention operations slow down inference.
Optimization needed ā Hierarchical interconnects (NVLink, Infinity Fabric) reduce contention.
3.3 š Phase 3: Feedforward Layers (MLP)
š Compute-Intensive Phase
š¹ Computation:
- Linear transformation via fully connected (FC) layers.
- Non-linear activation functions (ReLU, GeLU, SiLU).
š¹ Data Movement in NoC:
Operation | NoC Traffic Type |
---|---|
FC layer computation | Core-local memory access |
Activation function (ReLU, GeLU) | Minimal memory movement |
Store intermediate results | Write to HBM (if batch size is large) |
ā NoC Behavior:
Structured memory access ā Less NoC congestion than attention. Compute-bound bottleneck ā Optimized tensor cores help accelerate FC layers.
3.4 š Phase 4: Layer Norm & Residual Connections
š Lightweight Memory Operations
š¹ Computation:
- Normalize activation outputs (LayerNorm).
- Add residual connection (skip connection).
š¹ Data Movement in NoC:
Operation | NoC Traffic Type |
---|---|
Read intermediate activations | Memory-to-core transfer |
Apply element-wise LayerNorm | Minimal NoC load |
Perform residual sum | Low-bandwidth local computation |
ā NoC Behavior:
Low NoC stress ā No global communication required. Minimal bottlenecks ā Mostly memory latency bound.
3.5 š Phase 5: Output Projection & Softmax
š Final Memory-Intensive Step
š¹ Computation: Compute final token probabilities using softmax. Select the next token during inference.
š¹ Data Movement in NoC:
Operation | NoC Traffic Type |
---|---|
Compute output probabilities | High memory bandwidth needed |
Store results for next token | Memory write operation |
ā NoC Behavior:
Latency bottleneck in last layer ā Softmax reads large activation data. Memory bandwidth limited ā If batch size is large, DRAM access slows down processing.
4. š 3. How NoC Behavior Changes Over Time
The NoC traffic pattern changes dynamically across transformer layers.
ā Transformer NoC Traffic Over Time
Phase | Traffic Pattern | Bottleneck |
---|---|---|
Embedding | Low traffic (read-heavy) | Memory latency |
MHSA (Self-Attention) | All-to-all NoC congestion | Memory bandwidth & communication delays |
MLP (Feedforward Layers) | Compute-heavy, structured NoC usage | Compute efficiency |
LayerNorm & Residual | Minimal NoC traffic | None |
Output Projection | Memory writes, softmax communication | DRAM bandwidth |
š Observations:
Early phases (Embedding, Attention) are memory-bound. MHSA creates the most NoC congestion (all-to-all traffic). MLP (Feedforward) is compute-heavy, but NoC load is lower.
5. š 4. NoC Optimizations for Transformer Models
Since MHSA creates the most NoC congestion, AI accelerators optimize their interconnects:
ā Techniques to Optimize NoC for Transformer Workloads
Optimization | Benefit |
---|---|
Hierarchical Interconnects (NVLink, Infinity Fabric) | Reduces NoC congestion by distributing traffic. |
3D NoC Architectures | Reduces average hop count and improves bandwidth. |
Express Virtual Channels (EVCs) | Allows priority paths for critical tensor transfers. |
Sparse Attention Techniques | Reduces the total number of all-to-all memory accesses. |
š 5. Final Takeaways
ā Self-Attention (MHSA) is the biggest NoC bottleneck due to all-to-all communication. ā MLP layers stress compute but not NoC as much (mostly structured memory accesses).