Skip to main content
NoCadvancedtransformernoccomputedata movementtransformer architectureai accelerators

Transformer Architecture: Compute & Data Movement Characteristics

Understanding the phases of transformer architecture and their NoC traffic patterns, from embedding to self-attention to feedforward layers.

25 min read
Updated 3/16/2025
4 prerequisites

Prerequisites

Make sure you're familiar with these concepts before diving in:

Computer Architecture
Neural Networks
Matrix Operations
Network On Chip Basics

Learning Objectives

By the end of this topic, you will be able to:

Understand transformer computation phases and their data movement patterns
Analyze NoC traffic characteristics during multi-head self-attention
Identify bottlenecks in transformer inference and training workloads
Design NoC optimizations specific to transformer model requirements
Compare compute vs memory-bound phases in transformer execution

Table of Contents

98% of this content was generated by LLM.

Table of Contents

šŸš€ Phases of Transformer Architecture in Terms of Computation & Data Movement

The Transformer model (used in GPT, BERT, LLaMA, etc.) consists of multiple stages of computation, each with unique NoC (Network-on-Chip) stress patterns. Understanding these phases is crucial for optimizing AI accelerators.

1. šŸ“Œ 1. High-Level Phases of Transformer Computation

The Transformer model operates in three major phases during training and inference:

PhaseComputation TypeData Movement CharacteristicsLatency Bottlenecks
1. Embedding & Input ProjectionLookup tables, matrix multiplicationsReads from memory, low bandwidthMemory access time
2. Multi-Head Self-Attention (MHSA)Matrix multiplications (QKV), softmaxHigh bandwidth, many-to-many communicationNoC congestion
3. Feedforward Layers (MLP)Fully connected layers (FC), activation functionsLess bandwidth, structured memory accessMemory latency
4. Layer Norm & Residual ConnectionsElement-wise operations, normalizationSmall memory access, low NoC trafficMinimal latency impact
5. Output Projection & SoftmaxSoftmax, final probability computationHeavy memory writesLast-layer memory bottleneck

šŸš€ Key Takeaway:

  • MHSA phase is the most NoC-intensive part due to massive all-to-all communication.
  • Feedforward (MLP) layers are compute-heavy but require structured memory access.

Transformer Architecture Overview

2. šŸ“š References

For deeper understanding of Transformer architecture and visualization:

  • The Illustrated Transformer by Jay Alammar - An excellent visual guide to understanding the Transformer architecture with step-by-step illustrations
  • Transformer Explainer - An interactive visualization tool that lets you explore how Transformers work at multiple levels of abstraction

3. šŸ“Œ 2. Step-by-Step Transformer Data Flow

3.1 šŸš€ Phase 1: Embedding & Input Projection

šŸ”¹ Computation: Convert input tokens into dense vector embeddings. Perform matrix multiplications to project embeddings into the model's hidden space.

šŸ”¹ Data Movement in NoC:

OperationNoC Traffic Type
Read embeddings from memoryMemory-to-core transfer (HBM)
Compute input projectionsLocal core communication
Store projected embeddingsWrite to DRAM/HBM

āœ… NoC Behavior: Low traffic → Mostly memory-bound, not NoC-intensive. Bottleneck: DRAM bandwidth if embeddings are large.

3.2 šŸš€ Phase 2: Multi-Head Self-Attention (MHSA)

šŸ“Œ Most NoC-Intensive Phase!

šŸ”¹ Computation:

  1. Compute Query (Q), Key (K), and Value (V) matrices.
  2. Perform QK^T (Attention Score Calculation).
  3. Apply Softmax & Weighted Sum of Values.

šŸ”¹ Data Movement in NoC:

OperationNoC Traffic Type
Broadcast Key (K) and Value (V) to all headsAll-to-All (many-to-many)
Compute QK^TMemory-intensive tensor multiplication
Softmax normalizationLocal core memory accesses
Weighted sum of valuesHigh-bandwidth data movement

āœ… NoC Behavior:

Extreme congestion → NoC must support high-bandwidth many-to-many traffic.

Major bottleneck → Memory-bound attention operations slow down inference.

Optimization needed → Hierarchical interconnects (NVLink, Infinity Fabric) reduce contention.

3.3 šŸš€ Phase 3: Feedforward Layers (MLP)

šŸ“Œ Compute-Intensive Phase

šŸ”¹ Computation:

  1. Linear transformation via fully connected (FC) layers.
  2. Non-linear activation functions (ReLU, GeLU, SiLU).

šŸ”¹ Data Movement in NoC:

OperationNoC Traffic Type
FC layer computationCore-local memory access
Activation function (ReLU, GeLU)Minimal memory movement
Store intermediate resultsWrite to HBM (if batch size is large)

āœ… NoC Behavior:

Structured memory access → Less NoC congestion than attention. Compute-bound bottleneck → Optimized tensor cores help accelerate FC layers.

3.4 šŸš€ Phase 4: Layer Norm & Residual Connections

šŸ“Œ Lightweight Memory Operations

šŸ”¹ Computation:

  1. Normalize activation outputs (LayerNorm).
  2. Add residual connection (skip connection).

šŸ”¹ Data Movement in NoC:

OperationNoC Traffic Type
Read intermediate activationsMemory-to-core transfer
Apply element-wise LayerNormMinimal NoC load
Perform residual sumLow-bandwidth local computation

āœ… NoC Behavior:

Low NoC stress → No global communication required. Minimal bottlenecks → Mostly memory latency bound.

3.5 šŸš€ Phase 5: Output Projection & Softmax

šŸ“Œ Final Memory-Intensive Step

šŸ”¹ Computation: Compute final token probabilities using softmax. Select the next token during inference.

šŸ”¹ Data Movement in NoC:

OperationNoC Traffic Type
Compute output probabilitiesHigh memory bandwidth needed
Store results for next tokenMemory write operation

āœ… NoC Behavior:

Latency bottleneck in last layer → Softmax reads large activation data. Memory bandwidth limited → If batch size is large, DRAM access slows down processing.

4. šŸ“Œ 3. How NoC Behavior Changes Over Time

The NoC traffic pattern changes dynamically across transformer layers.

āœ… Transformer NoC Traffic Over Time

PhaseTraffic PatternBottleneck
EmbeddingLow traffic (read-heavy)Memory latency
MHSA (Self-Attention)All-to-all NoC congestionMemory bandwidth & communication delays
MLP (Feedforward Layers)Compute-heavy, structured NoC usageCompute efficiency
LayerNorm & ResidualMinimal NoC trafficNone
Output ProjectionMemory writes, softmax communicationDRAM bandwidth

šŸ“Œ Observations:

Early phases (Embedding, Attention) are memory-bound. MHSA creates the most NoC congestion (all-to-all traffic). MLP (Feedforward) is compute-heavy, but NoC load is lower.

5. šŸ“Œ 4. NoC Optimizations for Transformer Models

Since MHSA creates the most NoC congestion, AI accelerators optimize their interconnects:

āœ… Techniques to Optimize NoC for Transformer Workloads

OptimizationBenefit
Hierarchical Interconnects (NVLink, Infinity Fabric)Reduces NoC congestion by distributing traffic.
3D NoC ArchitecturesReduces average hop count and improves bandwidth.
Express Virtual Channels (EVCs)Allows priority paths for critical tensor transfers.
Sparse Attention TechniquesReduces the total number of all-to-all memory accesses.

šŸ“Œ 5. Final Takeaways

āœ… Self-Attention (MHSA) is the biggest NoC bottleneck due to all-to-all communication. āœ… MLP layers stress compute but not NoC as much (mostly structured memory accesses).