RNGD Tensor-Contraction Processor
A paradigm-shifting AI accelerator built on tensor contraction primitives for LLM inference, achieving 512 TOPS on 653mm² die with 150W TDP and 4.1× better performance per watt than competing GPUs.
Key Performance Metrics
Architectural Highlights
- • Tensor contraction as fundamental primitive operation vs traditional matrix multiplication
- • 653mm² die with 8 processing elements achieving 512 TOPS peak performance
- • Hierarchical three-level Network-on-Chip architecture
- • Slice-based redundancy with 65 slices per PE (64 active + 1 spare)
- • Dual-context execution enabling computation and memory overlap
Technical Specifications
Innovative Features
- • Einstein summation tensor contraction enabling massive parallelism
- • Time-axis pipelining for continuous data flow optimization
- • Triple-engine slice architecture: Contraction, Vector, and Transpose engines
- • SECDED error correction with distributed ECC across memory hierarchy
- • Address Translation Unit with per-PE memory isolation
- • SR-IOV support for cloud virtualization deployment
1. Executive Summary
The RNGD processor, developed by FuriosaAI in collaboration with Dongguk University, represents a paradigm shift in AI accelerator design for Large Language Models (LLMs). Built on a 5nm process node with a massive 653mm² die area, this processor achieves 512 TOPS (Tera Operations Per Second) of compute performance while maintaining a remarkably efficient 150W TDP (Thermal Design Power). The chip's revolutionary approach uses tensor contraction as its fundamental primitive operation rather than traditional matrix multiplication, enabling unprecedented parallelism and energy efficiency.
2. 1. Fundamental Architecture and Tensor Contraction Theory
2.1 1.1 Tensor Contraction vs Matrix Multiplication
Traditional AI accelerators map tensor operations onto matrix multiplication units (GEMM - General Matrix Multiply). The RNGD takes a fundamentally different approach by using tensor contraction as its primitive operation.
Mathematical Definition of Tensor Contraction: For tensors A and B with indices, tensor contraction is defined as:
This operation generalizes Einstein summation notation and allows for:
- Massive parallelism: Multiple dimensions can be processed simultaneously
- Data locality optimization: Better cache utilization through dimensional reordering
- Time-axis pipelining: Similar to vector processors, enabling continuous data flow
2.2 1.2 Core Specifications
Parameter | Value | Calculation/Explanation |
---|---|---|
Process Node | 5nm TSMC | Advanced FinFET technology |
Die Area | 653mm² | Among the largest AI chips |
Peak Performance | 512 TOPS | 8 PEs × 64 TOPS/PE |
Memory Bandwidth | 1.5 TB/s | 2 × HBM3 stacks |
TDP | 150W | Board power limit |
Operating Frequency | 1 GHz (NoC), 2 GHz (CPU) | Dual-clock domain |
SRAM per PE | 32MB | Local storage for tensors |
Total On-chip Memory | 256MB + 28MB + 2MB | 8×32MB SRAM + L2/L1 caches + SPM |
2.3 1.3 Processing Element (PE) Architecture
Each PE contains:
- 65 slices (64 active + 1 spare for yield improvement)
- Tensor Unit (TU): 64 TOPS compute capability
- CPU Core: RISC-V based, 2GHz, manages control flow
- TDMA Engine: Tensor DMA for asynchronous data movement
- Memory Hierarchy:
- L1 I/D Cache: 64KB
- L2 Cache: 256KB
- Scratch Pad Memory (SPM): 3.5MB
- Tensor SRAM: 32MB
Yield Calculation: With 65 slices and 1 spare, the probability of a functional PE given single-slice failure rate p:
This redundancy scheme significantly improves manufacturing yield for the large 653mm² die.
3. 2. Hierarchical Network-on-Chip (NoC) Architecture
3.1 2.1 Three-Level NoC Design
The RNGD implements a sophisticated hierarchical NoC:
-
TU NoC (Intra-PE):
- 65 router nodes in bi-directional ring topology
- Supports multicasting for weight broadcasting
- Bandwidth: 256 GB/s per direction
-
PE Cluster NoC (Inter-PE):
- Connects 4 PEs within a cluster
- 1 GHz operation frequency
- Provides 1 TB/s aggregate bandwidth
- QoS control and timeout management
-
Memory NoC (System-level):
- Connects PE clusters to HBM3 memory
- Address hashing for load balancing across 32 HBM channels
- Supports up to 1.5 TB/s memory bandwidth
3.2 2.2 Bandwidth Calculations
Effective Memory Bandwidth per Operation:
This ratio is critical for LLM inference where memory bandwidth often bottlenecks performance.
Roofline Model Analysis: The operational intensity (I) determines whether an operation is compute or memory bound:
For LLMs with typical operational intensity of 50-100 ops/byte, the RNGD operates in the memory-bound regime, making its high bandwidth crucial.
4. 3. Slice Architecture and Compute Engine Details
4.1 3.1 Slice Components
Each slice contains three specialized engines:
-
Contraction Engine (CE):
- 8 Dot-Product Engines (DPE)
- Configurable reduction trees
- Supports INT8/FP16/BF16 operations
- Peak throughput: 1 TOPS per slice
-
Vector Engine (VE):
- Non-linear activation functions (ReLU, GELU, Softmax)
- Element-wise operations
- Type conversions
- Reduction operations
-
Transpose Engine (TE):
- Tensor axis permutation
- Example: transformation
- Critical for attention mechanism efficiency
4.2 3.2 Data Reuse Strategies
The architecture supports three data reuse patterns:
Strategy | Storage Location | Use Case | Energy Efficiency |
---|---|---|---|
Weight Stationary | Register File | Convolutions | Highest (minimal data movement) |
Input Stationary | CE Input Buffer | Batch processing | Medium |
Output Stationary | Accumulator Registers | Partial sum accumulation | Medium |
Energy Calculation Example: For weight-stationary operation with 8-bit weights:
Compared to DRAM access at ~100 pJ/byte, this represents >300× energy reduction.
5. 4. HBM3 Integration and Signal/Power Integrity
5.1 4.1 HBM3 Specifications
- Configuration: 2 stacks, 12-high configuration
- Channels: 32 total (16 per stack)
- Bandwidth per stack: 750 GB/s
- Operating voltage: 0.4V VDDQL, 0.75V VDD
- Interface width: 1024 bits per stack
5.2 4.2 Power Delivery Network (PDN) Design
The chip employs sophisticated power management:
Decoupling Capacitance Hierarchy:
- Deep Trench Capacitors (DTC): 24.5μF per cluster
- Metal-Insulator-Metal (MiM): ~1μF on-die
- On-die capacitors: ~100nF distributed
Voltage Ripple Analysis:
Where measured values:
- VDDQL (0.4V):
- VDD (0.75V):
5.3 4.3 Thermal Management
Power Density Calculation:
The custom heat sink design maintains junction temperature < 85°C with:
- Thermal resistance:
- Air cooling with custom fin design
- DVFS for dynamic thermal management
6. 5. Performance Analysis and Benchmarks
6.1 5.1 LLaMA-2 7B Performance
The RNGD demonstrates superior performance on LLaMA-2 7B model inference:
Attention Computation Breakdown: For sequence length L=2048, hidden dimension d=128:
6.2 5.2 Comparative Performance Metrics
Metric | RNGD | NVIDIA L40s | NVIDIA H100 | Analysis |
---|---|---|---|---|
Peak Memory BW | 1.5 TB/s | 0.86 TB/s | 3.35 TB/s | RNGD: 1.74× L40s |
TDP | 150W | 350W | 700W | RNGD: 57% lower than L40s |
Throughput (LLaMA-2) | 531 tok/s | 301 tok/s | 913 tok/s | RNGD: 1.76× L40s |
Perf/Watt | 3.54 tok/s/W | 0.86 tok/s/W | 1.30 tok/s/W | RNGD: 4.1× L40s efficiency |
GPT-J 6B (99% acc) | 12.3 TOPS/W | 8.0 TOPS/W | - | 53% better than L40s |
Efficiency Calculation:
6.3 5.3 Scalability Analysis
The RNGD supports multi-chip configurations via PCIe P2P:
- Without PCIe switch: 32 GB/s inter-chip bandwidth
- With PCIe switch: 52 GB/s inter-chip bandwidth
For 8-chip configuration:
7. 6. Software Stack and Programming Model
7.1 6.1 Dual-Context Execution
Each slice supports two execution contexts:
- Main context: Tensor operations (matrix multiplies, convolutions)
- Sub-context: Vector operations and memory transfers
This enables operation overlap:
Rather than:
7.2 6.2 Command Queue Architecture
The TU Controller (TUC) implements asynchronous operation:
- CPU queues commands to TUC
- TUC broadcasts configuration to all slices
- Operations execute deterministically
- CPU continues other work or polls for completion
Latency Hiding Calculation: For weight transfer time and computation time :
- Short sequences (): Computation hidden behind weight transfer
- Long sequences (): Weight transfer hidden behind computation
8. 7. Reliability and Manufacturing Features
8.1 7.1 Error Correction
- SRAM/SPM: Single-Error Correction, Double-Error Detection (SECDED)
- HBM Controller: ECC with error counting and interrupt generation
- Error Rate: Target < bit error rate
8.2 7.2 Security Features
- Secure Boot: Encryption-based firmware verification
- Address Translation Unit: Per-PE memory isolation
- SR-IOV Support: Virtual machine isolation for cloud deployment
8.3 7.3 Monitoring and Diagnostics
- Temperature Sensors: Distributed thermal monitoring
- Voltage Droop Detectors: Wide-bandwidth supply monitoring
- Timing Margin Monitors: Long-term reliability tracking
9. Conclusions and Future Outlook
The RNGD processor represents a significant advancement in AI accelerator design, specifically optimized for the memory-bound nature of LLM inference. Its tensor-contraction architecture, combined with high memory bandwidth and power efficiency, delivers 4.1× better performance per watt than comparable GPUs. The sophisticated NoC design, redundancy features, and comprehensive monitoring capabilities make it suitable for datacenter deployment.
Key Innovations:
- Tensor contraction as primitive operation - enabling better parallelism than matrix multiplication
- Hierarchical NoC architecture - providing 1.5 TB/s memory bandwidth efficiently
- Slice-based redundancy - improving yield on large 653mm² dies
- Dual-context execution - hiding memory latency behind computation
- Comprehensive power management - maintaining 150W TDP with DVFS
The RNGD demonstrates that specialized architectures can significantly outperform general-purpose GPUs for LLM inference, pointing toward a future of domain-specific accelerators for AI workloads. With its 53% improvement in GPT-J performance per watt and 1.76× throughput advantage over L40s at 57% lower power, the RNGD sets a new standard for efficient LLM inference acceleration.
10. References
[1] Einstein, A., "Die Grundlage der allgemeinen Relativitätstheorie", Annalen der Physik, 1916. [2] Vaswani, A., et al., "Attention Is All You Need", NeurIPS, 2017. [3] Brown, T., et al., "Language Models are Few-Shot Learners", NeurIPS, 2020. [4] Touvron, H., et al., "LLaMA: Open and Efficient Foundation Language Models", ArXiv, 2023.
Analysis based on architectural specifications and performance data from FuriosaAI's RNGD processor development collaboration with Dongguk University.