ChipCraft

Executive Summary

The RNGD processor, developed by FuriosaAI in collaboration with Dongguk University, represents a paradigm shift in AI accelerator design for Large Language Models (LLMs). Built on a 5nm process node with a massive 653mm² die area, this processor achieves 512 TOPS (Tera Operations Per Second) of compute performance while maintaining a remarkably efficient 150W TDP (Thermal Design Power). The chip's revolutionary approach uses tensor contraction as its fundamental primitive operation rather than traditional matrix multiplication, enabling unprecedented parallelism and energy efficiency.

1. Fundamental Architecture and Tensor Contraction Theory

1.1 Tensor Contraction vs Matrix Multiplication

Traditional AI accelerators map tensor operations onto matrix multiplication units (GEMM - General Matrix Multiply). The RNGD takes a fundamentally different approach by using tensor contraction as its primitive operation.

Mathematical Definition of Tensor Contraction: For tensors A and B with indices, tensor contraction is defined as: $C [i, j, k] = \sum_{m} A [i, m, j] \times B [m, k]$

This operation generalizes Einstein summation notation and allows for:

Massive parallelism: Multiple dimensions can be processed simultaneously
Data locality optimization: Better cache utilization through dimensional reordering
Time-axis pipelining: Similar to vector processors, enabling continuous data flow

1.2 Core Specifications

Parameter	Value	Calculation/Explanation
Process Node	5nm TSMC	Advanced FinFET technology
Die Area	653mm²	Among the largest AI chips
Peak Performance	512 TOPS	8 PEs × 64 TOPS/PE
Memory Bandwidth	1.5 TB/s	2 × HBM3 stacks
TDP	150W	Board power limit
Operating Frequency	1 GHz (NoC), 2 GHz (CPU)	Dual-clock domain
SRAM per PE	32MB	Local storage for tensors
Total On-chip Memory	256MB + 28MB + 2MB	8×32MB SRAM + L2/L1 caches + SPM

1.3 Processing Element (PE) Architecture

Each PE contains:

65 slices (64 active + 1 spare for yield improvement)
Tensor Unit (TU): 64 TOPS compute capability
CPU Core: RISC-V based, 2GHz, manages control flow
TDMA Engine: Tensor DMA for asynchronous data movement
Memory Hierarchy:
- L1 I/D Cache: 64KB
- L2 Cache: 256KB
- Scratch Pad Memory (SPM): 3.5MB
- Tensor SRAM: 32MB

Yield Calculation: With 65 slices and 1 spare, the probability of a functional PE given single-slice failure rate p: $P (functional PE) = 1 - C (65, 2) \times p^{2} \times (1 - p)^{63}$

This redundancy scheme significantly improves manufacturing yield for the large 653mm² die.

2. Hierarchical Network-on-Chip (NoC) Architecture

2.1 Three-Level NoC Design

The RNGD implements a sophisticated hierarchical NoC:

TU NoC (Intra-PE):
- 65 router nodes in bi-directional ring topology
- Supports multicasting for weight broadcasting
- Bandwidth: 256 GB/s per direction
PE Cluster NoC (Inter-PE):
- Connects 4 PEs within a cluster
- 1 GHz operation frequency
- Provides 1 TB/s aggregate bandwidth
- QoS control and timeout management
Memory NoC (System-level):
- Connects PE clusters to HBM3 memory
- Address hashing for load balancing across 32 HBM channels
- Supports up to 1.5 TB/s memory bandwidth

2.2 Bandwidth Calculations

Effective Memory Bandwidth per Operation: $B W_{per FLOP} = \frac{1.5 TB/s}{512 TOPS} = 2.93 bytes/operation$

This ratio is critical for LLM inference where memory bandwidth often bottlenecks performance.

Roofline Model Analysis: The operational intensity (I) determines whether an operation is compute or memory bound: $I_{threshold} = \frac{Peak FLOPS}{Peak Bandwidth} = \frac{512 TOPS}{1.5 TB/s} = 341 ops/byte$

For LLMs with typical operational intensity of 50-100 ops/byte, the RNGD operates in the memory-bound regime, making its high bandwidth crucial.

3. Slice Architecture and Compute Engine Details

3.1 Slice Components

Each slice contains three specialized engines:

Contraction Engine (CE):
- 8 Dot-Product Engines (DPE)
- Configurable reduction trees
- Supports INT8/FP16/BF16 operations
- Peak throughput: 1 TOPS per slice
Vector Engine (VE):
- Non-linear activation functions (ReLU, GELU, Softmax)
- Element-wise operations
- Type conversions
- Reduction operations
Transpose Engine (TE):
- Tensor axis permutation
- Example: $b \times l \times e \to b \times e \times l$ transformation
- Critical for attention mechanism efficiency

3.2 Data Reuse Strategies

The architecture supports three data reuse patterns:

Strategy	Storage Location	Use Case	Energy Efficiency
Weight Stationary	Register File	Convolutions	Highest (minimal data movement)
Input Stationary	CE Input Buffer	Batch processing	Medium
Output Stationary	Accumulator Registers	Partial sum accumulation	Medium

Energy Calculation Example: For weight-stationary operation with 8-bit weights: $Energy per MAC = E_{compute} + E_{register access}$

$= 0.2 pJ + 0.1 pJ = 0.3 pJ (at 5nm)$

Compared to DRAM access at ~100 pJ/byte, this represents >300× energy reduction.

4. HBM3 Integration and Signal/Power Integrity

4.1 HBM3 Specifications

Configuration: 2 stacks, 12-high configuration
Channels: 32 total (16 per stack)
Bandwidth per stack: 750 GB/s
Operating voltage: 0.4V VDDQL, 0.75V VDD
Interface width: 1024 bits per stack

4.2 Power Delivery Network (PDN) Design

The chip employs sophisticated power management:

Decoupling Capacitance Hierarchy:

Deep Trench Capacitors (DTC): 24.5μF per cluster
Metal-Insulator-Metal (MiM): ~1μF on-die
On-die capacitors: ~100nF distributed

Voltage Ripple Analysis: $V_{pp} = I_{peak} \times Z_{PDN}$

Where measured values:

VDDQL (0.4V): $V_{pp} = 3.65% = 14.6 mV$
VDD (0.75V): $V_{pp} = 7.48% = 56.1 mV$

4.3 Thermal Management

Power Density Calculation: $Power Density = \frac{150 W}{653 mm ^{2}} = 0.23 W/mm^{2}$

The custom heat sink design maintains junction temperature < 85°C with:

Thermal resistance: $R_{θ (j - a)} < 0.3° C / W$
Air cooling with custom fin design
DVFS for dynamic thermal management

5. Performance Analysis and Benchmarks

5.1 LLaMA-2 7B Performance

The RNGD demonstrates superior performance on LLaMA-2 7B model inference:

Attention Computation Breakdown: For sequence length L=2048, hidden dimension d=128: $Q K^{T} computation: 2 \times L \times L \times d = 2 \times 204 8^{2} \times 128 = 1.07 GFLOP$

$Memory required: 3 \times L \times d \times 2 bytes = 3 \times 2048 \times 128 \times 2 = 1.5 MB$

5.2 Comparative Performance Metrics

Metric	RNGD	NVIDIA L40s	NVIDIA H100	Analysis
Peak Memory BW	1.5 TB/s	0.86 TB/s	3.35 TB/s	RNGD: 1.74× L40s
TDP	150W	350W	700W	RNGD: 57% lower than L40s
Throughput (LLaMA-2)	531 tok/s	301 tok/s	913 tok/s	RNGD: 1.76× L40s
Perf/Watt	3.54 tok/s/W	0.86 tok/s/W	1.30 tok/s/W	RNGD: 4.1× L40s efficiency
GPT-J 6B (99% acc)	12.3 TOPS/W	8.0 TOPS/W	-	53% better than L40s

Efficiency Calculation: $RNGD Efficiency = \frac{531 tok/s}{150 W} = 3.54 tokens/second/watt$

$L40s Efficiency = \frac{301 tok/s}{350 W} = 0.86 tokens/second/watt$

$Improvement = \frac{3.54}{0.86} = 4.12 \times better efficiency$

5.3 Scalability Analysis

The RNGD supports multi-chip configurations via PCIe P2P:

Without PCIe switch: 32 GB/s inter-chip bandwidth
With PCIe switch: 52 GB/s inter-chip bandwidth

For 8-chip configuration: $Aggregate Performance = 8 \times 512 TOPS = 4, 096 TOPS = 4.1 POPS$

$Aggregate Memory BW = 8 \times 1.5 TB/s = 12 TB/s$

$Total Power = 8 \times 150 W = 1, 200 W$

6. Software Stack and Programming Model

6.1 Dual-Context Execution

Each slice supports two execution contexts:

Main context: Tensor operations (matrix multiplies, convolutions)
Sub-context: Vector operations and memory transfers

This enables operation overlap: $T im e_{total} = max (T im e_{tensor ops}, T im e_{vector ops} + T im e_{memory})$

Rather than: $T im e_{serial} = T im e_{tensor ops} + T im e_{vector ops} + T im e_{memory}$

6.2 Command Queue Architecture

The TU Controller (TUC) implements asynchronous operation:

CPU queues commands to TUC
TUC broadcasts configuration to all slices
Operations execute deterministically
CPU continues other work or polls for completion

Latency Hiding Calculation: For weight transfer time $T_{w}$ and computation time $T_{c}$ :

Short sequences ( $T_{c} < T_{w}$ ): Computation hidden behind weight transfer
Long sequences ( $T_{c} > T_{w}$ ): Weight transfer hidden behind computation

7. Reliability and Manufacturing Features

7.1 Error Correction

SRAM/SPM: Single-Error Correction, Double-Error Detection (SECDED)
HBM Controller: ECC with error counting and interrupt generation
Error Rate: Target < $1 0^{- 12}$ bit error rate

7.2 Security Features

Secure Boot: Encryption-based firmware verification
Address Translation Unit: Per-PE memory isolation
SR-IOV Support: Virtual machine isolation for cloud deployment

7.3 Monitoring and Diagnostics

Temperature Sensors: Distributed thermal monitoring
Voltage Droop Detectors: Wide-bandwidth supply monitoring
Timing Margin Monitors: Long-term reliability tracking

Conclusions and Future Outlook

The RNGD processor represents a significant advancement in AI accelerator design, specifically optimized for the memory-bound nature of LLM inference. Its tensor-contraction architecture, combined with high memory bandwidth and power efficiency, delivers 4.1× better performance per watt than comparable GPUs. The sophisticated NoC design, redundancy features, and comprehensive monitoring capabilities make it suitable for datacenter deployment.

Key Innovations:

Tensor contraction as primitive operation - enabling better parallelism than matrix multiplication
Hierarchical NoC architecture - providing 1.5 TB/s memory bandwidth efficiently
Slice-based redundancy - improving yield on large 653mm² dies
Dual-context execution - hiding memory latency behind computation
Comprehensive power management - maintaining 150W TDP with DVFS

The RNGD demonstrates that specialized architectures can significantly outperform general-purpose GPUs for LLM inference, pointing toward a future of domain-specific accelerators for AI workloads. With its 53% improvement in GPT-J performance per watt and 1.76× throughput advantage over L40s at 57% lower power, the RNGD sets a new standard for efficient LLM inference acceleration.

References

[1] Einstein, A., "Die Grundlage der allgemeinen Relativitätstheorie", Annalen der Physik, 1916. [2] Vaswani, A., et al., "Attention Is All You Need", NeurIPS, 2017. [3] Brown, T., et al., "Language Models are Few-Shot Learners", NeurIPS, 2020. [4] Touvron, H., et al., "LLaMA: Open and Efficient Foundation Language Models", ArXiv, 2023.

Analysis based on architectural specifications and performance data from FuriosaAI's RNGD processor development collaboration with Dongguk University.

RNGD Tensor-Contraction Processor

Key Performance Metrics

Architectural Highlights

Technical Specifications

Innovative Features