Skip to main content
AIAccelerators5nm TSMC2024Tensor ContractionAI AcceleratorFuriosaAILLM Inference5nm ProcessHBM3NoC ArchitectureEnergy EfficiencyRISC-VDatacenter AI
FuriosaAI & Dongguk University

RNGD Tensor-Contraction Processor

A paradigm-shifting AI accelerator built on tensor contraction primitives for LLM inference, achieving 512 TOPS on 653mm² die with 150W TDP and 4.1× better performance per watt than competing GPUs.

9 min read
5nm TSMC process
Released 2024
Updated 1/16/2025

Key Performance Metrics

512 TOPS peak performance at 150W TDP
1.5 TB/s memory bandwidth via dual HBM3 stacks
4.1× better performance per watt than NVIDIA L40s
1.76× higher LLaMA-2 7B throughput than L40s at 57% lower power
531 tokens/second on LLaMA-2 7B inference
12.3 TOPS/W efficiency on GPT-J 6B (53% better than L40s)

Architectural Highlights

  • Tensor contraction as fundamental primitive operation vs traditional matrix multiplication
  • 653mm² die with 8 processing elements achieving 512 TOPS peak performance
  • Hierarchical three-level Network-on-Chip architecture
  • Slice-based redundancy with 65 slices per PE (64 active + 1 spare)
  • Dual-context execution enabling computation and memory overlap

Technical Specifications

8 Processing Elements with 64 TOPS each
65 slices per PE (64 active + 1 spare for yield)
256MB total SRAM (32MB per PE) + 28MB L2 + 2MB L1
Dual HBM3 stacks: 32 channels, 1.5 TB/s aggregate bandwidth
Dual-clock domains: 1GHz NoC, 2GHz CPU cores
Triple-engine slice design: CE (1 TOPS), VE, TE per slice
Deep Trench Capacitors: 24.5μF per cluster for power integrity

Innovative Features

  • Einstein summation tensor contraction enabling massive parallelism
  • Time-axis pipelining for continuous data flow optimization
  • Triple-engine slice architecture: Contraction, Vector, and Transpose engines
  • SECDED error correction with distributed ECC across memory hierarchy
  • Address Translation Unit with per-PE memory isolation
  • SR-IOV support for cloud virtualization deployment

1. Executive Summary

The RNGD processor, developed by FuriosaAI in collaboration with Dongguk University, represents a paradigm shift in AI accelerator design for Large Language Models (LLMs). Built on a 5nm process node with a massive 653mm² die area, this processor achieves 512 TOPS (Tera Operations Per Second) of compute performance while maintaining a remarkably efficient 150W TDP (Thermal Design Power). The chip's revolutionary approach uses tensor contraction as its fundamental primitive operation rather than traditional matrix multiplication, enabling unprecedented parallelism and energy efficiency.

2. 1. Fundamental Architecture and Tensor Contraction Theory

2.1 1.1 Tensor Contraction vs Matrix Multiplication

Traditional AI accelerators map tensor operations onto matrix multiplication units (GEMM - General Matrix Multiply). The RNGD takes a fundamentally different approach by using tensor contraction as its primitive operation.

Mathematical Definition of Tensor Contraction: For tensors A and B with indices, tensor contraction is defined as: C[i,j,k]=mA[i,m,j]×B[m,k]C[i,j,k] = \sum_m A[i,m,j] \times B[m,k]

This operation generalizes Einstein summation notation and allows for:

  • Massive parallelism: Multiple dimensions can be processed simultaneously
  • Data locality optimization: Better cache utilization through dimensional reordering
  • Time-axis pipelining: Similar to vector processors, enabling continuous data flow

2.2 1.2 Core Specifications

ParameterValueCalculation/Explanation
Process Node5nm TSMCAdvanced FinFET technology
Die Area653mm²Among the largest AI chips
Peak Performance512 TOPS8 PEs × 64 TOPS/PE
Memory Bandwidth1.5 TB/s2 × HBM3 stacks
TDP150WBoard power limit
Operating Frequency1 GHz (NoC), 2 GHz (CPU)Dual-clock domain
SRAM per PE32MBLocal storage for tensors
Total On-chip Memory256MB + 28MB + 2MB8×32MB SRAM + L2/L1 caches + SPM

2.3 1.3 Processing Element (PE) Architecture

Each PE contains:

  • 65 slices (64 active + 1 spare for yield improvement)
  • Tensor Unit (TU): 64 TOPS compute capability
  • CPU Core: RISC-V based, 2GHz, manages control flow
  • TDMA Engine: Tensor DMA for asynchronous data movement
  • Memory Hierarchy:
    • L1 I/D Cache: 64KB
    • L2 Cache: 256KB
    • Scratch Pad Memory (SPM): 3.5MB
    • Tensor SRAM: 32MB

Yield Calculation: With 65 slices and 1 spare, the probability of a functional PE given single-slice failure rate p: P(functional PE)=1C(65,2)×p2×(1p)63P(\text{functional PE}) = 1 - C(65,2) \times p^2 \times (1-p)^{63}

This redundancy scheme significantly improves manufacturing yield for the large 653mm² die.

3. 2. Hierarchical Network-on-Chip (NoC) Architecture

3.1 2.1 Three-Level NoC Design

The RNGD implements a sophisticated hierarchical NoC:

  1. TU NoC (Intra-PE):

    • 65 router nodes in bi-directional ring topology
    • Supports multicasting for weight broadcasting
    • Bandwidth: 256 GB/s per direction
  2. PE Cluster NoC (Inter-PE):

    • Connects 4 PEs within a cluster
    • 1 GHz operation frequency
    • Provides 1 TB/s aggregate bandwidth
    • QoS control and timeout management
  3. Memory NoC (System-level):

    • Connects PE clusters to HBM3 memory
    • Address hashing for load balancing across 32 HBM channels
    • Supports up to 1.5 TB/s memory bandwidth

3.2 2.2 Bandwidth Calculations

Effective Memory Bandwidth per Operation: BWper FLOP=1.5 TB/s512 TOPS=2.93 bytes/operationBW_{\text{per FLOP}} = \frac{1.5 \text{ TB/s}}{512 \text{ TOPS}} = 2.93 \text{ bytes/operation}

This ratio is critical for LLM inference where memory bandwidth often bottlenecks performance.

Roofline Model Analysis: The operational intensity (I) determines whether an operation is compute or memory bound: Ithreshold=Peak FLOPSPeak Bandwidth=512 TOPS1.5 TB/s=341 ops/byteI_{\text{threshold}} = \frac{\text{Peak FLOPS}}{\text{Peak Bandwidth}} = \frac{512 \text{ TOPS}}{1.5 \text{ TB/s}} = 341 \text{ ops/byte}

For LLMs with typical operational intensity of 50-100 ops/byte, the RNGD operates in the memory-bound regime, making its high bandwidth crucial.

4. 3. Slice Architecture and Compute Engine Details

4.1 3.1 Slice Components

Each slice contains three specialized engines:

  1. Contraction Engine (CE):

    • 8 Dot-Product Engines (DPE)
    • Configurable reduction trees
    • Supports INT8/FP16/BF16 operations
    • Peak throughput: 1 TOPS per slice
  2. Vector Engine (VE):

    • Non-linear activation functions (ReLU, GELU, Softmax)
    • Element-wise operations
    • Type conversions
    • Reduction operations
  3. Transpose Engine (TE):

    • Tensor axis permutation
    • Example: b×l×eb×e×lb \times l \times e \rightarrow b \times e \times l transformation
    • Critical for attention mechanism efficiency

4.2 3.2 Data Reuse Strategies

The architecture supports three data reuse patterns:

StrategyStorage LocationUse CaseEnergy Efficiency
Weight StationaryRegister FileConvolutionsHighest (minimal data movement)
Input StationaryCE Input BufferBatch processingMedium
Output StationaryAccumulator RegistersPartial sum accumulationMedium

Energy Calculation Example: For weight-stationary operation with 8-bit weights: Energy per MAC=Ecompute+Eregister access\text{Energy per MAC} = E_{\text{compute}} + E_{\text{register access}} =0.2 pJ+0.1 pJ=0.3 pJ (at 5nm)= 0.2 \text{ pJ} + 0.1 \text{ pJ} = 0.3 \text{ pJ (at 5nm)}

Compared to DRAM access at ~100 pJ/byte, this represents >300× energy reduction.

5. 4. HBM3 Integration and Signal/Power Integrity

5.1 4.1 HBM3 Specifications

  • Configuration: 2 stacks, 12-high configuration
  • Channels: 32 total (16 per stack)
  • Bandwidth per stack: 750 GB/s
  • Operating voltage: 0.4V VDDQL, 0.75V VDD
  • Interface width: 1024 bits per stack

5.2 4.2 Power Delivery Network (PDN) Design

The chip employs sophisticated power management:

Decoupling Capacitance Hierarchy:

  1. Deep Trench Capacitors (DTC): 24.5μF per cluster
  2. Metal-Insulator-Metal (MiM): ~1μF on-die
  3. On-die capacitors: ~100nF distributed

Voltage Ripple Analysis: Vpp=Ipeak×ZPDNV_{pp} = I_{\text{peak}} \times Z_{\text{PDN}}

Where measured values:

  • VDDQL (0.4V): Vpp=3.65%=14.6mVV_{pp} = 3.65\% = 14.6\text{mV}
  • VDD (0.75V): Vpp=7.48%=56.1mVV_{pp} = 7.48\% = 56.1\text{mV}

5.3 4.3 Thermal Management

Power Density Calculation: Power Density=150W653mm2=0.23 W/mm2\text{Power Density} = \frac{150\text{W}}{653\text{mm}^2} = 0.23 \text{ W/mm}^2

The custom heat sink design maintains junction temperature < 85°C with:

  • Thermal resistance: Rθ(ja)<0.3°C/WR_{\theta(j-a)} < 0.3°C/W
  • Air cooling with custom fin design
  • DVFS for dynamic thermal management

6. 5. Performance Analysis and Benchmarks

6.1 5.1 LLaMA-2 7B Performance

The RNGD demonstrates superior performance on LLaMA-2 7B model inference:

Attention Computation Breakdown: For sequence length L=2048, hidden dimension d=128: QKT computation: 2×L×L×d=2×20482×128=1.07 GFLOPQK^T \text{ computation: } 2 \times L \times L \times d = 2 \times 2048^2 \times 128 = 1.07 \text{ GFLOP} Memory required: 3×L×d×2 bytes=3×2048×128×2=1.5 MB\text{Memory required: } 3 \times L \times d \times 2 \text{ bytes} = 3 \times 2048 \times 128 \times 2 = 1.5 \text{ MB}

6.2 5.2 Comparative Performance Metrics

MetricRNGDNVIDIA L40sNVIDIA H100Analysis
Peak Memory BW1.5 TB/s0.86 TB/s3.35 TB/sRNGD: 1.74× L40s
TDP150W350W700WRNGD: 57% lower than L40s
Throughput (LLaMA-2)531 tok/s301 tok/s913 tok/sRNGD: 1.76× L40s
Perf/Watt3.54 tok/s/W0.86 tok/s/W1.30 tok/s/WRNGD: 4.1× L40s efficiency
GPT-J 6B (99% acc)12.3 TOPS/W8.0 TOPS/W-53% better than L40s

Efficiency Calculation: RNGD Efficiency=531 tok/s150W=3.54 tokens/second/watt\text{RNGD Efficiency} = \frac{531 \text{ tok/s}}{150\text{W}} = 3.54 \text{ tokens/second/watt} L40s Efficiency=301 tok/s350W=0.86 tokens/second/watt\text{L40s Efficiency} = \frac{301 \text{ tok/s}}{350\text{W}} = 0.86 \text{ tokens/second/watt} Improvement=3.540.86=4.12× better efficiency\text{Improvement} = \frac{3.54}{0.86} = 4.12\times \text{ better efficiency}

6.3 5.3 Scalability Analysis

The RNGD supports multi-chip configurations via PCIe P2P:

  • Without PCIe switch: 32 GB/s inter-chip bandwidth
  • With PCIe switch: 52 GB/s inter-chip bandwidth

For 8-chip configuration: Aggregate Performance=8×512 TOPS=4,096 TOPS=4.1 POPS\text{Aggregate Performance} = 8 \times 512 \text{ TOPS} = 4,096 \text{ TOPS} = 4.1 \text{ POPS} Aggregate Memory BW=8×1.5 TB/s=12 TB/s\text{Aggregate Memory BW} = 8 \times 1.5 \text{ TB/s} = 12 \text{ TB/s} Total Power=8×150W=1,200W\text{Total Power} = 8 \times 150\text{W} = 1,200\text{W}

7. 6. Software Stack and Programming Model

7.1 6.1 Dual-Context Execution

Each slice supports two execution contexts:

  • Main context: Tensor operations (matrix multiplies, convolutions)
  • Sub-context: Vector operations and memory transfers

This enables operation overlap: Timetotal=max(Timetensor ops,Timevector ops+Timememory)Time_{\text{total}} = \max(Time_{\text{tensor ops}}, Time_{\text{vector ops}} + Time_{\text{memory}})

Rather than: Timeserial=Timetensor ops+Timevector ops+TimememoryTime_{\text{serial}} = Time_{\text{tensor ops}} + Time_{\text{vector ops}} + Time_{\text{memory}}

7.2 6.2 Command Queue Architecture

The TU Controller (TUC) implements asynchronous operation:

  1. CPU queues commands to TUC
  2. TUC broadcasts configuration to all slices
  3. Operations execute deterministically
  4. CPU continues other work or polls for completion

Latency Hiding Calculation: For weight transfer time TwT_w and computation time TcT_c:

  • Short sequences (Tc<TwT_c < T_w): Computation hidden behind weight transfer
  • Long sequences (Tc>TwT_c > T_w): Weight transfer hidden behind computation

8. 7. Reliability and Manufacturing Features

8.1 7.1 Error Correction

  • SRAM/SPM: Single-Error Correction, Double-Error Detection (SECDED)
  • HBM Controller: ECC with error counting and interrupt generation
  • Error Rate: Target < 101210^{-12} bit error rate

8.2 7.2 Security Features

  • Secure Boot: Encryption-based firmware verification
  • Address Translation Unit: Per-PE memory isolation
  • SR-IOV Support: Virtual machine isolation for cloud deployment

8.3 7.3 Monitoring and Diagnostics

  • Temperature Sensors: Distributed thermal monitoring
  • Voltage Droop Detectors: Wide-bandwidth supply monitoring
  • Timing Margin Monitors: Long-term reliability tracking

9. Conclusions and Future Outlook

The RNGD processor represents a significant advancement in AI accelerator design, specifically optimized for the memory-bound nature of LLM inference. Its tensor-contraction architecture, combined with high memory bandwidth and power efficiency, delivers 4.1× better performance per watt than comparable GPUs. The sophisticated NoC design, redundancy features, and comprehensive monitoring capabilities make it suitable for datacenter deployment.

Key Innovations:

  1. Tensor contraction as primitive operation - enabling better parallelism than matrix multiplication
  2. Hierarchical NoC architecture - providing 1.5 TB/s memory bandwidth efficiently
  3. Slice-based redundancy - improving yield on large 653mm² dies
  4. Dual-context execution - hiding memory latency behind computation
  5. Comprehensive power management - maintaining 150W TDP with DVFS

The RNGD demonstrates that specialized architectures can significantly outperform general-purpose GPUs for LLM inference, pointing toward a future of domain-specific accelerators for AI workloads. With its 53% improvement in GPT-J performance per watt and 1.76× throughput advantage over L40s at 57% lower power, the RNGD sets a new standard for efficient LLM inference acceleration.

10. References

[1] Einstein, A., "Die Grundlage der allgemeinen Relativitätstheorie", Annalen der Physik, 1916. [2] Vaswani, A., et al., "Attention Is All You Need", NeurIPS, 2017. [3] Brown, T., et al., "Language Models are Few-Shot Learners", NeurIPS, 2020. [4] Touvron, H., et al., "LLaMA: Open and Efficient Foundation Language Models", ArXiv, 2023.

Analysis based on architectural specifications and performance data from FuriosaAI's RNGD processor development collaboration with Dongguk University.

Product Information

Manufacturer:
FuriosaAI & Dongguk University
Process Node:
5nm TSMC
Release Year:
2024
Category:
AIAccelerators

Performance Benchmarks

LLaMA-2 7B: 531 tokens/second vs 301 on L40s
GPT-J 6B: 12.3 TOPS/W vs 8.0 TOPS/W on L40s
Multi-chip scaling: 8× chips = 4.1 POPS aggregate
Memory bandwidth utilization: 2.93 bytes/operation
Operational intensity threshold: 341 ops/byte