Inside the NVIDIA Rubin Platform: Six New Chips, One AI Supercomputer

1. Introduction: The AI Factory Era

The AI industry has entered a new industrial phase. What began as discrete systems for model training and inference has evolved into always-on AI factories that continuously convert power, silicon, and data into intelligence at scale. These factories now power applications that generate business plans, analyze markets, conduct research, and reason across vast knowledge bases.

"AI factories now underpin applications that generate business plans, analyze markets, conduct deep research, and reason across vast bodies of knowledge."

1.1 The Challenge

Modern AI workloads demand:

Hundreds of thousands of input tokens for long-context reasoning
Real-time inference under strict constraints
Agentic reasoning with complex multi-step workflows
Multimodal pipelines processing diverse data types

All while maintaining constraints on:

Power consumption
Reliability and uptime
Security and isolation
Deployment velocity
Cost per token

1.2 Three Scaling Laws Driving AI Progress

The evolution is captured by three fundamental scaling laws:

Pre-training scaling: Models learn inherent knowledge
Post-training scaling: Models learn to think through fine-tuning and reinforcement
Test-time scaling: Models reason by generating more tokens during inference

Rendering diagram...

2. The NVIDIA Rubin Platform: Extreme Co-Design Philosophy

The Rubin platform represents a fundamental shift: treating the data center, not a single GPU server, as the unit of compute. This approach is built on extreme co-design where GPUs, CPUs, networking, security, software, power delivery, and cooling are architected together as a unified system.

2.1 Five Platform-Level Breakthroughs

Breakthrough	Description
Rack-scale coherence	72 GPUs operate as single coherent machine
Sustained intelligence production	Optimized for continuous operation, not peak bursts
Predictable performance	Deterministic behavior under real workloads
Security-first architecture	Built-in isolation and confidential computing
Operational efficiency	Designed for zero-downtime maintenance

2.2 Vera Rubin NVL72: The Flagship System

The Vera Rubin NVL72 is a rack-scale system where the entire rack operates as a coherent machine within a larger AI factory. It's optimized for:

Predictable latency
High utilization across heterogeneous execution phases
Efficient conversion of power into usable intelligence

Rendering diagram...

3. Six New Chips, One AI Supercomputer

The Rubin platform integrates six purpose-built chips, each engineered for a specific role in the AI factory:

Rendering diagram...

3.1 Vera CPU: Purpose-Built for AI Factories

The Vera CPU is the high-bandwidth, low-latency data movement engine that keeps AI factories operating efficiently at scale.

Key Specifications

Feature	Grace CPU	Vera CPU
Cores	72 Neoverse V2	88 NVIDIA Olympus
Threads	72	176 (Spatial Multithreading)
L2 Cache/core	1MB	2MB
L3 Cache	114MB	162MB
Memory BW	Up to 512GB/s	Up to 1.2TB/s
Memory Capacity	Up to 480GB LPDDR5X	Up to 1.5TB LPDDR5X
NVLink-C2C	900GB/s	1.8TB/s
PCIe/CXL	Gen5	Gen6/CXL 3.1
Confidential Computing	No	Yes

NVIDIA Olympus Core Innovation

Spatial Multithreading: A new multithreading approach that runs two hardware threads per core by physically partitioning resources instead of time-slicing. This enables:

Runtime tradeoff between performance and efficiency
Increased throughput and virtual CPU density
Predictable performance with strong isolation
Critical for multi-tenant AI factories

Scalable Coherency Fabric (SCF)

The second-generation SCF connects all 88 Olympus cores to shared L3 cache and memory on a single monolithic die:

Avoids chiplet boundaries for consistent latency
Sustains >90% of peak memory bandwidth under load
Enables deterministic, high-throughput data movement
Ensures linear scaling as core count increases

Coherent Memory Architecture

NVLink-C2C provides 1.8 TB/s of coherent bandwidth between Vera CPUs and Rubin GPUs, enabling:

Unified address space across CPU and GPU memory
Applications treat LPDDR5X and HBM4 as single coherent pool
Reduced data movement overhead
Efficient KV-cache offload and multi-model execution

3.2 Rubin GPU: Execution Engine for Transformer-Era AI

The Rubin GPU transforms rack-scale capability into sustained intelligence production, designed for continuous training, post-training, and inference.

Core Specifications

Feature	Blackwell	Rubin
Transistors	208B	336B
Compute Dies	2	2
NVFP4 Inference	10 PFLOPS*	50 PFLOPS*
NVFP4 Training	10 PFLOPS	35 PFLOPS
Softmax Acceleration	16 Ops/Clk/SM (FP32) 32 Ops/Clk/SM (FP16)	64 Ops/Clk/SM (FP16)

*Transformer Engine compute

Architecture Highlights

224 Streaming Multiprocessors (SMs) with 6th-gen Tensor Cores
Optimized for NVFP4 and FP8 low-precision execution
Expanded Special Function Units (SFUs) for attention and activation
Improved branch prediction, prefetching, and load-store performance

Third-Generation Transformer Engine

New capabilities include:

Hardware-accelerated adaptive compression for NVFP4
Up to 50 PFLOPS NVFP4 for inference
Full backward compatibility with Blackwell GPUs
Automatic optimization without code changes

HBM4 Memory Subsystem

Revolutionary memory architecture:

Up to 288 GB of HBM4 per GPU
Up to 22 TB/s aggregate bandwidth (~3x vs Blackwell)
Doubled interface width compared to HBM3e
Improved decode and front-end efficiency

Scientific Computing Convergence

Feature	Hopper GPU	Blackwell GPU	Rubin GPU
FP32 Vector (TFLOPS)	67	80	130
FP32 Matrix (TFLOPS)	67	227*	400*
FP64 Vector (TFLOPS)	34	40	33
FP64 Matrix (TFLOPS)	67	150*	200*

*Peak performance using Tensor Core-based emulation algorithms

3.3 NVLink 6 Switch: Rack-Scale Scale-Up Fabric

NVLink 6 eliminates communication bottlenecks by enabling 72 Rubin GPUs to operate as a single, tightly coupled accelerator.

Key Capabilities

3.6 TB/s bidirectional bandwidth per GPU (2x vs previous gen)
Full all-to-all topology across the rack
Uniform latency - any GPU to any GPU
SHARP in-network compute: 14.4 TFLOPS FP8 per switch tray

MoE and Reasoning Optimization

For Mixture-of-Experts (MoE) workloads:

Up to 2x higher throughput for all-to-all operations
Dynamic token routing without fabric saturation
Efficient expert parallelism across all 72 GPUs

Operational Features

Hot-swappable trays
Continued operation with partial population
Dynamic traffic rerouting around faults
In-service software updates
Fine-grained link telemetry

3.4 ConnectX-9: AI Scale-Out Bandwidth

ConnectX-9 serves as the intelligent endpoint of the Spectrum-X Ethernet fabric, delivering 1.6 Tb/s of network bandwidth per Rubin GPU.

Endpoint Intelligence

Programmable congestion control at the endpoint:

Smooths traffic injection during all-to-all phases
Reduces head-of-line blocking
Maintains high effective bandwidth under load
Prevents congestion before it forms

Multi-Tenant Isolation

Enforces fairness and isolation per job/tenant
Predictable network behavior regardless of other workloads
Critical for shared AI infrastructure

Security Features

Data-in-transit encryption (IPsec, PSP)
Data-at-rest encryption for storage platforms
Secure boot and firmware authentication
Device attestation

3.5 BlueField-4 DPU: Operating System of the AI Factory

BlueField-4 is the processor powering the operating system of the AI factory, handling control, security, data movement, and orchestration independently of AI computation.

Architecture

Dual-die package combining:

64-core NVIDIA Grace CPU for infrastructure offload
Integrated ConnectX-9 for tightly coupled data movement
Up to 800 Gb/s ultra-low-latency connectivity

Generational Improvements

Feature	BlueField-3	BlueField-4
Bandwidth	400 Gb/s	800 Gb/s
Compute	16 Arm A78 Cores	64 Arm Neoverse V2
Performance	Baseline	6x
Memory BW	75 GB/s	250 GB/s
Memory Capacity	32GB	128GB
Cloud Networking	32K hosts	128K hosts
Data Encryption	400Gb/s	800Gb/s
NVMe Storage	10M IOPs @ 4K	20M IOPs @ 4K

ASTRA: Advanced Secure Trusted Resource Architecture

System-level trust architecture that:

Establishes trust domain within compute tray
Provides single, trusted control point
Enables secure bare-metal operation
Strong multi-tenant isolation
Trusted infrastructure control independent of host software

Inference Context Memory Storage

AI-native infrastructure tier powered by BlueField-4:

Pod-level access to shared inference context
Efficient reuse of KV caches across requests
Up to 5x boost in tokens per second
Up to 5x power efficiency vs traditional storage
Extends GPU memory capacity for long-context workloads

3.6 Spectrum-6 Ethernet Switch: Scale-Out and Scale-Across

Spectrum-6 advances NVIDIA's purpose-built Ethernet fabric for accelerated computing with co-packaged optics.

Core Specifications

102.4 Tb/s per switch chip (2x vs Spectrum-4)
200G PAM4 SerDes
128x 800 Gb/s ports
Hardware-assisted performance isolation

Spectrum-X Ethernet Photonics

Revolutionary efficiency through co-packaged optics:

~5x better network power efficiency
Lower end-to-end latency
Dramatically improved signal integrity
Optical loss reduced from ~22 dB to ~4 dB
Up to 64x better signal integrity

Platform Evolution

Feature	Blackwell	Rubin
Platform	Spectrum-X SN5000 + CX-8	Spectrum-X SN6000 + CX-9
Chip	Spectrum-4 + CX-8	Spectrum-6 + CX-9
Switch BW	51.2 Tb/s (64x 800G)	102.4 Tb/s (128x 800G)
GPU BW	800 Gb/s (2x400G)	1600 Gb/s (2x800G)
SerDes	100G PAM4	200G PAM4
Protocol	Ethernet	Ethernet
Connectivity	OSFP	OSFP

AI Traffic Optimization

Spectrum-X Ethernet handles variable all-to-all communication:

Coordinated congestion control across fabric
Adaptive routing for bursty MoE traffic
Significantly faster job completion times
Distance-aware congestion control for geo-distributed deployments

4. From Chips to Systems: Scaling to AI Factory

The progression from silicon to deployable AI factory follows a deliberate path:

Rendering diagram...

4.1 Vera Rubin Superchip

The foundational compute building block combining:

Rubin GPU with HBM4
Vera CPU with LPDDR5X
NVLink-C2C coherent interconnect
Unified memory addressing

4.2 NVL72 Rack Architecture

Rack-scale integration features:

72 Rubin GPUs in single coherent domain
NVLink 6 all-to-all fabric
Integrated power and cooling
BlueField-4 infrastructure control
Spectrum-X scale-out connectivity

4.3 DGX SuperPOD: Deployment-Scale Unit

The DGX SuperPOD represents the deployment-scale unit of an AI factory:

Multiple NVL72 racks
Unified management and orchestration
Predictable scaling characteristics
Production-ready operations
Enterprise support and services

5. Software and Developer Experience

The Rubin platform's software stack makes rack-scale systems programmable and accessible.

5.1 Software Foundation

NVIDIA CUDA: Core parallel computing platform
NVIDIA CUDA-X: Accelerated libraries for AI, HPC, data science
NVIDIA NCCL: Optimized collective communications
NVIDIA DOCA: Infrastructure services framework for BlueField DPUs

5.2 Framework Integration

Seamless integration with:

PyTorch, TensorFlow, JAX
NVIDIA NeMo for LLM training
NVIDIA TensorRT for inference optimization
NVIDIA Triton Inference Server

5.3 Programming Model

Key characteristics:

Transparent scaling: Code written for single GPU scales to 72
Coherent memory: Unified addressing across CPU/GPU
Automatic optimization: Frameworks leverage hardware features
Backward compatibility: Existing CUDA code runs unmodified

6. Operating at AI Factory Scale

Production foundations ensure reliability, security, and efficiency.

6.1 Reliability and Uptime

Hot-swappable components
Dynamic traffic rerouting
Continued operation during maintenance
Fine-grained telemetry and monitoring
Predictive failure detection

6.2 Security Architecture

Multi-layered security:

Confidential computing support (Vera CPU)
ASTRA trusted control plane (BlueField-4)
Data-in-transit encryption (ConnectX-9)
Secure boot and attestation
Strong multi-tenant isolation

6.3 Energy Efficiency

Co-packaged optics: ~5x network power efficiency
Optimized power delivery per rack
Efficient low-precision compute (NVFP4)
Workload-aware power management
Coherent memory reduces data movement

6.4 Ecosystem Readiness

Major cloud providers support
OEM partnerships for deployment
ISV software certification
Open networking standards (Ethernet)
Arm software ecosystem compatibility

7. Performance and Efficiency at Scale

The Rubin platform delivers measurable gains in real-world AI factory deployments.

7.1 Training Performance

One-fourth as many GPUs required for equivalent training throughput
Higher sustained utilization across training phases
Faster convergence through improved communication
Efficient scaling to thousands of GPUs

7.2 Inference Throughput

10x higher inference throughput for long-context workloads
10x lower cost per token through efficiency gains
Up to 2x improvement in MoE all-to-all operations
5x tokens per second boost with Inference Context Memory Storage

7.3 Real-World Impact

Key metrics:

Sustained GPU utilization >90% in production
Predictable latency under variable load
Linear scaling efficiency to rack scale
Reduced time-to-deployment for new models

8. Why Rubin is the AI Factory Platform

8.1 Extreme Co-Design Delivers Results

The Rubin platform demonstrates that treating the data center as the unit of compute delivers:

Predictable Performance: Behavior remains consistent under real workloads
Economic Efficiency: Lower cost per token through sustained utilization
Operational Scalability: Systems that can be deployed and maintained at scale
Security by Design: Trust and isolation built into architecture
Future-Proof Foundation: Coherent scaling path as AI demands grow

8.2 The Shift to Intelligence Production

"AI factories now function as always-on intelligence production systems, where efficiency in reasoning, context handling, and data movement determines performance."

The Rubin platform is purpose-built for this reality, where:

Reasoning models require sustained multi-step inference
Long contexts demand massive memory bandwidth
Agentic workflows interleave compute and communication
Multi-tenant operation requires strong isolation
Cost per token determines competitive advantage

8.3 From Components to Coherent Systems

The six-chip architecture works as one:

Vera CPU orchestrates data movement
Rubin GPU executes transformer workloads
NVLink 6 enables rack-scale coherence
ConnectX-9 controls scale-out endpoints
BlueField-4 operates the infrastructure
Spectrum-6 provides efficient scale-out fabric

Together, they transform the AI factory from a collection of servers into a unified intelligence production system.

Conclusion

The NVIDIA Rubin platform represents a fundamental architectural shift in how AI infrastructure is designed, deployed, and operated. By applying extreme co-design across compute, networking, memory, and infrastructure, Rubin establishes a new foundation for producing intelligence efficiently, securely, and predictably at scale.

As AI workloads continue to evolve toward longer contexts, deeper reasoning, and more complex agentic behaviors, the Rubin platform provides the architectural foundation to meet these demands while maintaining the economic efficiency required for widespread AI deployment.

The era of AI factories has arrived, and the Rubin platform is purpose-built to power them.