ChipCraft

Executive Summary

Modern AI accelerators require high-bandwidth, low-latency interconnects to scale from single devices to datacenter-scale systems. This document compares Google's TPU Pod Optical Interconnects with NVIDIA's NVSwitch, traditional InfiniBand, Ethernet-based solutions, and emerging technologies. Understanding these trade-offs is critical for designing scalable AI infrastructure.

1. TPU Pod Optical Interconnects vs. NVIDIA NVSwitch

Architecture Philosophy

Aspect	TPU Pod OCS	NVIDIA NVSwitch
Primary Goal	Scale-out (thousands of accelerators)	Scale-up (within server/rack)
Technology	Optical circuit switching	Electrical packet switching
Topology	3D Torus (inter-chip)	Fat-tree/All-to-All (intra-server)
Scope	Pod-wide (datacenter scale)	Server-wide (up to 256 GPUs with NVSwitch)

Detailed Technical Comparison

1.1 NVIDIA NVSwitch Architecture

NVSwitch 3rd Generation (Hopper/H100):

64 NVLink 4.0 ports per switch
3.6 TB/s aggregate bidirectional bandwidth
900 GB/s per port bidirectional (450 GB/s each direction)
Electrical signaling (SerDes-based)
Hardware-accelerated collectives (SHARP)

NVSwitch Topology:

DGX H100 Configuration:
8 GPUs × 18 NVLink ports each = 144 total links
2 NVSwitch chips × 64 ports = 128 ports
→ Full non-blocking fabric within server
 
Scale-out via InfiniBand:
8 ConnectX-7 NICs × 400 Gbps = 3.2 Tbps external bandwidth

NVSwitch 4th Generation (Blackwell/B200):

Announced for 2024-2025
Expected 5th gen NVLink support
1.8 TB/s per GPU bidirectional
72 NVLink ports per switch

1.2 TPU Pod Optical Interconnect Architecture

TPU v4 Pod Configuration:

4,096 TPU v4 chips
3D Torus: 16×16×16 topology
6 bidirectional optical links per chip (±X, ±Y, ±Z)
~100-200 Gbps per optical link
Optical circuit switching for reconfigurability

TPU v5p Pod:

SparseCores with enhanced interconnect
Improved for mixture-of-experts models
Higher bandwidth per link (~200-400 Gbps estimated)
Enhanced collective communication primitives

Performance Characteristics

1.3 Bandwidth Analysis

NVSwitch (per GPU in DGX H100):

Intra-server: 900 GB/s bidirectional to fabric
All-reduce bandwidth: ~600-700 GB/s (accounting for algorithm efficiency)
Inter-server: Limited by InfiniBand (400 Gbps = 50 GB/s per rail)

TPU Pod OCS (per chip):

6 links × ~150 Gbps = 900 Gbps total (112.5 GB/s)
All-reduce bandwidth: ~80-90 GB/s (dimension-ordered reduction)
Scales linearly with Pod size due to torus topology

Key Insight:

NVSwitch: Higher per-accelerator bandwidth, limited to server scale
TPU Pod: Lower per-chip bandwidth, but scales to thousands of chips

1.4 Latency Analysis

Metric	NVSwitch	TPU Pod OCS
Switch latency	~200-300 ns	~500-800 ns
Hop latency	250-300 ns	600-1000 ns
Cable latency	5 ns/m (electrical)	5 ns/m (optical fiber)
All-reduce (256 devices)	~5-10 μs	~20-30 μs
All-reduce (4096 devices)	N/A (requires multi-hop IB)	~40-60 μs

Latency Trade-offs:

Electrical switching (NVSwitch): Lower absolute latency
Optical switching: Higher single-hop latency, but better scaling characteristics
Circuit-switched optical: Deterministic, no congestion-related variance

1.5 Power Efficiency

NVSwitch Power:

NVSwitch 3rd gen: ~500-600W per switch chip
DGX H100 system: ~1200W for interconnect subsystem (2× NVSwitch + traces)
Power/bandwidth: ~0.33W per GB/s aggregate throughput

TPU Pod OCS Power:

Optical transceiver: ~5-10W per 100 Gbps lane
Per chip (6 links): ~60-120W for interconnect
Optical switches (shared): Amortized across thousands of links
Power/bandwidth: ~0.1-0.2W per GB/s aggregate throughput

Winner: TPU Pod OCS (~3× better power efficiency)

1.6 Scalability

NVSwitch Scalability Limits:

Electrical reach: Limited to ~1-2m for high-speed SerDes
Radix: 64 ports per switch chip (need hierarchical design for >64 GPUs)
Power density: 500W+ per switch becomes thermal challenge
Scale-out: Must transition to InfiniBand/Ethernet for multi-rack

NVSwitch Multi-Tier Architecture:

Tier 1: NVSwitch within server (8 GPUs)
Tier 2: InfiniBand leaf switches (up to 64 servers)
Tier 3: InfiniBand spine switches (scale to thousands)
 
Bandwidth degradation: 900 GB/s → 50 GB/s at tier boundaries

TPU Pod Scalability Advantages:

Optical reach: 100m+ without active components
Uniform topology: Same 3D torus across entire Pod
Bisection bandwidth: Maintained across all scales
No tier boundaries: Eliminates bandwidth cliffs

2. Comparison with Traditional Datacenter Interconnects

2.1 InfiniBand (Mellanox/NVIDIA)

InfiniBand HDR (200 Gbps):

Packet-switched, lossy or lossless (via PFC)
RDMA support for low-latency
Congestion control: ECN, PFC, adaptive routing
Widely deployed in HPC and AI clusters

InfiniBand NDR (400 Gbps):

Current state-of-the-art
8 lanes × 50 Gbps (PAM4)
~1-2 μs latency for RDMA operations
SHARP (in-network collectives)

TPU Pod OCS vs. InfiniBand:

Feature	InfiniBand NDR	TPU Pod OCS
Technology	Electrical (copper) or optical (AOC)	Pure optical
Switching	Packet-switched	Circuit-switched
Bandwidth/port	400 Gbps	100-400 Gbps
Latency	1-2 μs (RDMA)	0.5-1 μs (circuit)
Congestion	Possible (needs PFC/ECN)	None (dedicated circuits)
Reconfigurability	Fixed topology	Dynamic via OCS
Cost	High (per-port licensing, switches)	Very high (optical components)
Adoption	Widespread (open standard)	Google proprietary

Key Differences:

InfiniBand: General-purpose, flexible, industry standard
TPU Pod OCS: Purpose-built for AI training, optimized for collectives
InfiniBand: Better for mixed workloads (storage, compute, network)
TPU Pod OCS: Better for homogeneous, bulk-synchronous AI training

2.2 Ethernet-Based Solutions

RoCE (RDMA over Converged Ethernet):

100/200/400 GbE with RDMA
Lossless Ethernet via PFC/ECN
Lower cost than InfiniBand
Used in hyperscaler AI clusters (Meta, Microsoft)

Ultra Ethernet Consortium (2023+):

Industry consortium: AMD, Intel, Meta, Microsoft, etc.
Target: AI/ML optimized Ethernet
Goals: Lower latency, better congestion control, in-network collectives
Competitive with InfiniBand for AI workloads

TPU Pod OCS vs. Ethernet:

Feature	400G Ethernet	TPU Pod OCS
Latency	2-5 μs (RoCE)	0.5-1 μs
Jitter	Moderate (packet-switched)	Very low (circuit-switched)
Congestion	Requires careful tuning	None
Cost	Lower (commodity)	Higher (custom)
Ecosystem	Broad vendor support	Google only
Collective ops	Software-based	Hardware-optimized

Use Case Fit:

Ethernet: Cost-sensitive deployments, mixed workloads, vendor diversity
TPU Pod OCS: Maximum performance, homogeneous AI training, single vendor

3. Emerging Interconnect Technologies

3.1 Co-Packaged Optics (CPO)

Technology:

Optical transceivers integrated directly with switch/accelerator package
Eliminates electrical reach limitations
Reduces power and latency

Status:

Early deployment (Ayar Labs, Intel, others)
Expected to mature 2025-2027
Could enable 10+ Tbps per chip

Comparison to TPU Pod OCS:

CPO: Next-generation optical, tighter integration
TPU Pod: Current-generation optical, module-based
Both: Optical switching with similar topology options

3.2 Silicon Photonics

Technology:

Photonic circuits fabricated in silicon
Integrates lasers, modulators, detectors on-chip
Potential for massive I/O bandwidth (100+ Tbps)

Industry Players:

Intel Silicon Photonics
Broadcom (optical ASICs)
Ayar Labs (TeraPHY, SuperNova)
Lightmatter (optical interconnect/compute)

Relationship to TPU Pods:

TPU v4/v5: Likely use discrete optical modules
Future TPUs: May adopt silicon photonics
Same architecture (3D torus, OCS) with better technology

3.3 AMD Infinity Fabric

AMD MI300 Interconnect:

Infinity Fabric within package (chiplet-to-chiplet)
896 GB/s per GPU via Infinity Fabric links
Scale-out via InfiniBand/Ethernet (same as NVIDIA)

Comparison to NVSwitch/TPU:

Similar philosophy to NVSwitch (intra-package bandwidth)
Requires external fabric (IB/Ethernet) for multi-node
Less integrated than NVSwitch at server level

3.4 Intel Gaudi (Habana)

Gaudi 2/3 Interconnect:

24× 100 GbE RoCE ports per accelerator
Integrated NIC (no external switches for small clusters)
All-to-all connectivity in scale-up configurations

Comparison to TPU Pod:

Gaudi: Ethernet-based, lower cost, less specialized
TPU Pod: Optical-based, higher performance, custom
Gaudi: Better for cost-sensitive deployments
TPU Pod: Better for maximum scale and performance

4. Topology Deep-Dive

4.1 NVSwitch: Fat-Tree/Clos

        [Spine Switches]
       /       |       \
    [Leaf]  [Leaf]  [Leaf]
     / \      / \      / \
   GPU GPU  GPU GPU  GPU GPU

Characteristics:

Non-blocking within tier
Oversubscription between tiers (typically 1:1 to 4:1)
Requires more switches for larger scales
Well-understood, extensively studied

4.2 TPU Pod: 3D Torus

      Z
      |
      |_____ Y
     /
    X
 
Each node connects to 6 neighbors: ±X, ±Y, ±Z
16×16×16 = 4,096 nodes

Characteristics:

Fixed degree (6) regardless of scale
Multiple paths between nodes (fault-tolerant)
Diameter: N^(1/3) for N nodes
Optimized for nearest-neighbor and dimension-ordered collectives

All-Reduce on 3D Torus:

# Simplified algorithm
def all_reduce_3d_torus(local_data):
    # Phase 1: Reduce-scatter along X
    for i in range(X_DIM):
        send(data[i], neighbor_x_plus)
        recv(remote_data, neighbor_x_minus)
        local_data[i] = reduce(local_data[i], remote_data)
    
    # Phase 2: Reduce-scatter along Y
    for i in range(Y_DIM):
        send(data[i], neighbor_y_plus)
        recv(remote_data, neighbor_y_minus)
        local_data[i] = reduce(local_data[i], remote_data)
    
    # Phase 3: Reduce-scatter along Z
    for i in range(Z_DIM):
        send(data[i], neighbor_z_plus)
        recv(remote_data, neighbor_z_minus)
        local_data[i] = reduce(local_data[i], remote_data)
    
    # Phase 4-6: All-gather in reverse order (Z, Y, X)
    # ... (symmetric to reduce-scatter)
    
    return fully_reduced_data

Bandwidth Efficiency:

Utilizes all links simultaneously
No congestion (each link carries unique data)
Latency: O(log N) with constant factor

4.3 Dragonfly/Dragonfly+ (HPC systems)

Used in: Cray Shasta systems, some supercomputers

Characteristics:

Hierarchical: intra-group (all-to-all), inter-group (adaptive routing)
High radix switches (64+ ports)
Better diameter than torus for large systems

vs. TPU Pod:

Dragonfly: Better for communication patterns with locality
TPU Pod: Better for uniform, bulk-synchronous patterns
Dragonfly: Uses adaptive routing (complex)
TPU Pod: Uses dimension-ordered routing (simple, deterministic)

5. Cost and TCO Analysis

5.1 Cost Breakdown (Estimated per 256 Accelerators)

Component	NVSwitch + IB	TPU Pod OCS	Ethernet RoCE
Accelerators	$5-8M	$4-7M	$3-5M
Interconnect HW	$500K-1M	$1-2M	$200-500K
Optical transceivers	$100K	$400-800K	$100-200K
Switches	$500K	Included in OCS	$300-500K
Cables	$100K	$200K	$100K
Power (3 yr)	$800K	$600K	$1M
Cooling (3 yr)	$400K	$300K	$500K
Total (3 yr)	~$8-11M	~$7-11M	~$5-8M

Key Insights:

Accelerator cost dominates (70-80% of total)
TPU Pod OCS: Higher upfront, lower operational costs
Ethernet: Lowest upfront and operational (with tradeoffs)
NVSwitch: Middle ground, but requires expensive IB for scale-out

5.2 Performance per Dollar

Assuming 256-accelerator cluster for large model training:

Metric	NVSwitch+IB	TPU Pod	Ethernet
Training throughput	1.0× (baseline)	0.9-1.1×	0.7-0.9×
3-year TCO	$9M	$9M	$6.5M
Perf/$ (normalized)	1.0×	0.9-1.1×	1.0-1.2×

Interpretation:

Ethernet: Best cost/performance for smaller scales
NVSwitch: Best absolute performance within server
TPU Pod: Best performance at massive scale (1000+ chips)

6. Use Case Fit

6.1 When to Use NVSwitch

✅ Ideal for:

Server-scale deployments (8-64 GPUs)
Mixed precision training
Models with high inter-GPU communication (model parallelism)
When maximum per-GPU bandwidth is critical
Existing NVIDIA ecosystem/CUDA code

❌ Not ideal for:

Cost-sensitive deployments
Scales >256 GPUs (requires expensive IB fabric)
Heterogeneous workloads (overkill for sparse communication)

6.2 When to Use TPU Pod OCS

✅ Ideal for:

Massive scale training (1000+ accelerators)
Bulk-synchronous parallel workloads
Google ecosystem (TensorFlow, JAX)
Energy efficiency is critical
Long-term, homogeneous AI training

❌ Not ideal for:

Small-scale deployments (<256 chips)
Mixed workloads (storage, compute, network)
Vendor diversity requirements
Non-Google frameworks (PyTorch, though improving)

6.3 When to Use InfiniBand/Ethernet

✅ Ideal for:

Multi-vendor environments
Mixed workloads (AI + HPC + storage)
Cost-sensitive deployments
Need for industry-standard ecosystem
Future flexibility (upgrade paths, vendor changes)

❌ Not ideal for:

Absolute maximum performance
Lowest latency requirements
Extremely large-scale synchronous training

7. Future Trends

7.1 Convergence

Key Observation: Technologies are converging:

NVSwitch may adopt optical links for longer reach
TPU Pods may incorporate electrical for short-reach
InfiniBand/Ethernet adding AI-specific features

7.2 In-Network Computing

Trend: Moving computation into the network:

NVIDIA SHARP (in-network reductions)
AMD infinity fabric (on-path computation)
Optical switches with photonic computing elements

Impact on TPU Pods:

Already has optimized collective primitives
Could add in-network reduction for better scalability

7.3 Disaggregation

Trend: Separating compute, memory, and storage:

CXL (Compute Express Link) for memory pooling
Optical interconnects enabling disaggregation
Composable infrastructure

TPU Pod Evolution:

Future versions may disaggregate memory
Optical fabric enables flexible resource allocation

8. Key Takeaways for Interviews

Quick Comparison Table

Factor	NVSwitch	TPU Pod OCS	InfiniBand	Ethernet
Scale sweet spot	8-256 GPUs	512-4096 TPUs	64-1024 nodes	32-512 nodes
Technology	Electrical	Optical	Electrical/optical	Electrical/optical
Bandwidth/link	900 GB/s	100-200 GB/s	50-100 GB/s	50-100 GB/s
Latency	Lowest	Low	Medium	Medium-high
Power efficiency	Medium	Best	Medium	Worst
Cost	High	Highest	High	Lowest
Flexibility	Low	Lowest	High	Highest
Adoption	NVIDIA only	Google only	Broad (HPC/AI)	Universal

Architectural Insights

No single winner: Choice depends on workload, scale, and constraints
Optical advantage: Scales better with distance and power, but costs more
Electrical advantage: Lower latency at short reach, lower cost, mature
Topology matters: Torus (TPU) vs. Fat-tree (NVSwitch) impacts scalability
Circuit vs. Packet: Circuit switching (OCS) is deterministic but less flexible

Discussion Points

If asked "Which is better, TPU Pod or NVSwitch?"

Answer framework:

"It depends on the use case and scale..."
"For intra-server: NVSwitch wins (higher bandwidth, lower latency)"
"For massive scale: TPU Pod wins (better power, uniform topology, no tier boundaries)"
"For cost: Neither - Ethernet-based solutions are more economical"
"For ecosystem: NVSwitch (broader adoption) vs. TPU Pod (Google-only)"

If asked "Why did Google choose optical for TPU Pods?"

Key points:

Scale: Optical scales better to thousands of nodes
Power: 3-5× better power efficiency than electrical at high bandwidth
Distance: Can span large datacenter without repeaters
Reconfigurability: OCS allows dynamic topology changes
Workload fit: AI training is bulk-synchronous, benefits from circuit switching

References and Further Reading

Google TPU Papers:
- Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (ISCA 2017)
- Jouppi et al., "A Domain-Specific Supercomputer for Training Deep Neural Networks" (CACM 2020)
NVIDIA NVSwitch:
- NVIDIA Hopper Architecture Whitepaper
- NVLink and NVSwitch Documentation
Optical Interconnects:
- Kachris et al., "Optical Interconnects for Data Centers" (2013)
- Sun et al., "LIONS: An RDMA-Oriented Design for Low-Latency Optical Switches" (SIGCOMM 2020)
Network Topologies:
- Dally & Towles, "Principles and Practices of Interconnection Networks" (2004)
- Kim et al., "Technology-Driven, Highly-Scalable Dragonfly Topology" (ISCA 2008)
Industry Reports:
- Omdia: "AI Infrastructure Market Analysis"
- Dell'Oro Group: "Data Center Network Equipment Report"

Prerequisites

Learning Objectives

Table of Contents

Executive Summary

1. TPU Pod Optical Interconnects vs. NVIDIA NVSwitch

Architecture Philosophy

Detailed Technical Comparison

1.1 NVIDIA NVSwitch Architecture

1.2 TPU Pod Optical Interconnect Architecture

Performance Characteristics

1.3 Bandwidth Analysis

1.4 Latency Analysis

1.5 Power Efficiency

1.6 Scalability

2. Comparison with Traditional Datacenter Interconnects

2.1 InfiniBand (Mellanox/NVIDIA)

2.2 Ethernet-Based Solutions

3. Emerging Interconnect Technologies

3.1 Co-Packaged Optics (CPO)

3.2 Silicon Photonics

3.3 AMD Infinity Fabric

3.4 Intel Gaudi (Habana)

4. Topology Deep-Dive

4.1 NVSwitch: Fat-Tree/Clos

4.2 TPU Pod: 3D Torus

4.3 Dragonfly/Dragonfly+ (HPC systems)

5. Cost and TCO Analysis

5.1 Cost Breakdown (Estimated per 256 Accelerators)

5.2 Performance per Dollar

6. Use Case Fit

6.1 When to Use NVSwitch

6.2 When to Use TPU Pod OCS

6.3 When to Use InfiniBand/Ethernet

7. Future Trends

7.1 Convergence

7.2 In-Network Computing

7.3 Disaggregation

8. Key Takeaways for Interviews

Quick Comparison Table

Architectural Insights

Discussion Points

References and Further Reading

In This Topic

Related Topics

Quick Actions