Skip to main content
DatacenterArchadvancedTPUNVSwitchInfiniBandoptical-interconnectdatacenter-architectureperformance-comparisonnetwork-topologyai-infrastructurescalability

TPU Pod Optical Interconnects vs NVIDIA NVSwitch Comparison

Comprehensive comparison of Google TPU Pod optical interconnects with NVIDIA NVSwitch, InfiniBand, Ethernet, and emerging datacenter interconnect technologies for AI infrastructure

75 min read
Updated 10/1/2025
4 prerequisites

Prerequisites

Make sure you're familiar with these concepts before diving in:

Understanding of TPU Pod architecture and optical interconnects
Knowledge of GPU architecture and NVLink/NVSwitch
Familiarity with datacenter networking (InfiniBand, Ethernet)
Understanding of AI training scalability challenges

Learning Objectives

By the end of this topic, you will be able to:

Compare optical vs electrical interconnect technologies for AI
Evaluate trade-offs between scale-up and scale-out architectures
Analyze network topology impacts on collective communication
Understand cost, power, and performance trade-offs at datacenter scale
Assess emerging interconnect technologies (CPO, silicon photonics)

Table of Contents

1. Executive Summary

Modern AI accelerators require high-bandwidth, low-latency interconnects to scale from single devices to datacenter-scale systems. This document compares Google's TPU Pod Optical Interconnects with NVIDIA's NVSwitch, traditional InfiniBand, Ethernet-based solutions, and emerging technologies. Understanding these trade-offs is critical for designing scalable AI infrastructure.


2. 1. TPU Pod Optical Interconnects vs. NVIDIA NVSwitch

2.1 Architecture Philosophy

AspectTPU Pod OCSNVIDIA NVSwitch
Primary GoalScale-out (thousands of accelerators)Scale-up (within server/rack)
TechnologyOptical circuit switchingElectrical packet switching
Topology3D Torus (inter-chip)Fat-tree/All-to-All (intra-server)
ScopePod-wide (datacenter scale)Server-wide (up to 256 GPUs with NVSwitch)

2.2 Detailed Technical Comparison

1.1 NVIDIA NVSwitch Architecture

NVSwitch 3rd Generation (Hopper/H100):

  • 64 NVLink 4.0 ports per switch
  • 3.6 TB/s aggregate bidirectional bandwidth
  • 900 GB/s per port bidirectional (450 GB/s each direction)
  • Electrical signaling (SerDes-based)
  • Hardware-accelerated collectives (SHARP)

NVSwitch Topology:

DGX H100 Configuration:
8 GPUs × 18 NVLink ports each = 144 total links
2 NVSwitch chips × 64 ports = 128 ports
→ Full non-blocking fabric within server
 
Scale-out via InfiniBand:
8 ConnectX-7 NICs × 400 Gbps = 3.2 Tbps external bandwidth

NVSwitch 4th Generation (Blackwell/B200):

  • Announced for 2024-2025
  • Expected 5th gen NVLink support
  • 1.8 TB/s per GPU bidirectional
  • 72 NVLink ports per switch

1.2 TPU Pod Optical Interconnect Architecture

TPU v4 Pod Configuration:

  • 4,096 TPU v4 chips
  • 3D Torus: 16×16×16 topology
  • 6 bidirectional optical links per chip (±X, ±Y, ±Z)
  • ~100-200 Gbps per optical link
  • Optical circuit switching for reconfigurability

TPU v5p Pod:

  • SparseCores with enhanced interconnect
  • Improved for mixture-of-experts models
  • Higher bandwidth per link (~200-400 Gbps estimated)
  • Enhanced collective communication primitives

2.3 Performance Characteristics

1.3 Bandwidth Analysis

NVSwitch (per GPU in DGX H100):

  • Intra-server: 900 GB/s bidirectional to fabric
  • All-reduce bandwidth: ~600-700 GB/s (accounting for algorithm efficiency)
  • Inter-server: Limited by InfiniBand (400 Gbps = 50 GB/s per rail)

TPU Pod OCS (per chip):

  • 6 links × ~150 Gbps = 900 Gbps total (112.5 GB/s)
  • All-reduce bandwidth: ~80-90 GB/s (dimension-ordered reduction)
  • Scales linearly with Pod size due to torus topology

Key Insight:

  • NVSwitch: Higher per-accelerator bandwidth, limited to server scale
  • TPU Pod: Lower per-chip bandwidth, but scales to thousands of chips

1.4 Latency Analysis

MetricNVSwitchTPU Pod OCS
Switch latency~200-300 ns~500-800 ns
Hop latency250-300 ns600-1000 ns
Cable latency5 ns/m (electrical)5 ns/m (optical fiber)
All-reduce (256 devices)~5-10 μs~20-30 μs
All-reduce (4096 devices)N/A (requires multi-hop IB)~40-60 μs

Latency Trade-offs:

  • Electrical switching (NVSwitch): Lower absolute latency
  • Optical switching: Higher single-hop latency, but better scaling characteristics
  • Circuit-switched optical: Deterministic, no congestion-related variance

1.5 Power Efficiency

NVSwitch Power:

  • NVSwitch 3rd gen: ~500-600W per switch chip
  • DGX H100 system: ~1200W for interconnect subsystem (2× NVSwitch + traces)
  • Power/bandwidth: ~0.33W per GB/s aggregate throughput

TPU Pod OCS Power:

  • Optical transceiver: ~5-10W per 100 Gbps lane
  • Per chip (6 links): ~60-120W for interconnect
  • Optical switches (shared): Amortized across thousands of links
  • Power/bandwidth: ~0.1-0.2W per GB/s aggregate throughput

Winner: TPU Pod OCS (~3× better power efficiency)

1.6 Scalability

NVSwitch Scalability Limits:

  1. Electrical reach: Limited to ~1-2m for high-speed SerDes
  2. Radix: 64 ports per switch chip (need hierarchical design for >64 GPUs)
  3. Power density: 500W+ per switch becomes thermal challenge
  4. Scale-out: Must transition to InfiniBand/Ethernet for multi-rack

NVSwitch Multi-Tier Architecture:

Tier 1: NVSwitch within server (8 GPUs)
Tier 2: InfiniBand leaf switches (up to 64 servers)
Tier 3: InfiniBand spine switches (scale to thousands)
 
Bandwidth degradation: 900 GB/s → 50 GB/s at tier boundaries

TPU Pod Scalability Advantages:

  1. Optical reach: 100m+ without active components
  2. Uniform topology: Same 3D torus across entire Pod
  3. Bisection bandwidth: Maintained across all scales
  4. No tier boundaries: Eliminates bandwidth cliffs

3. 2. Comparison with Traditional Datacenter Interconnects

3.1 2.1 InfiniBand (Mellanox/NVIDIA)

InfiniBand HDR (200 Gbps):

  • Packet-switched, lossy or lossless (via PFC)
  • RDMA support for low-latency
  • Congestion control: ECN, PFC, adaptive routing
  • Widely deployed in HPC and AI clusters

InfiniBand NDR (400 Gbps):

  • Current state-of-the-art
  • 8 lanes × 50 Gbps (PAM4)
  • ~1-2 μs latency for RDMA operations
  • SHARP (in-network collectives)

TPU Pod OCS vs. InfiniBand:

FeatureInfiniBand NDRTPU Pod OCS
TechnologyElectrical (copper) or optical (AOC)Pure optical
SwitchingPacket-switchedCircuit-switched
Bandwidth/port400 Gbps100-400 Gbps
Latency1-2 μs (RDMA)0.5-1 μs (circuit)
CongestionPossible (needs PFC/ECN)None (dedicated circuits)
ReconfigurabilityFixed topologyDynamic via OCS
CostHigh (per-port licensing, switches)Very high (optical components)
AdoptionWidespread (open standard)Google proprietary

Key Differences:

  • InfiniBand: General-purpose, flexible, industry standard
  • TPU Pod OCS: Purpose-built for AI training, optimized for collectives
  • InfiniBand: Better for mixed workloads (storage, compute, network)
  • TPU Pod OCS: Better for homogeneous, bulk-synchronous AI training

3.2 2.2 Ethernet-Based Solutions

RoCE (RDMA over Converged Ethernet):

  • 100/200/400 GbE with RDMA
  • Lossless Ethernet via PFC/ECN
  • Lower cost than InfiniBand
  • Used in hyperscaler AI clusters (Meta, Microsoft)

Ultra Ethernet Consortium (2023+):

  • Industry consortium: AMD, Intel, Meta, Microsoft, etc.
  • Target: AI/ML optimized Ethernet
  • Goals: Lower latency, better congestion control, in-network collectives
  • Competitive with InfiniBand for AI workloads

TPU Pod OCS vs. Ethernet:

Feature400G EthernetTPU Pod OCS
Latency2-5 μs (RoCE)0.5-1 μs
JitterModerate (packet-switched)Very low (circuit-switched)
CongestionRequires careful tuningNone
CostLower (commodity)Higher (custom)
EcosystemBroad vendor supportGoogle only
Collective opsSoftware-basedHardware-optimized

Use Case Fit:

  • Ethernet: Cost-sensitive deployments, mixed workloads, vendor diversity
  • TPU Pod OCS: Maximum performance, homogeneous AI training, single vendor

4. 3. Emerging Interconnect Technologies

4.1 3.1 Co-Packaged Optics (CPO)

Technology:

  • Optical transceivers integrated directly with switch/accelerator package
  • Eliminates electrical reach limitations
  • Reduces power and latency

Status:

  • Early deployment (Ayar Labs, Intel, others)
  • Expected to mature 2025-2027
  • Could enable 10+ Tbps per chip

Comparison to TPU Pod OCS:

  • CPO: Next-generation optical, tighter integration
  • TPU Pod: Current-generation optical, module-based
  • Both: Optical switching with similar topology options

4.2 3.2 Silicon Photonics

Technology:

  • Photonic circuits fabricated in silicon
  • Integrates lasers, modulators, detectors on-chip
  • Potential for massive I/O bandwidth (100+ Tbps)

Industry Players:

  • Intel Silicon Photonics
  • Broadcom (optical ASICs)
  • Ayar Labs (TeraPHY, SuperNova)
  • Lightmatter (optical interconnect/compute)

Relationship to TPU Pods:

  • TPU v4/v5: Likely use discrete optical modules
  • Future TPUs: May adopt silicon photonics
  • Same architecture (3D torus, OCS) with better technology

4.3 3.3 AMD Infinity Fabric

AMD MI300 Interconnect:

  • Infinity Fabric within package (chiplet-to-chiplet)
  • 896 GB/s per GPU via Infinity Fabric links
  • Scale-out via InfiniBand/Ethernet (same as NVIDIA)

Comparison to NVSwitch/TPU:

  • Similar philosophy to NVSwitch (intra-package bandwidth)
  • Requires external fabric (IB/Ethernet) for multi-node
  • Less integrated than NVSwitch at server level

4.4 3.4 Intel Gaudi (Habana)

Gaudi 2/3 Interconnect:

  • 24× 100 GbE RoCE ports per accelerator
  • Integrated NIC (no external switches for small clusters)
  • All-to-all connectivity in scale-up configurations

Comparison to TPU Pod:

  • Gaudi: Ethernet-based, lower cost, less specialized
  • TPU Pod: Optical-based, higher performance, custom
  • Gaudi: Better for cost-sensitive deployments
  • TPU Pod: Better for maximum scale and performance

5. 4. Topology Deep-Dive

5.1 4.1 NVSwitch: Fat-Tree/Clos

        [Spine Switches]
       /       |       \
    [Leaf]  [Leaf]  [Leaf]
     / \      / \      / \
   GPU GPU  GPU GPU  GPU GPU

Characteristics:

  • Non-blocking within tier
  • Oversubscription between tiers (typically 1:1 to 4:1)
  • Requires more switches for larger scales
  • Well-understood, extensively studied

5.2 4.2 TPU Pod: 3D Torus

      Z
      |
      |_____ Y
     /
    X
 
Each node connects to 6 neighbors: ±X, ±Y, ±Z
16×16×16 = 4,096 nodes

Characteristics:

  • Fixed degree (6) regardless of scale
  • Multiple paths between nodes (fault-tolerant)
  • Diameter: N^(1/3) for N nodes
  • Optimized for nearest-neighbor and dimension-ordered collectives

All-Reduce on 3D Torus:

# Simplified algorithm
def all_reduce_3d_torus(local_data):
    # Phase 1: Reduce-scatter along X
    for i in range(X_DIM):
        send(data[i], neighbor_x_plus)
        recv(remote_data, neighbor_x_minus)
        local_data[i] = reduce(local_data[i], remote_data)
    
    # Phase 2: Reduce-scatter along Y
    for i in range(Y_DIM):
        send(data[i], neighbor_y_plus)
        recv(remote_data, neighbor_y_minus)
        local_data[i] = reduce(local_data[i], remote_data)
    
    # Phase 3: Reduce-scatter along Z
    for i in range(Z_DIM):
        send(data[i], neighbor_z_plus)
        recv(remote_data, neighbor_z_minus)
        local_data[i] = reduce(local_data[i], remote_data)
    
    # Phase 4-6: All-gather in reverse order (Z, Y, X)
    # ... (symmetric to reduce-scatter)
    
    return fully_reduced_data

Bandwidth Efficiency:

  • Utilizes all links simultaneously
  • No congestion (each link carries unique data)
  • Latency: O(log N) with constant factor

5.3 4.3 Dragonfly/Dragonfly+ (HPC systems)

Used in: Cray Shasta systems, some supercomputers

Characteristics:

  • Hierarchical: intra-group (all-to-all), inter-group (adaptive routing)
  • High radix switches (64+ ports)
  • Better diameter than torus for large systems

vs. TPU Pod:

  • Dragonfly: Better for communication patterns with locality
  • TPU Pod: Better for uniform, bulk-synchronous patterns
  • Dragonfly: Uses adaptive routing (complex)
  • TPU Pod: Uses dimension-ordered routing (simple, deterministic)

6. 5. Cost and TCO Analysis

6.1 5.1 Cost Breakdown (Estimated per 256 Accelerators)

ComponentNVSwitch + IBTPU Pod OCSEthernet RoCE
Accelerators$5-8M$4-7M$3-5M
Interconnect HW$500K-1M$1-2M$200-500K
Optical transceivers$100K$400-800K$100-200K
Switches$500KIncluded in OCS$300-500K
Cables$100K$200K$100K
Power (3 yr)$800K$600K$1M
Cooling (3 yr)$400K$300K$500K
Total (3 yr)~$8-11M~$7-11M~$5-8M

Key Insights:

  1. Accelerator cost dominates (70-80% of total)
  2. TPU Pod OCS: Higher upfront, lower operational costs
  3. Ethernet: Lowest upfront and operational (with tradeoffs)
  4. NVSwitch: Middle ground, but requires expensive IB for scale-out

6.2 5.2 Performance per Dollar

Assuming 256-accelerator cluster for large model training:

MetricNVSwitch+IBTPU PodEthernet
Training throughput1.0× (baseline)0.9-1.1×0.7-0.9×
3-year TCO$9M$9M$6.5M
Perf/$ (normalized)1.0×0.9-1.1×1.0-1.2×

Interpretation:

  • Ethernet: Best cost/performance for smaller scales
  • NVSwitch: Best absolute performance within server
  • TPU Pod: Best performance at massive scale (1000+ chips)

7. 6. Use Case Fit

7.1 6.1 When to Use NVSwitch

Ideal for:

  • Server-scale deployments (8-64 GPUs)
  • Mixed precision training
  • Models with high inter-GPU communication (model parallelism)
  • When maximum per-GPU bandwidth is critical
  • Existing NVIDIA ecosystem/CUDA code

Not ideal for:

  • Cost-sensitive deployments
  • Scales >256 GPUs (requires expensive IB fabric)
  • Heterogeneous workloads (overkill for sparse communication)

7.2 6.2 When to Use TPU Pod OCS

Ideal for:

  • Massive scale training (1000+ accelerators)
  • Bulk-synchronous parallel workloads
  • Google ecosystem (TensorFlow, JAX)
  • Energy efficiency is critical
  • Long-term, homogeneous AI training

Not ideal for:

  • Small-scale deployments (<256 chips)
  • Mixed workloads (storage, compute, network)
  • Vendor diversity requirements
  • Non-Google frameworks (PyTorch, though improving)

7.3 6.3 When to Use InfiniBand/Ethernet

Ideal for:

  • Multi-vendor environments
  • Mixed workloads (AI + HPC + storage)
  • Cost-sensitive deployments
  • Need for industry-standard ecosystem
  • Future flexibility (upgrade paths, vendor changes)

Not ideal for:

  • Absolute maximum performance
  • Lowest latency requirements
  • Extremely large-scale synchronous training

8.1 7.1 Convergence

Key Observation: Technologies are converging:

  • NVSwitch may adopt optical links for longer reach
  • TPU Pods may incorporate electrical for short-reach
  • InfiniBand/Ethernet adding AI-specific features

8.2 7.2 In-Network Computing

Trend: Moving computation into the network:

  • NVIDIA SHARP (in-network reductions)
  • AMD infinity fabric (on-path computation)
  • Optical switches with photonic computing elements

Impact on TPU Pods:

  • Already has optimized collective primitives
  • Could add in-network reduction for better scalability

8.3 7.3 Disaggregation

Trend: Separating compute, memory, and storage:

  • CXL (Compute Express Link) for memory pooling
  • Optical interconnects enabling disaggregation
  • Composable infrastructure

TPU Pod Evolution:

  • Future versions may disaggregate memory
  • Optical fabric enables flexible resource allocation

9. 8. Key Takeaways for Interviews

9.1 Quick Comparison Table

FactorNVSwitchTPU Pod OCSInfiniBandEthernet
Scale sweet spot8-256 GPUs512-4096 TPUs64-1024 nodes32-512 nodes
TechnologyElectricalOpticalElectrical/opticalElectrical/optical
Bandwidth/link900 GB/s100-200 GB/s50-100 GB/s50-100 GB/s
LatencyLowestLowMediumMedium-high
Power efficiencyMediumBestMediumWorst
CostHighHighestHighLowest
FlexibilityLowLowestHighHighest
AdoptionNVIDIA onlyGoogle onlyBroad (HPC/AI)Universal

9.2 Architectural Insights

  1. No single winner: Choice depends on workload, scale, and constraints
  2. Optical advantage: Scales better with distance and power, but costs more
  3. Electrical advantage: Lower latency at short reach, lower cost, mature
  4. Topology matters: Torus (TPU) vs. Fat-tree (NVSwitch) impacts scalability
  5. Circuit vs. Packet: Circuit switching (OCS) is deterministic but less flexible

9.3 Discussion Points

If asked "Which is better, TPU Pod or NVSwitch?"

Answer framework:

  1. "It depends on the use case and scale..."
  2. "For intra-server: NVSwitch wins (higher bandwidth, lower latency)"
  3. "For massive scale: TPU Pod wins (better power, uniform topology, no tier boundaries)"
  4. "For cost: Neither - Ethernet-based solutions are more economical"
  5. "For ecosystem: NVSwitch (broader adoption) vs. TPU Pod (Google-only)"

If asked "Why did Google choose optical for TPU Pods?"

Key points:

  1. Scale: Optical scales better to thousands of nodes
  2. Power: 3-5× better power efficiency than electrical at high bandwidth
  3. Distance: Can span large datacenter without repeaters
  4. Reconfigurability: OCS allows dynamic topology changes
  5. Workload fit: AI training is bulk-synchronous, benefits from circuit switching

10. References and Further Reading

  1. Google TPU Papers:

    • Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (ISCA 2017)
    • Jouppi et al., "A Domain-Specific Supercomputer for Training Deep Neural Networks" (CACM 2020)
  2. NVIDIA NVSwitch:

    • NVIDIA Hopper Architecture Whitepaper
    • NVLink and NVSwitch Documentation
  3. Optical Interconnects:

    • Kachris et al., "Optical Interconnects for Data Centers" (2013)
    • Sun et al., "LIONS: An RDMA-Oriented Design for Low-Latency Optical Switches" (SIGCOMM 2020)
  4. Network Topologies:

    • Dally & Towles, "Principles and Practices of Interconnection Networks" (2004)
    • Kim et al., "Technology-Driven, Highly-Scalable Dragonfly Topology" (ISCA 2008)
  5. Industry Reports:

    • Omdia: "AI Infrastructure Market Analysis"
    • Dell'Oro Group: "Data Center Network Equipment Report"