Skip to main content
DatacenterArch5nm TSMC2024Network SwitchDatacenterBroadcomSerDes5nm ProcessCPOPAM4AI InfrastructurePower EfficiencyMonolithic Design
Broadcom

Tomahawk5 Switch Chip

Monumental achievement in network switching technology delivering 51.2 Tb/s switching capacity in a single monolithic 5nm die, featuring 512 lanes of 106.25Gb/s PAM4 SerDes, co-packaged optics support, and 8.8 pJ/bit power efficiency for AI datacenter infrastructure.

10 min read
5nm TSMC process
Released 2024
Updated 1/16/2025

Key Performance Metrics

51.2 Tb/s total switching capacity (double previous generation)
8.8 pJ/bit power efficiency at 450W typical deployment power
512 SerDes lanes supporting up to 45dB insertion loss
64 ports of 800GbE or 128 ports of 400GbE maximum configuration
4.1× performance per watt improvement over 5 generations
9,352 pin custom BGA package with advanced thermal design
6 integrated ARM cores for on-chip processing

Architectural Highlights

  • Monolithic 5nm die delivering 51.2 Tb/s switching capacity
  • 512 lanes of 106.25Gb/s PAM4 SerDes with DSP equalization
  • Revolutionary co-packaged optics (CPO) and linear pluggable optics (LPO) support
  • Custom compressed hex BGA pattern with 0.9mm pitch for improved signal integrity
  • Shared buffer architecture with dynamic allocation across all ports
  • Multi-bin Adaptive Voltage Scaling (AVS) for power optimization

Technical Specifications

TSMC 5nm FinFET monolithic die implementation
512 lanes × 106.25 Gb/s PAM4 SerDes technology
9,352 pin custom compressed hex BGA (0.9mm pitch)
Shared buffer: estimated 100+ MB with dynamic allocation
Power delivery: 430A peak current, 550ns transition time
Voltage bins: 8 levels (0.700V to 0.7875V, 12.5mV steps)
Air cooling compatible with custom heatsink design

Innovative Features

  • Direct drive architecture eliminating DSP retiming in optics (100ns latency reduction)
  • Dynamic shared buffer allocation achieving 80%+ utilization vs 30-50% static
  • 45dB insertion loss compensation with advanced DSP equalization
  • CPO achieving 5.5 pJ/bit optical efficiency vs 15+ pJ/bit traditional
  • Custom hex BGA pattern reducing package size and improving signal integrity
  • Comprehensive load line implementation maintaining 3% voltage regulation

1. Executive Summary

The Broadcom Tomahawk5 (TH5, BCM78900 series) represents a monumental achievement in network switching technology, delivering 51.2 Tb/s of switching capacity in a single monolithic die—double the bandwidth of its predecessors. Fabricated on TSMC's 5nm process, this massive switch chip features 512 lanes of 106.25Gb/s PAM4 SerDes, integrates six ARM cores, and achieves remarkable power efficiency at 450W typical deployment power. The chip's revolutionary support for co-packaged optics (CPO) and linear pluggable optics (LPO) positions it as the cornerstone of next-generation AI/ML datacenter networking infrastructure.

2. 1. Architectural Innovation and Core Technology

2.1 1.1 Monolithic Die Architecture

The TH5's implementation as a monolithic die represents a critical design decision with significant implications:

Advantages of Monolithic vs Multi-Chip:

Monolithic Benefits:
- Lower cost: Single die manufacturing
- Lower power: No die-to-die serialization overhead  
- Lower latency: Direct on-chip communication
- Simpler packaging: No interposer or bridges needed
 
Trade-offs:
- Larger die size → Lower yield
- Single point of failure
- Manufacturing complexity at reticle limits

Die Specifications:

  • Process Technology: TSMC 5nm FinFET
  • Package Pins: 9,352 pins (massive BGA array)
  • Package Type: Organic BGA with custom technology
  • Ball Pitch: 0.9mm (custom compressed hex pattern)

2.2 1.2 Core Performance Metrics

ParameterValueCalculation/Explanation
Total Bandwidth51.2 Tb/s512 lanes × 106.25 Gb/s
SerDes TechnologyPAM44-level pulse amplitude modulation
Port Configurations100/200/400/800 GbEFlexible port breakout
Maximum 800G Ports6451.2 Tb/s ÷ 800 Gb/s
Maximum 400G Ports12851.2 Tb/s ÷ 400 Gb/s
Maximum 100G Ports51251.2 Tb/s ÷ 100 Gb/s
Typical Power450WCustomer deployments
Power Efficiency8.8 pJ/bit450W ÷ 51.2 Tb/s
ARM Cores6 coresOn-chip processing

2.3 1.3 Generational Evolution Analysis

Tomahawk Family Progression:

GenerationProcessBandwidthPowerEfficiency
TH128nm3.2 Tb/s115W35.9 pJ/bit
TH216nm6.4 Tb/s180W28.1 pJ/bit
TH316nm12.8 Tb/s220W17.2 pJ/bit
TH47nm25.6 Tb/s306W12.0 pJ/bit
TH55nm51.2 Tb/s450W8.8 pJ/bit

Power Efficiency Improvement:

Generation-to-generation: ~30% power reduction
Process technology alone: ~15-20% reduction
Architecture innovation: ~10-15% additional reduction

3. 2. SerDes Technology and Signal Integrity

3.1 2.1 Peregrine SerDes Architecture

The integrated Peregrine SerDes technology enables multiple connectivity options:

SerDes Specifications:

  • Lane Rate: 106.25 Gb/s PAM4
  • Number of Lanes: 512
  • Insertion Loss Support: >45dB at 10⁻⁶ pre-FEC BER
  • DAC Cable Support: 4-meter cables
  • Modulation: PAM4 (2 bits per symbol)

Effective Data Rate Calculation: Symbol Rate=53.125 GBaud\text{Symbol Rate} = 53.125 \text{ GBaud} Bits per Symbol=2 (PAM4)\text{Bits per Symbol} = 2 \text{ (PAM4)} Raw Rate=53.125×2=106.25 Gb/s\text{Raw Rate} = 53.125 \times 2 = 106.25 \text{ Gb/s} \text{With FEC overhead (~7%): Net rate} \approx 99 \text{ Gb/s}

3.2 2.2 DSP-Based Equalization

The DSP SerDes implementation provides:

Channel Compensation:
- Feed-Forward Equalization (FFE)
- Decision Feedback Equalization (DFE)  
- Continuous Time Linear Equalization (CTLE)
- Maximum Likelihood Sequence Detection (MLSD)
 
Performance:
- 45dB insertion loss compensation
- BER < 10⁻⁶ pre-FEC
- BER < 10⁻¹² post-FEC

3.3 2.3 Signal Integrity Innovations

Custom BGA Pattern Benefits:

  • Traditional: 1.0mm pitch → >100mm package size
  • TH5 Custom: 0.9mm hex pattern → Reduced size
  • Signal Isolation: Improved FEXT/NEXT performance
  • Insertion Loss: Reduced trace lengths

4. 3. Shared Buffer Architecture

4.1 3.1 Output Queued Shared Buffer Design

The TH5 implements an advanced shared buffer architecture:

Buffer Architecture:
- Total Buffer Size: Estimated 100+ MB
- Dynamic Allocation: Across all ports and queues
- Queue Types: Unicast, Multicast, CPU
- QoS Levels: 8 priority queues per port

Buffer Efficiency Calculation: Traditional Static Buffer:\text{Traditional Static Buffer:} 512 ports×200KB/port=102.4MB (fixed allocation)512 \text{ ports} \times 200\text{KB/port} = 102.4\text{MB (fixed allocation)} \text{Utilization: ~30-50% typical}

TH5 Shared Buffer:\text{TH5 Shared Buffer:} 100MB shared dynamically100\text{MB shared dynamically} \text{Utilization: >80% achievable} Effective capacity: 1.6-2.7× improvement\text{Effective capacity: 1.6-2.7× improvement}

4.2 3.2 Traffic Management Features

Burst Absorption Capability: For 800G port at full rate:\text{For 800G port at full rate:} Burst Duration=Buffer SizePort Rate\text{Burst Duration} = \frac{\text{Buffer Size}}{\text{Port Rate}} If 1MB allocated: 1MB×8800Gb/s=10μs burst\text{If 1MB allocated: } \frac{1\text{MB} \times 8}{800\text{Gb/s}} = 10\mu s \text{ burst}

Quality of Service Implementation:

  • Weighted Fair Queuing (WFQ)
  • Strict Priority Scheduling
  • Deficit Weighted Round Robin (DWRR)
  • Hierarchical QoS with multiple levels

5. 4. Power Management and Thermal Design

5.1 4.1 Adaptive Voltage Scaling (AVS)

The multi-bin AVS system provides sophisticated power optimization:

AVS Implementation:

Voltage Bins: 8 levels
Range: 0.700V to 0.7875V
Granularity: 12.5mV steps
Selection: Wafer probe testing
Storage: OTP (One-Time Programmable) array

Power Savings Calculation: PowerV2×f\text{Power} \propto V^2 \times f Voltage reduction: 0.7875V0.700V=11.1% reduction\text{Voltage reduction: } 0.7875\text{V} \rightarrow 0.700\text{V} = 11.1\% \text{ reduction} Power savings: 1(0.7000.7875)2=21% power reduction\text{Power savings: } 1 - \left(\frac{0.700}{0.7875}\right)^2 = 21\% \text{ power reduction}

5.2 4.2 Power Delivery Network (PDN) Design

Critical PDN Requirements: Peak Current Transition: 430A\text{Peak Current Transition: } 430\text{A} Transition Time: 550ns\text{Transition Time: } 550\text{ns} Current Slew Rate: 430A550ns=782 A/μs\text{Current Slew Rate: } \frac{430\text{A}}{550\text{ns}} = 782 \text{ A/μs} Voltage Droop Target: under 3% of VDD\text{Voltage Droop Target: under 3\% of } V_{DD}

PDN Impedance Calculation: Ztarget=ΔVmaxΔImaxZ_{target} = \frac{\Delta V_{max}}{\Delta I_{max}} Ztarget=0.03×0.75V430A=52.3μΩZ_{target} = \frac{0.03 \times 0.75\text{V}}{430\text{A}} = 52.3 \mu\Omega

The PDN must maintain impedance below 52.3μΩ across relevant frequency bands.

5.3 4.3 Load Line Implementation

Measured Performance:

  • Voltage Droop: 39.9mV (with load line disabled)
  • With Load Line: Maintains 3% specification
  • Compensation: Dedicated sense line from die to VRM

6. 5. Co-Packaged Optics (CPO) Innovation

6.1 5.1 CPO Architecture

The TH5-Bailly variant integrates optical engines directly:

CPO Specifications: Configuration: 8 optical engines×6.4T each\text{Configuration: } 8 \text{ optical engines} \times 6.4\text{T each} Total Optical Bandwidth: 51.2 Tb/s\text{Total Optical Bandwidth: } 51.2 \text{ Tb/s} System Power: 820W total\text{System Power: } 820\text{W total} Optics Power: 274W\text{Optics Power: } 274\text{W} ASIC Power: 546W\text{ASIC Power: } 546\text{W} Optical Efficiency: 274W51.2 Tb/s=5.35 pJ/bit\text{Optical Efficiency: } \frac{274\text{W}}{51.2 \text{ Tb/s}} = 5.35 \text{ pJ/bit}

6.2 5.2 Direct Drive Architecture

The "direct drive" approach eliminates DSP retiming in optics:

Traditional Optical Link:

TX ASIC → DSP → E/O → Fiber → O/E → DSP → RX ASIC
Total DSPs: 2
Latency: ~200ns
Power: ~15+ pJ/bit

Direct Drive (CPO/LPO):

TX ASIC → Linear E/O → Fiber → Linear O/E → RX ASIC
Total DSPs: 0 (linear conversion only)
Latency: ~100ns (100ns reduction)
Power: ~5.5 pJ/bit

6.3 5.3 Power Efficiency Comparison

ConfigurationPower EfficiencyRelative to Retimed
CPO4.8 pJ/bitBest efficiency
LPO10 pJ/bit33% reduction
LRO12 pJ/bit20% reduction
Fully Retimed15+ pJ/bitBaseline

7. 6. Packaging Technology and Mechanical Design

7.1 6.1 Package Innovation

Custom Compressed Hex BGA Pattern:

Traditional Square Grid:
- 1.0mm pitch
- Package size: >100mm × 100mm
- Longer traces → Higher insertion loss
 
TH5 Hex Pattern:
- 0.9mm pitch (10% reduction)
- Hexagonal arrangement
- Package size: under 90mm × 90mm
- Improved SI metrics

7.2 6.2 Thermal Performance

Air Cooling Achievement: Power Density=450WPackage Area\text{Power Density} = \frac{450\text{W}}{\text{Package Area}} Assuming 90×90mm: 450W8100mm2=55.6 mW/mm2\text{Assuming } 90 \times 90\text{mm: } \frac{450\text{W}}{8100\text{mm}^2} = 55.6 \text{ mW/mm}^2

Thermal Solution:

  • Lidless package design
  • Custom heatsink
  • Air cooling sufficient (no liquid required)
  • Junction temperature: Within spec

7.3 6.3 Mechanical Reliability

JEDEC Compliance:

  • Component Level: Passed JESD47K first attempt
  • Room Temperature Coplanarity: under 200μm (met)
  • High Temperature Warpage: -140μm/+230μm (within spec)
  • Shock & Bend Tests: Passed IPC9702/3

8. 7. System Integration and Deployment

8.1 7.1 Silicon Validation Kit (SVK)

The TH5 SVK demonstrates system simplicity:

Configuration:
- 64 ports of 800G
- Stacked OSFP connectors (belly-to-belly)
- PCB routed signals (no flyover cables)
- Air cooled system
- Simplified design vs competitor solutions

8.2 7.2 Module Test Platform (MTP)

For LPO/LRO qualification:

  • Multiple form factor support (QSFP, OSFP, etc.)
  • Electrical channel variety for testing
  • Pre-FEC BER measurement capability
  • Comprehensive module validation

8.3 7.3 Deployment Flexibility

Connectivity Options Supported:

  1. Direct Attach Copper (DAC): Up to 4 meters
  2. Front Panel Pluggables: Standard transceivers
  3. Linear Pluggable Optics (LPO): Reduced power
  4. Co-Packaged Optics (CPO): Maximum integration

9. 8. Performance Analysis and Benchmarking

9.1 8.1 Latency Analysis

Port-to-Port Latency Components: SerDes TX: 10ns\text{SerDes TX: } \sim10\text{ns} Switch Fabric: 100200ns\text{Switch Fabric: } \sim100-200\text{ns} Shared Buffer: Variable (depends on congestion)\text{Shared Buffer: Variable (depends on congestion)} SerDes RX: 10ns\text{SerDes RX: } \sim10\text{ns} Total (uncongested): 120220ns\text{Total (uncongested): } \sim120-220\text{ns}

With CPO/LPO: 100ns reduction per link\text{With CPO/LPO: } 100\text{ns reduction per link} End-to-end improvement: 200ns for switched connection\text{End-to-end improvement: } 200\text{ns for switched connection}

9.2 8.2 Throughput Efficiency

Non-blocking Performance: Radix: 512×100G or 64×800G\text{Radix: } 512 \times 100\text{G or } 64 \times 800\text{G} Bisection Bandwidth: 25.6 Tb/s\text{Bisection Bandwidth: } 25.6 \text{ Tb/s} Oversubscription: None (1:1 non-blocking)\text{Oversubscription: None (1:1 non-blocking)}

9.3 8.3 Power Efficiency Evolution

Performance per Watt Improvement: TH1: 3.2 Tb/s115W=27.8 Gb/s/W\text{TH1: } \frac{3.2 \text{ Tb/s}}{115\text{W}} = 27.8 \text{ Gb/s/W} TH5: 51.2 Tb/s450W=113.8 Gb/s/W\text{TH5: } \frac{51.2 \text{ Tb/s}}{450\text{W}} = 113.8 \text{ Gb/s/W} Improvement: 4.1× over 5 generations\text{Improvement: } 4.1\times \text{ over 5 generations}

10. 9. AI/ML Networking Optimization

10.1 9.1 AI Workload Characteristics

Modern AI training requires:

- All-reduce operations: High bisection bandwidth
- Parameter servers: Low latency
- Gradient aggregation: Multicast support
- Model parallelism: Predictable latency

10.2 9.2 TH5 AI Optimizations

Features for AI/ML:

  • 51.2 Tb/s eliminates network bottlenecks
  • Shared buffer handles bursty gradient traffic
  • Low latency for synchronous training
  • CPO option for highest density

AI Cluster Scaling: GPUs per TH5: 64 (assuming 800G per GPU)\text{GPUs per TH5: } 64 \text{ (assuming 800G per GPU)} Bandwidth per GPU: 800 Gb/s bidirectional\text{Bandwidth per GPU: } 800 \text{ Gb/s bidirectional} Total cluster size: Multiple TH5s in 2-tier topology\text{Total cluster size: Multiple TH5s in 2-tier topology} Scale: Thousands of GPUs with under 5μs latency\text{Scale: Thousands of GPUs with under } 5\mu s \text{ latency}

11. Conclusions

The Broadcom Tomahawk5 represents a quantum leap in datacenter switching technology, doubling bandwidth while maintaining exceptional power efficiency. Its monolithic 5nm implementation achieves 51.2 Tb/s switching capacity with only 8.8 pJ/bit power efficiency—a 30% improvement over previous generation beyond process technology gains alone.

Key Innovations:

  1. Monolithic architecture - Reducing cost, power, and latency vs multi-chip designs
  2. Advanced SerDes technology - Supporting 45dB insertion loss with DSP equalization
  3. Three-tier power optimization - AVS, PDN design, and load line implementation achieving 450W typical power
  4. Revolutionary optics integration - CPO achieving 5.5 pJ/bit with 100ns latency reduction
  5. Custom packaging innovations - Compressed hex BGA pattern enabling SI improvements

The TH5's support for CPO and LPO, combined with traditional copper and optical interfaces, provides unprecedented deployment flexibility. Its shared buffer architecture with dynamic allocation ensures efficient handling of AI/ML workloads' bursty traffic patterns.

With 4.1× improvement in performance per watt over five generations and air-cooling compatibility, the Tomahawk5 sets new standards for sustainable, high-performance datacenter networking infrastructure. It's perfectly positioned for the explosive growth in AI/ML computational demands, enabling cluster scaling to thousands of GPUs while maintaining microsecond-level latencies critical for synchronous training operations.

The chip represents the convergence of electrical and photonic technologies, with its CPO capabilities pointing toward the future of integrated silicon photonics in high-performance computing systems.

12. References

[1] IEEE 802.3 Ethernet Standard, "IEEE Standard for Ethernet", 2018. [2] Optical Internetworking Forum, "Implementation Agreement for Linear Pluggable Optics", 2023. [3] Chen, K., et al., "Co-Packaged Optics for High-Performance Computing", Nature Photonics, 2023. [4] Broadcom Inc., "Tomahawk5 Product Brief and Technical Specifications", 2024.

Analysis based on Broadcom Tomahawk5 technical specifications, industry presentations, and datacenter networking architecture requirements.

Product Information

Manufacturer:
Broadcom
Process Node:
5nm TSMC
Release Year:
2024
Category:
DatacenterArch

Performance Benchmarks

51.2 Tb/s non-blocking switching capacity
Port-to-port latency: 120-220ns uncongested
CPO latency reduction: 100ns per optical link
SerDes performance: BER < 10⁻⁶ pre-FEC, < 10⁻¹² post-FEC
DAC cable support: up to 4 meters
Power efficiency evolution: 27.8 to 113.8 Gb/s/W (4.1× improvement)