ChipCraft

Executive Summary

The Broadcom Tomahawk5 (TH5, BCM78900 series) represents a monumental achievement in network switching technology, delivering 51.2 Tb/s of switching capacity in a single monolithic die—double the bandwidth of its predecessors. Fabricated on TSMC's 5nm process, this massive switch chip features 512 lanes of 106.25Gb/s PAM4 SerDes, integrates six ARM cores, and achieves remarkable power efficiency at 450W typical deployment power. The chip's revolutionary support for co-packaged optics (CPO) and linear pluggable optics (LPO) positions it as the cornerstone of next-generation AI/ML datacenter networking infrastructure.

1. Architectural Innovation and Core Technology

1.1 Monolithic Die Architecture

The TH5's implementation as a monolithic die represents a critical design decision with significant implications:

Advantages of Monolithic vs Multi-Chip:

Monolithic Benefits:
- Lower cost: Single die manufacturing
- Lower power: No die-to-die serialization overhead  
- Lower latency: Direct on-chip communication
- Simpler packaging: No interposer or bridges needed
 
Trade-offs:
- Larger die size → Lower yield
- Single point of failure
- Manufacturing complexity at reticle limits

Die Specifications:

Process Technology: TSMC 5nm FinFET
Package Pins: 9,352 pins (massive BGA array)
Package Type: Organic BGA with custom technology
Ball Pitch: 0.9mm (custom compressed hex pattern)

1.2 Core Performance Metrics

Parameter	Value	Calculation/Explanation
Total Bandwidth	51.2 Tb/s	512 lanes × 106.25 Gb/s
SerDes Technology	PAM4	4-level pulse amplitude modulation
Port Configurations	100/200/400/800 GbE	Flexible port breakout
Maximum 800G Ports	64	51.2 Tb/s ÷ 800 Gb/s
Maximum 400G Ports	128	51.2 Tb/s ÷ 400 Gb/s
Maximum 100G Ports	512	51.2 Tb/s ÷ 100 Gb/s
Typical Power	450W	Customer deployments
Power Efficiency	8.8 pJ/bit	450W ÷ 51.2 Tb/s
ARM Cores	6 cores	On-chip processing

1.3 Generational Evolution Analysis

Tomahawk Family Progression:

Generation	Process	Bandwidth	Power	Efficiency
TH1	28nm	3.2 Tb/s	115W	35.9 pJ/bit
TH2	16nm	6.4 Tb/s	180W	28.1 pJ/bit
TH3	16nm	12.8 Tb/s	220W	17.2 pJ/bit
TH4	7nm	25.6 Tb/s	306W	12.0 pJ/bit
TH5	5nm	51.2 Tb/s	450W	8.8 pJ/bit

Power Efficiency Improvement:

Generation-to-generation: ~30% power reduction
Process technology alone: ~15-20% reduction
Architecture innovation: ~10-15% additional reduction

2. SerDes Technology and Signal Integrity

2.1 Peregrine SerDes Architecture

The integrated Peregrine SerDes technology enables multiple connectivity options:

SerDes Specifications:

Lane Rate: 106.25 Gb/s PAM4
Number of Lanes: 512
Insertion Loss Support: >45dB at 10⁻⁶ pre-FEC BER
DAC Cable Support: 4-meter cables
Modulation: PAM4 (2 bits per symbol)

Effective Data Rate Calculation: $Symbol Rate = 53.125 GBaud$

$Bits per Symbol = 2 (PAM4)$

$Raw Rate = 53.125 \times 2 = 106.25 Gb/s$

$\text{With FEC overhead (~7%): Net rate} \approx 99 \text{ Gb/s}$

2.2 DSP-Based Equalization

The DSP SerDes implementation provides:

Channel Compensation:
- Feed-Forward Equalization (FFE)
- Decision Feedback Equalization (DFE)  
- Continuous Time Linear Equalization (CTLE)
- Maximum Likelihood Sequence Detection (MLSD)
 
Performance:
- 45dB insertion loss compensation
- BER < 10⁻⁶ pre-FEC
- BER < 10⁻¹² post-FEC

2.3 Signal Integrity Innovations

Custom BGA Pattern Benefits:

Traditional: 1.0mm pitch → >100mm package size
TH5 Custom: 0.9mm hex pattern → Reduced size
Signal Isolation: Improved FEXT/NEXT performance
Insertion Loss: Reduced trace lengths

3. Shared Buffer Architecture

3.1 Output Queued Shared Buffer Design

The TH5 implements an advanced shared buffer architecture:

Buffer Architecture:
- Total Buffer Size: Estimated 100+ MB
- Dynamic Allocation: Across all ports and queues
- Queue Types: Unicast, Multicast, CPU
- QoS Levels: 8 priority queues per port

Buffer Efficiency Calculation: $Traditional Static Buffer:$

$512 ports \times 200 KB/port = 102.4 MB (fixed allocation)$

$\text{Utilization: ~30-50% typical}$

$TH5 Shared Buffer:$

$100 MB shared dynamically$

$\text{Utilization: >80% achievable}$

$Effective capacity: 1.6-2.7\times improvement$

3.2 Traffic Management Features

Burst Absorption Capability: $For 800G port at full rate:$

$Burst Duration = \frac{Buffer Size}{Port Rate}$

$If 1MB allocated: \frac{1 MB \times 8}{800 Gb/s} = 10 μ s burst$

Quality of Service Implementation:

Weighted Fair Queuing (WFQ)
Strict Priority Scheduling
Deficit Weighted Round Robin (DWRR)
Hierarchical QoS with multiple levels

4. Power Management and Thermal Design

4.1 Adaptive Voltage Scaling (AVS)

The multi-bin AVS system provides sophisticated power optimization:

AVS Implementation:

Voltage Bins: 8 levels
Range: 0.700V to 0.7875V
Granularity: 12.5mV steps
Selection: Wafer probe testing
Storage: OTP (One-Time Programmable) array

Power Savings Calculation: $Power \propto V^{2} \times f$

$Voltage reduction: 0.7875 V \to 0.700 V = 11.1% reduction$

$Power savings: 1 - (\frac{0.700}{0.7875})^{2} = 21% power reduction$

4.2 Power Delivery Network (PDN) Design

Critical PDN Requirements: $Peak Current Transition: 430 A$

$Transition Time: 550 ns$

$Current Slew Rate: \frac{430 A}{550 ns} = 782 A/μs$

$Voltage Droop Target: under 3% of V_{DD}$

PDN Impedance Calculation: $Z_{t a r g e t} = \frac{Δ V _{ma x}}{Δ I _{ma x}}$

$Z_{t a r g e t} = \frac{0.03 \times 0.75 V}{430 A} = 52.3 μ Ω$

The PDN must maintain impedance below 52.3μΩ across relevant frequency bands.

4.3 Load Line Implementation

Measured Performance:

Voltage Droop: 39.9mV (with load line disabled)
With Load Line: Maintains 3% specification
Compensation: Dedicated sense line from die to VRM

5. Co-Packaged Optics (CPO) Innovation

5.1 CPO Architecture

The TH5-Bailly variant integrates optical engines directly:

CPO Specifications: $Configuration: 8 optical engines \times 6.4 T each$

$Total Optical Bandwidth: 51.2 Tb/s$

$System Power: 820 W total$

$Optics Power: 274 W$

$ASIC Power: 546 W$

$Optical Efficiency: \frac{274 W}{51.2 Tb/s} = 5.35 pJ/bit$

5.2 Direct Drive Architecture

The "direct drive" approach eliminates DSP retiming in optics:

Traditional Optical Link:

TX ASIC → DSP → E/O → Fiber → O/E → DSP → RX ASIC
Total DSPs: 2
Latency: ~200ns
Power: ~15+ pJ/bit

Direct Drive (CPO/LPO):

TX ASIC → Linear E/O → Fiber → Linear O/E → RX ASIC
Total DSPs: 0 (linear conversion only)
Latency: ~100ns (100ns reduction)
Power: ~5.5 pJ/bit

5.3 Power Efficiency Comparison

Configuration	Power Efficiency	Relative to Retimed
CPO	4.8 pJ/bit	Best efficiency
LPO	10 pJ/bit	33% reduction
LRO	12 pJ/bit	20% reduction
Fully Retimed	15+ pJ/bit	Baseline

6. Packaging Technology and Mechanical Design

6.1 Package Innovation

Custom Compressed Hex BGA Pattern:

Traditional Square Grid:
- 1.0mm pitch
- Package size: >100mm × 100mm
- Longer traces → Higher insertion loss
 
TH5 Hex Pattern:
- 0.9mm pitch (10% reduction)
- Hexagonal arrangement
- Package size: under 90mm × 90mm
- Improved SI metrics

6.2 Thermal Performance

Air Cooling Achievement: $Power Density = \frac{450 W}{Package Area}$

$Assuming 90 \times 90 mm: \frac{450 W}{8100 mm ^{2}} = 55.6 mW/mm^{2}$

Thermal Solution:

Lidless package design
Custom heatsink
Air cooling sufficient (no liquid required)
Junction temperature: Within spec

6.3 Mechanical Reliability

JEDEC Compliance:

Component Level: Passed JESD47K first attempt
Room Temperature Coplanarity: under 200μm (met)
High Temperature Warpage: -140μm/+230μm (within spec)
Shock & Bend Tests: Passed IPC9702/3

7. System Integration and Deployment

7.1 Silicon Validation Kit (SVK)

The TH5 SVK demonstrates system simplicity:

Configuration:
- 64 ports of 800G
- Stacked OSFP connectors (belly-to-belly)
- PCB routed signals (no flyover cables)
- Air cooled system
- Simplified design vs competitor solutions

7.2 Module Test Platform (MTP)

For LPO/LRO qualification:

Multiple form factor support (QSFP, OSFP, etc.)
Electrical channel variety for testing
Pre-FEC BER measurement capability
Comprehensive module validation

7.3 Deployment Flexibility

Connectivity Options Supported:

Direct Attach Copper (DAC): Up to 4 meters
Front Panel Pluggables: Standard transceivers
Linear Pluggable Optics (LPO): Reduced power
Co-Packaged Optics (CPO): Maximum integration

8. Performance Analysis and Benchmarking

8.1 Latency Analysis

Port-to-Port Latency Components: $SerDes TX: \sim 10 ns$

$Switch Fabric: \sim 100 - 200 ns$

$Shared Buffer: Variable (depends on congestion)$

$SerDes RX: \sim 10 ns$

$Total (uncongested): \sim 120 - 220 ns$

$With CPO/LPO: 100 ns reduction per link$

$End-to-end improvement: 200 ns for switched connection$

8.2 Throughput Efficiency

Non-blocking Performance: $Radix: 512 \times 100 G or 64 \times 800 G$

$Bisection Bandwidth: 25.6 Tb/s$

$Oversubscription: None (1:1 non-blocking)$

8.3 Power Efficiency Evolution

Performance per Watt Improvement: $TH1: \frac{3.2 Tb/s}{115 W} = 27.8 Gb/s/W$

$TH5: \frac{51.2 Tb/s}{450 W} = 113.8 Gb/s/W$

$Improvement: 4.1 \times over 5 generations$

9. AI/ML Networking Optimization

9.1 AI Workload Characteristics

Modern AI training requires:

- All-reduce operations: High bisection bandwidth
- Parameter servers: Low latency
- Gradient aggregation: Multicast support
- Model parallelism: Predictable latency

9.2 TH5 AI Optimizations

Features for AI/ML:

51.2 Tb/s eliminates network bottlenecks
Shared buffer handles bursty gradient traffic
Low latency for synchronous training
CPO option for highest density

AI Cluster Scaling: $GPUs per TH5: 64 (assuming 800G per GPU)$

$Bandwidth per GPU: 800 Gb/s bidirectional$

$Total cluster size: Multiple TH5s in 2-tier topology$

$Scale: Thousands of GPUs with under 5 μ s latency$

Conclusions

The Broadcom Tomahawk5 represents a quantum leap in datacenter switching technology, doubling bandwidth while maintaining exceptional power efficiency. Its monolithic 5nm implementation achieves 51.2 Tb/s switching capacity with only 8.8 pJ/bit power efficiency—a 30% improvement over previous generation beyond process technology gains alone.

Key Innovations:

Monolithic architecture - Reducing cost, power, and latency vs multi-chip designs
Advanced SerDes technology - Supporting 45dB insertion loss with DSP equalization
Three-tier power optimization - AVS, PDN design, and load line implementation achieving 450W typical power
Revolutionary optics integration - CPO achieving 5.5 pJ/bit with 100ns latency reduction
Custom packaging innovations - Compressed hex BGA pattern enabling SI improvements

The TH5's support for CPO and LPO, combined with traditional copper and optical interfaces, provides unprecedented deployment flexibility. Its shared buffer architecture with dynamic allocation ensures efficient handling of AI/ML workloads' bursty traffic patterns.

With 4.1× improvement in performance per watt over five generations and air-cooling compatibility, the Tomahawk5 sets new standards for sustainable, high-performance datacenter networking infrastructure. It's perfectly positioned for the explosive growth in AI/ML computational demands, enabling cluster scaling to thousands of GPUs while maintaining microsecond-level latencies critical for synchronous training operations.

The chip represents the convergence of electrical and photonic technologies, with its CPO capabilities pointing toward the future of integrated silicon photonics in high-performance computing systems.

References

[1] IEEE 802.3 Ethernet Standard, "IEEE Standard for Ethernet", 2018. [2] Optical Internetworking Forum, "Implementation Agreement for Linear Pluggable Optics", 2023. [3] Chen, K., et al., "Co-Packaged Optics for High-Performance Computing", Nature Photonics, 2023. [4] Broadcom Inc., "Tomahawk5 Product Brief and Technical Specifications", 2024.

Analysis based on Broadcom Tomahawk5 technical specifications, industry presentations, and datacenter networking architecture requirements.

Key Performance Metrics

Architectural Highlights

Technical Specifications

Innovative Features