Tensor Processing Unit (TPU) Architecture
Deep dive into TPU design philosophy, architecture evolution, and lessons for AI accelerator development
Prerequisites
Make sure you're familiar with these concepts before diving in:
Table of Contents
Tensor Processing Unit (TPU) Architecture
1. Introduction
Tensor Processing Units (TPUs) represent one of the most successful custom AI accelerator architectures ever deployed. Designed specifically for neural network inference and training, TPUs demonstrate how domain-specific architectures can achieve significant improvements in performance, power efficiency, and cost-effectiveness over general-purpose processors.
2. TPU Design Philosophy
2.1 Core Principles
The TPU architecture is built on several key design principles:
- Specialization over Generality: Optimize for neural network operations rather than general computation
- Efficiency over Peak Performance: Maximize useful work per watt and per dollar
- Deterministic Execution: Predictable performance for production deployment
- Co-design with Software: Architecture and compiler developed together
2.2 Design Trade-offs
TPU vs GPU Design Philosophy:
┌─────────────────┬─────────────┬─────────────┐
│ Aspect │ TPU │ GPU │
├─────────────────┼─────────────┼─────────────┤
│ Control Logic │ Minimal │ Complex │
│ Cache Hierarchy │ Simplified │ Multi-level │
│ Memory Bandwidth│ Very High │ High │
│ Precision │ Reduced │ Full │
│ Programmability │ Limited │ General │
│ Energy Efficiency│ Optimized │ Moderate │
└─────────────────┴─────────────┴─────────────┘
3. TPU v1 Architecture
3.1 System Overview
TPU v1 was designed specifically for neural network inference:
- Process Technology: 28nm
- Die Size: ~331 mm²
- Peak Performance: 92 TOPS (8-bit integer)
- Memory: 28 MiB on-chip, high-bandwidth off-chip
- Precision: 8-bit integer only
3.2 Matrix Multiply Unit (MXU)
The heart of TPU v1 is a massive systolic array:
TPU v1 Systolic Array (256×256):
┌───┬───┬───┬ ┬───┐
│PE │PE │PE │ ... │PE │ ← Weights flow horizontally
├───┼───┼───┼ ┼───┤
│PE │PE │PE │ ... │PE │
├───┼───┼───┼ ┼───┤
│PE │PE │PE │ ... │PE │
│ │ │ │ │ │
│ │ │ │ │ │
├───┼───┼───┼ ┼───┤
│PE │PE │PE │ ... │PE │
└───┴───┴───┴ ┴───┘
↑ ↑ ↑ ↑
Activations flow vertically
Each PE performs:
result = activation × weight + partial_sum
3.3 Memory Hierarchy
TPU v1 Memory Hierarchy:
┌─────────────────────────────────────┐
│ Host Memory (via PCIe) │ ← Model storage
├─────────────────────────────────────┤
│ DDR3 DRAM (8 GiB, 30 GB/s) │ ← Large tensors
├─────────────────────────────────────┤
│ Unified Buffer (24 MiB, 400 GB/s) │ ← Activation storage
├─────────────────────────────────────┤
│ Weight FIFO (64 KiB) │ ← Weight streaming
├─────────────────────────────────────┤
│ Accumulator Memory (4 MiB) │ ← Partial sums
└─────────────────────────────────────┘
3.4 Performance Analysis
TPU v1 achieved remarkable efficiency:
- Area Efficiency: 65,536 8-bit MAC units in 331 mm²
- Power Efficiency: ~40 TOPS/W (significantly better than contemporary GPUs)
- Memory Efficiency: High bandwidth, low latency on-chip storage
- Cost Efficiency: Simplified design reduces manufacturing costs
4. TPU v2/v3 Evolution
4.1 Training Support
TPU v2 added training capabilities:
- Floating-Point Support: bfloat16 precision
- Bidirectional Data Flow: Forward and backward pass support
- Increased Memory: 32 GiB high-bandwidth memory (HBM)
- Vector Processing Units: Element-wise operations
- Scalar Processing Units: Control and setup operations
4.2 bfloat16 Precision
The bfloat16 format was introduced specifically for TPU architectures:
IEEE FP16: [S][EEEEE][FFFFFFFFFF] (1+5+10 bits)
TPU bfloat16: [S][EEEEEEEE][FFFFFFF] (1+8+7 bits)
Benefits of bfloat16:
- Same dynamic range as FP32 (8-bit exponent)
- Reduced precision requirements (7-bit mantissa)
- Direct truncation from FP32 (no rounding errors)
- Hardware simplification vs FP16
4.3 TPU Pod Architecture
TPU v2/v3 introduced pod-scale deployment:
TPU Pod Topology (v2/v3):
┌─────┬─────┬─────┬─────┐
│TPU │TPU │TPU │TPU │
│ 0 │ 1 │ 2 │ 3 │
├─────┼─────┼─────┼─────┤
│TPU │TPU │TPU │TPU │
│ 4 │ 5 │ 6 │ 7 │
├─────┼─────┼─────┼─────┤
│TPU │TPU │TPU │TPU │
│ 8 │ 9 │ 10 │ 11 │
├─────┼─────┼─────┼─────┤
│TPU │TPU │TPU │TPU │
│ 12 │ 13 │ 14 │ 15 │
└─────┴─────┴─────┴─────┘
Interconnect Features:
- High-speed inter-chip links (>100 GB/s per link)
- 2D torus topology for scalability
- Dedicated interconnect for collective operations
- Software-managed communication overlapping
5. TPU v4 and Beyond
5.1 Advanced Features
TPU v4 introduces several architectural improvements:
- Sparsity Support: Hardware acceleration for sparse operations
- Optical Interconnect: Circuit-switched optical networking
- Advanced Memory: Improved capacity and bandwidth
- Enhanced Precision: Multiple precision formats
5.2 Optical Circuit Switching
TPU v4 Optical Interconnect:
┌─────────────────────────────────────┐
│ Optical Circuit Switch (OCS) │
├─────────────────────────────────────┤
│ • Reconfigurable topology │
│ • Multi-Tbps aggregate bandwidth │
│ • Low latency circuit switching │
│ • Power-efficient optical links │
└─────────────────────────────────────┘
Benefits:
- Reduced power vs electrical interconnect
- Higher bandwidth density
- Flexible topology reconfiguration
- Improved scaling to massive pod sizes
6. Software Ecosystem
6.1 XLA (Accelerated Linear Algebra)
TPUs rely heavily on the XLA compiler:
- Graph Optimization: Fusion, constant folding, dead code elimination
- Memory Management: Buffer assignment and lifetime analysis
- Parallelization: Automatic sharding across multiple TPUs
- Hardware Mapping: Efficient use of systolic arrays and memory hierarchy
6.2 JAX Integration
JAX provides a natural programming model for TPUs:
import jax
import jax.numpy as jnp
# Automatic compilation to TPU
@jax.jit
def transformer_layer(x, weights):
# Multi-head attention
attn_out = multi_head_attention(x, weights.attn)
x = layer_norm(x + attn_out, weights.ln1)
# Feed-forward network
ffn_out = ffn(x, weights.ffn)
x = layer_norm(x + ffn_out, weights.ln2)
return x
# Automatic sharding across TPU pod
x_sharded = jax.device_put(x, jax.sharding.PositionalSharding(devices))
7. Performance Characteristics
7.1 Workload Analysis
TPUs excel at certain types of workloads:
TPU Performance Sweet Spot:
┌─────────────────┬─────────┬─────────────┐
│ Workload Type │ TPU Fit │ Reasoning │
├─────────────────┼─────────┼─────────────┤
│ Large CNNs │ Excellent│ Matrix ops │
│ Transformers │ Good │ Attention │
│ RNNs │ Moderate │ Sequential │
│ Graph Neural │ Poor │ Irregular │
│ Reinforcement │ Variable │ Depends │
└─────────────────┴─────────┴─────────────┘
7.2 Scaling Behavior
TPU performance scales well with model size and batch size:
- Model Parallelism: Efficient weight distribution across chips
- Data Parallelism: High-bandwidth inter-chip communication
- Pipeline Parallelism: Overlapped execution across layers
- Mixed Parallelism: Combination strategies for very large models
8. Design Lessons and Principles
8.1 Key Insights from TPU Development
- Domain Specialization Pays Off: 10-100× improvements possible with focused design
- Memory Bandwidth is Critical: Often more important than peak compute
- Deterministic Execution: Simplifies deployment and debugging
- Co-design is Essential: Hardware and software must evolve together
- Scale Drives Innovation: Large-scale deployment motivates optimization
8.2 Architectural Trade-offs
TPU Design Decision Analysis:
┌─────────────────┬─────────────┬─────────────┐
│ Decision │ Benefits │ Trade-offs │
├─────────────────┼─────────────┼─────────────┤
│ Reduced Precision│ Area/Power │ Accuracy │
│ Systolic Arrays │ Efficiency │ Flexibility │
│ Simple Control │ Power/Area │ Generality │
│ Large On-chip │ Bandwidth │ Die Area │
│ Fixed Function │ Optimization│ Adaptability│
└─────────────────┴─────────────┴─────────────┘
9. Impact on Industry
9.1 Influence on AI Accelerator Design
The TPU's success has influenced the entire AI hardware industry:
- Systolic Array Adoption: Many accelerators now use similar architectures
- Custom Precision Formats: bfloat16 adopted by other vendors
- Software Co-design: Increased focus on compiler optimization
- Pod-Scale Deployment: Multi-chip systems becoming standard
- Optical Interconnect: Growing interest in photonic networking
9.2 Lessons for Future Architectures
The TPU demonstrates several principles for AI accelerator design:
- Start with the workload: Understand target applications deeply
- Optimize for common case: Design for typical rather than worst-case scenarios
- Memory matters: Bandwidth and capacity often limit performance
- Software is half the solution: Compilers are critical for efficiency
- Scale changes everything: Distributed deployment creates new opportunities
The TPU represents a landmark achievement in computer architecture, demonstrating how domain-specific design can deliver transformational improvements in AI workload performance and efficiency.