Skip to main content
AI Hardwareexpert

Tensor Processing Unit (TPU) Architecture

Deep dive into TPU design philosophy, architecture evolution, and lessons for AI accelerator development

3 prerequisites

Prerequisites

Make sure you're familiar with these concepts before diving in:

Understanding of neural network operations
Basic knowledge of computer architecture
Familiarity with matrix multiplication algorithms

Table of Contents

Tensor Processing Unit (TPU) Architecture

1. Introduction

Tensor Processing Units (TPUs) represent one of the most successful custom AI accelerator architectures ever deployed. Designed specifically for neural network inference and training, TPUs demonstrate how domain-specific architectures can achieve significant improvements in performance, power efficiency, and cost-effectiveness over general-purpose processors.

2. TPU Design Philosophy

2.1 Core Principles

The TPU architecture is built on several key design principles:

  1. Specialization over Generality: Optimize for neural network operations rather than general computation
  2. Efficiency over Peak Performance: Maximize useful work per watt and per dollar
  3. Deterministic Execution: Predictable performance for production deployment
  4. Co-design with Software: Architecture and compiler developed together

2.2 Design Trade-offs

TPU vs GPU Design Philosophy:
┌─────────────────┬─────────────┬─────────────┐
│ Aspect          │ TPU         │ GPU         │
├─────────────────┼─────────────┼─────────────┤
│ Control Logic   │ Minimal     │ Complex     │
│ Cache Hierarchy │ Simplified  │ Multi-level │
│ Memory Bandwidth│ Very High   │ High        │  
│ Precision       │ Reduced     │ Full        │
│ Programmability │ Limited     │ General     │
│ Energy Efficiency│ Optimized  │ Moderate    │
└─────────────────┴─────────────┴─────────────┘

3. TPU v1 Architecture

3.1 System Overview

TPU v1 was designed specifically for neural network inference:

  • Process Technology: 28nm
  • Die Size: ~331 mm²
  • Peak Performance: 92 TOPS (8-bit integer)
  • Memory: 28 MiB on-chip, high-bandwidth off-chip
  • Precision: 8-bit integer only

3.2 Matrix Multiply Unit (MXU)

The heart of TPU v1 is a massive systolic array:

TPU v1 Systolic Array (256×256):
┌───┬───┬───┬     ┬───┐
│PE │PE │PE │ ... │PE │  ← Weights flow horizontally
├───┼───┼───┼     ┼───┤
│PE │PE │PE │ ... │PE │
├───┼───┼───┼     ┼───┤
│PE │PE │PE │ ... │PE │
│   │   │   │     │   │
│   │   │   │     │   │
├───┼───┼───┼     ┼───┤
│PE │PE │PE │ ... │PE │
└───┴───┴───┴     ┴───┘
↑   ↑   ↑         ↑
Activations flow vertically
 
Each PE performs:
result = activation × weight + partial_sum

3.3 Memory Hierarchy

TPU v1 Memory Hierarchy:
┌─────────────────────────────────────┐
│ Host Memory (via PCIe)              │ ← Model storage
├─────────────────────────────────────┤
│ DDR3 DRAM (8 GiB, 30 GB/s)        │ ← Large tensors
├─────────────────────────────────────┤  
│ Unified Buffer (24 MiB, 400 GB/s)  │ ← Activation storage
├─────────────────────────────────────┤
│ Weight FIFO (64 KiB)               │ ← Weight streaming
├─────────────────────────────────────┤
│ Accumulator Memory (4 MiB)         │ ← Partial sums
└─────────────────────────────────────┘

3.4 Performance Analysis

TPU v1 achieved remarkable efficiency:

  • Area Efficiency: 65,536 8-bit MAC units in 331 mm²
  • Power Efficiency: ~40 TOPS/W (significantly better than contemporary GPUs)
  • Memory Efficiency: High bandwidth, low latency on-chip storage
  • Cost Efficiency: Simplified design reduces manufacturing costs

4. TPU v2/v3 Evolution

4.1 Training Support

TPU v2 added training capabilities:

  • Floating-Point Support: bfloat16 precision
  • Bidirectional Data Flow: Forward and backward pass support
  • Increased Memory: 32 GiB high-bandwidth memory (HBM)
  • Vector Processing Units: Element-wise operations
  • Scalar Processing Units: Control and setup operations

4.2 bfloat16 Precision

The bfloat16 format was introduced specifically for TPU architectures:

IEEE FP16:     [S][EEEEE][FFFFFFFFFF]      (1+5+10 bits)
TPU bfloat16: [S][EEEEEEEE][FFFFFFF]    (1+8+7 bits)
 
Benefits of bfloat16:
- Same dynamic range as FP32 (8-bit exponent)
- Reduced precision requirements (7-bit mantissa)
- Direct truncation from FP32 (no rounding errors)
- Hardware simplification vs FP16

4.3 TPU Pod Architecture

TPU v2/v3 introduced pod-scale deployment:

TPU Pod Topology (v2/v3):
┌─────┬─────┬─────┬─────┐
│TPU  │TPU  │TPU  │TPU  │
│ 0   │ 1   │ 2   │ 3   │
├─────┼─────┼─────┼─────┤
│TPU  │TPU  │TPU  │TPU  │  
│ 4   │ 5   │ 6   │ 7   │
├─────┼─────┼─────┼─────┤
│TPU  │TPU  │TPU  │TPU  │
│ 8   │ 9   │ 10  │ 11  │
├─────┼─────┼─────┼─────┤
│TPU  │TPU  │TPU  │TPU  │
│ 12  │ 13  │ 14  │ 15  │
└─────┴─────┴─────┴─────┘
 
Interconnect Features:
- High-speed inter-chip links (>100 GB/s per link)
- 2D torus topology for scalability
- Dedicated interconnect for collective operations
- Software-managed communication overlapping

5. TPU v4 and Beyond

5.1 Advanced Features

TPU v4 introduces several architectural improvements:

  • Sparsity Support: Hardware acceleration for sparse operations
  • Optical Interconnect: Circuit-switched optical networking
  • Advanced Memory: Improved capacity and bandwidth
  • Enhanced Precision: Multiple precision formats

5.2 Optical Circuit Switching

TPU v4 Optical Interconnect:
┌─────────────────────────────────────┐
│ Optical Circuit Switch (OCS)        │
├─────────────────────────────────────┤
│ • Reconfigurable topology          │
│ • Multi-Tbps aggregate bandwidth   │
│ • Low latency circuit switching    │
│ • Power-efficient optical links    │
└─────────────────────────────────────┘
 
Benefits:
- Reduced power vs electrical interconnect
- Higher bandwidth density
- Flexible topology reconfiguration
- Improved scaling to massive pod sizes

6. Software Ecosystem

6.1 XLA (Accelerated Linear Algebra)

TPUs rely heavily on the XLA compiler:

  • Graph Optimization: Fusion, constant folding, dead code elimination
  • Memory Management: Buffer assignment and lifetime analysis
  • Parallelization: Automatic sharding across multiple TPUs
  • Hardware Mapping: Efficient use of systolic arrays and memory hierarchy

6.2 JAX Integration

JAX provides a natural programming model for TPUs:

import jax
import jax.numpy as jnp
 
# Automatic compilation to TPU
@jax.jit
def transformer_layer(x, weights):
    # Multi-head attention
    attn_out = multi_head_attention(x, weights.attn)
    x = layer_norm(x + attn_out, weights.ln1)
    
    # Feed-forward network  
    ffn_out = ffn(x, weights.ffn)
    x = layer_norm(x + ffn_out, weights.ln2)
    return x
 
# Automatic sharding across TPU pod
x_sharded = jax.device_put(x, jax.sharding.PositionalSharding(devices))

7. Performance Characteristics

7.1 Workload Analysis

TPUs excel at certain types of workloads:

TPU Performance Sweet Spot:
┌─────────────────┬─────────┬─────────────┐
│ Workload Type   │ TPU Fit │ Reasoning   │
├─────────────────┼─────────┼─────────────┤
│ Large CNNs      │ Excellent│ Matrix ops  │
│ Transformers    │ Good     │ Attention   │  
│ RNNs           │ Moderate │ Sequential  │
│ Graph Neural   │ Poor     │ Irregular   │
│ Reinforcement  │ Variable │ Depends     │
└─────────────────┴─────────┴─────────────┘

7.2 Scaling Behavior

TPU performance scales well with model size and batch size:

  • Model Parallelism: Efficient weight distribution across chips
  • Data Parallelism: High-bandwidth inter-chip communication
  • Pipeline Parallelism: Overlapped execution across layers
  • Mixed Parallelism: Combination strategies for very large models

8. Design Lessons and Principles

8.1 Key Insights from TPU Development

  1. Domain Specialization Pays Off: 10-100× improvements possible with focused design
  2. Memory Bandwidth is Critical: Often more important than peak compute
  3. Deterministic Execution: Simplifies deployment and debugging
  4. Co-design is Essential: Hardware and software must evolve together
  5. Scale Drives Innovation: Large-scale deployment motivates optimization

8.2 Architectural Trade-offs

TPU Design Decision Analysis:
┌─────────────────┬─────────────┬─────────────┐
│ Decision        │ Benefits    │ Trade-offs  │
├─────────────────┼─────────────┼─────────────┤
│ Reduced Precision│ Area/Power  │ Accuracy    │
│ Systolic Arrays │ Efficiency  │ Flexibility │
│ Simple Control  │ Power/Area  │ Generality  │  
│ Large On-chip   │ Bandwidth   │ Die Area    │
│ Fixed Function  │ Optimization│ Adaptability│
└─────────────────┴─────────────┴─────────────┘

9. Impact on Industry

9.1 Influence on AI Accelerator Design

The TPU's success has influenced the entire AI hardware industry:

  • Systolic Array Adoption: Many accelerators now use similar architectures
  • Custom Precision Formats: bfloat16 adopted by other vendors
  • Software Co-design: Increased focus on compiler optimization
  • Pod-Scale Deployment: Multi-chip systems becoming standard
  • Optical Interconnect: Growing interest in photonic networking

9.2 Lessons for Future Architectures

The TPU demonstrates several principles for AI accelerator design:

  • Start with the workload: Understand target applications deeply
  • Optimize for common case: Design for typical rather than worst-case scenarios
  • Memory matters: Bandwidth and capacity often limit performance
  • Software is half the solution: Compilers are critical for efficiency
  • Scale changes everything: Distributed deployment creates new opportunities

The TPU represents a landmark achievement in computer architecture, demonstrating how domain-specific design can deliver transformational improvements in AI workload performance and efficiency.