Tensor Processing Unit (TPU) Architecture

Introduction

Tensor Processing Units (TPUs) represent one of the most successful custom AI accelerator architectures ever deployed. Designed specifically for neural network inference and training, TPUs demonstrate how domain-specific architectures can achieve significant improvements in performance, power efficiency, and cost-effectiveness over general-purpose processors.

TPU Design Philosophy

Core Principles

The TPU architecture is built on several key design principles:

Specialization over Generality: Optimize for neural network operations rather than general computation
Efficiency over Peak Performance: Maximize useful work per watt and per dollar
Deterministic Execution: Predictable performance for production deployment
Co-design with Software: Architecture and compiler developed together

Design Trade-offs

TPU vs GPU Design Philosophy:
┌─────────────────┬─────────────┬─────────────┐
│ Aspect          │ TPU         │ GPU         │
├─────────────────┼─────────────┼─────────────┤
│ Control Logic   │ Minimal     │ Complex     │
│ Cache Hierarchy │ Simplified  │ Multi-level │
│ Memory Bandwidth│ Very High   │ High        │  
│ Precision       │ Reduced     │ Full        │
│ Programmability │ Limited     │ General     │
│ Energy Efficiency│ Optimized  │ Moderate    │
└─────────────────┴─────────────┴─────────────┘

TPU v1 Architecture

System Overview

TPU v1 was designed specifically for neural network inference:

Process Technology: 28nm
Die Size: ~331 mm²
Peak Performance: 92 TOPS (8-bit integer)
Memory: 28 MiB on-chip, high-bandwidth off-chip
Precision: 8-bit integer only

Matrix Multiply Unit (MXU)

The heart of TPU v1 is a massive systolic array:

TPU v1 Systolic Array (256×256):
┌───┬───┬───┬     ┬───┐
│PE │PE │PE │ ... │PE │  ← Weights flow horizontally
├───┼───┼───┼     ┼───┤
│PE │PE │PE │ ... │PE │
├───┼───┼───┼     ┼───┤
│PE │PE │PE │ ... │PE │
│   │   │   │     │   │
│   │   │   │     │   │
├───┼───┼───┼     ┼───┤
│PE │PE │PE │ ... │PE │
└───┴───┴───┴     ┴───┘
↑   ↑   ↑         ↑
Activations flow vertically
 
Each PE performs:
result = activation × weight + partial_sum

Memory Hierarchy

TPU v1 Memory Hierarchy:
┌─────────────────────────────────────┐
│ Host Memory (via PCIe)              │ ← Model storage
├─────────────────────────────────────┤
│ DDR3 DRAM (8 GiB, 30 GB/s)        │ ← Large tensors
├─────────────────────────────────────┤  
│ Unified Buffer (24 MiB, 400 GB/s)  │ ← Activation storage
├─────────────────────────────────────┤
│ Weight FIFO (64 KiB)               │ ← Weight streaming
├─────────────────────────────────────┤
│ Accumulator Memory (4 MiB)         │ ← Partial sums
└─────────────────────────────────────┘

Performance Analysis

TPU v1 achieved remarkable efficiency:

Area Efficiency: 65,536 8-bit MAC units in 331 mm²
Power Efficiency: ~40 TOPS/W (significantly better than contemporary GPUs)
Memory Efficiency: High bandwidth, low latency on-chip storage
Cost Efficiency: Simplified design reduces manufacturing costs

TPU v2/v3 Evolution

Training Support

TPU v2 added training capabilities:

Floating-Point Support: bfloat16 precision
Bidirectional Data Flow: Forward and backward pass support
Increased Memory: 32 GiB high-bandwidth memory (HBM)
Vector Processing Units: Element-wise operations
Scalar Processing Units: Control and setup operations

bfloat16 Precision

The bfloat16 format was introduced specifically for TPU architectures:

IEEE FP16:     [S][EEEEE][FFFFFFFFFF]      (1+5+10 bits)
TPU bfloat16: [S][EEEEEEEE][FFFFFFF]    (1+8+7 bits)
 
Benefits of bfloat16:
- Same dynamic range as FP32 (8-bit exponent)
- Reduced precision requirements (7-bit mantissa)
- Direct truncation from FP32 (no rounding errors)
- Hardware simplification vs FP16

TPU Pod Architecture

TPU v2/v3 introduced pod-scale deployment:

TPU Pod Topology (v2/v3):
┌─────┬─────┬─────┬─────┐
│TPU  │TPU  │TPU  │TPU  │
│ 0   │ 1   │ 2   │ 3   │
├─────┼─────┼─────┼─────┤
│TPU  │TPU  │TPU  │TPU  │  
│ 4   │ 5   │ 6   │ 7   │
├─────┼─────┼─────┼─────┤
│TPU  │TPU  │TPU  │TPU  │
│ 8   │ 9   │ 10  │ 11  │
├─────┼─────┼─────┼─────┤
│TPU  │TPU  │TPU  │TPU  │
│ 12  │ 13  │ 14  │ 15  │
└─────┴─────┴─────┴─────┘
 
Interconnect Features:
- High-speed inter-chip links (>100 GB/s per link)
- 2D torus topology for scalability
- Dedicated interconnect for collective operations
- Software-managed communication overlapping

TPU v4 and Beyond

Advanced Features

TPU v4 introduces several architectural improvements:

Sparsity Support: Hardware acceleration for sparse operations
Optical Interconnect: Circuit-switched optical networking
Advanced Memory: Improved capacity and bandwidth
Enhanced Precision: Multiple precision formats

Optical Circuit Switching

TPU v4 Optical Interconnect:
┌─────────────────────────────────────┐
│ Optical Circuit Switch (OCS)        │
├─────────────────────────────────────┤
│ • Reconfigurable topology          │
│ • Multi-Tbps aggregate bandwidth   │
│ • Low latency circuit switching    │
│ • Power-efficient optical links    │
└─────────────────────────────────────┘
 
Benefits:
- Reduced power vs electrical interconnect
- Higher bandwidth density
- Flexible topology reconfiguration
- Improved scaling to massive pod sizes

Software Ecosystem

XLA (Accelerated Linear Algebra)

TPUs rely heavily on the XLA compiler:

Graph Optimization: Fusion, constant folding, dead code elimination
Memory Management: Buffer assignment and lifetime analysis
Parallelization: Automatic sharding across multiple TPUs
Hardware Mapping: Efficient use of systolic arrays and memory hierarchy

JAX Integration

JAX provides a natural programming model for TPUs:

import jax
import jax.numpy as jnp
 
# Automatic compilation to TPU
@jax.jit
def transformer_layer(x, weights):
    # Multi-head attention
    attn_out = multi_head_attention(x, weights.attn)
    x = layer_norm(x + attn_out, weights.ln1)
    
    # Feed-forward network  
    ffn_out = ffn(x, weights.ffn)
    x = layer_norm(x + ffn_out, weights.ln2)
    return x
 
# Automatic sharding across TPU pod
x_sharded = jax.device_put(x, jax.sharding.PositionalSharding(devices))

Performance Characteristics

Workload Analysis

TPUs excel at certain types of workloads:

TPU Performance Sweet Spot:
┌─────────────────┬─────────┬─────────────┐
│ Workload Type   │ TPU Fit │ Reasoning   │
├─────────────────┼─────────┼─────────────┤
│ Large CNNs      │ Excellent│ Matrix ops  │
│ Transformers    │ Good     │ Attention   │  
│ RNNs           │ Moderate │ Sequential  │
│ Graph Neural   │ Poor     │ Irregular   │
│ Reinforcement  │ Variable │ Depends     │
└─────────────────┴─────────┴─────────────┘

Scaling Behavior

TPU performance scales well with model size and batch size:

Model Parallelism: Efficient weight distribution across chips
Data Parallelism: High-bandwidth inter-chip communication
Pipeline Parallelism: Overlapped execution across layers
Mixed Parallelism: Combination strategies for very large models

Design Lessons and Principles

Key Insights from TPU Development

Domain Specialization Pays Off: 10-100× improvements possible with focused design
Memory Bandwidth is Critical: Often more important than peak compute
Deterministic Execution: Simplifies deployment and debugging
Co-design is Essential: Hardware and software must evolve together
Scale Drives Innovation: Large-scale deployment motivates optimization

Architectural Trade-offs

TPU Design Decision Analysis:
┌─────────────────┬─────────────┬─────────────┐
│ Decision        │ Benefits    │ Trade-offs  │
├─────────────────┼─────────────┼─────────────┤
│ Reduced Precision│ Area/Power  │ Accuracy    │
│ Systolic Arrays │ Efficiency  │ Flexibility │
│ Simple Control  │ Power/Area  │ Generality  │  
│ Large On-chip   │ Bandwidth   │ Die Area    │
│ Fixed Function  │ Optimization│ Adaptability│
└─────────────────┴─────────────┴─────────────┘

Impact on Industry

Influence on AI Accelerator Design

The TPU's success has influenced the entire AI hardware industry:

Systolic Array Adoption: Many accelerators now use similar architectures
Custom Precision Formats: bfloat16 adopted by other vendors
Software Co-design: Increased focus on compiler optimization
Pod-Scale Deployment: Multi-chip systems becoming standard
Optical Interconnect: Growing interest in photonic networking

Lessons for Future Architectures

The TPU demonstrates several principles for AI accelerator design:

Start with the workload: Understand target applications deeply
Optimize for common case: Design for typical rather than worst-case scenarios
Memory matters: Bandwidth and capacity often limit performance
Software is half the solution: Compilers are critical for efficiency
Scale changes everything: Distributed deployment creates new opportunities

The TPU represents a landmark achievement in computer architecture, demonstrating how domain-specific design can deliver transformational improvements in AI workload performance and efficiency.

Tensor Processing Unit (TPU) Architecture

Prerequisites

Table of Contents

Tensor Processing Unit (TPU) Architecture

Introduction

TPU Design Philosophy

Core Principles

Design Trade-offs

TPU v1 Architecture

System Overview

Matrix Multiply Unit (MXU)

Memory Hierarchy

Performance Analysis

TPU v2/v3 Evolution

Training Support

bfloat16 Precision

TPU Pod Architecture

TPU v4 and Beyond

Advanced Features

Optical Circuit Switching

Software Ecosystem

XLA (Accelerated Linear Algebra)

JAX Integration

Performance Characteristics

Workload Analysis

Scaling Behavior

Design Lessons and Principles

Key Insights from TPU Development

Architectural Trade-offs

Impact on Industry

Influence on AI Accelerator Design

Lessons for Future Architectures

In This Topic

Related Topics

Quick Actions