Advanced GPU Architecture & Performance

Modern GPUs are throughput-oriented many-core processors that have revolutionized parallel computing. But here's what separates the pros from the amateurs: understanding the execution model, memory hierarchy, and performance methodology well enough to design, analyze, and debug at scale.

Let's dive deep into the architecture that powers everything from gaming to AI training. 🚀

The Big Picture: GPU Design Philosophy

A modern GPU trades single-thread latency for massive data-parallel throughput:

SIMT execution: Groups of threads (warps) execute one instruction stream in lockstep
Enormous in-flight concurrency to hide long memory latencies
Software-visible hierarchy: Grid → thread blocks → warps → threads
Hardware hierarchy: GPCs/TCs → SMs → execution lanes
Deep memory hierarchy: Registers → shared/L1 → L2 → HBM/GDDR
High-bandwidth interconnects: Within GPU and across GPUs (NVLink)

Rendering diagram...

Programming & Execution Model

CUDA-Style Mental Model

The GPU execution model creates a hierarchy of parallelism:

Kernel launch: Grid of thread blocks (CTAs)
Thread block placement: Each CTA assigned to an SM; multiple CTAs can be resident if resources allow
Warp execution: 32 threads executed in lockstep with Independent Thread Scheduling (Volta+)
Divergence handling: Hardware tracks masks/stacks, reconverging at post-dominator

Rendering diagram...

Critical Volta+ Changes: Independent Thread Scheduling

Pre-Volta: Warps executed in perfect lockstep - warp-synchronous programming was "safe"

Volta+: Each thread has its own program counter, enabling:

Finer-grained reconvergence after divergence
Better performance for irregular control flow
Breaking change: Warp-synchronous code now has race conditions!

// Pre-Volta: This was "safe" (but wrong)
__global__ void unsafe_warp_sync() {
    int lane = threadIdx.x % 32;
    __shared__ int data[32];
    
    data[lane] = lane;
    // Implicit warp barrier - NO LONGER GUARANTEED!
    int val = data[31 - lane];  // Race condition on Volta+
}
 
// Volta+: Must use explicit synchronization
__global__ void safe_warp_sync() {
    int lane = threadIdx.x % 32;
    __shared__ int data[32];
    
    data[lane] = lane;
    __syncwarp();  // Explicit warp barrier required
    int val = data[31 - lane];  // Now safe
}

Divergence & Reconvergence

When threads in a warp take different execution paths:

Hardware tracks active mask for each path
Serializes execution of divergent paths
Reconverges at the immediate post-dominator
Cost is proportional to path imbalance

// Bad: Maximum divergence
if (threadIdx.x % 2 == 0) {
    // 50% of threads execute this
    expensive_computation_A();
} else {
    // 50% of threads execute this  
    expensive_computation_B();
}
// Total time = time_A + time_B
 
// Better: Minimize divergence
int warp_id = threadIdx.x / 32;
if (warp_id % 2 == 0) {
    // Entire warps take same path
    expensive_computation_A();
} else {
    expensive_computation_B();
}

SM (Streaming Multiprocessor) Microarchitecture

The SM is where the magic happens. Let's break down the key components:

Rendering diagram...

Front-End: Instruction Fetch & Scheduling

Per-warp schedulers: Each warp has independent instruction fetch and decode
Scoreboard tracking: Monitors register dependencies and memory operations
Issue logic: Selects ready warps each cycle (typically 2-4 warps can issue simultaneously)

Execution Units

Modern SMs contain heterogeneous execution resources:

FP32/INT32 ALUs: Basic arithmetic and logic operations
FP64 ALUs: Double-precision (typically fewer units)
Tensor Cores: Matrix multiply-accumulate for AI workloads
SFUs (Special Function Units): Transcendentals, interpolation
Load/Store Units: Memory operations with coalescing logic

Register File & Operand Collection

Massive register file: 64K+ 32-bit registers per SM
Operand collectors: Gather operands from register file
Bank conflicts: Register file is banked; avoid conflicts with careful allocation

Hopper+ Advanced Features

Thread Block Clusters (TBC)

Innovation: Multiple CTAs scheduled concurrently across nearby SMs with hardware coordination:

// Hopper: Cluster-level programming
__global__ void cluster_kernel() {
    // Access another CTA's shared memory in same cluster
    extern __shared__ int shared_data[];
    
    cluster_sync();  // Hardware barrier across CTAs in cluster
    
    // Distributed shared memory access
    int *remote_shared = cluster_map_shared_rank(shared_data, target_rank);
    int value = *remote_shared;  // Direct load from another CTA's shared mem
}

Tensor Memory Accelerator (TMA)

Game changer: Asynchronous tensor-shaped copies with hardware acceleration:

// TMA: Overlap data movement with compute
__global__ void tma_gemm() {
    __shared__ float A_shared[TILE_M][TILE_K];
    __shared__ float B_shared[TILE_K][TILE_N];
    
    // Async copy next tile while computing current
    tma_load_async(A_shared, global_A_ptr, tile_coords);
    tma_load_async(B_shared, global_B_ptr, tile_coords);
    
    // Compute on previous tile
    compute_tile(prev_A, prev_B, C_accum);
    
    // Wait for async copies to complete
    tma_wait_group(0);
}

Distributed Shared Memory (DSMEM)

Capability: Direct loads/stores/atomics to another CTA's shared memory within a cluster:

Use case: Reduce global memory traffic for inter-CTA communication
Performance: Much faster than global memory roundtrip
Programming: Explicit rank mapping and synchronization required

Memory Hierarchy Deep Dive

The Complete Hierarchy

Registers (per-thread)     →  ~1 cycle latency, highest bandwidth
    ↓
Shared Memory/L1 (per-SM)  →  ~20-30 cycle latency, coalescing critical  
    ↓
L2 Cache (unified)         →  ~200 cycle latency, cross-SM sharing
    ↓  
HBM/GDDR (off-package)     →  ~400-800 cycle latency, massive bandwidth

Rendering diagram...

Coalescing: The Make-or-Break Optimization

The rule: Warp memory accesses should be contiguous and aligned to minimize transactions.

// Bad: Strided access (32 transactions for 32 threads)
__global__ void strided_bad() {
    int tid = threadIdx.x;
    float val = input[tid * STRIDE];  // Each thread hits different cache line
}
 
// Good: Coalesced access (1 transaction for 32 threads)  
__global__ void coalesced_good() {
    int tid = threadIdx.x;
    float val = input[tid];  // Consecutive threads access consecutive elements
}
 
// Advanced: Vectorized loads when aligned
__global__ void vectorized_best() {
    int tid = threadIdx.x;
    float4 vals = reinterpret_cast<float4*>(input)[tid];  // 4x bandwidth utilization
}

Shared Memory Banking

Shared memory is divided into 32 banks (4-byte words). Simultaneous access to the same bank causes bank conflicts.

// Bad: Bank conflicts
__shared__ float shared[32][32];
float val = shared[threadIdx.x][0];  // All threads access bank 0
 
// Good: No conflicts  
__shared__ float shared[32][33];  // Padding breaks stride pattern
float val = shared[threadIdx.x][threadIdx.y];
 
// Advanced: Swizzling for complex access patterns
int swizzled_idx = (threadIdx.x + threadIdx.y) % 32;
float val = shared[threadIdx.x][swizzled_idx];

Virtual Memory & TLBs

GPUs have full virtual memory with hardware page table walkers:

TLB reach: Critical for large working sets
Page sizes: 4KB, 2MB, 1GB huge pages supported
UVM (Unified Virtual Memory): Transparent CPU-GPU memory management
Page fault handling: Can be expensive; prefer explicit memory management

Concurrency, Residency, and Scheduling

Occupancy vs Performance

Occupancy = Active warps / Maximum possible warps per SM

// Calculate theoretical occupancy
int max_warps_per_sm = 64;  // Architecture dependent
int warps_per_block = (block_size + 31) / 32;
int blocks_per_sm_regs = max_regs_per_sm / (regs_per_thread * block_size);
int blocks_per_sm_shared = max_shared_per_sm / shared_per_block;
int blocks_per_sm = min(blocks_per_sm_regs, blocks_per_sm_shared);
int occupancy = (blocks_per_sm * warps_per_block) / max_warps_per_sm;

Key insight: High occupancy ≠ high performance. Optimize for useful work, not just occupancy.

Resource Limits

Each SM has finite resources that limit CTA residency:

Registers per SM: 64K-256K depending on architecture
Shared memory per SM: 48KB-228KB (configurable L1/shared split)
Max warps per SM: 32-64 depending on architecture
Max CTAs per SM: 16-32 depending on architecture

MPS & MIG: Multi-Tenancy

MPS (Multi-Process Service):

Software multiplexing of GPU resources
Lower submission overhead
Process isolation with Volta+ hardware assist

MIG (Multi-Instance GPU):

Hardware partitioning into up to 7 isolated instances
Each instance gets dedicated HBM slice, cache, and compute
Full QoS isolation for cloud/multi-tenant scenarios

Rendering diagram...

Scaling Out: NVLink & Modern Interconnects

NVLink Evolution

5th Generation (Blackwell):

Scale: Up to 576 GPUs in single NVLink fabric
Bandwidth: ~130 TB/s in 72-GPU NVL72 domain
Per-GPU: ~1.8 TB/s interconnect bandwidth

Grace-Blackwell Integration

NVLink-C2C (Chip-to-Chip):

Bandwidth: ~900 GB/s bidirectional CPU-GPU
Coherency: Unified memory space between Grace CPU and Blackwell GPU
Use cases: Large model inference, tight CPU-GPU coupling

Rendering diagram...

GB200 NVL72: Rack-Scale Computing

Architecture: 36 Grace CPUs + 72 Blackwell GPUs in single NVLink domain Performance: ~30× faster real-time inference for trillion-parameter LLMs vs 8-GPU nodes Programming model: Acts like "one big GPU" with coherent memory

Specialized Engines & Datatypes

Tensor Cores Evolution

Each generation brings new capabilities:

Volta: FP16 matrix operations
Ampere: BF16, TF32, structured sparsity (2:4)
Hopper: FP8, Transformer Engine with dynamic precision
Blackwell: FP4, even more aggressive mixed precision

// Hopper Transformer Engine example
#include <transformer_engine/transformer_engine.h>
 
// Dynamic FP8/FP16 mixing based on tensor statistics
te_gemm(A_fp8, B_fp8, C_fp16, 
        scale_A, scale_B, 
        TE_DType_kFloat8_E4M3,  // Dynamic precision selection
        TE_DType_kFloat16);

Specialized Accelerators

DPX Instructions (Hopper):

Dynamic programming acceleration
Genomics, graph algorithms
Custom instruction set for DP recurrence relations

Confidential Computing:

TEE (Trusted Execution Environment) support
Memory encryption and attestation
Secure multi-tenant AI workloads

Performance Methodology

1. Roofline Analysis First

// Compute roofline
float compute_roof = peak_tensor_ops_per_sec;
 
// Memory roofline  
float memory_roof = arithmetic_intensity * peak_bandwidth;
 
// Achievable performance
float perf_bound = min(compute_roof, memory_roof);

Key insight: Identify if you're compute-bound or memory-bound, then optimize accordingly.

2. Traffic Shaping & Tiling

// Bad: Poor data reuse
for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
        for (int k = 0; k < K; k++) {
            C[i][j] += A[i][k] * B[k][j];  // Poor cache reuse
        }
    }
}
 
// Good: Tiled for cache reuse
for (int ii = 0; ii < M; ii += TILE_M) {
    for (int jj = 0; jj < N; jj += TILE_N) {
        for (int kk = 0; kk < K; kk += TILE_K) {
            // Compute tile with high data reuse
            gemm_tile(A + ii*K + kk, B + kk*N + jj, C + ii*N + jj);
        }
    }
}

3. Occupancy vs Resources

Strategy: Increase active warps until you hide memory latency without causing register spills or shared memory pressure.

4. Multi-GPU Considerations

Topology awareness: Minimize NVLink hops
Communication patterns: Optimize all-reduce, all-gather
Load balancing: Account for heterogeneous performance
Memory placement: NUMA-aware allocation

Common Pitfalls & Debugging

Silent Warp-Sync Bugs (Volta+)

// Use Compute Sanitizer to detect races
$ compute-sanitizer --tool racecheck ./my_program
$ compute-sanitizer --tool synccheck ./my_program

Memory Access Patterns

// Debug coalescing issues
$ ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum ./my_program
// Compare to theoretical minimum transactions

Resource Utilization

// Check occupancy and resource usage
$ ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active ./my_program
$ ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum ./my_program

Interview Deep Dives

Key Takeaways

SIMT execution model: Understand divergence costs and Volta+ independent scheduling
Memory hierarchy mastery: Coalescing and banking are make-or-break optimizations
Resource balance: Occupancy is a means, not an end - optimize for useful work
Modern features: Leverage TMA, clusters, and tensor cores for cutting-edge performance
Scale-out awareness: NVLink topology and multi-GPU programming are increasingly critical

The GPU architecture landscape is evolving rapidly. Master these fundamentals, then stay current with each generation's innovations. The future belongs to those who can think in parallel! 🚀

Prerequisites

Learning Objectives

Table of Contents

Advanced GPU Architecture & Performance

The Big Picture: GPU Design Philosophy

Programming & Execution Model

CUDA-Style Mental Model

Critical Volta+ Changes: Independent Thread Scheduling

Divergence & Reconvergence

SM (Streaming Multiprocessor) Microarchitecture

Front-End: Instruction Fetch & Scheduling

Execution Units

Register File & Operand Collection

Hopper+ Advanced Features

Thread Block Clusters (TBC)

Tensor Memory Accelerator (TMA)

Distributed Shared Memory (DSMEM)

Memory Hierarchy Deep Dive

The Complete Hierarchy

Coalescing: The Make-or-Break Optimization

Shared Memory Banking

Virtual Memory & TLBs

Concurrency, Residency, and Scheduling

Occupancy vs Performance

Resource Limits

MPS & MIG: Multi-Tenancy

Scaling Out: NVLink & Modern Interconnects

NVLink Evolution

Grace-Blackwell Integration

GB200 NVL72: Rack-Scale Computing

Specialized Engines & Datatypes

Tensor Cores Evolution

Specialized Accelerators

Performance Methodology

1. Roofline Analysis First

2. Traffic Shaping & Tiling

3. Occupancy vs Resources

4. Multi-GPU Considerations

Common Pitfalls & Debugging

Silent Warp-Sync Bugs (Volta+)

Memory Access Patterns

Resource Utilization

Interview Deep Dives

Key Takeaways

In This Topic

Related Topics

Quick Actions