Advanced GPU Architecture & Performance
Modern GPU microarchitecture, SIMT execution model, and performance optimization for senior-level design.
Prerequisites
Make sure you're familiar with these concepts before diving in:
Learning Objectives
By the end of this topic, you will be able to:
Table of Contents
Advanced GPU Architecture & Performance
Modern GPUs are throughput-oriented many-core processors that have revolutionized parallel computing. But here's what separates the pros from the amateurs: understanding the execution model, memory hierarchy, and performance methodology well enough to design, analyze, and debug at scale.
Let's dive deep into the architecture that powers everything from gaming to AI training. 🚀
1. The Big Picture: GPU Design Philosophy
A modern GPU trades single-thread latency for massive data-parallel throughput:
- SIMT execution: Groups of threads (warps) execute one instruction stream in lockstep
- Enormous in-flight concurrency to hide long memory latencies
- Software-visible hierarchy: Grid → thread blocks → warps → threads
- Hardware hierarchy: GPCs/TCs → SMs → execution lanes
- Deep memory hierarchy: Registers → shared/L1 → L2 → HBM/GDDR
- High-bandwidth interconnects: Within GPU and across GPUs (NVLink)
2. Programming & Execution Model
2.1 CUDA-Style Mental Model
The GPU execution model creates a hierarchy of parallelism:
- Kernel launch: Grid of thread blocks (CTAs)
- Thread block placement: Each CTA assigned to an SM; multiple CTAs can be resident if resources allow
- Warp execution: 32 threads executed in lockstep with Independent Thread Scheduling (Volta+)
- Divergence handling: Hardware tracks masks/stacks, reconverging at post-dominator
2.2 Critical Volta+ Changes: Independent Thread Scheduling
Pre-Volta: Warps executed in perfect lockstep - warp-synchronous programming was "safe"
Volta+: Each thread has its own program counter, enabling:
- Finer-grained reconvergence after divergence
- Better performance for irregular control flow
- Breaking change: Warp-synchronous code now has race conditions!
// Pre-Volta: This was "safe" (but wrong)
__global__ void unsafe_warp_sync() {
int lane = threadIdx.x % 32;
__shared__ int data[32];
data[lane] = lane;
// Implicit warp barrier - NO LONGER GUARANTEED!
int val = data[31 - lane]; // Race condition on Volta+
}
// Volta+: Must use explicit synchronization
__global__ void safe_warp_sync() {
int lane = threadIdx.x % 32;
__shared__ int data[32];
data[lane] = lane;
__syncwarp(); // Explicit warp barrier required
int val = data[31 - lane]; // Now safe
}
2.3 Divergence & Reconvergence
When threads in a warp take different execution paths:
- Hardware tracks active mask for each path
- Serializes execution of divergent paths
- Reconverges at the immediate post-dominator
- Cost is proportional to path imbalance
// Bad: Maximum divergence
if (threadIdx.x % 2 == 0) {
// 50% of threads execute this
expensive_computation_A();
} else {
// 50% of threads execute this
expensive_computation_B();
}
// Total time = time_A + time_B
// Better: Minimize divergence
int warp_id = threadIdx.x / 32;
if (warp_id % 2 == 0) {
// Entire warps take same path
expensive_computation_A();
} else {
expensive_computation_B();
}
3. SM (Streaming Multiprocessor) Microarchitecture
The SM is where the magic happens. Let's break down the key components:
3.1 Front-End: Instruction Fetch & Scheduling
- Per-warp schedulers: Each warp has independent instruction fetch and decode
- Scoreboard tracking: Monitors register dependencies and memory operations
- Issue logic: Selects ready warps each cycle (typically 2-4 warps can issue simultaneously)
3.2 Execution Units
Modern SMs contain heterogeneous execution resources:
- FP32/INT32 ALUs: Basic arithmetic and logic operations
- FP64 ALUs: Double-precision (typically fewer units)
- Tensor Cores: Matrix multiply-accumulate for AI workloads
- SFUs (Special Function Units): Transcendentals, interpolation
- Load/Store Units: Memory operations with coalescing logic
3.3 Register File & Operand Collection
- Massive register file: 64K+ 32-bit registers per SM
- Operand collectors: Gather operands from register file
- Bank conflicts: Register file is banked; avoid conflicts with careful allocation
4. Hopper+ Advanced Features
4.1 Thread Block Clusters (TBC)
Innovation: Multiple CTAs scheduled concurrently across nearby SMs with hardware coordination:
// Hopper: Cluster-level programming
__global__ void cluster_kernel() {
// Access another CTA's shared memory in same cluster
extern __shared__ int shared_data[];
cluster_sync(); // Hardware barrier across CTAs in cluster
// Distributed shared memory access
int *remote_shared = cluster_map_shared_rank(shared_data, target_rank);
int value = *remote_shared; // Direct load from another CTA's shared mem
}
4.2 Tensor Memory Accelerator (TMA)
Game changer: Asynchronous tensor-shaped copies with hardware acceleration:
// TMA: Overlap data movement with compute
__global__ void tma_gemm() {
__shared__ float A_shared[TILE_M][TILE_K];
__shared__ float B_shared[TILE_K][TILE_N];
// Async copy next tile while computing current
tma_load_async(A_shared, global_A_ptr, tile_coords);
tma_load_async(B_shared, global_B_ptr, tile_coords);
// Compute on previous tile
compute_tile(prev_A, prev_B, C_accum);
// Wait for async copies to complete
tma_wait_group(0);
}
4.3 Distributed Shared Memory (DSMEM)
Capability: Direct loads/stores/atomics to another CTA's shared memory within a cluster:
- Use case: Reduce global memory traffic for inter-CTA communication
- Performance: Much faster than global memory roundtrip
- Programming: Explicit rank mapping and synchronization required
5. Memory Hierarchy Deep Dive
5.1 The Complete Hierarchy
Registers (per-thread) → ~1 cycle latency, highest bandwidth
↓
Shared Memory/L1 (per-SM) → ~20-30 cycle latency, coalescing critical
↓
L2 Cache (unified) → ~200 cycle latency, cross-SM sharing
↓
HBM/GDDR (off-package) → ~400-800 cycle latency, massive bandwidth
5.2 Coalescing: The Make-or-Break Optimization
The rule: Warp memory accesses should be contiguous and aligned to minimize transactions.
// Bad: Strided access (32 transactions for 32 threads)
__global__ void strided_bad() {
int tid = threadIdx.x;
float val = input[tid * STRIDE]; // Each thread hits different cache line
}
// Good: Coalesced access (1 transaction for 32 threads)
__global__ void coalesced_good() {
int tid = threadIdx.x;
float val = input[tid]; // Consecutive threads access consecutive elements
}
// Advanced: Vectorized loads when aligned
__global__ void vectorized_best() {
int tid = threadIdx.x;
float4 vals = reinterpret_cast<float4*>(input)[tid]; // 4x bandwidth utilization
}
5.3 Shared Memory Banking
Shared memory is divided into 32 banks (4-byte words). Simultaneous access to the same bank causes bank conflicts.
// Bad: Bank conflicts
__shared__ float shared[32][32];
float val = shared[threadIdx.x][0]; // All threads access bank 0
// Good: No conflicts
__shared__ float shared[32][33]; // Padding breaks stride pattern
float val = shared[threadIdx.x][threadIdx.y];
// Advanced: Swizzling for complex access patterns
int swizzled_idx = (threadIdx.x + threadIdx.y) % 32;
float val = shared[threadIdx.x][swizzled_idx];
5.4 Virtual Memory & TLBs
GPUs have full virtual memory with hardware page table walkers:
- TLB reach: Critical for large working sets
- Page sizes: 4KB, 2MB, 1GB huge pages supported
- UVM (Unified Virtual Memory): Transparent CPU-GPU memory management
- Page fault handling: Can be expensive; prefer explicit memory management
6. Concurrency, Residency, and Scheduling
6.1 Occupancy vs Performance
Occupancy = Active warps / Maximum possible warps per SM
// Calculate theoretical occupancy
int max_warps_per_sm = 64; // Architecture dependent
int warps_per_block = (block_size + 31) / 32;
int blocks_per_sm_regs = max_regs_per_sm / (regs_per_thread * block_size);
int blocks_per_sm_shared = max_shared_per_sm / shared_per_block;
int blocks_per_sm = min(blocks_per_sm_regs, blocks_per_sm_shared);
int occupancy = (blocks_per_sm * warps_per_block) / max_warps_per_sm;
Key insight: High occupancy ≠high performance. Optimize for useful work, not just occupancy.
6.2 Resource Limits
Each SM has finite resources that limit CTA residency:
- Registers per SM: 64K-256K depending on architecture
- Shared memory per SM: 48KB-228KB (configurable L1/shared split)
- Max warps per SM: 32-64 depending on architecture
- Max CTAs per SM: 16-32 depending on architecture
6.3 MPS & MIG: Multi-Tenancy
MPS (Multi-Process Service):
- Software multiplexing of GPU resources
- Lower submission overhead
- Process isolation with Volta+ hardware assist
MIG (Multi-Instance GPU):
- Hardware partitioning into up to 7 isolated instances
- Each instance gets dedicated HBM slice, cache, and compute
- Full QoS isolation for cloud/multi-tenant scenarios
7. Scaling Out: NVLink & Modern Interconnects
7.1 NVLink Evolution
5th Generation (Blackwell):
- Scale: Up to 576 GPUs in single NVLink fabric
- Bandwidth: ~130 TB/s in 72-GPU NVL72 domain
- Per-GPU: ~1.8 TB/s interconnect bandwidth
7.2 Grace-Blackwell Integration
NVLink-C2C (Chip-to-Chip):
- Bandwidth: ~900 GB/s bidirectional CPU-GPU
- Coherency: Unified memory space between Grace CPU and Blackwell GPU
- Use cases: Large model inference, tight CPU-GPU coupling
7.3 GB200 NVL72: Rack-Scale Computing
Architecture: 36 Grace CPUs + 72 Blackwell GPUs in single NVLink domain Performance: ~30× faster real-time inference for trillion-parameter LLMs vs 8-GPU nodes Programming model: Acts like "one big GPU" with coherent memory
8. Specialized Engines & Datatypes
8.1 Tensor Cores Evolution
Each generation brings new capabilities:
- Volta: FP16 matrix operations
- Ampere: BF16, TF32, structured sparsity (2:4)
- Hopper: FP8, Transformer Engine with dynamic precision
- Blackwell: FP4, even more aggressive mixed precision
// Hopper Transformer Engine example
#include <transformer_engine/transformer_engine.h>
// Dynamic FP8/FP16 mixing based on tensor statistics
te_gemm(A_fp8, B_fp8, C_fp16,
scale_A, scale_B,
TE_DType_kFloat8_E4M3, // Dynamic precision selection
TE_DType_kFloat16);
8.2 Specialized Accelerators
DPX Instructions (Hopper):
- Dynamic programming acceleration
- Genomics, graph algorithms
- Custom instruction set for DP recurrence relations
Confidential Computing:
- TEE (Trusted Execution Environment) support
- Memory encryption and attestation
- Secure multi-tenant AI workloads
9. Performance Methodology
9.1 1. Roofline Analysis First
// Compute roofline
float compute_roof = peak_tensor_ops_per_sec;
// Memory roofline
float memory_roof = arithmetic_intensity * peak_bandwidth;
// Achievable performance
float perf_bound = min(compute_roof, memory_roof);
Key insight: Identify if you're compute-bound or memory-bound, then optimize accordingly.
9.2 2. Traffic Shaping & Tiling
// Bad: Poor data reuse
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
for (int k = 0; k < K; k++) {
C[i][j] += A[i][k] * B[k][j]; // Poor cache reuse
}
}
}
// Good: Tiled for cache reuse
for (int ii = 0; ii < M; ii += TILE_M) {
for (int jj = 0; jj < N; jj += TILE_N) {
for (int kk = 0; kk < K; kk += TILE_K) {
// Compute tile with high data reuse
gemm_tile(A + ii*K + kk, B + kk*N + jj, C + ii*N + jj);
}
}
}
9.3 3. Occupancy vs Resources
Strategy: Increase active warps until you hide memory latency without causing register spills or shared memory pressure.
9.4 4. Multi-GPU Considerations
- Topology awareness: Minimize NVLink hops
- Communication patterns: Optimize all-reduce, all-gather
- Load balancing: Account for heterogeneous performance
- Memory placement: NUMA-aware allocation
10. Common Pitfalls & Debugging
10.1 Silent Warp-Sync Bugs (Volta+)
// Use Compute Sanitizer to detect races
$ compute-sanitizer --tool racecheck ./my_program
$ compute-sanitizer --tool synccheck ./my_program
10.2 Memory Access Patterns
// Debug coalescing issues
$ ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum ./my_program
// Compare to theoretical minimum transactions
10.3 Resource Utilization
// Check occupancy and resource usage
$ ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active ./my_program
$ ncu --metrics l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum ./my_program
11. Interview Deep Dives
12. Key Takeaways
- SIMT execution model: Understand divergence costs and Volta+ independent scheduling
- Memory hierarchy mastery: Coalescing and banking are make-or-break optimizations
- Resource balance: Occupancy is a means, not an end - optimize for useful work
- Modern features: Leverage TMA, clusters, and tensor cores for cutting-edge performance
- Scale-out awareness: NVLink topology and multi-GPU programming are increasingly critical
The GPU architecture landscape is evolving rapidly. Master these fundamentals, then stay current with each generation's innovations. The future belongs to those who can think in parallel! 🚀