GPU Architecture Fundamentals
Deep dive into graphics processing unit design, SIMD execution, and parallel computing architectures.
Prerequisites
Make sure you're familiar with these concepts before diving in:
Learning Objectives
By the end of this topic, you will be able to:
Table of Contents
GPU Architecture Fundamentals
Graphics Processing Units have evolved from specialized graphics accelerators to general-purpose parallel computing powerhouses. Understanding their architecture is crucial for modern system design.
CPU vs GPU Philosophy
The fundamental difference lies in design philosophy:
CPU: Optimized for sequential performance
- Few cores (4-32)
- Large caches
- Complex control logic
- Branch prediction
- Out-of-order execution
GPU: Optimized for throughput
- Thousands of cores
- Small caches per core
- Simple control logic
- Massive parallelism
- In-order execution
SIMD vs SIMT
SIMD (Single Instruction, Multiple Data)
Traditional vector processors execute the same instruction on multiple data elements simultaneously.
SIMT (Single Instruction, Multiple Threads)
Modern GPU innovation: execute the same instruction across multiple threads, but each thread has its own program counter and registers.
GPU Memory Hierarchy
Modern GPUs have a complex memory system optimized for bandwidth:
Global Memory
- Largest capacity (GBs)
- Highest latency (400-800 cycles)
- Accessible by all threads
- Coalesced access patterns crucial
Shared Memory
- Fast, low-latency (1-32 cycles)
- Shared within thread block
- Explicitly managed by programmer
- Bank conflicts can hurt performance
Registers
- Fastest access
- Private to each thread
- Limited quantity affects occupancy
Constant Memory
- Read-only
- Cached for broadcast patterns
- Good for parameters accessed by all threads
Warp Scheduling
Modern GPUs group threads into execution units - typically 32 threads into warps or 64 threads into wavefronts:
Warp Execution
- All threads in a warp execute the same instruction
- Divergent branches cause serialization
- Inactive threads are masked out
Occupancy
The ratio of active warps to maximum possible warps per SM:
- Higher occupancy → better latency hiding
- Limited by registers, shared memory, thread blocks
- Sweet spot often 50-75%, not always 100%
Branch Divergence
When threads in a warp take different execution paths:
if (threadIdx.x < 16) {
// Path A - threads 0-15
result = expensive_computation_A();
} else {
// Path B - threads 16-31
result = expensive_computation_B();
}Both paths execute serially, reducing effective parallelism by 50%.
Modern GPU Features
Tensor Processing Units
- Specialized for AI workloads
- Mixed-precision matrix operations
- 4×4×4 matrix multiply-accumulate
- Massive speedup for deep learning
Matrix Acceleration Engines
- Optimized for AI/ML workloads
- Support for various data precisions
- Hardware-accelerated matrix operations
Ray Tracing Cores
- Hardware-accelerated ray-triangle intersection
- Bounding volume hierarchy traversal
- Real-time ray tracing capabilities
Programming Models
CUDA Programming Model
__global__ void vectorAdd(float* A, float* B, float* C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}OpenCL (Cross-platform)
__kernel void vectorAdd(__global float* A, __global float* B,
__global float* C, int N) {
int i = get_global_id(0);
if (i < N) {
C[i] = A[i] + B[i];
}
}Performance Optimization
Key strategies for GPU performance:
- Maximize Occupancy: Balance registers and shared memory usage
- Coalesced Memory Access: Align memory accesses to cache lines
- Minimize Divergence: Structure algorithms to avoid branch divergence
- Hide Latency: Use enough threads to cover memory latency
- Optimize Memory Hierarchy: Use shared memory for data reuse
Key Takeaways
- GPUs trade single-thread performance for massive parallelism
- SIMT execution enables flexible parallel programming
- Memory bandwidth is often the limiting factor
- Warp-level thinking is crucial for optimization
- Modern GPUs are increasingly specialized for AI/ML workloads