GPU Architecture Fundamentals
Deep dive into graphics processing unit design, SIMD execution, and parallel computing architectures.
Prerequisites
Make sure you're familiar with these concepts before diving in:
Learning Objectives
By the end of this topic, you will be able to:
Table of Contents
GPU Architecture Fundamentals
Graphics Processing Units have evolved from specialized graphics accelerators to general-purpose parallel computing powerhouses. Understanding their architecture is crucial for modern system design.
1. CPU vs GPU Philosophy
The fundamental difference lies in design philosophy:
CPU: Optimized for sequential performance
- Few cores (4-32)
- Large caches
- Complex control logic
- Branch prediction
- Out-of-order execution
GPU: Optimized for throughput
- Thousands of cores
- Small caches per core
- Simple control logic
- Massive parallelism
- In-order execution
2. SIMD vs SIMT
2.1 SIMD (Single Instruction, Multiple Data)
Traditional vector processors execute the same instruction on multiple data elements simultaneously.
2.2 SIMT (Single Instruction, Multiple Threads)
Modern GPU innovation: execute the same instruction across multiple threads, but each thread has its own program counter and registers.
3. GPU Memory Hierarchy
Modern GPUs have a complex memory system optimized for bandwidth:
3.1 Global Memory
- Largest capacity (GBs)
- Highest latency (400-800 cycles)
- Accessible by all threads
- Coalesced access patterns crucial
3.2 Shared Memory
- Fast, low-latency (1-32 cycles)
- Shared within thread block
- Explicitly managed by programmer
- Bank conflicts can hurt performance
3.3 Registers
- Fastest access
- Private to each thread
- Limited quantity affects occupancy
3.4 Constant Memory
- Read-only
- Cached for broadcast patterns
- Good for parameters accessed by all threads
4. Warp Scheduling
Modern GPUs group threads into execution units - typically 32 threads into warps or 64 threads into wavefronts:
4.1 Warp Execution
- All threads in a warp execute the same instruction
- Divergent branches cause serialization
- Inactive threads are masked out
4.2 Occupancy
The ratio of active warps to maximum possible warps per SM:
- Higher occupancy → better latency hiding
- Limited by registers, shared memory, thread blocks
- Sweet spot often 50-75%, not always 100%
5. Branch Divergence
When threads in a warp take different execution paths:
if (threadIdx.x < 16) {
// Path A - threads 0-15
result = expensive_computation_A();
} else {
// Path B - threads 16-31
result = expensive_computation_B();
}
Both paths execute serially, reducing effective parallelism by 50%.
6. Modern GPU Features
6.1 Tensor Processing Units
- Specialized for AI workloads
- Mixed-precision matrix operations
- 4×4×4 matrix multiply-accumulate
- Massive speedup for deep learning
6.2 Matrix Acceleration Engines
- Optimized for AI/ML workloads
- Support for various data precisions
- Hardware-accelerated matrix operations
6.3 Ray Tracing Cores
- Hardware-accelerated ray-triangle intersection
- Bounding volume hierarchy traversal
- Real-time ray tracing capabilities
7. Programming Models
7.1 CUDA Programming Model
__global__ void vectorAdd(float* A, float* B, float* C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N) {
C[i] = A[i] + B[i];
}
}
7.2 OpenCL (Cross-platform)
__kernel void vectorAdd(__global float* A, __global float* B,
__global float* C, int N) {
int i = get_global_id(0);
if (i < N) {
C[i] = A[i] + B[i];
}
}
8. Performance Optimization
Key strategies for GPU performance:
- Maximize Occupancy: Balance registers and shared memory usage
- Coalesced Memory Access: Align memory accesses to cache lines
- Minimize Divergence: Structure algorithms to avoid branch divergence
- Hide Latency: Use enough threads to cover memory latency
- Optimize Memory Hierarchy: Use shared memory for data reuse
9. Key Takeaways
- GPUs trade single-thread performance for massive parallelism
- SIMT execution enables flexible parallel programming
- Memory bandwidth is often the limiting factor
- Warp-level thinking is crucial for optimization
- Modern GPUs are increasingly specialized for AI/ML workloads