GPU Architecture Fundamentals

Graphics Processing Units have evolved from specialized graphics accelerators to general-purpose parallel computing powerhouses. Understanding their architecture is crucial for modern system design.

CPU vs GPU Philosophy

The fundamental difference lies in design philosophy:

CPU: Optimized for sequential performance

Few cores (4-32)
Large caches
Complex control logic
Branch prediction
Out-of-order execution

GPU: Optimized for throughput

Thousands of cores
Small caches per core
Simple control logic
Massive parallelism
In-order execution

SIMD vs SIMT

SIMD (Single Instruction, Multiple Data)

Traditional vector processors execute the same instruction on multiple data elements simultaneously.

SIMT (Single Instruction, Multiple Threads)

Modern GPU innovation: execute the same instruction across multiple threads, but each thread has its own program counter and registers.

Rendering diagram...

GPU Memory Hierarchy

Modern GPUs have a complex memory system optimized for bandwidth:

Global Memory

Largest capacity (GBs)
Highest latency (400-800 cycles)
Accessible by all threads
Coalesced access patterns crucial

Shared Memory

Fast, low-latency (1-32 cycles)
Shared within thread block
Explicitly managed by programmer
Bank conflicts can hurt performance

Registers

Fastest access
Private to each thread
Limited quantity affects occupancy

Constant Memory

Read-only
Cached for broadcast patterns
Good for parameters accessed by all threads

Warp Scheduling

Modern GPUs group threads into execution units - typically 32 threads into warps or 64 threads into wavefronts:

Warp Execution

All threads in a warp execute the same instruction
Divergent branches cause serialization
Inactive threads are masked out

Occupancy

The ratio of active warps to maximum possible warps per SM:

Higher occupancy → better latency hiding
Limited by registers, shared memory, thread blocks
Sweet spot often 50-75%, not always 100%

Branch Divergence

When threads in a warp take different execution paths:

if (threadIdx.x < 16) {
    // Path A - threads 0-15
    result = expensive_computation_A();
} else {
    // Path B - threads 16-31  
    result = expensive_computation_B();
}

Both paths execute serially, reducing effective parallelism by 50%.

Modern GPU Features

Tensor Processing Units

Specialized for AI workloads
Mixed-precision matrix operations
4×4×4 matrix multiply-accumulate
Massive speedup for deep learning

Matrix Acceleration Engines

Optimized for AI/ML workloads
Support for various data precisions
Hardware-accelerated matrix operations

Ray Tracing Cores

Hardware-accelerated ray-triangle intersection
Bounding volume hierarchy traversal
Real-time ray tracing capabilities

Programming Models

CUDA Programming Model

__global__ void vectorAdd(float* A, float* B, float* C, int N) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

OpenCL (Cross-platform)

__kernel void vectorAdd(__global float* A, __global float* B, 
                       __global float* C, int N) {
    int i = get_global_id(0);
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

Performance Optimization

Key strategies for GPU performance:

Maximize Occupancy: Balance registers and shared memory usage
Coalesced Memory Access: Align memory accesses to cache lines
Minimize Divergence: Structure algorithms to avoid branch divergence
Hide Latency: Use enough threads to cover memory latency
Optimize Memory Hierarchy: Use shared memory for data reuse

Key Takeaways

GPUs trade single-thread performance for massive parallelism
SIMT execution enables flexible parallel programming
Memory bandwidth is often the limiting factor
Warp-level thinking is crucial for optimization
Modern GPUs are increasingly specialized for AI/ML workloads

GPU Architecture Fundamentals

Prerequisites

Learning Objectives

Table of Contents

GPU Architecture Fundamentals

CPU vs GPU Philosophy

SIMD vs SIMT

SIMD (Single Instruction, Multiple Data)

SIMT (Single Instruction, Multiple Threads)

GPU Memory Hierarchy

Global Memory

Shared Memory

Registers

Constant Memory

Warp Scheduling

Warp Execution

Occupancy

Branch Divergence

Modern GPU Features

Tensor Processing Units

Matrix Acceleration Engines

Ray Tracing Cores

Programming Models

CUDA Programming Model

OpenCL (Cross-platform)

Performance Optimization

Key Takeaways

In This Topic

Related Topics

Quick Actions