GPU Sparsity Patterns & Performance
GPU sparsity patterns from unstructured CSR to structured 2:4 for modern tensor cores.
Prerequisites
Make sure you're familiar with these concepts before diving in:
Learning Objectives
By the end of this topic, you will be able to:
Table of Contents
GPU Sparsity Patterns & Performance
Sparsity is everywhere in modern AI workloads - from pruned neural networks to sparse attention mechanisms. But here's the catch: not all sparsity is created equal on GPUs. Let's dive deep into when sparsity helps, when it hurts, and how to implement it right.
1. Types of Sparsity
Understanding the sparsity landscape is crucial for making the right architectural decisions:
1.1 Unstructured Sparsity
- Pattern: Arbitrary zero locations
- Storage: CSR/CSC/COO formats with full indirection
- Pros: Maximum compression ratio, works with any sparsity pattern
- Cons: Poor memory locality, irregular memory access patterns, warp divergence
1.2 Structured Sparsity (2:4)
- Pattern: Fixed ratio - 2 non-zeros in every 4 elements
- Storage: Compressed values + compact metadata
- Pros: Hardware-friendly, predictable access patterns, tensor core support
- Cons: Limited to specific sparsity ratios, may require retraining
1.3 Block Sparsity (BCSR)
- Pattern: Dense B×B blocks with sparse block-level structure
- Storage: Block indices + dense sub-matrices
- Pros: Better coalescing, lower metadata overhead, cache-friendly
- Cons: Less flexible than unstructured, block size tuning required
2. ELI5: Sparsity in Plain English
Think of sparsity like a coloring book with lots of blank squares:
-
What is sparsity? Lots of zeros in your data. If a page has blank squares, you skip coloring them. Skipping zeros saves time and memory.
-
Why do GPUs care? GPUs work in big groups of 32 threads (warps). If data is irregular, some lanes sit idle—like a dance line tripping over a messy stage.
-
The flavors:
- Unstructured: zeros are random. Great compression, messy for GPUs.
- 2:4 structured: in every 4 numbers, only 2 are kept. Very tidy, hardware loves it.
- Block: keep non-zeros in squares (tiles); easier, faster moves.
-
Storage trick: save only positions and values of non-zeros. More structure = fewer positions to store.
-
When it helps: extremely sparse or nicely patterned. When it hurts: only mildly sparse or very irregular.
Sound bite: If it's super-sparse or neatly patterned, work that sparse runway. If not, dense is still the snappiest number.
3. Storage Formats Deep Dive
3.1 CSR (Compressed Sparse Row)
// CSR representation
vector<int> row_ptr; // Row start indices
vector<int> col_idx; // Column indices
vector<float> val; // Non-zero values
- Best for: SpMV, SpMM with row-wise access patterns
- Memory pattern: Indirection through
col_idx
hurts coalescing - Warp efficiency: Variable nnz per row causes load imbalance
3.2 BCSR (Block Compressed Sparse Row)
// BCSR representation
vector<int> block_row_ptr; // Block row start indices
vector<int> block_col_idx; // Block column indices
vector<float> block_vals; // Dense B×B blocks
- Sweet spot for GPUs: Reduces metadata, improves coalescing
- Block size tuning: Match to shared memory capacity and warp size
- Trade-off: Some zero-padding within blocks
3.3 ELL/ELLPACK
- Pattern: Fixed nnz per row (pad shorter rows)
- Pros: Perfect coalescing, simple indexing
- Cons: Memory waste from padding, poor for skewed distributions
4. Performance Modeling
4.1 Back-of-Envelope Analysis
For sparse GEMM C = A * B
where A is sparse:
// Dense GEMM FLOPs
dense_flops = 2 * M * N * K;
// Sparse effective FLOPs
sparse_flops = (1 - sparsity) * dense_flops;
// But add overhead:
// - Metadata bytes per nonzero
// - Warp divergence penalties
// - Reduced arithmetic intensity
Critical insight: Unstructured SpMM often needs ≥90% sparsity to beat dense kernels. Structured 2:4 wins much earlier due to hardware support.
4.2 Roofline Analysis for Sparse Kernels
// Effective bandwidth calculation
effective_bw = (nnz_bytes + metadata_bytes) / kernel_time;
// Arithmetic intensity
AI = sparse_flops / (nnz_bytes + metadata_bytes);
// Performance bound
perf = min(compute_roof, AI * memory_roof);
5. GPU Implementation Strategies
5.1 Coalescing-First Design
// Bad: scattered column access
for (int i = 0; i < nnz; i++) {
result += val[i] * x[col_idx[i]]; // Random access!
}
// Good: reorder for coalescing
// Process by sorted column blocks or use shared memory staging
5.2 Warp-Cooperative Processing
Strategy 1: Row per Warp
__global__ void spmv_row_per_warp(/* ... */) {
int warp_id = threadIdx.x / 32;
int lane_id = threadIdx.x % 32;
int row = blockIdx.x * warps_per_block + warp_id;
// Warp cooperatively processes row
float sum = 0.0f;
for (int j = row_ptr[row] + lane_id; j < row_ptr[row+1]; j += 32) {
sum += val[j] * x[col_idx[j]];
}
// Warp reduction
sum = warp_reduce_sum(sum);
if (lane_id == 0) y[row] = sum;
}
Strategy 2: Block per Warp (BCSR)
__global__ void spmm_bcsr_block_per_warp(/* ... */) {
// Each warp processes one B×B block
// Better coalescing, lower metadata overhead
// Shared memory for input vector reuse
}
5.3 Shared Memory Staging
__shared__ float x_shared[BLOCK_SIZE];
// Stage frequently accessed vector elements
// Reduces global memory traffic for irregular access patterns
6. 2:4 Structured Sparsity
6.1 Hardware Support
Modern tensor cores (Ampere+) have native 2:4 sparse support:
6.2 Implementation Pattern
// 1. Prune weights to 2:4 pattern during training
// 2. Compress: pack 2 values + 2-bit metadata per 4 elements
// 3. Use cuSPARSELt or custom kernels for inference
// Example: 2:4 sparse GEMM
cusparseLtMatmul(handle, &plan, &alpha,
compressed_A, B, &beta, C,
workspace, streams, num_streams);
6.3 Performance Characteristics
- Theoretical speedup: 2× (half the math)
- Practical speedup: 1.3-1.6× (memory/scheduling overhead)
- Memory reduction: ~50% for weights
- Accuracy: Minimal degradation with proper training
7. Load Balancing Strategies
7.1 Handling Skewed Distributions
// Problem: Some rows have many nnz, others few
// Solution 1: Row bucketing
vector<vector<int>> buckets(MAX_NNZ_BUCKET);
for (int i = 0; i < num_rows; i++) {
int nnz = row_ptr[i+1] - row_ptr[i];
buckets[min(nnz, MAX_NNZ_BUCKET-1)].push_back(i);
}
// Process each bucket with appropriate kernel variant
// Solution 2: Work stealing
// Long rows split across multiple warps/blocks
7.2 Dynamic Load Balancing
__global__ void spmv_work_stealing(/* ... */) {
__shared__ int work_queue[QUEUE_SIZE];
__shared__ int queue_head, queue_tail;
// Warps grab work from shared queue
// Split large rows dynamically
}
8. Validation & Benchmarking
8.1 Correctness Validation
// Always validate against reference implementation
bool validate_sparse_result(const SparseMatrix& A,
const DenseMatrix& B,
const DenseMatrix& C_sparse,
const DenseMatrix& C_reference) {
float max_error = 0.0f;
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
float error = abs(C_sparse[i][j] - C_reference[i][j]);
max_error = max(max_error, error);
}
}
return max_error < TOLERANCE;
}
8.2 Performance Metrics
// Key metrics to track
struct SparseKernelMetrics {
float achieved_bandwidth_gbps;
float effective_flops_per_sec;
float warp_efficiency;
float memory_efficiency;
float speedup_vs_dense;
float speedup_vs_cusparse;
};
9. Interview Practice Questions
10. Hands-On Exercises
10.1 Exercise 1: Sparsity Analysis
Given a CSR matrix with M=1024, N=1024, nnz=10240:
- Calculate sparsity ratio
- Estimate metadata overhead (bytes)
- Compute arithmetic intensity for SpMV
- Determine if memory or compute bound on A100
10.2 Exercise 2: Block Size Optimization
Design BCSR block sizes for a sparse matrix that:
- Fits in 48KB shared memory per SM
- Avoids 32-bank conflicts
- Minimizes zero-padding overhead
- Show your calculations and trade-offs
10.3 Exercise 3: 2:4 Conversion
Take a dense 4×4 weight matrix, convert to 2:4 sparse format:
- Show pruning decisions
- Pack compressed representation
- Outline tensor core kernel usage
- Estimate memory savings and speedup
11. Key Takeaways
- Structure matters: 2:4 > Block > Unstructured for GPU performance
- Coalescing first: Design memory access patterns before optimizing compute
- High sparsity threshold: Need 90%+ for unstructured to beat dense
- Load balancing critical: Skewed distributions kill performance
- Validate everything: Sparse kernels are notoriously buggy
The future is structured sparsity with hardware acceleration. Master 2:4 patterns and tensor core integration - that's where the performance gains are hiding! 🚀