GPU Sparsity Patterns & Performance

Sparsity is everywhere in modern AI workloads - from pruned neural networks to sparse attention mechanisms. But here's the catch: not all sparsity is created equal on GPUs. Let's dive deep into when sparsity helps, when it hurts, and how to implement it right.

Types of Sparsity

Understanding the sparsity landscape is crucial for making the right architectural decisions:

Rendering diagram...

Unstructured Sparsity

Pattern: Arbitrary zero locations
Storage: CSR/CSC/COO formats with full indirection
Pros: Maximum compression ratio, works with any sparsity pattern
Cons: Poor memory locality, irregular memory access patterns, warp divergence

Structured Sparsity (2:4)

Pattern: Fixed ratio - 2 non-zeros in every 4 elements
Storage: Compressed values + compact metadata
Pros: Hardware-friendly, predictable access patterns, tensor core support
Cons: Limited to specific sparsity ratios, may require retraining

Block Sparsity (BCSR)

Pattern: Dense B×B blocks with sparse block-level structure
Storage: Block indices + dense sub-matrices
Pros: Better coalescing, lower metadata overhead, cache-friendly
Cons: Less flexible than unstructured, block size tuning required

ELI5: Sparsity in Plain English

Think of sparsity like a coloring book with lots of blank squares:

What is sparsity? Lots of zeros in your data. If a page has blank squares, you skip coloring them. Skipping zeros saves time and memory.
Why do GPUs care? GPUs work in big groups of 32 threads (warps). If data is irregular, some lanes sit idle—like a dance line tripping over a messy stage.
The flavors:
- Unstructured: zeros are random. Great compression, messy for GPUs.
- 2:4 structured: in every 4 numbers, only 2 are kept. Very tidy, hardware loves it.
- Block: keep non-zeros in squares (tiles); easier, faster moves.
Storage trick: save only positions and values of non-zeros. More structure = fewer positions to store.
When it helps: extremely sparse or nicely patterned. When it hurts: only mildly sparse or very irregular.

Rendering diagram...

Sound bite: If it's super-sparse or neatly patterned, work that sparse runway. If not, dense is still the snappiest number.

Storage Formats Deep Dive

CSR (Compressed Sparse Row)

// CSR representation
vector<int> row_ptr;    // Row start indices
vector<int> col_idx;    // Column indices  
vector<float> val;      // Non-zero values

Best for: SpMV, SpMM with row-wise access patterns
Memory pattern: Indirection through col_idx hurts coalescing
Warp efficiency: Variable nnz per row causes load imbalance

BCSR (Block Compressed Sparse Row)

// BCSR representation  
vector<int> block_row_ptr;    // Block row start indices
vector<int> block_col_idx;    // Block column indices
vector<float> block_vals;     // Dense B×B blocks

Sweet spot for GPUs: Reduces metadata, improves coalescing
Block size tuning: Match to shared memory capacity and warp size
Trade-off: Some zero-padding within blocks

Rendering diagram...

ELL/ELLPACK

Pattern: Fixed nnz per row (pad shorter rows)
Pros: Perfect coalescing, simple indexing
Cons: Memory waste from padding, poor for skewed distributions

Performance Modeling

Back-of-Envelope Analysis

For sparse GEMM C = A * B where A is sparse:

// Dense GEMM FLOPs
dense_flops = 2 * M * N * K;
 
// Sparse effective FLOPs  
sparse_flops = (1 - sparsity) * dense_flops;
 
// But add overhead:
// - Metadata bytes per nonzero
// - Warp divergence penalties  
// - Reduced arithmetic intensity

Critical insight: Unstructured SpMM often needs ≥90% sparsity to beat dense kernels. Structured 2:4 wins much earlier due to hardware support.

Rendering diagram...

Roofline Analysis for Sparse Kernels

// Effective bandwidth calculation
effective_bw = (nnz_bytes + metadata_bytes) / kernel_time;
 
// Arithmetic intensity  
AI = sparse_flops / (nnz_bytes + metadata_bytes);
 
// Performance bound
perf = min(compute_roof, AI * memory_roof);

GPU Implementation Strategies

Coalescing-First Design

// Bad: scattered column access
for (int i = 0; i < nnz; i++) {
    result += val[i] * x[col_idx[i]];  // Random access!
}
 
// Good: reorder for coalescing
// Process by sorted column blocks or use shared memory staging

Warp-Cooperative Processing

Strategy 1: Row per Warp

__global__ void spmv_row_per_warp(/* ... */) {
    int warp_id = threadIdx.x / 32;
    int lane_id = threadIdx.x % 32;
    int row = blockIdx.x * warps_per_block + warp_id;
    
    // Warp cooperatively processes row
    float sum = 0.0f;
    for (int j = row_ptr[row] + lane_id; j < row_ptr[row+1]; j += 32) {
        sum += val[j] * x[col_idx[j]];
    }
    
    // Warp reduction
    sum = warp_reduce_sum(sum);
    if (lane_id == 0) y[row] = sum;
}

Strategy 2: Block per Warp (BCSR)

__global__ void spmm_bcsr_block_per_warp(/* ... */) {
    // Each warp processes one B×B block
    // Better coalescing, lower metadata overhead
    // Shared memory for input vector reuse
}

Rendering diagram...

Shared Memory Staging

__shared__ float x_shared[BLOCK_SIZE];
 
// Stage frequently accessed vector elements
// Reduces global memory traffic for irregular access patterns

2:4 Structured Sparsity

Hardware Support

Modern tensor cores (Ampere+) have native 2:4 sparse support:

Rendering diagram...

Implementation Pattern

// 1. Prune weights to 2:4 pattern during training
// 2. Compress: pack 2 values + 2-bit metadata per 4 elements  
// 3. Use cuSPARSELt or custom kernels for inference
 
// Example: 2:4 sparse GEMM
cusparseLtMatmul(handle, &plan, &alpha, 
                 compressed_A, B, &beta, C, 
                 workspace, streams, num_streams);

Performance Characteristics

Theoretical speedup: 2× (half the math)
Practical speedup: 1.3-1.6× (memory/scheduling overhead)
Memory reduction: ~50% for weights
Accuracy: Minimal degradation with proper training

Load Balancing Strategies

Handling Skewed Distributions

// Problem: Some rows have many nnz, others few
// Solution 1: Row bucketing
vector<vector<int>> buckets(MAX_NNZ_BUCKET);
for (int i = 0; i < num_rows; i++) {
    int nnz = row_ptr[i+1] - row_ptr[i];
    buckets[min(nnz, MAX_NNZ_BUCKET-1)].push_back(i);
}
 
// Process each bucket with appropriate kernel variant
 
// Solution 2: Work stealing
// Long rows split across multiple warps/blocks

Dynamic Load Balancing

__global__ void spmv_work_stealing(/* ... */) {
    __shared__ int work_queue[QUEUE_SIZE];
    __shared__ int queue_head, queue_tail;
    
    // Warps grab work from shared queue
    // Split large rows dynamically
}

Validation & Benchmarking

Correctness Validation

// Always validate against reference implementation
bool validate_sparse_result(const SparseMatrix& A, 
                           const DenseMatrix& B,
                           const DenseMatrix& C_sparse,
                           const DenseMatrix& C_reference) {
    float max_error = 0.0f;
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++) {
            float error = abs(C_sparse[i][j] - C_reference[i][j]);
            max_error = max(max_error, error);
        }
    }
    return max_error < TOLERANCE;
}

Performance Metrics

// Key metrics to track
struct SparseKernelMetrics {
    float achieved_bandwidth_gbps;
    float effective_flops_per_sec;  
    float warp_efficiency;
    float memory_efficiency;
    float speedup_vs_dense;
    float speedup_vs_cusparse;
};

Interview Practice Questions

Hands-On Exercises

Exercise 1: Sparsity Analysis

Given a CSR matrix with M=1024, N=1024, nnz=10240:

Calculate sparsity ratio
Estimate metadata overhead (bytes)
Compute arithmetic intensity for SpMV
Determine if memory or compute bound on A100

Exercise 2: Block Size Optimization

Design BCSR block sizes for a sparse matrix that:

Fits in 48KB shared memory per SM
Avoids 32-bank conflicts
Minimizes zero-padding overhead
Show your calculations and trade-offs

Exercise 3: 2:4 Conversion

Take a dense 4×4 weight matrix, convert to 2:4 sparse format:

Show pruning decisions
Pack compressed representation
Outline tensor core kernel usage
Estimate memory savings and speedup

Key Takeaways

Structure matters: 2:4 > Block > Unstructured for GPU performance
Coalescing first: Design memory access patterns before optimizing compute
High sparsity threshold: Need 90%+ for unstructured to beat dense
Load balancing critical: Skewed distributions kill performance
Validate everything: Sparse kernels are notoriously buggy

The future is structured sparsity with hardware acceleration. Master 2:4 patterns and tensor core integration - that's where the performance gains are hiding! 🚀

GPU Sparsity Patterns & Performance

Prerequisites

Learning Objectives

Table of Contents

GPU Sparsity Patterns & Performance

Types of Sparsity

Unstructured Sparsity

Structured Sparsity (2:4)

Block Sparsity (BCSR)

ELI5: Sparsity in Plain English

Storage Formats Deep Dive

CSR (Compressed Sparse Row)

BCSR (Block Compressed Sparse Row)

ELL/ELLPACK

Performance Modeling

Back-of-Envelope Analysis

Roofline Analysis for Sparse Kernels

GPU Implementation Strategies

Coalescing-First Design

Warp-Cooperative Processing

Shared Memory Staging

2:4 Structured Sparsity

Hardware Support

Implementation Pattern

Performance Characteristics

Load Balancing Strategies

Handling Skewed Distributions

Dynamic Load Balancing

Validation & Benchmarking

Correctness Validation

Performance Metrics

Interview Practice Questions

Hands-On Exercises

Exercise 1: Sparsity Analysis

Exercise 2: Block Size Optimization

Exercise 3: 2:4 Conversion

Key Takeaways

In This Topic

Related Topics

Quick Actions