Skip to main content
GPUadvanced

GPU Sparsity Patterns & Performance

GPU sparsity patterns from unstructured CSR to structured 2:4 for modern tensor cores.

25 min read
Updated 9/7/2024
3 prerequisites

Prerequisites

Make sure you're familiar with these concepts before diving in:

GPU Architecture Basics
Memory Hierarchy
CUDA Programming

Learning Objectives

By the end of this topic, you will be able to:

Understand different sparsity types and their GPU performance implications
Choose optimal storage formats (CSR, BCSR, ELL) for different sparsity patterns
Implement efficient sparse kernels with proper coalescing and warp cooperation
Model sparse kernel performance and validate against dense baselines
Leverage 2:4 structured sparsity with tensor cores effectively

Table of Contents

GPU Sparsity Patterns & Performance

Sparsity is everywhere in modern AI workloads - from pruned neural networks to sparse attention mechanisms. But here's the catch: not all sparsity is created equal on GPUs. Let's dive deep into when sparsity helps, when it hurts, and how to implement it right.

1. Types of Sparsity

Understanding the sparsity landscape is crucial for making the right architectural decisions:

Rendering diagram...

1.1 Unstructured Sparsity

  • Pattern: Arbitrary zero locations
  • Storage: CSR/CSC/COO formats with full indirection
  • Pros: Maximum compression ratio, works with any sparsity pattern
  • Cons: Poor memory locality, irregular memory access patterns, warp divergence

1.2 Structured Sparsity (2:4)

  • Pattern: Fixed ratio - 2 non-zeros in every 4 elements
  • Storage: Compressed values + compact metadata
  • Pros: Hardware-friendly, predictable access patterns, tensor core support
  • Cons: Limited to specific sparsity ratios, may require retraining

1.3 Block Sparsity (BCSR)

  • Pattern: Dense B×B blocks with sparse block-level structure
  • Storage: Block indices + dense sub-matrices
  • Pros: Better coalescing, lower metadata overhead, cache-friendly
  • Cons: Less flexible than unstructured, block size tuning required

2. ELI5: Sparsity in Plain English

Think of sparsity like a coloring book with lots of blank squares:

  • What is sparsity? Lots of zeros in your data. If a page has blank squares, you skip coloring them. Skipping zeros saves time and memory.

  • Why do GPUs care? GPUs work in big groups of 32 threads (warps). If data is irregular, some lanes sit idle—like a dance line tripping over a messy stage.

  • The flavors:

    • Unstructured: zeros are random. Great compression, messy for GPUs.
    • 2:4 structured: in every 4 numbers, only 2 are kept. Very tidy, hardware loves it.
    • Block: keep non-zeros in squares (tiles); easier, faster moves.
  • Storage trick: save only positions and values of non-zeros. More structure = fewer positions to store.

  • When it helps: extremely sparse or nicely patterned. When it hurts: only mildly sparse or very irregular.

Rendering diagram...

Sound bite: If it's super-sparse or neatly patterned, work that sparse runway. If not, dense is still the snappiest number.

3. Storage Formats Deep Dive

3.1 CSR (Compressed Sparse Row)

// CSR representation
vector<int> row_ptr;    // Row start indices
vector<int> col_idx;    // Column indices  
vector<float> val;      // Non-zero values
  • Best for: SpMV, SpMM with row-wise access patterns
  • Memory pattern: Indirection through col_idx hurts coalescing
  • Warp efficiency: Variable nnz per row causes load imbalance

3.2 BCSR (Block Compressed Sparse Row)

// BCSR representation  
vector<int> block_row_ptr;    // Block row start indices
vector<int> block_col_idx;    // Block column indices
vector<float> block_vals;     // Dense B×B blocks
  • Sweet spot for GPUs: Reduces metadata, improves coalescing
  • Block size tuning: Match to shared memory capacity and warp size
  • Trade-off: Some zero-padding within blocks
Rendering diagram...

3.3 ELL/ELLPACK

  • Pattern: Fixed nnz per row (pad shorter rows)
  • Pros: Perfect coalescing, simple indexing
  • Cons: Memory waste from padding, poor for skewed distributions

4. Performance Modeling

4.1 Back-of-Envelope Analysis

For sparse GEMM C = A * B where A is sparse:

// Dense GEMM FLOPs
dense_flops = 2 * M * N * K;
 
// Sparse effective FLOPs  
sparse_flops = (1 - sparsity) * dense_flops;
 
// But add overhead:
// - Metadata bytes per nonzero
// - Warp divergence penalties  
// - Reduced arithmetic intensity

Critical insight: Unstructured SpMM often needs ≥90% sparsity to beat dense kernels. Structured 2:4 wins much earlier due to hardware support.

Rendering diagram...

4.2 Roofline Analysis for Sparse Kernels

// Effective bandwidth calculation
effective_bw = (nnz_bytes + metadata_bytes) / kernel_time;
 
// Arithmetic intensity  
AI = sparse_flops / (nnz_bytes + metadata_bytes);
 
// Performance bound
perf = min(compute_roof, AI * memory_roof);

5. GPU Implementation Strategies

5.1 Coalescing-First Design

// Bad: scattered column access
for (int i = 0; i < nnz; i++) {
    result += val[i] * x[col_idx[i]];  // Random access!
}
 
// Good: reorder for coalescing
// Process by sorted column blocks or use shared memory staging

5.2 Warp-Cooperative Processing

Strategy 1: Row per Warp

__global__ void spmv_row_per_warp(/* ... */) {
    int warp_id = threadIdx.x / 32;
    int lane_id = threadIdx.x % 32;
    int row = blockIdx.x * warps_per_block + warp_id;
    
    // Warp cooperatively processes row
    float sum = 0.0f;
    for (int j = row_ptr[row] + lane_id; j < row_ptr[row+1]; j += 32) {
        sum += val[j] * x[col_idx[j]];
    }
    
    // Warp reduction
    sum = warp_reduce_sum(sum);
    if (lane_id == 0) y[row] = sum;
}

Strategy 2: Block per Warp (BCSR)

__global__ void spmm_bcsr_block_per_warp(/* ... */) {
    // Each warp processes one B×B block
    // Better coalescing, lower metadata overhead
    // Shared memory for input vector reuse
}
Rendering diagram...

5.3 Shared Memory Staging

__shared__ float x_shared[BLOCK_SIZE];
 
// Stage frequently accessed vector elements
// Reduces global memory traffic for irregular access patterns

6. 2:4 Structured Sparsity

6.1 Hardware Support

Modern tensor cores (Ampere+) have native 2:4 sparse support:

Rendering diagram...

6.2 Implementation Pattern

// 1. Prune weights to 2:4 pattern during training
// 2. Compress: pack 2 values + 2-bit metadata per 4 elements  
// 3. Use cuSPARSELt or custom kernels for inference
 
// Example: 2:4 sparse GEMM
cusparseLtMatmul(handle, &plan, &alpha, 
                 compressed_A, B, &beta, C, 
                 workspace, streams, num_streams);

6.3 Performance Characteristics

  • Theoretical speedup: 2× (half the math)
  • Practical speedup: 1.3-1.6× (memory/scheduling overhead)
  • Memory reduction: ~50% for weights
  • Accuracy: Minimal degradation with proper training

7. Load Balancing Strategies

7.1 Handling Skewed Distributions

// Problem: Some rows have many nnz, others few
// Solution 1: Row bucketing
vector<vector<int>> buckets(MAX_NNZ_BUCKET);
for (int i = 0; i < num_rows; i++) {
    int nnz = row_ptr[i+1] - row_ptr[i];
    buckets[min(nnz, MAX_NNZ_BUCKET-1)].push_back(i);
}
 
// Process each bucket with appropriate kernel variant
 
// Solution 2: Work stealing
// Long rows split across multiple warps/blocks

7.2 Dynamic Load Balancing

__global__ void spmv_work_stealing(/* ... */) {
    __shared__ int work_queue[QUEUE_SIZE];
    __shared__ int queue_head, queue_tail;
    
    // Warps grab work from shared queue
    // Split large rows dynamically
}

8. Validation & Benchmarking

8.1 Correctness Validation

// Always validate against reference implementation
bool validate_sparse_result(const SparseMatrix& A, 
                           const DenseMatrix& B,
                           const DenseMatrix& C_sparse,
                           const DenseMatrix& C_reference) {
    float max_error = 0.0f;
    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++) {
            float error = abs(C_sparse[i][j] - C_reference[i][j]);
            max_error = max(max_error, error);
        }
    }
    return max_error < TOLERANCE;
}

8.2 Performance Metrics

// Key metrics to track
struct SparseKernelMetrics {
    float achieved_bandwidth_gbps;
    float effective_flops_per_sec;  
    float warp_efficiency;
    float memory_efficiency;
    float speedup_vs_dense;
    float speedup_vs_cusparse;
};

9. Interview Practice Questions

10. Hands-On Exercises

10.1 Exercise 1: Sparsity Analysis

Given a CSR matrix with M=1024, N=1024, nnz=10240:

  • Calculate sparsity ratio
  • Estimate metadata overhead (bytes)
  • Compute arithmetic intensity for SpMV
  • Determine if memory or compute bound on A100

10.2 Exercise 2: Block Size Optimization

Design BCSR block sizes for a sparse matrix that:

  • Fits in 48KB shared memory per SM
  • Avoids 32-bank conflicts
  • Minimizes zero-padding overhead
  • Show your calculations and trade-offs

10.3 Exercise 3: 2:4 Conversion

Take a dense 4×4 weight matrix, convert to 2:4 sparse format:

  • Show pruning decisions
  • Pack compressed representation
  • Outline tensor core kernel usage
  • Estimate memory savings and speedup

11. Key Takeaways

  1. Structure matters: 2:4 > Block > Unstructured for GPU performance
  2. Coalescing first: Design memory access patterns before optimizing compute
  3. High sparsity threshold: Need 90%+ for unstructured to beat dense
  4. Load balancing critical: Skewed distributions kill performance
  5. Validate everything: Sparse kernels are notoriously buggy

The future is structured sparsity with hardware acceleration. Master 2:4 patterns and tensor core integration - that's where the performance gains are hiding! 🚀