Matrix Multiplication Optimization Techniques
Progressive optimization of matrix multiplication: from naive O(N³) to cache-blocked SIMD implementations.
Prerequisites
Make sure you're familiar with these concepts before diving in:
Learning Objectives
By the end of this topic, you will be able to:
Table of Contents
Matrix Multiplication Optimization Techniques
Matrix multiplication is a fundamental operation in scientific computing that demonstrates key optimization principles: cache locality, loop optimization, and vectorization.
1. Performance Evolution Overview
This guide presents four progressive optimization techniques:
- Naive O(N³) - Basic triple-nested loop
- Spatial Locality - Loop reordering for cache efficiency
- Cache Blocking - Tiling to fit cache hierarchy
- SIMD + Blocking - Vectorization with optimal memory access
2. Technique 1: Naive Implementation
The straightforward implementation follows mathematical definition but suffers from poor cache behavior.
// Naive implementation, O(N^3)
void matmul_naive(int N, vector<vector<double>>& A,
vector<vector<double>>& B,
vector<vector<double>>& C) {
for (int i = 0; i < N; ++i) {
for (int j = 0; j < N; ++j) {
for (int k = 0; k < N; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
2.1 Performance Issues
- Poor spatial locality:
B[k][j]
accesses stride through columns - Cache misses: Each
B
access likely misses cache for large matrices - Memory bandwidth: Inefficient use of loaded cache lines
3. Technique 2: Spatial Locality Optimization
Reordering loops to improve memory access patterns dramatically improves performance.
void matmul_spatial_locality(int N, vector<vector<double>>& A,
vector<vector<double>>& B,
vector<vector<double>>& C) {
// Loop reordering: i-k-j instead of i-j-k
for (int i = 0; i < N; ++i) {
for (int k = 0; k < N; ++k) {
for (int j = 0; j < N; ++j) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
3.1 Key Optimization
- Sequential access:
B[k][j]
now accessed sequentially in inner loop - Cache reuse: Each
B[k][*]
row loaded once per k-iteration - Bandwidth utilization: Full cache line usage for row-major matrices
3.2 Performance Gain
Typically 2-4x speedup for large matrices due to eliminated cache misses.
4. Technique 3: Cache Blocking (Tiling)
Blocking divides computation into smaller tiles that fit in cache hierarchy.
void matmul_blocking(int N, vector<vector<double>>& A,
vector<vector<double>>& B,
vector<vector<double>>& C,
int BLOCK_SIZE) {
for (int i = 0; i < N; i += BLOCK_SIZE) {
for (int j = 0; j < N; j += BLOCK_SIZE) {
for (int k = 0; k < N; k += BLOCK_SIZE) {
// Process block
for (int ii = i; ii < min(i + BLOCK_SIZE, N); ++ii) {
for (int jj = j; jj < min(j + BLOCK_SIZE, N); ++jj) {
for (int kk = k; kk < min(k + BLOCK_SIZE, N); ++kk) {
C[ii][jj] += A[ii][kk] * B[kk][jj];
}
}
}
}
}
}
}
4.1 Cache Analysis
For L1 cache optimization with 32KB capacity:
Memory requirement per block: BLOCK_SIZE² × 2 × 8 bytes ≤ 32KB
BLOCK_SIZE ≤ √(32KB / 16) = √(2048) ≈ 45
Optimal BLOCK_SIZE = 48
fits well in 32KB L1 cache.
4.2 Performance Characteristics
- Cache reuse: Data blocks reused multiple times before eviction
- Reduced memory traffic: Lower-level cache misses minimized
- Scalability: Performance maintained as matrix size increases
4.3 Trade-offs
- Overhead: Blocking adds complexity for small matrices
- Significant gains: Large matrices see substantial improvement
- Tuning required: Block size must match cache hierarchy
5. Technique 4: SIMD + Blocking Optimization
Combining blocking with vectorization achieves peak performance.
#include <immintrin.h> // AVX2 intrinsics
void matmul_blocked_simd(
const vector<vector<double>>& A,
const vector<vector<double>>& B,
vector<vector<double>>& C,
int N,
int BLOCK_SIZE = 48 // Fits in 32KB L1 cache
) {
for (int ii = 0; ii < N; ii += BLOCK_SIZE) {
int i_max = min(ii + BLOCK_SIZE, N);
for (int kk = 0; kk < N; kk += BLOCK_SIZE) {
int k_max = min(kk + BLOCK_SIZE, N);
for (int jj = 0; jj < N; jj += BLOCK_SIZE) {
int j_max = min(jj + BLOCK_SIZE, N);
// Process block with SIMD
for (int i = ii; i < i_max; ++i) {
for (int k = kk; k < k_max; ++k) {
// Broadcast A[i][k] to all SIMD lanes
__m256d a_val = _mm256_set1_pd(A[i][k]);
int j = jj;
// SIMD inner loop - process 4 doubles at once
for (; j <= j_max - 4; j += 4) {
__m256d c_vec = _mm256_loadu_pd(&C[i][j]);
__m256d b_vec = _mm256_loadu_pd(&B[k][j]);
// Fused multiply-add: C[i][j] += A[i][k] * B[k][j]
c_vec = _mm256_fmadd_pd(a_val, b_vec, c_vec);
_mm256_storeu_pd(&C[i][j], c_vec);
}
// Scalar cleanup for remaining columns
for (; j < j_max; ++j) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
}
}
}
5.1 SIMD Optimizations
Vectorization Strategy
- 4-wide vectors: Process 4 double-precision values simultaneously
- Broadcast: Single
A[i][k]
value broadcast to all SIMD lanes - FMA instruction: Fused multiply-add reduces instruction count
- Cleanup loop: Handle remaining elements not divisible by vector width
Performance Benefits
- Instruction-level parallelism: 4x theoretical speedup from vectorization
- Reduced instruction count: FMA combines multiply and add operations
- Memory efficiency: Optimal memory access patterns maintained
6. Performance Comparison
6.1 Expected Speedup (Large Matrices)
Technique | Relative Performance | Key Benefit |
---|---|---|
Naive | 1.0x | Baseline |
Spatial Locality | 2-4x | Eliminates cache misses |
Cache Blocking | 3-6x | Maximizes cache reuse |
SIMD + Blocking | 8-15x | Vector parallelism + cache optimization |
6.2 Matrix Size Impact
- Small matrices (N < 64): Blocking overhead may hurt performance
- Medium matrices (64 ≤ N ≤ 512): Spatial locality dominates
- Large matrices (N > 512): Blocking and SIMD essential
7. Implementation Guidelines
7.1 Block Size Selection
// L1 cache sizing calculation
constexpr int L1_CACHE_SIZE = 32 * 1024; // 32KB typical
constexpr int DOUBLE_SIZE = 8;
constexpr int BLOCKS_IN_CACHE = 2; // A and B blocks
constexpr int OPTIMAL_BLOCK_SIZE =
static_cast<int>(sqrt(L1_CACHE_SIZE / (BLOCKS_IN_CACHE * DOUBLE_SIZE)));
7.2 Compiler Optimizations
# Recommended compiler flags
-O3 -march=native -ffast-math -funroll-loops
7.3 Memory Alignment
// Ensure SIMD alignment
alignas(32) vector<vector<double>> matrix(N, vector<double>(N));
8. Advanced Techniques
8.1 Multi-threading
#pragma omp parallel for collapse(2)
for (int ii = 0; ii < N; ii += BLOCK_SIZE) {
for (int jj = 0; jj < N; jj += BLOCK_SIZE) {
// Parallel block processing
}
}
8.2 Cache-Oblivious Algorithms
- Recursive blocking: Automatically adapts to cache hierarchy
- No tuning required: Optimal for unknown cache sizes
- Complex implementation: Higher development complexity
8.3 Specialized Libraries
- BLAS: Highly optimized vendor implementations
- Commercial libraries: CPU-specific optimizations
- OpenBLAS: Open-source alternative
- GPU libraries: GPU acceleration frameworks
9. Profiling and Analysis
9.1 Performance Metrics
- FLOPS: Floating-point operations per second
- Cache miss rate: L1/L2/L3 miss statistics
- Memory bandwidth: Bytes transferred per second
- Vectorization efficiency: SIMD instruction usage
9.2 Tools
- Commercial profilers: Detailed performance analysis
- perf: Linux performance profiling
- Valgrind: Cache simulation and analysis
- Hardware counters: Direct cache miss measurement
10. Key Takeaways
10.1 Optimization Hierarchy
- Algorithm complexity: Choose optimal algorithmic approach
- Cache optimization: Maximize temporal and spatial locality
- Vectorization: Utilize SIMD instruction sets
- Parallelization: Scale across multiple cores
10.2 Performance Principles
- Memory hierarchy awareness: Design for cache behavior
- Data layout: Row-major vs column-major impacts
- Instruction-level parallelism: Compiler and hardware cooperation
- Measurement: Profile before and after optimizations
Matrix multiplication optimization demonstrates fundamental performance engineering: understanding hardware characteristics, optimizing memory access patterns, and leveraging specialized instructions for maximum computational efficiency.