Skip to main content
Performanceadvancedoptimizationcachesimdmatrixperformanceblocking

Matrix Multiplication Optimization Techniques

Progressive optimization of matrix multiplication: from naive O(N³) to cache-blocked SIMD implementations.

25 min read
Updated 9/8/2024
2 prerequisites

Prerequisites

Make sure you're familiar with these concepts before diving in:

C++ Fundamentals and Performance
Computer Architecture Basics

Learning Objectives

By the end of this topic, you will be able to:

Understand cache locality impacts on performance
Master loop reordering and blocking techniques
Apply SIMD vectorization for computational kernels
Analyze performance characteristics of optimization strategies

Table of Contents

Matrix Multiplication Optimization Techniques

Matrix multiplication is a fundamental operation in scientific computing that demonstrates key optimization principles: cache locality, loop optimization, and vectorization.

1. Performance Evolution Overview

This guide presents four progressive optimization techniques:

  1. Naive O(N³) - Basic triple-nested loop
  2. Spatial Locality - Loop reordering for cache efficiency
  3. Cache Blocking - Tiling to fit cache hierarchy
  4. SIMD + Blocking - Vectorization with optimal memory access

2. Technique 1: Naive Implementation

The straightforward implementation follows mathematical definition but suffers from poor cache behavior.

// Naive implementation, O(N^3)
void matmul_naive(int N, vector<vector<double>>& A, 
                  vector<vector<double>>& B, 
                  vector<vector<double>>& C) {
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            for (int k = 0; k < N; ++k) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

2.1 Performance Issues

  • Poor spatial locality: B[k][j] accesses stride through columns
  • Cache misses: Each B access likely misses cache for large matrices
  • Memory bandwidth: Inefficient use of loaded cache lines

3. Technique 2: Spatial Locality Optimization

Reordering loops to improve memory access patterns dramatically improves performance.

void matmul_spatial_locality(int N, vector<vector<double>>& A,
                             vector<vector<double>>& B, 
                             vector<vector<double>>& C) {
    // Loop reordering: i-k-j instead of i-j-k
    for (int i = 0; i < N; ++i) {
        for (int k = 0; k < N; ++k) {
            for (int j = 0; j < N; ++j) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

3.1 Key Optimization

  • Sequential access: B[k][j] now accessed sequentially in inner loop
  • Cache reuse: Each B[k][*] row loaded once per k-iteration
  • Bandwidth utilization: Full cache line usage for row-major matrices

3.2 Performance Gain

Typically 2-4x speedup for large matrices due to eliminated cache misses.

4. Technique 3: Cache Blocking (Tiling)

Blocking divides computation into smaller tiles that fit in cache hierarchy.

void matmul_blocking(int N, vector<vector<double>>& A,
                     vector<vector<double>>& B, 
                     vector<vector<double>>& C, 
                     int BLOCK_SIZE) {
    for (int i = 0; i < N; i += BLOCK_SIZE) {
        for (int j = 0; j < N; j += BLOCK_SIZE) {
            for (int k = 0; k < N; k += BLOCK_SIZE) {
                // Process block
                for (int ii = i; ii < min(i + BLOCK_SIZE, N); ++ii) {
                    for (int jj = j; jj < min(j + BLOCK_SIZE, N); ++jj) {
                        for (int kk = k; kk < min(k + BLOCK_SIZE, N); ++kk) {
                            C[ii][jj] += A[ii][kk] * B[kk][jj];
                        }
                    }
                }
            }
        }
    }
}

4.1 Cache Analysis

For L1 cache optimization with 32KB capacity:

Memory requirement per block: BLOCK_SIZE² × 2 × 8 bytes ≤ 32KB
BLOCK_SIZE ≤ √(32KB / 16) = √(2048) ≈ 45

Optimal BLOCK_SIZE = 48 fits well in 32KB L1 cache.

4.2 Performance Characteristics

  • Cache reuse: Data blocks reused multiple times before eviction
  • Reduced memory traffic: Lower-level cache misses minimized
  • Scalability: Performance maintained as matrix size increases

4.3 Trade-offs

  • Overhead: Blocking adds complexity for small matrices
  • Significant gains: Large matrices see substantial improvement
  • Tuning required: Block size must match cache hierarchy

5. Technique 4: SIMD + Blocking Optimization

Combining blocking with vectorization achieves peak performance.

#include <immintrin.h>  // AVX2 intrinsics
 
void matmul_blocked_simd(
    const vector<vector<double>>& A,
    const vector<vector<double>>& B,
    vector<vector<double>>& C,
    int N,
    int BLOCK_SIZE = 48  // Fits in 32KB L1 cache
) {
    for (int ii = 0; ii < N; ii += BLOCK_SIZE) {
        int i_max = min(ii + BLOCK_SIZE, N);
        for (int kk = 0; kk < N; kk += BLOCK_SIZE) {
            int k_max = min(kk + BLOCK_SIZE, N);
            for (int jj = 0; jj < N; jj += BLOCK_SIZE) {
                int j_max = min(jj + BLOCK_SIZE, N);
 
                // Process block with SIMD
                for (int i = ii; i < i_max; ++i) {
                    for (int k = kk; k < k_max; ++k) {
                        // Broadcast A[i][k] to all SIMD lanes
                        __m256d a_val = _mm256_set1_pd(A[i][k]);
                        int j = jj;
 
                        // SIMD inner loop - process 4 doubles at once
                        for (; j <= j_max - 4; j += 4) {
                            __m256d c_vec = _mm256_loadu_pd(&C[i][j]);
                            __m256d b_vec = _mm256_loadu_pd(&B[k][j]);
                            // Fused multiply-add: C[i][j] += A[i][k] * B[k][j]
                            c_vec = _mm256_fmadd_pd(a_val, b_vec, c_vec);
                            _mm256_storeu_pd(&C[i][j], c_vec);
                        }
 
                        // Scalar cleanup for remaining columns
                        for (; j < j_max; ++j) {
                            C[i][j] += A[i][k] * B[k][j];
                        }
                    }
                }
            }
        }
    }
}

5.1 SIMD Optimizations

Vectorization Strategy

  • 4-wide vectors: Process 4 double-precision values simultaneously
  • Broadcast: Single A[i][k] value broadcast to all SIMD lanes
  • FMA instruction: Fused multiply-add reduces instruction count
  • Cleanup loop: Handle remaining elements not divisible by vector width

Performance Benefits

  • Instruction-level parallelism: 4x theoretical speedup from vectorization
  • Reduced instruction count: FMA combines multiply and add operations
  • Memory efficiency: Optimal memory access patterns maintained

6. Performance Comparison

6.1 Expected Speedup (Large Matrices)

TechniqueRelative PerformanceKey Benefit
Naive1.0xBaseline
Spatial Locality2-4xEliminates cache misses
Cache Blocking3-6xMaximizes cache reuse
SIMD + Blocking8-15xVector parallelism + cache optimization

6.2 Matrix Size Impact

  • Small matrices (N < 64): Blocking overhead may hurt performance
  • Medium matrices (64 ≤ N ≤ 512): Spatial locality dominates
  • Large matrices (N > 512): Blocking and SIMD essential

7. Implementation Guidelines

7.1 Block Size Selection

// L1 cache sizing calculation
constexpr int L1_CACHE_SIZE = 32 * 1024;  // 32KB typical
constexpr int DOUBLE_SIZE = 8;
constexpr int BLOCKS_IN_CACHE = 2;  // A and B blocks
constexpr int OPTIMAL_BLOCK_SIZE = 
    static_cast<int>(sqrt(L1_CACHE_SIZE / (BLOCKS_IN_CACHE * DOUBLE_SIZE)));

7.2 Compiler Optimizations

# Recommended compiler flags
-O3 -march=native -ffast-math -funroll-loops

7.3 Memory Alignment

// Ensure SIMD alignment
alignas(32) vector<vector<double>> matrix(N, vector<double>(N));

8. Advanced Techniques

8.1 Multi-threading

#pragma omp parallel for collapse(2)
for (int ii = 0; ii < N; ii += BLOCK_SIZE) {
    for (int jj = 0; jj < N; jj += BLOCK_SIZE) {
        // Parallel block processing
    }
}

8.2 Cache-Oblivious Algorithms

  • Recursive blocking: Automatically adapts to cache hierarchy
  • No tuning required: Optimal for unknown cache sizes
  • Complex implementation: Higher development complexity

8.3 Specialized Libraries

  • BLAS: Highly optimized vendor implementations
  • Commercial libraries: CPU-specific optimizations
  • OpenBLAS: Open-source alternative
  • GPU libraries: GPU acceleration frameworks

9. Profiling and Analysis

9.1 Performance Metrics

  • FLOPS: Floating-point operations per second
  • Cache miss rate: L1/L2/L3 miss statistics
  • Memory bandwidth: Bytes transferred per second
  • Vectorization efficiency: SIMD instruction usage

9.2 Tools

  • Commercial profilers: Detailed performance analysis
  • perf: Linux performance profiling
  • Valgrind: Cache simulation and analysis
  • Hardware counters: Direct cache miss measurement

10. Key Takeaways

10.1 Optimization Hierarchy

  1. Algorithm complexity: Choose optimal algorithmic approach
  2. Cache optimization: Maximize temporal and spatial locality
  3. Vectorization: Utilize SIMD instruction sets
  4. Parallelization: Scale across multiple cores

10.2 Performance Principles

  • Memory hierarchy awareness: Design for cache behavior
  • Data layout: Row-major vs column-major impacts
  • Instruction-level parallelism: Compiler and hardware cooperation
  • Measurement: Profile before and after optimizations

Matrix multiplication optimization demonstrates fundamental performance engineering: understanding hardware characteristics, optimizing memory access patterns, and leveraging specialized instructions for maximum computational efficiency.