Matrix Multiplication Optimization Techniques

Matrix multiplication is a fundamental operation in scientific computing that demonstrates key optimization principles: cache locality, loop optimization, and vectorization.

Performance Evolution Overview

This guide presents four progressive optimization techniques:

Naive O(N³) - Basic triple-nested loop
Spatial Locality - Loop reordering for cache efficiency
Cache Blocking - Tiling to fit cache hierarchy
SIMD + Blocking - Vectorization with optimal memory access

Technique 1: Naive Implementation

The straightforward implementation follows mathematical definition but suffers from poor cache behavior.

// Naive implementation, O(N^3)
void matmul_naive(int N, vector<vector<double>>& A, 
                  vector<vector<double>>& B, 
                  vector<vector<double>>& C) {
    for (int i = 0; i < N; ++i) {
        for (int j = 0; j < N; ++j) {
            for (int k = 0; k < N; ++k) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

Performance Issues

Poor spatial locality: B[k][j] accesses stride through columns
Cache misses: Each B access likely misses cache for large matrices
Memory bandwidth: Inefficient use of loaded cache lines

Technique 2: Spatial Locality Optimization

Reordering loops to improve memory access patterns dramatically improves performance.

void matmul_spatial_locality(int N, vector<vector<double>>& A,
                             vector<vector<double>>& B, 
                             vector<vector<double>>& C) {
    // Loop reordering: i-k-j instead of i-j-k
    for (int i = 0; i < N; ++i) {
        for (int k = 0; k < N; ++k) {
            for (int j = 0; j < N; ++j) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

Key Optimization

Sequential access: B[k][j] now accessed sequentially in inner loop
Cache reuse: Each B[k][*] row loaded once per k-iteration
Bandwidth utilization: Full cache line usage for row-major matrices

Performance Gain

Typically 2-4x speedup for large matrices due to eliminated cache misses.

Technique 3: Cache Blocking (Tiling)

Blocking divides computation into smaller tiles that fit in cache hierarchy.

void matmul_blocking(int N, vector<vector<double>>& A,
                     vector<vector<double>>& B, 
                     vector<vector<double>>& C, 
                     int BLOCK_SIZE) {
    for (int i = 0; i < N; i += BLOCK_SIZE) {
        for (int j = 0; j < N; j += BLOCK_SIZE) {
            for (int k = 0; k < N; k += BLOCK_SIZE) {
                // Process block
                for (int ii = i; ii < min(i + BLOCK_SIZE, N); ++ii) {
                    for (int jj = j; jj < min(j + BLOCK_SIZE, N); ++jj) {
                        for (int kk = k; kk < min(k + BLOCK_SIZE, N); ++kk) {
                            C[ii][jj] += A[ii][kk] * B[kk][jj];
                        }
                    }
                }
            }
        }
    }
}

Cache Analysis

For L1 cache optimization with 32KB capacity:

Memory requirement per block: BLOCK_SIZE² × 2 × 8 bytes ≤ 32KB
BLOCK_SIZE ≤ √(32KB / 16) = √(2048) ≈ 45

Optimal BLOCK_SIZE = 48 fits well in 32KB L1 cache.

Performance Characteristics

Cache reuse: Data blocks reused multiple times before eviction
Reduced memory traffic: Lower-level cache misses minimized
Scalability: Performance maintained as matrix size increases

Trade-offs

Overhead: Blocking adds complexity for small matrices
Significant gains: Large matrices see substantial improvement
Tuning required: Block size must match cache hierarchy

Technique 4: SIMD + Blocking Optimization

Combining blocking with vectorization achieves peak performance.

#include <immintrin.h>  // AVX2 intrinsics
 
void matmul_blocked_simd(
    const vector<vector<double>>& A,
    const vector<vector<double>>& B,
    vector<vector<double>>& C,
    int N,
    int BLOCK_SIZE = 48  // Fits in 32KB L1 cache
) {
    for (int ii = 0; ii < N; ii += BLOCK_SIZE) {
        int i_max = min(ii + BLOCK_SIZE, N);
        for (int kk = 0; kk < N; kk += BLOCK_SIZE) {
            int k_max = min(kk + BLOCK_SIZE, N);
            for (int jj = 0; jj < N; jj += BLOCK_SIZE) {
                int j_max = min(jj + BLOCK_SIZE, N);
 
                // Process block with SIMD
                for (int i = ii; i < i_max; ++i) {
                    for (int k = kk; k < k_max; ++k) {
                        // Broadcast A[i][k] to all SIMD lanes
                        __m256d a_val = _mm256_set1_pd(A[i][k]);
                        int j = jj;
 
                        // SIMD inner loop - process 4 doubles at once
                        for (; j <= j_max - 4; j += 4) {
                            __m256d c_vec = _mm256_loadu_pd(&C[i][j]);
                            __m256d b_vec = _mm256_loadu_pd(&B[k][j]);
                            // Fused multiply-add: C[i][j] += A[i][k] * B[k][j]
                            c_vec = _mm256_fmadd_pd(a_val, b_vec, c_vec);
                            _mm256_storeu_pd(&C[i][j], c_vec);
                        }
 
                        // Scalar cleanup for remaining columns
                        for (; j < j_max; ++j) {
                            C[i][j] += A[i][k] * B[k][j];
                        }
                    }
                }
            }
        }
    }
}

SIMD Optimizations

Vectorization Strategy

4-wide vectors: Process 4 double-precision values simultaneously
Broadcast: Single A[i][k] value broadcast to all SIMD lanes
FMA instruction: Fused multiply-add reduces instruction count
Cleanup loop: Handle remaining elements not divisible by vector width

Performance Benefits

Instruction-level parallelism: 4x theoretical speedup from vectorization
Reduced instruction count: FMA combines multiply and add operations
Memory efficiency: Optimal memory access patterns maintained

Performance Comparison

Expected Speedup (Large Matrices)

Technique	Relative Performance	Key Benefit
Naive	1.0x	Baseline
Spatial Locality	2-4x	Eliminates cache misses
Cache Blocking	3-6x	Maximizes cache reuse
SIMD + Blocking	8-15x	Vector parallelism + cache optimization

Matrix Size Impact

Small matrices (N < 64): Blocking overhead may hurt performance
Medium matrices (64 ≤ N ≤ 512): Spatial locality dominates
Large matrices (N > 512): Blocking and SIMD essential

Implementation Guidelines

Block Size Selection

// L1 cache sizing calculation
constexpr int L1_CACHE_SIZE = 32 * 1024;  // 32KB typical
constexpr int DOUBLE_SIZE = 8;
constexpr int BLOCKS_IN_CACHE = 2;  // A and B blocks
constexpr int OPTIMAL_BLOCK_SIZE = 
    static_cast<int>(sqrt(L1_CACHE_SIZE / (BLOCKS_IN_CACHE * DOUBLE_SIZE)));

Compiler Optimizations

# Recommended compiler flags
-O3 -march=native -ffast-math -funroll-loops

Memory Alignment

// Ensure SIMD alignment
alignas(32) vector<vector<double>> matrix(N, vector<double>(N));

Advanced Techniques

Multi-threading

#pragma omp parallel for collapse(2)
for (int ii = 0; ii < N; ii += BLOCK_SIZE) {
    for (int jj = 0; jj < N; jj += BLOCK_SIZE) {
        // Parallel block processing
    }
}

Cache-Oblivious Algorithms

Recursive blocking: Automatically adapts to cache hierarchy
No tuning required: Optimal for unknown cache sizes
Complex implementation: Higher development complexity

Specialized Libraries

BLAS: Highly optimized vendor implementations
Commercial libraries: CPU-specific optimizations
OpenBLAS: Open-source alternative
GPU libraries: GPU acceleration frameworks

Profiling and Analysis

Performance Metrics

FLOPS: Floating-point operations per second
Cache miss rate: L1/L2/L3 miss statistics
Memory bandwidth: Bytes transferred per second
Vectorization efficiency: SIMD instruction usage

Tools

Commercial profilers: Detailed performance analysis
perf: Linux performance profiling
Valgrind: Cache simulation and analysis
Hardware counters: Direct cache miss measurement

Key Takeaways

Optimization Hierarchy

Algorithm complexity: Choose optimal algorithmic approach
Cache optimization: Maximize temporal and spatial locality
Vectorization: Utilize SIMD instruction sets
Parallelization: Scale across multiple cores

Performance Principles

Memory hierarchy awareness: Design for cache behavior
Data layout: Row-major vs column-major impacts
Instruction-level parallelism: Compiler and hardware cooperation
Measurement: Profile before and after optimizations

Matrix multiplication optimization demonstrates fundamental performance engineering: understanding hardware characteristics, optimizing memory access patterns, and leveraging specialized instructions for maximum computational efficiency.

Matrix Multiplication Optimization Techniques

Prerequisites

Learning Objectives

Table of Contents

Matrix Multiplication Optimization Techniques

Performance Evolution Overview

Technique 1: Naive Implementation

Performance Issues

Technique 2: Spatial Locality Optimization

Key Optimization

Performance Gain

Technique 3: Cache Blocking (Tiling)

Cache Analysis

Performance Characteristics

Trade-offs

Technique 4: SIMD + Blocking Optimization

SIMD Optimizations

Vectorization Strategy

Performance Benefits

Performance Comparison

Expected Speedup (Large Matrices)

Matrix Size Impact

Implementation Guidelines

Block Size Selection

Compiler Optimizations

Memory Alignment

Advanced Techniques

Multi-threading

Cache-Oblivious Algorithms

Specialized Libraries

Profiling and Analysis

Performance Metrics

Tools

Key Takeaways

Optimization Hierarchy

Performance Principles

In This Topic

Related Topics

Quick Actions