Memory Hierarchy 101

The Memory Wall: CPUs can execute billions of operations per second, but fetching data from main memory takes hundreds of cycles! The memory hierarchy is the ingenious solution that makes modern computing practical. Let's understand how it works! 🏗️

Why Memory Hierarchy Exists

The Fundamental Trade-off: Speed vs Capacity

Physics dictates: You can't have fast, large, AND cheap memory simultaneously.

           Speed            Size            Cost/GB
           (access time)    (capacity)      (relative)
         ↑
SRAM     |    1 ns         Kilobytes       $10,000
(Cache)  |    0.5 ns       Megabytes       $5,000
         |
DRAM     |    50-100 ns    Gigabytes       $5-10
(RAM)    |
         |
SSD      |    100 μs       Terabytes       $0.10
         |
HDD      |    5-10 ms      Terabytes       $0.02
         ↓

The Memory Hierarchy Pyramid

        ┌─────────────┐ ← Fastest, Smallest, Most Expensive
        │  Registers  │    1 cycle, ~1 KB
        ├─────────────┤
        │  L1 Cache   │    4 cycles, 32-64 KB
        ├─────────────┤
        │  L2 Cache   │    12 cycles, 256 KB - 1 MB
        ├─────────────┤
        │  L3 Cache   │    40 cycles, 4-32 MB
        ├─────────────┤
        │    DRAM     │    200 cycles, 8-128 GB
        ├─────────────┤
        │     SSD     │    100,000 cycles, 1-4 TB
        └─────────────┘ ← Slowest, Largest, Cheapest

Key Principle: Keep frequently accessed data in fast, small memories. Move data between levels as needed.

Registers: The Fastest Storage

Registers are the CPU's directly addressable storage.

Register File Organization

// Typical 64-bit CPU register file
x0-x31:  General-purpose registers (32 × 64-bit)
f0-f31:  Floating-point registers (32 × 64-bit)
PC:      Program Counter
SP:      Stack Pointer
 
Total: ~2 KB of the fastest storage!

Register Access

Latency: 0 cycles (combinational logic) Bandwidth: Multiple reads/writes per cycle Capacity: Limited by instruction encoding (5-6 bits → 32-64 registers)

ADD r1, r2, r3   # All operands in registers - instant access!

Why so few?

Larger register files → slower access (longer wires)
More bits needed in instructions to encode register numbers
Compiler register allocation becomes harder

Cache Memory: The Smart Middle Ground

Cache is small, fast SRAM between CPU and main memory.

Cache Levels

L1 Cache (Primary Cache)

Split design (Harvard architecture):

L1 Instruction Cache (L1-I):  32 KB, read-only
L1 Data Cache (L1-D):          32 KB, read-write
 
Why split?
- Simultaneous instruction fetch and data access
- Different access patterns (sequential vs random)
- Security (prevents code injection via data)

Characteristics:

Latency: 4-5 cycles
Bandwidth: 1-2 TB/s (multiple accesses per cycle)
Private: Each core has its own L1

L2 Cache (Secondary Cache)

Size: 256 KB - 1 MB
Latency: 12-15 cycles
Bandwidth: 200-500 GB/s
Organization: Unified (instructions + data)
Access: Private per core (most designs)

Purpose:

Holds L1 victims (data evicted from L1)
Larger working set than L1
Filters memory traffic

L3 Cache (Last-Level Cache, LLC)

Size: 4-32 MB
Latency: 40-50 cycles  
Bandwidth: 50-200 GB/s
Organization: Unified, shared across all cores
Inclusive: Contains copies of data in L1/L2 (design-dependent)

Purpose:

Shared cache for inter-core communication
Reduces DRAM traffic
Victim cache for lower levels

Cache Access: Hits vs Misses

Cache Hit ✅

Data is found in cache → Fast!

// Access pattern
Read address 0x1000
→ Check L1 cache
→ HIT! Data returned in 4 cycles

Performance: Served at cache speed (single-digit cycles)

Cache Miss ❌

Data not in cache → Must fetch from next level.

// Access pattern
Read address 0x2000
→ Check L1 cache → MISS
→ Check L2 cache → MISS  
→ Check L3 cache → MISS
→ Fetch from DRAM (200 cycles!)
→ Fill L3, L2, L1
→ Return data to CPU

Performance: Served at DRAM speed (hundreds of cycles)

Types of Misses

1. Compulsory Miss (Cold Start)

// First access to data - must fetch from memory
int array[1000];
x = array[0];  // Cold miss - never accessed before

2. Capacity Miss

// Working set larger than cache
int array[100000];  // 400 KB > 32 KB L1 cache
for (int i = 0; i < 100000; i++) {
    sum += array[i];  // Many capacity misses
}

3. Conflict Miss (Associativity)

// Multiple addresses map to same cache line
// (More on cache organization in advanced topics)

Hit Rate vs Miss Rate

Hit Rate = Hits / (Hits + Misses)
Miss Rate = 1 - Hit Rate
 
Example: 95% hit rate
→ 95% of accesses served in 4 cycles (L1)
→ 5% of accesses served in 200 cycles (DRAM)
 
Average access time = 0.95 × 4 + 0.05 × 200 = 13.8 cycles

Small changes in hit rate have huge impact!

90% hit rate → 0.1 × 200 = 20 cycle penalty
95% hit rate → 0.05 × 200 = 10 cycle penalty (2x better!)
99% hit rate → 0.01 × 200 = 2 cycle penalty (10x better!)

Principles of Locality

Why caches work: Programs exhibit predictable access patterns.

1. Temporal Locality ⏰

If you access data once, you'll likely access it again soon.

// Loop variable accessed repeatedly
for (int i = 0; i < 1000; i++) {  // 'i' has high temporal locality
    sum += array[i];
}
 
// Hot function called frequently
double compute() { ... }  // Code has temporal locality
for (int i = 0; i < 1000; i++) {
    result = compute();  // Function re-executed
}

Cache benefit: Keep recently accessed data in cache.

2. Spatial Locality 📍

If you access data at address X, you'll likely access address X+1, X+2, ... soon.

// Sequential array access
int array[1000];
for (int i = 0; i < 1000; i++) {
    sum += array[i];  // Accesses adjacent elements
}
 
// Struct field accesses
struct Point { int x, y, z; };
Point p;
p.x = 1;  // Accessing p.x
p.y = 2;  // Likely to access p.y next (adjacent in memory)

Cache benefit: Fetch entire cache line (64 bytes), not just one word.

Cache Line Size

Cache line = 64 bytes (typical)
 
Access array[0] → Fetch array[0..15] (16 ints)
Access array[1] → HIT! Already in cache
Access array[2] → HIT! Already in cache
...
Access array[16] → MISS! Need next cache line

Spatial locality exploitation: 1 miss provides 16 hits!

Bad Access Patterns

Strided Access (Poor Spatial Locality)

// Access every 16th element → waste cache lines
for (int i = 0; i < n; i += 16) {
    sum += array[i];  // Only use 1/16 of each cache line
}

Random Access (No Locality)

// Random access → no temporal or spatial locality
for (int i = 0; i < n; i++) {
    sum += array[random()];  // Cache thrashing!
}

DRAM: Main Memory

DRAM (Dynamic RAM) is the "working memory" of the computer.

DRAM Organization

DRAM Chip
├── Banks (8-16 per chip)
│   ├── Rows (thousands)
│   │   └── Columns (thousands)
│   │       └── Cells (1 bit each)

DRAM Access Protocol

1. Row Activation (RAS - Row Address Strobe)

Select row → Sense amplifiers read entire row into row buffer
Latency: ~15 ns

2. Column Access (CAS - Column Address Strobe)

Select column from row buffer → Transfer data
Latency: ~15 ns

Total first-word latency: ~50-100 ns (150-300 CPU cycles!)

DRAM Bandwidth vs Latency

Latency: Time to first byte

CAS Latency (CL): 15-20 cycles (at DRAM clock)
First word: ~50-100 ns

Bandwidth: Peak transfer rate

DDR4-3200:
- Clock: 1600 MHz (double data rate)
- Bus width: 64 bits
- Peak bandwidth: 25.6 GB/s
 
DDR5-4800:
- Clock: 2400 MHz
- Peak bandwidth: 38.4 GB/s

Key insight: DRAM is like a truck

High latency (takes time to get to warehouse)
High bandwidth (carries lots of cargo once moving)

→ Amortize latency by transferring large blocks!

Row Buffer Locality

Row buffer acts like a cache for one DRAM row.

// Good: Sequential access (same row)
for (int i = 0; i < 1000; i++) {
    data[i] = 0;  // Row buffer hits!
}
Accesses: 1 row activation + 1000 column accesses
 
// Bad: Random access (different rows)
for (int i = 0; i < 1000; i++) {
    data[random()] = 0;  // Row buffer misses!
}
Accesses: 1000 row activations + 1000 column accesses (16x slower!)

Bandwidth vs Latency Trade-offs

Bandwidth-Optimized: GPUs

GPU GDDR6 Memory:
- Latency: 200-300 cycles (worse than CPU DRAM!)
- Bandwidth: 600-900 GB/s (20x better than CPU!)
 
Why?
- Wide bus (384-512 bits vs 64 for CPU)
- Many parallel memory channels
- High clock speeds

Use case: Streaming workloads where you can hide latency with parallelism

Latency-Optimized: CPUs

CPU DDR4 Memory:
- Latency: 50-80 ns (better than GPU)
- Bandwidth: 25-50 GB/s (limited)
 
Why?
- Sophisticated prefetchers
- Out-of-order execution to hide latency
- Large caches to reduce memory accesses

Use case: Irregular, pointer-chasing workloads

Memory Access Patterns in Practice

Pattern 1: Sequential Scan (Good!)

// Streaming access - excellent spatial locality
for (int i = 0; i < n; i++) {
    sum += array[i];
}
 
Characteristics:
✅ High spatial locality (cache lines fully utilized)
✅ Prefetcher friendly (predictable pattern)
✅ Row buffer hits (DRAM friendly)

Pattern 2: Strided Access (Medium)

// Matrix column access
for (int col = 0; col < n; col++) {
    for (int row = 0; row < n; row++) {
        sum += matrix[row][col];  // Stride = n
    }
}
 
Characteristics:
⚠️ Poor spatial locality if stride &gt; cache line
⚠️ Some prefetcher support (constant stride)
✅ Row buffer locality if stride is small

Pattern 3: Random Access (Bad!)

// Hash table lookup
for (int i = 0; i < n; i++) {
    sum += hash_table[hash(keys[i])];
}
 
Characteristics:
❌ No spatial locality
❌ No temporal locality (unless highly skewed)
❌ Prefetcher cannot help
❌ Cache thrashing

Cache-Friendly Programming

Technique 1: Loop Ordering

// Bad: Column-major access of row-major array
for (int col = 0; col < n; col++) {
    for (int row = 0; row < n; row++) {
        sum += matrix[row][col];  // Stride = n (poor spatial locality)
    }
}
 
// Good: Row-major access of row-major array
for (int row = 0; row < n; row++) {
    for (int col = 0; col < n; col++) {
        sum += matrix[row][col];  // Stride = 1 (excellent spatial locality)
    }
}
 
Speedup: 3-10x from just changing loop order!

Technique 2: Cache Blocking (Tiling)

// Bad: Working set &gt; cache size
for (int i = 0; i < n; i++) {
    for (int j = 0; j < n; j++) {
        for (int k = 0; k < n; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}
 
// Good: Block to fit in cache
const int BLOCK = 64;
for (int ii = 0; ii < n; ii += BLOCK) {
    for (int jj = 0; jj < n; jj += BLOCK) {
        for (int kk = 0; kk < n; kk += BLOCK) {
            for (int i = ii; i < min(ii+BLOCK, n); i++) {
                for (int j = jj; j < min(jj+BLOCK, n); j++) {
                    for (int k = kk; k < min(kk+BLOCK, n); k++) {
                        C[i][j] += A[i][k] * B[k][j];
                    }
                }
            }
        }
    }
}
 
Speedup: 2-5x from cache blocking!

Technique 3: Array of Structs → Struct of Arrays

// Bad: Poor spatial locality
struct Particle { float x, y, z, vx, vy, vz; };
Particle particles[10000];
 
for (int i = 0; i < 10000; i++) {
    particles[i].x += particles[i].vx;  // Access 1/6 of cache line
}
 
// Good: Excellent spatial locality
struct ParticlesSoA {
    float x[10000];
    float y[10000];
    float z[10000];
    float vx[10000], vy[10000], vz[10000];
};
 
for (int i = 0; i < 10000; i++) {
    soa.x[i] += soa.vx[i];  // Full cache line utilization!
}
 
Speedup: 2-4x from data layout change!

Key Takeaways

✅ Memory hierarchy exists due to fundamental speed-capacity-cost trade-offs

✅ Cache levels provide progressively larger but slower storage

L1: 4 cycles, 32-64 KB
L2: 12 cycles, 256 KB - 1 MB
L3: 40 cycles, 4-32 MB
DRAM: 200 cycles, GB-scale

✅ Locality principles make caches effective:

Temporal: reuse recently accessed data
Spatial: access nearby data

✅ Cache hits served in single digits cycles Cache misses cost 100-300 cycles → Huge penalty!

✅ Bandwidth vs latency: Different workloads need different trade-offs

CPUs: latency-optimized
GPUs: bandwidth-optimized

✅ Cache-friendly programming can give 2-10x speedups for free!

Practice Problems

1. Hit Rate Impact

L1 hit rate = 95%, L1 latency = 4 cycles, DRAM latency = 200 cycles
What is the average access time?
What if hit rate improves to 98%?

2. Spatial Locality

Array of 1000 ints, cache line = 64 bytes
How many cache misses for sequential access?
How many for strided access with stride=16?

3. Cache Blocking

Matrix multiply: 1024×1024 matrices, L1 cache = 32 KB
What block size should you use? Why?

Next Steps

Now that you understand memory hierarchy basics:

→ Cache Coherence: How do multi-core systems keep caches consistent?

→ Advanced Caching: Associativity, replacement policies, write policies

→ GPU Memory: HBM, GDDR, memory coalescing

→ KV Cache Optimization: Apply these principles to ML inference

Remember: The memory hierarchy is the most important performance factor in modern systems! 🚀

Prerequisites

Learning Objectives

Table of Contents