System & Microarchitecture Deep Dive
End-to-end reasoning about compute + data pathologies with evidence-based fixes for CPU pipelines, GPU occupancy, and memory hierarchies
Practical Exercises
- CPU pipeline bottleneck analysis with Top-Down methodology
- NUMA-aware memory allocation optimization
- GPU occupancy calculation workshop
- Hardware prefetching distance optimization
Tools Required
Real-World Applications
- Optimizing transformer inference pipelines
- Debugging memory-bound ML workloads
- Datacenter resource allocation
- GPU kernel performance tuning
Part of Learning Tracks
System & Microarchitecture — Deep Dive
Author: Principal Computer Architect & Professor of Computer Architecture
Scope: CPU pipelines, caches/TLBs, prefetching, coherence, NUMA, SIMD, GPU occupancy & memory hierarchy, interconnects (NoC/PCIe/CXL/NVLink), storage & I/O.
Use this when: You need to reason end‑to‑end about compute + data, eliminate pathologies, and communicate fixes with evidence.
📋 Table of Contents
0) Mental model
1) CPU pipeline basics (OoO superscalar)
2) Cache & memory hierarchy
3) Prefetching (HW + SW)
4) Coherency & sharing
5) NUMA effects
6) Vector/SIMD (CPU)
7) GPU occupancy & memory behavior
8) Interconnects, storage & I/O
9) Worked examples
10) Checklist & anti‑patterns
Further reading
0) Mental model
A modern SoC couples:
- CPU complex: out‑of‑order (OoO) cores with private L1I/L1D and often private or per‑core L2; a shared LLC (L3) and ring/mesh NoC.
- Accelerators: GPU, NPU/TPU, ISP, video, DMA engines.
- Memory subsystem: DDRx/HBM stacks, memory controllers (MCs), address interleaving, IOMMU.
- I/O fabric: PCIe/CXL, vendor coherency links (NVLink/NVSwitch, Infinity Fabric). The art is to feed compute (bandwidth, latency, locality), hide latency (parallelism, prefetch), and balance resources (utilization without interference).
1) CPU pipeline basics (OoO superscalar)
Modern out-of-order (OoO) superscalar processors achieve high performance through sophisticated pipeline structures that allow instructions to execute out of program order while maintaining the illusion of sequential execution.
Pipeline stages deep dive
┌────────┬─────────┬─────────┬──────────┬───────┬─────────┬───────────┬────────┐ │ FETCH │ DECODE │ RENAME │ DISPATCH │ ISSUE │ EXECUTE │ WRITEBACK │ RETIRE │ │ │ │ │ │ │ │ │ │ │ I-cache│ µop │ RAT │ RS/ │ Wakeup│ Func │ Result │ ROB │ │ Branch │ decode │ Free │ ROB │ Select│ Units │ Forward │ Commit │ │ Pred │ Fusion │ list │ Alloc │ │ │ │ │ └────────┴─────────┴─────────┴──────────┴───────┴─────────┴───────────┴────────┘
1. FETCH Stage
- Purpose: Retrieve instructions from memory and predict control flow
- Key Components:
- Branch Predictor: Predicts taken/not-taken for conditional branches
- Branch Target Buffer (BTB): Caches branch target addresses
- Return Address Stack (RAS): Dedicated stack for function returns
- I-Cache: Instruction cache (typically 32-64 KB, 4-8 way associative)
- I-TLB: Instruction translation lookaside buffer
Flow: PC → I-TLB lookup → I-Cache access → Branch prediction → Next PC generation
2. DECODE Stage
- Purpose: Parse instruction bytes into microoperations (µops)
- Key Components:
- Instruction Decoders: Convert x86/ARM instructions to internal µops
- µop Cache: Stores decoded µops to avoid repeated decoding (1-2K entries)
- Macro-op Fusion: Combines related instructions (e.g., cmp+jcc)
- Complex Instruction Support: Microcode ROM for complex instructions
Flow: Raw instructions → Decode units → µop generation → µop cache storage
3. RENAME Stage
- Purpose: Eliminate false dependencies by mapping architectural to physical registers
- Key Components:
- Register Alias Table (RAT): Maps architectural → physical registers
- Physical Register File: Pool of physical registers (160-256 entries)
- Free List: Tracks available physical registers
- Dependency Tracking: Builds dependency chains for scheduling
Example:
Original: ADD R1, R2, R3 ; R1 = R2 + R3
SUB R4, R1, R5 ; R4 = R1 - R5 (depends on previous R1)
Renamed: ADD P37, P21, P45 ; P37 = P21 + P45
SUB P52, P37, P33 ; P52 = P37 - P33 (dependency preserved)
4. DISPATCH Stage
- Purpose: Allocate resources and place µops in execution queues
- Key Components:
- Reservation Stations (RS): Hold µops waiting for operands (36-60 entries)
- Reorder Buffer (ROB): Maintains program order for retirement (168-352 entries)
- Load/Store Queue: Buffers memory operations (48-72 load, 32-56 store entries)
- Resource Allocation: Assigns execution units and ports
Flow: Renamed µops → ROB allocation → RS insertion → Resource assignment
5. ISSUE Stage
- Purpose: Wake up ready µops and select for execution
- Key Components:
- Wakeup Logic: Monitors operand readiness (CAM-based)
- Select Logic: Chooses ready µops for execution ports
- Age-based Priority: Older instructions get priority
- Port Binding: Maps µops to specific execution ports
Critical Path: Operand ready → Wakeup CAM → Select logic → Port assignment (1 cycle)
6. EXECUTE Stage
- Purpose: Perform actual computation
- Execution Ports (Intel Skylake-like):
- Port 0: ALU, Vector ALU, Vector Shift, Branch
- Port 1: ALU, Vector ALU, Vector Multiply, Slow LEA
- Port 2: Load Unit, Address Generation
- Port 3: Load Unit, Address Generation
- Port 4: Store Data
- Port 5: ALU, Vector ALU, Vector Shuffle, Fast LEA
- Port 6: ALU, Branch, Store Address
- Port 7: Store Address
Bypass Network: Results forwarded directly to dependent µops (0-1 cycle latency)
7. WRITEBACK Stage
- Purpose: Update architectural state and forward results
- Key Components:
- Result Buses: Carry results to register file and bypass network
- Exception Handling: Detect and signal architectural exceptions
- Register File Update: Write results to physical registers
8. RETIRE Stage
- Purpose: Commit instructions in program order
- Key Components:
- ROB Head: Points to oldest uncommitted instruction
- Exception Resolution: Handle precise exceptions
- Resource Reclamation: Free physical registers and ROB entries
- Memory Ordering: Enforce memory consistency model
Pipeline Stage to Microarchitectural Structure Mapping
Understanding which structures belong to each pipeline stage is crucial for performance analysis and optimization. Here's the complete mapping:
🔧 STAGE-BY-STAGE STRUCTURE BREAKDOWN
Pipeline Stage | Primary Structures | Function | Capacity/Size |
---|---|---|---|
FETCH | Branch PredictorBTBI-CacheI-TLB | Predict and fetch instructions | 32KB I-Cache 4K BTB entries |
DECODE | µop CacheDecode UnitsMacro-fusion | Convert to internal µops | 1.5K µop cache entries |
RENAME | RATFree ListPhysical RegFile | Eliminate false dependencies | 180 physical registers |
DISPATCH | ROBRSLoad/Store Queues | Allocate resources, queue µops | 224 ROB entries 97 RS entries |
ISSUE | Wakeup LogicSelect Logic | Choose ready µops for execution | CAM-based wakeup |
EXECUTE | Functional UnitsExecution Ports | Perform computation | 8 execution ports |
WRITEBACK | Bypass NetworkResult Buses | Forward results | Multi-cycle bypass |
RETIRE | ROB HeadCommit Logic | Maintain program order | Up to 4 µops/cycle |
🎯 VISUAL REFERENCES FOR EACH STAGE
📚 Academic References:
- Hennessy & Patterson: "Computer Architecture: A Quantitative Approach" - Chapter 3 (Pipeline diagrams)
- Intel Optimization Manual: Section 2.2 (Microarchitecture overview)
- Agner Fog's Microarchitecture Guide: Detailed pipeline analysis for modern CPUs
🌐 Online Visual Resources:
- WikiChip Microarchitecture Diagrams: https://en.wikichip.org/wiki/intel/microarchitectures
- Real World Technologies: CPU architecture deep dives with block diagrams
- AnandTech CPU Reviews: Pipeline stage analysis with performance correlations
- Intel Developer Documentation: Optimization reference manuals with diagrams
DETAILED STAGE ANALYSIS WITH STRUCTURES
1. FETCH STAGE STRUCTURES
┌─────────────────────────────────────────────┐
│ FETCH STAGE │
┌─────────────────┐ │ ┌─────────────┐ ┌──────────────────┐ │
│ Program │ │ │ Branch │────│ Branch Target │ │
│ Counter │─┼──│ Predictor │ │ Buffer (BTB) │ │
│ (PC) │ │ │ │ │ │ │
└─────────────────┘ │ └─────────────┘ └──────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ FETCH UNIT │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ I-Cache │ │ I-TLB │ │ │
│ │ │ 32-64 KB │ │ 128 entries │ │ │
│ │ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
│
▼ (16-32 bytes/cycle)
┌──────────────┐
│ Instruction │
│ Queue │
└──────────────┘
ELI5: FETCH Stage "Think of this like a librarian who needs to find the next book (instruction) you want to read. The Branch Predictor is like a smart librarian who guesses which book you'll want next, even before you ask! The I-Cache is like a small bookshelf right at your desk with your most recently used books, and the I-TLB is like a super-fast address book that remembers where each book is located in the big library."
2. DECODE STAGE STRUCTURES
┌─────────────────────────────────────────────┐
│ DECODE STAGE │
┌──────────────┐ │ ┌─────────────────────────────────────┐ │
│ Instruction │ │ │ µop Cache │ │
│ Queue │────┼──│ (1.5K entries) │ │
│ │ │ │ ┌─────────┐ ┌─────────────────┐ │ │
└──────────────┘ │ │ │ Tag │ │ Decoded µops │ │ │
│ │ │ Array │ │ (6 wide) │ │ │
│ │ └─────────┘ └─────────────────┘ │ │
│ └─────────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ (cache miss) │
│ ┌─────────────┐ ┌──────────────────┐ │
│ │ 4 x │ │ Macro-fusion │ │
│ │ Decoders │ │ Logic │ │
│ │ (1 complex, │ │ (CMP + JCC = 1) │ │
│ │ 3 simple) │ │ │ │
│ │ └─────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────┘
│
▼ (up to 6 µops/cycle)
┌──────────────┐
│ Rename │
│ Queue │
└──────────────┘
ELI5: DECODE Stage
"This is like translating a foreign language book into your native language. The µop Cache is like having pre-translated pages ready to go, so you don't have to translate the same page twice. When you haven't seen a page before, the Decoders are like human translators who break down complex sentences (x86 instructions) into simple actions (µops) that the CPU can easily understand."
3. RENAME STAGE STRUCTURES
┌─────────────────────────────────────────────┐
│ RENAME STAGE │
┌──────────────┐ │ ┌─────────────────────────────────────┐ │
│ Rename │ │ │ Register Alias Table (RAT) │ │
│ Queue │────┼──│ Arch → Physical Register Map │ │
│ │ │ │ ┌───┬───┬───┬───┬───┬───┬───┐ │ │
└──────────────┘ │ │ │R0 │R1 │R2 │R3 │R4 │..│R15│ │ │
│ │ ├───┼───┼───┼───┼───┼───┼───┤ │ │
│ │ │P37│P42│P15│P28│P71│..│P93│ │ │
│ │ └───┴───┴───┴───┴───┴───┴───┘ │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────▼─────────────────┐ │
│ │ Physical Register File │ │
│ │ (180 registers) │ │
│ │ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │ │
│ │ │P0│P1│P2│.│..│37│..│71│..│179│ │ │
│ │ └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ │ │
│ └─────────────────┬─────────────────┘ │
│ │ │
│ ┌─────────────────▼─────────────────┐ │
│ │ Free List │ │
│ │ (Available Physical Regs) │ │
│ │ P22, P44, P67, P88... │ │
│ └───────────────────────────────────┘ │
└─────────────────────────────────────────────┘
ELI5: RENAME Stage "Imagine you have 16 desks (architectural registers) but 180 actual workspaces (physical registers). The RAT is like a smart assistant who puts sticky notes on your desks saying which workspace to actually use. This way, if two people need 'Desk 1' at the same time, they can use 'Workspace 37' and 'Workspace 42' without conflicts. The Free List is like the assistant's notebook tracking which workspaces are available."
4. DISPATCH STAGE STRUCTURES
┌──────────────────────────────────────────────┐
│ DISPATCH STAGE │
┌────────────┐ │ ┌────────────────────────────────────────┐ │
│ Renamed │ │ │ Reorder Buffer (ROB) │ │
│ µops │─────┼─│ (224 entries, 4-wide) │ │
│ │ │ │ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │ │
└────────────┘ │ │ │H │ │ │ │ │ │ │ │ │ │ │T │ │ │
│ │ └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ │ │
│ │ ▲Head Tail▲ │ │
│ └────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────▼────────────────┐ │
│ │ Resource Allocation │ │
│ └───────────────┬────────────────┘ │
│ │ │
┌──────────────────────┬─────────────▼─────────┬───────────────────────┐
│ │ │ │
│ ┌─────────────────┐ │ ┌─────────────────┐ │ ┌─────────────────┐ │
│ │ Reservation │ │ │ Load Queue │ │ │ Store Queue │ │
│ │ Stations │ │ │ (48 entries) │ │ │ (32 entries) │ │
│ │ (97 entries) │ │ │ │ │ │ │ │
│ └─────────────────┘ │ └─────────────────┘ │ └─────────────────┘ │
└──────────────────────┴───────────────────────┴───────────────────────┘
ELI5: DISPATCH Stage "This is like a restaurant manager organizing orders. The ROB is like a numbered ticket system that keeps track of every order in the exact sequence customers arrived. The Reservation Stations are like different prep stations (salad, grill, dessert) where orders wait until all ingredients are ready. The Load/Store Queues are special stations just for getting items from storage or putting things away."
5. ISSUE STAGE STRUCTURES
┌──────────────────────────────────────────────┐
│ ISSUE STAGE │
┌───────────────┼─────────────────────────────────────────────┼───────────────┐
│ │ ┌─────────────────────────────────────┐ │ │
│ Reservation │ │ Wakeup Logic │ │ Operand │
│ Stations │ │ (CAM-based) │ │ Bypass │
│ │◄───┤ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │──┤ Network │
│ ┌───┬───┬───┐ │ │ │Tag 0│ │Tag 1│ │Tag 2│ │Tag 3│ │ │ │
│ │Rdy│Rdy│ ! │ │ │ └─────┘ └─────┘ └─────┘ └─────┘ │ │ ┌───┬───┬───┐ │
│ │Rdy│ ! │Rdy│ │ │ │ │ │ │ │ │ │P37│P42│P15│ │
│ │ ! │Rdy│ ! │ │ │ ▼ ▼ ▼ ▼ │ │ └───┴───┴───┘ │
│ └───┴───┴───┘ │ │ ┌─────────────────────────────┐ │ │ │
└───────────────┼────┤ │ Select Logic │ │──┼───────────────┘
│ │ │ (Age + Port availability) │ │ │
│ │ └─────────────────────────────┘ │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────┐ │
│ │ Port Assignment │ │
│ │ │ │
│ │ Port0 Port1 Port2 Port3 Port4 │ │
│ │ ALU ALU Load Load Store │ │
│ │ VEC VEC AGU AGU Data │ │
│ └─────────────────────────────────────┘ │
└─────────────────────────────────────────────┘
ELI5: ISSUE Stage "This is like a smart traffic controller at a busy intersection. The Wakeup Logic is constantly watching for when all the 'ingredients' (operands) for a recipe (instruction) are ready. The Select Logic is like the traffic controller deciding which cars (ready instructions) get to go through the intersection (execution ports) first, considering both how long they've been waiting and which lanes are available."
6. EXECUTE STAGE STRUCTURES
┌──────────────────────────────────────────────┐
│ EXECUTE STAGE │
│ │
┌─────────┐ │ ┌─────┬─────┬─────┬─────┬─────┬─────┐ │ ┌─────────┐
│Selected │────►│ │Port0│Port1│Port2│Port3│Port4│Port5│ │────►│Results │
│ µops │ │ └─────┴─────┴─────┴─────┴─────┴─────┘ │ │ to │
│ │ │ │ │ │ │ │ │ │ │Bypass │
└─────────┘ │ ▼ ▼ ▼ ▼ ▼ ▼ │ └─────────┘
│ ┌─────┬─────┬─────┬─────┬─────┬─────┐ │
│ │ ALU │ ALU │Load │Load │Store│ ALU │ │
│ │ │ │Unit │Unit │Data │ │ │
│ │ Vec │ Vec │ │ │ │ Vec │ │
│ │ ALU │ ALU │ AGU │ AGU │ │ ALU │ │
│ └─────┴─────┴─────┴─────┴─────┴─────┘ │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ Bypass Network │ │
│ │ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │ │
│ │ │Bus0│ │Bus1│ │Bus2│ │Bus3│ ... │ │
│ │ └────┘ └────┘ └────┘ └────┘ │ │
│ │ │ │ │ │ │ │
│ │ ▼ ▼ ▼ ▼ │ │
│ │ ┌─────────────────────────────┐ │ │
│ │ │ Forwarding Muxes │ │ │
│ │ │ (to dependent µops) │ │ │
│ │ └─────────────────────────────┘ │ │
│ └──────────────────────────────────────┘ │
└──────────────────────────────────────────────┘
ELI5: EXECUTE Stage "This is like a kitchen with 6 different cooking stations (execution ports). Each station can do different things - some can chop vegetables (ALU), some can get ingredients from the pantry (Load Units), some can put dishes on the serving table (Store Units). The Bypass Network is like a super-fast conveyor belt that immediately sends finished ingredients from one station to another that needs them, without waiting for the dish to be completely done."
7. WRITEBACK STAGE STRUCTURES
┌──────────────────────────────────────────────┐
│ WRITEBACK STAGE │
┌─────────┐ │ ┌─────────────────────────────────────────┐ │
│Execution│────►│ │ Result Buses │ │
│Results │ │ │ (8 buses, 64-bit each) │ │
│ │ │ │ ┌─────┬─────┬─────┬─────┬─────┬─────┐ │ │
└─────────┘ │ │ │Bus0 │Bus1 │Bus2 │Bus3 │Bus4 │Bus5 │ │ │
│ │ └─────┴─────┴─────┴─────┴─────┴─────┘ │ │
│ └─────────────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ Physical Reg │ │ Exception Logic │ │
│ │ File Update │ │ ┌─────────────────┐│ │
│ │ ┌─────────────┐ │ │ │ Fault Detection │││ │
│ │ │ 180 x │ │ │ │ Priority Logic │││ │
│ │ │ 64-bit Regs │ │ │ │ Exception Queue │││ │
│ │ └─────────────┘ │ │ └─────────────────┘│ │
│ └─────────────────┘ └─────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ ROB Update │ │
│ │ (Mark instructions complete) │ │
│ └─────────────────────────────────────────┘ │
└──────────────────────────────────────────────┘
ELI5: WRITEBACK Stage "This is like the final quality check and packaging in a factory. The Result Buses are like conveyor belts carrying finished products. The Physical Register File is like a big storage warehouse where all the finished work gets stored with proper labels. The Exception Logic is like a quality inspector who checks if anything went wrong and needs to be reported to management (the operating system)."
8. RETIRE STAGE STRUCTURES
┌──────────────────────────────────────────────┐
│ RETIRE STAGE │
│ │
┌─────────┐ │ ┌─────────────────────────────────────────┐ │ ┌─────────┐
│Completed│────►│ │ Reorder Buffer │ │────►│Arch │
│ µops │ │ │ (In-Order Commit) │ │ │State │
│ │ │ │ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │ │ │Update │
└─────────┘ │ │ │ H│ │ │ C│ C│ C│ │ │ │ │ │ T│ │ │ └─────────┘
│ │ └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ │ │
│ │ Head→ Complete→ ←Tail │ │
│ └─────────────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ Memory Ordering │ │ Resource Reclaim │ │
│ │ ┌─────────────┐ │ │ ┌─────────────────┐ │ │
│ │ │Store Buffer │ │ │ │ Free Physical │ │ │
│ │ │ Drain │ │ │ │ Registers │ │ │
│ │ │Load/Store │ │ │ │ ROB Entries │ │ │
│ │ │Forwarding │ │ │ │ RS Entries │ │ │
│ │ └─────────────┘ │ │ └─────────────────┘ │ │
│ └─────────────────┘ └─────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ Exception Handling │ │
│ │ • Precise exception delivery │ │
│ │ • Pipeline flush on exceptions │ │
│ │ • Architectural state recovery │ │
│ └─────────────────────────────────────────┘ │
└──────────────────────────────────────────────┘
ELI5: RETIRE Stage "This is like the final graduation ceremony where students (instructions) can only graduate in the exact order they enrolled, even if some finished their coursework early. The ROB Head is like the dean who calls students one by one to receive their diploma. Resource Reclaim is like cleaning up the classroom after each student leaves, making space for new students. Exception Handling is like having security ready to handle any problems while maintaining the proper ceremony order."
Critical performance relationships
- Frontend: Branch predictor accuracy directly impacts I-cache hit rate and decode bandwidth
- Rename: RAT size limits instruction window; free list exhaustion stalls dispatch
- Scheduler: RS size × wakeup speed determines out-of-order window depth
- Backend: Port contention limits achievable IPC despite large instruction window
- Memory: LDQ/STQ size affects memory-level parallelism and store-to-load forwarding
Throughput vs. latency
- Throughput ≈ issue_width × freq × utilization.
- Latency affects serial portions and responsiveness (Amdahl).
- CPI decomposition:
CPI = CPI_retiring + CPI_FE + CPI_BE_core + CPI_BE_memory + CPI_bad_spec
.
Back‑pressure patterns (symptoms → likely cause → fix)
Symptom | Likely Cause | First Fixes |
---|---|---|
High FE‑bound + I‑cache/TLB misses | Code bloat, poor layout, small pages | LTO/PGOFDO code layouthuge pagesreduce icache footprint |
High bad‑speculation | Branchy control, data‑dependent branches | If‑convert small diamondsloop unswitchingvalue speculation avoidance |
BE core‑bound + port pressure | Over‑unrolled vector kernels, reg pressure, imbalanced ports | Reduce unrollrebalance scalar/SIMD opssoftware pipelining |
BE memory‑bound + high MPKI | Strided/irregular access, poor locality | SoA layoutblocking/tilingsoftware prefetchcache hints/pragmas |
Store‑bound, store‑forward stalls | Small scattered stores, write‑combining misses | Buffer & copy (memcpy‑like)align and batchnon‑temporal stores when appropriate |
Tip: Always correlate with Top‑Down percentages and MPKI/latency to avoid misdiagnosis.
2) Cache & memory hierarchy
The memory hierarchy bridges the performance gap between fast processor cores and relatively slow DRAM through multiple levels of progressively larger but slower caches.
Memory hierarchy latency characteristics
Level | Typical Size | Latency (ns) | Latency (cycles @3GHz) | Bandwidth | Key Properties |
---|---|---|---|---|---|
L1 Data | 32-64 KB | 1-2 ns | 3-6 cycles | ~1 TB/s | Per-core, virtually indexed |
L1 Instruction | 32-64 KB | 1-2 ns | 3-6 cycles | ~500 GB/s | Per-core, virtually indexed |
L2 Private | 256 KB-1 MB | 3-10 ns | 9-30 cycles | ~400 GB/s | Per-core or per-2-cores |
L3/LLC (Shared) | 8-32 MB | 10-40 ns | 30-120 cycles | ~200 GB/s | Shared, physically indexed |
DRAM (Local) | 8-128 GB | 70-150 ns | 210-450 cycles | ~50-100 GB/s | Main memory, NUMA local |
DRAM (Remote) | Cross-socket | 100-200 ns | 300-600 cycles | ~25-50 GB/s | NUMA remote access |
NVMe SSD | 1-8 TB | 10-100 µs | 30K-300K cycles | ~3-7 GB/s | Non-volatile storage |
Cache hierarchy policies: Deep dive
Understanding cache policies is crucial for predicting performance behavior and optimizing memory access patterns.
Inclusion Policies
Inclusive Hierarchy:
- Definition: All data in L1/L2 must also exist in LLC
- Advantage: Simpler coherence protocol - LLC acts as directory filter
- Disadvantage: LLC space "wasted" on L1/L2 duplicates
- Example: Intel pre-Skylake architectures
┌─────────┐ ┌─────────────────────┐
│ L1 │ │ L3/LLC │
│ [A,B,C] │────│ [A,B,C,D,E,F,G,H] │ ← Contains ALL L1 data
└─────────┘ └─────────────────────┘
Exclusive Hierarchy:
- Definition: Data exists in only ONE cache level
- Advantage: Maximum effective cache capacity
- Disadvantage: Complex victim handling, potential ping-ponging
- Example: AMD Bulldozer family (L1-L2 exclusive)
┌─────────┐ ┌─────────────────────┐
│ L1 │ │ L3/LLC │
│ [A,B,C] │ │ [D,E,F,G,H,I,J,K] │ ← Disjoint data sets
└─────────┘ └─────────────────────┘
Non-Inclusive (Mostly Exclusive):
- Definition: No strict inclusion requirement; data can exist anywhere
- Advantage: Balanced approach - flexibility without waste
- Implementation: Intel Skylake+ uses this approach
- Behavior: L1/L2 victims may or may not be in LLC
Write Policies
Write-Allocate vs. No-Write-Allocate:
// Write-Allocate behavior:
array[random_index] = value;
// 1. Cache miss brings entire line into cache
// 2. Modifies only the needed bytes
// 3. Good for subsequent reads to same line
// No-Write-Allocate behavior:
array[random_index] = value;
// 1. Write goes directly to next level
// 2. No cache line allocated
// 3. Better for streaming writes with no reuse
Write-Through vs. Write-Back:
- Write-Through: Every write immediately propagates to next level (simpler, higher bandwidth usage)
- Write-Back: Writes accumulate in cache, propagated only on eviction (complex, bandwidth efficient)
Replacement Policies
LRU (Least Recently Used):
- Traditional choice for small associativities
- Problem: Thrashing on streaming access patterns
- Implementation cost: O(n log n) bits per set
RRIP (Re-Reference Interval Prediction):
┌──────────┬──────────────────┬─────────────┐
│ RRIP Val │ Meaning │ Behavior │
├──────────┼──────────────────┼─────────────┤
│ 00 │ Near-immediate │ Don't evict │
│ 01 │ Intermediate │ Age once │
│ 10 │ Distant │ Age twice │
│ 11 │ Long/Never │ Evict first │
└──────────┴──────────────────┴─────────────┘
- Scan-resistant: Streaming data gets RRIP=11 (immediate eviction)
- Used in: Intel Ivy Bridge+ LLC
Translation Lookaside Buffers (TLBs) and Virtual Memory
TLBs are critical caches for virtual-to-physical address translation, often the hidden bottleneck in memory-intensive workloads.
TLB Structure and Organization
Instruction TLB (I-TLB):
- Size: 128-256 entries for 4KB pages
- Purpose: Caches instruction page translations
- Critical for: Large codebases, shared libraries
Data TLB (D-TLB):
- L1 D-TLB: 64-128 entries (4KB pages) + 32-64 entries (2MB/1GB pages)
- L2 D-TLB: 1K-2K entries (unified page sizes)
- Purpose: Caches data page translations
Page Walk Process
When TLB misses occur, hardware must perform a page walk:
Virtual Address (x86-64 with 4-level paging):
┌─────────┬─────────┬─────────┬─────────┬──────────────┐
│ PML4(9) │ PDP(9) │ PD(9) │ PT(9) │ Offset(12) │
└─────────┴─────────┴─────────┴─────────┴──────────────┘
Page Walk Steps:
1. CR3 → PML4 Base Address (1 memory access)
2. PML4[index] → Page Directory Pointer (1 memory access)
3. PDP[index] → Page Directory (1 memory access)
4. PD[index] → Page Table (1 memory access)
5. PT[index] → Physical Page Frame (1 memory access)
Total: 5 memory accesses for 4KB page
Page Walk Caches (PWC):
- L1 PWC: Caches intermediate page table entries
- L2 PWC: Larger intermediate cache
- Impact: Reduces 4-5 memory accesses to 1-2 for subsequent nearby translations
Huge Pages: Performance Game-Changer
Standard vs. Huge Page TLB Coverage:
Page Size | x86-64 | ARM64 | TLB Entries | Coverage | Use Case |
---|---|---|---|---|---|
Small | 4 KB | 4 KB | 128 | 512 KB | Default, fine-grained |
Medium | 2 MB | 64 KB | 32 | 64 MB | Large arrays, databases |
Large | 1 GB | 2 MB | 16 | 16 GB | ML models, in-memory DBs |
Huge Page Benefits:
- Reduced TLB pressure: 512x fewer TLB entries needed (2MB vs 4KB)
- Fewer page walks: Dramatically reduced page table traversals
- Better cache utilization: Page table entries stay hot longer
Huge Page Implementation:
#include <sys/mman.h>
// Method 1: Transparent Huge Pages (system-wide)
// /sys/kernel/mm/transparent_hugepage/enabled = "always"
// Method 2: Explicit huge page allocation
void* huge_malloc(size_t size) {
void* addr = mmap(NULL, size,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
-1, 0);
if (addr == MAP_FAILED) return NULL;
return addr;
}
// Method 3: madvise() hint for existing mapping
void* regular_malloc = malloc(large_size);
madvise(regular_malloc, large_size, MADV_HUGEPAGE);
When to Use Huge Pages:
- Large contiguous allocations > 2MB (ML models, embeddings, KV caches)
- Read-heavy workloads with good spatial locality
- Long-lived allocations to amortize setup cost
When NOT to Use Huge Pages:
- Small, scattered allocations < 256KB
- Short-lived objects with high allocation/free churn
- Memory-constrained environments (internal fragmentation)
NUMA Effects on Memory Hierarchy
Local vs. Remote Memory Access:
┌─────────────────────────────────────────────────────────────┐
│ NUMA Node 0 │
│ ┌─────┐ ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │
│ │CPU 0│──│L1/L2/L3 │──│Memory │ │ │ │
│ │CPU 1│ │Caches │ │Controller│ │ Local DRAM │ │
│ └─────┘ └─────────┘ └─────────┘ │ (50-80 ns) │ │
└─────────────────────────────────────┴─────────────────────┘
│
┌────┴────┐ Inter-socket Link
│QPI/UPI │ (40-60 ns additional)
└────┬────┘
┌─────────────────────────────────────────────────────────────┐
│ NUMA Node 1 │
│ ┌─────┐ ┌─────────┐ ┌─────────┐ ┌─────────────────┐ │
│ │CPU 2│──│L1/L2/L3 │──│Memory │ │ │ │
│ │CPU 3│ │Caches │ │Controller│ │ Remote DRAM │ │
│ └─────┘ └─────────┘ └─────────┘ │ (100-150 ns) │ │
└─────────────────────────────────────┴─────────────────────┘
Practical optimization strategies
Memory Layout Optimization
Cache Line Alignment:
// BAD: False sharing
struct SharedData {
volatile int counter_a; // CPU 0 writes
volatile int counter_b; // CPU 1 writes
} __attribute__((packed)); // Both in same cache line!
// GOOD: Cache line separation
struct SharedData {
volatile int counter_a;
char padding[60]; // Pad to cache line boundary
volatile int counter_b;
} __attribute__((aligned(64))); // Each counter in own cache line
Prefetch-Friendly Layout:
// Prefetch distance calculation
int prefetch_distance = memory_latency_cycles / (instructions_per_cycle * cycles_per_iteration);
for (int i = 0; i < N; i++) {
// Prefetch future data
if (i + prefetch_distance < N) {
__builtin_prefetch(&data[i + prefetch_distance], 0, 3);
}
// Process current data
process_data(&data[i]);
}
TLB Optimization Techniques
First-Touch NUMA Policy:
// Ensure memory allocation matches thread placement
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
int numa_node = thread_id / cores_per_node;
// Bind thread to NUMA node
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(numa_node * cores_per_node + local_core, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
// First-touch allocation on local node
#pragma omp for
for (int i = 0; i < array_size; i++) {
large_array[i] = initial_value; // Allocates on local node
}
}
Page Coloring for LLC Conflict Avoidance:
// Align critical data structures to avoid LLC set conflicts
#define LLC_WAYS 16
#define LLC_SET_SIZE (64 * LLC_WAYS) // 1024 bytes per set
// Allocate arrays with offset to use different LLC sets
void* allocate_conflict_free(size_t size, int color) {
size_t offset = color * LLC_SET_SIZE;
void* base = aligned_alloc(4096, size + offset + 4096);
return (char*)base + offset;
}
💡 Key Takeaways:
- TLB misses cost 4-5x more than cache misses due to page walk overhead
- Huge pages provide 100-500x TLB reach improvement for large datasets
- Cache policies fundamentally change behavior - inclusive vs exclusive affects eviction patterns
- NUMA topology awareness can improve performance by 2-3x for memory-bound workloads
- Always profile memory hierarchy behavior with hardware counters before optimizing
3) Prefetching (HW + SW)
Hardware prefetchers: next‑line, stride, delta‑correlation, spatial. Too aggressive ⇒ pollution & bandwidth waste.
Software prefetch: bring data just‑in‑time.
Estimating prefetch distance (rule‑of‑thumb):
Let memory latency L
cycles, loop body issues I
iterations/cycle on average, per‑iter bytes B
, sustainable bandwidth BW
bytes/cycle. To overlap latency, distance ≈
distance ≈ ceil( L / max(1, (BW / B) / I) )
Pseudocode:
for (int i = 0; i < N; i += 16) {
__builtin_prefetch(A + i + PDIST, 0, 3); // read-mostly
// compute on A[i..i+15]
}
Measure pf_hits / pf_issued, LLC MPKI, and DRAM BW before/after.
4) Coherency & sharing
Protocols: MESI/MOESI; broadcast snoop vs. directory; inclusive LLC acts as probe filter on some systems.
Pathologies: false sharing; migratory objects; producer/consumer bouncing.
Mitigations
- Avoid fine‑grained write sharing; shard counters; per‑thread buffers with combining.
- Pad shared structs to cache line size; use SPSC/MPSC rings for message‑passing.
- For read‑mostly data, prefer read‑only segments or duplicate per‑socket copies.
5) NUMA effects
What hurts: remote DRAM latency, inter‑socket link contention, remote LLC misses.
Do this:
- Pin threads + memory (
numactl -C … -m …
, libnuma). - Shard services per socket; route requests to local shard (stickiness).
- Enable automatic NUMA balancing only if phases are long and stable.
6) Vector/SIMD (CPU)
ISAs: SSE/AVX/AVX‑512, NEON/SVE/SVE2. Wider vectors increase throughput and power transients (some platforms down‑clock under heavy AVX‑512).
Make vectorization easy:
- Affine indexes + contiguous access, minimal control flow; consider predication rather than branches.
- Align loads/stores (e.g., 32/64‑byte); keep gather/scatter rare; use software pipelining to hide latencies.
- Use compiler diagnostics (
-Rpass=loop-vectorize
) and only drop to intrinsics when the compiler can't form the pattern.
7) GPU occupancy & memory behavior
Execution: SIMT warps (e.g., 32 threads) scheduled by SM warp schedulers.
Occupancy: active warps per SM limited by registers, shared memory, blocks/SM, and architectural caps.
Occupancy math (example):
Assume per‑SM caps: 64 warps, 256 KB registers, 100 KB shared memory. Kernel uses 64 regs/thread, 48 KB shared/block, 256 threads/block (= 8 warps/block).
- Reg‑limited blocks = floor( (256 KB / (64 regs × 256 thr × 4 B)) ) = floor( (262144 / 65536) ) = 4 blocks.
- Smem‑limited blocks = floor(100 KB / 48 KB) = 2 blocks.
- Blocks cap ⇒ 2 blocks/SM × 8 warps = 16 active warps ⇒ occupancy = 16 / 64 = 25%.
Try reducing regs to 48 or smem to 32 KB to raise occupancy while watching ILP.
Memory hierarchy (NVIDIA‑like): registers → shared/L1 (configurable split) → L2 → HBM/DRAM.
Best practices:
- Coalescing: map threadIdx.x to contiguous addresses.
- Shared memory: avoid bank conflicts; use async copy to stage tiles.
- L2 residency: temporal tiling; reuse windows that fit L2 to reduce HBM traffic.
- Profile warp stall reasons (memory dependency, barrier, dispatch) to focus fixes.
8) Interconnects, storage & I/O
NoC (on‑chip)
Topologies: ring, mesh, crossbar, custom hierarchical fabrics. Watch hotspot links; use QoS (virtual channels, priorities) for latency‑critical flows (e.g., CPU misses) vs. bulk DMA.
Off‑chip
- PCIe/CXL: Bandwidth per direction ≈
lane_rate × lanes × encoding_efficiency
. Use it to size DMA batch sizes and decide between copy vs. map. - Vendor links: NVLink/NVSwitch provide coherent or near‑coherent GPU‑GPU paths; design for collectives (all‑reduce) locality.
- Storage (NVMe): µs‑scale latency; increase queue depth and I/O size for throughput; avoid sync in hot path; io_uring helps.
I/O checklist
- Batch small I/Os; use direct I/O for log‑structured stores.
- Tune interrupt moderation vs. latency (polling for micro‑bursts).
- Ensure IOMMU mappings are large and reused (reduce TLB pressure).
9) Worked examples
(A) Choosing prefetch distance
You measure: L2 miss latency ~ 35 ns (≈105 cycles @3 GHz); loop issues ~0.5 iters/cycle; per‑iter reads 64 B; DRAM BW ≈ 200 B/cycle aggregate.
(BW / B) / I = (200/64)/0.5 = 6.25
⇒ distance ≈ ceil(105 / 6.25) = 17
. Start with PDIST=16–20, verify pf_hit rate.
(B) NUMA shard sizing
2‑socket server; request fits in 100 µs compute + 80 µs memory; remote adds 40 µs on average ⇒ p99 increases >20%. Enforce shard‑local routing and first‑touch; verify p99 reduction.
10) Checklist & anti‑patterns
- ✅ Measure OI + bandwidth (roofline) before micro‑tuning kernels.
- ✅ Stabilize placement (NUMA pinning) before comparing results.
- ✅ Use huge pages for large, hot datasets and pad shared structs.
- ❌ Chasing 100% GPU occupancy at the expense of ILP.
- ❌ Over‑aggressive prefetching when memory BW is already saturated.
- ❌ Mixed small/large I/O without batching.
Further reading
- CUDA/ROCm tuning guides (occupancy, memory coalescing), Intel/ARM optimization manuals.
- CXL/NVLink/NVSwitch technical overviews for interconnect behavior.