System & Microarchitecture — Deep Dive

Author: Principal Computer Architect & Professor of Computer Architecture
Scope: CPU pipelines, caches/TLBs, prefetching, coherence, NUMA, SIMD, GPU occupancy & memory hierarchy, interconnects (NoC/PCIe/CXL/NVLink), storage & I/O.
Use this when: You need to reason end‑to‑end about compute + data, eliminate pathologies, and communicate fixes with evidence.

📋 Table of Contents

0) Mental model

1) CPU pipeline basics (OoO superscalar)

2) Cache & memory hierarchy

3) Prefetching (HW + SW)

4) Coherency & sharing

5) NUMA effects

6) Vector/SIMD (CPU)

7) GPU occupancy & memory behavior

8) Interconnects, storage & I/O

9) Worked examples

10) Checklist & anti‑patterns

0) Mental model

A modern SoC couples:

CPU complex: out‑of‑order (OoO) cores with private L1I/L1D and often private or per‑core L2; a shared LLC (L3) and ring/mesh NoC.
Accelerators: GPU, NPU/TPU, ISP, video, DMA engines.
Memory subsystem: DDRx/HBM stacks, memory controllers (MCs), address interleaving, IOMMU.
I/O fabric: PCIe/CXL, vendor coherency links (NVLink/NVSwitch, Infinity Fabric). The art is to feed compute (bandwidth, latency, locality), hide latency (parallelism, prefetch), and balance resources (utilization without interference).

1) CPU pipeline basics (OoO superscalar)

Modern out-of-order (OoO) superscalar processors achieve high performance through sophisticated pipeline structures that allow instructions to execute out of program order while maintaining the illusion of sequential execution.

Pipeline stages deep dive

┌────────┬─────────┬─────────┬──────────┬───────┬─────────┬───────────┬────────┐ │ FETCH │ DECODE │ RENAME │ DISPATCH │ ISSUE │ EXECUTE │ WRITEBACK │ RETIRE │ │ │ │ │ │ │ │ │ │ │ I-cache│ µop │ RAT │ RS/ │ Wakeup│ Func │ Result │ ROB │ │ Branch │ decode │ Free │ ROB │ Select│ Units │ Forward │ Commit │ │ Pred │ Fusion │ list │ Alloc │ │ │ │ │ └────────┴─────────┴─────────┴──────────┴───────┴─────────┴───────────┴────────┘

1. FETCH Stage

Purpose: Retrieve instructions from memory and predict control flow
Key Components:
- Branch Predictor: Predicts taken/not-taken for conditional branches
- Branch Target Buffer (BTB): Caches branch target addresses
- Return Address Stack (RAS): Dedicated stack for function returns
- I-Cache: Instruction cache (typically 32-64 KB, 4-8 way associative)
- I-TLB: Instruction translation lookaside buffer

Flow: PC → I-TLB lookup → I-Cache access → Branch prediction → Next PC generation

2. DECODE Stage

Purpose: Parse instruction bytes into microoperations (µops)
Key Components:
- Instruction Decoders: Convert x86/ARM instructions to internal µops
- µop Cache: Stores decoded µops to avoid repeated decoding (1-2K entries)
- Macro-op Fusion: Combines related instructions (e.g., cmp+jcc)
- Complex Instruction Support: Microcode ROM for complex instructions

Flow: Raw instructions → Decode units → µop generation → µop cache storage

3. RENAME Stage

Purpose: Eliminate false dependencies by mapping architectural to physical registers
Key Components:
- Register Alias Table (RAT): Maps architectural → physical registers
- Physical Register File: Pool of physical registers (160-256 entries)
- Free List: Tracks available physical registers
- Dependency Tracking: Builds dependency chains for scheduling

Example:

Original:  ADD R1, R2, R3    ; R1 = R2 + R3
          SUB R4, R1, R5    ; R4 = R1 - R5  (depends on previous R1)
 
Renamed:   ADD P37, P21, P45  ; P37 = P21 + P45
          SUB P52, P37, P33  ; P52 = P37 - P33 (dependency preserved)

4. DISPATCH Stage

Purpose: Allocate resources and place µops in execution queues
Key Components:
- Reservation Stations (RS): Hold µops waiting for operands (36-60 entries)
- Reorder Buffer (ROB): Maintains program order for retirement (168-352 entries)
- Load/Store Queue: Buffers memory operations (48-72 load, 32-56 store entries)
- Resource Allocation: Assigns execution units and ports

Flow: Renamed µops → ROB allocation → RS insertion → Resource assignment

5. ISSUE Stage

Purpose: Wake up ready µops and select for execution
Key Components:
- Wakeup Logic: Monitors operand readiness (CAM-based)
- Select Logic: Chooses ready µops for execution ports
- Age-based Priority: Older instructions get priority
- Port Binding: Maps µops to specific execution ports

Critical Path: Operand ready → Wakeup CAM → Select logic → Port assignment (1 cycle)

6. EXECUTE Stage

Purpose: Perform actual computation
Execution Ports (Intel Skylake-like):
- Port 0: ALU, Vector ALU, Vector Shift, Branch
- Port 1: ALU, Vector ALU, Vector Multiply, Slow LEA
- Port 2: Load Unit, Address Generation
- Port 3: Load Unit, Address Generation
- Port 4: Store Data
- Port 5: ALU, Vector ALU, Vector Shuffle, Fast LEA
- Port 6: ALU, Branch, Store Address
- Port 7: Store Address

Bypass Network: Results forwarded directly to dependent µops (0-1 cycle latency)

7. WRITEBACK Stage

Purpose: Update architectural state and forward results
Key Components:
- Result Buses: Carry results to register file and bypass network
- Exception Handling: Detect and signal architectural exceptions
- Register File Update: Write results to physical registers

8. RETIRE Stage

Purpose: Commit instructions in program order
Key Components:
- ROB Head: Points to oldest uncommitted instruction
- Exception Resolution: Handle precise exceptions
- Resource Reclamation: Free physical registers and ROB entries
- Memory Ordering: Enforce memory consistency model

Pipeline Stage to Microarchitectural Structure Mapping

Understanding which structures belong to each pipeline stage is crucial for performance analysis and optimization. Here's the complete mapping:

🔧 STAGE-BY-STAGE STRUCTURE BREAKDOWN

Pipeline Stage	Primary Structures	Function	Capacity/Size
FETCH	Branch PredictorBTBI-CacheI-TLB	Predict and fetch instructions	32KB I-Cache 4K BTB entries
DECODE	µop CacheDecode UnitsMacro-fusion	Convert to internal µops	1.5K µop cache entries
RENAME	RATFree ListPhysical RegFile	Eliminate false dependencies	180 physical registers
DISPATCH	ROBRSLoad/Store Queues	Allocate resources, queue µops	224 ROB entries 97 RS entries
ISSUE	Wakeup LogicSelect Logic	Choose ready µops for execution	CAM-based wakeup
EXECUTE	Functional UnitsExecution Ports	Perform computation	8 execution ports
WRITEBACK	Bypass NetworkResult Buses	Forward results	Multi-cycle bypass
RETIRE	ROB HeadCommit Logic	Maintain program order	Up to 4 µops/cycle

🎯 VISUAL REFERENCES FOR EACH STAGE

📚 Academic References:

Hennessy & Patterson: "Computer Architecture: A Quantitative Approach" - Chapter 3 (Pipeline diagrams)
Intel Optimization Manual: Section 2.2 (Microarchitecture overview)
Agner Fog's Microarchitecture Guide: Detailed pipeline analysis for modern CPUs

🌐 Online Visual Resources:

WikiChip Microarchitecture Diagrams: https://en.wikichip.org/wiki/intel/microarchitectures
Real World Technologies: CPU architecture deep dives with block diagrams
AnandTech CPU Reviews: Pipeline stage analysis with performance correlations
Intel Developer Documentation: Optimization reference manuals with diagrams

DETAILED STAGE ANALYSIS WITH STRUCTURES

1. FETCH STAGE STRUCTURES

                    ┌─────────────────────────────────────────────┐
                    │              FETCH STAGE                    │
┌─────────────────┐ │  ┌─────────────┐    ┌──────────────────┐   │
│    Program      │ │  │   Branch    │────│  Branch Target   │   │
│    Counter      │─┼──│ Predictor   │    │  Buffer (BTB)    │   │ 
│     (PC)        │ │  │             │    │                  │   │
└─────────────────┘ │  └─────────────┘    └──────────────────┘   │
                    │           │                   │           │
                    │           ▼                   ▼           │
                    │  ┌─────────────────────────────────────┐   │
                    │  │         FETCH UNIT                  │   │
                    │  │   ┌─────────────┐  ┌─────────────┐  │   │
                    │  │   │   I-Cache   │  │    I-TLB    │  │   │
                    │  │   │   32-64 KB  │  │ 128 entries │  │   │
                    │  │   └─────────────┘  └─────────────┘  │   │
                    │  └─────────────────────────────────────┘   │
                    └─────────────────────────────────────────────┘
                                       │
                                       ▼ (16-32 bytes/cycle)
                               ┌──────────────┐
                               │  Instruction │
                               │    Queue     │
                               └──────────────┘

ELI5: FETCH Stage "Think of this like a librarian who needs to find the next book (instruction) you want to read. The Branch Predictor is like a smart librarian who guesses which book you'll want next, even before you ask! The I-Cache is like a small bookshelf right at your desk with your most recently used books, and the I-TLB is like a super-fast address book that remembers where each book is located in the big library."

2. DECODE STAGE STRUCTURES

                    ┌─────────────────────────────────────────────┐
                    │              DECODE STAGE                   │
┌──────────────┐    │  ┌─────────────────────────────────────┐   │
│ Instruction  │    │  │            µop Cache                │   │
│    Queue     │────┼──│      (1.5K entries)                │   │
│              │    │  │  ┌─────────┐  ┌─────────────────┐   │   │
└──────────────┘    │  │  │ Tag     │  │  Decoded µops   │   │   │
                    │  │  │ Array   │  │     (6 wide)    │   │   │
                    │  │  └─────────┘  └─────────────────┘   │   │
                    │  └─────────────────────────────────────┘   │
                    │           │              │                 │
                    │           ▼              ▼ (cache miss)    │
                    │  ┌─────────────┐    ┌──────────────────┐   │
                    │  │    4 x      │    │   Macro-fusion   │   │
                    │  │  Decoders   │    │     Logic        │   │
                    │  │ (1 complex, │    │ (CMP + JCC = 1)  │   │
                    │  │ 3 simple)   │    │                  │   │
                    │  │ └─────────────┘    └──────────────────┘   │
                    └─────────────────────────────────────────────┘
                                       │
                                       ▼ (up to 6 µops/cycle)
                               ┌──────────────┐
                               │  Rename      │
                               │    Queue     │
                               └──────────────┘

ELI5: DECODE Stage
"This is like translating a foreign language book into your native language. The µop Cache is like having pre-translated pages ready to go, so you don't have to translate the same page twice. When you haven't seen a page before, the Decoders are like human translators who break down complex sentences (x86 instructions) into simple actions (µops) that the CPU can easily understand."

3. RENAME STAGE STRUCTURES

                    ┌─────────────────────────────────────────────┐
                    │               RENAME STAGE                  │
┌──────────────┐    │  ┌─────────────────────────────────────┐   │
│   Rename     │    │  │      Register Alias Table (RAT)    │   │  
│   Queue      │────┼──│    Arch → Physical Register Map    │   │
│              │    │  │   ┌───┬───┬───┬───┬───┬───┬───┐     │   │
└──────────────┘    │  │   │R0 │R1 │R2 │R3 │R4 │..│R15│     │   │
                    │  │   ├───┼───┼───┼───┼───┼───┼───┤     │   │
                    │  │   │P37│P42│P15│P28│P71│..│P93│     │   │
                    │  │   └───┴───┴───┴───┴───┴───┴───┘     │   │
                    │  └─────────────────────────────────────┘   │
                    │                    │                       │
                    │  ┌─────────────────▼─────────────────┐     │
                    │  │       Physical Register File      │     │
                    │  │          (180 registers)          │     │
                    │  │ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐   │     │
                    │  │ │P0│P1│P2│.│..│37│..│71│..│179│   │     │
                    │  │ └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘   │     │
                    │  └─────────────────┬─────────────────┘     │
                    │                    │                       │
                    │  ┌─────────────────▼─────────────────┐     │
                    │  │          Free List                │     │
                    │  │    (Available Physical Regs)     │     │
                    │  │     P22, P44, P67, P88...         │     │
                    │  └───────────────────────────────────┘     │
                    └─────────────────────────────────────────────┘

ELI5: RENAME Stage "Imagine you have 16 desks (architectural registers) but 180 actual workspaces (physical registers). The RAT is like a smart assistant who puts sticky notes on your desks saying which workspace to actually use. This way, if two people need 'Desk 1' at the same time, they can use 'Workspace 37' and 'Workspace 42' without conflicts. The Free List is like the assistant's notebook tracking which workspaces are available."

4. DISPATCH STAGE STRUCTURES

                   ┌──────────────────────────────────────────────┐
                   │             DISPATCH STAGE                   │
┌────────────┐     │ ┌────────────────────────────────────────┐   │
│  Renamed   │     │ │         Reorder Buffer (ROB)           │   │
│    µops    │─────┼─│        (224 entries, 4-wide)           │   │
│            │     │ │ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │   │
└────────────┘     │ │ │H │  │  │  │  │  │  │  │  │  │  │T │ │   │
                   │ │ └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ │   │
                   │ │    ▲Head                     Tail▲    │   │
                   │ └────────────────────────────────────────┘   │
                   │                    │                         │
                   │    ┌───────────────▼────────────────┐        │
                   │    │        Resource Allocation     │        │
                   │    └───────────────┬────────────────┘        │
                   │                    │                         │
   ┌──────────────────────┬─────────────▼─────────┬───────────────────────┐
   │                      │                       │                       │
   │  ┌─────────────────┐ │  ┌─────────────────┐  │  ┌─────────────────┐  │
   │  │  Reservation    │ │  │  Load Queue     │  │  │  Store Queue    │  │
   │  │   Stations      │ │  │  (48 entries)   │  │  │  (32 entries)   │  │
   │  │  (97 entries)   │ │  │                 │  │  │                 │  │
   │  └─────────────────┘ │  └─────────────────┘  │  └─────────────────┘  │
   └──────────────────────┴───────────────────────┴───────────────────────┘

ELI5: DISPATCH Stage "This is like a restaurant manager organizing orders. The ROB is like a numbered ticket system that keeps track of every order in the exact sequence customers arrived. The Reservation Stations are like different prep stations (salad, grill, dessert) where orders wait until all ingredients are ready. The Load/Store Queues are special stations just for getting items from storage or putting things away."

5. ISSUE STAGE STRUCTURES

                   ┌──────────────────────────────────────────────┐
                   │               ISSUE STAGE                    │
   ┌───────────────┼─────────────────────────────────────────────┼───────────────┐
   │               │    ┌─────────────────────────────────────┐  │               │
   │ Reservation   │    │          Wakeup Logic               │  │   Operand     │
   │  Stations     │    │         (CAM-based)                 │  │   Bypass      │
   │               │◄───┤  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐   │──┤   Network     │
   │ ┌───┬───┬───┐ │    │  │Tag 0│ │Tag 1│ │Tag 2│ │Tag 3│   │  │               │
   │ │Rdy│Rdy│ ! │ │    │  └─────┘ └─────┘ └─────┘ └─────┘   │  │ ┌───┬───┬───┐ │
   │ │Rdy│ ! │Rdy│ │    │     │       │       │       │     │  │ │P37│P42│P15│ │
   │ │ ! │Rdy│ ! │ │    │     ▼       ▼       ▼       ▼     │  │ └───┴───┴───┘ │
   │ └───┴───┴───┘ │    │  ┌─────────────────────────────┐   │  │               │
   └───────────────┼────┤  │      Select Logic           │   │──┼───────────────┘
                   │    │  │   (Age + Port availability) │   │  │
                   │    │  └─────────────────────────────┘   │  │
                   │    └─────────────────────────────────────┘  │
                   │                     │                        │
                   │                     ▼                        │
                   │    ┌─────────────────────────────────────┐   │
                   │    │         Port Assignment             │   │
                   │    │                                     │   │
                   │    │ Port0 Port1 Port2 Port3 Port4      │   │
                   │    │  ALU   ALU  Load  Load  Store       │   │
                   │    │  VEC   VEC   AGU   AGU   Data       │   │
                   │    └─────────────────────────────────────┘   │
                   └─────────────────────────────────────────────┘

ELI5: ISSUE Stage "This is like a smart traffic controller at a busy intersection. The Wakeup Logic is constantly watching for when all the 'ingredients' (operands) for a recipe (instruction) are ready. The Select Logic is like the traffic controller deciding which cars (ready instructions) get to go through the intersection (execution ports) first, considering both how long they've been waiting and which lanes are available."

6. EXECUTE STAGE STRUCTURES

                   ┌──────────────────────────────────────────────┐
                   │              EXECUTE STAGE                   │
                   │                                              │
   ┌─────────┐     │    ┌─────┬─────┬─────┬─────┬─────┬─────┐     │     ┌─────────┐
   │Selected │────►│    │Port0│Port1│Port2│Port3│Port4│Port5│     │────►│Results  │
   │  µops   │     │    └─────┴─────┴─────┴─────┴─────┴─────┘     │     │  to     │
   │         │     │      │     │     │     │     │     │         │     │Bypass   │
   └─────────┘     │      ▼     ▼     ▼     ▼     ▼     ▼         │     └─────────┘
                   │   ┌─────┬─────┬─────┬─────┬─────┬─────┐      │
                   │   │ ALU │ ALU │Load │Load │Store│ ALU │      │
                   │   │     │     │Unit │Unit │Data │     │      │
                   │   │ Vec │ Vec │     │     │     │ Vec │      │
                   │   │ ALU │ ALU │ AGU │ AGU │     │ ALU │      │
                   │   └─────┴─────┴─────┴─────┴─────┴─────┘      │
                   │                                              │
                   │   ┌──────────────────────────────────────┐   │
                   │   │         Bypass Network              │   │
                   │   │  ┌────┐ ┌────┐ ┌────┐ ┌────┐       │   │
                   │   │  │Bus0│ │Bus1│ │Bus2│ │Bus3│  ...  │   │
                   │   │  └────┘ └────┘ └────┘ └────┘       │   │
                   │   │     │     │      │      │          │   │
                   │   │     ▼     ▼      ▼      ▼          │   │
                   │   │  ┌─────────────────────────────┐    │   │
                   │   │  │    Forwarding Muxes         │    │   │
                   │   │  │   (to dependent µops)       │    │   │
                   │   │  └─────────────────────────────┘    │   │
                   │   └──────────────────────────────────────┘   │
                   └──────────────────────────────────────────────┘

ELI5: EXECUTE Stage "This is like a kitchen with 6 different cooking stations (execution ports). Each station can do different things - some can chop vegetables (ALU), some can get ingredients from the pantry (Load Units), some can put dishes on the serving table (Store Units). The Bypass Network is like a super-fast conveyor belt that immediately sends finished ingredients from one station to another that needs them, without waiting for the dish to be completely done."

7. WRITEBACK STAGE STRUCTURES

                   ┌──────────────────────────────────────────────┐
                   │             WRITEBACK STAGE                  │
   ┌─────────┐     │  ┌─────────────────────────────────────────┐ │
   │Execution│────►│  │           Result Buses                  │ │
   │Results  │     │  │        (8 buses, 64-bit each)          │ │
   │         │     │  │ ┌─────┬─────┬─────┬─────┬─────┬─────┐   │ │
   └─────────┘     │  │ │Bus0 │Bus1 │Bus2 │Bus3 │Bus4 │Bus5 │   │ │
                   │  │ └─────┴─────┴─────┴─────┴─────┴─────┘   │ │
                   │  └─────────────────────────────────────────┘ │
                   │             │                 │              │
                   │             ▼                 ▼              │
                   │  ┌─────────────────┐ ┌─────────────────────┐ │
                   │  │ Physical Reg    │ │  Exception Logic    │ │
                   │  │ File Update     │ │  ┌─────────────────┐│ │
                   │  │ ┌─────────────┐ │ │  │ Fault Detection │││ │
                   │  │ │   180 x     │ │ │  │ Priority Logic  │││ │
                   │  │ │ 64-bit Regs │ │ │  │ Exception Queue │││ │
                   │  │ └─────────────┘ │ │  └─────────────────┘│ │
                   │  └─────────────────┘ └─────────────────────┘ │
                   │             │                 │              │
                   │             ▼                 ▼              │
                   │  ┌─────────────────────────────────────────┐ │
                   │  │         ROB Update                      │ │
                   │  │    (Mark instructions complete)        │ │
                   │  └─────────────────────────────────────────┘ │
                   └──────────────────────────────────────────────┘

ELI5: WRITEBACK Stage "This is like the final quality check and packaging in a factory. The Result Buses are like conveyor belts carrying finished products. The Physical Register File is like a big storage warehouse where all the finished work gets stored with proper labels. The Exception Logic is like a quality inspector who checks if anything went wrong and needs to be reported to management (the operating system)."

8. RETIRE STAGE STRUCTURES

                   ┌──────────────────────────────────────────────┐
                   │              RETIRE STAGE                    │
                   │                                              │
   ┌─────────┐     │  ┌─────────────────────────────────────────┐ │     ┌─────────┐
   │Completed│────►│  │         Reorder Buffer                  │ │────►│Arch     │
   │ µops    │     │  │         (In-Order Commit)               │ │     │State    │
   │         │     │  │ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐  │ │     │Update   │
   └─────────┘     │  │ │ H│  │  │ C│ C│ C│  │  │  │  │  │ T│  │ │     └─────────┘
                   │  │ └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘  │ │
                   │  │   Head→         Complete→      ←Tail    │ │
                   │  └─────────────────────────────────────────┘ │
                   │             │               │                │
                   │             ▼               ▼                │
                   │  ┌─────────────────┐ ┌─────────────────────┐ │
                   │  │ Memory Ordering │ │  Resource Reclaim   │ │
                   │  │ ┌─────────────┐ │ │ ┌─────────────────┐ │ │
                   │  │ │Store Buffer │ │ │ │ Free Physical   │ │ │
                   │  │ │ Drain       │ │ │ │ Registers       │ │ │
                   │  │ │Load/Store   │ │ │ │ ROB Entries     │ │ │
                   │  │ │Forwarding   │ │ │ │ RS Entries      │ │ │
                   │  │ └─────────────┘ │ │ └─────────────────┘ │ │
                   │  └─────────────────┘ └─────────────────────┘ │
                   │                                              │
                   │  ┌─────────────────────────────────────────┐ │
                   │  │        Exception Handling               │ │
                   │  │  • Precise exception delivery           │ │
                   │  │  • Pipeline flush on exceptions        │ │
                   │  │  • Architectural state recovery        │ │
                   │  └─────────────────────────────────────────┘ │
                   └──────────────────────────────────────────────┘

ELI5: RETIRE Stage "This is like the final graduation ceremony where students (instructions) can only graduate in the exact order they enrolled, even if some finished their coursework early. The ROB Head is like the dean who calls students one by one to receive their diploma. Resource Reclaim is like cleaning up the classroom after each student leaves, making space for new students. Exception Handling is like having security ready to handle any problems while maintaining the proper ceremony order."

Critical performance relationships

Frontend: Branch predictor accuracy directly impacts I-cache hit rate and decode bandwidth
Rename: RAT size limits instruction window; free list exhaustion stalls dispatch
Scheduler: RS size × wakeup speed determines out-of-order window depth
Backend: Port contention limits achievable IPC despite large instruction window
Memory: LDQ/STQ size affects memory-level parallelism and store-to-load forwarding

Throughput vs. latency

Throughput ≈ issue_width × freq × utilization.
Latency affects serial portions and responsiveness (Amdahl).
CPI decomposition: CPI = CPI_retiring + CPI_FE + CPI_BE_core + CPI_BE_memory + CPI_bad_spec.

Back‑pressure patterns (symptoms → likely cause → fix)

Symptom	Likely Cause	First Fixes
High FE‑bound + I‑cache/TLB misses	Code bloat, poor layout, small pages	LTO/PGOFDO code layouthuge pagesreduce icache footprint
High bad‑speculation	Branchy control, data‑dependent branches	If‑convert small diamondsloop unswitchingvalue speculation avoidance
BE core‑bound + port pressure	Over‑unrolled vector kernels, reg pressure, imbalanced ports	Reduce unrollrebalance scalar/SIMD opssoftware pipelining
BE memory‑bound + high MPKI	Strided/irregular access, poor locality	SoA layoutblocking/tilingsoftware prefetchcache hints/pragmas
Store‑bound, store‑forward stalls	Small scattered stores, write‑combining misses	Buffer & copy (memcpy‑like)align and batchnon‑temporal stores when appropriate

Tip: Always correlate with Top‑Down percentages and MPKI/latency to avoid misdiagnosis.

2) Cache & memory hierarchy

The memory hierarchy bridges the performance gap between fast processor cores and relatively slow DRAM through multiple levels of progressively larger but slower caches.

Memory hierarchy latency characteristics

Level	Typical Size	Latency (ns)	Latency (cycles @3GHz)	Bandwidth	Key Properties
L1 Data	32-64 KB	1-2 ns	3-6 cycles	~1 TB/s	Per-core, virtually indexed
L1 Instruction	32-64 KB	1-2 ns	3-6 cycles	~500 GB/s	Per-core, virtually indexed
L2 Private	256 KB-1 MB	3-10 ns	9-30 cycles	~400 GB/s	Per-core or per-2-cores
L3/LLC (Shared)	8-32 MB	10-40 ns	30-120 cycles	~200 GB/s	Shared, physically indexed
DRAM (Local)	8-128 GB	70-150 ns	210-450 cycles	~50-100 GB/s	Main memory, NUMA local
DRAM (Remote)	Cross-socket	100-200 ns	300-600 cycles	~25-50 GB/s	NUMA remote access
NVMe SSD	1-8 TB	10-100 µs	30K-300K cycles	~3-7 GB/s	Non-volatile storage

Cache hierarchy policies: Deep dive

Understanding cache policies is crucial for predicting performance behavior and optimizing memory access patterns.

Inclusion Policies

Inclusive Hierarchy:

Definition: All data in L1/L2 must also exist in LLC
Advantage: Simpler coherence protocol - LLC acts as directory filter
Disadvantage: LLC space "wasted" on L1/L2 duplicates
Example: Intel pre-Skylake architectures

┌─────────┐    ┌─────────────────────┐
│   L1    │    │       L3/LLC        │
│ [A,B,C] │────│ [A,B,C,D,E,F,G,H]   │ ← Contains ALL L1 data
└─────────┘    └─────────────────────┘

Exclusive Hierarchy:

Definition: Data exists in only ONE cache level
Advantage: Maximum effective cache capacity
Disadvantage: Complex victim handling, potential ping-ponging
Example: AMD Bulldozer family (L1-L2 exclusive)

┌─────────┐    ┌─────────────────────┐
│   L1    │    │       L3/LLC        │
│ [A,B,C] │    │ [D,E,F,G,H,I,J,K]   │ ← Disjoint data sets
└─────────┘    └─────────────────────┘

Non-Inclusive (Mostly Exclusive):

Definition: No strict inclusion requirement; data can exist anywhere
Advantage: Balanced approach - flexibility without waste
Implementation: Intel Skylake+ uses this approach
Behavior: L1/L2 victims may or may not be in LLC

Write Policies

Write-Allocate vs. No-Write-Allocate:

// Write-Allocate behavior:
array[random_index] = value;  
// 1. Cache miss brings entire line into cache
// 2. Modifies only the needed bytes  
// 3. Good for subsequent reads to same line
 
// No-Write-Allocate behavior:  
array[random_index] = value;
// 1. Write goes directly to next level
// 2. No cache line allocated
// 3. Better for streaming writes with no reuse

Write-Through vs. Write-Back:

Write-Through: Every write immediately propagates to next level (simpler, higher bandwidth usage)
Write-Back: Writes accumulate in cache, propagated only on eviction (complex, bandwidth efficient)

Replacement Policies

LRU (Least Recently Used):

Traditional choice for small associativities
Problem: Thrashing on streaming access patterns
Implementation cost: O(n log n) bits per set

RRIP (Re-Reference Interval Prediction):

┌──────────┬──────────────────┬─────────────┐
│ RRIP Val │ Meaning          │ Behavior    │
├──────────┼──────────────────┼─────────────┤
│    00    │ Near-immediate   │ Don't evict │
│    01    │ Intermediate     │ Age once    │
│    10    │ Distant          │ Age twice   │  
│    11    │ Long/Never       │ Evict first │
└──────────┴──────────────────┴─────────────┘

Scan-resistant: Streaming data gets RRIP=11 (immediate eviction)
Used in: Intel Ivy Bridge+ LLC

Translation Lookaside Buffers (TLBs) and Virtual Memory

TLBs are critical caches for virtual-to-physical address translation, often the hidden bottleneck in memory-intensive workloads.

TLB Structure and Organization

Instruction TLB (I-TLB):

Size: 128-256 entries for 4KB pages
Purpose: Caches instruction page translations
Critical for: Large codebases, shared libraries

Data TLB (D-TLB):

L1 D-TLB: 64-128 entries (4KB pages) + 32-64 entries (2MB/1GB pages)
L2 D-TLB: 1K-2K entries (unified page sizes)
Purpose: Caches data page translations

Page Walk Process

When TLB misses occur, hardware must perform a page walk:

Virtual Address (x86-64 with 4-level paging):
┌─────────┬─────────┬─────────┬─────────┬──────────────┐
│ PML4(9) │ PDP(9)  │ PD(9)   │ PT(9)   │ Offset(12)   │
└─────────┴─────────┴─────────┴─────────┴──────────────┘
 
Page Walk Steps:
1. CR3 → PML4 Base Address              (1 memory access)
2. PML4[index] → Page Directory Pointer (1 memory access)
3. PDP[index] → Page Directory          (1 memory access)  
4. PD[index] → Page Table              (1 memory access)
5. PT[index] → Physical Page Frame     (1 memory access)
Total: 5 memory accesses for 4KB page

Page Walk Caches (PWC):

L1 PWC: Caches intermediate page table entries
L2 PWC: Larger intermediate cache
Impact: Reduces 4-5 memory accesses to 1-2 for subsequent nearby translations

Huge Pages: Performance Game-Changer

Standard vs. Huge Page TLB Coverage:

Page Size	x86-64	ARM64	TLB Entries	Coverage	Use Case
Small	4 KB	4 KB	128	512 KB	Default, fine-grained
Medium	2 MB	64 KB	32	64 MB	Large arrays, databases
Large	1 GB	2 MB	16	16 GB	ML models, in-memory DBs

Huge Page Benefits:

Reduced TLB pressure: 512x fewer TLB entries needed (2MB vs 4KB)
Fewer page walks: Dramatically reduced page table traversals
Better cache utilization: Page table entries stay hot longer

Huge Page Implementation:

#include <sys/mman.h>
 
// Method 1: Transparent Huge Pages (system-wide)
// /sys/kernel/mm/transparent_hugepage/enabled = "always"
 
// Method 2: Explicit huge page allocation
void* huge_malloc(size_t size) {
    void* addr = mmap(NULL, size, 
                     PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                     -1, 0);
    if (addr == MAP_FAILED) return NULL;
    return addr;
}
 
// Method 3: madvise() hint for existing mapping  
void* regular_malloc = malloc(large_size);
madvise(regular_malloc, large_size, MADV_HUGEPAGE);

When to Use Huge Pages:

Large contiguous allocations > 2MB (ML models, embeddings, KV caches)
Read-heavy workloads with good spatial locality
Long-lived allocations to amortize setup cost

When NOT to Use Huge Pages:

Small, scattered allocations < 256KB
Short-lived objects with high allocation/free churn
Memory-constrained environments (internal fragmentation)

NUMA Effects on Memory Hierarchy

Local vs. Remote Memory Access:

┌─────────────────────────────────────────────────────────────┐
│                    NUMA Node 0                             │
│  ┌─────┐  ┌─────────┐  ┌─────────┐  ┌─────────────────┐   │
│  │CPU 0│──│L1/L2/L3 │──│Memory   │  │                 │   │
│  │CPU 1│  │Caches   │  │Controller│  │   Local DRAM    │   │
│  └─────┘  └─────────┘  └─────────┘  │    (50-80 ns)   │   │
└─────────────────────────────────────┴─────────────────────┘
                            │
                       ┌────┴────┐ Inter-socket Link
                       │QPI/UPI  │ (40-60 ns additional)
                       └────┬────┘
┌─────────────────────────────────────────────────────────────┐
│                    NUMA Node 1                             │
│  ┌─────┐  ┌─────────┐  ┌─────────┐  ┌─────────────────┐   │
│  │CPU 2│──│L1/L2/L3 │──│Memory   │  │                 │   │
│  │CPU 3│  │Caches   │  │Controller│  │  Remote DRAM    │   │
│  └─────┘  └─────────┘  └─────────┘  │   (100-150 ns)  │   │
└─────────────────────────────────────┴─────────────────────┘

Practical optimization strategies

Memory Layout Optimization

Cache Line Alignment:

// BAD: False sharing
struct SharedData {
    volatile int counter_a;  // CPU 0 writes
    volatile int counter_b;  // CPU 1 writes  
} __attribute__((packed));   // Both in same cache line!
 
// GOOD: Cache line separation  
struct SharedData {
    volatile int counter_a;
    char padding[60];        // Pad to cache line boundary
    volatile int counter_b;
} __attribute__((aligned(64))); // Each counter in own cache line

Prefetch-Friendly Layout:

// Prefetch distance calculation
int prefetch_distance = memory_latency_cycles / (instructions_per_cycle * cycles_per_iteration);
 
for (int i = 0; i < N; i++) {
    // Prefetch future data
    if (i + prefetch_distance < N) {
        __builtin_prefetch(&data[i + prefetch_distance], 0, 3);
    }
    
    // Process current data
    process_data(&data[i]);
}

TLB Optimization Techniques

First-Touch NUMA Policy:

// Ensure memory allocation matches thread placement
#pragma omp parallel
{
    int thread_id = omp_get_thread_num();
    int numa_node = thread_id / cores_per_node;
    
    // Bind thread to NUMA node
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(numa_node * cores_per_node + local_core, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
    
    // First-touch allocation on local node
    #pragma omp for
    for (int i = 0; i < array_size; i++) {
        large_array[i] = initial_value;  // Allocates on local node
    }
}

Page Coloring for LLC Conflict Avoidance:

// Align critical data structures to avoid LLC set conflicts
#define LLC_WAYS 16
#define LLC_SET_SIZE (64 * LLC_WAYS)  // 1024 bytes per set
 
// Allocate arrays with offset to use different LLC sets
void* allocate_conflict_free(size_t size, int color) {
    size_t offset = color * LLC_SET_SIZE;
    void* base = aligned_alloc(4096, size + offset + 4096);
    return (char*)base + offset;
}

💡 Key Takeaways:

TLB misses cost 4-5x more than cache misses due to page walk overhead

Huge pages provide 100-500x TLB reach improvement for large datasets

Cache policies fundamentally change behavior - inclusive vs exclusive affects eviction patterns

NUMA topology awareness can improve performance by 2-3x for memory-bound workloads

Always profile memory hierarchy behavior with hardware counters before optimizing

3) Prefetching (HW + SW)

Hardware prefetchers: next‑line, stride, delta‑correlation, spatial. Too aggressive ⇒ pollution & bandwidth waste.
Software prefetch: bring data just‑in‑time.

Estimating prefetch distance (rule‑of‑thumb):
Let memory latency L cycles, loop body issues I iterations/cycle on average, per‑iter bytes B, sustainable bandwidth BW bytes/cycle. To overlap latency, distance ≈

distance ≈ ceil( L / max(1, (BW / B) / I) )

Pseudocode:

for (int i = 0; i < N; i += 16) {
  __builtin_prefetch(A + i + PDIST, 0, 3); // read-mostly
  // compute on A[i..i+15]
}

Measure pf_hits / pf_issued, LLC MPKI, and DRAM BW before/after.

4) Coherency & sharing

Protocols: MESI/MOESI; broadcast snoop vs. directory; inclusive LLC acts as probe filter on some systems.
Pathologies: false sharing; migratory objects; producer/consumer bouncing.

Mitigations

Avoid fine‑grained write sharing; shard counters; per‑thread buffers with combining.
Pad shared structs to cache line size; use SPSC/MPSC rings for message‑passing.
For read‑mostly data, prefer read‑only segments or duplicate per‑socket copies.

5) NUMA effects

What hurts: remote DRAM latency, inter‑socket link contention, remote LLC misses.
Do this:

Pin threads + memory (numactl -C … -m …, libnuma).
Shard services per socket; route requests to local shard (stickiness).
Enable automatic NUMA balancing only if phases are long and stable.

6) Vector/SIMD (CPU)

ISAs: SSE/AVX/AVX‑512, NEON/SVE/SVE2. Wider vectors increase throughput and power transients (some platforms down‑clock under heavy AVX‑512).
Make vectorization easy:

Affine indexes + contiguous access, minimal control flow; consider predication rather than branches.
Align loads/stores (e.g., 32/64‑byte); keep gather/scatter rare; use software pipelining to hide latencies.
Use compiler diagnostics (-Rpass=loop-vectorize) and only drop to intrinsics when the compiler can't form the pattern.

7) GPU occupancy & memory behavior

Execution: SIMT warps (e.g., 32 threads) scheduled by SM warp schedulers.
Occupancy: active warps per SM limited by registers, shared memory, blocks/SM, and architectural caps.

Occupancy math (example):
Assume per‑SM caps: 64 warps, 256 KB registers, 100 KB shared memory. Kernel uses 64 regs/thread, 48 KB shared/block, 256 threads/block (= 8 warps/block).

Reg‑limited blocks = floor( (256 KB / (64 regs × 256 thr × 4 B)) ) = floor( (262144 / 65536) ) = 4 blocks.
Smem‑limited blocks = floor(100 KB / 48 KB) = 2 blocks.
Blocks cap ⇒ 2 blocks/SM × 8 warps = 16 active warps ⇒ occupancy = 16 / 64 = 25%.
Try reducing regs to 48 or smem to 32 KB to raise occupancy while watching ILP.

Memory hierarchy (NVIDIA‑like): registers → shared/L1 (configurable split) → L2 → HBM/DRAM.
Best practices:

Coalescing: map threadIdx.x to contiguous addresses.
Shared memory: avoid bank conflicts; use async copy to stage tiles.
L2 residency: temporal tiling; reuse windows that fit L2 to reduce HBM traffic.
Profile warp stall reasons (memory dependency, barrier, dispatch) to focus fixes.

8) Interconnects, storage & I/O

NoC (on‑chip)

Topologies: ring, mesh, crossbar, custom hierarchical fabrics. Watch hotspot links; use QoS (virtual channels, priorities) for latency‑critical flows (e.g., CPU misses) vs. bulk DMA.

Off‑chip

PCIe/CXL: Bandwidth per direction ≈ lane_rate × lanes × encoding_efficiency. Use it to size DMA batch sizes and decide between copy vs. map.
Vendor links: NVLink/NVSwitch provide coherent or near‑coherent GPU‑GPU paths; design for collectives (all‑reduce) locality.
Storage (NVMe): µs‑scale latency; increase queue depth and I/O size for throughput; avoid sync in hot path; io_uring helps.

I/O checklist

Batch small I/Os; use direct I/O for log‑structured stores.
Tune interrupt moderation vs. latency (polling for micro‑bursts).
Ensure IOMMU mappings are large and reused (reduce TLB pressure).

9) Worked examples

(A) Choosing prefetch distance

You measure: L2 miss latency ~ 35 ns (≈105 cycles @3 GHz); loop issues ~0.5 iters/cycle; per‑iter reads 64 B; DRAM BW ≈ 200 B/cycle aggregate.
(BW / B) / I = (200/64)/0.5 = 6.25 ⇒ distance ≈ ceil(105 / 6.25) = 17. Start with PDIST=16–20, verify pf_hit rate.

(B) NUMA shard sizing

2‑socket server; request fits in 100 µs compute + 80 µs memory; remote adds 40 µs on average ⇒ p99 increases >20%. Enforce shard‑local routing and first‑touch; verify p99 reduction.

10) Checklist & anti‑patterns

✅ Measure OI + bandwidth (roofline) before micro‑tuning kernels.
✅ Stabilize placement (NUMA pinning) before comparing results.
✅ Use huge pages for large, hot datasets and pad shared structs.
❌ Chasing 100% GPU occupancy at the expense of ILP.
❌ Over‑aggressive prefetching when memory BW is already saturated.
❌ Mixed small/large I/O without batching.

System & Microarchitecture Deep Dive

Practical Exercises

Tools Required

Real-World Applications

Next Modules

Part of Learning Tracks

AI System Architect Learning Track

Deep Learning Performance Architect Learning Track

System & Microarchitecture — Deep Dive

📋 Table of Contents

0) Mental model

1) CPU pipeline basics (OoO superscalar)

Pipeline stages deep dive

1. FETCH Stage

2. DECODE Stage

3. RENAME Stage

4. DISPATCH Stage

5. ISSUE Stage

6. EXECUTE Stage

7. WRITEBACK Stage

8. RETIRE Stage

Pipeline Stage to Microarchitectural Structure Mapping

🔧 STAGE-BY-STAGE STRUCTURE BREAKDOWN

🎯 VISUAL REFERENCES FOR EACH STAGE

DETAILED STAGE ANALYSIS WITH STRUCTURES

1. FETCH STAGE STRUCTURES

2. DECODE STAGE STRUCTURES

3. RENAME STAGE STRUCTURES

4. DISPATCH STAGE STRUCTURES

5. ISSUE STAGE STRUCTURES

6. EXECUTE STAGE STRUCTURES

7. WRITEBACK STAGE STRUCTURES

8. RETIRE STAGE STRUCTURES

Critical performance relationships

Throughput vs. latency

Back‑pressure patterns (symptoms → likely cause → fix)

2) Cache & memory hierarchy

Memory hierarchy latency characteristics

Cache hierarchy policies: Deep dive

Inclusion Policies

Write Policies

Replacement Policies

Translation Lookaside Buffers (TLBs) and Virtual Memory

TLB Structure and Organization

Page Walk Process

Huge Pages: Performance Game-Changer

NUMA Effects on Memory Hierarchy

Practical optimization strategies

Memory Layout Optimization

TLB Optimization Techniques

3) Prefetching (HW + SW)

4) Coherency & sharing

5) NUMA effects

6) Vector/SIMD (CPU)

7) GPU occupancy & memory behavior

8) Interconnects, storage & I/O

NoC (on‑chip)

Off‑chip

9) Worked examples

(A) Choosing prefetch distance

(B) NUMA shard sizing

10) Checklist & anti‑patterns

Further reading

Related Modules

Benchmarks & Workloads — MLPerf Essentials

Cluster-Level Thinking — Scheduling, Placement, Isolation

Cluster-Level Thinking — Scheduling, Placement, Isolation