Tools & Methods to Localize Bottlenecks — Top‑Down, CDRD, and Roofline

Goal: Turn counters and simple models into clear diagnoses and action items.

📋 Table of Contents

1) The Top‑Down Method (TMAM) — portable anatomy

3) Roofline Modeling — from whiteboard to numbers

1) The Top‑Down Method (TMAM) — portable anatomy

Principle: Every cycle (or slot) is classified into one of four buckets: Retiring, Front‑End Bound, Back‑End Bound, Bad Speculation. Each bucket expands into finer nodes (e.g., FE→i‑cache/itlb/decoder/uop‑cache; BE→core vs. memory→L1/L2/LLC/DRAM).

1.1 Mapping to your platform

Define slots = pipeline_width × cycles. Then map counters:

Retiring: uops retired (or instructions × µops per inst).
Bad speculation: branch mispredict penalties + machine clears + faults.
Front‑End bound: i‑cache/ITLB misses, decode throughput misses, µop cache misses, fetch bubbles.
Back‑End Core: scheduler full, RS/ROB full, port pressure, FPU/ALU utilization.
Back‑End Memory: L1D/L2/LLC MPKI, load miss latency, LFB/line‑fill buffer stalls, DTLB misses, page walks.

Create a JSON mapping in your repo that spells out each node's formula from raw counters. Version it alongside simulator configs.

1.2 A minimal workflow

Calibrate counters with microbenchmarks: pointer‑chases (memory), DGEMM (compute), branchy loops (speculation), tight i‑cache loops (FE).
Capture counters in consistent ROIs and sampling windows (per phase).
Render a 100% stacked bar (one per kernel/phase). Distill into 2–3 bullets of why and what to try next.

1.3 Quick interpretations

FE‑bound + high icache/itlb misses → shrink code, huge pages, layout (PGO/LTO).
Bad speculation dominates → simplify control, if‑convert, precompute branch conditions, avoid unpredictable divisions/mods in hot loops.
BE‑core with port pressure → reduce unroll, split loads across ports, interleave independent ops.
BE‑memory with L2/LLC MPKI → restructure access; blocking to raise reuse; software prefetch; NUMA locality.

1.4 Extending to GPUs

On GPUs the four buckets still map:

FE analog: instruction fetch/issue limits, warp scheduler dispatch stalls.
Bad speculation analog: divergence & replays.
BE‑core: functional unit saturation, register file bank conflicts.
BE‑memory: L1/shared bank conflicts, L2/DRAM latency/BW stalls.

2) Intel CDRD & doc hygiene

Maintain a pointer file of exact versions/hashes for: Optimization Reference Manual, µarch counter guides, Metric definitions (PerfMon JSON), VTune metrics equations.
Keep a small sanity harness that recomputes vendor "top metrics" from raw counters; alert on drift >5%.

3) Roofline Modeling — from whiteboard to numbers

Attainable performance = min( Peak_Compute , Sustained_BW × Operational_Intensity )

Operational Intensity (OI): FLOPs / bytes moved to/from the chosen memory level. Use nested rooflines (L1/shared vs. L2 vs. DRAM) to focus reuse.

3.1 Procedure (CPU or GPU)

Measure sustained bandwidth (STREAM, device memcopies, NCCL collectives).
Measure or compute peak compute (ISA width × pipes × freq × ops/cycle).
Compute OI of each hot kernel; include read and write traffic; be explicit about the level (DRAM? L2?).
Plot points with variance bars (repeat runs).
Annotate changes that move points: tiling/fusion/compression (↑ OI), frequency/DVFS/core count (↑ roof), channel count/HBM (↑ BW).

3.2 Worked example

Platform: peak FP32 60 TF/s; sustained DRAM BW 1.2 TB/s.
Kernel: 120 GF, 300 GB memory traffic ⇒ OI = 120e9 / 300e9 = 0.4 FLOP/B.
Bandwidth roof at OI: 1.2e12 B/s × 0.4 = 0.48e12 FLOP/s = 0.48 TF/s ⇒ BW‑bound (~0.8% of peak compute).
Action: block for reuse (↑OI), quantize/compress, prefetch, consider HBM if external.

3.3 Communicating

One slide per kernel: Top‑Down bar, Roofline point, Before/After table with change → metric deltas (e.g., "tiling 128× → OI 0.4→1.3, perf 0.48→1.6 TF/s").

4) Automation snippets

Linux perf to CSV:

perf stat -e cycles,instructions,branches,branch-misses,cache-misses,cache-references -x, -- sleep 10 > perf.csv

Python sketch to compute Top‑Down buckets (pseudo):

slots = width * cycles
retiring = uops_retired
bad_spec = branch_mispred_penalty + machine_clears
fe_bound = icache_stall + itlb_stall + decode_stall + uopcache_miss
be_core = rs_full + rob_full + port_pressure - overlap_terms
be_mem  = l1d_miss_penalty + l2_miss_penalty + llc_miss_penalty + dtlb_walk
for k,v in locals().items():
    pct[k] = v/slots

5) Pitfalls

Using theoretical roofs instead of sustained (misleads).
Ignoring writes in OI (common mistake).
Mixing unwarmed regions with steady‑state measurements.
Treating vendor‑tool "derived metrics" as ground truth—always recompute from raw counters.

References (starter set)

Yasin, "A Top‑Down Method for Performance Analysis".
Intel PerfMon JSON & VTune metrics docs.
Roofline model tutorials (Williams/Waterman/Patterson).

Tools & Methods: Top-Down, CDRD, and Roofline

Practical Exercises

Tools Required

Real-World Applications

Next Modules

Part of Learning Tracks

AI System Architect Learning Track

Deep Learning Performance Architect Learning Track

Tools & Methods to Localize Bottlenecks — Top‑Down, CDRD, and Roofline

📋 Table of Contents

1) The Top‑Down Method (TMAM) — portable anatomy

2) Intel CDRD & doc hygiene

3) Roofline Modeling — from whiteboard to numbers

4) Automation snippets

5) Pitfalls

References (starter set)

1) The Top‑Down Method (TMAM) — portable anatomy

1.1 Mapping to your platform

1.2 A minimal workflow

1.3 Quick interpretations

1.4 Extending to GPUs

2) Intel CDRD & doc hygiene

3) Roofline Modeling — from whiteboard to numbers

3.1 Procedure (CPU or GPU)

3.2 Worked example

3.3 Communicating

4) Automation snippets

5) Pitfalls

References (starter set)

Related Modules

AI Workload Analysis & Benchmarking

Modeling & Simulation

Power & Thermal Awareness — From Activity to perf/W