Tools & Methods: Top-Down, CDRD, and Roofline
Turn counters and simple models into clear diagnoses and action items using systematic performance analysis methodologies
Practical Exercises
- Implement Top-Down analysis on CPU workloads
- Build automated roofline plotting for GPU kernels
- Create performance counter calibration harness
- Design visualization for bottleneck identification
Tools Required
Real-World Applications
- Systematic ML inference optimization
- Datacenter workload characterization
- Hardware evaluation and comparison
- Performance regression root cause analysis
Part of Learning Tracks
Tools & Methods to Localize Bottlenecks — Top‑Down, CDRD, and Roofline
Goal: Turn counters and simple models into clear diagnoses and action items.
📋 Table of Contents
1) The Top‑Down Method (TMAM) — portable anatomy
2) Intel CDRD & doc hygiene
3) Roofline Modeling — from whiteboard to numbers
4) Automation snippets
5) Pitfalls
References (starter set)
1) The Top‑Down Method (TMAM) — portable anatomy
Principle: Every cycle (or slot) is classified into one of four buckets: Retiring, Front‑End Bound, Back‑End Bound, Bad Speculation. Each bucket expands into finer nodes (e.g., FE→i‑cache/itlb/decoder/uop‑cache; BE→core vs. memory→L1/L2/LLC/DRAM).
1.1 Mapping to your platform
Define slots = pipeline_width × cycles. Then map counters:
- Retiring: uops retired (or instructions × µops per inst).
- Bad speculation: branch mispredict penalties + machine clears + faults.
- Front‑End bound: i‑cache/ITLB misses, decode throughput misses, µop cache misses, fetch bubbles.
- Back‑End Core: scheduler full, RS/ROB full, port pressure, FPU/ALU utilization.
- Back‑End Memory: L1D/L2/LLC MPKI, load miss latency, LFB/line‑fill buffer stalls, DTLB misses, page walks.
Create a JSON mapping in your repo that spells out each node's formula from raw counters. Version it alongside simulator configs.
1.2 A minimal workflow
- Calibrate counters with microbenchmarks: pointer‑chases (memory), DGEMM (compute), branchy loops (speculation), tight i‑cache loops (FE).
- Capture counters in consistent ROIs and sampling windows (per phase).
- Render a 100% stacked bar (one per kernel/phase). Distill into 2–3 bullets of why and what to try next.
1.3 Quick interpretations
- FE‑bound + high icache/itlb misses → shrink code, huge pages, layout (PGO/LTO).
- Bad speculation dominates → simplify control, if‑convert, precompute branch conditions, avoid unpredictable divisions/mods in hot loops.
- BE‑core with port pressure → reduce unroll, split loads across ports, interleave independent ops.
- BE‑memory with L2/LLC MPKI → restructure access; blocking to raise reuse; software prefetch; NUMA locality.
1.4 Extending to GPUs
On GPUs the four buckets still map:
- FE analog: instruction fetch/issue limits, warp scheduler dispatch stalls.
- Bad speculation analog: divergence & replays.
- BE‑core: functional unit saturation, register file bank conflicts.
- BE‑memory: L1/shared bank conflicts, L2/DRAM latency/BW stalls.
2) Intel CDRD & doc hygiene
- Maintain a pointer file of exact versions/hashes for: Optimization Reference Manual, µarch counter guides, Metric definitions (PerfMon JSON), VTune metrics equations.
- Keep a small sanity harness that recomputes vendor "top metrics" from raw counters; alert on drift >5%.
3) Roofline Modeling — from whiteboard to numbers
Attainable performance = min( Peak_Compute , Sustained_BW × Operational_Intensity )
- Operational Intensity (OI): FLOPs / bytes moved to/from the chosen memory level. Use nested rooflines (L1/shared vs. L2 vs. DRAM) to focus reuse.
3.1 Procedure (CPU or GPU)
- Measure sustained bandwidth (
STREAM
, device memcopies, NCCL collectives). - Measure or compute peak compute (ISA width × pipes × freq × ops/cycle).
- Compute OI of each hot kernel; include read and write traffic; be explicit about the level (DRAM? L2?).
- Plot points with variance bars (repeat runs).
- Annotate changes that move points: tiling/fusion/compression (↑ OI), frequency/DVFS/core count (↑ roof), channel count/HBM (↑ BW).
3.2 Worked example
Platform: peak FP32 60 TF/s; sustained DRAM BW 1.2 TB/s.
Kernel: 120 GF, 300 GB memory traffic ⇒ OI = 120e9 / 300e9 = 0.4
FLOP/B.
Bandwidth roof at OI: 1.2e12 B/s × 0.4 = 0.48e12 FLOP/s = 0.48 TF/s
⇒ BW‑bound (~0.8% of peak compute).
Action: block for reuse (↑OI), quantize/compress, prefetch, consider HBM if external.
3.3 Communicating
One slide per kernel: Top‑Down bar, Roofline point, Before/After table with change → metric deltas (e.g., "tiling 128× → OI 0.4→1.3, perf 0.48→1.6 TF/s").
4) Automation snippets
Linux perf to CSV:
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,cache-references -x, -- sleep 10 > perf.csv
Python sketch to compute Top‑Down buckets (pseudo):
slots = width * cycles
retiring = uops_retired
bad_spec = branch_mispred_penalty + machine_clears
fe_bound = icache_stall + itlb_stall + decode_stall + uopcache_miss
be_core = rs_full + rob_full + port_pressure - overlap_terms
be_mem = l1d_miss_penalty + l2_miss_penalty + llc_miss_penalty + dtlb_walk
for k,v in locals().items():
pct[k] = v/slots
5) Pitfalls
- Using theoretical roofs instead of sustained (misleads).
- Ignoring writes in OI (common mistake).
- Mixing unwarmed regions with steady‑state measurements.
- Treating vendor‑tool "derived metrics" as ground truth—always recompute from raw counters.
References (starter set)
- Yasin, "A Top‑Down Method for Performance Analysis".
- Intel PerfMon JSON & VTune metrics docs.
- Roofline model tutorials (Williams/Waterman/Patterson).