Skip to main content
ModulesPerformance

Tools & Methods: Top-Down, CDRD, and Roofline

Turn counters and simple models into clear diagnoses and action items using systematic performance analysis methodologies

expertPerformance150m
4
Exercises
5
Tools
4
Applications
4
Min Read

Practical Exercises

  • Implement Top-Down analysis on CPU workloads
  • Build automated roofline plotting for GPU kernels
  • Create performance counter calibration harness
  • Design visualization for bottleneck identification

Tools Required

Intel VTuneperfNVIDIA NsightPython/matplotlibPerformance counters

Real-World Applications

  • Systematic ML inference optimization
  • Datacenter workload characterization
  • Hardware evaluation and comparison
  • Performance regression root cause analysis

Tools & Methods to Localize Bottlenecks — Top‑Down, CDRD, and Roofline

Goal: Turn counters and simple models into clear diagnoses and action items.


📋 Table of Contents


1) The Top‑Down Method (TMAM) — portable anatomy

Principle: Every cycle (or slot) is classified into one of four buckets: Retiring, Front‑End Bound, Back‑End Bound, Bad Speculation. Each bucket expands into finer nodes (e.g., FE→i‑cache/itlb/decoder/uop‑cache; BE→core vs. memory→L1/L2/LLC/DRAM).

1.1 Mapping to your platform

Define slots = pipeline_width × cycles. Then map counters:

  • Retiring: uops retired (or instructions × µops per inst).
  • Bad speculation: branch mispredict penalties + machine clears + faults.
  • Front‑End bound: i‑cache/ITLB misses, decode throughput misses, µop cache misses, fetch bubbles.
  • Back‑End Core: scheduler full, RS/ROB full, port pressure, FPU/ALU utilization.
  • Back‑End Memory: L1D/L2/LLC MPKI, load miss latency, LFB/line‑fill buffer stalls, DTLB misses, page walks.

Create a JSON mapping in your repo that spells out each node's formula from raw counters. Version it alongside simulator configs.

1.2 A minimal workflow

  1. Calibrate counters with microbenchmarks: pointer‑chases (memory), DGEMM (compute), branchy loops (speculation), tight i‑cache loops (FE).
  2. Capture counters in consistent ROIs and sampling windows (per phase).
  3. Render a 100% stacked bar (one per kernel/phase). Distill into 2–3 bullets of why and what to try next.

1.3 Quick interpretations

  • FE‑bound + high icache/itlb misses → shrink code, huge pages, layout (PGO/LTO).
  • Bad speculation dominates → simplify control, if‑convert, precompute branch conditions, avoid unpredictable divisions/mods in hot loops.
  • BE‑core with port pressure → reduce unroll, split loads across ports, interleave independent ops.
  • BE‑memory with L2/LLC MPKI → restructure access; blocking to raise reuse; software prefetch; NUMA locality.

1.4 Extending to GPUs

On GPUs the four buckets still map:

  • FE analog: instruction fetch/issue limits, warp scheduler dispatch stalls.
  • Bad speculation analog: divergence & replays.
  • BE‑core: functional unit saturation, register file bank conflicts.
  • BE‑memory: L1/shared bank conflicts, L2/DRAM latency/BW stalls.

2) Intel CDRD & doc hygiene

  • Maintain a pointer file of exact versions/hashes for: Optimization Reference Manual, µarch counter guides, Metric definitions (PerfMon JSON), VTune metrics equations.
  • Keep a small sanity harness that recomputes vendor "top metrics" from raw counters; alert on drift >5%.

3) Roofline Modeling — from whiteboard to numbers

Attainable performance = min( Peak_Compute , Sustained_BW × Operational_Intensity )

  • Operational Intensity (OI): FLOPs / bytes moved to/from the chosen memory level. Use nested rooflines (L1/shared vs. L2 vs. DRAM) to focus reuse.

3.1 Procedure (CPU or GPU)

  1. Measure sustained bandwidth (STREAM, device memcopies, NCCL collectives).
  2. Measure or compute peak compute (ISA width × pipes × freq × ops/cycle).
  3. Compute OI of each hot kernel; include read and write traffic; be explicit about the level (DRAM? L2?).
  4. Plot points with variance bars (repeat runs).
  5. Annotate changes that move points: tiling/fusion/compression (↑ OI), frequency/DVFS/core count (↑ roof), channel count/HBM (↑ BW).

3.2 Worked example

Platform: peak FP32 60 TF/s; sustained DRAM BW 1.2 TB/s.
Kernel: 120 GF, 300 GB memory traffic ⇒ OI = 120e9 / 300e9 = 0.4 FLOP/B.
Bandwidth roof at OI: 1.2e12 B/s × 0.4 = 0.48e12 FLOP/s = 0.48 TF/sBW‑bound (~0.8% of peak compute).
Action: block for reuse (↑OI), quantize/compress, prefetch, consider HBM if external.

3.3 Communicating

One slide per kernel: Top‑Down bar, Roofline point, Before/After table with change → metric deltas (e.g., "tiling 128× → OI 0.4→1.3, perf 0.48→1.6 TF/s").


4) Automation snippets

Linux perf to CSV:

perf stat -e cycles,instructions,branches,branch-misses,cache-misses,cache-references -x, -- sleep 10 > perf.csv

Python sketch to compute Top‑Down buckets (pseudo):

slots = width * cycles
retiring = uops_retired
bad_spec = branch_mispred_penalty + machine_clears
fe_bound = icache_stall + itlb_stall + decode_stall + uopcache_miss
be_core = rs_full + rob_full + port_pressure - overlap_terms
be_mem  = l1d_miss_penalty + l2_miss_penalty + llc_miss_penalty + dtlb_walk
for k,v in locals().items():
    pct[k] = v/slots

5) Pitfalls

  • Using theoretical roofs instead of sustained (misleads).
  • Ignoring writes in OI (common mistake).
  • Mixing unwarmed regions with steady‑state measurements.
  • Treating vendor‑tool "derived metrics" as ground truth—always recompute from raw counters.

References (starter set)

  • Yasin, "A Top‑Down Method for Performance Analysis".
  • Intel PerfMon JSON & VTune metrics docs.
  • Roofline model tutorials (Williams/Waterman/Patterson).
#Top-Down#roofline#performance-analysis#profiling#bottleneck-analysis