Validation & Measurement — Trust, But Verify

Goal: Cross‑validate models with real counters, quantify uncertainty, and communicate limits.

📋 Table of Contents

1) The triangle check

2) Linux perf / eBPF / friends

3) Communicating limits

4) Quick validations

References

1) The triangle check

Simulator ↔ hardware counters/profilers ↔ application‑level KPIs. If two sides disagree, investigate before drawing conclusions.

2) Linux perf / eBPF / friends

perf stat for macro counters; perf record/report for hotspots (annotate).
eBPF with bcc/bpftrace to trace syscalls, scheduler, TCP/IO; build USE (Utilization, Saturation, Errors) dashboards.
Pin ROIs with markers (usdt/user probes) to align with simulator ROIs.

Counter set (starter):

cycles, instructions, branches, branch-misses, cache-references, cache-misses.
L1D/L2/LLC misses & refills; dtlb/itlb loads & misses; page walks.
Offcore responses: local vs. remote DRAM, prefetch vs. demand.
Memory BW (uncore counters), throttling/thermal flags.

Examples

# macro view
perf stat -e cycles,instructions,branches,branch-misses,cache-misses ./app --args
 
# profile FE stalls or branchy code
perf record -e branch-misses:i -g -- ./app --args
perf report --stdio
 
# eBPF TCP latency histogram (bpftrace)
bpftrace -e 'kprobe/tcp_recvmsg { @ns[tid] = nsecs; } kretprobe/tcp_recvmsg /@ns[tid]/ { @d = hist(nsecs - @ns[tid]); delete(@ns[tid]); }'

Correlate with MPKI, miss latency, measured BW (e.g., pcm-memory, perf uncore_imc/*/), and Top‑Down percentages.

3) Communicating limits

State what the model cannot see (firmware throttling, PCIe back‑pressure, OS jitter).
Separate calibrated from extrapolated claims.
Provide error bars and rerun counts; show before/after with the same inputs and placement.

4) Quick validations

Memory‑bound? MPKI↑ + miss latency↑ + BW near roofline ⇒ yes.
Compute‑bound? Port pressure/issue utilization high + close to compute roof.
Storage‑bound? fio with same queue depths & sizes; check p99 not just mean.

References

Brendan Gregg's USE method; Linux perf manpages; eBPF tutorials.

Validation & Measurement — Trust, But Verify

Practical Exercises

Tools Required

Real-World Applications

Next Modules

Part of Learning Tracks

AI System Architect Learning Track

Deep Learning Performance Architect Learning Track

Validation & Measurement — Trust, But Verify

📋 Table of Contents

1) The triangle check

2) Linux perf / eBPF / friends

3) Communicating limits

4) Quick validations

References

1) The triangle check

2) Linux perf / eBPF / friends

3) Communicating limits

4) Quick validations

References

Related Modules

AI Workload Analysis & Benchmarking

Modeling & Simulation

Power & Thermal Awareness — From Activity to perf/W