Performance Metrics & Analysis Foundations
Master essential performance metrics including IPC, throughput, latency, and bandwidth. Learn to identify bottlenecks using Amdahl's Law and the Roofline Model.
Prerequisites
Make sure you're familiar with these concepts before diving in:
Learning Objectives
By the end of this topic, you will be able to:
Table of Contents
Performance Metrics & Analysis Foundations
"You can't optimize what you can't measure" - this fundamental principle drives all performance engineering. Whether you're designing a new processor, optimizing software, or choosing between architectural alternatives, understanding performance metrics is essential! š
1. Core Performance Metrics
1.1 1. Instructions Per Cycle (IPC) š¢
Definition: Average number of instructions completed per clock cycle.
IPC = Instructions Executed / Clock Cycles
Example:
// Execution of 1000 instructions takes 1250 cycles
IPC = 1000 / 1250 = 0.8
What it tells you:
- IPC < 1: Pipeline stalls, hazards, or serial execution
- IPC = 1: Perfect scalar pipeline (ideal 5-stage)
- IPC > 1: Superscalar execution (multiple instructions/cycle)
- IPC >> 1: Wide superscalar or VLIW architecture
Real-world examples:
- In-order CPU: IPC ā 0.5-1.0
- Out-of-order CPU (x86): IPC ā 1.5-4.0
- GPU SIMT core: IPC ā 0.3-0.6 (but thousands of concurrent threads!)
1.2 2. Throughput vs Latency ā±ļø
These are different concepts that are often confused!
Throughput: Work completed per unit time
Throughput = Total Work / Total Time
Example: Instructions per second (IPS), Transactions per second (TPS)
Analogy: Cars crossing a bridge
- Throughput = 100 cars/hour
Latency: Time for one operation to complete
Latency = Time from start to completion
Example: Time for one instruction, memory access time
Analogy: Time for one car to cross the bridge
- Latency = 1 minute
1.3 The Throughput-Latency Trade-off
Key Insight: You can improve throughput without reducing latency!
Example: Pipelining
Non-pipelined: 5 cycles per instruction
Latency: 5 cycles
Throughput: 1 inst / 5 cycles = 0.2 IPC
Pipelined: Still 5 cycles per instruction
Latency: 5 cycles (same!)
Throughput: 1 inst / 1 cycle = 1.0 IPC (5x better!)
When each matters:
- Latency-critical: Real-time systems, user interaction, single-threaded code
- Throughput-critical: Batch processing, server workloads, parallel code
1.4 3. Bandwidth š”
Definition: Maximum data transfer rate
Bandwidth = Data Size / Time
Units: GB/s, MB/s, Gb/s
Memory Bandwidth Example:
DDR4-3200 RAM:
- Clock: 1600 MHz (double data rate ā 3200 MT/s)
- Bus width: 64 bits
- Bandwidth = 3200 Ć 64 / 8 = 25.6 GB/s
Bandwidth Types:
Type | Typical Value | Use Case |
---|---|---|
L1 Cache | 1-2 TB/s | Register spills, hot data |
L2 Cache | 200-500 GB/s | Working set, shared data |
L3 Cache | 50-200 GB/s | Shared cache, victim cache |
DRAM | 25-100 GB/s | Main memory |
NVMe SSD | 3-7 GB/s | Storage |
SATA SSD | 0.5 GB/s | Older storage |
Peak vs Sustained Bandwidth:
- Peak: Theoretical maximum (marketing numbers!)
- Sustained: Real achievable bandwidth (what you actually get)
- Gap: 20-50% is common due to overhead, arbitration, refresh
2. Amdahl's Law: The Speedup Reality Check
Amdahl's Law tells you the maximum speedup possible from an optimization.
2.1 The Formula
Speedup = 1 / ((1 - P) + P/S)
Where:
P = Fraction of program that is parallelizable
S = Speedup of the parallel portion
2.2 Worked Example
Scenario: Optimize matrix multiplication
- 80% of runtime is parallelizable (P = 0.8)
- Parallel version runs 10x faster (S = 10)
Speedup = 1 / ((1 - 0.8) + 0.8/10)
= 1 / (0.2 + 0.08)
= 1 / 0.28
= 3.57x
Interpretation: Even with a 10x speedup on 80% of the code, overall speedup is only 3.57x due to the serial portion (20%)!
2.3 The Serial Bottleneck
Key insight: The unparallelized portion dominates at high core counts.
Cores: 1 2 4 8 16 32 64
Speedup: 1.0x 1.6x 2.5x 3.3x 4.0x 4.4x 4.7x
ā
Diminishing returns!
Even with infinite cores (S ā ā):
Max Speedup = 1 / (1 - P) = 1 / 0.2 = 5x
That 20% serial code limits you to 5x maximum speedup no matter how many cores you add!
2.4 Practical Applications
1. Multicore Scaling
// Web server: 95% of time in parallel request handling
P = 0.95, S = number of cores
8 cores: Speedup = 1/(0.05 + 0.95/8) = 5.9x
16 cores: Speedup = 1/(0.05 + 0.95/16) = 7.5x
32 cores: Speedup = 1/(0.05 + 0.95/32) = 10.3x
// Diminishing returns evident!
2. Accelerator Decisions
// ML inference: 70% matrix ops (GPU-friendly), 30% pre/post-processing
GPU is 100x faster for matrix ops
Speedup = 1 / (0.30 + 0.70/100) = 1 / 0.307 = 3.26x
// GPU only gives 3.26x overall speedup!
// Need to optimize the 30% CPU portion too
3. The Roofline Model: Compute vs Memory Bound
The Roofline Model visualizes whether your code is:
- Compute-bound: Limited by ALU/FPU throughput
- Memory-bound: Limited by DRAM bandwidth
3.1 The Roofline Plot
Performance (GFLOP/s)
ā
ā _____________________ Compute Roof (Peak FLOPS)
ā /
ā /
ā / Memory Roof (BW Ć Arithmetic Intensity)
ā /
ā /
ā /
ā_____________/________________________________ā Arithmetic Intensity
(FLOPS/Byte)
3.2 Key Concepts
Arithmetic Intensity (AI):
AI = Operations / Bytes Transferred
Example: Matrix multiply (NĆN)
Operations: 2N³ FLOPs (multiply-add for each element)
Data: 3N² floats à 4 bytes = 12N² bytes
AI = 2N³ / (12N²) = N/6 FLOPS/byte
For N=1024: AI = 170 FLOPS/byte (compute-bound!)
For N=8: AI = 1.3 FLOPS/byte (memory-bound!)
Attainable Performance:
Performance = min(
Peak FLOPS, // Compute roof
Memory BW Ć Arithmetic Intensity // Memory roof
)
3.3 Example Analysis
System specs:
- Peak FLOPS: 1000 GFLOP/s
- Memory BW: 50 GB/s
- Ridge point: 1000/50 = 20 FLOPS/byte
Your kernel:
- Arithmetic Intensity: 5 FLOPS/byte
Prediction:
Performance = min(1000, 50 Ć 5) = min(1000, 250) = 250 GFLOP/s
Status: MEMORY-BOUND (using only 25% of peak FLOPS)
Optimization strategy:
- Increase AI: Cache blocking, loop fusion, algorithm changes
- Increase effective BW: Prefetching, coalescing, compression
3.4 Roofline in Practice
Case Study: ML Matrix Operations
// Small batch inference (batch=1, hidden=512)
AI = 0.5 FLOPS/byte ā Memory-bound
Action: Increase batch size to amortize memory transfers
// Large batch training (batch=256, hidden=4096)
AI = 200 FLOPS/byte ā Compute-bound
Action: Optimize CUDA kernels, use Tensor Cores
4. Basic Profiling Concepts
4.1 What to Measure
1. Time Breakdown ā±ļø
Total Time = Compute Time + Memory Time + Stall Time
// Use hardware counters
perf stat -e cycles,instructions,cache-misses ./app
2. Hotspots š„
// Identify functions consuming most time
perf record ./app
perf report
Output:
50.2% matmul() ā Focus optimization here!
20.1% softmax()
15.3% layer_norm()
3. Cache Behavior š¾
// L1/L2/L3 miss rates
perf stat -e L1-dcache-load-misses,LLC-load-misses ./app
High miss rate ā Memory-bound
Low miss rate + low IPC ā Compute-bound or other stalls
4.2 Key Performance Counters (PMCs)
Counter | Meaning | Good Value |
---|---|---|
IPC | Instructions/Cycle | > 1.0 for modern CPUs |
Cache Miss Rate | Misses/Accesses | < 5% for L1 |
Branch Mispredict | Mispredicts/Branches | < 10% |
TLB Miss Rate | Misses/Accesses | < 1% |
Memory Stall Cycles | Cycles waiting for memory | < 30% of total |
5. Workload Characterization
5.1 Understanding Your Application
1. Compute Intensity
FLOP-heavy: ML training, scientific computing
Integer-heavy: Databases, compression, crypto
Mixed: General applications
2. Memory Access Pattern
Sequential: Streaming, good for prefetch
Random: Graph algorithms, hash tables, poor locality
Strided: Matrix operations, tunable
3. Parallelism
Embarrassingly parallel: Image processing, Monte Carlo
Medium: Matrix ops with some dependencies
Serial: Recursive algorithms, sequential parsing
5.2 Architecture Matching
Workload Type | Best Architecture | Why |
---|---|---|
High AI, parallel | GPU, TPU | Maximize throughput |
Low AI, parallel | Multi-core CPU | Better memory subsystem |
Serial, branchy | High-IPC CPU | Out-of-order, prediction |
Real-time | Specialized | Predictable latency |
6. Optimization Strategy
6.1 The Optimization Hierarchy
1. Algorithm (10-100x gains)
O(N²) ā O(N log N) can dwarf all other optimizations!
2. Data Structures (2-10x gains)
Array-of-Structs ā Struct-of-Arrays for SIMD
Hash table ā Cache-friendly design
3. Architecture Awareness (1.5-5x gains)
Cache blocking, prefetching, vectorization
4. Micro-optimizations (1.1-1.5x gains)
Loop unrolling, instruction scheduling
Rule: Always start at the top! Don't optimize bit-twiddling when you have an O(N³) algorithm that could be O(N log N).
7. Key Takeaways
ā IPC measures pipeline efficiency - target > 1.0 for modern CPUs
ā Throughput ā Latency - pipelining improves one without the other
ā Amdahl's Law reveals the painful truth: serial portions limit speedup
ā Roofline Model identifies if you're compute or memory bound
ā Profile first, optimize second - measure to find real bottlenecks
ā Arithmetic Intensity is the key metric for accelerator suitability
8. Practice Problems
1. Speedup Calculation
50% of code is parallelizable and runs 8x faster on 8 cores.
What is the overall speedup?
2. Roofline Analysis
System: 500 GFLOP/s peak, 40 GB/s memory bandwidth
Kernel: 2 FLOPS/byte arithmetic intensity
Is it compute or memory bound? What's the attainable performance?
3. Bottleneck Identification
Profiler shows: 70% time in memory stalls, IPC = 0.4
What's the problem? What should you optimize?
9. Next Steps
Now that you understand performance fundamentals:
ā Memory Hierarchy: Understand cache behavior and optimization
ā Parallel Computing: Apply Amdahl's Law to real parallel programs
ā Advanced Profiling: Intel VTune, NVIDIA Nsight, perf
ā Optimization: Cache blocking, vectorization, prefetching
Remember: Performance engineering is about making informed trade-offs based on measured data! šÆ