Performance Metrics & Analysis Foundations

"You can't optimize what you can't measure" - this fundamental principle drives all performance engineering. Whether you're designing a new processor, optimizing software, or choosing between architectural alternatives, understanding performance metrics is essential! 📊

Core Performance Metrics

1. Instructions Per Cycle (IPC) 🔢

Definition: Average number of instructions completed per clock cycle.

IPC = Instructions Executed / Clock Cycles

Example:

// Execution of 1000 instructions takes 1250 cycles
IPC = 1000 / 1250 = 0.8

What it tells you:

IPC < 1: Pipeline stalls, hazards, or serial execution
IPC = 1: Perfect scalar pipeline (ideal 5-stage)
IPC > 1: Superscalar execution (multiple instructions/cycle)
IPC >> 1: Wide superscalar or VLIW architecture

Real-world examples:

In-order CPU: IPC ≈ 0.5-1.0
Out-of-order CPU (x86): IPC ≈ 1.5-4.0
GPU SIMT core: IPC ≈ 0.3-0.6 (but thousands of concurrent threads!)

2. Throughput vs Latency ⏱️

These are different concepts that are often confused!

Throughput: Work completed per unit time

Throughput = Total Work / Total Time
 
Example: Instructions per second (IPS), Transactions per second (TPS)

Analogy: Cars crossing a bridge

Throughput = 100 cars/hour

Latency: Time for one operation to complete

Latency = Time from start to completion
 
Example: Time for one instruction, memory access time

Analogy: Time for one car to cross the bridge

Latency = 1 minute

The Throughput-Latency Trade-off

Key Insight: You can improve throughput without reducing latency!

Example: Pipelining

Non-pipelined: 5 cycles per instruction
Latency:      5 cycles
Throughput:   1 inst / 5 cycles = 0.2 IPC
 
Pipelined: Still 5 cycles per instruction
Latency:      5 cycles (same!)
Throughput:   1 inst / 1 cycle = 1.0 IPC (5x better!)

When each matters:

Latency-critical: Real-time systems, user interaction, single-threaded code
Throughput-critical: Batch processing, server workloads, parallel code

3. Bandwidth 📡

Definition: Maximum data transfer rate

Bandwidth = Data Size / Time
 
Units: GB/s, MB/s, Gb/s

Memory Bandwidth Example:

DDR4-3200 RAM:
- Clock: 1600 MHz (double data rate → 3200 MT/s)
- Bus width: 64 bits
- Bandwidth = 3200 × 64 / 8 = 25.6 GB/s

Bandwidth Types:

Type	Typical Value	Use Case
L1 Cache	1-2 TB/s	Register spills, hot data
L2 Cache	200-500 GB/s	Working set, shared data
L3 Cache	50-200 GB/s	Shared cache, victim cache
DRAM	25-100 GB/s	Main memory
NVMe SSD	3-7 GB/s	Storage
SATA SSD	0.5 GB/s	Older storage

Peak vs Sustained Bandwidth:

Peak: Theoretical maximum (marketing numbers!)
Sustained: Real achievable bandwidth (what you actually get)
Gap: 20-50% is common due to overhead, arbitration, refresh

Amdahl's Law: The Speedup Reality Check

Amdahl's Law tells you the maximum speedup possible from an optimization.

The Formula

Speedup = 1 / ((1 - P) + P/S)
 
Where:
P = Fraction of program that is parallelizable
S = Speedup of the parallel portion

Worked Example

Scenario: Optimize matrix multiplication

80% of runtime is parallelizable (P = 0.8)
Parallel version runs 10x faster (S = 10)

Speedup = 1 / ((1 - 0.8) + 0.8/10)
        = 1 / (0.2 + 0.08)
        = 1 / 0.28
        = 3.57x

Interpretation: Even with a 10x speedup on 80% of the code, overall speedup is only 3.57x due to the serial portion (20%)!

The Serial Bottleneck

Key insight: The unparallelized portion dominates at high core counts.

Cores:    1      2      4      8     16     32     64
Speedup:  1.0x   1.6x   2.5x   3.3x  4.0x  4.4x  4.7x
                                            ↑
                                    Diminishing returns!

Even with infinite cores (S → ∞):

Max Speedup = 1 / (1 - P) = 1 / 0.2 = 5x

That 20% serial code limits you to 5x maximum speedup no matter how many cores you add!

Practical Applications

1. Multicore Scaling

// Web server: 95% of time in parallel request handling
P = 0.95, S = number of cores
 
8 cores:  Speedup = 1/(0.05 + 0.95/8)  = 5.9x
16 cores: Speedup = 1/(0.05 + 0.95/16) = 7.5x
32 cores: Speedup = 1/(0.05 + 0.95/32) = 10.3x
 
// Diminishing returns evident!

2. Accelerator Decisions

// ML inference: 70% matrix ops (GPU-friendly), 30% pre/post-processing
GPU is 100x faster for matrix ops
 
Speedup = 1 / (0.30 + 0.70/100) = 1 / 0.307 = 3.26x
 
// GPU only gives 3.26x overall speedup!
// Need to optimize the 30% CPU portion too

The Roofline Model: Compute vs Memory Bound

The Roofline Model visualizes whether your code is:

Compute-bound: Limited by ALU/FPU throughput
Memory-bound: Limited by DRAM bandwidth

The Roofline Plot

Performance (GFLOP/s)
   ↑
   │                    _____________________ Compute Roof (Peak FLOPS)
   │                   /
   │                  /
   │                 /  Memory Roof (BW × Arithmetic Intensity)
   │                /
   │               /
   │              /
   │_____________/________________________________→ Arithmetic Intensity
                                                   (FLOPS/Byte)

Key Concepts

Arithmetic Intensity (AI):

AI = Operations / Bytes Transferred
 
Example: Matrix multiply (N×N)
Operations: 2N³ FLOPs (multiply-add for each element)
Data:       3N² floats × 4 bytes = 12N² bytes
AI = 2N³ / (12N²) = N/6 FLOPS/byte
 
For N=1024: AI = 170 FLOPS/byte (compute-bound!)
For N=8:    AI = 1.3 FLOPS/byte (memory-bound!)

Attainable Performance:

Performance = min(
    Peak FLOPS,                    // Compute roof
    Memory BW × Arithmetic Intensity  // Memory roof
)

Example Analysis

System specs:

Peak FLOPS: 1000 GFLOP/s
Memory BW: 50 GB/s
Ridge point: 1000/50 = 20 FLOPS/byte

Your kernel:

Arithmetic Intensity: 5 FLOPS/byte

Prediction:

Performance = min(1000, 50 × 5) = min(1000, 250) = 250 GFLOP/s
 
Status: MEMORY-BOUND (using only 25% of peak FLOPS)

Optimization strategy:

Increase AI: Cache blocking, loop fusion, algorithm changes
Increase effective BW: Prefetching, coalescing, compression

Roofline in Practice

Case Study: ML Matrix Operations

// Small batch inference (batch=1, hidden=512)
AI = 0.5 FLOPS/byte → Memory-bound
Action: Increase batch size to amortize memory transfers
 
// Large batch training (batch=256, hidden=4096)  
AI = 200 FLOPS/byte → Compute-bound
Action: Optimize CUDA kernels, use Tensor Cores

Basic Profiling Concepts

What to Measure

1. Time Breakdown ⏱️

Total Time = Compute Time + Memory Time + Stall Time
 
// Use hardware counters
perf stat -e cycles,instructions,cache-misses ./app

2. Hotspots 🔥

// Identify functions consuming most time
perf record ./app
perf report
 
Output:
50.2%  matmul()        ← Focus optimization here!
20.1%  softmax()
15.3%  layer_norm()

3. Cache Behavior 💾

// L1/L2/L3 miss rates
perf stat -e L1-dcache-load-misses,LLC-load-misses ./app
 
High miss rate → Memory-bound
Low miss rate + low IPC → Compute-bound or other stalls

Key Performance Counters (PMCs)

Counter	Meaning	Good Value
IPC	Instructions/Cycle	> 1.0 for modern CPUs
Cache Miss Rate	Misses/Accesses	< 5% for L1
Branch Mispredict	Mispredicts/Branches	< 10%
TLB Miss Rate	Misses/Accesses	< 1%
Memory Stall Cycles	Cycles waiting for memory	< 30% of total

Workload Characterization

Understanding Your Application

1. Compute Intensity

FLOP-heavy: ML training, scientific computing
Integer-heavy: Databases, compression, crypto
Mixed: General applications

2. Memory Access Pattern

Sequential: Streaming, good for prefetch
Random: Graph algorithms, hash tables, poor locality
Strided: Matrix operations, tunable

3. Parallelism

Embarrassingly parallel: Image processing, Monte Carlo
Medium: Matrix ops with some dependencies  
Serial: Recursive algorithms, sequential parsing

Architecture Matching

Workload Type	Best Architecture	Why
High AI, parallel	GPU, TPU	Maximize throughput
Low AI, parallel	Multi-core CPU	Better memory subsystem
Serial, branchy	High-IPC CPU	Out-of-order, prediction
Real-time	Specialized	Predictable latency

Optimization Strategy

The Optimization Hierarchy

1. Algorithm (10-100x gains)

O(N²) → O(N log N) can dwarf all other optimizations!

2. Data Structures (2-10x gains)

Array-of-Structs → Struct-of-Arrays for SIMD
Hash table → Cache-friendly design

3. Architecture Awareness (1.5-5x gains)

Cache blocking, prefetching, vectorization

4. Micro-optimizations (1.1-1.5x gains)

Loop unrolling, instruction scheduling

Rule: Always start at the top! Don't optimize bit-twiddling when you have an O(N³) algorithm that could be O(N log N).

Key Takeaways

✅ IPC measures pipeline efficiency - target > 1.0 for modern CPUs

✅ Throughput ≠ Latency - pipelining improves one without the other

✅ Amdahl's Law reveals the painful truth: serial portions limit speedup

✅ Roofline Model identifies if you're compute or memory bound

✅ Profile first, optimize second - measure to find real bottlenecks

✅ Arithmetic Intensity is the key metric for accelerator suitability

Practice Problems

1. Speedup Calculation

50% of code is parallelizable and runs 8x faster on 8 cores.
What is the overall speedup?

2. Roofline Analysis

System: 500 GFLOP/s peak, 40 GB/s memory bandwidth
Kernel: 2 FLOPS/byte arithmetic intensity
Is it compute or memory bound? What's the attainable performance?

3. Bottleneck Identification

Profiler shows: 70% time in memory stalls, IPC = 0.4
What's the problem? What should you optimize?

Next Steps

Now that you understand performance fundamentals:

→ Memory Hierarchy: Understand cache behavior and optimization

→ Parallel Computing: Apply Amdahl's Law to real parallel programs

→ Advanced Profiling: Intel VTune, NVIDIA Nsight, perf

→ Optimization: Cache blocking, vectorization, prefetching

Remember: Performance engineering is about making informed trade-offs based on measured data! 🎯

Performance Metrics & Analysis Foundations

Prerequisites

Learning Objectives

Table of Contents

Performance Metrics & Analysis Foundations

Core Performance Metrics

1. Instructions Per Cycle (IPC) 🔢

2. Throughput vs Latency ⏱️

Throughput: Work completed per unit time

Latency: Time for one operation to complete

The Throughput-Latency Trade-off

3. Bandwidth 📡

Amdahl's Law: The Speedup Reality Check

The Formula

Worked Example

The Serial Bottleneck

Practical Applications

The Roofline Model: Compute vs Memory Bound

The Roofline Plot

Key Concepts

Example Analysis

Roofline in Practice

Basic Profiling Concepts

What to Measure

Key Performance Counters (PMCs)

Workload Characterization

Understanding Your Application

Architecture Matching

Optimization Strategy

The Optimization Hierarchy

Key Takeaways

Practice Problems

Next Steps

In This Topic

Related Topics

Quick Actions