Skip to main content
PerformancebeginnerperformanceIPCthroughputlatencyAmdahls-LawRoofline-Modelbeginner

Performance Metrics & Analysis Foundations

Master essential performance metrics including IPC, throughput, latency, and bandwidth. Learn to identify bottlenecks using Amdahl's Law and the Roofline Model.

28 min read
Updated 10/1/2025
3 prerequisites

Prerequisites

Make sure you're familiar with these concepts before diving in:

Basic computer architecture knowledge
Understanding of CPU pipelines
Familiarity with memory hierarchy

Learning Objectives

By the end of this topic, you will be able to:

Understand key metrics: IPC, throughput, latency, and bandwidth
Apply Amdahl's Law to predict speedup from optimizations
Use the Roofline Model to identify compute vs memory bottlenecks
Perform basic profiling and workload characterization
Make informed architectural trade-offs based on metrics

Table of Contents

Performance Metrics & Analysis Foundations

"You can't optimize what you can't measure" - this fundamental principle drives all performance engineering. Whether you're designing a new processor, optimizing software, or choosing between architectural alternatives, understanding performance metrics is essential! šŸ“Š

1. Core Performance Metrics

1.1 1. Instructions Per Cycle (IPC) šŸ”¢

Definition: Average number of instructions completed per clock cycle.

IPC = Instructions Executed / Clock Cycles

Example:

// Execution of 1000 instructions takes 1250 cycles
IPC = 1000 / 1250 = 0.8

What it tells you:

  • IPC < 1: Pipeline stalls, hazards, or serial execution
  • IPC = 1: Perfect scalar pipeline (ideal 5-stage)
  • IPC > 1: Superscalar execution (multiple instructions/cycle)
  • IPC >> 1: Wide superscalar or VLIW architecture

Real-world examples:

  • In-order CPU: IPC ā‰ˆ 0.5-1.0
  • Out-of-order CPU (x86): IPC ā‰ˆ 1.5-4.0
  • GPU SIMT core: IPC ā‰ˆ 0.3-0.6 (but thousands of concurrent threads!)

1.2 2. Throughput vs Latency ā±ļø

These are different concepts that are often confused!

Throughput: Work completed per unit time

Throughput = Total Work / Total Time
 
Example: Instructions per second (IPS), Transactions per second (TPS)

Analogy: Cars crossing a bridge

  • Throughput = 100 cars/hour

Latency: Time for one operation to complete

Latency = Time from start to completion
 
Example: Time for one instruction, memory access time

Analogy: Time for one car to cross the bridge

  • Latency = 1 minute

1.3 The Throughput-Latency Trade-off

Key Insight: You can improve throughput without reducing latency!

Example: Pipelining

Non-pipelined: 5 cycles per instruction
Latency:      5 cycles
Throughput:   1 inst / 5 cycles = 0.2 IPC
 
Pipelined: Still 5 cycles per instruction
Latency:      5 cycles (same!)
Throughput:   1 inst / 1 cycle = 1.0 IPC (5x better!)

When each matters:

  • Latency-critical: Real-time systems, user interaction, single-threaded code
  • Throughput-critical: Batch processing, server workloads, parallel code

1.4 3. Bandwidth šŸ“”

Definition: Maximum data transfer rate

Bandwidth = Data Size / Time
 
Units: GB/s, MB/s, Gb/s

Memory Bandwidth Example:

DDR4-3200 RAM:
- Clock: 1600 MHz (double data rate → 3200 MT/s)
- Bus width: 64 bits
- Bandwidth = 3200 Ɨ 64 / 8 = 25.6 GB/s

Bandwidth Types:

TypeTypical ValueUse Case
L1 Cache1-2 TB/sRegister spills, hot data
L2 Cache200-500 GB/sWorking set, shared data
L3 Cache50-200 GB/sShared cache, victim cache
DRAM25-100 GB/sMain memory
NVMe SSD3-7 GB/sStorage
SATA SSD0.5 GB/sOlder storage

Peak vs Sustained Bandwidth:

  • Peak: Theoretical maximum (marketing numbers!)
  • Sustained: Real achievable bandwidth (what you actually get)
  • Gap: 20-50% is common due to overhead, arbitration, refresh

2. Amdahl's Law: The Speedup Reality Check

Amdahl's Law tells you the maximum speedup possible from an optimization.

2.1 The Formula

Speedup = 1 / ((1 - P) + P/S)
 
Where:
P = Fraction of program that is parallelizable
S = Speedup of the parallel portion

2.2 Worked Example

Scenario: Optimize matrix multiplication

  • 80% of runtime is parallelizable (P = 0.8)
  • Parallel version runs 10x faster (S = 10)
Speedup = 1 / ((1 - 0.8) + 0.8/10)
        = 1 / (0.2 + 0.08)
        = 1 / 0.28
        = 3.57x

Interpretation: Even with a 10x speedup on 80% of the code, overall speedup is only 3.57x due to the serial portion (20%)!

2.3 The Serial Bottleneck

Key insight: The unparallelized portion dominates at high core counts.

Cores:    1      2      4      8     16     32     64
Speedup:  1.0x   1.6x   2.5x   3.3x  4.0x  4.4x  4.7x
                                            ↑
                                    Diminishing returns!

Even with infinite cores (S → āˆž):

Max Speedup = 1 / (1 - P) = 1 / 0.2 = 5x

That 20% serial code limits you to 5x maximum speedup no matter how many cores you add!

2.4 Practical Applications

1. Multicore Scaling

// Web server: 95% of time in parallel request handling
P = 0.95, S = number of cores
 
8 cores:  Speedup = 1/(0.05 + 0.95/8)  = 5.9x
16 cores: Speedup = 1/(0.05 + 0.95/16) = 7.5x
32 cores: Speedup = 1/(0.05 + 0.95/32) = 10.3x
 
// Diminishing returns evident!

2. Accelerator Decisions

// ML inference: 70% matrix ops (GPU-friendly), 30% pre/post-processing
GPU is 100x faster for matrix ops
 
Speedup = 1 / (0.30 + 0.70/100) = 1 / 0.307 = 3.26x
 
// GPU only gives 3.26x overall speedup!
// Need to optimize the 30% CPU portion too

3. The Roofline Model: Compute vs Memory Bound

The Roofline Model visualizes whether your code is:

  • Compute-bound: Limited by ALU/FPU throughput
  • Memory-bound: Limited by DRAM bandwidth

3.1 The Roofline Plot

Performance (GFLOP/s)
   ↑
   │                    _____________________ Compute Roof (Peak FLOPS)
   │                   /
   │                  /
   │                 /  Memory Roof (BW Ɨ Arithmetic Intensity)
   │                /
   │               /
   │              /
   │_____________/________________________________→ Arithmetic Intensity
                                                   (FLOPS/Byte)

3.2 Key Concepts

Arithmetic Intensity (AI):

AI = Operations / Bytes Transferred
 
Example: Matrix multiply (NƗN)
Operations: 2N³ FLOPs (multiply-add for each element)
Data:       3N² floats Ɨ 4 bytes = 12N² bytes
AI = 2N³ / (12N²) = N/6 FLOPS/byte
 
For N=1024: AI = 170 FLOPS/byte (compute-bound!)
For N=8:    AI = 1.3 FLOPS/byte (memory-bound!)

Attainable Performance:

Performance = min(
    Peak FLOPS,                    // Compute roof
    Memory BW Ɨ Arithmetic Intensity  // Memory roof
)

3.3 Example Analysis

System specs:

  • Peak FLOPS: 1000 GFLOP/s
  • Memory BW: 50 GB/s
  • Ridge point: 1000/50 = 20 FLOPS/byte

Your kernel:

  • Arithmetic Intensity: 5 FLOPS/byte

Prediction:

Performance = min(1000, 50 Ɨ 5) = min(1000, 250) = 250 GFLOP/s
 
Status: MEMORY-BOUND (using only 25% of peak FLOPS)

Optimization strategy:

  1. Increase AI: Cache blocking, loop fusion, algorithm changes
  2. Increase effective BW: Prefetching, coalescing, compression

3.4 Roofline in Practice

Case Study: ML Matrix Operations

// Small batch inference (batch=1, hidden=512)
AI = 0.5 FLOPS/byte → Memory-bound
Action: Increase batch size to amortize memory transfers
 
// Large batch training (batch=256, hidden=4096)  
AI = 200 FLOPS/byte → Compute-bound
Action: Optimize CUDA kernels, use Tensor Cores

4. Basic Profiling Concepts

4.1 What to Measure

1. Time Breakdown ā±ļø

Total Time = Compute Time + Memory Time + Stall Time
 
// Use hardware counters
perf stat -e cycles,instructions,cache-misses ./app

2. Hotspots šŸ”„

// Identify functions consuming most time
perf record ./app
perf report
 
Output:
50.2%  matmul()        ← Focus optimization here!
20.1%  softmax()
15.3%  layer_norm()

3. Cache Behavior šŸ’¾

// L1/L2/L3 miss rates
perf stat -e L1-dcache-load-misses,LLC-load-misses ./app
 
High miss rate → Memory-bound
Low miss rate + low IPC → Compute-bound or other stalls

4.2 Key Performance Counters (PMCs)

CounterMeaningGood Value
IPCInstructions/Cycle> 1.0 for modern CPUs
Cache Miss RateMisses/Accesses< 5% for L1
Branch MispredictMispredicts/Branches< 10%
TLB Miss RateMisses/Accesses< 1%
Memory Stall CyclesCycles waiting for memory< 30% of total

5. Workload Characterization

5.1 Understanding Your Application

1. Compute Intensity

FLOP-heavy: ML training, scientific computing
Integer-heavy: Databases, compression, crypto
Mixed: General applications

2. Memory Access Pattern

Sequential: Streaming, good for prefetch
Random: Graph algorithms, hash tables, poor locality
Strided: Matrix operations, tunable

3. Parallelism

Embarrassingly parallel: Image processing, Monte Carlo
Medium: Matrix ops with some dependencies  
Serial: Recursive algorithms, sequential parsing

5.2 Architecture Matching

Workload TypeBest ArchitectureWhy
High AI, parallelGPU, TPUMaximize throughput
Low AI, parallelMulti-core CPUBetter memory subsystem
Serial, branchyHigh-IPC CPUOut-of-order, prediction
Real-timeSpecializedPredictable latency

6. Optimization Strategy

6.1 The Optimization Hierarchy

1. Algorithm (10-100x gains)

O(N²) → O(N log N) can dwarf all other optimizations!

2. Data Structures (2-10x gains)

Array-of-Structs → Struct-of-Arrays for SIMD
Hash table → Cache-friendly design

3. Architecture Awareness (1.5-5x gains)

Cache blocking, prefetching, vectorization

4. Micro-optimizations (1.1-1.5x gains)

Loop unrolling, instruction scheduling

Rule: Always start at the top! Don't optimize bit-twiddling when you have an O(N³) algorithm that could be O(N log N).

7. Key Takeaways

āœ… IPC measures pipeline efficiency - target > 1.0 for modern CPUs

āœ… Throughput ≠ Latency - pipelining improves one without the other

āœ… Amdahl's Law reveals the painful truth: serial portions limit speedup

āœ… Roofline Model identifies if you're compute or memory bound

āœ… Profile first, optimize second - measure to find real bottlenecks

āœ… Arithmetic Intensity is the key metric for accelerator suitability

8. Practice Problems

1. Speedup Calculation

50% of code is parallelizable and runs 8x faster on 8 cores.
What is the overall speedup?

2. Roofline Analysis

System: 500 GFLOP/s peak, 40 GB/s memory bandwidth
Kernel: 2 FLOPS/byte arithmetic intensity
Is it compute or memory bound? What's the attainable performance?

3. Bottleneck Identification

Profiler shows: 70% time in memory stalls, IPC = 0.4
What's the problem? What should you optimize?

9. Next Steps

Now that you understand performance fundamentals:

→ Memory Hierarchy: Understand cache behavior and optimization

→ Parallel Computing: Apply Amdahl's Law to real parallel programs

→ Advanced Profiling: Intel VTune, NVIDIA Nsight, perf

→ Optimization: Cache blocking, vectorization, prefetching

Remember: Performance engineering is about making informed trade-offs based on measured data! šŸŽÆ