ModulesMLSystems

Benchmarks & Workloads — MLPerf Essentials

What MLPerf Inference/Training measure, how to read QPS/latency/accuracy, and pragmatic usage for architecture evaluation

intermediateMLSystems75m

4

Exercises

3

Tools

4

Applications

2

Min Read

Practical Exercises

MLPerf result analysis and comparison
Custom workload integration with MLPerf
SLO tier mapping for production systems
Cost-per-token calculation methodology

Tools Required

MLPerf benchmarksLoadGenPerformance analysis tools

Real-World Applications

Hardware procurement decision support
Competitive analysis of AI accelerators
Capacity planning for inference workloads
Performance regression testing

Next Modules

Tail Latency & Scale-Out — p95/p99/p99.9 Engineering →Cluster-Level Thinking — Scheduling, Placement, Isolation →

Part of Learning Tracks

AI System Architect Learning Track

AI System Architect

principal11 modules

Benchmarks & Workloads — MLPerf Essentials

Focus: What MLPerf Inference/Training measure, how to read QPS/latency/accuracy, and how to use it pragmatically.

📋 Table of Contents

1) Taxonomy

2) Reading a result like a pro

3) Internal usage

4) Communication patterns

References

1) Taxonomy

Inference: Datacenter (Server, Offline), Edge (SingleStream, MultiStream). Metrics: QPS at/below latency targets and at/above accuracy targets.
Training: Time‑to‑target‑quality (e.g., min time to reach specified validation metric).

Closed vs. Open: closed enforces model/accuracy/inputs; open allows alternative optimizations (less comparable).

2) Reading a result like a pro

Confirm accuracy equivalence (e.g., 99% vs. 99.9% of reference).
Compare Server (latency‑bounded QPS) vs. Offline (throughput) within the same submission.
Check power runs (energy/inf), and configuration details: quantization recipe, graph capture, KV reuse, MIG layout.
Always look for SLO tiers analogous to your production tiers.

3) Internal usage

Use as a sanity suite to guard regressions.
Augment with proprietary workloads (real tokenizer, pre/post).
Track $ per 1k tokens for LLMs; feed into capacity planning.

4) Communication patterns

Provide a one‑pager per SKU:

QPS @ latency target, Energy/op, $ per 1k tokens.
Footnote accuracy, batching mode, precision (FP8/FP16/INT8), and KV policies.

References

MLPerf Inference and Training documentation portals; latest public result pages.

Related Modules

ML Systems in Datacenters — LLM Inference Realities

MLSystemsexpert

Multimodal Foundation Models: Architecture & System Design

MLSystemsexpert

System & Microarchitecture Deep Dive

MLSystemsexpert

#MLPerf#benchmarks#inference#training#QPS#latency#accuracy