ML Systems in Datacenters — LLM Inference Realities

Focus: TTFT vs. tokens/s, batching/latency, KV‑cache memory, PagedAttention/vLLM impact, practical serving tactics.

📋 Table of Contents

Serving KPIs & Phases (Prefill vs. Decode)

1) The three metrics you'll see everywhere

2) What actually happens on the GPU? Two phases

3) SLOs and tiers (how companies set expectations)

4) Why prefill vs. decode behave so differently

5) Scheduling and system patterns that move the needle

6) How to measure correctly (and avoid common pitfalls)

7) Quick checklist for first deployments

8) KV‑Cache Sizing, Pressure & PagedAttention — From First Principles

9) Scheduling/batching policies

10) Practical engineering checklist

12) Capacity planning sketch (worked)

Serving KPIs & Phases (Prefill vs. Decode) — A Practical, No‑Assumptions Guide

Who this is for: Engineers and PMs who are new to LLM serving and want a solid, technically correct mental model that connects what you measure to what actually runs on the GPU.

1) The three metrics you'll see everywhere

1.1 Time to First Token (TTFT)

TTFT is the wall‑clock time from sending the request to the first byte/token arriving. It is dominated by the prefill phase (tokenization + network + scheduler queuing + the model's first pass).

Figure: LLM inference phases showing prefill (parallel processing) and decode (sequential generation). Source: Cho et al., "KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation"

Why it matters: users perceive responsiveness from TTFT; sub‑second TTFT dramatically improves UX for chat UIs and tools that stream output.

How people measure it: record a timestamp just before the request, then the timestamp when the first token is yielded. (Don't include model load time or cold starts unless that's part of your SLO.)

Understanding the Prefill Phase (and Why It Dominates TTFT)

This section gives you a visual, practical mental model for the prefill phase of LLM inference—the part that typically dominates TTFT. It includes ASCII diagrams, a timeline, and a concrete numeric example.

High‑Level Timeline

Time ──► | network & queue |=========== PREFILL (process full prompt) ===========| first token ► | decode t1 | decode t2 | decode t3 | …
           ↑——————————————— TTFT ————————————————↑

TTFT is the wall‑clock time from when the request is sent until the first byte/token arrives.
Prefill runs the entire prompt through the model, layer by layer, and writes per‑token K/V entries to the KV cache.
Decode then generates output one token at a time, reusing the KV cache to avoid reprocessing the whole prompt.

Intuition: Prefill is compute‑bound (lots of matrix math across all prompt tokens), while decode tends to be memory/latency‑bound (small, per‑token steps that reuse the cache).

What the Model Does During Prefill

tokens → embeddings → [layer 1: self‑attn → MLP] → [layer 2 …] → … → logits
                         │                         │
                      write K,V to KV cache     write K,V to KV cache

The model processes all prompt tokens in parallel per layer.
Each attention layer stores keys/values (the KV cache) so the decode loop can reuse them.
After prefill completes, the first token can be produced and streamed to the client.

Concrete (Illustrative) Example

The numbers below are hypothetical but realistic for a mid‑size modern decoder‑only LLM on a single data‑center GPU. Your exact figures will vary by hardware, batch size, scheduler, model, precision, and prompt length.

Scenario

Prompt length: 1,000 tokens
Output target: 100 tokens
Network + queueing overhead: 50 ms
Average prefill compute cost: 0.6 ms/token (aggregate across layers)
Additional framework overhead: 50 ms
Decode speed: 40 tokens/sec (~25 ms/token)

Derived timeline

Prefill time ≈ 1,000 × 0.6 ms + 50 ms ≈ 650 ms
TTFT ≈ 50 ms (net/queue) + 650 ms (prefill) ≈ 700 ms
Decode time for 100 tokens ≈ 100 × 25 ms = 2,500 ms

Overall E2E (first token to last token): ~ 700 ms + 2,500 ms = 3.2 s

Timeline Visualization

Rendering diagram...

Process Flow:

Rendering diagram...

Why Prefill Often Dominates TTFT

It must process every prompt token before any output can be streamed.
Attention during prefill works over the whole prompt, so compute scales with prompt length.
Techniques like chunked prefill, prefill–decode disaggregation, prefix caching, and KV‑Runahead/parallelization specifically target TTFT.

Practical Tips for Optimizing TTFT

Keep prompts lean; move large context, tools, or retrieved content to structured inputs when possible.
Use a serving stack that supports chunked prefill and balanced scheduling between compute‑bound prefill and memory‑bound decode.
Measure TTFT separately from ITL/TPS to catch regressions specific to the prompt path.

1.2 Inter‑Token Latency (ITL) / Time‑Per‑Output‑Token (TPOT)

Once the first token appears, you care about the rate at which the rest of the tokens arrive. ITL (a.k.a. TPOT) is the average time between successive output tokens during decoding. Lower is better.

1.3 Tokens per Second (TPS, "tokens/s")

Throughput measured as output tokens per second (sometimes split into input and output token rates). Useful for capacity planning, autoscaling, and cost/performance tradeoffs. TPS must be considered alongside latency metrics—maximizing TPS alone can harm TTFT/ITL in interactive apps.

2) What actually happens on the GPU? Two phases

LLM inference alternates between two very different phases:

Prefill (a.k.a. prompt processing): the entire prompt (N tokens) is processed in parallel. This is compute‑bound (large GEMMs & attention) and mainly determines TTFT.
Decode (generation): one new token at a time. This stage repeatedly reads the KV cache created earlier. It's memory‑bandwidth/latency‑bound and mainly determines ITL/TPOT and tokens/s.

A first‑order mental model:

TTFT  ~=  tokenization + network + queueing + prefill compute
ITL   ~=  (KV-cache reads per step + small compute) / available memory bandwidth
TPS   ~=  1 / ITL   (for a single stream; scheduling affects multi‑stream TPS)

3) SLOs and tiers (how companies set expectations)

Most production systems define tiered SLOs:

Premium / Low‑latency: tight TTFT (e.g., streaming begins quickly), controlled ITL for interactive tooling. May trade raw throughput for jitter control.
Throughput tier / Batch: looser TTFT but very high TPS for offline or background workloads.
Hybrid: separate budgets for TTFT vs. ITL, sometimes using disaggregation (see §5) to tune them independently.

4) Why prefill vs. decode behave so differently

Prefill is parallel compute: big matrix multiplications over the whole prompt; very high arithmetic intensity. Scheduling and queuing can dominate here under load.
Decode is memory traffic: each new token attends to all prior tokens by reading the stored K/V tensors (the KV cache). Arithmetic per token is small; latency is often gated by HBM bandwidth and cache locality.

This division is why optimizations often split along phase boundaries.

5) Scheduling and system patterns that move the needle

5.1 Continuous/rolling batching

Keep the GPUs saturated by continuously admitting new requests/tokens into active batches. This improves utilization and throughput without letting TTFT/ITL explode (good schedulers prioritize decodes).

5.2 Chunked prefill

Break large prefills into chunks that can be interleaved with decodes so decoders don't starve. Often combined with decode‑first prioritization to keep ITL stable.

5.3 Disaggregating prefill and decode

Run prefill and decode on different worker pools (or even different hardware/parallelism strategies). This lets you independently tune TTFT and ITL, and reduce tail latency interference when long prefills block decodes.

5.4 Prefix caching

If many requests share the same prompt prefix (system prompt, RAG boilerplate), reuse the KV cache for that prefix across requests and skip recompute. This reduces TTFT and pressure on memory capacity/bandwidth.

6) How to measure correctly (and avoid common pitfalls)

Include queuing in TTFT if you're measuring end‑to‑end user latency.
Stream responses to cut perceived latency; but remember that streaming does not change end‑to‑end latency for a fixed number of tokens. It mainly improves UX and early‑exit behavior.
Separate metrics by phase in your dashboards: TTFT (prefill), ITL/TPOT (decode), plus request‑level p50/p95 and GPU‑level utilization/bandwidth counters.
Watch prompt lengths: TTFT grows with input tokens; ITL grows mostly with output tokens and KV‑cache bandwidth pressure.
Quantization and speculative decoding help both phases, but usually help prefill more than decode unless they reduce KV bandwidth demand too.

7) Quick checklist for first deployments

Track TTFT, ITL/TPOT, TPS per route/model; alert on tail (p95/p99).
Enable continuous batching; prefer decode‑first scheduling.
Turn on chunked prefill for long prompts.
Use prefix caching if prefixes repeat.
Consider prefill/decode disaggregation when traffic & SLOs warrant it.
Capacity plan with input token distribution and output length caps.

8) KV‑Cache Sizing, Pressure & PagedAttention — From First Principles

Goal: Understand exactly what the KV cache is, how to size it, where memory is wasted, and why systems like vLLM with PagedAttention fix the problem. This section assumes no prior GPU background.

8.1) What is the KV cache?

During attention, every token produces three vectors per layer: Q, K, and V.
For decoding the next token, the model only needs the new Q—but it must read all prior K and V for that sequence (and head) to compute attention. We therefore store all previously computed K/V in GPU memory as the KV cache so we don't recompute them at every step.

Implication: decode steps are read‑heavy on the KV cache, so bandwidth/latency dominate performance once generation begins.

8.2) How big is the KV cache? (Exact formula + example)

For most decoder‑only transformers, the KV memory for one sequence of length T is approximately:

KV_bytes_per_sequence  ≈  T × L × H × D × 2 × bytes_per_elem

Where:

T = tokens processed so far (context length)
L = number of transformer layers
H = attention heads
D = head dimension
The factor 2 accounts for K and V (no Q is stored)
bytes_per_elem = 2 for FP16/BF16, 1 for FP8/INT8 variants, 4 for FP32

For a batch of B sequences, multiply by B (or by the sum of individual sequence lengths).

Worked example (matches the slide)

Batch = 16, Layers = 32, Heads = 32, Head dim = 128, Precision = FP16 (2 bytes), Context = 4096 tokens:

KV_bytes ≈ 16 × 32 × 32 × 128 × 2 × 2 × 4096
         = 34.36 GB  (decimal)  = 32.00 GiB (binary)

That is just for KV — you still need memory for activations, parameters, and workspace. This is why naïve serving often "runs out of VRAM" even on large GPUs.

8.3) Why naïve allocation wastes memory (fragmentation)

If you allocate one contiguous slab per request sized to the max context (or grow and copy it as context expands), three bad things happen:

External fragmentation: small holes between slabs can't be used for new sequences.
Internal fragmentation: you reserve space for max length even if the sequence is short.
Resize stalls: when a request outgrows its slab, you may need to reallocate/copy, which stalls decoding.

These pathologies cap batch size and increase tail latency.

8.4) PagedAttention: make KV look like virtual memory

Key idea: manage KV in fixed‑size blocks (e.g., 16–32 tokens per block) and maintain a page table that maps each sequence's logical blocks to physical blocks located anywhere in GPU memory.

Benefits:

Near‑zero external fragmentation: any free block can be reused for any sequence.
Small internal fragmentation: at most one partial block per sequence.
Fast growth: append a block; no copy of prior KV.
Sharing & caching: identical prefixes across requests can point to the same physical blocks (prefix caching).
Kernel‑aware: attention kernels are optimized to fetch from these paged KV layouts efficiently.

You'll see knobs like block_size = 16 or 32 tokens and features like automatic prefix caching in modern systems (e.g., vLLM).

8.5) Practical knobs that relieve KV pressure

Reduce precision for KV (e.g., FP8/INT8) to cut bytes per element.
Prefix caching to avoid recomputing shared prompts; the reused blocks do not multiply memory.
Continuous batching to keep utilization high while admitting decodes promptly.
Chunked prefill so long prompts don't starve decoders.
KV compression/eviction (research): e.g., SnapKV, RocketKV, head/token dropping—trade a bit of model quality for much lower bandwidth/VRAM.
Disaggregate prefill & decode so memory‑bound decode runs on appropriately tuned instances without being interrupted by long prefills.

8.6) Sanity‑check calculator (how to size a box)

Given model {L, H, D} and precision p:

KV bytes per token  = 2 × L × (H × D) × bytes(p)
KV bytes per seq    = (tokens_so_far) × [bytes per token]
KV bytes per batch  = sum_over_sequences(KV bytes per seq)

Add headroom for: model weights, temporary activations (prefill), CUDA workspace, attention metadata, and fragmentation. For multi‑tenant services, assume p95 input lengths, not just means.

8.7) What this looks like in a real engine (vLLM)

PagedAttention kernels read KV from block‑structured caches and use a page‑table lookup to hop to the right physical blocks efficiently.
Block management shows up in configs (block_size), schedulers (preempt & resume sequences across blocks), and features like automatic prefix caching and continuous batching.
Chunked prefill is enabled by default in newer vLLM releases so decodes keep flowing even under long prefills.

8.8) Takeaways

KV footprint scales with B × T × L × H × D × precision—it grows linearly with context length and batch size.
Decode performance is often capped by memory bandwidth to read KV—optimize for locality, precision, and block layout.
PagedAttention plus prefix caching and continuous batching are practical, production‑proven ways to reduce waste and improve both latency and throughput.

9) Scheduling/batching policies

Static batching: good throughput, hurts idling and tail latency.
Continuous batching: join in‑flight batches between tokens; better balance.
Speculative decoding: draft model proposes k tokens, target verifies; reduces wall time under accuracy guardrails.
Separate low‑latency tier (small batches, reserved capacity) from throughput tier (large batches).

10) Practical engineering checklist

Profile prefill vs. decode separately; scale resources accordingly (more tensor cores vs. more KV BW).
Keep KV in HBM; if offloading to CPU, prefetch and overlap DMA; pin pages; use large IOMMU pages.
Use quantization (int8/int4) where quality allows; beware quantization‑aware kernels for attention.
Consider MIG or node‑level isolation; noisy neighbors wreak havoc on tails.
Instrument TTFT distribution and tokens/s per request; log batch sizes and effective context lengths.

11) Debugging playbook

TTFT spikes: request coalescing delays, tokenizer CPU hotspot, cold model weights, page faults.
Low tokens/s: poor achieved occupancy, unfused attention kernels, KV gather patterns causing bank conflicts or L2 thrash, PCIe back‑pressure.
OOM: reduce max context or enable paged KV, apply prefix caching, share LoRA adapters.

12) Capacity planning sketch (worked)

Target p99 ≤ 300 ms for 128‑token responses. Budget: TTFT 50 ms, decode 250 ms.
If decode cost/token ≈ 0.5 ms at batch=32 ⇒ tokens/s/GPU ≈ 2000.
With continuous batching + paged KV, expect +20–50% throughput depending on context distribution; validate on your model mix.

References

vLLM PagedAttention; NVIDIA TensorRT‑LLM KV manager; CUDA performance backgrounders.

ML Systems in Datacenters — LLM Inference Realities

Practical Exercises

Tools Required

Real-World Applications

Next Modules

Part of Learning Tracks

AI System Architect Learning Track

ML Systems in Datacenters — LLM Inference Realities

📋 Table of Contents

Serving KPIs & Phases (Prefill vs. Decode) — A Practical, No‑Assumptions Guide

1) The three metrics you'll see everywhere

1.1 Time to First Token (TTFT)

Understanding the Prefill Phase (and Why It Dominates TTFT)

High‑Level Timeline

What the Model Does During Prefill

Concrete (Illustrative) Example

Timeline Visualization

Why Prefill Often Dominates TTFT

Practical Tips for Optimizing TTFT

1.2 Inter‑Token Latency (ITL) / Time‑Per‑Output‑Token (TPOT)

1.3 Tokens per Second (TPS, "tokens/s")

2) What actually happens on the GPU? Two phases

3) SLOs and tiers (how companies set expectations)

4) Why prefill vs. decode behave so differently

5) Scheduling and system patterns that move the needle

5.1 Continuous/rolling batching

5.2 Chunked prefill

5.3 Disaggregating prefill and decode

5.4 Prefix caching

6) How to measure correctly (and avoid common pitfalls)

7) Quick checklist for first deployments

Further reading (annotated)

Additional Prefill & TTFT Optimization Resources

8) KV‑Cache Sizing, Pressure & PagedAttention — From First Principles

8.1) What is the KV cache?

8.2) How big is the KV cache? (Exact formula + example)

Worked example (matches the slide)

8.3) Why naïve allocation wastes memory (fragmentation)

8.4) PagedAttention: make KV look like virtual memory

8.5) Practical knobs that relieve KV pressure

8.6) Sanity‑check calculator (how to size a box)

8.7) What this looks like in a real engine (vLLM)

8.8) Takeaways

Further reading (annotated)

9) Scheduling/batching policies

10) Practical engineering checklist

11) Debugging playbook

12) Capacity planning sketch (worked)

References

Related Modules

Benchmarks & Workloads — MLPerf Essentials

Multimodal Foundation Models: Architecture & System Design

System & Microarchitecture Deep Dive