Technical Glossary
Comprehensive definitions of computer architecture, AI systems, and performance optimization concepts. Hover over any term to explore detailed technical explanations.
Contrastive Language–Image Pre-training (OpenAI, 2021) trains a vision encoder and a text encoder together so that match...
Dynamic batching strategy that continuously admits new requests into active batches as tokens complete, rather than wait...
The second phase of LLM inference that generates output one token at a time. This stage repeatedly reads the KV cache cr...
Specialized parallel computing architecture optimized for throughput over latency. Core principle: thousands of lightwei...
Many-to-one communication pattern where multiple senders simultaneously transmit to a single receiver, creating synchron...
Also called Time-Per-Output-Token (TPOT). The average time between successive output tokens during decoding. Lower ITL m...
Key-Value cache that stores previously computed K (key) and V (value) tensors from attention layers. During decode, the ...
Hierarchical storage architecture designed to bridge the speed gap between fast processors and slow main memory. Exploit...
Short-duration traffic spikes where arrival rate temporarily exceeds link capacity. Typically last microseconds to milli...
Memory management technique that manages KV cache in fixed-size blocks (16-32 tokens) with a page table mapping logical ...
A mathematical model where job arrivals occur randomly and independently, with the time between arrivals following an ex...
The first phase of LLM inference where the entire prompt (N tokens) is processed in parallel. This is compute-bound (lar...
Specialized cache architecture that reduces miss penalties using predetermined access patterns. Features temporal decoup...
Quantitative target for service reliability/performance. For LLM serving, typically includes TTFT and ITL thresholds at ...
Mathematical optimization technique developed by Samsung for Exynos 2400 NPU to solve memory hierarchy problems. Creates...
"Tail" refers to high-percentile latencies (p95/p99) - the slow end of the latency distribution. p95 means 95% of reques...
The wall-clock time from sending the request to the first byte/token arriving. TTFT is dominated by the prefill phase (t...
Throughput metric measuring output tokens per second. For single stream: TPS ≈ 1/ITL. For multiple streams, scheduling a...
Revolutionary neural network architecture based on the attention mechanism that processes sequences in parallel. Core in...
Vision Transformer (ViT) is a deep learning architecture that revolutionized computer vision by applying the transformer...
💡 Pro tip: Hover over any term above (or anywhere on the site) to see detailed technical explanations with examples and code snippets.