Skip to main content

Technical Glossary

Comprehensive definitions of computer architecture, AI systems, and performance optimization concepts. Hover over any term to explore detailed technical explanations.

CLIP (Contrastive Language-Image Pre-training)advanced

Contrastive Language–Image Pre-training (OpenAI, 2021) trains a vision encoder and a text encoder together so that match...

multimodalcontrastive-learningtext-encoderstable-diffusioncross-attentionconditioningopenai
Continuous Batchingadvanced

Dynamic batching strategy that continuously admits new requests into active batches as tokens complete, rather than wait...

llmbatchingschedulingoptimization
Decode Phaseintermediate

The second phase of LLM inference that generates output one token at a time. This stage repeatedly reads the KV cache cr...

llminferencememory-boundsequential
GPU Architecture Fundamentalsadvanced

Specialized parallel computing architecture optimized for throughput over latency. Core principle: thousands of lightwei...

gpuparallel-computingsimdcudastreaming-multiprocessorwarpmemory-hierarchytensor-cores
Incast Scenariosintermediate

Many-to-one communication pattern where multiple senders simultaneously transmit to a single receiver, creating synchron...

networkingtraffic-patterndatacenterml-systemscongestion
Inter-Token Latency (ITL)basic

Also called Time-Per-Output-Token (TPOT). The average time between successive output tokens during decoding. Lower ITL m...

llmlatencystreamingdecode
KV Cacheadvanced

Key-Value cache that stores previously computed K (key) and V (value) tensors from attention layers. During decode, the ...

llmmemoryattentionoptimization
Memory Systemsadvanced

Hierarchical storage architecture designed to bridge the speed gap between fast processors and slow main memory. Exploit...

memory-hierarchycachevirtual-memorytlbnumamemory-managementperformancelocality
Microburstsintermediate

Short-duration traffic spikes where arrival rate temporarily exceeds link capacity. Typically last microseconds to milli...

networkingtrafficdatacenterqueueingperformance
PagedAttentionadvanced

Memory management technique that manages KV cache in fixed-size blocks (16-32 tokens) with a page table mapping logical ...

llmmemory-managementoptimizationvllm
Poisson Arrivalsintermediate

A mathematical model where job arrivals occur randomly and independently, with the time between arrivals following an ex...

queueing-theoryprobabilityarrival-processmathematical-modeling
Prefill Phaseintermediate

The first phase of LLM inference where the entire prompt (N tokens) is processed in parallel. This is compute-bound (lar...

llminferencecompute-boundparallel
Q-Cache (Queuing Cache)advanced

Specialized cache architecture that reduces miss penalties using predetermined access patterns. Features temporal decoup...

npucachememory-hierarchyai-accelerator
Service Level Objective (SLO)intermediate

Quantitative target for service reliability/performance. For LLM serving, typically includes TTFT and ITL thresholds at ...

srereliabilityperformancemonitoring
Skewness-Curve-Based Methodsadvanced

Mathematical optimization technique developed by Samsung for Exynos 2400 NPU to solve memory hierarchy problems. Creates...

tensor-optimizationtilingmemory-reusenpuai-acceleratorsamsungmemory-hierarchy
Tail Latencyintermediate

"Tail" refers to high-percentile latencies (p95/p99) - the slow end of the latency distribution. p95 means 95% of reques...

performancesremonitoringlatency
Time To First Token (TTFT)basic

The wall-clock time from sending the request to the first byte/token arriving. TTFT is dominated by the prefill phase (t...

llminferencelatencyprefill
Tokens Per Second (TPS)basic

Throughput metric measuring output tokens per second. For single stream: TPS ≈ 1/ITL. For multiple streams, scheduling a...

llmthroughputperformancebatching
Transformer Architectureadvanced

Revolutionary neural network architecture based on the attention mechanism that processes sequences in parallel. Core in...

transformerattentionself-attentionmulti-headencoderdecoderllmneural-networks
Vision Transformer (ViT)advanced

Vision Transformer (ViT) is a deep learning architecture that revolutionized computer vision by applying the transformer...

vision-transformerattentionpatchesself-attentioncomputer-visiontransformer

💡 Pro tip: Hover over any term above (or anywhere on the site) to see detailed technical explanations with examples and code snippets.