Technical Glossary

Comprehensive definitions of computer architecture, AI systems, and performance optimization concepts. Hover over any term to explore detailed technical explanations.

•

CLIP (Contrastive Language-Image Pre-training)advanced

Contrastive Language–Image Pre-training (OpenAI, 2021) trains a vision encoder and a text encoder together so that match...

multimodalcontrastive-learningtext-encoderstable-diffusioncross-attentionconditioningopenai

•

Continuous Batchingadvanced

Dynamic batching strategy that continuously admits new requests into active batches as tokens complete, rather than wait...

llmbatchingschedulingoptimization

•

Decode Phaseintermediate

The second phase of LLM inference that generates output one token at a time. This stage repeatedly reads the KV cache cr...

llminferencememory-boundsequential

•

GPU Architecture Fundamentalsadvanced

Specialized parallel computing architecture optimized for throughput over latency. Core principle: thousands of lightwei...

gpuparallel-computingsimdcudastreaming-multiprocessorwarpmemory-hierarchytensor-cores

•

Incast Scenariosintermediate

Many-to-one communication pattern where multiple senders simultaneously transmit to a single receiver, creating synchron...

networkingtraffic-patterndatacenterml-systemscongestion

•

Inter-Token Latency (ITL)basic

Also called Time-Per-Output-Token (TPOT). The average time between successive output tokens during decoding. Lower ITL m...

llmlatencystreamingdecode

•

KV Cacheadvanced

Key-Value cache that stores previously computed K (key) and V (value) tensors from attention layers. During decode, the ...

llmmemoryattentionoptimization

•

Memory Systemsadvanced

Hierarchical storage architecture designed to bridge the speed gap between fast processors and slow main memory. Exploit...

memory-hierarchycachevirtual-memorytlbnumamemory-managementperformancelocality

•

Microburstsintermediate

Short-duration traffic spikes where arrival rate temporarily exceeds link capacity. Typically last microseconds to milli...

networkingtrafficdatacenterqueueingperformance

•

PagedAttentionadvanced

Memory management technique that manages KV cache in fixed-size blocks (16-32 tokens) with a page table mapping logical ...

llmmemory-managementoptimizationvllm

•

Poisson Arrivalsintermediate

A mathematical model where job arrivals occur randomly and independently, with the time between arrivals following an ex...

queueing-theoryprobabilityarrival-processmathematical-modeling

•

Prefill Phaseintermediate

The first phase of LLM inference where the entire prompt (N tokens) is processed in parallel. This is compute-bound (lar...

llminferencecompute-boundparallel

•

Q-Cache (Queuing Cache)advanced

Specialized cache architecture that reduces miss penalties using predetermined access patterns. Features temporal decoup...

npucachememory-hierarchyai-accelerator

•

Service Level Objective (SLO)intermediate

Quantitative target for service reliability/performance. For LLM serving, typically includes TTFT and ITL thresholds at ...

srereliabilityperformancemonitoring

•

Skewness-Curve-Based Methodsadvanced

Mathematical optimization technique developed by Samsung for Exynos 2400 NPU to solve memory hierarchy problems. Creates...

tensor-optimizationtilingmemory-reusenpuai-acceleratorsamsungmemory-hierarchy

•

Smith-Waterman Algorithmintermediate

Dynamic programming algorithm for local sequence alignment that identifies similar regions between two nucleotide or pro...

algorithmbioinformaticsdynamic-programmingsequence-alignmentoptimization

•

Tail Latencyintermediate

"Tail" refers to high-percentile latencies (p95/p99) - the slow end of the latency distribution. p95 means 95% of reques...

performancesremonitoringlatency

•

Time To First Token (TTFT)basic

The wall-clock time from sending the request to the first byte/token arriving. TTFT is dominated by the prefill phase (t...

llminferencelatencyprefill

•

Tokens Per Second (TPS)basic

Throughput metric measuring output tokens per second. For single stream: TPS ≈ 1/ITL. For multiple streams, scheduling a...

llmthroughputperformancebatching

•

Transformer Architectureadvanced

Revolutionary neural network architecture based on the attention mechanism that processes sequences in parallel. Core in...

transformerattentionself-attentionmulti-headencoderdecoderllmneural-networks

•

Vision Transformer (ViT)advanced

Vision Transformer (ViT) is a deep learning architecture that revolutionized computer vision by applying the transformer...

vision-transformerattentionpatchesself-attentioncomputer-visiontransformer

💡 Pro tip: Hover over any term above (or anywhere on the site) to see detailed technical explanations with examples and code snippets.