Expert Modules

Deep-dive technical modules covering system architecture, performance analysis, and AI infrastructure.

17

Total Modules

24h

Total Content

6

Categories

17

Expert Level

Filter by Category:

Performance (5)MLSystems (3)DatacenterArch (3)

Filter by Difficulty:

intermediate (0)advanced (0)expert (17)

Advanced GPU Architecture for ML

Deep dive into modern GPU architectures optimized for machine learning, from latest datacenter GPUs to next-generation designs

💎 AI Hardware

⏱️ 6 min read

Start Learning 🚀

AI Hardware Simulation & Modeling

Develop high-fidelity simulators and performance models for evaluating next-generation AI accelerator architectures

⏱️ 7 min read

Start Learning 🚀

Cluster-Level Thinking — Scheduling, Placement, Isolation

SRE and platform engineering for ML training/serving clusters: resource allocation, gang scheduling, and system-level optimization

💎 DatacenterArch110m

🎯 4 exercises

🛠️ 5 tools

💼 4 applications

#scheduling#placement#isolation#cluster

⏱️ 13 min read

Start Learning 🚀

Cluster-Level Thinking — Scheduling, Placement, Isolation

SRE and platform engineering for ML training/serving clusters: resource allocation, gang scheduling, and system-level optimization

💎 DatacenterArch110m

🎯 4 exercises

🛠️ 5 tools

💼 4 applications

#scheduling#placement#isolation#cluster

⏱️ 13 min read

Start Learning 🚀

Deep Learning ASIC Architecture

Master the design principles of custom AI accelerators, from tensor processing units to emerging neuromorphic architectures

💎 AI Hardware

⏱️ 4 min read

Start Learning 🚀

Interconnect Fabrics for AI Systems

Design and optimization of high-performance interconnects for distributed AI training and inference systems

💎 AI Systems

⏱️ 7 min read

Start Learning 🚀

ML Systems in Datacenters — LLM Inference Realities

TTFT vs tokens/s optimization, batching strategies, KV-cache memory management, PagedAttention/vLLM impact, and practical serving tactics

💎 MLSystems120m

🎯 4 exercises

🛠️ 4 tools

💼 4 applications

#LLM#inference#KV-cache#batching

⏱️ 15 min read

Start Learning 🚀

Modeling & Simulation

Strategic simulation methodology: choose the right simulation paradigm and fidelity level; ask targeted questions, validate against reality

💎 Performance220m

🎯 9 exercises

🛠️ 23 tools

💼 7 applications

#simulation#modeling#DES#discrete-event

⏱️ 20 min read

Start Learning 🚀

Multi-Node AI Training Systems

Master the design and optimization of distributed AI training systems across hundreds of nodes and GPUs

💎 AI Systems

⏱️ 5 min read

Start Learning 🚀

Multimodal Foundation Models: Architecture & System Design

Comprehensive analysis of multimodal foundation model architectures, training methodologies, and system engineering challenges for vision-language AI systems

💎 MLSystems180m

🎯 4 exercises

🛠️ 5 tools

💼 4 applications

#multimodal#foundation-models#vision-language#cross-attention

⏱️ 22 min read

Start Learning 🚀

Power & Thermal Awareness — From Activity to perf/W

Translate simulated activity into power/thermal behavior and communicate perf/W trade-offs credibly using McPAT and HotSpot

💎 Performance140m

🎯 4 exercises

🛠️ 5 tools

💼 4 applications

#power#thermal#McPAT#HotSpot

⏱️ 2 min read

Start Learning 🚀

PPA Analysis Methodologies

Master Performance, Power, and Area analysis techniques for evaluating hardware design trade-offs in AI accelerators

💎 Performance

⏱️ 7 min read

Start Learning 🚀

System & Microarchitecture Deep Dive

End-to-end reasoning about compute + data pathologies with evidence-based fixes for CPU pipelines, GPU occupancy, and memory hierarchies

💎 MLSystems180m

🎯 4 exercises

🛠️ 5 tools

💼 4 applications

#CPU#GPU#NUMA#occupancy

⏱️ 30 min read

Start Learning 🚀

Tail Latency & Scale-Out — p95/p99/p99.9 Engineering

Design for tails, not means: queueing theory, amplification effects, and tail-tolerant distributed system patterns

💎 DatacenterArch100m

🎯 4 exercises

🛠️ 4 tools

💼 4 applications

#tail-latency#p99#queueing#scale-out

⏱️ 2 min read

Start Learning 🚀

Tools & Methods: Top-Down, CDRD, and Roofline

Turn counters and simple models into clear diagnoses and action items using systematic performance analysis methodologies

💎 Performance150m

🎯 4 exercises

🛠️ 5 tools

💼 4 applications

#Top-Down#roofline#performance-analysis#profiling

⏱️ 4 min read

Start Learning 🚀

Transformer Hardware Optimization

Deep dive into optimizing hardware architectures for transformer-based models, from attention mechanisms to large language model inference

💎 AI Hardware

⏱️ 7 min read

Start Learning 🚀

Validation & Measurement — Trust, But Verify

Cross-validate models with real counters, quantify uncertainty, and communicate limits in performance analysis

💎 Performance130m

🎯 4 exercises

🛠️ 5 tools

💼 4 applications

#validation#measurement#perf#eBPF

⏱️ 2 min read

Start Learning 🚀