Cluster‑Level Thinking — Scheduling, Placement, Isolation

A Comprehensive Guide for SREs, Capacity Planners, and Platform Engineers

📋 Table of Contents

1) Scheduling and Placement

Pack vs. Spread Strategies

Pack strategy maximizes locality by placing related workloads on the same nodes or close network proximity (intra‑node, same rack), while spread strategy distributes workloads across different nodes, racks, or availability zones for fault tolerance and failure isolation.

Key Concepts:

Locality: Co‑locating communicating processes to minimize network latency and maximize cache efficiency
Failure isolation: Distributing workloads to prevent correlated failures from affecting multiple instances
Network topology awareness: Considering rack, switch, and datacenter boundaries in placement decisions

Implementation Examples:

Ray implements both strategies through placement groups, where PACK places all bundles on a single node for maximum locality, while SPREAD distributes bundles across separate nodes on a best‑effort basis.

Pack Strategy Use Cases:

High‑communication ML training jobs
Distributed databases requiring low‑latency coordination
Microservices with frequent inter‑service calls

Spread Strategy Use Cases:

High‑availability web services
Batch processing jobs requiring fault tolerance
Multi‑tenant workloads requiring isolation

Oversubscription with cgroups/quotas

Oversubscription allows aggregate resource allocation to exceed physical capacity by using Linux cgroups for resource isolation and work‑conserving scheduling. The system can over‑subscribe resources when individual workloads don't fully utilize their allocated resources.

Key Concepts:

cgroups (Control Groups): Linux kernel feature for resource isolation and management
Work‑conserving: Scheduler property that never leaves resources idle if there are runnable tasks
Quota/Period model: Using cpu.cfs_quota_us and cpu.cfs_period_us to control bandwidth allocation over time windows

Technical Implementation:

# Example: Limit a cgroup to 50% of one CPU core
echo 50000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_quota_us    # 50ms quota
echo 100000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_period_us  # 100ms period

Benefits:

Increased cluster utilization (typically 20‑30% improvement)
Better resource sharing between production and batch workloads
Dynamic resource reclamation based on actual usage

Challenges:

Risk of performance degradation during load spikes
Complex capacity planning
Need for sophisticated monitoring and alerting

Gang Scheduling

Gang scheduling provides "all‑or‑nothing" scheduling for distributed jobs requiring synchronized execution across multiple nodes, particularly critical for distributed ML training where all workers must start simultaneously.

Key Concepts:

All‑or‑nothing: Either all tasks in a job get scheduled simultaneously or none do
Distributed training coordination: Essential for frameworks like TensorFlow and PyTorch where training cannot proceed with partial worker sets
PodGroup: Kubernetes abstraction that defines minimum resource requirements (minMember, minResources) for gang scheduling

Implementation Example:

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: training‑job‑pg
spec:
  minMember: 4        # Require all 4 workers
  minResources:
    cpu: "16"         # Total CPU requirement
    memory: "64Gi"    # Total memory requirement
    nvidia.com/gpu: "4"  # Total GPU requirement

Use Cases:

Distributed ML training (TensorFlow, PyTorch, Horovod)
MPI‑based HPC applications
Distributed databases requiring quorum
Multi‑stage pipeline jobs

Benefits:

Prevents resource deadlocks
Eliminates wasted resources from partial scheduling
Ensures consistent training performance

Bin‑packing GPUs and MIG

GPU bin‑packing prevents fragmentation through intelligent allocation, while Multi‑Instance GPU (MIG) allows partitioning single GPUs into up to 7 isolated instances with dedicated compute cores, memory, and cache.

Key Concepts:

GPU fragmentation: When GPU instances are created and destroyed at different locations, physical positioning affects which other instances can be instantiated
MIG (Multi‑Instance GPU): NVIDIA technology providing hardware‑level isolation with guaranteed QoS and fault isolation between instances
Dedicated vs. shared allocation: MIG provides complete isolation versus time‑slicing which shares resources temporally

MIG Configuration Examples:

# Enable MIG mode
nvidia‑smi ‑mig 1
 
# Create GPU instances (3g.20gb profile)
nvidia‑smi mig ‑cgi 9,9  # Create two 3g.20gb instances
 
# Create compute instances
nvidia‑smi mig ‑cci 1g.5gb  # Create compute instances within GPU instances

Available MIG Profiles (A100‑40GB):

1g.5gb: 1/7 GPU, 5GB memory
2g.10gb: 2/7 GPU, 10GB memory
3g.20gb: 3/7 GPU, 20GB memory
7g.40gb: Full GPU, 40GB memory

Bin‑packing Strategies:

First Fit: Place workload on first GPU with sufficient resources
Best Fit: Place workload on GPU with least remaining capacity
Worst Fit: Place workload on GPU with most remaining capacity (for fragmentation avoidance)

2) System Analogs & Tools

Borg Mental Models

Google's Borg pioneered cluster management concepts including hierarchical quotas, priority‑based preemption, admission control, and resource reclamation where unused reserved resources are allocated to lower‑priority jobs.

Core Concepts:

Quota allocation: Vector of resource quantities (CPU, RAM, disk) at given priority levels for time periods, used for admission control
Priority bands: Non‑overlapping priority levels (monitoring, production, batch, best effort) with preemption policies
Resource reclamation: Allocating over‑provisioned but unused resources to batch jobs, with about 20% of workload running on reclaimed resources

Borg Architecture:

Cell (Cluster)
├── Borgmaster (Control Plane)
│   ├── Scheduler
│   ├── State Manager
│   └── API Server
└── Borglets (Node Agents)
    ├── Task Management
    ├── Resource Monitoring
    └── Health Reporting

Key Innovations:

Declarative job specifications (BCL ‑ Borg Configuration Language)
Continuous scheduling (not batch‑based)
Resource estimation and reclamation
Priority‑driven preemption with cascading prevention
Fine‑grained resource isolation using Linux containers

Slurm Mechanics

Slurm uses partitions as job queues with attached QoS policies, GRES for GPU management, preemption based on priority hierarchies, job arrays for parallel execution, and fair‑share scheduling based on historical usage.

Key Components:

Partitions: Job queues imposing restrictions like job size limits, time limits, and user permissions
QoS (Quality of Service): Affects scheduling priority, preemption capability, and resource limits; can be assigned to jobs or partitions
GRES (Generic Resources): Configurable resource types like GPUs, with specific allocation and accounting
Fair‑share: Scheduling factor based on allocated shares versus consumed resources, promoting equitable access

Slurm Configuration Example:

# Partition configuration
PartitionName=gpu Nodes=gpu[01‑10] MaxTime=72:00:00 DefaultTime=01:00:00
PartitionName=cpu Nodes=cpu[01‑50] MaxTime=168:00:00 DefaultTime=01:00:00
 
# QoS configuration
sacctmgr create qos name=high priority=1000 MaxJobs=100 MaxWall=24:00:00
sacctmgr create qos name=normal priority=500 MaxJobs=500 MaxWall=72:00:00
sacctmgr create qos name=low priority=100 MaxJobs=1000 MaxWall=168:00:00
 
# GRES configuration
GresTypes=gpu
NodeName=gpu[01‑10] Gres=gpu:tesla_v100:4

Job Submission Example:

#!/bin/bash
#SBATCH ‑‑partition=gpu
#SBATCH ‑‑qos=high
#SBATCH ‑‑gres=gpu:2
#SBATCH ‑‑time=04:00:00
#SBATCH ‑‑mem=32G
 
python train_model.py

Ray Framework

Ray provides placement groups for resource co‑location, actors for stateful long‑running processes, and autoscaler integration for dynamic cluster sizing.

Key Features:

Placement groups: Resource bundles with PACK/SPREAD strategies for coordinated scheduling
Actors: Long‑lived stateful processes that persist across multiple tasks
Autoscaler hooks: Integration points for dynamic node provisioning

Ray Placement Group Example:

import ray
 
# Create placement group for distributed training
pg = ray.util.placement_group([
    {"CPU": 2, "GPU": 1},  # Worker 1
    {"CPU": 2, "GPU": 1},  # Worker 2  
    {"CPU": 2, "GPU": 1},  # Worker 3
    {"CPU": 1},            # Parameter server
], strategy="PACK")  # Co‑locate on same nodes when possible
 
# Use placement group in remote functions
@ray.remote(num_cpus=2, num_gpus=1)
def train_worker(data_shard):
    # Training logic here
    pass
 
# Schedule tasks within placement group
futures = []
for i in range(3):
    future = train_worker.options(
        placement_group=pg,
        placement_group_bundle_index=i
    ).remote(data_shards[i])
    futures.append(future)

3) Practical Policies

Classes of Service (Gold/Silver/Bronze)

Implement hierarchical service tiers with different SLOs and preemption policies to manage resource allocation and performance expectations.

Service Tier Definition:

Gold Tier:

SLO: 99.9% availability, under 100ms p99 latency
Resource guarantees: Reserved capacity, no preemption
Priority: Highest scheduling priority
Cost: Premium pricing

Silver Tier:

SLO: 99.5% availability, under 500ms p99 latency
Resource guarantees: Best‑effort with preemption protection from Bronze
Priority: Medium scheduling priority
Cost: Standard pricing

Bronze Tier:

SLO: 99% availability, under 2s p99 latency
Resource guarantees: Preemptible, runs on excess capacity
Priority: Lowest scheduling priority
Cost: Discounted pricing

Implementation in Kubernetes:

# Gold tier priority class
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gold‑priority
value: 1000
globalDefault: false
description: "Gold tier ‑ highest priority, no preemption"
 
‑‑‑
# Silver tier priority class
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: silver‑priority
value: 500
globalDefault: false
description: "Silver tier ‑ medium priority"
 
‑‑‑
# Bronze tier priority class
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: bronze‑priority
value: 100
globalDefault: true
description: "Bronze tier ‑ preemptible"

Node Labeling & Anti‑affinity

Use node labels and anti‑affinity rules to prevent resource conflicts and isolate different workload types.

Node Labeling Strategy:

# Label nodes by workload type
kubectl label nodes worker‑01 workload‑type=cpu‑intensive
kubectl label nodes worker‑02 workload‑type=memory‑intensive  
kubectl label nodes worker‑03 workload‑type=storage‑heavy
kubectl label nodes gpu‑01 workload‑type=gpu‑compute
 
# Label nodes by performance tier
kubectl label nodes worker‑01 performance‑tier=high
kubectl label nodes worker‑02 performance‑tier=medium
kubectl label nodes worker‑03 performance‑tier=standard

Anti‑affinity Implementation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cpu‑intensive‑app
spec:
  replicas: 3
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            ‑ matchExpressions:
              ‑ key: workload‑type
                operator: In
                values: ["cpu‑intensive"]
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          ‑ weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                ‑ key: app
                  operator: In
                  values: ["storage‑heavy‑app"]
              topologyKey: kubernetes.io/hostname

KV Cache Locality

Optimize key‑value cache performance by ensuring cache data stays local to processing units, particularly important for GPU‑accelerated serving workloads.

Implementation Strategies:

GPU Memory Pinning:

import torch
 
# Pin KV cache in GPU memory
cache_tensor = torch.zeros(batch_size, seq_len, hidden_size).cuda()
cache_tensor = cache_tensor.pin_memory()  # Prevent swapping
 
# Use CUDA streams for async operations
stream = torch.cuda.Stream()
with torch.cuda.stream(stream):
    # Perform cache operations without blocking
    result = model.forward_with_cache(input_ids, cache_tensor)

Node‑local Cache Configuration:

apiVersion: v1
kind: Pod
spec:
  containers:
  ‑ name: inference‑server
    resources:
      requests:
        nvidia.com/gpu: 1
        memory: "32Gi"
      limits:
        nvidia.com/gpu: 1  
        memory: "32Gi"
    env:
    ‑ name: CUDA_VISIBLE_DEVICES
      value: "0"
    ‑ name: KV_CACHE_SIZE_GB
      value: "16"
  nodeSelector:
    gpu‑type: "a100"
  tolerations:
  ‑ key: "dedicated‑inference"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

4) Observability

Cluster Dashboards

Modern observability requires comprehensive dashboards tracking utilization, fragmentation metrics, queue depth, preemption counts, and SLO compliance with both real‑time and historical trend analysis.

Essential Metrics:

Resource Utilization:

CPU utilization (per node, per cluster)
Memory utilization and pressure
GPU utilization and memory usage
Network bandwidth and packet rates
Storage I/O and capacity

Scheduling Metrics:

Queue depth by priority/QoS level
Scheduling latency (time from submission to start)
Preemption frequency and duration
Resource fragmentation index
Pod pending time by reason

Performance Metrics:

Job completion time percentiles
Throughput (jobs/hour, tasks/second)
Resource efficiency (allocated vs. used)
Failure rates by workload type

Example Prometheus Queries:

# CPU utilization by node
100 ‑ (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
 
# GPU utilization across cluster
avg(nvidia_gpu_utilization_gpu)
 
# Pod scheduling latency
histogram_quantile(0.95, 
  rate(scheduler_scheduling_duration_seconds_bucket[5m])
)
 
# Resource fragmentation (unusable CPU cores)
sum(kube_node_status_allocatable{resource="cpu"}) ‑ 
sum(kube_pod_container_resource_requests{resource="cpu"}) ‑ 
sum(kube_node_status_capacity{resource="cpu"}) + 
sum(kube_node_status_allocatable{resource="cpu"})

SLO Compliance & Monitoring

Service Level Objectives require continuous monitoring with error budget tracking, burn rate alerts, and both request‑based and time‑window‑based measurements.

SLO Framework Components:

SLI (Service Level Indicators): Quantitative measurements like latency percentiles, error rates, and availability
Error budget: Tolerable amount of system unavailability within compliance periods
Burn rate: Rate at which error budget is being consumed
Tail latency: Focus on high percentiles (p95, p99) rather than just mean latency

SLO Implementation Example:

# Prometheus recording rules for SLI calculation
groups:
‑ name: sli_rules
  rules:
  ‑ record: sli:request_latency_99p
    expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
  
  ‑ record: sli:error_rate  
    expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
    
  ‑ record: sli:availability_5m
    expr: 1 ‑ sli:error_rate
 
# SLO alerting rules
‑ name: slo_alerts
  rules:
  ‑ alert: HighErrorBudgetBurnRate
    expr: sli:error_rate > 0.001  # 0.1% error rate threshold
    for: 5m
    labels:
      severity: warning
      slo_type: availability
    annotations:
      summary: "High error budget burn rate detected"
      description: "Error rate is {{ $value | humanizePercentage }} over 5 minutes"

Admission Control

Admission control systems validate resource requests before scheduling, enforce quotas and policies, and provide headroom estimates to prevent oversubscription.

Key Functions:

Headroom estimation: Predicting available capacity for new workloads
Resource validation: Checking if requested resources are available and policy‑compliant
Quota enforcement: Ensuring resource usage stays within defined limits

Implementation in Kubernetes:

# Resource quota per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team‑a‑quota
  namespace: team‑a
spec:
  hard:
    requests.cpu: "100"
    requests.memory: "200Gi" 
    requests.nvidia.com/gpu: "10"
    limits.cpu: "200"
    limits.memory: "400Gi"
    pods: "50"
 
‑‑‑
# Limit range for default/max values
apiVersion: v1
kind: LimitRange
metadata:
  name: team‑a‑limits
  namespace: team‑a
spec:
  limits:
  ‑ default:
      cpu: "2"
      memory: "4Gi"
    defaultRequest:
      cpu: "100m" 
      memory: "128Mi"
    max:
      cpu: "16"
      memory: "32Gi"
    type: Container

Admission Controller for Custom Policies:

// Example admission webhook for GPU validation
func (w *GPUAdmissionWebhook) validateGPURequest(req *v1.AdmissionRequest) *v1.AdmissionResponse {
    pod := &corev1.Pod{}
    if err := json.Unmarshal(req.Object.Raw, pod); err != nil {
        return &v1.AdmissionResponse{
            Result: &metav1.Status{Message: err.Error()},
        }
    }
    
    // Check GPU fragmentation impact
    gpuRequests := getGPURequests(pod)
    if willCauseFragmentation(gpuRequests) {
        return &v1.AdmissionResponse{
            Allowed: false,
            Result: &metav1.Status{
                Message: "GPU request would cause excessive fragmentation",
            },
        }
    }
    
    return &v1.AdmissionResponse{Allowed: true}
}

Best Practices Summary

Scheduling Optimization

Use gang scheduling for distributed workloads to prevent resource deadlocks
Implement tiered service classes with appropriate preemption policies
Leverage MIG for GPU sharing when full GPU isolation isn't required
Monitor and alert on resource fragmentation to maintain scheduling efficiency

Resource Management

Enable oversubscription carefully with proper monitoring and limits
Use work‑conserving schedulers to maximize utilization
Implement resource reclamation for unused reserved capacity
Maintain headroom for burst capacity and emergency workloads

Observability & SLOs

Define meaningful SLIs that reflect user experience
Monitor error budgets and burn rates for proactive incident response
Use tail latency metrics (p95, p99) rather than just averages
Implement comprehensive dashboards covering utilization, performance, and reliability

Operational Excellence

Test preemption and failure scenarios regularly
Automate capacity planning based on historical trends and growth projections
Implement gradual rollouts for scheduler configuration changes
Maintain runbooks for common scheduling and resource issues

References

Ray Project Documentation ‑ Ray Core Scheduling
NVIDIA Multi‑Instance GPU User Guide ‑ MIG Documentation
Abhishek Verma et al. ‑ Large‑scale cluster management at Google with Borg, EuroSys 2015
Slurm Workload Manager ‑ Official Documentation
Kubernetes Documentation ‑ Pod Quality of Service Classes
Linux Kernel Documentation ‑ Control Group v2
Hongzi Mao et al. ‑ Learning scheduling algorithms for data processing clusters, SIGCOMM 2019
Gang Scheduling Research ‑ Scheduling Parallel‑Task Jobs Subject to Packing and Placement Constraints
Cluster Scheduling Survey ‑ A survey of Kubernetes scheduling algorithms
SRE Handbook ‑ Google Site Reliability Engineering

This guide provides comprehensive coverage of cluster‑level thinking for modern distributed systems. For the latest updates and additional resources, refer to the project documentation and research papers cited above.

Cluster-Level Thinking — Scheduling, Placement, Isolation

Practical Exercises

Tools Required

Real-World Applications

Next Modules

Cluster‑Level Thinking — Scheduling, Placement, Isolation

📋 Table of Contents

1) Scheduling and placement

2) System analogs & tools

3) Practical policies

4) Observability

References

1) Scheduling and Placement

Pack vs. Spread Strategies

Key Concepts:

Implementation Examples:

Oversubscription with cgroups/quotas

Key Concepts:

Technical Implementation:

Gang Scheduling

Key Concepts:

Implementation Example:

Bin‑packing GPUs and MIG

Key Concepts:

MIG Configuration Examples:

2) System Analogs & Tools

Borg Mental Models

Core Concepts:

Borg Architecture:

Slurm Mechanics

Key Components:

Slurm Configuration Example:

Ray Framework

Key Features:

Ray Placement Group Example:

3) Practical Policies

Classes of Service (Gold/Silver/Bronze)

Service Tier Definition:

Implementation in Kubernetes:

Node Labeling & Anti‑affinity

Node Labeling Strategy:

Anti‑affinity Implementation:

KV Cache Locality

Implementation Strategies:

4) Observability

Cluster Dashboards

Essential Metrics:

Example Prometheus Queries:

SLO Compliance & Monitoring

SLO Framework Components:

SLO Implementation Example:

Admission Control

Key Functions:

Implementation in Kubernetes:

Admission Controller for Custom Policies:

Best Practices Summary

Scheduling Optimization

Resource Management

Observability & SLOs

Operational Excellence

References

Related Modules

Cluster-Level Thinking — Scheduling, Placement, Isolation

System & Microarchitecture Deep Dive

Tail Latency & Scale-Out — p95/p99/p99.9 Engineering