Skip to main content
ModulesDatacenterArch

Cluster-Level Thinking — Scheduling, Placement, Isolation

SRE and platform engineering for ML training/serving clusters: resource allocation, gang scheduling, and system-level optimization

expertDatacenterArch110m
4
Exercises
5
Tools
4
Applications
13
Min Read

Practical Exercises

  • GPU bin-packing algorithm implementation
  • Gang scheduling policy design for distributed training
  • Resource quota and priority system design
  • Cluster utilization optimization case study

Tools Required

KubernetesSlurmRayBorg-like systemsResource management tools

Real-World Applications

  • Large-scale ML training cluster management
  • Multi-tenant GPU sharing strategies
  • Datacenter resource optimization
  • Container orchestration for AI workloads

Cluster‑Level Thinking — Scheduling, Placement, Isolation

A Comprehensive Guide for SREs, Capacity Planners, and Platform Engineers


📋 Table of Contents


1) Scheduling and Placement

Pack vs. Spread Strategies

Pack strategy maximizes locality by placing related workloads on the same nodes or close network proximity (intra‑node, same rack), while spread strategy distributes workloads across different nodes, racks, or availability zones for fault tolerance and failure isolation.

Key Concepts:

  • Locality: Co‑locating communicating processes to minimize network latency and maximize cache efficiency
  • Failure isolation: Distributing workloads to prevent correlated failures from affecting multiple instances
  • Network topology awareness: Considering rack, switch, and datacenter boundaries in placement decisions

Implementation Examples:

Ray implements both strategies through placement groups, where PACK places all bundles on a single node for maximum locality, while SPREAD distributes bundles across separate nodes on a best‑effort basis.

Pack Strategy Use Cases:

  • High‑communication ML training jobs
  • Distributed databases requiring low‑latency coordination
  • Microservices with frequent inter‑service calls

Spread Strategy Use Cases:

  • High‑availability web services
  • Batch processing jobs requiring fault tolerance
  • Multi‑tenant workloads requiring isolation

Oversubscription with cgroups/quotas

Oversubscription allows aggregate resource allocation to exceed physical capacity by using Linux cgroups for resource isolation and work‑conserving scheduling. The system can over‑subscribe resources when individual workloads don't fully utilize their allocated resources.

Key Concepts:

  • cgroups (Control Groups): Linux kernel feature for resource isolation and management
  • Work‑conserving: Scheduler property that never leaves resources idle if there are runnable tasks
  • Quota/Period model: Using cpu.cfs_quota_us and cpu.cfs_period_us to control bandwidth allocation over time windows

Technical Implementation:

# Example: Limit a cgroup to 50% of one CPU core
echo 50000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_quota_us    # 50ms quota
echo 100000 > /sys/fs/cgroup/cpu/mygroup/cpu.cfs_period_us  # 100ms period

Benefits:

  • Increased cluster utilization (typically 20‑30% improvement)
  • Better resource sharing between production and batch workloads
  • Dynamic resource reclamation based on actual usage

Challenges:

  • Risk of performance degradation during load spikes
  • Complex capacity planning
  • Need for sophisticated monitoring and alerting

Gang Scheduling

Gang scheduling provides "all‑or‑nothing" scheduling for distributed jobs requiring synchronized execution across multiple nodes, particularly critical for distributed ML training where all workers must start simultaneously.

Key Concepts:

  • All‑or‑nothing: Either all tasks in a job get scheduled simultaneously or none do
  • Distributed training coordination: Essential for frameworks like TensorFlow and PyTorch where training cannot proceed with partial worker sets
  • PodGroup: Kubernetes abstraction that defines minimum resource requirements (minMember, minResources) for gang scheduling

Implementation Example:

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: training‑job‑pg
spec:
  minMember: 4        # Require all 4 workers
  minResources:
    cpu: "16"         # Total CPU requirement
    memory: "64Gi"    # Total memory requirement
    nvidia.com/gpu: "4"  # Total GPU requirement

Use Cases:

  • Distributed ML training (TensorFlow, PyTorch, Horovod)
  • MPI‑based HPC applications
  • Distributed databases requiring quorum
  • Multi‑stage pipeline jobs

Benefits:

  • Prevents resource deadlocks
  • Eliminates wasted resources from partial scheduling
  • Ensures consistent training performance

Bin‑packing GPUs and MIG

GPU bin‑packing prevents fragmentation through intelligent allocation, while Multi‑Instance GPU (MIG) allows partitioning single GPUs into up to 7 isolated instances with dedicated compute cores, memory, and cache.

Key Concepts:

  • GPU fragmentation: When GPU instances are created and destroyed at different locations, physical positioning affects which other instances can be instantiated
  • MIG (Multi‑Instance GPU): NVIDIA technology providing hardware‑level isolation with guaranteed QoS and fault isolation between instances
  • Dedicated vs. shared allocation: MIG provides complete isolation versus time‑slicing which shares resources temporally

MIG Configuration Examples:

# Enable MIG mode
nvidia‑smi ‑mig 1
 
# Create GPU instances (3g.20gb profile)
nvidia‑smi mig ‑cgi 9,9  # Create two 3g.20gb instances
 
# Create compute instances
nvidia‑smi mig ‑cci 1g.5gb  # Create compute instances within GPU instances

Available MIG Profiles (A100‑40GB):

  • 1g.5gb: 1/7 GPU, 5GB memory
  • 2g.10gb: 2/7 GPU, 10GB memory
  • 3g.20gb: 3/7 GPU, 20GB memory
  • 7g.40gb: Full GPU, 40GB memory

Bin‑packing Strategies:

  • First Fit: Place workload on first GPU with sufficient resources
  • Best Fit: Place workload on GPU with least remaining capacity
  • Worst Fit: Place workload on GPU with most remaining capacity (for fragmentation avoidance)

2) System Analogs & Tools

Borg Mental Models

Google's Borg pioneered cluster management concepts including hierarchical quotas, priority‑based preemption, admission control, and resource reclamation where unused reserved resources are allocated to lower‑priority jobs.

Core Concepts:

  • Quota allocation: Vector of resource quantities (CPU, RAM, disk) at given priority levels for time periods, used for admission control
  • Priority bands: Non‑overlapping priority levels (monitoring, production, batch, best effort) with preemption policies
  • Resource reclamation: Allocating over‑provisioned but unused resources to batch jobs, with about 20% of workload running on reclaimed resources

Borg Architecture:

Cell (Cluster)
├── Borgmaster (Control Plane)
│   ├── Scheduler
│   ├── State Manager
│   └── API Server
└── Borglets (Node Agents)
    ├── Task Management
    ├── Resource Monitoring
    └── Health Reporting

Key Innovations:

  • Declarative job specifications (BCL ‑ Borg Configuration Language)
  • Continuous scheduling (not batch‑based)
  • Resource estimation and reclamation
  • Priority‑driven preemption with cascading prevention
  • Fine‑grained resource isolation using Linux containers

Slurm Mechanics

Slurm uses partitions as job queues with attached QoS policies, GRES for GPU management, preemption based on priority hierarchies, job arrays for parallel execution, and fair‑share scheduling based on historical usage.

Key Components:

  • Partitions: Job queues imposing restrictions like job size limits, time limits, and user permissions
  • QoS (Quality of Service): Affects scheduling priority, preemption capability, and resource limits; can be assigned to jobs or partitions
  • GRES (Generic Resources): Configurable resource types like GPUs, with specific allocation and accounting
  • Fair‑share: Scheduling factor based on allocated shares versus consumed resources, promoting equitable access

Slurm Configuration Example:

# Partition configuration
PartitionName=gpu Nodes=gpu[01‑10] MaxTime=72:00:00 DefaultTime=01:00:00
PartitionName=cpu Nodes=cpu[01‑50] MaxTime=168:00:00 DefaultTime=01:00:00
 
# QoS configuration
sacctmgr create qos name=high priority=1000 MaxJobs=100 MaxWall=24:00:00
sacctmgr create qos name=normal priority=500 MaxJobs=500 MaxWall=72:00:00
sacctmgr create qos name=low priority=100 MaxJobs=1000 MaxWall=168:00:00
 
# GRES configuration
GresTypes=gpu
NodeName=gpu[01‑10] Gres=gpu:tesla_v100:4

Job Submission Example:

#!/bin/bash
#SBATCH ‑‑partition=gpu
#SBATCH ‑‑qos=high
#SBATCH ‑‑gres=gpu:2
#SBATCH ‑‑time=04:00:00
#SBATCH ‑‑mem=32G
 
python train_model.py

Ray Framework

Ray provides placement groups for resource co‑location, actors for stateful long‑running processes, and autoscaler integration for dynamic cluster sizing.

Key Features:

  • Placement groups: Resource bundles with PACK/SPREAD strategies for coordinated scheduling
  • Actors: Long‑lived stateful processes that persist across multiple tasks
  • Autoscaler hooks: Integration points for dynamic node provisioning

Ray Placement Group Example:

import ray
 
# Create placement group for distributed training
pg = ray.util.placement_group([
    {"CPU": 2, "GPU": 1},  # Worker 1
    {"CPU": 2, "GPU": 1},  # Worker 2  
    {"CPU": 2, "GPU": 1},  # Worker 3
    {"CPU": 1},            # Parameter server
], strategy="PACK")  # Co‑locate on same nodes when possible
 
# Use placement group in remote functions
@ray.remote(num_cpus=2, num_gpus=1)
def train_worker(data_shard):
    # Training logic here
    pass
 
# Schedule tasks within placement group
futures = []
for i in range(3):
    future = train_worker.options(
        placement_group=pg,
        placement_group_bundle_index=i
    ).remote(data_shards[i])
    futures.append(future)

3) Practical Policies

Classes of Service (Gold/Silver/Bronze)

Implement hierarchical service tiers with different SLOs and preemption policies to manage resource allocation and performance expectations.

Service Tier Definition:

Gold Tier:

  • SLO: 99.9% availability, under 100ms p99 latency
  • Resource guarantees: Reserved capacity, no preemption
  • Priority: Highest scheduling priority
  • Cost: Premium pricing

Silver Tier:

  • SLO: 99.5% availability, under 500ms p99 latency
  • Resource guarantees: Best‑effort with preemption protection from Bronze
  • Priority: Medium scheduling priority
  • Cost: Standard pricing

Bronze Tier:

  • SLO: 99% availability, under 2s p99 latency
  • Resource guarantees: Preemptible, runs on excess capacity
  • Priority: Lowest scheduling priority
  • Cost: Discounted pricing

Implementation in Kubernetes:

# Gold tier priority class
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gold‑priority
value: 1000
globalDefault: false
description: "Gold tier ‑ highest priority, no preemption"
 
‑‑‑
# Silver tier priority class
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: silver‑priority
value: 500
globalDefault: false
description: "Silver tier ‑ medium priority"
 
‑‑‑
# Bronze tier priority class
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: bronze‑priority
value: 100
globalDefault: true
description: "Bronze tier ‑ preemptible"

Node Labeling & Anti‑affinity

Use node labels and anti‑affinity rules to prevent resource conflicts and isolate different workload types.

Node Labeling Strategy:

# Label nodes by workload type
kubectl label nodes worker‑01 workload‑type=cpu‑intensive
kubectl label nodes worker‑02 workload‑type=memory‑intensive  
kubectl label nodes worker‑03 workload‑type=storage‑heavy
kubectl label nodes gpu‑01 workload‑type=gpu‑compute
 
# Label nodes by performance tier
kubectl label nodes worker‑01 performance‑tier=high
kubectl label nodes worker‑02 performance‑tier=medium
kubectl label nodes worker‑03 performance‑tier=standard

Anti‑affinity Implementation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cpu‑intensive‑app
spec:
  replicas: 3
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            ‑ matchExpressions:
              ‑ key: workload‑type
                operator: In
                values: ["cpu‑intensive"]
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          ‑ weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                ‑ key: app
                  operator: In
                  values: ["storage‑heavy‑app"]
              topologyKey: kubernetes.io/hostname

KV Cache Locality

Optimize key‑value cache performance by ensuring cache data stays local to processing units, particularly important for GPU‑accelerated serving workloads.

Implementation Strategies:

GPU Memory Pinning:

import torch
 
# Pin KV cache in GPU memory
cache_tensor = torch.zeros(batch_size, seq_len, hidden_size).cuda()
cache_tensor = cache_tensor.pin_memory()  # Prevent swapping
 
# Use CUDA streams for async operations
stream = torch.cuda.Stream()
with torch.cuda.stream(stream):
    # Perform cache operations without blocking
    result = model.forward_with_cache(input_ids, cache_tensor)

Node‑local Cache Configuration:

apiVersion: v1
kind: Pod
spec:
  containers:
  ‑ name: inference‑server
    resources:
      requests:
        nvidia.com/gpu: 1
        memory: "32Gi"
      limits:
        nvidia.com/gpu: 1  
        memory: "32Gi"
    env:
    ‑ name: CUDA_VISIBLE_DEVICES
      value: "0"
    ‑ name: KV_CACHE_SIZE_GB
      value: "16"
  nodeSelector:
    gpu‑type: "a100"
  tolerations:
  ‑ key: "dedicated‑inference"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

4) Observability

Cluster Dashboards

Modern observability requires comprehensive dashboards tracking utilization, fragmentation metrics, queue depth, preemption counts, and SLO compliance with both real‑time and historical trend analysis.

Essential Metrics:

Resource Utilization:

  • CPU utilization (per node, per cluster)
  • Memory utilization and pressure
  • GPU utilization and memory usage
  • Network bandwidth and packet rates
  • Storage I/O and capacity

Scheduling Metrics:

  • Queue depth by priority/QoS level
  • Scheduling latency (time from submission to start)
  • Preemption frequency and duration
  • Resource fragmentation index
  • Pod pending time by reason

Performance Metrics:

  • Job completion time percentiles
  • Throughput (jobs/hour, tasks/second)
  • Resource efficiency (allocated vs. used)
  • Failure rates by workload type

Example Prometheus Queries:

# CPU utilization by node
100 ‑ (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
 
# GPU utilization across cluster
avg(nvidia_gpu_utilization_gpu)
 
# Pod scheduling latency
histogram_quantile(0.95, 
  rate(scheduler_scheduling_duration_seconds_bucket[5m])
)
 
# Resource fragmentation (unusable CPU cores)
sum(kube_node_status_allocatable{resource="cpu"}) ‑ 
sum(kube_pod_container_resource_requests{resource="cpu"}) ‑ 
sum(kube_node_status_capacity{resource="cpu"}) + 
sum(kube_node_status_allocatable{resource="cpu"})

SLO Compliance & Monitoring

Service Level Objectives require continuous monitoring with error budget tracking, burn rate alerts, and both request‑based and time‑window‑based measurements.

SLO Framework Components:

  • SLI (Service Level Indicators): Quantitative measurements like latency percentiles, error rates, and availability
  • Error budget: Tolerable amount of system unavailability within compliance periods
  • Burn rate: Rate at which error budget is being consumed
  • Tail latency: Focus on high percentiles (p95, p99) rather than just mean latency

SLO Implementation Example:

# Prometheus recording rules for SLI calculation
groups:
‑ name: sli_rules
  rules:
  ‑ record: sli:request_latency_99p
    expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
  
  ‑ record: sli:error_rate  
    expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
    
  ‑ record: sli:availability_5m
    expr: 1 ‑ sli:error_rate
 
# SLO alerting rules
‑ name: slo_alerts
  rules:
  ‑ alert: HighErrorBudgetBurnRate
    expr: sli:error_rate > 0.001  # 0.1% error rate threshold
    for: 5m
    labels:
      severity: warning
      slo_type: availability
    annotations:
      summary: "High error budget burn rate detected"
      description: "Error rate is {{ $value | humanizePercentage }} over 5 minutes"

Admission Control

Admission control systems validate resource requests before scheduling, enforce quotas and policies, and provide headroom estimates to prevent oversubscription.

Key Functions:

  • Headroom estimation: Predicting available capacity for new workloads
  • Resource validation: Checking if requested resources are available and policy‑compliant
  • Quota enforcement: Ensuring resource usage stays within defined limits

Implementation in Kubernetes:

# Resource quota per namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team‑a‑quota
  namespace: team‑a
spec:
  hard:
    requests.cpu: "100"
    requests.memory: "200Gi" 
    requests.nvidia.com/gpu: "10"
    limits.cpu: "200"
    limits.memory: "400Gi"
    pods: "50"
 
‑‑‑
# Limit range for default/max values
apiVersion: v1
kind: LimitRange
metadata:
  name: team‑a‑limits
  namespace: team‑a
spec:
  limits:
  ‑ default:
      cpu: "2"
      memory: "4Gi"
    defaultRequest:
      cpu: "100m" 
      memory: "128Mi"
    max:
      cpu: "16"
      memory: "32Gi"
    type: Container

Admission Controller for Custom Policies:

// Example admission webhook for GPU validation
func (w *GPUAdmissionWebhook) validateGPURequest(req *v1.AdmissionRequest) *v1.AdmissionResponse {
    pod := &corev1.Pod{}
    if err := json.Unmarshal(req.Object.Raw, pod); err != nil {
        return &v1.AdmissionResponse{
            Result: &metav1.Status{Message: err.Error()},
        }
    }
    
    // Check GPU fragmentation impact
    gpuRequests := getGPURequests(pod)
    if willCauseFragmentation(gpuRequests) {
        return &v1.AdmissionResponse{
            Allowed: false,
            Result: &metav1.Status{
                Message: "GPU request would cause excessive fragmentation",
            },
        }
    }
    
    return &v1.AdmissionResponse{Allowed: true}
}

Best Practices Summary

Scheduling Optimization

  1. Use gang scheduling for distributed workloads to prevent resource deadlocks
  2. Implement tiered service classes with appropriate preemption policies
  3. Leverage MIG for GPU sharing when full GPU isolation isn't required
  4. Monitor and alert on resource fragmentation to maintain scheduling efficiency

Resource Management

  1. Enable oversubscription carefully with proper monitoring and limits
  2. Use work‑conserving schedulers to maximize utilization
  3. Implement resource reclamation for unused reserved capacity
  4. Maintain headroom for burst capacity and emergency workloads

Observability & SLOs

  1. Define meaningful SLIs that reflect user experience
  2. Monitor error budgets and burn rates for proactive incident response
  3. Use tail latency metrics (p95, p99) rather than just averages
  4. Implement comprehensive dashboards covering utilization, performance, and reliability

Operational Excellence

  1. Test preemption and failure scenarios regularly
  2. Automate capacity planning based on historical trends and growth projections
  3. Implement gradual rollouts for scheduler configuration changes
  4. Maintain runbooks for common scheduling and resource issues

References

  1. Ray Project DocumentationRay Core Scheduling
  2. NVIDIA Multi‑Instance GPU User GuideMIG Documentation
  3. Abhishek Verma et al.Large‑scale cluster management at Google with Borg, EuroSys 2015
  4. Slurm Workload ManagerOfficial Documentation
  5. Kubernetes DocumentationPod Quality of Service Classes
  6. Linux Kernel DocumentationControl Group v2
  7. Hongzi Mao et al.Learning scheduling algorithms for data processing clusters, SIGCOMM 2019
  8. Gang Scheduling ResearchScheduling Parallel‑Task Jobs Subject to Packing and Placement Constraints
  9. Cluster Scheduling SurveyA survey of Kubernetes scheduling algorithms
  10. SRE HandbookGoogle Site Reliability Engineering

This guide provides comprehensive coverage of cluster‑level thinking for modern distributed systems. For the latest updates and additional resources, refer to the project documentation and research papers cited above.

#scheduling#placement#isolation#cluster#GPU#MIG#resource-management