Scalable Cache Coherence for Manycore Processors

As processor core counts continue to grow, traditional cache coherence protocols face significant scalability challenges. This ISCA 2023 paper presents a breakthrough approach to directory-based coherence that dramatically reduces memory overhead.

The Scalability Problem

Traditional directory-based coherence protocols maintain a directory entry for each memory block, tracking which caches have copies. For a 256-core system with 64-byte cache lines:

Memory Overhead Calculation:

Directory entry size: 256 bits (1 bit per core) + state bits
For 1GB memory: 16M cache lines × 32 bytes/entry = 512MB directory overhead
50% memory overhead - clearly unsustainable!

Key Innovation: Hierarchical Directories

The paper introduces a three-level directory hierarchy:

Level 1: Local Directories

Track sharing within 8-core clusters
Small, fast, co-located with L3 cache
Handles 80% of coherence traffic locally

Level 2: Regional Directories

Track sharing across 4 clusters (32 cores)
Moderate size, shared among clusters
Aggregates coherence requests

Level 3: Global Directory

Sparse representation for system-wide sharing
Only activated for widely-shared data
Uses compressed bit vectors

Rendering diagram...

Adaptive Coherence Granularity

Traditional protocols use fixed cache-line granularity. This work introduces adaptive granularity:

Fine-grained (64B): For private or lightly-shared data
Coarse-grained (512B): For widely-shared read-only data
Dynamic switching: Based on observed sharing patterns

Benefits:

Reduces directory entries for read-only shared data
Maintains fine-grained control for frequently modified data
40% reduction in directory traffic

Invalidation Aggregation

Instead of sending individual invalidations, the protocol aggregates them:

Temporal Aggregation: Collect invalidations over 10-cycle window
Spatial Aggregation: Combine invalidations for adjacent cache lines
Hierarchical Propagation: Send aggregated messages up the hierarchy

Result: 35% reduction in coherence message count

Performance Evaluation

The authors evaluated on 64, 128, and 256-core systems:

Memory Overhead Reduction

64 cores: 45% reduction (vs. traditional directory)
128 cores: 58% reduction
256 cores: 62% reduction

Performance Impact

Latency: 5-8% increase in remote access latency
Throughput: 2-3% performance loss on average
Bandwidth: 30% reduction in coherence traffic

Workload Analysis

Scientific computing: Excellent results (minimal sharing)
Database workloads: Good results (read-heavy sharing)
Graph analytics: Moderate results (irregular sharing patterns)

Implementation Challenges

Hardware Complexity

Requires three-level directory hierarchy
Additional logic for granularity adaptation
More complex coherence state machines

Software Implications

OS page allocation affects clustering efficiency
NUMA-aware scheduling becomes more important
Memory allocators should consider sharing patterns

Industry Relevance

This work directly addresses real industry needs:

Datacenter Processors:

Modern x86 processors: 64+ cores per socket
High-end server CPUs: 56+ cores per socket
ARM server processors: 128+ cores planned

Cost Implications:

Directory memory is expensive (SRAM/eDRAM)
60% reduction enables cost-effective scaling
Reduces power consumption of coherence logic

Future Directions

The paper opens several research avenues:

Machine Learning Integration: Use ML to predict sharing patterns
Heterogeneous Systems: Extend to CPU+GPU coherence
Persistent Memory: Adapt for storage-class memory
Security: Coherence-based side-channel mitigation

Critical Analysis

Strengths:

Addresses real scalability bottleneck
Comprehensive evaluation methodology
Practical implementation considerations

Limitations:

Increased hardware complexity
Performance overhead for some workloads
Limited evaluation of security implications

Key Takeaways

Directory overhead is a major scalability bottleneck for manycore systems
Hierarchical approaches can dramatically reduce memory requirements
Adaptive granularity provides flexibility for different sharing patterns
Small performance trade-offs can enable significant cost savings
Industry adoption will depend on implementation complexity vs. benefits

This research represents a significant step toward practical 256+ core processors, making manycore systems economically viable for broader deployment.