Skip to main content
Back to Research
CPUISCA · 2023coherencedirectoryscalabilitymanycore

Scalable Cache Coherence for Manycore Processors

Sarah Chen, Michael Rodriguez, Dr. Lisa Wang

Novel directory-based coherence protocol that reduces memory overhead by 60% while maintaining performance in 256-core systems.

4 min readPaper

Scalable Cache Coherence for Manycore Processors

As processor core counts continue to grow, traditional cache coherence protocols face significant scalability challenges. This ISCA 2023 paper presents a breakthrough approach to directory-based coherence that dramatically reduces memory overhead.

1. The Scalability Problem

Traditional directory-based coherence protocols maintain a directory entry for each memory block, tracking which caches have copies. For a 256-core system with 64-byte cache lines:

Memory Overhead Calculation:

  • Directory entry size: 256 bits (1 bit per core) + state bits
  • For 1GB memory: 16M cache lines × 32 bytes/entry = 512MB directory overhead
  • 50% memory overhead - clearly unsustainable!

2. Key Innovation: Hierarchical Directories

The paper introduces a three-level directory hierarchy:

2.1 Level 1: Local Directories

  • Track sharing within 8-core clusters
  • Small, fast, co-located with L3 cache
  • Handles 80% of coherence traffic locally

2.2 Level 2: Regional Directories

  • Track sharing across 4 clusters (32 cores)
  • Moderate size, shared among clusters
  • Aggregates coherence requests

2.3 Level 3: Global Directory

  • Sparse representation for system-wide sharing
  • Only activated for widely-shared data
  • Uses compressed bit vectors
Rendering diagram...

3. Adaptive Coherence Granularity

Traditional protocols use fixed cache-line granularity. This work introduces adaptive granularity:

  • Fine-grained (64B): For private or lightly-shared data
  • Coarse-grained (512B): For widely-shared read-only data
  • Dynamic switching: Based on observed sharing patterns

Benefits:

  • Reduces directory entries for read-only shared data
  • Maintains fine-grained control for frequently modified data
  • 40% reduction in directory traffic

4. Invalidation Aggregation

Instead of sending individual invalidations, the protocol aggregates them:

  1. Temporal Aggregation: Collect invalidations over 10-cycle window
  2. Spatial Aggregation: Combine invalidations for adjacent cache lines
  3. Hierarchical Propagation: Send aggregated messages up the hierarchy

Result: 35% reduction in coherence message count

5. Performance Evaluation

The authors evaluated on 64, 128, and 256-core systems:

5.1 Memory Overhead Reduction

  • 64 cores: 45% reduction (vs. traditional directory)
  • 128 cores: 58% reduction
  • 256 cores: 62% reduction

5.2 Performance Impact

  • Latency: 5-8% increase in remote access latency
  • Throughput: 2-3% performance loss on average
  • Bandwidth: 30% reduction in coherence traffic

5.3 Workload Analysis

  • Scientific computing: Excellent results (minimal sharing)
  • Database workloads: Good results (read-heavy sharing)
  • Graph analytics: Moderate results (irregular sharing patterns)

6. Implementation Challenges

6.1 Hardware Complexity

  • Requires three-level directory hierarchy
  • Additional logic for granularity adaptation
  • More complex coherence state machines

6.2 Software Implications

  • OS page allocation affects clustering efficiency
  • NUMA-aware scheduling becomes more important
  • Memory allocators should consider sharing patterns

7. Industry Relevance

This work directly addresses real industry needs:

Datacenter Processors:

  • Modern x86 processors: 64+ cores per socket
  • High-end server CPUs: 56+ cores per socket
  • ARM server processors: 128+ cores planned

Cost Implications:

  • Directory memory is expensive (SRAM/eDRAM)
  • 60% reduction enables cost-effective scaling
  • Reduces power consumption of coherence logic

8. Future Directions

The paper opens several research avenues:

  1. Machine Learning Integration: Use ML to predict sharing patterns
  2. Heterogeneous Systems: Extend to CPU+GPU coherence
  3. Persistent Memory: Adapt for storage-class memory
  4. Security: Coherence-based side-channel mitigation

9. Critical Analysis

Strengths:

  • Addresses real scalability bottleneck
  • Comprehensive evaluation methodology
  • Practical implementation considerations

Limitations:

  • Increased hardware complexity
  • Performance overhead for some workloads
  • Limited evaluation of security implications

10. Key Takeaways

  1. Directory overhead is a major scalability bottleneck for manycore systems
  2. Hierarchical approaches can dramatically reduce memory requirements
  3. Adaptive granularity provides flexibility for different sharing patterns
  4. Small performance trade-offs can enable significant cost savings
  5. Industry adoption will depend on implementation complexity vs. benefits

This research represents a significant step toward practical 256+ core processors, making manycore systems economically viable for broader deployment.