Scalable Cache Coherence for Manycore Processors
Novel directory-based coherence protocol that reduces memory overhead by 60% while maintaining performance in 256-core systems.
Scalable Cache Coherence for Manycore Processors
As processor core counts continue to grow, traditional cache coherence protocols face significant scalability challenges. This ISCA 2023 paper presents a breakthrough approach to directory-based coherence that dramatically reduces memory overhead.
1. The Scalability Problem
Traditional directory-based coherence protocols maintain a directory entry for each memory block, tracking which caches have copies. For a 256-core system with 64-byte cache lines:
Memory Overhead Calculation:
- Directory entry size: 256 bits (1 bit per core) + state bits
- For 1GB memory: 16M cache lines × 32 bytes/entry = 512MB directory overhead
- 50% memory overhead - clearly unsustainable!
2. Key Innovation: Hierarchical Directories
The paper introduces a three-level directory hierarchy:
2.1 Level 1: Local Directories
- Track sharing within 8-core clusters
- Small, fast, co-located with L3 cache
- Handles 80% of coherence traffic locally
2.2 Level 2: Regional Directories
- Track sharing across 4 clusters (32 cores)
- Moderate size, shared among clusters
- Aggregates coherence requests
2.3 Level 3: Global Directory
- Sparse representation for system-wide sharing
- Only activated for widely-shared data
- Uses compressed bit vectors
3. Adaptive Coherence Granularity
Traditional protocols use fixed cache-line granularity. This work introduces adaptive granularity:
- Fine-grained (64B): For private or lightly-shared data
- Coarse-grained (512B): For widely-shared read-only data
- Dynamic switching: Based on observed sharing patterns
Benefits:
- Reduces directory entries for read-only shared data
- Maintains fine-grained control for frequently modified data
- 40% reduction in directory traffic
4. Invalidation Aggregation
Instead of sending individual invalidations, the protocol aggregates them:
- Temporal Aggregation: Collect invalidations over 10-cycle window
- Spatial Aggregation: Combine invalidations for adjacent cache lines
- Hierarchical Propagation: Send aggregated messages up the hierarchy
Result: 35% reduction in coherence message count
5. Performance Evaluation
The authors evaluated on 64, 128, and 256-core systems:
5.1 Memory Overhead Reduction
- 64 cores: 45% reduction (vs. traditional directory)
- 128 cores: 58% reduction
- 256 cores: 62% reduction
5.2 Performance Impact
- Latency: 5-8% increase in remote access latency
- Throughput: 2-3% performance loss on average
- Bandwidth: 30% reduction in coherence traffic
5.3 Workload Analysis
- Scientific computing: Excellent results (minimal sharing)
- Database workloads: Good results (read-heavy sharing)
- Graph analytics: Moderate results (irregular sharing patterns)
6. Implementation Challenges
6.1 Hardware Complexity
- Requires three-level directory hierarchy
- Additional logic for granularity adaptation
- More complex coherence state machines
6.2 Software Implications
- OS page allocation affects clustering efficiency
- NUMA-aware scheduling becomes more important
- Memory allocators should consider sharing patterns
7. Industry Relevance
This work directly addresses real industry needs:
Datacenter Processors:
- Modern x86 processors: 64+ cores per socket
- High-end server CPUs: 56+ cores per socket
- ARM server processors: 128+ cores planned
Cost Implications:
- Directory memory is expensive (SRAM/eDRAM)
- 60% reduction enables cost-effective scaling
- Reduces power consumption of coherence logic
8. Future Directions
The paper opens several research avenues:
- Machine Learning Integration: Use ML to predict sharing patterns
- Heterogeneous Systems: Extend to CPU+GPU coherence
- Persistent Memory: Adapt for storage-class memory
- Security: Coherence-based side-channel mitigation
9. Critical Analysis
Strengths:
- Addresses real scalability bottleneck
- Comprehensive evaluation methodology
- Practical implementation considerations
Limitations:
- Increased hardware complexity
- Performance overhead for some workloads
- Limited evaluation of security implications
10. Key Takeaways
- Directory overhead is a major scalability bottleneck for manycore systems
- Hierarchical approaches can dramatically reduce memory requirements
- Adaptive granularity provides flexibility for different sharing patterns
- Small performance trade-offs can enable significant cost savings
- Industry adoption will depend on implementation complexity vs. benefits
This research represents a significant step toward practical 256+ core processors, making manycore systems economically viable for broader deployment.