High Performance Cache Replacement Using Re-Reference Interval Prediction
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=1815971
Year: ISCA 2010
Keyword: RRIP; Cache Replacement; LRU
This paper proposes RRIP, a machanism for determining whether a cache line should be replaced by predicting the interval from the current time till its next reference. The motivation for RRIP is locality anomaly, which is not uncommon for lower level caches, such as L2 and LLC. The classical Least Recently Used (LRU) algorithm works pretty well, sometimes approximating Belady’s optimal OPT algorithm, for access patterns that have high locality, which is particularly true for L1 cache. On L2 and LLC, LRU may not work well for three reasons. First, most of the locality has been filtered out by L1, and only those misses the L1 cache can be observed by L2 and LCC. This property violates the assumption of LRU that both a cache miss and cache hit imply accesses to the same line in the near future. Second, when the size of the working set is larger than the cache, cache thrashing can happen. The most typical example is accessing an array in a circular manner. All accesses will result in cache misses, because the LRU algorithm always selects array elements that are smaller for eviction, which are accessed earlier and hence are closer to the LRU position. The third reason is scan pattern, which accesses non-temporal blocks. Scan pattern also violates the assumption of LRU, because cache misses bring in data that will never be accessed in the near future. Overall speaking, LRU is not suitable for L2 and LLC caches as a replacement algorithm, as a consequence of different locality assumptions.
Cache replacement algorithms can be described using policies. Cache replacement policies define the state transition when particular events take place. For example, the insert policy defines the status of a loaded and perhaps existing cache blocks when a block is loaded. Similarly, the hit policy and miss policy define the transition of states when a cache block is hit and when a miss is signaled respectively. In LRU, the insert and hit policy both put the block at MRU position, which is the furthest possible position from LRU. The miss policy simply evicts the block at LRU position. As stated above, the LRU policy works sub-optimally for lower level caches, because the locality on these levels usually discourage predicting references for hit and inserted blocks as to be in the near future.
RRIP fixes LRU by predicting newly inserted cache blocks to be accessed in the long future, therefore prioritizing them for victim selection on cache misses. To implement the policy, each cache block is associated with an M-bit saturating counter. In the paper M is chosen to be two. A value of zero indicates the block is expected to be accessed in the near future. The larger the value is, the longer the predicted interval of future accesses will be, and hence the more likely the block will be chosen for victim selection. A newly inserted block will have a counter value of (2M - 2). On victim selection, the cache controller attempts to evict the block with counter value (2M - 1). If such a block does not exist, then the counters of all blocks are incremented by one, and the controller will repeat until it finds one. On a cache hit, the controller can either reset the counter of the hit block to zero immediately, demonstrating strong belief that the block will be accessed in the near future, or act less aggressively and only decrement the counter by one. The former is called Hit Priority (HP) and the latter Frequency Priority (FP). It is noted in the paper that RRIP-HP may harm performance, because cache lines that are only accessed once after being inserted is expected to stay in the cache for a long time, which may not be true. RRIP-HP, on the other hand, prioritize the eviction of cache blocks based on their reference frequency. The entire scheme is called Static RRIP (SRRIP), because decisions are made statically without run-time information.
SRRIP is scan-resistant as long as the scan length is within a certain bound. The bound is determined by the working set size, the cache size, and the width of the counter. For short scans, since cache blocks are inserted with expected long reference interval, they should also be evicted before the working set. For long scans, however, given that the aging mechanism works by incrementing the counter of cache blocks until the eligible one is found, performance will be adversely affected because both temproal and non-temporal lines will have the same predicted interval. Furthermore, if the re-reference interval is larger than the cache size, then SRRIP still causes cache thrashing. All accesses will miss the cache.
To solve the problem with SRRIP, the paper also proses Dynamic RRIP (DRRIP), which dynamically chooses between SRRIP and Bimodal RRIP (BRRIP). Bimodal RRIP inserts the majority of new blocks with re-reference interval of (2M - 1), and with small probablity with an interval of (2M - 2). BRRIP can yield better result compared with SRRIP, because it at least presevres part of the working set in the cache. Neither SRRIP nor BRRIP work well under all circumstances. In this regard, DRRIP uses a selector to choose the one that works better using run-time information. DRRIP uses set-dueling as an evaluation framework for both replacement algorithms. A subset of cache sets are dedicated to SRRIP, another to BRRIP, and the rest just follows the current better one. DRRIP maintains two sets of counters to record the performance of both. The one that has better performance will be adopted for the follower sets.