Exploiting Compressed Block Size as an Indicator of Future Reuse
- Hide Paper Summary
Link: https://ieeexplore.ieee.org/document/7056021/
Year: HPCA 2015
Keyword: CAMP; Cache; Compression
Highlight:
-
Evicting small blocks should be discouraged, since more small blocks need to be evicted to accommodate for a larger block. In practice, we combine the original replacement algorithm’s priority ranking with block size by using the ratio: pi / si.
-
The block size class that should be given priority can be selected using set-dualing, i.e. sample a few sets from the cache, and dry-run the algorithm under consideration. If the results are better, then the algorithm is proven better.
-
Block sizes sometimes can serve as reuse indicator, because blocks of regular content such as array elements or pointer values tend to be accessed in loops, which exhibit high spatial and temporal locality. When this is detected, eviction of the same size block should be discouraged.
This paper proposes Compression Aware Management Policies (CAMP), which is a set of cache management policy specifically tuned for compressed caches. Cache compression has been proposed by prior publications to reduce memory traffic as well as to increase effective cache size. Its effectiveness, however, also depends on the cache management policy. Conventional policies such as LRU, or, commercially adopted policies such as RRIP, may not be optimal for compressed caches. To solve this issue, more factors need to be taken into consideration when designing management policies for compressed caches.
The paper makes two critical observations. First, due to the fact that compressed blocks are of different sizes, sometimes significantly different, Belady’s OPT cache replacement algorithm may no longer be optimal in terms of number of misses, if multiple cache blocks need to be evicted as replacement for a larger block. In this case, cache replacement policies that implicitly or explicitly relies on future reuse distances may no longer provide good performance numbers. The second observation is that cache blocks of similar sizes may have a similar access pattern, because of loops in program code and arrays in the data structure. For example, a nested loop iterating over array elements can generate regular access patterns to the array. In addition, arrays in program data structures typically contain data of the same type, sometimes even within a small dynamic value range, such as pointers to small objects allocated by a malloc(). Most malloc() implementations will only allocate from a pool of larger memory blocks, and will optimize for locality. The resulting objects will be close to each other in the address space, and hence easily compressible to similar sizes. Another example is sparse matrix in which most elements are zero. In such a matrix, most cache lines will be compressed to contain only one or two non-zero values, with the rest being easily compressible.
The above two observations motivate CAMP, which consists of two policies: Minimum Value Eviction (MVE), which is an improved version of existing eviction policies, and Size-based Insertion Policy (SIP), which determines the priority of blocks for eviction when they are fetched into the cache. These two policies complement each other. The initial priority for eviction when a line is brought into the cache is determined by SIP, while the replacement decision is made using a combination of the eviction priority and other factors (as we will see below). Eviction prioirties are also updated when cache blocks are hit.
In the baseline version of MVE, the paper assumes a conventional set-associative cache organized into sets and ways. No particular replacement algorithm is assumed, but the algorithm must use a priority value to rank all ways in a set as candidates, and the one with the highest priority value is evicted. In LRU, this value corresponds to the distance of the way in LRU stack to the top of the stack (i.e. the one at the tail of the stack has the highest priority). In RRIP, the priority value is the re-reference prediction value, RRPV. On eviction, the MVE computes a value Vi for each way in the set using the formula: Vi = pi / si, where pi is the inverse of the priority, while si is the size of the compressed block corresponding to the tag. This formula discourages the eviction of blocks with smaller sizes, since several smaller blocks may need to be evicted in order to serve one large, hardly compressed block, which reduces the effective cache size. Note that the value pi actually denotes the “importance” of the block. In LRU, it is the distance of the block in the LRU stack to the tail of the stack, while in RRIP, it is computed by subtracting RRPV from the maximum RRIP counter value. After computing the Vi for each i in the set, the cache controller evicts the one with the smallest Vi.
SIP works by setting up one or more “virtual caches” which is driven by accesses to a small subset of sets in the main cache. Each virtual cache runs a ranking algorithm with different parameters (i.e. which size class should be given a higher priority). The virtual cache does not address any data slot. Instead, it only serves as a “dry run” of the intended ranking algorithm, and maintain statistics. During execution, the cache controller may start a “sampling interval”, in which accesses made to the set samples in the main cache are also routed to the virtual cache. Each virtual cache will demonstrate a different behavior based on the algorithm it is assigned. At the end of the sampling interval, the statistics of the virtual caches are compared with the main cache. The algorithm with better performance will be used for the next sampling interval.
In the case of SIP, given n different compressed size classes and m sets per virtual cache for sampling, the cache controller maintains an extra n * m tags as described above. Each of the n size classes prioritize blocks of the size class when they are fetched into the cache. For each virtual cache i, a counter Ci is maintained to track the “relative” number of misses between the current algorithm and the main cache. When a miss happens in the sampled set of the main cache, the counter is incremented by one to indicate that the algorithm halps reducing misses. On the other hand, if a miss occurs in the virtual cache, but not the main cache, then the counter is decremented to indicate that the algorithm actually hurts performance. In the case of both hits or misses, the counter is not changed. At the end of the interval, the algorithm with the largest positive counter value is selected as the main cache’s ranking algorithm. If all counter values are negative, then the main cache’s algorithm is unchanged. During the next interval, the main cache will rank the selected size class with high priority when they are brought into the cache. In RRIP, this translates into assigning an RRPV value of zero, while in LRU this is equivalent to inserting the block to the head of the LRU stack.
The paper also proposes porting CAMP to the V-Way Cache design, named G-MVE and G-SIP. The V-Way cache features a global replacement policy with decoupled tag array and data array. Instead of statically binding one data slot to one tag, V-Way cache doubles the number of sets in the tag array, and uses an indirection field to allow tags addressing arbitrary data slots in the global data array. This way, an eviction decision can often be made by evicting a data slot from another less frequently accessed set, and then associate the tag in the current slot with the data slot, increasing the number of ways in a given set. In order to make replacement decisions, each data slot has a counter tracking access frequency to the block. The counter is incremented when accesses hit. On eviction, a PTR pointer is used to scan the global data array (with wrap-around). The cache controller evicts the first encountered block with zero counter value, and decrements all other counters when the counter is checked against zero.
G-MVE improves the above replacement algorithm as follows. Instead of only using frequency value, the cache controller computes Vi = pi / si, where pi is the frequency counter and si is the compressed block size. A maximum of 64 Vi is scanned, and the block with the minimum Vi is evicted until there is sufficient room for the new block. The reason that 64 tags are scans are as follows. First, the algorithm can terminate within a fixed number of cycles. Second, in the most extreme case, 64 1 byte lines need to be evicted in order to make room for a uncompressed 64 byte line.
G-SIP could not directly copy the idea of set sampling from SIP, since sets are not isolated from each other when making replacement decisions. The paper proposes using set dueling to achieve a similar effect with global data array. With n class sizes, both the data slot and tag slot is partitioned into (n + 1) regions. G-SIP enforces the rule that replacement decisions can only be made within the same region as the tag. The paper claims that this only has minimum impact on performance, given a large number of sets. Cache lines brought into region i is assigned a priority value if the size class falls into size class i. The last region runs the default V-Way Cache replacement algorithm, in which the new line is always assigned the minimum priority. A miss counter is associated with each region, which is incremented when a cache miss occurs on a tag within the region. The cache controller periodically starts training the cache, and terminates training after a short period. The duration is selected such that parameter training only occupies a small amount of run time. At the end of the interval, the size class with the smallest miss counter value is selected, which is then used for priority assignment.