A Fully-Associative Software-Managed Cache Design

- Hide Paper Summary

Paper Title: A Fully-Associative Software-Managed Cache Design
Link: https://dl.acm.org/doi/10.1145/339647.339660
Year: ISCA 2000
Keyword: Cache; IIC; Indirect Indexed Cache

Back

Questions

This paper uses a different definition of full associativity. It mostly means that an address can be stored anywhere on the entire address space in other publications, and even in earlier part of this paper. But then later on it is revealed that ths proposed design is just chaining extension of a regular 4-way set associative table, which IS NOT FULLY ASSOCIATIVE AT ALL! It still takes a hash function to map an address to only a subset of the tags. It can be arguably justified by the fact that the primary table is far smaller than the tag pool, but again it is not fully associative.
IIC does not save the tag CAM lookup overhead at all. The first primary table lookup is still a 4-way set associative lookup without any optimization. The only optimization is that data array is not accessed in parallel at the cost of longer latency.
The idea of not statically binding tag and data does not contribute to the overall merit. In fact, there must be an equal number of tags and data slots, otherwise it makes no sense (this is not virtual memory). On the other hand, IIC can be implemented even if each tag is statically bound to a data slot.
The paper does not discuss how an empty tag can be found when a new one is to be allocated. Given the large number of tags, this process is likely to be time consuming without some clever data structure.
So essentially this paper only describes a cache organization with flexible way sizes and a novel algo. for replacement. Even so the paper makes unrealistic assumption about the hardware environment running the repl. algo.

This paper proposes Indirect Index Cache (IIC), a fully associative L2 cache design featuring better hit rate and more flexible replacement policies. The paper begins by identifying similarities between cache memory and virtual pages. Both are proposed and implemented as a measurement of dealing with the increasing speed gap between components in the memory hierarchy. The difference is that on commercial produces, cache memories, at all levels, are typically implemented as a hardware-managed array of tag and data. Tags are statically mapped to data slots. In order to access data using an address tag, tags (or more likely, a subset of tags) need to be searched until a matching entry is found, after which the data slot is located at the same offset and accessed. Virtual memory, on the other hand, works differently. There is no fixed binding between a virtual page frame and a physical page (although some implementations may enforce one to aid virtual address caches or better performance). Any virtual page frame can hold any physical page. Hardware must consult a dedicated mapping structure to acquire the mapping instead.

Being able to map cache lines in a fully associative manner is beneficial to cache memory design and application performance, as pointed out by the paper. There are three reasons. First, a fully associative cache enables better chance of finding an ideal victim for replacement on capacity misses. On traditional caches, the victim must be found within the same set as the address to be accessed. Fully associative cache treats all lines in the cache as candidates, allowing more complicated algorithm to be used for replacements. Second, some advanced designs can be implemented easily on fully associative cache, such as cache line pinning and hardware transactional memory. Cache line pinning “locks” a certain cache block in the set, either to retain atomicity of data operations on that block or to protect the line from cache thrashing. With a limited associativity per set, locking a line in the cache always means losing some of the flexibility of replacements. In the worst case, all cache lines in the set are locked, and the hardware will be trapped into a deadlock when a new cache line is to be brought into the same set. The same applies to hardware transactional memory, such as Intel TSX, in which a set conflict misses (too many transactionally written lines in a set) will abort the transaction and block progress. With a fully associative cache design, neither of the above two use cases can become a problem until all lines in the cache are locked, which is an extremely rare corner case. The last reason is that cache partition can be
implemented more efficiently on a fully associative cache. On regular caches, partitioning are often done on separate ways, which has its limit.

This paper proposes two challenges to solve for fully associative caches. The first challenge is effect address mapping and cache lookup. Classical caches typically use direct mapping, or Content-Addressable Memory, or both, to generate the index of the data slot, and then access data. For fully associative caches, the entire CAM array needs to be accessed, since an address can potentially be cached anywhere in the array, which is not practical for both the energy and the decoding logic. The second challenge is to design and implement an efficient replacement algorithm to deal with the large associativity and special access pattern of the L2 cache. Compared with regular caches, IIC needs to select a replacement candidate among tens of thousands of addresses, which may not be implemented efficiently using the old LRU. Besides, LRU is not particularly good for L2 caches, since the L2 only observes a filtered subset of the actual trace, where most locality has been absorbed by the L1 level.

IIC is organized as a chaining hash table. A tag in IIC consists of several fields. The regular address field stores the address, and supports CAM-style lookup by the reading logic. The coherence state and other state fields store regular coherence and internal state information, which is orthogonal to the cache organization. The most important fields are “chain” field, which stores the next element pointer of the chaining hash table, and the “repl” field which stores replacement metadata. The chain field only need suffcient number of bits to index all tags (12 bits in the paper), while the “repl” field contains 32 bits for organizing the replacement FIFO lists, as we will see below.

The IIC works as a hardware hash table using chaining for conflict resolution. As a software chaining hash table, ICC hardware consists of two parts. The first part is the primary table, which is organized like a regular 4-way set associoative cache (with significantly smaller number of sets). An incoming address is hashed into one of the many sets, called a “bucket”, and perform a regular CAM lookup within the set. If none of the address tags match, instead of declaring cache miss as in a regular cache, the hardware will continue searching the chained tags until the end is reached or until a hit is signaled. Note that all tags within a bucket share one “chain” field, which is only stored once. The hardware tag walker repeatedly follows the “chain” field to the next tag, compares the address, and signals a hit if addresses match. Data slot is read after hit is signaled, which is serialized after tag comparison. If a miss is signaled, the software handler is invoked to evict a line, after which the controller will fetch the missing address.

The software handler running the replacement algorithm can either be programmed into the cache controller, or can be a dedicated hardware thread on the processor. The paper did not describe how the cache hardware may interact with software handler running on the CPU, since these internal states are necessary for decision making, but inaccessible as the cache is supposed to be transparent.

The paper identifies four major overheads of IIC. The first is tag storage overhead, since IIC uses almost 2x more bits than a regular 4-way set associative cache. The paper claims, however, that tag storage is always insignificant in overall cache storage, since the majority of the SRAM storage is dedicated to data array. The second overhead is serialized tag and data access. This may add a few extra cycles on the critical path, since in regular caches we can access tag and data slot in the same cycle (and discard data if no hit is signaled). The paper justifies this using the fact that some commercial designs already serialize tag and data access for decreased energy consumption. The third overhead is the chaining and tag walk overhead. The paper proposes that after each successful lookup, the tag that actually hits (if any) should be moved to the first entry in the collision chain. This way the more frequent a tag is hit, the less steps it is expected to take before it is found. The last overhead is software handler overhead, which may incur significantly longer latency than the regular LRU policy on low associative caches. In order to alleviate this, the paper proposes that the DRAM access for misses should be initiated while the handler is still running. These two processes can thus be overlapped.

The paper also proposes Generational Replacement Algorithm as a replacement for the commonly used LRU (or its variants). The observation is that the L2 cache only sees a filtered trace where most locality is not present. Based on this observation, LRU does not work well, since a frequently accessed cache line may stay in L1 for majority of the time, and therefore never accessed in a long period of time. Page table management algorithms also do not work well, since the page table’s access bit is set on every memory access going through the TLB, which exposes locality better than L2 cache.

The generational algorithm leverages the concept of access frequency. Instead of only relying on the last time an address is accessed in the L2 cache, the frequency of access is also important, since a “hot” cache line may only occasionally be evicted out of L1, but as long as it remains “hot”, there should be a steady stream of accessed into the L2 for that address, regardless of the last time the line is accessed. The generational algorithm exposes this by having multiple LRU stacks. The paper gives an example of four stacks (in the paper it is called “pools”): Stack 0 has the lowest priority, and is considered for eviction. Stack 1 - 3 has increasing priority. When a block is at the head of stack 1 - 3, and the reference bit is not set, the algorithm will move the tag of the block down one level to the end of the next lower priority stack, decreasing its priority and increasing the likelihood that the line will be evicted in the future if not accessed within a certain number of misses. On each cache miss, the algorithm selects one cache line from each stack, and moves it to the next stack of lower priority as described above, if the reference bit is clear, or to the next higher priority stack if the bit is set to honor the fact that there might also be a steady stream of references in the future. The one evicted by the lowest proprity stack is selected as the eviction victim. No action is needed for a cache hit except the hardware setting the reference bit automatically. (I think the author is just trying to use 1 bit to approximate coarse-grained frequency tracking instead of actually tracking each reference from L1 and save them and then perform a global min operation)

New cache lines are treated differently, because they are supposted to stay in the L2 cache at least for a while in order to determine whether they will be referenced or not. The algorithm thus also has a new block stack, which only cycles cache lines in itself, but is never considered for eviction. After a certain period of time, cache lines in this stack will be moved out, and inserted into one of the normal stacks. Time is tracked with timestamps using a centralized timestamp counter.

The implementation of the generational algorithm requires one bit for the reference bit, two 12 bit index fields to form the doubly linked list of LRU stacks, and the timestamp for the new block stack.