The ZCache: Decoupling Ways and Associativity
- Hide Paper Summary
Link: https://ieeexplore.ieee.org/document/5695536/
Year: MICRO 2010
Keyword: zCache
This paper proposes ZCache, a cache array design that decouples ways from associativity. Traditional set-associative cache lookup and replacement policies are based on the concept of sets, which consists of a few ways. All addresses are mapped to individual sets, in which the address can be stored by any of the way. The paper points out that this organization prevents an optimal or near optimal replacement decision from being made, since the replacement policy, regardless of which policy it is as long as the domain is within the set, can only evict an existing block from the current set, while some where in the cache a better candidate may exist.
The paper identifies the root cause of non-optimal or even bad replacement decisions as coupled cache replacement policy and cache line placement policy on a miss. Associativity is defined for both replacement and placement. When a miss occurs and the target line is fetched from the lower level, the placement policy dictates which blocks could be used to store the line, such that a later associative lookup only needs to compare tags in these slots without a chance of “missing” it. When a set conflict miss occurs, the replacement policy determines, from a pool of candidates, which lines should be evicted for the new line. In a traditional set-associative cache, these two policies are equivalent: The replacement policy always selects a block in the current set the address is mapped into, and the newly fetched line is also always stored in the same slot, such that cache lookup operations only compare tags in the current set.
One easy way of increasing the chance that a better candicate can be selected during eviction is to increase the associativity, i.e. the number of ways, in the cache organization. This has been done on commercial products where the LLC is highly associative, featuring more than 16 ways. Having a larger set to select candidates from of course enables better decisions to be made, but this benefit comes at a cost: During cache lookup, all 16 way tags and data in the set need to be read out from the SRAM register file, and then compared with the requested address, even though only at most one of the 16 data reads will be useful. On modern hardware, such parallel SRAM read is a major source of cache power consumption and heat dissipation, and is likely not to be scaled further for the next genetation product.
ZCache decouples associativity from replacement by allowing a larger number of blocks to be selected as candicates beyond the current set. A “set conflict” is defined as two addresses having a non-zero chance of being mapped onto the same physical slot. This “conflict” relation is transitive for traditional cache organization, since one address can only be statically mapped to one set using the lower bits in the requested address. Using this definition, when a miss occurs, the cache line to be evicted must be within the transitive closure of the “conflict” relation using the current content of the set, since the new line being fetched from lower leval can only be stored in one of the conflicting addresses’ slots.
ZCache extends the semantics of “set conflict” by allowing one address being statically mapped to more than one sets. This way, the set conflict relation is not longer transitive, since address A, B conflicting and B, C conflicting do not necessarily imply that A and C must also conflict, since A, B and B, C may conflict on different sets. This way, the transitive closure of the transitive relation is much larger than in the first case, since more cache lines can be included while we “reach out” to other conflicting addresses by following the relation.
In practice, ZCache can be implemented as follows. The tag and data slots are partitioned into equal-sized parts of size W, called “ways”. Given a N-way partition, the cache also defined N hash functions (preferably) for any given address. The N hash functions map the address into one of the W slots on each way respectively (hash function Hi only maps the address for wayi). Cache lookup requires reading the tags of each slot from each way, and compares that with the requested address. If any one of the slots indicate an address match, a hit is signaled. Otherwise, it will be a cache miss. The physical implementation of ways can just be SRAM register files with one read port. Each way has its own read port and R/W logic to support parallel tag comparison.
Recall that according to the definition of set conflicts, an address A stored on way i conflicts with other addresses B stored on way j slot k, if and only if A also hash into slot k on way j. In other words, the conflict set of any address stored in the cache consists of all addresses that happen to be hashed into the same slot on other ways. An address tag can always be found by the lookup as long as it is stored in one of the conflicting slots.
When a cache miss occurs, one existing line in the cache must be evicted for replacement. In the classical way of performing eviction, the algorithm simply find the conflicting addresses for the requested address A, and evict one of them (e.g. B) to make space. ZCache goes one step forward by allowing a cache line (e.g. C) in the conflicting set of conflicting addresses be evicted also, after which the conflicting address B is moved to the secondary conflicting slot of C, and the requested address is moved to the slot of B. This can be applied recursively for several iterations as long as tags are always repositioned into its conflicting addresses, after the latter has been evicted or repositioned. The paper calls this process a “tag walk”, and points out that this is essentially a BFS search on the “set conflict” relation.
The actual implementation of the tag walk is described as follows. A small candidate buffer implemented as a single-port SRAM is added to the cache controller. On signaling a miss, the cache controller computes the conflicting set for each level in a breadth-first manner. It first schedules reading the first level W conflicting tags of the requested address by reading into the banked register file holding tags. After the output is available, these tags are then inserted into the candicate buffer since they serve as candidates of eviction. In the meantime, the output is also used for scheduling for reading next-level conflicting set. This iterative process stops when a certain threshold (in terms of levels) is reached, at which time the candicate buffer holds a list of possible candicates for eviction.
Values in candicate buffers are pointers to the tag (i.e. way ID and slot ID). These values also implicitly form a tree structure. No explicit parent is stored, since the tree is of fixed fan-out, and parents can be implied by the index of an element. We need parent information because it is used for repositioning.
The actual replacement algorithm proposed by the paper is LRU. Conventional LRU implementation does not work for ZCache, since there is no set ordering unlike a conventional set in which an LRU stack can be formed. The paper proposes using simple timestamp based LRU. A centralized counter dispenses the current timestamp with regard to cache operations. When a hit or line fetch occurs, the timestmap is written into the tag of the line. During eviction, the cache controller scans the candidate buffer, and evicts the line with the smallest timestamp. Repositioning is performed after the line (which is not necessarily at leaf level) is evicted by reading the parent node and store it into the child that was just evicted or repositioned. Although this tag walk, scan and repositioning process will have larger latency than simple LRU, the paper claims that such latency can be overlapped with line fetch on cache misses, which, in the case of LLC, will take hundreds of cycles.