Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines



- Hide Paper Summary
Paper Title: Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines
Link: https://dl.acm.org/doi/10.1109/HPCA.2007.346202
Year: HPCA 2007
Keyword: Cache; LDIS; Line Distillation



Back

Highlight:

  1. Improves dynamic associativity by partial caching. We can treat the partial lines as logical cache lines.

  2. Use set dueling to make policy decisions on whether word level caching should be used. Although set dualing is not specialized to LDIS, it is impressive how this mechanism can help. (i.e. If you are not sure how to make a policy decision, then try set dueling with duplicated tags conducting “dry runs” of the workload, and then compare the performance of different schemes based on the stats collected from each tag set.)

Questions

  1. Some statements are self-conflicting. For example, in Sec. 5.2 it is said that when there is a fetch request from L1, the L2 controller should mark the footprint bit. This, however, conflicts with the previous statement in Sec. 4.1 saying that the footprint is updated when L1d write back happens. And in practice it is unnecessary to do that on L1 fetch, unless the L2 is non-inclusive.

  2. Handling dirty eviction might be more complicated than it seems to be. If you retain dirty words in the WOC, then DRAM should do a read-modify-write to only write back partial dirty words, which may incur coherence problem if another device, e.g. DMA or another core, accesses the DRAM. If you always wb full cache line before sending it to WOC, then WOC only reduces read/write misses in L2. The paper seems to suggest the former, i.e. dirty lines are partially retained. In this case DRAM contains partial dirty line, which complicates coherence.

  3. What if more than one 8 byte words are accessed by L1? How does the circuit compare more than one words in the tag array? One way is to always flush the WOC and perform a line fill, even if all accesses hit WOC.

  4. I can’t see how head bit helps storage allocation. The lookup circuit can always just read out all tags in parallel and then find the range based on addresses, unless you want to save energy by not comparing address tag per lookup (which is nonsense because you have to do it anyway for tag matching).

  5. Changing L1 to the sector cache design is somewhat an overkill. I would say it is better to just not cache the partially hit line of WOC in L1. This also solves problem (6) below.

  6. How are write backs of partial lines handled? Are they entered into the WOC? Do they trigger a write back? Not mentioned at all.

This paper proposes line distillation (LDIS), a technique for dynamically increasing cache associativity. This paper identifies that in most workloads, not all words in a cache block are accessed equally by the processor before the line is evicted from the cache. This motivates the design that cache lines are only partially evicted, and words that are likely to be reused in the near future be cached by a small word level store. Since only partial data is stored per line, more lines can be “squeezed” into a physical slot, increasing effective associativity. In addition, the paper also points out that the chances that unused words in a cache line be accessed later is highly related to the position of the line in the LRU stack. For lines that are closer to the LRU position within a set, it is less likely that unused words will be accessed later. This further justifies the design that only previously accessed words in evicted lines are stored by the word level cache, while other words are simply discarded.

The LDIS design consists of two independently operating caches. The first cache, called the Line Organized Cache (LOC), functions as a conventional set-associative cache. Note that although the paper uses LRU and LRU stack location to illutrate the relationship between spatial locality and access frequency of cache lines, the conventional cache can use any replacement algorithm. The second cache, called the Word Organized Cache (WOC), is a logical cache of a maximum associativity of 8, but share the data slot among all eight tags to store only frequently used words in the logical cache lines. The actual runtime associativity of a particular way of the WOC is dependent on the layout of data in the physical storage. In practice, the L2 cache is divided to implement both LOC and WOC. For example, for a 4-way set associative L2, the first three ways are dedicated to LOC, whose operation is unchanged. The last way can be dedicated to the WOC for storing at most eight logical lines, achieving a maximum dynamic associativity of eleven. The indexing scheme of WOC is the same as LOC. Each set of the WOC is only used to store cache lines that are mapped to the same set in the LOC.

In order to accommodate eight logical lines in the WOC, seven more tags are added to each WOC entry. These tags contain not only the conventional address tags, which identifie the line address, but also contain the word offset within the cache line. The WOC lookup circuit should compare the word to be accessed, which is passed down from the L1 cache, with the word offset in the tag array. A hit is signaled only if both the address tag and the word offset match.

To simplify storage management of the physical slot, data cannot be laid out arbitrarily. Several constraints are made to avoid fragmentation and compaction of data. First, single word data can be placed anywhere in the physical slot, as long as it is aligned to the word boundary. Tags are statically associated with words in the data slot, which means the tag is also updated to the address and offset of the single word. Second, multi-word data from an evicted line can only be stored in a consecutive, power-of-two size granularity, i.e. 2, 4, or 8 words, which must also form a consecutive range in the original line. This is to prevent external fragmentation, since two empty entries of the same class can always be combined to form a larger entry of the next size class. The last constriant is that a segment of size K must also align to K words boundaries in the physical slot. This is also to avoid external fragmentation, since segments are stored compactly for most of the time if they are properly aligned. To track the starting word of a segment, the paper also proposes adding a bit, called the “head-bit” bit, to each tag in the tag array. The head bit is set if the tag maps the first word of a segment from a logical line in the physical slot.

In order to track words that are accessed by the processor after a cache line is fetched into the L2, the paper proposes adding a bit vector, called the “footprint”, to each entry in the LOC and the L1 data cache. The footprint vector is flash-cleared when a slot is filled with a cache line fetched from the lower level. On every L1 access, the corresponding bit in L1 footprint vector is set. On L1 write backs, the footprint vector is also sent back to L2 by the eviction request. The L2 controller will combine this footprint vector to the vector of the same address using bitwise OR. On L1 misses, the MSHR should communicate the word offset of the access to L2, which will then set the corresponding bit in L2 entry’s footprint vector before serving the line.

Since partial line data might be returned from L2 accesses, the paper also proposes extending each entry of the L1 data cache with a bit vector indicating the presence of words on the corresponding offset. Accesses to non-existing words are treated as misses, which are then forwarded to the L2.

On a lookup operation, both LOC and WOC are probed in parallel. The paper lists four possible outcomes of the lookup operation: (1) The access hits LOC; (2) The access hits a partial word in WOC; (3) The access misses both; (4) The access hits an address tag in WOC, but misses the offset check. In the first two cases, the full or partial cache line are returned to the L1 respectively. The L2 controller initializes the valid bit vector for the partial line such that only valid words are marked. In the third case, a cache miss is signaled, which initiates a cache fill request to the lower level. The fetched line is always installed into the LOC with footprint bit vector cleared. In the last case, all entries in the WOC that correspond to the requested address are invalidated. In addition, a line fill request is sent to the lower level to fetch the full line. If any of the WOC entries is dirty, these dirty words are also applied to the full line just fetched from the lower level. The full line is then installed into the LOC. Write accesses are handled similarly. The paper, however, does not give any description on how partial writes from L1 is handled.

On an LOC eviction, the dirty words are collected, and then installed into the WOC. The WOC may need to evict a few logical lines to accommodate for the newly evicted block. The paper suggests that eviction decisions are made randomly to minimize the overhead. Furthermore, not all cache lines will be installed into the WOC when they are evicted. The paper proposes that only cache lines whose spatial locality is lower than the median of all lines are entered, while the rest are simply discarded to avoid polluting the WOC and hence decreasing the dynamic associativity. To collect statistics of the median, the L2 controller maintains eight counters, with counter i tracking the number of cache lines with i accessed words when evicted. An extra counter tracks the total number of lines evicted from the LOC. Only lines whose spatial locality represented by the number of bits in the footprint vector is lower than the median are installed into the WOC.

The paper also recognizes that not all workloads benefit from LDIS. For those workloads that tend to access unused bytes in LRU blocks, they demonstrate higher cache miss rates, since most accesses will be case (4), which triggers a miss anyway in the L2 cache, which actually reduces dynamic L2 associativity. The paper proposes turning off the WOC and returning the cache way used by it back to the LOC which then functions as a conventional L2 cache. The policy decision is made using set dueling: A few random sets (called “leader sets”) are selected from the L2 with their address tags duplicated (called “Auxiliary Tag Directory”, ATD), which conducts “dry runs” of the conventional L2 policy without LDIS. Statistics are collected from the dry run to determine whether switching policy would benefit performance by having a better statistics, or worsen it. In our case, a single counter is used to track the difference in the number of cache misses between the two policies. Each miss that occurs only in the leader set will increment the counter, while misses only occurring in the ATD decrement the counter. At the end of a sampling window, policy will be switched if the counter value is larger than or smaller than a certain threshold.