TLC: A Tag-Less Cache for Reducing Dynamic First Level Cache Energy



- Hide Paper Summary
Paper Title: TLC: A Tag-Less Cache for Reducing Dynamic First Level Cache Energy
Link: https://dl.acm.org/doi/10.1145/2540708.2540714
Year: MICRO 2013
Keyword: Cache; TLB; TLC; Tag-Less Cache



Back

Highlights:

  1. This paper is essentially a combination of (1) virtual address caches; (2) Super-block caches; (3) Decoupled tag and data caches (with indirection pointers for locating data). The novelty here is that super-blocks can be of page size (micro-page size, actually), and co-locates the super-block tag array with the TLB, such that information of both can be read out with a single lookup.

  2. Synonym of virtual address caches can be solved by a reverse translation table, and by now allowing both TLB entries to co-exist.

  3. By changing the super-block tag size (i.e., TLB granularity) we can make a trade-off between cache misses due to super-block tags + low locality and the metadata cost of fine-grained TLB. In addition, L1 TLB can operate with micro-pages by injecting entries from the L2, without changing how the rest of the MMU works.

Comments:

  1. I do not get why the paper proposes an extra level of indirection for back pointers to solve synonym. I understand that the authors want to say that it is necessary since the location of the newly inserted synonym entry changes. This, however, is not even true, because the newly inserted entry can simply just be the old synonym entry to be evicted, i.e., just update the VA, and it is done. In this case, the validity bits and way numbers do not need to be copied, and there is no need for an extra level of indirection.

This paper proposes the Tag-Less Cache (TLC) architecture to reduce energy consumption of conventional L1 caches. The paper is motivated by the fact that conventional caches are rather power hungry, which is caused by two design choices. First, cache accesses always need tag comparisons in order to locate the data block in one of the many ways, or declare a cache miss. Second, to reduce access latency, both the tags and data blocks in the set are read in parallel after the set index is computed, and only one of the blocks is selected in the case of a cache hit, which wastes the energy consumed by accessing the rest of the set. The waste can be quite significant, since the entire set is accessed, while at most one of them will be actually useful.

Previous works attempt to reduce energy issues of cache accesses with two techniques. The first one, staged access, serializes the tag access and data access into two phases. Only the way in the data bank that contains the requested block will be accessed, if it is a hit, or the data bank is not accessed at all, if it is a miss. This approach saves energy of data bank accesses, but increases the access latency by one cycle. The second technique, way prediction, attempts to predict the way that an access will likely to hit, and only checks the predicted way. It avoids parallel lookup as well as associative set lookup by speculatively reading the data bank first, and checking the tag array only for confirmation. If speculation fails, a conventional lookup is still performed, which results in both increased latency and energy consumption.

This paper, on the other hand, proposes that way information of addresses can be stored in the TLB, and that the L1 cache can simply get rid of the tag array. To be more specific: For each TLB entry, we add 64 1-bit flag indicating the existence of the corresponding block address in the L1 cache, and 64 log2(K)-bit way indicators, one for each block. The L1 cache does not have a tag array. Instead, it only has one data array which can be accessed by giving the set and way number. For each data block, a back pointer storing the index of the TLB entry is maintained, such that the TLB entry can be located quickly when the block is evicted.

We next describe the operations in details. On a cache access, the virtual address is first used to probe the TLB. If the TLB hits, then the corresponding validity bit, which is indexed with bits of the address between the page number and block offset, will first be checked. If the validity bit is zero, meaning that the block does not exist, then a cache miss can be signaled immediately, and the physical page number is read out to form the address to be fetched from the lower level. If the validity bit is one, then the access hits the cache, and the way number is retrieved, which, combined with the set index (generated in the same manner as in a conventional cache), is used to access the data array. The paper noted that in the case of a cache hit (which should be the majority of cases), the physical page number does not need to be retrieved from the TLB, since the way number is sufficient to access the data bank.

On a cache miss, an existing entry needs to be evicted from the L1. The eviction takes place by following the back pointer of the entry to be evicted to locate the TLB entry, and resetting its validity bit. When the new entry is fetched from the lower level, the corresponding TLB entry is also initialized by first setting the validity bit, and then storing the way number.

On TLB misses, an existing TLB entry is evicted, and the new entry is brought in. When evicting the existing entry, all cache blocks mapped by that entry also should be evicted from the L1 cache. The new TLB entry is initialized by setting all validity and way number fields to zero, since the eviction protocol guarantees that addresses not mapped by the TLB will not be cached.

One issue of the above baseline design is that evicted TLB entries may also evict cache blocks, causing unnecessary misses. Although this may not be a huge problem for programs with good locality, as the coverage of the TLB is much larger than the coverage of the L1 cache, for programs with low locality, the worst case scenario is that each TLB entry only has one valid block, and the total number of valid blocks that can co-exist in the L1 cache is the number of entries in the TLB, which is far less than the number of L1 data slots.

To address this issue, the paper proposes micro-page TLB: Instead of caching VA to PA translations in 4KB granularity, the L1 TLB, instead, caches the translation info as well as cache way numbers in 512 byte micro-pages. The number of entries in L1 TLB can hence increase by 8x, meaning that in the worst case scenario, the TLB can still maintain block information for the entire cache (the paper uses 32KB cache with 512 entries, and a 64-entry baseline TLB). The micro-page L1 TLB works transparently as the baseline TLB. TLB entries are generated from the standard 4KB translation and injected into the L1 TLB on TLB misses. In-memory page table entries and L2 TLB are unmodified.

Adopting micro-page TLB, however, increases L1 TLB misses, especially on streaming access patterns, since a single 4KB translation now takes 8 TLB misses to be fetched into the L1 TLB. To deal with this, the paper proposes macro-page prefetching, which injects the missing entries for the same 4KB regular page into the L1 TLB, when one of the micro-pages are injected. The prefetched entries will be placed at the end of the LRU chain to avoid polluting the L1 TLB.

By associating way indices with the TLB’s translation entry, the TLC design essentially adds a virtual address tag array that is physically co-located with the TLB tags. The virtual tag, as with all virtual tag designs, will inevitably create two problems: synonym and homonym. Homonym can be easily solved by adding an extra ASID per TLB entry, which has already been in many of the modern implementations. Synonym, on the other hand, is more tricky, since if two or more VAs are mapped to the same PA, then there would be multiple instances of TLB entries with the same PA, while there is only one copy of data per physical address in the cache. To handle this, TLC makes a policy decision by not allowing synonym translation entries to co-exist in the L1 TLB, which is justified by the fact that synonym is only rarely used, and therefore it is sufficient to just maintain correctness but not efficiency. The paper proposes adding a reverse translation table that translates PA back to VA that is coupled with the L1 TLB, meaning that the contents of the reverse table is always consistent with the L1 TLB. When an entry is to be inserted into the L1 TLB, the table is first consulted, and if an entry with the same PA but a different PA exists, then insertion happens by updating the existing entry’s VA field with the new VA to be inserted. Validity bits and way numbers are unmodified in this case, since both entries describe the status of the same physical page.