The Direct-to-Data (D2D) Cache: Navigating the Cache Hierarchy with a Single Lookup
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/2678373.2665694
Year: Computer Architecture News, June 2014
Keyword: D2D Cache; TLB
Highlight:
-
The entire hierarchy can be made tag-less by using a central repo for tracking cache block locations and the component ID.
-
The central repo can also form a hierarchy such that frequently used page-level entries are cached along with TLB entries.
-
Cache block data and different levels of the repo can be chained together using pointers to each other, such that the correct component can be found when a cache block is evicted / inserted.
-
Exclusive cache design is useful here, since only one copy of any address is allowed across the hierarchy. The central repo only needs to track this single copy. Inclusion is usually implemented to simplify coherence since lower level caches can be used as coherence filters. In D2D design, since the cache location information can be fully obtained from the Hub, inclusion does not give any sensible benefit but just wastes space in L2.
Comments:
-
In the “eTLB hit but cache miss” case, the block is inserted into the L1, and its Hub pointer must be set. In this case, how do you know the Hub entry that maps the block? Does this require a separate Hub lookup? I understand that this may not be on the critical path, because the Hub lookup can be overlapped with main memory fetch anyway.
-
The paper claims that the D2D architecture can be extended to three level hierarchy with trivial effort. I did not see how this is trivial. Do we extend the Hub such that both L2 and LLC have one? How do you maintain exclusiveness (between LLC and private hierarchy, or private hierarchy is also exclusive)? They actually published a new paper on how to extend D2D to three-level hierarchy. As I said above, the design is indeed non-trivial compared with D2D.
This paper proposes Direct-to-Data (D2D) cache, a cache hierarchy architecture that gets rid of the conventional tag array and uses centralized repository for cache block location lookup. This paper is based on a previous work on tag-less L1 cache, which can be found here. The D2D cache adopts the idea of tag-less L1 cache that extends the TLB as page-sized super-block tag array, and applies the same design philosophy to the remaining levels of the cache hierarchy, seeking to remove the tag array entirely for all cache components. To achieve this, the D2D cache adds a few extra hardware components that track not only the location of a block in the L1, but also the location in the hierarchy and the sharing status.
The D2D cache is developed based on Tag-Less Cache (TLC), which is motived by the fact that both tag array lookup and parallel data array read consumes a significant portion of energy of an L1 cache. This is, however, unavoidable in a tag-based cache, since the block can be stored in any of the ways within a set. The TLC design, on the other hand, proposes that the way number and validity status be tracked by the TLB, essentially decoupling the tag and data array, and treating the TLB as a page-sized super-block tag array. Each TLB entry (called an eTLB entry) in TLC hence contains a vector of valid bits and an array of way numbers, with each of them tracking the status of a block virtual address at the corresponding offset. Cache accesses will then use the information in the TLB to determine whether the requested address if cache, and if it is, the way number within the set where it is stored. Each cache block in the data array must also be extended with a back-pointer to the eTLB entry whose virtual address covers the block address, such that the entry can be updated to reflect a status change when the block is evicted. When an eTLB entry is evicted, as a result of TLB replacement, all blocks whose addresses lie within the page virtual address of the evicted entry must also be evicted, since every block in the cache should be covered by an eTLB entry. Due to the nature of TLB being a virtual address cache, the TLC is also virtually addressed, and hence is prone to homonym and synonym issues. Homonym is resolved with the existing ASID field of TLB entries, meaning that entries mapping the same virtual address to different physical addresses can still be distinguished by the TLB with the additional ASID field. Homonym is not handled by the TLB itself, and TLC paper proposes that a reverse translation table mapping physical page addresses to virtual page addresses be added for tracking homonyms. When an entry is to be inserted into the eTLB, the reverse table is consulted to enforce the invariant that none of the existing virtual addresses already map to the same physical address. If synonym is detected, the invariant is enforced by copying the existing entry to the new entry, rewriting its address, and retaining the cache blocks covered by the entry.
The D2D cache extends the above paradigm to the L2 and potentially LLC by also tracking the current holder of the block in eTLB entries (the paper uses a two-level cache hierarchy with shared L2 as example). In addition, the D2D proposal adds a hardware directory called the Hub, at the L2 level, that serves as a central repository for tracking cache line status in the entire hierarchy. The cache location portion of L1 eTLB is now treated as a smaller but faster “cache” for the Hub. The Hub is organized as a set-associative Content-Addressable Memory (CAM) using physical page address as keys. Each entry of the Hub tracks the location of the block addresses for all blocks within the page using three fields: A “valid” bit field indicating whether the address is actually cached anywhere in the hierarchy; A way number field indicating the way number of the block; A component ID field indicating the current cache component that holds a cached copy of the block. Each Hub entry also has a pointer to eTLB entries, which stores the index of the entry if the same physical page is also in the eTLB (and set to invalid if the page is not in the eTLB).
To simplify location tracking of cache blocks, the D2D hierarchy is strictly exclusive, meaning that a block can be cached in at most one components across the hierarchy, such that only one component ID needs to be tracked. The D2D design also enforces two invariants. First, all blocks in the hierarchy must have an entry in the Hub whose physical address covers the block’s address. To help finding the Hub entry, a per-block pointer to Hub entries is also added to all blocks in the hierarchy. The second invariant is that the Hub is always inclusive to the eTLB, meaning that every entry of eTLB must also be present in the Hub. In addition, if a physical page is both in the eTLB and the Hub, only information in the eTLB is considered as valid, while the Hub copy can be stale. This can be easily determined by checking whether the rTLB entry pointer in the Hub entry is valid or not, as a valid pointer indicates that the eTLB entry should be used.
We next describe the operation of D2D caches as follows. In most cases, an access should hit the eTLB, and the valid bit in the entry is checked. If the valid bit is set, then the block is directly fetched from either the L1 or the L2 (we deviate from what was described in the paper simplify it a bit by ignoring the L1 instruction cache) using the way number stored in the eTLB entry. Special care must be taken if the block is cached in the L2, as the L2 set index may be generated using bits from the physical page number. If this is the case (depending on the L2 cache configuration), the physical page number must also be retrieved from the eTLB to compute the L2 index. If the valid bit is not set, the block is deemed to be not cached in any cache component, and a request is directly sent to the main memory using the physical address derived from the conventional part of the eTLB. The new block, once fetched from the main memory, will be inserted into the L1 cache, and the eTLB entry will be updated accordingly. The Hub pointer of the newly inserted block will also be set to the Hub entry.
If, however, that the eTLB misses, then the request is forwarded to the Hub, which stores the full set of information of cache block status. Note that the Hub uses physical address for lookup, address translation must therefore be performed at L2 TLB first to retrieve the physical tag, which is then used for Hub lookup. If the page that covers the block is in the Hub, then the block is fetched directly from the cache component indicated by the Hub entry (the block can still be in L1 but eTLB may not have the entry for the block). The TLB entry is also fetched from the Hub to the eTLB.
If the Hub also misses, then the block is not cached anywhere in the hierarchy, which triggers a cache miss. The block is then fetched from the main memory as in eTLB cache miss. In addition, a new entry is inserted into the Hub, and the valid bit as well as the way number are also set accordingly.
On an eTLB eviction, the entry is just written back to the Hub, updating the Hub entry with potentially more up-to-date block status. The eTLB pointer in the Hub entry is also invalidated. No cache blocks need to be invalidated, since block status is tracked by the Hub, and eTLB is simply just serves as a fast cache. On Hub eviction, however, all blocks belonging to the page that are currently in the hierarchy must be evicted. The blocks can be enumerated by walking the valid bit vector and following the way numbers stored in the Hub entry. Hub eviction can be optimized with similar techniques as described in the original TLC paper, such as micro-pages and micro-page prefetching.
On cache block eviction from the L1 to L2 or from L2 to the main memory, the Hub and possibly the eTLB entry need to be updated. This is achieved by following the Hub pointer in the evicted block to update the Hub entry first, and if the Hub entry’s eTLB pointer is valid, the eTLB entry is also updated.
Similar to TLC, D2D also prevents synonym from co-existing in the eTLB. Synonyms are detected when a new entry is to be inserted into the eTLB on an eTLB miss. Since the Hub is physically tagged, if two virtual pages are mapped to the same physical page, and one of the virtual pages is already in the eTLB, then the pointer of that entry will be already set, indicating a synonym. Synonyms are resolved in the same manner as in TLC, that is, the old eTLB entry is copied over to the new entry, and the virtual page number is updated to be the new virtual page.
In prior discussion of the paper, it is assumed that the L2 cache is private, which simplifies certain issues such as tracking the status of block in multiple L1 eTLBs. In a shared L2 configuration, where multiple private L1 caches may cache the same address in shared state, an extra directory needs to be added to the Hub, such that each Hub entry tracks the coherence states of all blocks in the physical page, and that the entry has one eTLB pointer for each private L1 in order to find all shared copies. Coherence actions need to be performed before an incompatible request is served (e.g., GET on M state block, or GETX on S state block). The paper also noted that, for shared blocks, the latency of an accessing the shared L2 cache is equivalent to the latency of a regular tag-based L2, since the Hub needs to be consulted for coherence. One possible optimization is to classify blocks into shared and private, such that private blocks can still be directly fetched from the L2 on aa L1 miss, but shared blocks use the slow path that always performs coherence checks.