A Split Cache Hierarchy for Enabling Data-Oriented Optimizations



- Hide Paper Summary
Paper Title: A Split Cache Hierarchy for Enabling Data-Oriented Optimizations
Link: https://ieeexplore.ieee.org/document/7920820
Year: HPCA 2017
Keyword: D2M Cache; TLB



Back

Highlights:

  1. D2D can be extended to support the three-level multicore hierarchy by adding an extra tag store, the MD3, as both a tag store and a coherence controller;

  2. Even with coherence, the explicit addressing scheme used by D2D still works by letting each private hierarchy work unmodified as in D2D, and allowing them to point to a block in other nodes using only the node ID, but not storing the exact location.

Comments:

  1. Many details are omitted in this paper. For example, how are the replacement pointers (RP) set on miss requests? And also this seems over-designed and has lots of corner cases that the paper did not cover.

This paper proposes Direct-to-Master (D2M) cache, a tag-less cache architecture in multicore environments. The D2M proposal is based on a previous Direct-to-Data (D2D) design, in which cache block locations in the private L1 and L2 caches are tracked with a two-level hierarchy of metadata stores, namely, the eTLB and the Hub. D2M further extends the architecture to a shared LLC cache with multicore coherence.

We first describe the operations of the baseline D2D design. In the original D2D work, which can be found here, per-block address tags are eliminated from the private hierarchy. Instead, a centralized repository for tracking cache line locations, including the cache component and the way number, which is called the Hub, is added to the private hierarchy. The Hub is essentially a set-associative sector cache tag array that uses super-blocks of 4KB page size. The Hub uses physical page numbers as tags, and is co-located with the L2 cache. Each entry of the Hub stores the valid bit, the component that an address is currently cached in, and the way number in the cache. Set indices are not stored, as they can be directly generated from the virtual or physical address (depending on the cache size: for L1 it is fully virtual, while for L2 it also needs a few bits from the physical page number). The D2D cache maintains an invariant that all blocks stored in the private hierarchy must have a corresponding Hub entry that tracks the block status, and cache lookups must use information stored in the Hub to locate the requested cache block. In addition, to avoid maintaining multiple copies of the same block in the private hierarchy, the D2D design mandates that L1 and L2 be exclusive, meaning that any block can only be cached by at most one component.

To accelerate cache lookup, and to avoid relatively expensive lookups per memory operation in the Hub, a smaller but faster cache for the Hub is added to the L1 cache for tracking frequently used Hub entries. The Hub cache is co-located with the L1 TLB, called the eTLB, which is essentially a virtually tagged super-block tag array. Each entry of the eTLB tracks the same location information as the underlying entry (with the same physical address) of the Hub. Memory operations issued from the pipeline will first perform lookups in the eTLB, and, if misses, in the Hub. Cache blocks can be accessed, after an entry from either the eTLB or the Hub is located, using the component ID and the way number stored in the entry.

To facilitate cache block eviction, each data block has a Hub pointer that points to the Hub entry covering the block’s address. Similarly, each Hub entry has an eTLB pointer, which points to the eTLB entry, if one exists. On block eviction, the Hub and optionally the eTLB (if the eTLB pointer of te Hub entry is valid) will be notified, and the location information of the block is updated. This is achieved by following the data block’s Hub pointer to the Hub entry, and optionally also following the Hub entry’s eTLB pointer to the eTLB entry.

On Hub entry eviction, all blocks that are currently valid in the private hierarchy will be evicted as well, in order to maintain the invariance.

Since the eTLB uses virtual page numbers, both homonym and synonym are possible. Homonym is prevented by the conventional TLB with the ASID field. Synonym is prevented by checking whether an eTLB entry with the same physical address in the Hub but a different virtual address than requested, using the Hub entry’s eTLB pointer, when a new entry is to be inserted into the eTLB. The paper argues that since synonyms are relatively infrequent, the eTLB disallows entries that constitute synonyms to co-exist. In practice, this is achieved by evicting the previous synonym entry and copying its block location information to the new entry when the new entry is to be inserted.

We next describe the architecture of D2M. The D2M design extends D2D to the LLC with multicore cache coherence. The baseline D2D is largely the same in the baseline design, except four minor differences. First, the Hub and the eTLB are renamed to MD2 and MD1 respectively. Second, page-sized super-block tags, now called Location Information (LI) entries, are not necessarily of page sizes anymore. The paper uses the term “region” to describe the size-aligned range of addresses covered by a single entry. This change only adds extra flexibility for making design decisions, which does not modify the way that D2D operates. Third, the paper seems to suggest that the private hierarchy is no longer inclusive. Instead, the L2 should be inclusive of all blocks in the L1.

The last difference is that the location indicator for each block is extended to six bits (the number of bits depends on the system configuration. The paper assumes 8 private hierarchies, 32-way LLC, and 8-way L1 and L2). The indicator in the private hierarchy can thus address a block in the local hierarchy as in D2D (three bits are used for indicating ways, and the rest three for indicating L1/L2), or a block in another private hierarchy with only the node ID (three bits for node ID, and three bits prefix), or a block in the LLC (five bits for indicate ways, and one bit prefix to indicate LLC), or in the memory (there are lots of code points for this purpose).

To extend metadata tracking to the LLC, the D2M design also adds another level of tag store, called the MD3. Just like MD1 and MD2, the MD3 also tracks block location in the unit of regions. The MD3 also serves as the global coherence controller and dispatcher for coherence messages. Each region entry in MD3 is extended with a sharer bit vector to track line location in shared state in an approximate manner (since the sharer vector is per-region). The MD3 is also inclusive such that all blocks cached in the private hierarchies have an MD3 entry.

The MD3’s per-block location indicator only encodes whether a block is shared, exclusively owned, or invalid with a per-block ownership record and the above mentioned sharer bit vector. A cache block can be in one of the following states in the D2M design. First, a block can be shared by two or more private hierarchies, in which case the MD3 only indicates the shared state of the block in the ownership record, and uses the bit vector to approximately track the private hierarchies that could possibly cache the block (false negatives are impossible, while false positives are likely). Cache blocks in this state need to obtain write permission as in the standard protocol, if one of the private hierarchies intends to write to the block. Read requests that miss the private hierarchy are fulfilled by the LLC since the LLC is the owner of the block. Second, a block can be exclusively owned by exactly one private hierarchy, or by the LLC itself. In the prior case, the MD3 stores the node ID of the owner in the ownership record, and indicates that the block is exclusively owned. In the latter case, the MD3 simply stores the way number of the block in the LLC data array. Note that certain protocols (e.g., MOESI, MESIF) support sharing an owned blocks across the private hierarchy. In such cases, both the vectors and the node ID are needed to track the block state. In this paper it seems that the authors assume a simple protocol where the dirty block (M state in MESI) or the only copy of a clean block (E state in MESI) in the private hierarchy is considered as the owner (although explicit block state is not tracked by any level of the hierarchy, we are talking about the MESI equivalence here). Lastly, an invalid block is one that are not tracked by the MD3. This is indicated by setting the ownership record to point to the main memory.

We next describe the operational details of D2M cache. Within the private hierarchy, most operations are unmodified. The D2D part of the design is extended with the ability to react to coherence requests coming from the MD3. A coherence request may change the location information of a block, which can be easily implemented by updating both MD2 and MD1 (MD2 entries contain pointers to MD1 entries), if the requested address exists (recall that the sharer vector in MD3 is only a coarse-grained approximation). In addition, if the private hierarchy is the current owner of the block, a coherence request may downgrade the block by forcing a transfer of ownership back to the LLC with a write back, and updating the location information to also honor the LLC as an owner (the block can still be cached, and location indicator can still point to the local).

To eliminate lookups for blocks that are known to be not shared by any other private hierarchies, each region entry in the private cache is also extended with a “private” bit. This bit is set by the MD3 response to a previous miss if the region is not shared by any other private hierarchies, and can be revoked by MD3 if any block in the region transits to shared in the MD3. This is analogous with downgrading an E state block. A block in a region with the “private” bit set does not need any coherence for writes, similar to the silent upgrade operation in MESI with a E state block.

When a request misses the private hierarchy (is not cached for reads, or not an owner for writes), the request is sent to the MD3. The MD3 acts as a coherence controller, and may issue coherence downgrades to other private hierarchies before responding block data. The coherence messages will update the location indicators of blocks in each private hierarchy, such that these indicators always point to the current owner of the block, if the hierarchy does not cache the block, or point to the local copy (with private bit cleared), if there is one.

Block evictions are handled similarly within a private hierarchy. The paper somehow suggests adding a “replacement pointer (RP)” to aid metadata update in MD1/MD2 during replacement, but did not give any description on how this works.