Efficiently Enabling Conventional Block Sizes for Very Large Die-Stacked DRAM Caches



- Hide Paper Summary
Paper Title: Efficiently Enabling Conventional Block Sizes for Very Large Die-Stacked DRAM Caches
Link: https://dl.acm.org/citation.cfm?id=2155673
Year: MICRO 2011
Keyword: L4 Cache; DRAM Cache; LH Cache



Back

This paper presents a design for DRAM-based L4 cache implemented with Die-Stacked DRAM. Conventional DRAM is not suitable for implementing any form of caching, because accessing the DRAM cache before going to the home location can only increase latency in all situations. Die-Stacked DRAM, however, makes DRAM caching feasible due to its lower access latency. Previous researches reported lower latency of Die-Stacked DRAM, ranging from half to one-fourth of the latency of conventional DRAM, which implies the possibility of using on-chip Die-Stacked DRAM to implement an additional L4 cache between the LLC and conventional DRAM.

This paper points out, however, that due to the large size of DRAM caches, storing the full set of tags can be difficult. For example, to support 1GB DRAM cache, the size of the tag store reoported by the paper is 96MB (i.e. 9.4% storage overhead), which already exceeds the maximum amount of fast SRAM implementable with today’s technology (the paper was written in 2011, but I believe even in 2019 this amount of SRAM is either impossible or extremely slow/expensive). The paper also identifies several solutions to address the tag store issue. The first solution is to increase the size of the block, and hence reduce the number of tags for the same amount of cached data. The problem, however, is that large blocks (e.g. 4KB) have to be read from and written into their home locations as an indivisible unit. Without sufficient locality, which is typically the case with lower level caches because the locality has been filtered out by higher level caches, only a few smaller blocks will be used in the large unit, resulting in wastage of bandwidth and contention on the memory bus (because a larger block takes longer to transfer). The second solution is sector cache, in which the granularity of blocks are unchaned, but multiple blocks, called a “sector”, are grouped together and only associated with a single tag. Blocks can be transferred and evicted at the granularity of 64 byte blocks as on-chip caches. A bit vector is also added to indicate which blocks are present in the sector and which are not. The problem with sector cache, however, is that it assumes spatial locality of access, i.e. if a block is brought into the sector cache by an access, then later accesses are expected to occur on nearby addresses. As argued in the previous point, this assumption does not work well for lower level caches due to the filtering effect from higher level caches. As a result, higher miss rates are observed with sector caches compared with a regular cache. The last solution is to store tags in a separate DRAM store area which is organized as a tag array. Before data in the DRAM is accessed, we first read the tag array to locate the data block (or generate a miss), and then use a second access to read the data block. The problem with this method is that three accesses are incurred for every cache hit. The first two are to tag and data store, and the last access updates the LRU status. Although the last access is not on the critical path of a cache hit, the extra DRAM update request increases the level of contention on the memory controller, delaying later requests.

This paper argues that DRAM caches can be made efficient by reducing the latency of accessing the tag store. The paper assumes that the row size of the Die-Stacked DRAM is 2KB, though other row sizes are also applicable. One important observation is that current generations of DRAM are equipped with a row buffer. The row buffer is populated with the bits read from the internal array when a read command is sent. The content of the row buffer can be preserved even after the read operation, which enables the next read to “hit” the row buffer if it accesses the same row as the one current opened. This “open page” design gives opportunity to DRAM caches if we co-locate tags and data in the same row, allowing them to be accessed with one relatively expensive row buffer read, and several chaper column accesses.

The DRAM cache is described as follows. We still use the conventional set-associative cache organization, with an extremely high associativity. The paper suggests that we use a DRAM row to store an entire set, including data, tags and metadata. For 2KB rows, at most 29 data blocks can be stored, which takes 1856 bytes. Assuming 6 byte tags (48 bit physical address), we need an extra 174 bytes to store the tag as a compact array, such that they can be “streamed” out of the DRAM with burst reads. The remaining 18 bytes can be dedicated to other metadata such as dirty bit, coherence states or storing profiling data.

The operation of the DRAM cache is described as follows. When the L3 cache misses, a probe into the L4 DRAM cache is initiated. First, the row number of cpmputed using middle bits in the miss address in exact the same way as in a conventional cache. The target row number of computed by adding this set index onto the base row number of the cache region (configured once at system startup time). After computing the target row, the cache controller then sends an activate command to the DRAM cache with the target row number. After the row has been activated, the controller first streams out the array of tags (174 bytes) into an internal TCAM buffer, and then associatively searches the buffer with the remaining bits of the miss address. In the meantime, the cache controller closes the row by writing back the row to the array and precharging the bit lines. Row buffer contents will remain valid even after the row is closed. Given a tag match, the data block is located using the index of the tag in the tag array, and then read out from the DRAM using another read command. Note that since the row buffer still holds valid contents of the row, the second read operation will hit the row buffer and hence only incurs negligible latency. Updates to coherence states, dirty bits and LRU bits are performed in the background. In the case where no tag in the array can match the miss address, a L4 cache miss is generated, and a read to the conventional DRAM will be scheduled.

Given that a probe in the DRAM cache is a relatively expensive event, taking the latency of a row activation in DRAM to resolve, the paper also proposes MissMap to further eliminate DRAM cache misses whenever possible. The MissMap is a dedicated on-chip hardware structure for recording block residency states. The MissMap consists a set of entries recording which addresses are currently in the DRAM cache and which are not. Each entry covers a “segment” of memory, e.g. 4KB, and uses a single bit for every block in the segment for indicating presence of the block. An address tag is also assicated with each entry. The MissMap is organized as a conventional cache. In fact, the paper suggests that the MissMap can share the same SRAM storage with LLC. Due to the fact that MissMap represents block residency in a rather compact manner, several MBs of storage dedicated to the MissMap can cover hundreds of MBs of DRAM space.

When an L3 miss is detected, the cache controller initiates an access to the MissMap. The middle bits of the address is used to locate the set, and then an associative search is performed on the set to locate the entry. Assuming a MissMap hit, after the entry is read, we use the in-segment offset to locate the status bit for the target block. If the bit is set, then the block is present in the DRAM cache, and the cache controller reads the DRAM cache as described earlier. If the bit is clear, then the DRAM is cache is not accessed, and the cache controller directly reads the home location. On a MissMap miss, an existing entry is evicted to make space for the new entry. To avoid giving inaccurate block residency information, when an entry is evicted from the MissMap, we also evict the corresponding sets from the DRAM cache, writing back blocks if they are dirty. When a new block is inserted into/deleted from the cache, the corresponding entries are also updated by setting/clearing the bits. Such update can be moved out of the critical path by performing them in the background.