Stash: Have Your Scratchpad and Cache It Too

- Hide Paper Summary

Paper Title: Stash: Have Your Scratchpad and Cache It Too
Link: https://dl.acm.org/doi/10.1145/2749469.2750374
Year: ISCA 2015
Keyword: Scratchpad memory; Stash; GPGPU

Back

Highlights:

This paper achieves a balancing point between cache memory (software transparent, coherent) and scratchpad memory (software managed, non-coherent) by proposing a scratchpad design that is still fine-grained, software managed, but features on-demand load and eviction, and hardware coherence support. This retains the efficiency of scratchpad memory while getting rid of usability barriers such as software managed coherence and data loading/eviction.
Scratchpad memory does not need to support per-word address tracking. Instead, on GPGPU applications, it is sufficient to only support stride patterns on compact 2D tiles. The address generation in this case only requires a few size arguments, and we can compute the offset of the element in the global address space given the offset in the scratchpad memory.

Comments:

This paper essentially describes a tag-less, word-granularity VA cache memory design, where application explicitly manages the data storage and use private address pointers. While this is a good balancing point between pure cache and pure scratchpad memory designs, the descriptions in the paper is over-complicated since the author was basically just describing how a on-demand cache works without saying it is a cache.
TLB/RTLB seems a heavyweight component that requires lots of attention that is missing from the paper. How are TLB/RTLB loaded? Is there a page walker? Is there a dedicated page table? Does the application initializes it (likely not)? On a second thought, the TLB can grab information from the GPU’s TLB. RTLB contains the same amount of information as in the TLB, and it does not need any page walk.
The paper did not describe how the RTLB works except just saying that it provides reverse translation. I understand that this is used to translate the PA address in coherence requests back to VA, since Stash is essentially just a tag-less VA cache (and RTLB is standard for VA caches), but since Stash does not have a tag array, how do you locate the word even given the VA? Stash map stores info for address generation, yes that’s true, but this is only easy to compute on the forward direction (given offset, compute VA). I did not see how to easily determine whether a word in Stash is on a given VA. Nevermind, I get it: the info is stored in the coherence states and sent to Stash in the coherence request. But, I have another question: Per-word (Per-chunk) stash map info is already tracked for lazy write backs, why do you need it in the coherence request?

This paper proposes Stash, a globally addressable, cache coherent scratchpad memory architecture for GPGPU. The paper is motivated by the fact that existing commercial implementations of scratchpad memory has limited efficiency and usability due to the private, non-cache coherent address space. Stash addresses these issues by allowing the scratchpad memory to translate its private address to and from the global virtual address, such that the scratchpad memory can be loaded like a traditional cache on-demand, while maintaining easy access by retaining the direct addressing and fine-grained access capabilities of existing scratchpad memory.

The paper recognizes that both cache and scratchpad memory have limitations. Cache memory, on one hand, requires expensive address translation and tag lookups for each operation, which implies extra energy consumption in addition to data access. Furthermore, cache memory has a fixed size block interface, meaning that data must be transferred between the cache and the main memory in the unit of blocks, which is sub-optimal in both storage and bandwidth if only a small potion of the block is used. Such striding pattern is common in certain GPGPU applications, if, for example, a single field of an array of structs (AoS) is accessed in parallel.

Scratchpad memory, on the other hand, overcomes the above limitations by allowing direct and fine-grained addressing into its storage, entirely getting rid of the tag array and the block interface. It, however, suffers another two types of inefficiencies. First, current implementations of scratchpad memory has its own isolated address space, meaning that applications must explicitly move data into the scratchpad memory by issuing memory instructions. These explicit memory instructions will pollute the cache with data that is unlikely to be reused, as most of the operations will take place on the scratchpad memory. Second, scratchpad memory is not cache coherent, which implies that software must issue extra load and write back instructions to ensure the correctness of data. For example, data updates in the scratchpad memory needs to be reflected back to the global memory by software issuing explicit data movement instructions. Similarly, if data items is updated by the CPU, software should also check whether the version stored in the scratchpad memory is still up-to-date. Both requirements incur extra software and runtime complexity.

Stash combines the advantage of both cache and scratchpad memory, while avoiding their shortcomings, by adding an extra address translation layer between scratchpad memory address space and the global address space. Stash optimizes over the case where a field of a struct, the size of which is one or a few words, in an AoS is accessed, and the access takes place in a stride pattern within one rectangular block of a 2D tile. The pattern can therefore be described using a few size arguments, including the field size, the object size, the 2D tile size in total, the row size of a single block, and the number of strides per block, and offsets within the 2D tile can be generated given these size arguments and the base address of the 2D tile, assuming that all elements of the 2D tile are stored compactly. Applications need to initialize the stash hardware using two interfaces, namely, AddMap and ChangeMap, with the above information, which is performed by the compiler. The Stash hardware stores these map entries in a table called the stash map, which can be referred to when the application accesses the scratchpad memory.

We now describe the operation of Stash as follows. As in current implementations of scratchpad memory, the application needs to first allocate a chunk of continuous storage from the scratchpad before using it. The application should also initialize the stash map with the access pattern, such that Stash hardware can generate the virtual addresses using the size arguments and base virtual address information, given an offset into the scratchpad storage. The stash map entry is invalidated at the end of the kernel computation to free up resources used by the entry, such as scratchpad storage and TLB/RTLB entries (see below). Applications still need to explicitly manage the storage of the scratchpad, and use scratchpad-local pointers for accessing scratchpad’s private address space. The instructions for accessing Stash storage is extended with an extra operand that specifies the pattern of the access by referring to a stash map entry. Applications, however, do not need to load data from the global virtual address space to the scratchpad explicitly, nor do they need to explicitly evict dirty data to ensure coherence. Instead, Stash operates similarly to a conventional on-demand cache, which loads data automatically from the global address space when there is an access miss, and automatically responds to coherence requests combing from other components in the cache hierarchy. To accomplish this, each word in Stash hardware has an extra valid and dirty bit. The valid bit is set when a word is loaded from the global memory. If ths bit is not set on an access request, an access miss will be signaled, which triggers global memory accesses. The dirty bit is set if the word is written into, and is used to determine whether the word should be written back on evictions or coherence requests.

The stash hardware generates the virtual address of the access on an access miss using the information stored in the stash map entry specified in the operand of the access instruction. The paper proposes adding a TLB for translating the virtual address to the physical address, such that Stash hardware can directly issue coherence requests using physical addresses to acquire requested data. This TLB can use the same physical hardware as core TLB, or it can be a separate entity. The paper also proposes adding a reverse TLB (RTLB) to translate from physical to virtual addresses, such that coherence requests from the hierarchy can be translated to use virtual addresses, which is used by Stash for its internal address generation. Each TLB and RTLB entry contains a back pointer to the stash map entry whose generated addresses are covered by the entry. When a stash map entry is invalidated at the end of the kernel computation, both the TLB and RTLB entries will be invalidated as well because they are no longer useful. Since invalidations of stash map entries only happen at the end of a kernel computation, which is relatively infrequent, this can be performed by walking the TLB and RTLB and invalidating entries whose back pointer value matches the stash map entry to be invalidated. Dirty data belonging to the stash map entry should also be written back to the global memory when invalidation happens. This is just similar to how a normal cache memory writes back data except that the granularity is smaller.

Stash supports lazy write backs of dirty data after a kernel computation completes. The invalidation of stash map entries (and TLB/RTLB entries) is not immediately performed. Instead, Stash only writes dirty data back on-demand and frees up resources when the same scratchpad storage is allocated to another kernel, or when TLB/RTLB needs more free entries but could not find any. To help Stash identify which stash map entry a data word belongs to, the paper proposes adding a per-word pointer to stash map entries just similar to the case of TLB/RTLB, which is walked when a stash map entry is evicted on-demand. The pointer can also be added in a larger granularity such as per-64 byte chunks to save some metadata overhead.

Another important feature of Stash is that it is cache coherent. Values updated in Stash must be registered as dirty in the underlying directory, and external coherence requests received by Stash must be responded properly. Both are made possible in Stash due to the existence of the TLB and RTLB for translating addresses between the global address space and the private address space of scratchpad memory. Although this may sound similar to how a regular cache controller handles coherence, the paper points out a few differences between regular coherence and Stash. First, Stash uses a smaller granularity than most caches, indicating that the underlying protocol must support registering the coherence status of fine-grained words. Second, Stash writes back data in word granularity, meaning that the underlying protocol must be able to merge the word with a standard sized block used in other components. Lastly, the coherence protocol must also track the per-word stash map index that a word belongs to, which is sent to Stash in the coherence request. This information is useful for locating the stash map entry and the data word when a coherence request is received, which is necessary, because otherwise it will be difficult to locate the target of the coherence request only using the virtual address from the RTLB.