Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses

- Hide Paper Summary

Paper Title: Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses
Link: https://dl.acm.org/doi/10.1145/2830772.2830820
Year: MICRO 2015
Keyword: DRAM; GS-DRAM;

Back

Highlight:

Using a butterfly sorted network-based shuffling algorithm to distributes words on the same offset of cache lines, such that words on the same offset from W consecutive cache lines are guaranteed to be mapped to different physical locations, in order to avoid chip conflict.
Different patterns can be accessed using simple bitwise operation to generate physical chip offsets using the logical offset of the line plus the pattern ID.

Questions

The paper should mention that the step size cannot be too large (although in typical use cases it will not be very large), since the row buffer must hold DRAM rows from the same row ID. If the step size is too large, then it is possible that the line resides in a different row, requiring the chips in the rank to activate different rows (i.e., some lines are on row X, and some lines are on row X + 1), complicating the design of the row buffer.
The above problem is actually fatal, if some lines are at the end of one row, and some are at the next row. In this case, even if the step size is small, chips in a rank still have to activate different rows.
If the access pattern of a physical address range changes, or it is deallocated and reallocated to a different process, the DRAM content stored there must be rearranged, since the words in a cache line have been shuffled.
Is it even a useful pattern, when the pattern ID = 3 but the column ID is not aligned on W-line boundary? In the 32-byte line example, if the pattern ID = 2’b11, and the last two bits of column ID is 2’b01 (not aligned to 4-line boundary), then the actual offset will be: 2’b00 ^ 2’b01 = 2’b01, 2’b01 ^ 2’b01 = 2’b00, 2’b10 ^ 2’b01 = 2’b11, 2’b11 ^ 2’b01 = 2’b10, means seem to be not meaningful at all.
Cache controller needs to know the actual data layout in the DRAM, since the cache controller must expand the pattern ID and address tag into word addresses, which requires it to decode the address tag into per-chip offset ( the column address in the paper).
What is the coherence state of a gathered cache line?
Generally speaking, I would say it is better not to cache the gathered cache lines.

This paper proposes Gather-Scatter DRAM, a novel DRAM architecture that enables fast gather and scatter semantics. The paper points out that current DRAM interface only supports cache line granularity access, which causes difficulties when accessing data in a strided pattern. This prevents several common patterns from being implemented efficiently. For example, in matrix manipulating code, accessing multiple matrix elements on the same column often requires strided reads where the stride length is the size of a row, if the matrix is stored in row-major order where elements in the same row are stored. Second, in HTAP database applications, tuples can be accessed either transactionally, where only a few tuple is touched, or analytically, where certain fields of a large set of tuples are scanned. In a row store data layout, scanning one field of all tuples also rerquires strided accesses to memory, with the stride being the physical size of a tuple. The last example is SIMD, which supports data gathering and scattering with special instructions. These instructions, however, can be inefficient if multiple cache lines need to be fetched from the DRAM.

The paper assumes the following DRAM architecture. The DRAM storage consists of potentially multiple channels, each supporting several DIMMs. Channels have their own control and data signals. Multiple ranks exist within a channel. All DRAM devices in a rank share the same command and data path, which operate as a basic unit of access. This paper assumes that cache lines are always mapped into one rank for the simplicity of discussion, but other address mappings can also be supported. Each rank consists of several DRAM chips, and each chip stores part of a cache line. In this paper, it is assumed that each chip provides 8 bytes of data on an access. The cache line width is determined by the number of chips on a single rank. For 64-byte cache lines, eight chips are present, in which chip 0 provides byte 0 - 7, chip 1 provides byte 8 - 15, chip 7 provides byte 57 - 64, and so on. Although the internal organization of chips is not important to this paper, it is still briefly described that each chip contains several arrays, where each array provides one bit to the 64-byte cache line. On a chip read operation, the read command fetches an entire row of bits from each array, and cache them in a structure called the row buffer. Although only one bit of the row is accessed per-array, the remaining bits from the row are latched in the row buffer to serve future reads without precharging and activation. The offset of the bit that is accessed is called the column address, which represents the offset of the 64-bit cache line in the row buffer.

The above DRAM organization has two important properties that make strided accesses inefficient. First, the same bit of all cache lines mapped to a rank is always provided by a single chip. Second, all chips in a rank can only be accessed by one request at a time, since there is only one control and data path per rank. These two properties prevent strided accesses from accessing words on the same offset from two different cache lines, since these two words will be mapped to the same chip.

This paper solves the above issue by mapping words into chips in a different manner. Conventionally, words on offset i is always mapped to chip i. The paper proposes a novel data shuffling algorithm to ensure that, for cache line size of W words, any W adjacent cache lines will have their words on offset i mapped to different chips, enabling parallel accesses to words on the same offset from adjacent cache lines, which is the most common strided access pattern.

The data shuffling algorithm works as follows. For a cache line on column address C (column address here can simply be interpreted as the offset of the cache line stored in the rank), the W words are shuffled before they are stored according to the bit representation of C. The shuffling algorithm takes the lowest log2(W) bits (3 bits for 64-byte lines) as the control key. The shuffling process consists of log2(W) steps, each involves swapping words in the cache line. In step j, the j-th bit in the control key (counting from LSB) is checked. If this bit is zero, then no shuffling is performed, and the line is sent to the next step. If the j-th bit is set, then the shuffling is performed as follows. First, the W words is divided into groups of (2 ^ j), i.e., with j = 2 and W = 8, this will divide the cache line into two groups, the first being word 0 - 3, the second being word 4 - 7. Then, each odd numbered group is swapped with the next even numbered group. In the above example, word 0 - 3 will be swapped to offset 4 - 7, while word 4 - 7 will be swapped to offset 0 - 3. The data shuffling distributes words evenly for any W cache lines. In other words, for any consecutive W lines stored in the rank, no word at the same logical offset is stored on the same physical offset, enabling efficient strided accesses, since words from different cache lines that are on the same offset can be accessed by activating different chips. Note that this property only holds for W or less cache lines. For more than W lines, it is unavoidable that two words on the same offset will be mapped to the same chip. This case, however, is not important, as the DRAM interface only allows accessing W words per operation. Even with strided access pattern, at most W distinct cache lines will be accessed, eliminating any possibility of chip conflict.

In order to access distinct cache lines on one request, the address computation within the chip should also be motified. In the conventional scheme, when cache line at offset X is requested, all chips will activate the word stored at offset X, the output of which is then assembled into a cache line. With data shuffling, the paper proposes that, given a cache line offset of X, each chip will read word stored at offset ((chip ID & Pattern ID) ^ X), in which the pattern ID is a log2(W) bit constant for a certain pattern. When pattern ID is set to zero, the DRAM rank runs in default mode, and always access the corresponding words on the same offset X, which is the default mode. When the pattern ID is all-ones, the lower bits of the actual offset is XOR’ed with the chip ID. If these bits are zero, meaning that the cache line is itself on a W-line boundary, the XOR essentially works as bitwise OR, and the words accessed are: Word 0 of line X, word 1 of line (X + 1), word 2 of line (X + 2), …, word 7 of line (X + 7). In the shuffled pattern, word 0 of physical line X is the first word of logical line X, word 1 of physical line (X + 1) is the first word of logical line (X + 1), …, word (X + 7) of physical line (X + 7) is the first word of logical line (X + 7), meaning that pattern 3 will access the first word of the next 8 adjacent lines in one request.

Other words can also be accessed similarly: In order to access the second word from the next 7 consecutive lines, the DRAM controller should use offset (X + 1) to compute physical offset, but still issue read command using W-line aligned offset X to compute per-chip offset. In general, in order to access the word on offset i, the DRAM controller should use offset (X + i) to compute physical offset, and add the offset to X for the physical offset on each chip.

At the operating system level, if page is expected to be accessed in strided patterns, the OS should mark the page porperty in the page table, and include the pattern ID in the memory request. The returned cache line is stored in the cache hierarchy as a physical line, which is tagged with the pattern ID. Cache coherence is handled by checking both conventional lines and patterned lines. The cache controller can expand the pattern ID and base address tag into word addresses, which is then checked against the coherence message. Since more than one copy of a word can exist in the same cache, the cache controller should also check whether an address is cached in multiple locations when performing writes to a non-dirty location.