RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization



- Hide Paper Summary
Paper Title: RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization
Link: https://dl.acm.org/doi/10.1145/2540708.2540725
Year: MICRO 2013
Keyword: RowClone; DRAM



Back

Highlight:

  1. Very simple concept: When the SRAM row buffer contains meaningful data, connecting the SRAM output to the bit lines of DRAM cells will overwrite cell content. This is part of the DRAM read protocol, but can also be utilized to duplicate a row.

  2. When two rows are not in the same bank, we just perform two back-to-back row activations, and copy from the source row buffer to the destination row buffer. The destination row buffer will overwrite the destination row in the same way as in the previous point.

  3. The copy granularity (DRAM row for FPM, 64 byte blocks for PSM) makes OS CoW easy since the copy granularity of this operation is also large.

  4. When copying or initializing data, the destination addresses in the hierarchy should be invalidated to ensure a coherent memory image. One optimization is to duplicate source blocks and rewrite the address tag to be the destination block.

Comments:

  1. The cache block duplication and rewrite may not be as easy as it seems to be. There are a few complexities. First, in an inclusive hierarchy, how do you maintain inclusiveness? Second, if only part of the source range is cached, how do you deal with those that are not cached? I am not saying this is impossible. Rather, I am just saying that this optimization might require significantly more hardware resource than suggested in the paper (in fact, this mechanism itself deserves a separate research paper, e.g., cache scrubbing).

  2. Although not the fault of the author, I am really curious about the mechanism that the row buffer and sense amplifier can rewrite contents in DRAM cells. My best guess is that, the SRAM row buffer is driven by CMOS’s Vdd and ground line, so it can actually charge and discharge the capacitor, if connected to the bit line.

This paper proposes RowClone, a fast data copy and initialization mechanism built into DRAM chips. The paper points out that, on current memory architecture, data copy and initialization (specifically, zeroing out a block) are two of the most commonly performed tasks, but yet very little optimizations exist. On most platforms, these two tasks are performed by having the processor issuing instructions to fetch data into the hierarchy, and then write it back to a different location. Some implementations optimize this using special instructions provided by the ISA for this exact purposes. Two possible optimizations exist on today’s x86. The first one is SIMD instructions, with which more than 64 bit data can be read from the an address into an SIMD register, and written to another address. In addition, with non-temporal loads and stores, these instructions may bypass the cache hierarchy, and directly access DRAM to avoid cache pollution and the extra overhead of cache flushing. The second optimization is called Efficient MOVSB/STOSB (ERMSB), which takes advantage of existing 8086 string instructions and its prefixed form (REP prefix) as a hint to the microarchitecture that a string operation is being performed, under which certain memory ordering constraints can be relaxed.

Despite the best efforts, current microarchitectural solutions are sub-optimal, since they all inevitably incur the extra latency and bandwidth overhead of having to transfer data between the core pipeline and the main memory. The paper observes that these overheads are unnecessary, since data copy and initialization can be performed within the DRAM chip, with only minor modifications to the interval logic.

The paper assumes the following DRAM operation model. A DRAM device consists of banks and subarrays. Although the actual physical device may have higher level abstractions such as channels and ranks, they are nothing more than just a bunch of banks organized together for addressing and resource sharing purposes, and can be decomposed into a set of banks. Each bank consists of several subarrays, in which only one of them can be activated (this is an important factor that restricts the design space, as we will see below). Subarrays are accessed in the unit of rows, the typical size of which are several KBs (8KB in this paper). Each subarray has a row buffer, which stores the content of a row after activation. The row buffer is an important component on RowClone’s design as it serves as a temporary data buffer between two activations.

The access sequence of DRAM assumed by this paper is described as follows. Before each access, the row index is generated using the physical address, and a word line is raised to select the row to be accessed. The bit lines of the row are opened by the word line signal enabling the access transistor connecting the capacitor and the signal output to the sense amplifier. The sense amplifier then detects the charge at each capacitor, amplifies them to the normal logic level, and latch them into the row buffer. Meanwhile, the values are also written back to the capacitors as they are latched by the sense amplifiers. The latter is critical to the design of RowClone, as we will see later, this implies that if we activate another row in the same subarray, the content of the current row will be written back by the sense amplifier as well, overwriting the existing content in the new row. Data read and write are performed within the row buffer, and therefore, reads and writes have the same sequence of commands. After each activation, the sense amplifiers are precharged for the next access to a different row. This step, however, is unrelated to RowClone’s protocol, and we do not cover it in details.

The simplest form of RowClone, called Fast Parallel Mode (FPM), is built upon an important observation made in the last section: When a row is in activated state, its content will be overwritten by what is currently stored in the row buffer. This is because the row buffer consists of SRAM cells which are driven by Vdd and ground lines to represent 0 and 1 respectively. DRAM cells, on the other hand, are merely capacitors, which will be charged or discharged if the bit line is open and is connected to either Vdd or ground. Activation without precharging will precisely achieve this effect: The DRAM cells are directly connected to the SRAM output of the row buffer, hence overwriting the current content of the cell with the content in the row buffer. If the row buffer contains data previously read from another row, this essentially copies the content of a row to another in two back-to-back activations without any precharge in-between.

The design of FPM is hence trivial: When a row copy command is received, the memory controller issues one activation command to the source row, which reads its contents into the row buffer and restoring the DRAM cell at the same time. Then the memory controller activates the destination row without any precharge, overwriting the destination row with the content of the row buffer.

The limitation of FPM is also obvious: It only supports transferring a line within the same subarray. If the source and destination rows are not in the same subarray, the command cannot be executed, which will be dropped by the memory controller.

The paper then proposes a more generous scheme, the Pipelined Serial Mode (PSM). Under PSM, two banks are activated back-to-back according to the source and destination addresses. The row in the source bank is first activated, with its contents read into the row buffer. Then the source row is closed and pre-charged, and meanwhile, the destination row in a different bank is activated. In the last step, the source bank’s row buffer is copied to the destination bank in the granularity of 64 bytes blocks, overwriting the destination row’s contents.

The PSM mode is a general mode that supports data copy between any two arbitrary banks (but not within the same bank). The granularity of copying is also more flexible, since it utilizes the internal bus of the DRAM device between banks, which can transfer up to 64 bytes of data, at the cost of lower throughput per cycle (one block versus one row). Inter-subarray copy within the same bank is also performed via PSM mode. The memory controller should over-provision an extra row buffer per bank. Inter-subarray copy is implemented by first copying the source blocks to the extra buffer via PSM mode, and then copying the blocks to the destination row buffer. For zero-initialization, the paper proposes that one row is reserved for each subarray whose contents is hardwired to zero. Zero-ing out a block or a row is just copying the zero row to the destination. This incurs a small overhead, but overall it is quite acceptable since subarrays normally have hundreds of rows.

The paper then proposes an microarchitectural level extension to utilize the hardware extension. First, two new instructions are added to the ISA: memcpy and meminit. These two instructions accepts the range to be copied/zeroed-out and the size as operands, and translates them to one or more FPM/PSM commands issued to the memory controller. Second, at cache coherence level, since the copy or initialization operation bypasses the cache hierarchy, affected cache lines in the destination addresses need to be invalidated, and dirty blocks in the source ranges need to be flushed back to the DRAM to ensure a consistent view between DRAM and the cache, both before and after the copy. One optimization can reduce the number of flushes, though, by creating in-cache copies of the destination blocks after invalidating the original blocks. The contents of these blocks are source blocks on the corresponding offset (for memcpy), or simply zeros (for meminit). Lastly, since PSM operations are likely non-atomic, the memory controller should block accesses to the destination address range during the copy or initialization operation.

At application level, Operating Systems may use RowClone to perform copy-on-write. This brings a great advantage to existing implementation, since CoW always copies a page, and therefore is always row-aligned. The DRAM chip should expose its internal address translation scheme as well as row size to the OS via a register on the memory controller. The OS should then use these information to allocate physical pages during CoW such that copies are performed on the same subarray whenever possible. Other applications, such as memset()/memcpy()/memmove() library calls, page migration, application snapshotting, CPU-GPU communication, virtual machine cloning, etc., can also leverage the two primitives.