TVRAK: Software-Managed Hardware Offload for DAX NVM Storage Redundancy



- Hide Paper Summary
Paper Title: TVRAK: Software-Managed Hardware Offload for DAX NVM Storage Redundancy
Link: http://users.ece.cmu.edu/~rkateja/pubs/tvarak.pdf
Year: CMU Technical Report 2019
Keyword: NVM; Tvrak; Checksum; Redundancy



Back

This paper presents Tvarak, a hardware design for NVM redundancy management. The paper points out that redundancy is critical for certain types of NVM applications to avoid firmware bugs or random data corruption. The paper also identifies several types of data corruption, such as unexpectedly dropped writes, corrupted mapping table, and random bit flips. To deal with these problems, a classical method is to compute a page checksum for each 4KB page in the page buffer before it is written back to the disk. The checksum is stored in a separate bank or device to minimize the chance of correlated failures. With NVM directed mapped access (DAX) enabled, however, this scheme no longer works, since the DAX address space is directly mapped onto the NVM device’s physical address space, rather than to a page buffer. The consequence is that the OS no longer has control over which pages are being modified in the DAX address space and which ones are written back, since the dirty bit and write back mechanism are now both controlled by the hardware.

Tvarak solves this problem by adding a redundancy controller on the LLC level right above the device firmware (such that a malfunctioning firmware will not destroy redundancy data). The redundancy controller performs two tasks. The fist task to to verify the integrity of data when a cache line sized block is fetched from the NVM. An error will be raised if the stored checksum of the block (or a larger block, depending on the granularity) unmatches the computed checksum. The second task is to update the checksum when a dirty block is written back to the NVM device. A new checksum will be computed using the diff between the dirty line and the clean line on the device. Note that one invariant maintained by Tvarak is that the checksum stored on the disk should always match the image on the device. This requires the checksum be updated atomically as data is updated. Power failure is not a concern here, since in the paper it is assumed that the system has a battery-backed power source. The system is just assumed to have enough power to write back everything in the cache and to the checksum device.

When a region of memory is mapped as DAX by the file system, the OS will notify the redundancy controller about the starting and ending physical address of the region. Although the paper did not explicitly mention this, the design forces the underlying physical address on the device to be consecutive, since otherwise the range may not be able to be stored in a pair (or several pairs) of registers. When a fetch request reaches LLC, the redundancy controller first computes the checksum of the fetched block, and then read the chechsum from the device, and compares these two. If the two values mismatch, an execption will be raised which forces the OS to handle the issue. Similarly, when a data block is to be written to the LLC, the controller first reads old value from the device image, computes the diff, and then uses the diff to compute a new checksum. Both the checksum and the data are updated atomically after that. Note that here atomicity refers to the invariant that no interleaving request may see partially updated data-checksum pair. This might happen without special care being taken, when another core requests the data from its LLC before the current checksum is released from the write back buffer but after the dirty line is written to the device. In this case the old checksum and the new line content will be seen by the other core, which raises spurious execptions.

The paper also suggests that extra metadata be maintained along with page checksums, such as parity blocks for recovery. The parity block is not accessed during normal operation. When a block is detected to be corrupted due to firmware bugs or random bit flips, the parity block is read to recover the content of the corrupted block. The parity block should also be updated when a dirty cache line is written back by the LLC. All updates should be made atomic by the redundancy controller.

In the naive design, checksums are stored in page granularity. This requires that the entire page being read when LLC requests a cache line from the device, which suffers from severe read amplification problem. Writes do not require reading the page, since only the block being written back is affected. The controller only reads back the old content of the line, and computes the diff. Although I am not familiar with CRC32C which is the checksum algorithm they use, it is suggested by the paper that the final checksum can be updated using only a diff. In the meantime, the parity update is also computed, and the three cache lines are updated atomically by the controller (the atomicity protocol of which is not mentioned).

To solve the read amplification problem in the naive design, the paper proposes also storing cache line granularity checksums in addition to page level checksums. This seems to be overly redundant and a waste of storage, since cache line level checksum takes 63 times more space to maintain. The paper argues, however, that we only use cache line level checksum for verification during normal operation, and leaves page level redundancy for crash recovery. To this end, the paper proposes that we allocate cache line level checksum in the DRAM, while still keep the page level checksum on NVM. On LLC reads, only the cache line level checksum and the data block is read, reducing read overhead from 64 more cache lines to only one more. On LLC write back, both checksums and the parity block are updated. The write amplification has grown from two more writes to three more writes. Although the paper did not claim anything on this, I claim that this trade-off is still beneficial, since write backs are not on the critical path in most cases, while reads are.

Having to read the block from the NVM still consumes bandwidth and lowers available bandwidth for data operations. In addition, write critical path is longer by reading the old block from the NVM for computing diff. To solve these two issues, in the final Tvrak design, two caches, a checksum cache and a diff cache, are added to reduce the latency and bandwidth of accessing diff and cache line level checksums. These two caches are of write back type such that the actual update of the NVM image is delayed. The backup power source should guarantee that these two caches can be safely written back on a power failure. The paper also suggests that we dedicate a few ways from the LLC banks for storing cached metadata. Note that the dedicated storage from LLC does not need to be very large, since checksums of a 64 byte cache line only consumes 4 bytes. A slot in the LLC can cover 16 consecutive liness checksum. The lookup algorithm of the dedicated checksum cache should be modified a little such that four more bits are masked off, since each tag now covers 16 lines. Note that both the diff cache and the checksum cache operate independently from the regullar data cache. When entries are evicted from these two, the data cache do not need to force an eviction to ensure atomicity of update (but optimizations apply, so below).

The last optimization is on the computation of diff. One observation is that the diff is readily available when a cache block is written back from the L2 to the LLC. Note that this assumes inclusive LLC, which may not always be the case expecially in larger systems where the sum of L2 ways exceeds the number of LLC way. Assuming an inclusive cache, the old line content is always in the LLC when a dirty line is written back. In this case, we compute the diff between the two lines, and write the diff to its LLC partition. Updates to page level checksums are delayed to the time when the diff is written back. The paper also mentioned that when a diff is to be removed (diffs are never written back) from the cache, the controller should also schedule a write back (without invalidation) on the corresponding dirty line in the LLC, if it exists, and mark it as clean. This is to reduce an extra update to the page level checksum when the dirty line is written back.