Mojim: A Reliable and Highly Available Non-Volatile Memory System



- Hide Paper Summary
Paper Title: Mojim: A Reliable and Highly Available Non-Volatile Memory System
Link: https://dl.acm.org/doi/10.1145/2786763.2694370
Year: SIGARCH 2015
Keyword: NVM; Mojim; Replication



Back

Highlight:

  1. Redo logging could achieve atomic commit and replication together without any extra overhead

  2. Having a mirrored copy overcomes the biggest issue with redo logging: Write ordering between dirty data and commit record, as the data transfer between the master and mirror is software controlled, unlike a cache.

  3. Redo logging is also a good way of performing checkpointing at arbitrary boundaries selected by software. This is similar to what we try to achieve in NVOverlay

Questions

  1. Write amplification of logging reduces the life time of NVM device by half

  2. In m-sync scheme (most common one), if a mirror copy crashed, why flushing the cache of the master copy? I cannot see how this affects recoverability. The mirror should be able to be rebuilt whether or not the master flushes the cache. Furthermore, flushing the cache will pollute the master’s image (although it is always potentially polluted). As long as you do not dump register file, data on the master copy is always inconsistent and execution could not resume. (I guess the application should make its own transactional support?) What’s more, the paper claims that for system with N machines, Mojim could tolerate N - 1 failures. In a master-mirror configuration, we do not need to worry about the case where both master and mirror copy fails.

  3. I did not get how metadata log is used. Does mirrored copy always recover to exact sync points? How does master copy know which data should be sent after the metadata log’s tail (master copy does not keep a log and has no idea which data were sent after the log tail)

This paper proposes Mojim, an NVM-based replication scheme featuring both high performance and reliability. The paper begins by identifying the problem with current replication schemes based on disks. These schemes often assume a disk access latency that is order of magnitudes larger than DRAM. The resulting design considers less about the effect of performance impact with the presence of software overhead, network stack overhead, and network roundtrip latency. In the NVM-era, as the speed of NVM is comparable or even close to that of DRAM, the system becomes more sensitive to software and network overhead, suffering performance degradation if not tuned specifically for NVM.

Mojim solves the above challenge using a combination of high-bandwidth network with RDMA interface, simpler replicating protocols with redo logging, and different levels of consistency between the master working copy and the mirrored copy. First, with high bandwidth interlink between the master and the mirror, the network latency can be reduced to a minimum, resulting in less blocks of the application. In addition, by using RDMA interface, rather than TCP or any buffered network interface, no extra software buffering is used for data transfer between the master and the mirror, reducing software buffering and data movement overhead to a minimum. The RDMA device could directly fetch cache-coherence data from the hardware cache, without enforcing a software write back and possibly blocking on the operation. With simpler replication protocol, the data exchange between master and mirror copy is as simple as master sending the log entries to the mirror, and the mirror acknowledgment of tansmission completion. No complicated two-phase commit is used, since only one mirror copy is present. At last, by providing several options with different consistency guarantees, applications can make trade-off between the degree of consistency and performance. Switching between consistency levels are achieved by deciding whether the application should wait for mirror persistence, or whether dirty data should be flushed, etc.

We next introduce the system configuration of Mojim. Mojin is implemented as a kernel-level library that interacts with application programmers via system calls or other exposed interfaces such as memory allocation. The system consists of a master copy with NVM installed, on which applications are executed, and a mirrored copy which is read-only. Background tasks such as auditing can be scheduled on the mirrored copy as long as they are read-only. A few optional backup copy can also be supported. Mojim guarantees that copies on the master and mirror are strongly consistent (if it is configured so), while only weakly consistent between these two and the backup copy. The kernel library works with a NVM file system which manages files on the NVM, and maps the file content using mmap() to the calling process’s virtual address space. Mojim provides two simple interfaces for the file system and application programmers to initiate data replication. The first is msync(), which synchronizes a given address range between both the master copy and the mirrored copy. The replication is guaranteed to be atomic in two aspects. First, all dirty data within the address range will be persisted atomically with regard to crashes after the call returns (note that it is not atomic on the master itself). Second, the master and mirrored copy will be synchronized atomically, such that the mirrored copy always contain the most up-to-date content of the master after the call returns. The second system call is gmsync(), which has the same semantics as msync(), but supports sending multiple address ranges to be atomically replicated.

Mojim employs redo logging to perform both atomic commit and data replication. Recall that in redo logging, two types of write ordering must be observed to ensure correctness. First, log entries should be persisted before the commit record. This suggests that all log entries must be transferred to the mirrored copy before the commit mark is written, only after which the system call could return. The second write ordering is between dirty data and the commit record. Dirty data can only be written back after the commit record is persisted. The second write ordering conveys different messages to the master and the mirror. For master copies, there is no guarantee that the hardware cache will not write back dirty data before the commit record, and therefore, the master copy by itself is not atomic at all. On the other hand, however, if we think of the master copy as a large buffer, and the mirrored copy as actual data, both write orderings are observed, since dirty data will never be directly transferred from the master to the mirror, which makes the second write ordering trivially true.

Mojim divides falut-free program execution into sync points, at which moments msync() or gmsync() are called to request for persistence. On crash recovery, the system image is guaranteed to be recovered to the most recent sync point given that the function at that sync point has returned. How to correctly identify and place sync points are left to the application programmer.

In the most common mode, the protocol ensures the consistency of the system for most of the time. At a sync point, the master copy generates log entries for dirty data in the given range or ranges, and then send these data to the mirrored copy via RDMA. Note that no cache flush is involved in this process, since RDMA nowadays is cache-coherent, and can directly fetch the most up-to-date lines with coherence protocol. After receiving these lines, the mirrored copy persists them onto the NVM without occupying any CPU cycle, which also uses RDMA. The mirrored copy then writes a commit mark, persists it to the NVM, and sends back an acknowledgement to the master copy. The master copy, meanwhile, blocks on the network until the acknowledgement from the mirror is received, after which msync() function returns. The mirrored copy also maintains a “metadata log” which records the IDs of requests that it has received. On crash recovery, this log is used to figure out the last log entry that the mirrored copt has received. The master copy will then send log entries after this point to help the mirrored copy to stay synchronized and consistent.

Crashes are detected via heartbeat signals. On detecting a crash, the master or the mirror will start crash recovery immediately. If the mirror crashed, the master copy will first flush the entire cache to ensure that dirty data is always present on the NVM. Then the mirrored copy is rebuilt or restarted. If the master copy crashed, the mirrored copy replays log entries with a commit record, and discards the rest. The mirrored copy then acts as the master, and resumes execution. The master copy is rebuilt in the meantime using the image of the mirrored copy. If, however, the master copy crashes before the flush completes, the image on the NVM will be potentially inconsistent, since some dirty blocks might be lost in the volatile cache. This “vulnerability window” may becomes an issue if the master copy also crashes within the window, leading to inconsistent image and potential data losses.

To avoid the vulnerability window, the paper also proposes msync with cache flush. The library always flushes the cache of the mirror copy before msync() returns, thus avoiding the window at all when the mirror crashes. Note that Mojim does not guarantee that the image on the NVM will not contain any dirty update from execution after the most recent sync point. In fact, current hardware caches, just like a “steal no force” database page buffer pool, can evict a dirty block any time after it is written. The application is assumed to implement its own transactional support to roll back such partial updates at the master node on crash recovery, but the paper does not cover such implementational details.

Log entries received from the master copy is written into dedicated log buffers by the RDMA receiving end on the mirror node. The mirror node periodically replays the log onto its NVM image, when the log buffer is full or when the number of received packets exceeds a certain threshold, after which the log buffer is truncated and reclaimed. If tier-two backup nodes are configured, the mirror copy also sends log entries to backup nodes in the background. These backup nodes are not strictly synchronized with the master and mirrored copy, which only guarantees recovery to one of the recent sync points, but not always the most recent one. As a result, msync() will not wait for backup nodes to complete their persistence.