AsymNVM: An efficient Framework for Implementing Persistent Data Structures on Asymmetric NVM Architecture
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/3373376.3378511
Year: ASPLOS 2020
Keyword: NVM; Redo Logging; Semantic Logging; AsymNVM
Highlight:
- Combines redo log and semantic log, to allow both fast persistence and precise state recording. This avoids checkpointing the address space, and also reduces the latency of logical commit (only writing one semantic log entry instead of writing all redo logs).
Questions
- This paper is badly written, full of factual errors, grammartic errors, and peculiar expressions. Lots of conceps are not even explained clearly (e.g., how does reads keep consistent with writes?).
This paper introduces AsymNVM, a distributed shared NVM storage architecture. AsymNVM is motivated by asymmetric NVM architecture, where NVM storage backends are decoupled from the computation frontend. AsymNVM points out that by decoupling NVM storage node from the computation node, conventional logging approaches are no longer feasible, due to increased communication latency. Write ordering, which is essential to logging-based persistence algorithms, requires flushing a cache line to the backend NVM, and waiting for the response over the network, taking at least one round-trip delay to complete.
We first describe the asymmetric NVM architecture before presenting details of AsymNVM. The baseline asymmetric NVM architecture consists of one or more NVM backend and computation frontend. Frontend accesses the NVM devices by issuing RDMA commands to explicit retrieve and push data over some high-throughput network, such as Infiniband. The mapping relation between the frontend and the backend can be arbitrary in order to maximize device utilization. Both frontend and backend can crash, after which crash recovery will be initiated. Frontend nodes do not have any NVM device installed. DRAM is available to run the OS and the application, and to serve as a buffer for data fetched from the backend NVM. The paper also assumes that the working set can be larger than DRAM capacity at any frontend node, but still within the capacity limit of the backend, because of higher storage density of NVM compared with DRAM. As a result, keeping the full volatile working set in the frontend DRAM is not an option, unlike some previous proposals. A few mirror nodes are also present as the backup for backend storage. Data is synchronized between the backend nodes and mirror nodes periodically.
Note that one of the basic assumptions of the paper is that, although the throughput of RDMA is comparable to the throughput of NVM devices, the latency of RDMA operations are significantly larger than NVM, which makes it necessary to design a persistence algorithm that requires less write ordering, as we will see below.
AsymNVM supports a transactional interface, where applications declare transaction boundaries, and the library ensures that all writes to the NVM are either committed as a unit, or none of them is committed. AsymNVM uses a combination of redo logging and semantic logging (operation logging as in the paper) to support both efficient, low-latency commit, and strong atomicity guarantees, i.e., a transaction is guaranteed to be committed, even after crash recovery, after the commit function returns. The mechanism is described as follows. The backend NVM maintains two log buffers, one is data log buffer, and another is semantic log buffer. The data log buffer stores per-word updates as redo entries consisting of the address tag and cache line data. The semantic log buffer, on the other hand, stores semantic log entries, which describe the operation with the function name and arguments, if any. The library provides instrumentation functions to insert transaction begin, end, load, and store operations.
At transaction begin, an operation log entry is generated, and written to the backend via RDMA write. The write does not need to be synchronous, whose completion is only checked at the end of the transaction, which is also the logical commit point of the transaction. As the transaction executes, writes will generate per-word data log entries, which are buffered locally in the frontend. A commit record is also appended to the log to indicate end-of-log. Data log entries are not essential to commit the transaction, and therefore, they are not required to be synchronously committed. The local data log buffer is flushed, when certain number of transactions have been committed, or when the buffer is full. Local buffering is to avoid excessive RDMA operations being issued, which can easily overwhelm the I/O queue of the RDMA controller. Furthermore, batching data from multiple transactions together helps leveraging high throughput of RDMA despite high single operation latency. AsymNVM uses a single NVM write operation to transfer local data log entries to the backend.
Although not mentioned by the paper, each transaction in AsymNVM should be assigned a mononically increasing transaction ID. Operation log entries and the data commit record should both contain the transaction ID, in order for the system to accurately replay both logs on crash recovery.
AsymNVM also maintains a local DRAM cache for frequently accessed pages. On transactional read operations, the DRAM cache is first checked for the requested address. If found, data can be directly returned from the cache. Otherwise, the RDMA issues a page fetch request to fill the buffer after a potential eviction. To maintain consistency of cached contents, transactional stores are also performed on the cached page, if it is present. The address mapping between the requested block and the address in the cache is managed by a frontend hash table. The hash table is updated when a page is filled. The read and write wrapper will also perform table lookups to translate the requested address to the cache address. Dirty pages do not need to be evicted, since the data redo log entry is sufficient to keep the backend up-to-date. The paper also suggests that the eviction algorithm be customized according to the workload. For example, on B+Tree workload, the root node and upper level nodes should be given higher priority to stay in the cache, while lower level nodes are of lower priority.
The backend also maintains the working set image as direct mapped NVM. The image is only updated by continuously replaying the redo log. RDMA reads must wait for the redo log replay to complete to proceed, in order to access consistent and most up-to-date data. If there are still outstanding redo logs, the read should be blocked. Note that the original paper does not define how reads are serialized with writes. What I describe here is one of the most conservative solutions, i.e., let reads always synchronize with all preceding writes, since it guarantees that a transaction can always access its own dirty data. More relaxed solutions, such as using optimistic reads and a global image version counter, is also possible. In this solution, reads do not synchronize with log replay. Instead, log a global version counter is incremented both before and after the current batch of log replay. The RDMA read operation checks the global version counter before the read, and re-checks it after. If the first check reads an odd counter value, or if two checks read different values, the read risks accessing inconsistent image, which must be retried.
One most notable difference between regular and asymmetric NVM architectures is that NVM storage allocation. In former configuration, the OS asserts full control over the NVM address space, which is managed exclusively by the OS. In the latter configuration, however, since the backend storage is shared among multiple frontends, resource cannot be monopolized by a certain frontend. In this case, the backend runs its own storage allocation engine, which satisfies allocation request from the frontend in large granularity (“slabs” in the paper). The large block of memory, after returned to the frontend OS, is then allocated in a fine grained manner to the application. The OS may also return large blocks back to the backend allocation engine, if the frontend has excessive number of empty blocks. The backend maintains a bitmap to track allocation status of blocks. The actual allocation memory map, however, must be rebuilt by performing a pointer-based scanning after crash recovery.
On recovery from a frontend crash, the frontend first locates both data and semantic log on the backend NVM. The image is restored to the pre-crash persistent state as follows. First, the recovery handler replays the remaining redo data log entries that have a commit record, and remember the last completed transaction that is in the redo entry, L. This step restores the image to a consistent state, but not necessarily the most up-to-date consistent state, since some transactions may have already been logically committed, but with a partial redo log. Then, the handler scans the semantic log, and for all entries with a transaction ID larger than L, replays the log by re-executing these transactions with the exact function and arguments. Lastly, allocation metadata and other metadata is restored, after which the system resumes execution.