Failure-Atomic Slotted Paging for Persistent Memory

- Hide Paper Summary

Paper Title: Failure-Atomic Slotted Paging for Persistent Memory
Link: https://dl.acm.org/citation.cfm?id=3037737
Year: ASPLOS 2017
Keyword: NVM; FASH; FAST; B+Tree

Back

This paper presents FASH and FAST, two atomic page update schemes for Non-Volatile Memory (NVM) based systems. This paper is motivated by the fact that current implementations of database systems are often designed with block devices, such as disks, in mind. As a consequence, logging is always performed in the granularity of a disk page, which is typically 4KB nowadays. This introduces the write amplification problem, in which a small write is amplified by the large granularity of writes. To make it worse, modern file systems are equipped with a feature called journaling, which duplicates data written to files in order to maintain the abstraction of failure-atomic file write system call. If the database uses a file to store the persistent log, then the I/O operation to persist the log will be duplicated due to file system level journaling, while in fact this journal is unnecessary since a corrupted log will not affect recovery (the database controls write ordering such that even if the log is corrupted, it can be identified and truncated).

This paper makes two major contributions. First, for simple updates, such as updates that only touch one page (or one node of a B+Tree), by using hardware transaction memory support, we can extend the 8-byte atomic write provided by most NVM to cache line sized atomic write, enabling us to update a page in-place without generating any log record. Second, for multi-page or multi-node update, this paper also proposes a lightweight logging scheme that does not have the write amplification problem introduced by file systems. These two contributions, combined together, forms the basis of FASH and FAST, which provide better performance when applied to a widely used database, SQLite.

This paper is based on slotted page mechanism, which is a general solution to to store variably lengthed records, such as database rows, or keys in a B+Tree index. In such a mechanism, a page (presumably 4KB) is divided two parts: A page header growing from the lower end of the page which stores offsets of data items, and the content of the page growing from the higher end, in which variably lengthed data items are stored. The page header consists of a integer counting the number of items in the page, a pointer pointing to the free list (maintained for unallocated space), and an array of offsets which encodes locations of data items. This page layout is similar to that of the Masstree, in which a permutation field is added to each node as an extra layer of indirection, which facilitates atomic insert and delete of elements, since items are not no longer shifted (which is not atomic), but instead we update the permutation field using an atomic Compare-And-Swap (Note: This similarity is also reflected by a paper in ASPLOS published two years later after this one, in which the masstree node is updated atomically w.r.t. NVM by placing the permutation field and the undo log in the same cache line).

This paper first demonstrates how to update the page header atomically with regard to the NVM using HTM. On most platforms, there are two granularities to characterize NVM. The first granularity is about the largest unit of data update when we issue a write back from the cache. In other words, this number describes the maximum nuber of bytes that can be persisted atomically on the NVM side without observing any partial update, especially after a power outrage. The second granularity is more related to the processor’s ISA, which defines the number of bytes that is guaranteed to be performed atomically by the processor, such that either all bits of the write operation are persisted to the NVM, or none of them is. This property is generally related to how instructions are executed internally within a processor. For example, even if some SIMD instructions can write multi-word value, these instructions may not be atomic, i.e. the partially updated state can be observed by a third party (either another processor, or the NVM attached to the bus) via eviction or coherence action on the cache line. In our case, if the page header is not updated atomically as a single unit by the processor, the partially updated state might be persisted to the NVM by an eviction, before we issue the cache line flush. If the system crashes at this point, the image recovered from the NVM is half-written, resulting in uncoverable data corruption. On Intel platforms, the previous granularity is the cache line size, i.e. 64 bytes, while the second granularity is 8 byte, the size of a machine word.

To make it simpler: There are two atomicity related issues for NVM-bases systems: Data that can be updated atomically on the CPU side, and data that can be persisted atomically on the NVM side. The latter implies the former, since if data cannot be updated atomically, an intervening cache eviction will inevitably make partial updates to the NVM storage. The former does not imply the latter, since even if data can be updated by the processor atomically (e.g. using HTM), the NVM still cannot accept them in an atomic batch, i.e. the NVM may have only persisted one cache line while discarded the other.

In order to make atomic update to the page header in cases such as sorting the keys in a B+Tree (which involves shifting some offsets in the array), either we need conventional logging to make sure that these updates can always be undone or replayed, or we should devise a machanism which allows cache-lined sized atomic update on the processor side, and then flush back the cache line to the NVM. In the latter scheme, according to the atomicity of both operations, no matter at which stage the system crashes, the updated cache is either written in its entirety, or not applied at all.

This paper leverages HTM to perform atomic updates to a single cache line. Current implementation of Intel TSX RTM mandates that the transaction be aborted if any cache line in the working set is evicted from the cache hierarchy. Effectively, we can “lock” a cache line within the cache hierarchy using RTM transactions, in order to perform a non-atomic series of updates to the cache line. With the help of HTM, an insertion into a slotted page is described as follows. First, unused storage is allocated within the page by finding a chunk of memory larger than the inserted item in the free list, and then writes the item into this chunk (and updates the free list pointers in other chunks). At this stage, if the system crashes, no change has been made even if the updated free list and/or data has been written into the NVM, since the page header is the only reference when we read the content of a page. In addition, corrupted free list can always be fixed by scanning the offset array and rebuilding the list (since the storage is never freed during normal operation). Then, we flush the updated addresses in the previous step to the NVM using a persist barrier. Next, we update the page header to add a new offset entry pointing to the just inserted item, potentially shifting other offsets. This update is wrapped into a RTM transaction to ensure atomicity. In the last step, we flush the updated header to the NVM, committing all the above changes. Element deletion is performed in a similar manner except that no storage is allocated from the free list. Updated items must not be overwritten, since otherwise the pre-image of the update might not be recoverable if the system crashes before the update is committed. Instead, we allocate new space for the updated item, and keep old value unreferenced which will eventually reclaimed by GC. In case that the transaction aborts due to unrecoverable conditions (which should not happen in practice since the transaction only consists of single cache line update), the system falls back to logging, which will be presented below.

HTM may also seem viable when mutliple pages are updated, in which we simply wrap the multiple page header update in the same transaction, this practice, however, does not work, because this multi-cache line update cannot be flushed atomically back to the NVM. In this case, we have no choice, but just have to fall back to conventional logging. This paper proposes a lightweight logging scheme as follows. In the first step, all storage allocation and value updates are performed within the free area of the pages as described in the previous paragraph. These dirty values are also flushed back to the NVM before the next step. The logical view of the pages do not change at the end this stage. In the second step, we generate log records for page headers to be updated into a small logging area mapped to the NVM address space, and then flush these log records back to the NVM. After this step, a commit mark is written and flushed seperately to commit the log. Only after the commit mark is persisted, are the page modifications actually committed. In the last step, we apply the page header changes to the corresponding pages, making available all the changes for later transactions to use. The log records are invalidated by writing and flushing another mark after the in-place updates are done.

Compared with traditional approach of using a log file for all dirty data, the lightweight logging approach has two obvious advantages. First, only page headers are logged, which saves both NVM bandwidth and log storage pressure. Second, file system is not involved in the logging process, which avoids the “journaling of journal” problem described in the first paragraph.

On recovery, the log manager first checks the log area to see if there is any valid log entries. A valid log entry is one that has the commit mark written, and yet not invalidated. If such entry exists, the recovery manager replays the log before removing it. In all other cases, the log can be truncated, since the system is always at a consistent state.