System Software for Persistent Memory
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/2592798.2592814
Year: EuroSys 2014
Keyword: NVM; PMFS
This paper introduces PMFS, a file system designed and optimized for NVM. Unlike previous NVM file system papers in which a concrete problem with other designs are identified and then solved, this paper is closer to a design document covering the high-level designs while explaining design motivations. The paper begins by identifying three different ways of extending existing file system paradigm to work with NVM. The first is to abandon file system totally, and shift the responsibility of resource management to the OS’s virtual memory manager. The second way is to only change the block layer interface and use the NVM as conventional block device with lower latency and higher throughput. The last is to partially abandon the block layer abstraction and the disk buffer cache, reducing software stack and data movement overhead, but maintain the conventional file system interface and semantics.
The paper chose the last for three reasons. First, legacy applications relying on file system interface still work on PMFS with reduced latency and increased bandwidth, which eases software migration. Second, by getting rid of the software block layer, PMFS does not suffer from abstraction and data movement overhead, since data exchange happens directly between the application and NVM storage. The last reason is that applications can also benefit from more powerful interface such as mmap(), which directly maps a range of virtual addresses to the NVM physical address, enabling the application to directly read from and write into NVM.
The paper then proceeds to discuss three different techniques for forcing write ordering on the current architecture. The first uses non-cachable pages, which are set via page table attribute bits. Loads and stores to these pages will bypass the cache. The problem, however, is that non-cacheable accesses incurs performance overheads for all memory operations, while only a subset of them needs to see a certain memory ordering. Besides, all memory operations will occur on the system bus if caching is disabled, which has a limited bandwidth. Frequent accesses to the NVM device not only saturates NVM bandwidth quickly, but also impacts performance of other unrelated applications. The second option is non-temporal stores, which are special store instructions whose data is not expected to be accessed in the near future, thus bypassing the cache hierarchy. Due to the extensive usage of non-temporal stores in streaming workloads, the processor is also equipped with a special write combining buffer, which tries to combine multiple smaller stores into a full cache line as much as possible to reduce write amplification of the store. Using non-temporal stores, however, may imply unexpected results or complicated interactions with cached loads. The implementation should then be very careful on placing memory barriers to avoid loading stale data or severe performance loss. The last option is to use a persistence barrier consisting of cache line flush instructions and a store fence. The persistence barrier is compatible with most other instructions. PMFS chooses persistence barrier to enforce write ordering due to its simplicity and efficiency. In order to issue flush instructions, the paper assumes that software should track dirty data in cache line granularity, and only flushes dirty data with a barrier. It is not discussed how dirty data tracking can be implemented, though. The paper also mentions hardware accelerated persistence such as epoch persistence. This approach, as pointed out by the paper, requires extensive hardware modification, such as cache line tagging and customized eviction algorithms. It is unlikely that future hardware with adopt this due to its complexity.
The paper then gives an overview of PMFS’s architecture. Two special design considerations affect the design decisions of PMFS. The first consideration is byte-addressability of NVM, which enables fine-grained logging at arbitrary byte boundaries. PMFS is therefore optimized to use smaller granularities for logging. In addition, some operations are intrinsically atomic on the current architecture. Logging is not required for those operations. The second consideration is based on the fact that the NVM device can be directly mapped to the virtual address space. If not protected properly, a stray write within the kernel or device driver can permanently alter data or metadata on PMFS, leading to data corruption. The design of PMFS should therefore take address space protection into consideration, reducing the possibility that unintended writes be made to the mapped address to a minimum.
The layout of PMFS is relatively simple. In the mapped address space, the first two chunks of storage are standard file system super blocks that store global metadata of the file system. A logging area that stores log entries for in-flight metadata operations follows. inodes are structured as B+Trees, as opposed to in conventional file systems where they are organized as a fixed size table after the super block. B+Tree nodes are allocated from the NVM as pages. inodes also use B+Trees to organize blocks. The root block offset is stored within the inode. The rest of the storage are maintained by an allocator. The allocator manages the rest of the storage as pools of fixed sized pages. 4KB page is used for internal data structure and the metadata, while 4KB, 2MB and 1GB pages are allocated for data pages. Although the details of the allocation algorithm is not discussed, the paper suggests that it is still at an early stage of development, and only simple strategies, such as coalescing a freed page with neighboring free pages, are used. The allocator, as mentioned later in the paper, maintains its internal data structure in volatile memory, which is only written back on an ordered shut down, and loaded on the next mount. On a system crash, the internal states will be lost. The recovery handler should scan all nodes in the file system to rebuild the allocation map after all other recovery steps are taken.
When mmap() is called on a file, PMFS will decide whether the file is mapped with regular 4KB pages, or larger sized pages to reduce TLB and page fault overhead. Using large pages, on the other hand, can negatively impact performance by introducing external fragmentation on the physical address space, and/or incuring extra write amplification when file data can potentially be modified by different processes via copy-on-write (note that this is different from the copy-on-write technique for ensuring atomic data updates, as we will see below). The paper suggests that the page size is either given at mount time, or inferred from file system calls that reveal possible future use pattern.
The paper then discusses techniques for performing atomic updates. Three common techniques are discussed: shadow paging, undo logging, and redo logging. Shadow paging is assumed to be performed on page granularity. It features low write amplification when the majority of the page is updated, but suffers from high write amplification when only a few bytes are updated. Undo and redo logging, on the contrary, has a static write amplification of two. Undo logging requires the pre-image to be persisted to the NVM before data items are updated, thus requiring one persistent barrier per update. Redo logging, on the other hand, only requires two barriers, one between log write and commit record, the other between data update and truncate record. PMFS selects cache line granularity undo logging as the method for atomic updates of metadata, since metadata updates are often small and sparse on a page. In addition, redo logging requires read redirection if the log has not been replayed on its home address, but the dirty data is read by the same transaction. In this case, the read request must search the read log to avoid accessing the stale version on the home address. Data page updates are made atomic using shadow paging.
Three extra lightweight techniques are used to perform atomic updates in-place without logging or shadow paging. The first is 64 byte store primitives, which are always atomic as long as they are not crossing cache line boundaries. The second is the stronger double-word compare and swap instruction, LOCK CMPXCHG16B, which atomically swaps two consecutive words in the same cache line. This locked version of instruction guarantees that the cache line will not be evicted halfway between the first and the second swap. The last method is to rely on Restricted Transactional Memory (RTM) support. RTM attempts to retain all dirty cache lines written by a transaction, until the hardware limit is reached, in which case the transaction is aborted. Before this happens, the transaction can update as many cache lines as it wants to without worrying about evictions. If too many aborts occur for a single update attempt, the working set of the update may not fit into RTM. The fall back path with conventional logging will be used instead.
As discussed in previous sections, PMFS uses a dedicated logging area on the NVM image to perform logging. The logging area is maintained as a circular log buffer. Log entries are aligned to 64 byte cache line boundaries to simplify validation. Each log entry consists of 48 bytes of data, a gen_id field for validation, a txn_id field to distinguish between different metadata update operations, and other necessary fields in the header. With undo logging, log entries are truncated after a metadata update transaction commits, necessitating an efficient way of identifying invalid log entries when newer entries wrap around from the tail. In addition, after crash recovery, all log entries are invalidated automatically, which also implies that there should be a way to quickly invalidate all previous log entries before the crash. PMFS uses the gen_id field to validate log entries. A generation number of maintained in the super block of PMFS, which is incremented when the log buffer wraps back and also after a crash recovery. When a log entry is generated, the content of the entry is first written without touching gen_id. The gen_id is always written last before the persistence barrier to ensure that this field will at least be persisted no earlier than the rest of the fields. Recall that log entries are cache line aligned, this does not require any extra persistent barrier, since either the gen_id field is written back together with all other fields of the entry, making the writes atomic, or the cache line is evicted before gen_id field is written, in which case it is still invalid until the final persistence barrier. Either case, when the new value of gen_id reaches NVM, the content of the log entry is guaranteed to be valid. During recovery, the log entries are scanned and replayed. An invalid log entry is identified when the gen_id of the next entry is smaller than the current entry, which must be because the next entry is either: (1) Incomplete because a crash happens before the barrier persists the entry, or (2) The next entry is merely a truncated entry written before the wrap around.
PMFS metadata logging works as follows. First, some number of log entries are reserved by atomically incrementing the log tail pointer. For each metadata operation, this number could be known in advance, and should not be too large. Before each update, the old value of the metadata field is copied to an log entry, and then persisted to the NVM using a barrier. The data is them updated in-place. After all updates are completed, PMFS issues a persistence barrier for all dirty metadata fields to commit the transaction. After the transaction commits, PMFS writes a commit record to the log buffer to indicate that all previous entries are truncated.
On crash recovery, the recovery handler scans the log buffer from the log head pointer. It first tries to locate the last uncommitted transaction by finding a suffix of the log without the commit record. The uncommitted transaction is then rolled back by copying the pre-image in the log entries to their corresponding locations.
Data operations are also guaranteed to be atomic with shadow paging. New pages are allocated in a copy-on-write fashion when an existing page is updated. The new page is linked into the file object after the operation completes by updating the inode’s B+Tree, which is a metadata operation. Although the paper claims that data may become durable before metadata does, I did not see how this can be achieved, since the data pages will not be visible before metadata update commits.
PMFS maps the entire NVM device to the kernel virtual address space to simplify address management. This, however, increases the risk of a rougue device driver or a faulty OS module permanently corrupting NVM data with strayed writes. Similarly, when the application maps a chunk of data using mmap(), there is also the risk that this chunk of memory may be corrupted by the OS. Changing page permission from read-write to read-only when the NVM pages are not intended to be written is theoratically possible, but in practice, it incurs excessive TLB shootdown traffic, which severely degrades performance. The paper proposes that the virtual memory control bit CR0.WP be used to prevent kernel process from unintended writes to the NVM space. When CR0.WP is on, the kernel process does not have permission to write a read-only kernel page. When CR0.WP is off, read-only permissions are overridden by kernel processes (user processes always could not access kernel memory, so this problem is trivial). The PMFS driver always map NVM pages to kernel address space as read-only. During normal operations, CR0.WP value is set to “1” to block writes. When the metadata field is to be updated, CR0.WP is set to “0” to enable write. Since CR0.WP is not saved on context switch, PMFS also blocks interrupts when metadata fields are updated.
Before physical NVM devices are available as commercial products, the performance characteristics of PMFS can only be approximated using simulation. The paper introduce Persistent Memory Emulation Platform, PMEP, which emulates both latency and bandwidth limit of NVM using special firmware and microcode. To emulate latency, PMEM monitos the number of stall cycles on LLC misses in a execution window. At the end of the window, extra stall cycles are injected by special microcode based on the number of stall cycles due to LLC misses and the ratio of NVM miss and DRAM miss. To emulate bandwidth, PMEM programs the DRAM controller to upper bound the number of transactions per second, throttling maximum bandwidth to around 9.5 GB/s.