Finding and Fixing Performance Pathologies in Persistent Memory Software Stacks
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=3304077
Year: ASPLOS 2019
Keyword: NVM; File System
This paper evaluates the performane of NVM on several disk-based and NVM-based storage systems. The emergence of NVM raises the new chanllenge of programming for fine-grained persistence devices that can be directly accessed by processors. On the other hand, existing software stacks, such as file systems, are mostly designed for block persistent devices such as magnetic disks or SSD. These designs provide an entirely different set of interfaces from what we are failiar with for in-memory applications in order to interact with the device, and/or manage data/metadata in a way that may not perform very well for newer hardware. Both of these make it a challenging task to migrate them from the old platforms to newer NVM-based platform.
One of the most prominent differences between disks and NVM is that NVM can be directly accessed by loads and stores issued by user application, while disks can only be accessed via a predefined set of OS interfaces. A direct transformation from existing applications to NVM applications without changing the interface (i.e. treat NVM as a block device by only accessing it in block granularity) may imply a potential performance problem, since the overhead of software stack, the read and write characteristics of NVM, and extra consistency requirements for NVM can all easily become a performance bottleneck. In the next paragraphs we discuss these problems pinpointed by the paper.
The first observation is that metadata changes are costly even on NVM file systems, and we should avoid doing so as much as possible. The paper uses SQList as an example. SQLite supports four different modes of logging, including both redo and undo logging. In the undo logging scheme, a logging file is generated for each transaction, keeping track of the before-image of modified data items during the transaction to endure atomicity. This logging file is either truncated or deleted, or marked as invalid by an extra persistent write at the end of the transaction. The redo logging, on the other hand, maintains a redo log across transactions, the content of which is replayed onto the database image after a crash. Experiments show that undo logging generally has inferior performance compared with redo logging, because of the file manipulation at the end of every transactions. Such file operations (delete or truncate) will very likely involve changing multiple metadata pages, which will then be synchronized onto the NVM to guarantee persistence. On the contrary, redo logging does not require costly file operations at the end of the transaction, and hence can sustain a higher throughput.
The second observation is that NVM allocation, different from volatile memory allocation, is a complicated task, the invocation of which should be reduced to a minimum. The paper uses SQLite redo logging and Kyoto Cabinet’s Write-Ahead Logging as an example. In both designs, the redo log is a single file in the file system, the storage of which is allocated on-demand for every redo log entry. NVM storage allocation, however, is significantly more expensive than in-memory allocation, since all metadata must be made strictly consistent on the NVM, due to the fact that memory leaks and data corruption will persist across crashes and reboots. The extra requirement for NVM consistency degrades allocation performance in both designs, which can be solved by two solutions. In the first solution, allocator metadata is not kept on the NVM. Instead, they are maintained in volatile memory, and will be rebuilt in the post-crash recovery procedure by a full address space scan. The second solution takes advantage of a system call, fallocate, in which storage is pre-allocated for files that are expected to grow. By calling fallocate on the WAL logging file, we can avoid frequent allocation, which amortizes the cost of expensive allocator calls over multiple appends to the log file.
The third observation is that using memory mapped NVM regions and cache flushes is always better than relying on file system interfaces. By mapping the NVM file to the virtual address space using mmap(), we circumvent the file system entirely, removing the overhead of performing system calls and going through the file system software stack. When combined with pre-allocation of file storage, this approach also needs another system call, mremap(), to accommodate for the newly allocated space. In addition, the paper also suggests that using cache flush instructions (clwb, clflushopt) and memory fences for data persistence is better than using msync(), as msync() will write back an entire page even if only a small part of it is actually dirty.
The fourth observation is that the old assumption that data transfer can only be made atomically at 4KB blocks will become a problem with byte-addressable NVM. The paper uses the journaling block device, JBD2, as an example. Based on reasons discussed above, JBD2 always writes journal for the modified metadata in 4KB page to ensure the atomicity of the write, even in the case where multiple small metadata changes are made to different pages. This will translate into multiple 4KB write to the NVM, with lots of space for improvement as the actual amount of data being changed is not that much. As an improvement, the authors of the paper redesigned JBD2 and named it as Journaling DAX Device (JDD). JDD only logs metadata to be affected by a write at a byte level, reduing the number of information to be written to the minimum.
Lastly, as the access speed of NVM is close to that of the DRAM, greater parallelism has been observed for NVM-based systems. In such a system, centralized coordination such as single global journaling device can easily become a bottleneck without proper tuning, as is the case with JBD mentioned above. In JBD, all threads generating a journal record will have to serialize with each other, since the JBD is protected by a single lock, resulting in limited concurrency. To solve this problem, the paper suggests that a thread-local journaling space be maintained such that threads can generate their own record without interfering with each other. Better performance has been observed after replacing JBD with JDD.