DaxVM: Stressing the Limits of Memory as a File Interface
- Hide Paper Summary
Link: https://ieeexplore.ieee.org/document/9923852/
Year: MICRO 2022
Keyword: Virtual Memory; NVM; mmap
Highlights:
-
mmap()’ed files do not necessarily have better access performance than using traditional interface, due to reasons such as page fault handling, kernel lock contention, and dirty bit tracking.
-
On x86, the page table of a mapped file can be constructed as physical pages are allocated to the file because the page table is a radix tree and hence does not need the virtual address explicitly. The pre-populated page table can be plugged into the process page table with constant cost (but only at aligned offsets).
-
Most file accesses are ephemeral, meaning that many virtual pages will be recycled fast. These virtual pages can be maintained in an ephemeral heap separately from the regular virtual addresses.
-
TLB shootdowns, as a result of munmap(), can be batched for multiple files and implemented as flushing the entire TLB. This approach may expose certain security risks. though, and must be used carefully.
-
Many NVM applications track the dirty status of addresses with user-space protocols, making kernel tracking redundant. We can optionally remove kernel dirty status tracking in this scenario to avoid the page fault overhead.
-
Blocks can be zeroed in the background when they are recycled, hence also eliminating the overhead from the critical path.
Comments:
- The pre-populated page table assumes that the NVM storage is always mapped at the same physical address. This behavior, however, may not be guaranteed especially if the NVM device is installed on a different system where existing NVM devices are already installed. Besides, it breaks the compatibility of the file system as a page table generated on x86 system cannot work on ARM. The second issue is easy to solve – if compatibility is a concern then the default mmap() behavior is restored.
This paper proposes DaxVM, a virtual memory abstraction that facilitates the user scenarios of direct-mapped NVM. DaxVM is motivated by the fact that traditional virtual memory interfaces, i.e., mmap() and munmap(), are not implemented properly for direct-mapped NVM due to their original purpose of supporting disk-file-based mapping or anonymous mapping. As a result, applications may suffer issues such as paging overhead, kernel lock contention, expensive operations on the critical path, and so on. As a solution, the paper proposes changing the implementation of virtual memory interfaces as well as adding new interfaces in order to optimize system performance for direct-mapped NVM.
The paper assumes direct-mapped NVM access, in which the NVM storage is mapped into part of the physical address space. The NVM-based file system manages NVM storage area and allocates NVM pages to files as in a regular file system. When a file is to be accessed by user applications, instead of using the traditional file system calls such as open(), read() and write(), the file system enables applications to directly map the physical pages of the file into the virtual address space of the application via mmap(). The mapped file can then be accessed as a memory object while the OS handles page faults as well as the mappings between the virtual and the physical addresses.
While the direct-mapped access model reduces unnecessary data copies, i.e., from the disk to the OS page buffer and from the OS page buffer to the user-space buffer (e.g., passed as a parameter to the read() system call), the paper still observes several performance issues with the access model. One of the examples is when the applications access many small files for a small amount of data, and each file offset is only accessed once. In this scenario, the paper suggests that the performance is much worse than that of traditional file interfaces. The second example is when multiple threads of the same process access the file. The paper reported decreased throughput as the thread count increased and concluded that this phenomenon is due to lock contention in the OS kernel. Lastly, the paper also experimented with different file sizes. It is shown that when the access pattern leaves the address space fragmented, i.e., mapped with many regular 4KB pages, the performance is worse than if the file is backed by 2MB huge pages. As a result, 2MB files can yield far better performance as the entire file can be backed by a single huge page (of course, it only works when the underlying physical pages are also contiguous on the NVM).
The paper then analyzes the cause of the problems and proposed the corresponding solutions. First, the paper suggests that paging overhead can cause suboptimal performance, since the existing implementation of mmap() populates page table entries for the process only lazily, i.e., when the mapped virtual page is accessed for the first time. Consequently, the application must pay the overhead of a page fault for every virtual page accessed, which is proportional to the memory footprint of the file access. To address this problem, the paper proposes that the file system can pre-populate the page table entries for the file when physical pages are allocated to the file. The pre-populated page table is stored as file metadata, and it is structurally identical to a subtree of the existing page table radix tree (and hence the top-level node must be a PUD, PMD or less likely, a PTE). With the pre-populated page table, when a file is opened for access, the pre-populated table is directed wired onto the page table of the OS. File access permissions are set up on a per-file basis (smaller granularity is disabled) on the top-level page table tree. This new feature enables O(1) file mapping and page fault handling at the cost of higher overhead for MMU page walks (as they must access the page table stored on the NVM).
Secondly, the paper attempts to address the kernel lock contention problem by observing that most file accesses are ephemeral, i.e., most files are only accessed for less than 0.25 seconds. One implication is that a small number of virtual pages can satisfy most file mapping requests, as these pages are expected to be recycled shortly when the files are unmapped. Based on this observation, the paper proposes allocating a small private heap on the virtual address space for each process to serve mmap() requests, rather than letting mmap() grab the kernel lock for the process’s memory descriptor every time new virtual addresses are needed. Besides, to support fast operations on the virtual address heap, the design only maintains mapping metadata for the first page of a mapped file in a dedicated data structure. Consequently, the file can only be mapped with a single permission set, and must not be unmapped partially.
The third problem is the overhead of munmap() which is paid when a file ceases to be needed by the application. The paper observes that the most time-consuming part of munmap() is to perform TLB shootdowns, which involve IPIs and spin waiting. To eliminate TLB shootdowns from the critical path, the paper proposes asynchronous unmap, which batches TLB shootdowns that are supposed to be performed by munmap() calls and performs them by flushing the entire TLB. This feature is implemented by tracking virtual pages (called “zombie VMAs”) that are already unmapped but not yet shot down from the remote TLBs. When the number of such pages exceeds the threshold, the shootdown is performed using the regular shootdown path but by flushing the TLB entirely. The paper also noted that this approach may expose security risks if the unmapped page remains accessible on remote cores, especially when the physical page is recycled for storing valid data in another file. In this scenario, the design will simply perform TLB flushes eagerly to prevent unexpected accesses to the underlying physical addresses.
The next problem is dirty status tracking which is needed by msync() and fsync() system calls to flush dirty cache blocks back to the NVM. In the current Linux kernel, dirty pages are tracked via page faults, i.e., the kernel initially maps all pages of a file as read-only, and marks the corresponding virtual page as dirty in the page cache when the first write incurs a fault. However, as we have shown earlier, page faults at 4KB granularity constitute a large portion of overhead during execution. The paper hence proposes two possible solutions. The first solution remains compatible with the current system, but only tracks the dirty status of pages at higher levels of the page table, i.e., at 2MB level or above. This feature can be implemented by setting the page table entry permission at higher-level entries. The second, more radical solution entirely nullifies dirty status tracking at the kernel level and encourages user applications to implement their own durability protocol. The second solution makes sense for direct-mapped NVM, as the durability protocol is often fine-grained and implemented at the user-space level.
The last problem is block zeroing, which happens in the current system when a new block is allocated. Block-zeroing avoids leaking the contents of an already-freed block to a different user via block recycling and is hence a critical security measurement. However, this operation is currently performed on the critical path and hence negatively affects performance. To address this problem, the paper proposes adding recycled pages into a special per-core list and uses a kernel thread to zero them out in the background only when the system is idle. This way, block zeroing need not be performed on the critical path as the block allocator can just pick the already-zeroed blocks from the per-core lists.
All the proposed features above are implemented as two system calls daxvm_mmap() and daxvm_munmap(). The two system calls accept a set of new flags to selectively enable or disable these features based on the application’s requirements. The existing virtual memory system calls, such as mremap() and madvise(), are also modified accordingly such that certain invariants of DaxVM will not break (e.g., files must not be partially unmapped and remapped).