SplitFS: Reducing Software Overhead in File Systems for Persistent Memory

- Hide Paper Summary

Paper Title: SplitFS: Reducing Software Overhead in File Systems for Persistent Memory
Link: https://dl.acm.org/citation.cfm?id=3359631
Year: SOSP 2019
Keyword: NVM; File System; SplitFS

Back

Highlight:

Performing logical logging using EXT4 DAX’s underlying atomic operation as primitives is a great innovation.
I like the idea of only writing data once while still providing data atomicity. Journaling file systems cannot do this because the journal is a shared object used for many purposes, such as metadata logging. This paper proposes decoupling data logging from other journaling tasks, and doing per-file journaling (i.e. having multiple staging files for handling writes). This has the advantage of both in-place update (optimized for read) and per-object journaling (optimized for write, less I/O).

Questions

Despite promising results, I do not quite buy the paper’s argument that Split file system can reduce software overhead. I understand that using block swap instead of write-ahead logging can save you some I/O (writing only once vs writing twice), but I cannot see how this reduce software overhead, or the paper uses the term “software overhead” to refer to logging?
Another confusing argument in this paper is that while logical logging is used to enforce atomicity of multi-step operation (which is an innovation), the paper did not mention that in order for this to work, we should issue fsync() to the underlying EXT4, and rely on atomicity guarantees of EXT4 and persistence guarantee of fsync() to log abstract operations of EXT4.
The paper did not mention how to undo an operation, e.g. during renaming the target name clashes with an existing file. Maybe SplitFS can check the condition by its own (which is redundant work), or it issues commands to EXT4 DAX and let EXT4 report error. The latter requires ability of undoing partial changes, since if EXT4 DAX reports error in the middle of an operation then SplitFS should roll back all prior actions.

This paper presents SplitFS, a user-space file system implementation aiming at reducing software overhead when running on the NVM. Byte-addressable NVM is more sensitive to the overhead of software stack due to its lower latency and higher bandwidth compared with conventional disks and SSD. The paper observes that most kernel file systems introduce non-negligible software overhead by performing file system calls into the kernal on normal operations such as file open and read/write. For example, the paper points out that on write operations, the file system has to perform block allocation, logging, metadata update, etc., which are all on the critical path.

SplitFS features a hybrid architecture in which metadata operations are delegatd to existing kernel file systems such as EXT4 DAX, while user data manipulation and semantics guarantees are delivered by a user space module. Building on top of an existing file system has two obvious benefits. First, metadata operations are rarer compared with data operations, while they often have subtle outcomes and implications, which on a software engineering point of view, is difficult to get correct without many hours of testing. Rather than dedicating thousands of hours creating and debugging a brand new mechanism that will not be used as frequently as normal data operations, just relying on existing mechanisms will be a good balancing point reagrding both performance and ease of development. Second, EXT4 only delivers limited semantics guarantees (i.e. POSIX), while in practice, a stricter or looser semantics may be desired. For example, the paper mentioned that SQLite does not require data writes to be fully atomic with regard to failures, since data writes are protected by logs, and log entry writes are ordered in a way that partially written log records can be identified. In such a use case, if the file system provides atomic write guarantees, we will have a write amplification problem, because every data and log write will be first logged by the file system, which is unnecessary. On the other hand, if the file system provides a strong support for atomicity and persistency of data, no logging is required even for transactional systems, as the logging can be done by the file system. SplitFS is able to provide different semantics guarantees by implementing flexible logging scheme in user space, which makes it more versatile. EXT4 DAX simply provides metadata atomicity guarantee, which enables logical logging to be used.

SplitFS provides three operating modes: POSIX, sync, and strict. Under POSIX mode, only metadata operations are guaranteed to be atomic with regard to failures. They are not necessarily synchronized (i.e. metadata updates may not be persisted and are prone to be lost even after the corresponding file system call returns). Data operations are neither atomic nor syncrhronized. Under sync mode, metadata operations are atomic and synchronized, and data operations are only synchronized but non-atomic. Note that atomicity does not imply synchronicity, since even if a write operation is atomic, the content of the write can still be lost after a crash if it is not synchronous. Synchronicity also does not imply atomicity, since synchronicity of write operations only dictate that the content of the write will not be lost after the operation completes; Partial writes can still occur if the system crashed before the write can return. Under strict mode, all metadata and data operations are atomic and synchronized, which means that their effect will be persistent right after the call returns, and if a crash happens before the operation completes, no harm will be done, and all partial updates will be rolled back. In the following discussion, we focus on the strict mode to demonstrate how SplitFS levarages EXT4 DAX.

SplitFS is compiled as a user space library, and is loaded at binary load time using LD_PRELOAD directive provided by the shell. The library overrides certain POSIX file system calls at glibc level (which is dynamically linked into the executable). When a glibc file system call is made by the application, SplitFS intercepts that call, analyzes the type and argument, and either redirect that call to the underlying EXT4 DAX file system (with possibly altered arguments and/or issuing more calls), or simply handles the request by itself. Since each application program has an instance of SplitFS running with it, SplitFS can provide support for different file access semantics to different applications in the same operating system.

At a high level, SplitFS handles writes atomically and synchronously using two techniques. The first technique is based on the observation that EXT4 DAX metadata operations are atomic (but no more!), which is demanded by POSIX. With these atomic “abstract operations” at hand, SplitFS only performs logical redo logging, which logs the high level abstarct operation on EXT4, and delegates the enforcement of atomicity of these abstract operations to EXT4. As a result, log entries in SplitFS are usually quite short, no larger than a single cache line for most simple operations, since only abstract EXT4 operations and their arguments are logged. The second technique is the use of staging files to buffer updates before they finally commit. Under strict mode, we must guarantee that an operation is recoverable if its operation has returned to the caller, and otherwise all changes may be simply discarded (but it does no harm to commit them whenever commit is possible). On each write or append request, SplitFS will redirect these writes to a pre-allocated staging file on the underlying EXT4 DAX file system, and create log entries describing the write. When the write is to be committed, SplitFS issues a special command, EXT4_IOC_MOVE_EXT, to EXT4 DAX using ioctl interface, the purpose of which is to atomically transfer the blocks in the staging file (i.e. dirty data of the write) to the actual being modified. The atomicity of this operation again is guaranteed by EXT4 DAX. During this process, only one write I/O to the NVM is performed instead of two, which is common for file systems that juornals data to provide atomic writes (first write to the journal, and second write to the actual file in-place).

On write or append operations, SplitFS first redirect data of the write or append to one of the free staging file. To avoid allocating staging files from EXT4 DAX on-demand, which is on the critical path and involves relatively slower metadata operations, SplitFS maintains a pool of free staging files, and pre-allocates a few from EXT4 DAX when it starts up. SplitFS also writes logs to reflect EXT4 DAX operations, such that they can be redone. When the write operation is to be committed, SplitFS writes a commit record to the log, and then issues a MOVE_EXT call to EXT4 DAX, which atomically swaps dirty blocks in the staging file to the master file, as described earlier. If synchronizicy is desired, SplitFS also issues fsync() on both files before the call returns. If the system crashes before writes are committed, no harm will be done to the master file, since it has not been updated in-place. The staging file will be deleted during recovery after they are identified (on clean shutdown there should not be any non-empty staging file, so just scan the pool and remove any staging file that are not empty).

On read operations, SplitFS first checks whether the read range overlaps a file range maintained in staging files. If true, reads are redirected to staging files with translated offsets. Otherwise reads are directly performed on the master file.

On other metadata updates, such as renaming or moving a file, SplitFS simply logs the operation using EXT4 DAX commands, commits the log record, and performs actual operations, before it returns results to the user. If synchronicity is also desired, SplitFS will call fsync() before returning to the caller.