Better I/O Through Byte-Addressable, Persistent Memory
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/1629575.1629589
Year: ISCA 2009
Keyword: NVM; BPFS; File System
Highlight:
-
Using B+Tree as a universal construct for representing all objects in the file system, i.e., inode files, directory files, user files.
-
B+Trees are be updated with either atomic, single-word update, or shadow-copy. The latter works by duplicating the
page, applying updates, and then atomically updating the pointer. If multiple pages are updated, they will propagate to the common ancestor, where the ancestor is also shadow copied, and its parent is atomically updated. BPFS uses the atomic update property to update a page atomically for swinging pointer to the shadow copy. -
Write ordering can be enforced by tagging each block in the cache hierarchy with a epoch ID. A global epoch counter is used to represent the current epoch, and all blocks written in the epoch are tagged with the counter value. The cache controller performs tag walk when a epoch tagged block is evicted. The tag walk finds all smaller epoch blocks and evict them first to enforce write ordering.
This paper introduces BPFS, a file system built on top of byte-addressable NVM. The paper recognizes that NVM is a promising storage medium for future file storage platforms due to its byte-addressability and low latency accesses. Compared to a conventional file system, BPFS has several advantages. First, its low read and write latency makes BPFS signifantly faster than conventional file systems. Second, even compared with block-based file system ported to NVM, BPFS still outperforms them using a unique form of shadow paging that reduces write amplification and data copy. Third, the byte-addressability of NVM makes small writes and certain meatdata operations fast, since only a few writes are sufficient instead of copying and writing back entire blocks. Lastly, although not mentioned by the paper, BPFS is capable of operating without an OS intermediate buffering level on the critical path, since the address space of the NVM can be directly mapped to the protected virtual addresses of the file system driver. Most notably, no data movement between NVM and DRAM is required and hence DRAM buffer is only optional, unlike previous designs. Besides all these, BPFS also provides a stronger semantics guarantee without incurring any overhead other than those already exist in shadow paging.
BPFS is implemented as a block-based file system on top of NVM using shadow paging as one of the the main updating algorithm. BPFS is compatible with the conventional file system interface, and provides a rather strong semantics guarantee: All operations should complete (almost) atomically in the order that they are issued by the application. Here “atomicity” refers to failure atomicity of data and metadata, meaning that an update operation on either or both will either be fully committed, or rolled back, after a crash, leaving the system itself as well as data always in a consistent state. Compared with conventional file systems, which usually only guarantees the consistency of metadata via journaling, but not data, in order to reduce write amplification of writing the same data twice, BPFS’s consistency model is a rather strong one since both metadata and data are protected from crashes.
BPFS is implemented as a collection of B+Tree trees, which represent the inode file, directory files, and user files. Just like regular B+Trees, each file in BPFS is stored as an on-NVM tree with inner nodes storing pointers to next-level nodes, and leaf nodes stroing pointers to data blocks. One particular optimization of the BPFS B+Tree is that some subtrees can be trimmed if the subtree contains nothing else but just NULL pointers at the leaf level to save some storage and reduce tree traversal latency. The tree traversal function, called “crawlers”, will return immediately once this is encountered, as if a NULL pointer is found at leaf level. Note that subtree trimming is only a physical level optimization that does not affect the logical content of the tree. The size of the file represented by that tree is independent from how the tree structure is, and the crawler functions always treat missing levels or nodes as if they store NULL pointers at leaf.
Files in BPFS are referred to using pointers to the root address of the B+Tree. In addition, the height of the tree is encoded into lower bits of the pointer (tree pointers are 4KB aligned, so there should be 12 redundant bits which is more than sufficient). Encoding tree heights, although unnecessary at the first glance, helps crawler functions to identify trimmed subtrees, in which case NULL pointers will be observed at a non-leaf level. The height field and tree pointer can be updated atomically using a 64-bit write, in the case of creating new root nodes, such that the tree reference should always be consistent.
The entry point of BPFS is the inode file, which stores inode objects indexed by inode numbers. There is only one global inode B+Tree in the file system, the root pointer of which is stored in a known location in the super block. The inode serves the same purposes as in conventional file systems, storing metadata such as the file pointer, file size, date, time, and permission bits. Inode zero is by default the root directory of the file system, and the starting point of file system traversal. Directory is represented as a special type of file, whose content is just an array of directory entries. Each entry stores the file name, inode number, entry type, etc. Directory files and user files are also represented with B+Trees.
File system updates are performed using two techniques. The first technique relies on the hardware guarantee that 64-bit writes are always atomic with regard to failures. If the update operation only causes a single 64-bit field to be written, then no special construct is needed, and BPFS will directly write the field and flush it to the NVM for persistence. One of the most typical use cases is when application appends data to the file, and the append operation allocates a new data page. In this case, BPFS first allocates a data page from the free page pool, populates the page with data, and then installs the page to the B+Tree leaf node by simply overwriting the leaf pointer with the address of the data page. Metadata updates, such as access date/time, can also be performed with this technique.
The second update technique is called short-circuit shadow paging, which involves copying the page to be updated, performing the actual updates on the copied page, and then updating the reference of the old page to the new page in the parent. Note that in a conventional file system, this update will cause an so-called “avalanche effect”, since the update to the parent page itself requires the duplication of the parent page, which recursively propagates to the root of the tree, causing huge write amplification. Recall that the entire structure of BPFS is a B+Tree collection, this issue can only become worse, since In BPFS, since pointer updates are always atomic, the paper figures out that the recursion can stop as long as the parent page update can be conducted by a single pointer update. If multiple paths are being updated this way, the parent node must be a common ancester of both paths. Although, in the rare occurrance, this approach may still propagate updates to the root of the file system, BPFS can avoid most full path shadow paging since most updates are highly localized. One example of large scale shadowing is when files are moved around. In this case, both the source and destination directories are updated, which may potentially be far away from each other in the inode B+Tree.
Despite non-volatile NVM data structures, BPFS also maintains auxiliary volatile data structure in the DRAM for faster access of certain metadata. The paper indicates that BPFS caches directories in hash tables to reduce directory file traversal time. In addition, free block list, free inode list, etc., are also cached in the DRAM. On initialization, BPFS scans free blocks and free inodes, and builds these structures. BPFS defines a free blocks as not being referred to in any of the B+Tree, and defines free inode as not being referred to in the form of inode numbers by any of the valid directory entry. Since NVM access latency is much shorter than regular disks, this process will only take a matter of seconds on a reasonably large file system.
The paper also points out that BPFS relies on hardware primitives to order writes to the NVM. This is critical for the correctness of recovery, since writes to shadow copied pages must be persisted before the pointer that links these pages to the tree hierarchy does. Otherwise, on a system crash, some page contents will be corrupted due to the writes not being persisted. The paper therefore proposes a hardware change, called epoch persistency, that enforces write ordering. In epoch persistency, the execution is divided into continuous segments, called epochs. Each epoch has an epoch number, which is dispensed by a current epoch counter on the cache controller. The write ordering enforced by epoch persistency is that writes in later epochs must not be persisted before writes in earlier epochs do. Applications issue special primitives which increment the epoch counter, creating new epochs and hence new write orderings for all writes after the point.
The hardware modifications are discussed as follows. All blocks in the cache hierarchy is extended with an extra 64-bit epoch tag indicating the epoch in which the cache block is written. A “persist” bit is also added to indicate whether write orderings are enforced for this block, which is set if the block’s address lies in the mapped NVM address space. Store instructions will mark a cache line with the current epoch by copying the value of the epoch register to the epoch tag field, if it is not set yet. If the epoch field is already set, and is smaller than the current epoch, the current line has to be written back first, before the write can proceed.
The cache replacement algorithm prioritizes blocks with smaller epoch tags. If a block has to be evicted, but its epoch tag is not the smallest in the L1, the cache controller has to perform a full tag walk, and writes back all lines with smaller epoch tags for the correct write ordering. This may also happen on a store eviction, as discussed above.
The paper also proposes that the memory controller should be able to drain the request queue in the event of power failures, since the processor and cache hierarchy treats a cache line as persisted after it is admitted into the memory controller’s queue. This is achieved by having a few capacitors on-board, such that they can provide power for the controlle to keep running until the queue drains out.