NOVA: A Log-Strictured File System for Hybrid Volatile/Non-volatile Main Memories
- Hide Paper Summary
Link: https://www.usenix.org/node/194455
Year: FAST 2016
Keyword: NVM; File System; NOVA; Log-Structured
This paper presents NOVA, a file system optimized for NVM which provides strong atomicity guarantees. In order for a file system operation to be atomic, either all of its sub-operations are committed to persistent storage, or none of them is committed. This is especially true when the file system suffers from a crash or failure, in which case the persistent storage may only contain partially updates states, due to the fact that page buffers maintained by the OS or hardware caches can evict pages back to the storage at unpredicable times.
As a result, several techniques are implemented to prevent these partial updates from corrupting the state of the file system, typically via a post-crash recovery process. The first technique is called journaling, in which all modifications to the file system, including metadata and data, are first written to a persistent journal, and then written in-place. A write ordering is enforced between the commit of the journal and the actual update of file system states to ensure that these operations can always be replayed after the journal has been committed. In the post-crash recovery process, if the journal is found to have committed, it will be replayed. The journal can only be removed after all dirty pages are flushed to the disk. The problem of journaling is that it doubles I/O traffic to the disk, since both the journal and the data need to be persisted. The second technique is shadow paging, which is commonly used in tree-structured file systems. In such a file system, it is assumed that directories and files are organized into a search tree, such that there is only one pointer to a page. When a page is to be updated, it is first duplicated by copying all contents of the page to another page, and all modifications are performed on the new page. Non-atomic modifications to the page are committed by an atomic compare-and-swap to the parent of the page. In this scheme, any change made on any level of the tree will propagate to the root, since when we modify the pointer on the parent node, a new copy of the parent node is also created and then swapped into the grandparent node. This cascading effect unnecessary memory traffic and can be expensive. The last technique is log-structured file system, in which all data and metadata are maintained in an append-only manner. Log-structured file systems can commit changes atomically by first appending the changes at the end of the log, followed by an atomic swap of the log tail pointer. This scheme, however, suffers from garbage collection problem, since obsolete data and metadata will need to be constantly removed to reclaim storage. In the case where no continuous storage can be reclaimed, live data needs to be moved around, which also causes unnecessary traffic.
NOVA features log-structured atomic commit of both data and metadata changes, without the cost of expensive page-level journaling, shadow paging, or garbage collection. This is achieved by adapting log-structured updates to provide the ideal semantics of atomic operations, while not maintaining the log as a physically sequential object on the disk. This combination is valid for NVM, due to the fact that NVM writes are less sensitive to the sequentiality of data, while on a spinning disk or SSD it is best practice to make sequential write the common case.
Based on the physical location, NOVA metadata is divided into two parts: An in-memory part which supports efficient lookup but provides no atomic support, and a log-structured NVM part which is difficult to search with (have to scan the log), but supports efficient atomic update. The in-memory part is constructed from the NVM part lazily during normal operations. When the file system is first mounted, only essential metadata is initialized from the super block. As user applications issue file accesses, the in-memory representation of the inode, directory, memory allocator, etc. are initialized by scanning the log and applying all changes.
NOVA organizes its inode table similar to a conventional file system. Instead of appending all inode modifications into the master log as in log-structured file systems, NOVA maintains a central inode table at the beginning of the address space, and partitions the inode table across cores to make parallel inode allocation and scanning possible. An inode in NOVA consists of two pointers, a log head pointer and a log tail pointer. Logs are maintained in a per-inode manner, in the form of linked lists of log blocks. These log blocks do not necessarily occupy continuous storage on the NVM. The log head pointer points to the first log block of the inode, while the log tail pointer points to the last committed state of the inode. States can be committed onto an inode by atomically moving the log tail pointer. The remaining part of the address space is maintained as a heap, from which log blocks and data pages can be allocated in a non-log-structured manner. Precise allocator states are maintained in the main memory as a block pool, which is persisted back to the NVM on a normal shutdown. On a failure, the persistent version of allocator state might be inconsistent, in which case we need to reconstruct the state by scanning all inodes and marking pages used by these inodes as active. After the scan, pages that are unmarked are added to the pool. To facilitate multi-inode updates, each core is also allocated a journal on the NVM, the operation of which will be described later.
We first describe single inode update in NOVA in this paragraph, and multi-node update in the next paragraph. In order to atomically update an inode, we first generate log entries describing the update, and then append these log entries to the log blocks of the inode to be updated. The update is committed by moving the log tail pointer in the inode to the last entry that we just appended. This final commit operation is atomic since it is only a 8-byte memory write which is then flushed back to the NVM. The in-memory representation of the inode is also updated accordingly after the commit, but this does not have to be atomic with regard to failures (the inode is still locked to prevent other threads accessing it though).
Multi-inode updates are more complicated since we cannot update multiple cache lines atomically with regard to the NVM with simple primitives. As a solution, the journal is used to coordinate multiple log tail writes which we describe as follows. First, log entries are generated and appended to the log as in single node update. Then, before moving the log tails, we copy the new value of the log tail at each affected inode into the journal, and flushes the journal back to the NVM. The multi-inode update has been committed after this flush operation. The last step is to update the log tails in each inode, and then flush the new value of the tails back to NVM. The journal is then truncated after all log tails have been persisted. Note that updating the journal is exactly like updating a single inode: We maintain the journal as a 4KB cicular buffer, and use two pointers to indicate the head and tail. When new entries are appended to the journal, we first write the entry, flush them back to NVM, update the tail, and then flushes the tail. This way the journal is updated atomically, with the flush of the tail pointer as the atomic point. Truncating the journal is similar: we simply update the head pointer to the current tail, and then flush the head pointer. Again, since updates to a single 8-byte pointer is always atomic with regard to failure, no complicated mechanism other than a persiste barrier is sufficient, making the entire sequence of update lightweight.
Directories in NOVA are maintained as logs on the NVM. The creation, deletion of files, and updates to file attributes are all stored as log entries to the directory’s inode log. Updating the directory is the same as updating an inode using the update scheme described above. To accelerate directory lookup, NOVA maintains an in-memory copy of the directory as a radix tree, such that the directory can be enumerated without having to walk the log.
File writes are committed to the inode of the file as an extent encoded as the begin address of the block and the length of the extent. If a write request overwrites an existing extent, NOVA first copies the old extent into the new page, and then commit updates as a new extend with the new page (note that pages are flushed before committed). Such copy-on-write style file update ensures that a file is never partially updated, as the old page is always consistent before we commit the inode log. Old pages can be reclaimed after the write is committed. On a system crash, either old page or new page will be leaked (i.e. unreferenced). In this case, the recovery handler will scan the inode’s write history, and locate pages that are unreferenced, and unmark them in the allocator.
Similar to any log-structured file system, NOVA needs constant garbage collection (GC) to free up space occupied by invalidated log entries. One major differences of NOVA is that it requires significantly less effort to be spent on GC at runtime, for three reasons. First, NOVA maintains per-inode log, which implies that GC can be performed on individual inodes seperately. The amount of data to be recollected and compressed per GC call is far less compared with the scheme that utilizes a centralized log. Second, NOVA only logs metadata updates without data (which itself is GC’ed at the end of every write operation). The number of bytes in the log is therefore far less than a centralized log that contains both data and metadata. The last reason is that NOVA does not maintain the log as a single sequential unit on the NVM. Instead, each inode has its own log which is then organized into a linked list. As a result, GC is performed on a log block level. NOVA can continue its operation as long as a block can be reclaimed. These three combined together make GC in NOVA lightweight.