FlatFS: Flatten Hierarchical File System Namespace on Non-volatile Memories
- Hide Paper Summary
Link: https://www.usenix.org/conference/atc22/presentation/cai
Year: USENIX ATC 2022
Keyword: NVM; FlatFS; File System
Highlight:
-
Path walking and name resolution are high-overhead operations for NVM file systems. To address this problem, we can implement the file system as a flat namespace and use an index to track all files and directories. This approach can be also considered as a generalization of having an infinitely sized dcache.
-
B+Tree can be split into two parts using a split key. This operation is done by traversing down the tree and splitting every node on the path using the key. Two subtrees can also be joined together if one of them only has keys that are smaller than the keys in the other. This process can be done by linking the shorter subtree into the taller one or directly merging the root of the shorter tree into the taller tree’s rightmost inner node at the same depth.
-
String keys that represent full paths stored in a B+Tree leaf node can be compressed using the common prefix. When an existing key is modified or a new key is to be inserted, the prefix may need to be adjusted by expanding the suffix.
Comments:
-
The paper indicates that in the experiment, resolving 50 components takes 14x more time than resolving a single component. Why is this non-scalable? It seems perfectly scalable to me. The real problem perhaps is that it is not constant time.
-
How do threads synchronize when they (1) Scan the leaf node? (2) Perform split and join?
This paper presents FlatFS, a file system designed for Byte-Addressable Non-Volatile Memory (NVM) that adopts a flat namespace to replace the hierarchical directory tree. The paper is motivated by the high overhead of component resolution and file tree walking in the existing hierarchical file systems when they are deployed on fast NVM as external storage. The paper addresses the issue by replacing the hierarchical organization of the file system with a flat namespace organization while preserving the conventional hierarchical view. Compared with conventional file systems, FlatFS demonstrates better performance and scalability when the operations are metadata-intensive.
In conventional file systems, files and directories are organized as a directory tree (if we ignore hard links), where the intermediate nodes are directories that can contain files and other directories, while the leaf nodes are regular files or soft links. However, this tree model comes at a cost as a file path needs to be internally resolved into an inode number by the file system before it can be used by other file system calls. On conventional systems using HDDs or SSDs, the cost of path resolution is relatively cheap due to the high overhead of performing I/Os. However, with the introduction of NVM, whose read and write latencies are comparable to those of DRAM and are much faster than even the fastest SSDs, I/O has largely ceased to be the slowest component on the critical path, whose latency now is even comparable to those of the software. As a result, the existing hierarchical file model can eventually become the bottleneck of file operations.
To confirm the above claim, the authors of the paper conducted two sets of experiments. First, the authors measured
the overhead of resolving a component name to the corresponding dentry object. Results show that even with a warm
dcache (a hash table that stores recently resolved paths and maps them to dentry objects) the overhead is
non-negligible, constituting around 30% of the total execution time. The overhead is almost doubled if the dcache
is cold, i.e., the entry is not in the cache and hence must be read from the underlying storage.
Besides, the authors also conducted experiments with a varying number of components on the path. Results show that
component resolution time scales with the number of components, but somehow it is sub-linear and no explanation is
given in the paper.
The second experiment measures the overall cost of path walking and name resolution during metadata operations.
Results show that when executing ls -R
command to list all directories and files recursively in a directory tree of
depth 11, the overhead of directory tree walking can constitute up to 80% of total execution time, again demonstrating
the high overhead of directory tree walking.
FlatFS addresses the above issues by organizing the entire file system as a flat namespace. In FlatFS, files and directories are still uniquely identified by the full absolute path as in a regular file system, but instead of resolving each component of the full path as a separate component by searching the component name in a directory structure, FlatFS directly maps the full path to the file or directory’s inode number using a B+Tree index structure, hence achieving faster lookups while avoiding the time-costing level-by-level component name resolution. Furthermore, since B+Tree traversals mostly involve sequential memory accesses within nodes with only indirections between nodes, the access pattern of the index search is also more suitable for NVM.
We next present FlatFS’s index structure, the Range-Optimized Index Tree or Br-Tree. The Br-Tree is essentially a B+Tree indexed by path strings, with the node size being 256 bytes which aligns with the size of NVM’s internal read and write buffers. As with all B+Trees, keys stored in leaf nodes are sorted in lexicographical order, meaning that files or directories that share the same common prefix (i.e., they are under the same directory) will be stored together. To reduce storage overhead, all keys stored in the same node are compressed by the common prefix. As a result, a key in a node is stored as a triple: the shred prefix, the suffix, and the size of the suffix. Concurrent tree operations are synchronized between different threads using top-down lock coupling. The integrity of the tree is also protected against system failures using undo logging.
Metadata operations on FlatFS start by querying the index. Regular file lookups use the full path and simply perform
point queries. Relative paths should be converted to full paths using the current working directory environmental
variable before being used. Directory lists use the full path of the directory. To delimit a directory, FlatFS
maintains two hidden entries in the index for every directory, with one denoting the beginning of the
directory content, whose
index key is the full path of the directory appended with \x01
, and the other denoting the end of the directory
content, whose index key is the full path of the directory appended with \xFE
. Queries using the directory’s full
path will hence always hit the beginning of the directory. The content of the directory can then be listed by scanning
on the leaf level until the end of the entry is seen.
Other range operations on the index, such as directory deletion and directory move, can be implemented using two index transformation operations. The first is tree split, which, given a split key, partitions the index tree into two parts, such that each part only contains keys that are smaller than or larger than the split key. The split operation is implemented as walking down the tree from the root level and splitting every node on the walking path using the split key, hence producing two equally height trees. The second operation is tree join, which is the exact opposite of tree split. Given two trees where the keys in the left tree are smaller than all keys in the right tree, the operation produces one single tree that contains all keys in both smaller trees. The tree join operation is implemented by simply merging the root of the “shorter” tree (i.e., the one with a smaller number of levels) into the rightmost inner node of the “taller” tree. If merging is impossible due to tree node size limitation, the two trees are joined by recursively inserting a new key into the parent node of the taller tree’s inner node and then directly linking the shorter tree into the parent node as a child. This process can propagate till the root of the taller tree, which may even cause a new root to be created if the current root node is full.
With the two new tree operations, directory remove can be implemented as two tree splits using the delimiter keys of the directory as the split keys, discarding the middle subtree (i.e., the one that contains the directory to be removed and all the content), and then joining the two remaining subtrees together. Directory move can be implemented as first removing the directory from the index tree and then splitting the tree using either of the two delimiter keys of the directory and then joining the tree subtrees together, with the one containing the directory content in the middle. Note that if the move target is already a directory, then a merge needs to be performed instead of an insert. The paper does not cover this case, though.
When a leaf node entry is updated, e.g., by a rename operation, FlatFS checks whether the key can still be compressed by the common prefix of the leaf node. If the check fails, then FlatFS will decompress all keys and find a new common prefix. The decompression process copies characters in the prefix to the suffix and increases the size of the suffix. To avoid frequent decompression, the paper proposes Write-Optimized Compressed Key (WoC Key) that always “prefetch” more characters to the suffix when decompression happens. This technique reduces write operations that occur during frequent decompression because it amortizes the cost of expanding the suffix over fewer large writes.