DeltaFS: Exascale File Systems Scale Better Without Dedicated Servers
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/2834976.2834984
Year: PDSW 2015
Keyword: File System; DeltaFS
Highlight:
-
Per-thread, log-structured metadata for high scalability on PB-class storage.
-
Conventional distributed file systems use centralized metadata servers which require lots of corrdination between nodes for proper updates. DeltaFS allows each thread on each node to generate their own delta records, which are metadata changes, and stores them locally as the first-class citizen that can be accessed by other threads. This reduces metadata traffic and consistency overhead, since metadata requests are evenly distributed to all nodes that may store delta records.
-
Delta objects are represented using LevelDB’s SSTable (implemented as LSM-Trees). These objects are propagated to a registry service such that other threads can locate one of the most up-to-date metadata updates when attempting file accesses.
Questions
- How does the delta registry (section 2.2) maps files to their corresponding deltas? Are they per-file mapping? If so, the registry itself would be another metadata repo that will be updated on application commit point, and I did not see how this is not a scalability bottleneck, if a normal metadata registry would be. On the other hand, on section 3.1, it seems that the registry does not perform per-file mapping, because otherwise there will not be any read amplification. It seems to be a per-process mapping, but in this case, how does the registry work exactly?
This workshop paper introduces DeltaFS, a distributed file system design for extremely high scalability without dedicated metadata servers. The paper observes in the beginning that existing distribuyted systems for HPC workloads often lack scalability due to dedicated metadata servers, which must be contacted on file system operations. The consistency of the global metadata store also requires extra protocol overhead, which is most likely unnecessary, since most workloads running on HPC clusters either do not communicate with each other, only using the file system as a large, persistent store of data, or only communicate with other processes using a small portion of the file system. In either case, maintaining a globally consistent image of file system metadata and sharing them among all processes seems to be an overly strong semantics, as it will not be needed for most of the time.
The paper uses two examples. The first example, Lustre, relies on a centralized, dedicated metadata server node to provide registry information. All file accesses must contact the single metadata server, which could easily become a performance bottleneck. The second example, IndexFS, partitions metadata and distributes them to several independent metadata servers, inproving scalability. The metadata retrieval and update cost, however, still exist on a per-access basis, incurring extra bandwidth and protocol overheasd on these servers.
DeltaFS solves the above challenge by not maintaining a globally consistent metadata registry, and not contacting the remote metadata server on most file accesses. Instead of updating the global metadata when file operations are being performed, DeltaFS stores local changes to the metadata as local metadata objects. Depending on the file sharing mode, this local metadata object is either only made visible to other processes, or shared across a small number of processes using only local metadata servers. In other words, DeltaFS reduces communication between the application and metadata registry by performing local writes and amortizing updates to metadata until application terminates. Compared with other designs, such as BatchFS, where metadata updates are always asynchronously pushed to the central registry in batches after the application terminates, DeltaFS saves even more bandwidth by maintaining metadata updates as local “deltas”, and letting other application to directly fetch metadata from these local delta, rather than forcing all file operations to serialize on the central registry, further reducing contention and improving performance.
The paper assumes the following architecture. The HPC cluster consists of computing nodes and storage nodes. Computing nodes are capable of performing certain file operations locally using the native file system. Storage nodes run the DeltaFS server (registry server), and the native file system is only used as a key-value object store, where objects can be data segments or file metadata. A central file registry maintains a global image of file metadata. The central registry, as discussed above, does not necessarily reflect the most up-to-date state of the current file system, since local deltas may exist which overrides the global registry’s entries. An extra “delta registry” tracks the distributed deltas and their contents. The delta registry is updated to point to the local delta object, when a file system object is first requested from the central file registry. The delta registry can be implemented using any distributed key-value stores. Applications on the HPC cluster are executed on multiple nodes which can span a significant part of the cluster. Each node may access its own set of files, and some nodes may share file accesses.
We next present details as follows. The DeltaFS client code is implemented as a linkable library which overrides standard file system calls. On opening a file, if the metadata is not yet on the local node, it is first fetched from either the central registry, from the delta registry, or from a given delta object location potentially on a different node (the last is configurable by command line arguments while the previous ones are default behavior). When conflicts occur, i.e., when the requested file object is mapped by more than one entries in the registries or in the delta objects, the application chooses its own conflict resolution protocol such as “first writer wins”, without having to synchronize with other processes or the central registry. When file metadata is to be updated, the DeltaFS client code will redirect the changes to a local delta object which stores uncommitted metadata updates. The local delta object, as suggested by the paper, is implemented with LevelDB’s Sorted String Table (SSTable). Each SSTable instance stores the local metadata changes on a node, and multiple SSTables may have conflicting keys. When the application closes, all delta updates are committed by publishing the existence of the delta table object to the delta registry. Note that during this process, the delta object itself is not transferred to a remote node, but is maintained locally as part of the file system’s committed metadata image. No conflict resolution is performed by the registry as well, since applications can choose which version to access when conflicts occur as we discussed above.
The author also noted that DeltaFS is not suitable for scenarios where applications interact frequently using the file system, since DeltaFS essentially is just BatchFS locally, which commits its metadata updates to the registry only at certain points. These metadata updates are invisible to another node before the commit point unless explicitly specified.
The paper proposes two common scenarios of utilizing DeltaFS. In the first scenario, each node of the application accesses a distinct set of files. In this case, since keys in the SSTable instances do not conflict, it is unnecessary to support inter-node file sharing support, and hence each node can run at native speed to generate their own delta records. After the commit point, a background compaction process is invoked to sort entries in the SSTable, and re-partition the sorted entries into a set of new tables, which are still distributed over the computing nodes. The compaction operation is critical in reducing the number of SSTable reads in order to find an entry, since the overall SSTables generated by the application are unsorted.
In the second scenario, threads within the applications share accesses to certain files. In this case, metadata updates made on one node must be propagated to other nodes before the commit point, such that a consistent metadata image is maintained at least within the application’s lifespan. In this case, delta entries are always partitioned by key values. When a delta entry is generated, it will be dispatched to the corresponding nodes. When a file request is issued, the DeltaFS client code will first query the corresponding node for a matching key. If the key can be found, then the uncommitted entry is used without contacting any global metadata service.