HTMFS: Strong Consistency Comes for Free with Hardware Transactional Memory in Persistent Memory File Systems
- Hide Paper Summary
Link: https://www.usenix.org/system/files/fast22-yi.pdf
Year: FAST 2022
Keyword: NVM; File System; HTM
Highlights:
-
File systems designed for DAX NVM can adopt HTM to guarantee both atomicity and durability without using journaling or shadow paging.
-
We can use sequence counters and OCC-style validation to protect metadata accesses. In this protocol, an operation is implemented as read-modify-write, where the read stage only reads shared data and their sequence counter, and the write stage is performed atomically using RTM and validates the sequence counters.
-
The page allocator can be extra-protected with logical logging such that speculatively allocated and deallocated pages can be handled properly on crash recovery in order to minimize security risks.
Comments:
-
The paper seems to mix two different types of consistency: multi-threaded synchronization, i.e., consistency at the memory ordering level, and durability, i.e., consistency for crash recovery. While Intel RTM guarantees both, these two types of consistency requirements have different considerations. For example, durability problems can occur even with a single thread. The paper started with consistency for crash recovery but focused more on multi-threaded synchronization throughout the body.
-
The paper mentions in Section 4.3 that HTMFS will fall back to using locks if RTM cannot commit after several retries. In this scenario, how is durability enforced? Do you also fall back to use journaling?
This paper presents HTMFS, a file system that leverages existing Hardware Transactional Memory (HTM) support to provide strong consistency guarantees while incurring low overhead. HTMFS achieves its design goals by wrapping multi-store operations within hardware transactions, whose atomicity is guaranteed by the hardware. As a result, HTMFS, compared with similar designs that use journaling or shadow paging, achieves similar or better performance without weakening the consistency guarantees.
The paper was motivated by the high overhead for maintaining consistency in conventional file systems. The paper has noted that conventional file systems use either journaling or shadow paging to maintain crash consistency. Both approaches, however, involve a large overhead. In the case of journaling, the same data is written twice, hence incurring 2x write amplification. In the case of shadow paging, even a small write to a data page can result in numerous writes to copy the page and update page pointers, causing an avalanche effect, as the paper authors have observed with NOVA. Consequently, conventional file systems often chose to trade off consistency guarantees for better performance.
With the introduction of Non-Volatile Memory, file systems can be implemented as directly operating on the address space, thanks to a technique called DAX which maps the NVM storage to the virtual address space for random access. As noted by the paper, this paradigm change has simplified the consistency challenge a little since processors can now perform atomic updates on a single cache block within the cache hierarchy. However, multi-update atomicity is still not guaranteed as the cache hierarchy may write back a dirty block halfway through the update, leaving the NVM image in an inconsistent state after a system crash.
To address this problem, the paper noticed that HTM is a perfect solution as HTMs guarantee the atomicity of a code block (called a transaction) with regard to concurrently executing threads by hardware. On newer hardware models, HTM transactions are atomic with regard to failures as well, meaning that committed transactions are guaranteed to be persisted on the NVM even in the event of power failures (as with all dirty data in the cache hierarchy), while uncommitted transactions will simply be rolled back by the failure and none of its modified data will be visible after the crash.
By leveraging such a novel and powerful hardware feature, NVM file system development is greatly simplified because developers can simply wrap all file system operations within a single transaction and let the hardware serialize them when multiple threads are accessing the same piece of data and on power failures. However, this ideal model is highly unrealistic for today’s commercial HTM implementation (Intel® RTM), the reason being that today’s RTM only supports limited transaction size regarding both read and write sets. As a result, a transaction on RTM may never be able to commit due to the working set size exceeding a certain threshold, or due to the data access pattern.
The main contribution of this paper is to adapt HTMFS file system implementation to use RTM for its consistency benefits while guaranteeing forward progress. From a high level, all file system operations are implemented as a three-stage process. In the first stage, the metadata required by the operation is read, without any write operation on shared data (writes to private buffers that are only accessible to the thread itself and will not be persisted after a power failure is fine). In HTMFS, every piece of metadata is accompanied by a sequence counter that stores the version of the metadata. The granularity that versions are tracked, though, is not revealed (and it can vary based on the logical meaning of the metadata). In this stage, after a piece of metadata is read, the corresponding sequence counter is also accessed, whose value will then be saved into the local buffer of the accessing thread. Then in the second stage, the operation performs local computation to derive the new values that should be written to update shared metadata. In the last stage, an RTM transaction is started, and all writes are performed as an atomic unit within the transaction. To verify that the metadata read during the first stage remains unchanged, at the beginning of the third stage, the operation will first validate the sequence counters stored in its local buffer against the current value of the counters, hence adding them to the transaction’s read set. If any of the counters mismatches, indicating that some writer transaction must have already updated the metadata, the current operation is abandoned and then restarted. Otherwise, the last stage performs the writes with values computed in the previous stage, and then increments the sequence counters for every piece of metadata it has modified. Any attempts to write the sequence counters after they have been verified will also cause the current RTM transaction to abort, and the operation is restarted as if the validation had failed. The operation logically commits when the RTM transaction from the last stage commits, after which point the metadata updates will be guaranteed to be atomic and persisted.
Data operations, however, cannot be performed like metadata updates as data operations can be unbounded in size. To deal with this challenge, the paper proposes several different techniques. First, for read operations, instead of wrapping the entire operation in an RTM transaction, which will unlikely to succeed if the read set is large, HTMFS simply records the page pointers and the sequence counter for those pointers, and validate them later during the last stage as if it were metadata accesses. For small writes, HTMFS will simply wrap them in RTM transactions, as small writes are unlikely to cause transaction aborts. Lastly, for large writes, the paper proposes copy-on-write, where the existing content of a page is copied to a private page, to which the intended modifications are applied. At the last stage of the write operation, the page pointers are updated to point to the new page, and the sequence counter is incremented to notify concurrent read operations that the content of the page has been changed.
One particularly critical component of HTMFS is the page allocator, because the internal states of the allocator may end up being rolled back after a crash, which can potentially leak previously uncommitted writes. This situation arises when a large write operation allocates a free page from the buffer, performs writes to the page, but does not commit the operation before the system crashes. In this case, even though the allocator status may remain intact due to the metadata update protocol, the after-crash image of the page on the NVM may contain uncommitted writes, which poses a potential security risk. To eliminate such a risk, the paper proposes that the allocator adopts logical logging which tracks pages that have been allocated but not yet committed (i.e., the “temporary list” in the paper). On crash recovery, the allocator log is then replayed, and the pages containing uncommitted writes are properly recycled. Page deallocation also faces similar issues, and the same solution applies.
The paper also demonstrates how certain complicated operations are implemented. The list includes path walk, directory updates, timestamp updates, and rename. These operations are specifically optimized with ad-hoc protocols, such that consistency is always guaranteed.
The paper lastly discusses minimizing HTM aborts. There are four empirically discovered rules. First, timestamp updates should be placed at the end of the last stage, because it yields the least number of aborts during experiments. Second, the paper also observes that a major source of aborts is page fault. To address this issue, the paper proposes first dry-running the operation to pre-fault the addresses in the working set on an abort, before the RTM transaction is restarted. Thirdly, the source of transaction aborts is the REP-prefixed MOV instructions that occur in memcpy implementation. The paper hence proposes replacing these memcpy functions with those implemented with SSE2 MOVs, which incurs far fewer aborts. Lastly, if a transaction has been restarted for too many times, HTMFS will deem it to be never committing, in which case the transaction is executed non-speculatively using a global lock similar to lock elision.