Romulus: Efficient Algorithms for Persistent Transactional Memory
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/3210377.3210392
Year: SPAA 2018
Keyword: NVM; Transaction; Romulus
This paper presents Romulus, an NVM library that implements persistent critical regions and transactions. Traditional logging methods, such as undo or redo logging, either use too many persistent barriers to ensure writr ordering, or requires read instrumentation to redirect read instructions to the redo log. Romulus solves both problems using shadow paging in which each data item has two copies, one in the master copy (called main copy), the other in the backup copy (called back copy). Romulus maintains the invariant that at any given moment in time, either the main copy is being mutated by an ongoing transaction and the back copy represents the last consistent state, or the main copy represents the most up-to-date consistent state, and the back copy is being updated to match the new copy. This way, when the system crashes, we always have a consistent copy of data to start the application with.
Romulus assumes that the persistent storage is exposed to the user address space via direct access (DAX) files. Physical addresses are assumed to be mapped to the same virtual address on each restart, such that no extra pointer operation is required to maintain the same pointer semantics. The DAX file is divided into three parts. The first part is the area header, which stores metadata on the DAX mapped file. This part is common in many NVM based designs, which stores the name of the mapping, the metadata, and statistics information. The most important thing in this part is a status variable describing the current progress of transactions and the after-commit copy process. The state variable can take three values: IDLE, meaning no transaction is running on the file, and the main copy and the back copy are both consistent; MUT, meaning a transaction is modifying the content of the file, and that the main copy can be in an inconsistent state; COPY, meaning that the transaction has finished execution, but we are still copying the main copy to the back copy. The transaction is considered as already committed in COPY state, because on a crash recovery, we just need to redo this copy process, and then change the state variable to IDLE. The second and third part of the DAX mapped file are two copies of working data. As discussed above, at any point during execution, at least one of these two copies are consistent with the state of most recently committed transaction. Modifications are always made to the main copy by user applications. The back copy is only written by library routines for synchronization after a transaction commit. Reading applications may or may not read into the back copy depending on the Romulus algorithm, as we will see below.
Memory allocation is also implemented within the persistent region. Allocator metadata and object list are both stored at the beginning of the last two areas, such that the consistency of the allocator also depends on the consistency of the two areas. Note that allocators must use relative offsets, because otherwise when the main image is copied to the backup, these pointers will still point to objects in the main copy, rather than the corresponding object in the backup. By implementing the allocator as part of the duplicated state, Romulus allows any allocator to be used without worrying about inconsistencies in the allocator after recovery and the resulting memory leak or double allocation.
Romulus is implemented with C++ programming language as a library. Persistent objects are wrapped with templates using the volatile type as argument. All mutation operators of the object is overloaded with a method that flushes the object into persistent storage before the mutation method returns. Object flush is implemented using clwb instruction on x86 platform. Besides object flush, Romulus also implements sync operation using sfence. We will see later that at most four sfences are issued for one Romulus transaction.
Transactions are marked with begin_transaction and end_transaction. Nested transactions are flattened, such that only the outermost transaction_end will commit the transaction. On transaction begin, we change the status variable from IDLE to MUT, indicating that the main copy is no longer consistent. On every mutation operation during the transaction, we flush the affected object back to the NVM before the method returns. These flushes are not ordered with each other to increase inter-bank parallelism. When the transaction commits, we first execute a barrier to stall the processor until all previous modifications have reached the NVM. Then we change the state of the status variable to COPY to indicate that the main copy is already consistent. A second barrier is issued to ensure that the status change propagates to the NVM. In the next step, we copy the main image to the back image byte-wise, and issue the third barrier after all copies are done. In the last step, we change the status variable to IDLE to indicate that both images are now consistent. A fourth flush is issued to ensure that the variable propagates to the NVM before we return and commits the transaction.
On recovery, we check the value of the status variable, and act accordingly. If the value is IDLE, then no recovery is needed, since both copies are consistent with the most up-to-date state. If the value is MUT, meaning that the main copy is being modified, then we copy the back image over to overwrite the main image, rolling back partial updates made by the incomplete transaction. If the value is COPY, meaning that the system crashes while we copy the value to back image, then the recovery routine copies the main image to back image again to complete the commit process. Note that in the last two cases, since copying one image to the other is an idempotent operation, if the system crashes again during recovery, no special handling is required, and the recovery process can simply restart.
The paper proposes one optimization for reducing the amount of data copied when a transaction commits. The transaction commit process requires copying the entire main image to the back image, even if only a small subset of the main image is updated by the transaction. The optimization maintains a log of addresses within DRAM. When a mutating method is called, the start address and the length of affected area is also written into the DRAM log. When the transaction commits, the library call only copies addresses that are present in the log, reducing the amount of data that is copied. Note that the paper claims that this logging optimization, unlike redo or undo logging, does not introduce extra write amplification, since application written data is always duplicated one more times, without writing them into the log. The system maintains a constant write amplification factor of two.
The paper also proposes three ways of supporting concurrent transactions with proper isolation. In the simplest manner, the application can just use a single global lock for serializing concurrent reads and writes, since only one copy of active data is maintained. In the second way, we use flat combining to aggregate update threads’ requests into a local buffer, which will then be taken and fulfilled by a dedicated thread. Only the dedicated thread can access the main copy, and it processes each request from the buffer on behalf of other threads. Requestor threads need to wait for the return value to appear in the buffer before they can return from the function call to maintain linearizability. In the last method, only one update thread is allowed to execute transaction on the main copy and to copy the main image to the back image. Reader threads either access the main image or the back image based on the current state variable. The paper uses an “offset” variable for determining the current read location. The offset variable is added to read addresses within the transaction area to redirect reads to the corresponding copy. When the state variable is IDLE and COPY, we read from the main image since it is consistent. When the variable is MUT, we read from the back image, since the main image is under modification.