SoftWrAP: A Lightweight Framework for Transactional Support of Storage Class Memory
- Hide Paper Summary
Link: MSST 2015
Year:
Keyword: NVM; SoftWrap; Redo Logging
Highlight:
-
Despite what the paper claims, this design is essentially a trade-off between increased total bus bandwidth and decreased NVM bandwidth. By allocating space in the DRAM buffer, we increase the bandwidth of writing into DRAM in addition to the bandwidth of writing NVM twice, which is intrinsic to redo logging. By not traversing the NVM log on both reads and log checkpointing, we reduced read bandwidth from the NVM, which now goes into DRAM. So the biggest merit of this design is to distribute bandwidth requirement of NVM to the DRAM, which makes sense, since DRAM often has higher bandwidth than NVM.
-
Redo log entries actually serve two purposes: (1) Reference for dirty data that has not yet been checkpointed to the NVM image; (2) Source of committed data for recovery. This paper decoupled the two usages of redo entries by only using redo log entries for recovery, but storing a separate shadowed copy of dirty data for access. This allows more flexible redo logging, such as logical logging (only logging operations that can be replayed deterministically) or variable sized logging.
This paper introduces SoftWaRP, a software framework that supports atomic transactions on NVM. The paper sets the goal of SoftWaRP as providing durability and atomicity to user applications in a system equipped with NVM. Without special support, neither durability and atomicity is supported by the naive memory interface of NVM devices. On one hand, dirty data written by store instructions will be buffered by the cache hierarchy, which may not reach the NVM before a crash. On the other hand, uncontrolled, arbitrary evictions can also happen during the execution of a transaction, breaking atomicity if data is only partially updated when a crash happens.
SoftWaRP differs from previous designs by using a combination of redo logging and shadowing. In conventional redo logging, two write ordering must be enforced. First, redo log entries must be persisted to the NVM before the commit record. Second, in-place updates of data must only be persisted to the NVM after the commit record is written. The first write ordering can be realized by simply executing a persist barrier before and after writing the commit record, which signals the logical commit of the transaction. The second write ordering, however, is difficult to achieve on today’s hardware, since if cache lines are updated in-place, they will be buffered by the cache hierarchy, the eviction of which cannot be controlled by software. Previous hardware approaches either force the cache controller to hold back evictions of such lines before transaction commit, or redirect them to an alternate, “shadow” location to avoid destroying the pre-transaction image. Software approaches, on the other hand, will not update data in-place until transaction commit. As a compensation, read operations to committed data must check the redo log in addition to the original location. Some background threads must also periodically checkpoint the redo log to the NVM image for ease of access. A major concern for the above software implementation is that reads must be redirected, which severely affects read performance due to the linear log search. Furthermore, effective NVM bandwidth is reduced to at most half of the device bandwidth, due to the intrinsic write amplification of redo logging, as well as log traversal overhead.
SoftWrAP combines shadowing and redo logging to reduce the overhead of log traversal as in software approcahes described above. In addition to generating the redo log entry when data is written during a transaction, the software library also copies dirty data to a DRAM buffer. The DRAM buffer is indexed by a hash table which can be accessed in constant time. Data accesses will first search the hash table for more up-to-date data that has not been checkpointed to the NVM image, before the image itself is accessed. This way, the linear log search can be avoided, which is replaced by a constant time operation. In addition, NVM bandwidth is no longer wasted on log traversal operations. Instead, DRAM is deployed as the temporary buffer to be used by SoftWrAP, which absorbs read traffic. As in other software redo logging approaches, the hash table is also periodically checkpointed to the NVM to avoid unlimited growth of the buffering area. SoftWrAP copies the redo image from the DRAM to the NVM, and truncates the redo log thereafter to recycle the NVM log buffer.
We next describe the design as follows. SoftWrAP features a transaction interface which defines the unit of atomicity
with regard to failures.
Data updates in a transaction are either all committed or not committed after a crash.
SoftWrAP uses two macros, wrapOpen()
and wrapClose()
to begin and commit a transaction respectively. wrapRead()
and wrapWrite()
are provided to access transactional data. Note that SoftWrAP library does not provide data isolation
support as in conventional transactional systems. Data stored by one transaction is always visible to other transactions
or non-transactional code after it is written.
Two extra data structures are also maintained. The first is the log buffer on the NVM, which is implemented as a circular log queue. Log entries are generated and persisted as transactional stores are executed. To maximize the efficiency of log persistence, the paper suggests that SoftWrAP uses write combining, non-temproal stores to avoid polluting the cache, while maximizing sequential write bandwidth. The second structure is the hash table for maintaining shadowed copies of dirty data. To simplify address computation, the hash table stores shadow pages in fixed 1KB granularity. Smaller granularities are also possible, at the cost of proportionally more table entries. Although not mentioned by the paper, the large shadowing granularity may not pose a major source of write amplification, since shadow pages are checkpointed to the NVM image only infrequently during execution, as we will see later, which amortizes the cost.
On a transactional store operation, the instrumented store will first generate the log entry containing the address, dirty data and the size of the update. Redo log entries do not need to have a fixed granularity, since they are not used as the reference for dirty data. The dirty data is also written into the shadow page in the hash table. An entry is allocated if it does not exist. When the transaction commits, a commit record is appended to the log, and the transaction logically commits. The log is not replayed on commit.
On a transactional load operation, the instrumented load first checks the hash table for dirty data. If the address is not shadowed, then NVM will be accessed directly. Both operations take constant time, making reads faster than most software redo logging approaches.
The shadow pages in the hash table are periodically checkpointed to the NVM image in the background. The checkpointing procedure iterates over pages in the hash table, and writes them back to their home locations. Note that only committed transactions will be checkpointed back to the NVM, while the partially written working set of a transaction must be kept isolated from the NVM.
To make shadow page checkpointing truly non-blocking and hence a background process, the paper proposes that two hash tables be used in a double-buffering manner. Execution is divided into epochs, with one epoch consisting of several transactions. Each epoch has an associated hash table for storing its shadowed pages, called the current hash table. When epoch advances, the current hash table is “closed”. A new hash table is allocated for the next epoch as the new current table, while the old table is checkpointed to the NVM in the background.
A hash table can be in the following four states: A (Active), meaning the table is the current one that all updates will be inserted; C (Closed), meaning the table has become inactive, but the contents of the table have not been checkpointed; E (Empty), meaning the table is empty and ready for use; R (Retiring), meaning that the table is not empty, but all contents have been checkpointed.
A table begins its life cycle in the E state when it transits from R in the previous iteration. When the table is selected as the current table, it transits to A state. When the table is switched out on an epoch boundary, it transits to C state. A C state table should still be searched before the NVM image, if a read operation misses the current table, since it may contain the most up-to-date copy of an address that has not yet been checkpointed and also whose entry has not been created in the new table. The background checkpointing thread also starts writing table entries back to the NVM when the table enters C state. After checkpoint completes, the table transites to R state. The R state is only transient, which is necessary to allow threads that started accessing the table before it leaves C state. An R state table itself will not be searched on current table misses, since all its contents can also be found on the NVM. After all threads accessing the R state table completes, the R state table is recycled, which will put it into the clean E state for use again in the next epoch.
On transaction begin, the macro takes a snapshot of the current states of the two hash tables. A hash table in A state can transit to C state, only when all threads that started when the table is in A state have completed. Similarly, a hash table in R state can only transit to the E state, only when all threads that started when the table is in C state have completed, since these threads may access the table in C state while the table is checkpointed and then transits to R state concurrently. After the table transits to the R state, no more threads can access the table, and therefore, no thread can ever hold a reference to the table if they all started after the table transits to the R state.