Write-Behind Logging
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=3025116
Year: VLDB 2016
Keyword: Logging; Durability; Transaction Processing; Database
This paper proposes Write-Behind Logging (WBL), a novel technique that levarages Non-Volatile Memory (NVM) to solve the classical problem in databases: making transactions durable. In earlier works, people have proposed various schemes to retain the consistency of database in cases of power failure (media failure and OS/DBMS failure are not considered, because they usually corrupt data in an unpredicable manner that a recovery algorithm could not handle), where all in-memory works not flushed to durable storage are lost and there is no way to know which part of the database is consistent. ARIES, as one example, uses both undo logging and redo logging to perform recovery. ARIES is intended for a disk-based database system with a managed buffer pool. To make it more realistic, two assumptions are made: First, dirty pages in the buffer pool may be written back to the disk at any time during transaction’s execution. This assumption is made to allow the buffer pool to swap in new pages when there is no free slot, as is the case with most buffer pool implementations. As a consequence, before a transaction could commit, its dirty pages may overwrite valid data on the disk. On power failure, the recovery algorithm must be able to recover the disk image to a state before the dirty pages were written out. The second assumption is that dirty pages may not be flushed back to durable storage on transaction commit. This assumption improves transaction throughput, because for disk-based systems, writing dirty data back to the disk will most likely be random writes, which can easily become the performance bottleneck. On recovery, the algorithm must be able to find back the committed values, even if they were not flushed to the disk before the crash. These two schemes, combined together, is called “steal and no force”. ARIES commit protocol is described as follows. During normal operations, the database monitors write operations to the database. For each write, the physical before-image and after-image are both saved in the log. On transaction commit, the log record of the transaction must be flushed to the disk using fsync(). The transaction is considered as committed once fsync returns. Fuzzy checkpoints are also taken periodically to avoid log buffer overflow. The fuzzy checkpointing operation does not have to halt all transactions, and hence can complete faster than a synchronized checkpoint. Since uncommitted data may exist in fuzzy checkpoints, both redo and undo are needed to restore the database to a consistent state after a crash.
There are three phases on ARIES recovery. In the first phase, the analysis phase, the recovery handler finds the most recent checkpoint, and restores the status of transactions and dirty pages when the checkpoint was taken. It also restores the database states using the checkpoint. Next, in redo phase, transactions that have already committed are replayed by the recovery handler. Only after-image of log records taken after and during the checkpoint are used for recovery. In the third phase, the undo phase, the recovery handler undoes modifications made to the database by uncommitted transactions. The set of uncommitted transactions are derived in the analysis phase. The before-image from log records are used to restore the database to the state before uncommitted transactions started. After the undo phase, the state of the database is consistent, and new transactions can be processed.
With the performance advantage of NVM over disk I/O, WBL can achieve a better performance than WAL and ARIES, especially on recovery time. Several design decisions are made to take advantage of fast and byte-addressable accesses to NVM. First, WBL is designed to work with MVCC, where each data item is tagged with two timestamps: One begin timestamp, which is the timestamp of transaction that created this version (version update only creates new versions, and never delete existing version), and one end timestamp, which is the begin timestamp of the next higher version in the version chain. Reading transactions traverse the version chain to find the version that is visible according to its timestamp (distributed using a centralized counter similar to how OCC allocates timestamps). The version read rule states that a version with timestamps [ts1, ts2) is visible to a transaction with begin timestamp ts if and only if (1) ts is within [ts1, ts2); and (2) the transaction that writes the version has committed (note that ts2 can be ∞ as is the case for the most up-to-date version). By allowing data items to be tagged with versions, no before-image is needed, because we can always access a version even if it has been overwritten by an uncommitted transaction after recovery. The second design decision is that WBL uses “steal, force” scheme for maintaining the buffer pool. While dirty data is allowed to propagate to the durable NVM during transaction execution, it is enforced that when the transaction commits, all dirty data must be made to the NVM by creating a new version. In practice, versions are stored in a table as tuples, with a backwad pointer field indicating the previous version. This is infeasible for disk-based systems, because the latancy of I/O and the random write pattern has a huge negative impact on performance. On NVM, however, small random writes are much faster, and hence the force commit scheme works perfectly. The third design decision is that the DBMS can avoid frequent logging using group commit. Multiple transactions are held in the transaction queue even if they have finished execution. The DBMS aggregates a few transactions until it decides that the commit overhead can be amortized. The commit protocol then flushes dirty pages of all transactions back to the NVM.
WBL works as follows. During normal operation, dirty cache lines may be evicted back to the NVM. These dirty lines belong to uncommitted transactions, and hence should not be visible to other transactions. The DBMS maintains a dirty tuple table (DTT), which stores for each transaction the locations (virtual addresses) of dirty items modified by the transaction. On transaction commit, no undo or redo log is created. Instead, the DBMS expects more transactions to commit in the near future such that the current committing transaction can be committed together. On each group commit, the DBMS first flushes all dirty data items back to the NVM as described above. The flush process queries the DTT to identify the location of dirty items, and then use cache line write back instruction from the ISA to perform write back. The DBMS then logs the next timestamp interval that will be used by future transactions until they commit. The interval is [cp, cd], where cp is the next timestamp that will be allocated to a committing transaction (i.e. current timestamp counter plus one), and cd is the next point where a group commit will happen (i.e. if the global counter reaches cd then a group commit is forced). The choice of cd is largely arbitrary, but to achieve a balance between transactions’ latency and commit overhead, the interval should neither be too larger nor too small. The interval is written into the persistent log after all dirty lines are flushed back to NVM, completing the group commit process. On recovery, the handler reads the most recent valid log entry, [cp, cd]. No undo or redo is needed in this process, and recovery only consists of a single analysis phase which is instaneous. The handler simply remember the interval in a DRAM location, and continues processing new transactions. In later processing, transactions should treat all versions whose commit timestamp is between this interval, or does not have a commit timestamp as invisible, since they belong to transactions that have not committed since this log record is written (a log record is written on every group commit).
The number of [cp, cd] intervals may grow indefinitely without proper cleaning. A background garbage collection thread can be used to cleanup versions that will never be visible to any transaction. This both includes versions that are too old to be visible, and uncommitted versions as a result of crash recovery. To perform GC for the latter purpose, the garbage collector takes a [cp, cd] from the current set of intervals, and scans the entire database for versions that are within this interval. These versions will be removed from the database heap. After a full scan, this interval is also removed from the DBMS’s volatile states. Similarly, after a successful group commit, log records before the most up-to-date entry is no longer needed, as the recovery handler only uses the most recent one. The garbage collector can safely remove older entries after the newest entry is persisted on the NVM.