ARIES: Atransaction Recivery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=128770
Year: TODS 1992
Keyword: ARIES; Recovery; Logging
This 69 page paper describes Algorithm for Recovery and Isolation Exploiting Semantics (ARIES) in great detail. ARIES is designed for providing durability to database transactions with fast recovery and low runtime overhead. ARIES also requires less software engineering effort to adopt for an existing system, because it supports rich semantics and most main stream database designs (as of the time the paper was published). We next describe the data structures and operations of ARIES.
ARIES requires two in-memory data structures to be maintained in the runtime. These two data structures are transaction table (TT) which describes the commit state of transactions, and dirty page table (DPT). Neither of them is persisted to the disk except for performing checkpoints. During recovery, these two tables are recoverd by replaying log records since the last checkpoint. The transaction table, as stated above, stores information of currently active transactions and their logging status. Typical fields include the transaction’s commit progress (not yet committed, or waiting for log records to be persisted), last log record written, and the next log record to undo if the transaction is halfway through a partial or full rollback. The dirty page table stores information for dirty pages currently in the buffer pool. ARIES only requires one field, the RecLSN (Recovery LSN), which is the smallest serial number of the log record that modifies the page since the last write back of the same page. During normal processing, the entry is updated to the log record number whenever a page is loaded into the buffer pool for writing, or when a write operation causes the buffer pool entry to become dirty. If a page is evicted from the buffer pool, the corresponding entry is removed. During recovery, however, the content of DPT does not strictly replicate the DPT before the failure. The reasoning behind is that ARIES always use log entries as the ultimate reference for performing recovery. DPT, on the other hand, only acts as a filter that prevents most unnecessary I/O when they are truly avoidable. False positives are possible, but they only affect recovery speed instead of correctness.
ARIES maintains all log entries in a logically serial log. The actual implementation may partition the serial log into multiple files. Each log entry has a Log Serial Number (LSN), which uniquely identifies: (1) The identity of the entry, and (2) The location of the entry in the log, which means that the content of the log entry can be found given the LSN. The LSN grows monotonically in all circumstances. A log entry may contain different information depending on its type. We list them as follows. First, the log entry contains the entry’s LSN, type and Transaction ID. These three fields are common to all log entries. Most log entries also have a PrevLSN field, which stores the LSN of the previous log entry written by the same transaction. The recovery routine could undo all uncommitted changes of a certain transaction by repeatedly following the PrevLSN until a NULL value is seen, without having to scan the entire log. If the log entry describes the modification to a page, it also has a page ID field, which stores the address that the modification is applied to. For log records generated during a rollback, a field UndoNextLSN stores the LSN of the next log entry that should be rolled back. This simplifies handling of transaction rollbacks, because the recovery routine only needs to use the most up-to-date entry to determine the next log entry from the same transaction that should be undone instead of scanning the log backwards. The last field of log entries is data. For normal write operations during normal processing, this field should contain both redo and undo information. For some log records, only redo information is present because they are part of the undo process, and cannot be undone.
We then describe the operations of ARIES during normal operation. The ARIES paper assumes a lock-based transaction management system, but also claims that ARIES does not preclude other design options. In either case, transactions must acquire latches on the page if they wish to insert or modify an element on the page. This is because data modification may incur garbage collection or page reorganization, which requires mutual exclusion on the page. The log record can only be appended to the log while the page latch is held by the modifying transaction. This is to ensure that, for every page, the order of log records is consistent with the order of actual page modifications happening physically. If this property is violated, it is possible that two log records on the same page are written out-of-order. During redo phase, these two entries will then be applied also in the wrong order, causing data corruption in some cases (e.g. redo logging is operational). While appending the log, the transaction manager computes the pre- and after-image of the data item to be modified, and fills in the PrevLSN field with the previous LSN the transaction has written (using the entry in the transaction table). On each persistent page, ARIES also maintains a LSN field called PageLSN. This field is updated whenever a log entry is generated for that page, and will be written back to disk with the page (the easiest way is to make the field part of the page). When the buffer pool decides to evict a page, according to the WAL property, log entries upto the entry that modifies the page must be written back before the page. In this case, the buffer pool manager reads the PageLSN field, and writes back all log entries upto the one indicated by the PageLSN.
On transaction commit, the transaction manager appends a commit record to the log. To ensure durability, it also must write back all log entries the current transaction has written. Similar to the normal operation case, the transaction manager reads the transaction table to determine the last log entry the transaction generated. It then forces log entries upto the last LSN to be written back. The transaction could successfully commit after that. Note that ARIES makes very relaxed assumption about the buffer manager: Buffer pool can be managed as steal, no-force. Steal means dirty pages are allowed to be written back before the corresponding transaction commits. The undo log entry can help recover to the previous state if crash happens. No-force means the transaction manager does not force dirty pages to be written back to disk on transaction commit. As we have shown above, only log entries the transaction has written will be written back. I/O is often significantly faster in this case, because writing the log only involves sequential I/O, while writing disk pages is more random.
On a partial or full rollback, the transaction manager needs to undo the partial change the current transaction has performed on disk pages by scanning the log in the backward direction and undoing all operations performed by that transaction. In ARIES, undo operations also generate log records, called Compensation Log Records (CLR). CLRs guarantee that every write operation can at most be undone once. Imagine the case where a transaction is halfway through a rollback before the system crashes. If no CLR is generated, the recovery routine has no way to know at which point the rollback operation was on at the moment of crash. Some operations in the log have to be undone twice in order to roll back the entire transaction as part of the undo pass. As we will show later, undoing an operation more than once may have disastrous outcome if operational logging is employed. ARIES treats CLR just like ordinary log records except that CLR does not contain a pre-image. In other words, CLRs can only be redone, but not undone since CLR indicates that the transaction is already in the process of being rolled back. When a CLR is generated, the transaction manager writes the PrevLSN in the ordinary log record that the current CLR undoes into the NextUndoLSN field of the CLR. This is to make sure that the recovery routine can quickly locate the next log entry to roll back on recovery. Note that the latest CLR in the log for a certain transaction actually defines an interval: (NextUndoLSN, current LSN], which is the range of log records whose effects have been undone. If nested rollbacks have taken before before the current rollback, the transaction manager will see CLRs written by earlier rollback operations. In this case, the transaction manager reads NextUndoLSN of the first CLR it encounters, and directly jumps to the NextUndoLSN, skipping all log entries in-between.
Periodically, ARIES takes checkpoints of the system and stores them in the log. The checkpoint serves as the starting point for recovery if a crash happens after the completion of the checkpoint. ARIES checkpointing algorithm is fast and non-intrusive in the runtime thanks to two unique features. First, dirty pages will not be written back to disk during the checkpoint operation, saving I/O bandwidth. Only data structures and some auxiliary information will be written back to the disk, which, as stated above, only incurs sequential I/O. Second, the checkpoint operation does not halt transaction execution while it is collecting data in the background. This results in a fuzzy checkpoint, which cannot precisely describe the state of the system at any given moment, but still works with carefully designed recovery algorithm. We next describe the details of checkpointing. First, the log manager writes a transaction begin record to the log. This log record indicates the beginning of the checkpoint operation. Every log entry appended after this entry and before the completion is considered as part of the checkpoint. Then, without interrupting current running transactions, the log manager copies the transaction table as well as the dirty page table in the background and stores them into the checkpoint end record. Note that transactions and the buffer manager may modify the tables in the meantime. This, however, does not affect the correctness of the algorithm, because (1) Existing transactions can only leave the table, which may cause a “ghost entry” to be written to the checkpoint, but in this case, the transaction commit record must be written after the checkpoint begin record, which is also part of the checkpoint. New transactions are also included in the checkpoint because they write transaction begin records after the checkpoing begin; (2) Dirty pages can leave and re-enter the table. In the previous case, it actually does not matter because the DPT is not the ultimate source for deciding whether an operation has been persisted. In the latter case, a similar argument can be used to prove that the new page will not be missed. After collecting data, the log manager writes back the checkpoing end record which contains the DBT and TT. As the final step, the LSN of the checkpoint begin record is written into the master record, which itself is stored in a well-known place on the disk and can be retrieved after the failure. Note that since the checkpoint is only lightweight, log entries before the checkpoint begin record cannot be discarded. Otherwise, transactions that fail to complete may not be able to commit or roll back.
On recovery, ARIES first runs an analysis algorithm to compute the DPT and TT. The TT must be recovered to the state right before the crash, but the content of DPT does not have to be exactly identical. In fact, ARIES only computes a superset of pages that are in the DPT before the crash. The recovery algorithm first reads the checkpoint end log record, and loads both tables into the memory. Due to the fuzzy nature of the checkpoint, some entries might be missing or unnecessary. Then, the recovery routine scans the log from the checkpoint begin record. For every log entry, if it indicates a transactional status change (e.g. begin, commit, rollback), then the change is reflected on the transaction table. Otherwise, if the entry indicates data modification, then the DPT is modified accordingly using the same rule as if it were normal processing. Note that during the analysis pass, no redo is performed. This is because since the checkpoint does not force back all dirty pages, log records before the checkpoint begin record may also be needed in the redo process, while the analysis pass always begins at the checkpoint begin record. CLRs are treated as data modification entries because they also modify pages. Note that the DPT does not evict entries during the analysis pass. During normal processing, it is possible that a page is silently evicted which causes the removal of an entry in the DPT. The algorithm guarantees that the result is still correct. After the analysis pass, the recovery routine computes the minimum RecLSN, and uses this value as the starting point of the following undo pass. Note that the RecLSN can be much smaller than the checkpoint begin LSN if a page is loaded into the buffer pool long time ago and has not been written back. In practice, this may cause very inefficient recovery, because the redo pass then needs to scan a large portion of the log and replay all of them. In addition, storage management becomes more difficult as the log cannot be garbage collected.
The redo pass replays all log entries from the earliest LSN that is known to be missing from the persistent image. The redo pass takes the minimum RecLSN from the analysis pass as an argument, and starts scanning log entries from that point. For all log entries that have an after-image, the page ID is first checked against the DPT. If the page is not in the DPT, then the page is already up-to-date, and does not require redo. If, however, that the page has an entry in the DPT, then the recovery routine suspects that some changes may not be written back to the disk, which requires a redo. Recall that the DPT is inexact, so even if a page is found in the DPT, it might still be the case that the page has been silently written back. To check for this possibility, the recovery routine reads the page into the buffer pool, and compares the PageLSN with the LSN of the log entry. If the latter is smaller, then it indicates that the disk page is actually newer than the log entry, the redo action of which can be skipped. Otherwise, the log entry is applied to the page, and the PageLSN is updated accordingly. Note that in this process no log entry is written, and the buffer pool is free to write back dirty pages. If the system crashes again, redo will begin from the same position. This time the PageLSN will be larger than it was last time, and the recovery routines will ignore the redo operation that has already been performed in the previous recovery attempt. CLRs are treated in the same way as a normal data write entry.
The last stage of recovery is the undo pass, in which “loser transactions” are rolled back using undo log. The recovery routine undoes transactions that have not successfully committed (as recorded by the transaction status in the TT), including those half way through their roll back process but were interrupted by the crash. The undo pass is described as follows. For every loser transaction in the TT, the last written LSN is used as the starting point of undo. The undo process follows the PrevLSN field in the log entry after undoing the current entry using the before-image. CLR entries are ignored, and the undo process instead follows the UndoNextLSN to the next log entry. For every undo action performed by the recovery routine, a CLR is also appended to the log. The CLR is generated in the same way as the CLR in a rollback during normal processing. If another crash happens in the middle of undo, the later recovery routine will find these CLRs, and skip undoing log records that have been undone. When performing undo on a page, the PageLSN must also be updated using the LSN of the CLR. WAL property must also be observed, which forced the log entry to be written back before a dirty page is evicted. After performing undo for the loser transaction, the recovery routine writes a transaction termination record to inform future instances of recovery routines that this transaction can be removed from the TT. The recovery completes after processing all loser transactions, after which normal processing can be resumed.
ARIES guarantees that each redo and undo operation is only performed at most once during the recovery even with multiple failures. For redo-able log entries, their LSN is compared with the page LSN, and if the former is smaller than or equal to PageLSN, its redo will be skipped. For rollbacks, the WAL property guarantees that if the dirty page affected by undo is written back to the disk, then so does the corresponding CLR. This has two consequences: First, just like the case with ordinary log entries in redo pass, if the system crashes during a previous undo attempt, CLRs whose dirty pages have been written back to the disk will be skipped, and the rest will be replayed. Second, the undo pass is guaranteed to begin with the most recent CLR entry in the log, whose effect to disk pages must have already been applied at the time of undo. The recovery routine will then proceed with the UndoNextLSN as the next LSN to undo.
Support for strong “exact-once” semantics of redo and undo implies the adoption of more efficient operational logging. Instead of copying the exact physical before- and after-image in log entries, which can potentially generate prohibitively large log records, operational logging only writes the logical operation with arguments, and relies on the business logic of the DBMS to translate the operation to physical changes during recovery. Note that the physical changes during recovery are not necessarily identical to the changes made during normal operation as we will show in an example below. Operational logging has two obvious benefits. First, it saves log buffer and on-disk log space, because some physical changes are only for bookkeeping purposes, and have no concrete meaning. One example is changing the layout of objects on the page for gargabe collection purposes. While this operation does not alter the logical content stored on the page, its effect must be logged if physical logging is used. The second benefit is that operational logging enables higher parallelism, because transactions do not have to establish unnecessary commit dependencies. Image the case where transaction T1 splits a B+Tree index page, and then transaction T2 deletes a key from the new page. If physical logging is used, T1 will log the physical changes to the B+Tree node first, and then T2 logs the deletion of the key using the pre- and after-image of the new node. If, however, the system crashes after T2 commits but before T1 commits. During the undo pass, modifications made by T2 will be undone, including the page split. After recovery, T1 losts its deletion operation, because as T2 was rolled back duing recovery, the new page was deallocated, and the original content of the pre-split B+Tree page was restored. Durability is violated in this case, because T2 is partially undone. One way of preventing this is to encorce a commit dependency: T2 cannot commit before T1, since T2 reads dirty data generated by T1. Introducing such a commit dependency guarantees recoverability of both transactions, but decreases parallelism when T2 outpaces T1 and could commit earlier. Logical logging solves the problem without extra commit dependency elegantly: Instead of physically undoing T1 by deallocating the after-split B+Tree nodes, T1 is only rolled back logically by removing the key that was inserted (which caused the split). The shape of the B+Tree is necessarily identical to the one during normal processing before the crash, because logical logging does not enforce a strict replication of the history.
ARIES also supports nested top transactions elegantly using CLR. Nested top transactions are operations that must happen atomically, but should not be rolled back once commit under all circumstances. To guarantee atomicity, nested top transactions must also participate in logging, and in the case of failure, be rolled back using the normal strategy. After the transaction has committed, however, even if the parent transaction rolls back, the nested top transaction must remain committed. To support such semantics, when the nested top transaction begins, the log manager remembers the most recent log entry of the transaction. After the top nested transaction commits, the log manager writes a CLR, and fills in the UndoNextLSN with the LSN taken before the transaction. The CLR itself contains no redo information. On recovery, the dummy CLR record protects the log region of the top nested transaction from being undone, as the undo pass will follow the UndoNextLSN to log entries before the nested top transaction.