Scalable Logging Through Emerging Non-Volatile Memory

- Hide Paper Summary

Paper Title: Scalable Logging Through Emerging Non-Volatile Memory
Link: https://dl.acm.org/citation.cfm?id=2732960
Year: VLDB 2014
Keyword: ARIES; Recovery; Logging; NVM

Back

This paper proposes distributed logging, a mechanism that extends ARIES-stype database recovery algorithm to provide support for scalable logging on multicore. The traditional ARIES algorithm provides a general-purpose, efficient, and versatile solution for database recovery, which requires only simple data structures and a few extension to an existing database system. The core of ARIES is a software maintained sequential log, in which log entries of different types are stored. Log entries are identified by their unique Log Sequence Number (LSN), which corresponds to their logical offsets into the log. Although not explicitly mentioned in the paper, in order to append an entry into the log, the transaction should acquire both a page latch and a log latch. The former is to ensure that two different transactions modifying the same page should append their entries in the same order as they conduct the modification. The latter is to protect the integrity and consistency of the log itself, preventing concurrent modifications corrupting the log data structure.

To reduce I/O overhead, log entries are only written back to the disk in large batch, which must follow the Write-Ahead Logging (WAL) property. The WAL property defines two types of write ordering. First, when a dirty page is to be written back to the disk, the log entry that wrote the page must be written back before the page. ARIES enforces this write ordering by forcing a log flushing till the LSN which is the most recent entry that modifies the page. The second write ordering is that before a transaction can write its commit record, all log entries generated by that transaction must be written back to the disk. ARIES enforces all log entries up to the transaction’s most recent to be written back in this case. To further amortize the overhead of frequently scheduling disk I/O, the transaction manager could choose to group commit a batch of transactions rather than committing them individually. Instead of writing the log entries for a transaction immediately when a transaction finishes execution, the transaction manager puts the completing transaction into a t commit pending queue. Transactions in the pending queue are only committed if the queue is full, or there has not been any commit for a while. With group commit, only one large I/O operation is scheduled for all transactions in the pending queue, which amortizes the overhead of I/O over multiple committing transactions at the cost of longer commit latency.

Being able to write into a centralized log object in the DRAM simplifies logging logic, because all log entries have a unique LSN, and their logical ordering is implied by the LSN. During recovery, ARIES does not redo a log entry, if the log LSN is smaller than or equal to the last modified LSN recorded on the page. On the other hand, however, keeping a centralized object in the memory which is accessed using a lock can easily become a performance bottleneck on today’s multicore architecture. The situation is only aggravated as the number of cores in the system increases and with multi-node memory architecture such as NUMA.

To overcome the inherent shortcoming of centralized logging, this paper proposes distributed logging which allows multiple log objects to be maintained in the main memory following some partitioning rules. In addition, recovery can be made more efficient using multiple log objects by adopting concurrent recovery algorithms. The paper also takes advantage of the fact that with the advent of Non-Volatile Memory (NVM), even random I/O from or into the NVM will be much faster than sequential I/O with disks. This observation justifies multiple log objects, which will incur non-sequential I/O operation to the NVM address space. At last, the paper also pointed out that current hardware is insufficient to implement efficient durability given that data accesses are cached by the processor. Future processor designs may incorporate the idea of backing the entire cache hierarchy with battery or super capacitors to extend persistence domain to the cache. Durability can then be made very efficient using only ordiary memory instructions and fences.

We next describe designs of distributed logging and focus on the chanllenges it poses to classical ARIES-style logging and recovery. In distributed logging, log entries can be partitioned using two keys: Page ID and Transaction ID. In the former case, the log manager operates like a hash table: whenever a transaction intends to append a new log entry, it hashes the page ID into a bucket, and then append the entry to the end of the log in the corresponding bucket. Since Page ID space does not overlap, buckets can be locked and maintained independent of each other, improving parallelism. In the other option, transaction oriented logging, every transaction have a dedicated log buffer. Transactions append only to their local buffer, and does not communicate with each other.

Due to the nature of redo and undo, both logging schemes are hard to deal with under certain circumstances. For page oriented logging, redo is simple, because the ordering of log entries for a single page still observes the logical ordering of modifications that happen on the page. Undo, however, is hard, because undo is intrinsically transaction oriented, which requires the recovery manager to traverse the linked list of log entries written by the loser transaction in reverse chronilogical order. With proper notion of inter-log undo, it would be very inefficient. For transaction oriented logging, redo is non-straightforward, because log records are only ordered with each other within a transaction. In the classical ARIES scheme there is no notion of inter-transaction dependency, as all operations are implicitly serialized by the LSN. Finding an encoding scheme which describes the data dependency between transactions is hence the essence of solving the issue with transaction oriented logging. If such an encoding scheme can be found, the recovery manager then follows the dependency during redo to ensure correct ordering for log replay. Undo with transaction oriented logging is trivial, because each log buffer only contains log entries generated by a single transaction. In the following paragraphs we describe solutions proposed by the paper regarding both logging schemes.

To solve the easier problem of not being able to undo transactions efficiently with page-oriented logging, the paper proposes that each transaction can have a private DRAM-only log buffer. The log buffer must reside in the DRAM all the time, and does not participate in the WAL. Every time a transaction appends an entry to the distributed page log, it must also write the same entry into its private log buffer. On partial or full rollbacks, the private log buffer is used to undo previously generated log entries of the transaction. The log buffer is discarded on successful commit, and can always be reconstructed from the redo WAL entries, which are already written back to the NVM before commit. During recovery, the recovery manager runs the classical ARIES algorithm. In the redo pass, the recovery manager reconstructs the private log buffer as it redoes modifications to pages. Note that in this case, even if the log record LSN is smaller than or equal to the PageLSN which indicates that the page already contains the update, the log entry must still be inserted into the private log buffer. At the end of the redo pass, the log buffer is recovered to the state right before the crash, and in the following undo pass, it can be used to roll back modifications of loser transactions just as in ARIES.

Determining the correct order of modifications from different trasactions is more difficult in transaction oriented logging, because two ordering constraints must be satisified. First, log entries from the same transaction must be ordered according to the program order that these modifications are carried out. Second, log entries on the same page from different transactions must also be ordered based on the logical ordering of modifications (e.g. if serializability is implemented, then the logical ordering the modifications is consistent with the logical ordering of transactions). Since log records are scattered between different transaction’s log objects, it would be difficult to encode LSNs in a global consistent manner without hampering scalability. To solve the global ordering problem, the paper proposes using Lamport logical clock. With logical clock, every entity in the system has a local integer counter, which represents the last time it synchronizes with another entity. Entities synchronize via sending messages to each other. The local clock is included in the message, and on receiving such a message, the receiver must set its local clock to the maximum of the local clock and the value included in the message. This way, it is guaranteed that all events (in the form of message passing) in the system can be compared based on the value included in the message. If the value of event A is smaller than that of event B, then it is potentially possible that event A happens before event B. In our case, entities are pages and transactions, because we want to maintain ordering properties on both. When a transaction causes a page to be loaded into the buffer pool for writing (or changes ownership), it sets the clock of both the transaction and the page to the larger of these two plus one (and if it is only for read, the value is just the larger of these two). Log entries contain the current clock of the transaction. This guarantees that if two log entries write to the same page, then the logical clock of these two entries are consistent with the actual physical order that write operations happen. Similarly, we can argue that log entries written by the same transaction always have a monotonically increasing logical clock, because the transaction will also synchronize with itself every time a new log entry is generated. The logical clock is called a “Global Sequence Number” (GSN) in the paper.

Transaction oriented undo is trivial, as the recovery manager just traverses the per-transaction log buffer and restores the previous state in reverse chronological order. Redo is more chanllenging, because log entries writing to the same page must be replayed in the same order as they were generated by transactions. Luckily, GSNs of log records describe such logical ordering. During recovery, log records are globally sorted into buckets partitioned using page ID using GSN. The process is described as follows. First, a target GSN is computed as the end point of the current iteration. The recovery manager scans the transaction-local log buffer, and for each log entry, hashes its Page ID into one of the buckets. The log entry is inserted at the tail of the bucket to make sure that log entries in each bucket are also sorted by GSN. Note that since log entries are ordered by GSN in a log buffer, the sorting is similar to multi-way merge in a merge sort algorithm. Then, in the second stage, buckets are assigned to recovery threads, and log entries are replayed in the order they are inserted into the buckets.

On a disk-based log buffer implementation, when a transaction commits, log entries whose LSNs are smaller than or equal to the transaction’s last LSN must be forced back to the disk, as required by WAL. With distributed logging, this becomes more challenging, because now log entries below a certain GSN can be scattered in multiple log objects. Even worse, with NVM technology, the NVM is the new disk, while processor cache becomes the new buffer pool. Log records written by different transactions may be cached by different processors, the write back of which is largely uncontrollable by the application. The paper suggests two solutions. In the first solution, hardware primitives such as pcommit, memory fence and cache line force write back is used (called an “epoch barrier”). By issuing epoch barriers before commit, it is guaranteed that all dirty log entries are written back to the NVM. The second solution employs a special type of caching policy called “write combining” (WC). Memory regions marked as WC are not cached by the processor. Instead, write operations are buffered by a WC queue, for which some optimizations may apply for common patterns, and read operations directly access memory. In the case when log entries must be flushed, the application issues a mfence instruction which will stall the processor until the WC queue is drained. Since log buffers are write-mostly and not read during normal processing, using WC memory is expected to have minimal performance impact on logging.

The problem of flushing log entries is only partially solved with WC caching. When a processor executes a mfence instruction, it only stalls until its local WC queue is drained, but does not wait for remote processors’ WC queues, which may hold log records whose GSN is smaller than or equal to the current committing transaction’s most recent entry. Given that draining the WC queue typically takes shorter time compared with disk I/O in classical ARIES, the paper proposes a group commit mechanism called “passive group commit”. In passive group commit, transactions that have finished their executions first flush the local WC queue, and then wait in a commit queue. Each processor maintains two thread-local variables. One is a dgsn (dirty GSN), which is updated to the maximum GSN on the local processor when an mfence returns. Another is a dirty bit which is set when new log entries are written and cleared when mfence returns. In order to commit transactions, a daemon thread periodically scans the local dgsn variables for all processors. It then computes min_dgsn, the minimum of all dgsns. Transactions whose last GSN is smaller than or equal to the min_dgsn can be removed from the pending queue and then commit. The dirty bit is used as a way of detecting whether any transaction has written any log entry between two scans. If no log entries have been generated, the dgsn will be ignored in the next scan, because we know that no log entries have been written since the last scan, and that the processor can be safely ignored because no cached log entries can be missed.

In the future, it is expected that the persistence domain be expanded from what it is nowadays (the NVM and its store buffer) to include processor caches. This can be realized by backing the processor cache hierarchy with a battery or super capacitor to allow the cache to be drained to lower level storage on a power failure. Given this technology at hand, the commit protocol of transactions can be further simplified, because memory fences are sufficient to guarantee both write ordering and persistence of data items the processor just written. Passive group commit is no longer required, because after the memory fence instruction returns, the processor guarantees global persistence of all log entries of the transaction. This is sufficient to ensure durability according to WAL.