NVWAL: Exploiting NVRAM in Write-Ahead Logging

- Hide Paper Summary

Paper Title: NVWAL: Exploiting NVRAM in Write-Ahead Logging
Link: https://dl.acm.org/citation.cfm?id=2872392
Year: ASPLOS 2016
Keyword: NVM; WAL; SQLite; NVWAL; Database

Back

This paper proposes NVWAL, a technique for improving log performance on SQLite, a widely adopted DBMS on mobile platforms. The current logging mechanism on SQLite stores block-level journals to the file systems. The journal is written first to persistent storage before data items are updated. The transaction is logically committed by appending a special commit record to the journal after all log entries are persistent, and before any data item is updated. This way, the transaction can always be replayed after a crash by reapplying all the log entries in the journal. The journal of a transaction can be truncated, if all dirty pages affected by the transaction has been written back to persistent storage, or overwritten by a later transaction (note that since logging is performed at page granularity, the detection of overwriting can also be done at page level).

This approach to logging offers a significant performance improvement over the older undo logging mechanism, which requires writing back the undo log as soon as they are generated to prevent violation of write ordering. It is, however, still not optimal, due to two reasons. First, the logging granularity is large compared with the usual granularity of updates. The paper identifies that SQLite features frequent small updates, which are mainly made on the B+Tree structure. Much of the I/O bandwidth for writing the log is wasted because only a small fraction of the logged page is actually useful. The second reason for inefficiency is write amplification introduced by the file system. In order to maintain the consistencu of metadata and data, the file system itself will also perform juornaling to the jounral data and affected metadata of the journal file. The “journaling of the journal” problem introduces unnecessary overhead, since the journaling mechanism requires only very weak semantics guarantee. For example, POSIX mandates that file system should provide the abstraction of atomic writes, while the journal accepts the fact that any log flush can be interrupted by a failure as long as single 8-byte writes are atomic (i.e. the commit mark). The mismatch between the capability of the file system and the actual requirement of the journaling mechanism amplifies writes, resulting in even more bandwidth waste.

NVWAL solves the above two problems by moving the log object to NVM. Instead of creating a separate log file and relying on the file system for maintaining metadata, NVWAL simply creates a persistent object using mmap() system call for requesting a chunk of persistent storage mapped to NVM addresses. The NVM address mapped to the process’s address sapce is then divided into blocks, which is the basic unit of allocation and deallocation. To save the application from a time consuming system call every time it needs a chunk of memory, NVWAL also maintains the allocator metadata itself, which permits both efficient storage management and correct recovery upon a failure.

We describe the logging process as follows. The basic unit of logging is called a “log frame”, which consists of a header which records the page number, offset, size of payload, and a pointer to the next log frame, which can be either in the same block or another block. The next pointer is encoded in a format that is independent from the base address of the mmap’ed region, because otherwise next time the region is mapped to a potentially different address, the pointer will point to invalid data. Following the log frame header, here comes the payload, whose size is determined by the size field in the header. The header also contains a single bit, padded to 8 bytes, as the commit mark. The last log frame of a transaction will have its commit mark set and flushed to the NVM after all other log frames are written. Before a log block can be written, it needs to be allocated by the user-space allocator. The allocator maintains its internal states within the DRAM, which will be lost during a recovery. To precisely recollect storage used by logs, we make the following observation: The journaling mechanism does not care whether a log block has been written back or not before the persistent barrier is issued. If the block is not invalid or corrupted (partially written) on the NVM, we simply discard its content and abort the transaction. The allocator reserves a status word in the block header to record block allocation state. The status word can be in one of the three following states: Free, which means the block is not allocated; Pending, which means that the block has been allocated, but it has not been referenced by another block; In-use, which means that the block contains valid data and must not be freed. Initially, all status words are free. After the allocator returns the block, it is set to pending, and then data ia written into the block, before it is linked into the log chain. After being linked into the log chain, the block is set to in-use, which makes the block logically part of the journal. During recovery, the recovery handler scans all blocks in the NVM region, and rebuilds its in-memory allocation state by checking status words at block headers. During this process, free blocks and pending blocks are freed and added into the free list, since they are likely not referenced by any valid block. In-use blocks are untouched. This simple scheme accepts the fact that some “pending” blocks that actually contain valid data and are referenced by other valid blocks are reclaimed incorrectly. Such misbehavior of the memory allocator, however, is tolerated as long as the status word is always checked before accessing a log block after the crash (actually I think the post-crash process can first truncate the log that discard blocks that are in the log chain but are reclaimed by the GC). Memory decllocation works in the reverse manner. In order to deallocate a block of memory, we first update the status word in the block to free, and then calls the allocator to reclaim its storage. This way, the block can be recognized by the recovery handler and reclaimed correctly.

To be simple: The allocator is simply there to avoid memory leaks after the crash. It does not guarantee that all blocks that are currently in-use will not be reclaimed. In fact, some pending blocks may have already been referenced by other blocks, but we must still reclaim all pending blocks to avoid leaking. On special purpose application like this, it is acceptable that the post-crash GC overkills as long as the status word is checked every time before use, and that log blocks for committed transactions are not reclaimed (they won’t, since the write ordering between the commit mark and the status word dictates that if the commit mark is written then the status word must have been updated).

The logging process is described as follows: When a transaction intends to update the database, it generates a log frame which describes the update in the header, with the after-image of the update in the payload. Updates to the database is also reflected to the DRAM copy to allow more efficient reads, because otherwise all readers have to first check the log to see if newer updates are present (I think on SQLite sebsite they are saying the opposite, i.e. readers must check the journal for newer updates. The paper, nevertheless, does not mention this). When the transaction is to be committed, the log manager sets the “commit mark” at the head of the last log block, and then flushes the bit to logically commit the transaction. The write ordering between the log block write and commit mark write is enforced here by executing a persistence barrier before writing the commit mark but after all log frames are generated. To save I/O bandwidth, NVWAL only logs the entire page under modification if it is the first page in the log. Before a log frame is generated, the log manager first goes through the log chain to check whether the same page has been logged. If positive, the log frame will be encoded in a differential form, i.e. only parts that are modified since the last log frame will be recorded. This approach achieves a balance between logging bandwidth and read complexity. By always logging an entire page as the first entry of a page in the log chain, reads can simply reconstruct the most up-to-date data by a log traversal if the data page is truly in the log (this can be detected quickly using a bloom filter, etc.). Otherwise, the reader may also have to read the DRAM part of the log for byte ranges that is not in the log, making the read complicated.

Periodically, SQLite executes a “checkpointing” process to merge the log content back to the persistent database file. The checkpointing process simply copies over the content in the log frames into the database by the order that they are generated. Uncommitted transactions are not copied, though. The same process is also executed during recovery. The recovery manager first performs block GC and truncates the log to remove blocks that are garbage collected, and then it runs the checkpointing process to reapply all committed transactions to the database file on persistent storage.

The paper also proposes not enforcing the write ordering between the log frames and the final commit mark for less stalls and better performance. By not flushing back cache lines of log frames, we risk the possibility that a log frame is not written back or corrupted while the commit mark is set, making the transaction unrecoverable. The paper suggests that we add a checksum field to the final log frame of the transaction, which is flushed in the same cache line atomically with the commit mark. During recovery, if the checksum indicates a corrupted log frame, the transaction is aborted.