Fast Databases with Fast Durability and Recovery Through Multicore Parallelism

- Hide Paper Summary

Paper Title: Fast Databases with Fast Durability and Recovery Through Multicore Parallelism
Link: https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_wenting
Year: OSDI 2014
Keyword: Database Recovery; Logging; Multicore

Back

This paper presents SiloR, a fast in-memory database system with support for persistence. SiloR is built upon Silo, an in-memory engine for transaction processing. Silo uses single version Optimistic Concurrency Control (OCC) for conflict detection. Every tuple in Silo has a version timestamp, consisting of three subfields: Epoch, commit timestamp (ct), and status bits. The first two subfields (epoch and ct) of the version timestamp equals the commit timestamp of the last write transaction that updates the tuple. A lock bit in the status bit field allows tuples to be locked by a committing transaction, providing exclusive access permission for that transaction until commit succeeds. Silo differs from classical backward OCC in that it does not rely on a centralized counter for establishing serialization orders. In fact, Silo serializes transaction in an entirely distributed and data-driven manner, which prevents scalability problems brought by a centralized timestamp counter. Silo keeps track of tuples read and written during the transaction using a read set (RS) and write set (WS). At commit time, the transaction manager first locks the write set by acquiring locks for every tuple in the write set. The lock acqusition process follows a global total ordering to avoid deadlock. The committing transaction then reads a global epoch counter, which is advanced slowly only by a background thread (every 40ms). The value of the global epoch counter represents the epoch in which the transaction commits. Next, the transaction validates its read by comparing versions of tuples in the read against the current tuple version in data tables. If these two versions are different or the tuple is currently locked by another committing transaction, indicatingthat another transaction committed (will commit) on the tuple after the current transaction reads it, the current transaction must abort. This is because although Silo does not use centralized counter to order transactions, the concurrency control procotol is still commit time ordered which means that the transaction must ensure that all tuples it accessed must remain unchanged until commit time. If validation succeeds, the transaction then proceeds to determine its ct by the following rules. First, the ct must not be smaller than or equal to the ct of any tuple in its RS and WS. This guarantees that a tuple’s timestamp is monotonically increasing. Personally, I don’t think ordering the transaction against its read set makes sense, because OCC is, by nature, commit time ordered. Timestamp assignment does not change the serialization guarantee of the protocol. Second, the ct must not be smaller than or equal to the previous ct that the same worker thread has assigned to transactions. This guarantees that a single thread will always see monotonically increasing cts. The third rule is that if a transaction committed in epoch e, then the final version timestamp must also use e. Based on these rules, the committing transaction computes the maximum ct over data items in its RS and WS, compare with the last ct, pick the larger one, and then form the version timestamp using this value and e. The last step of transaction commit it to update data items and version timestamps. The transaction first updates tuples using the values in its WS, after which the version timestamp is updated by writing the new value into the tuple header. Versions are unlocked after data and metadata are both updated.

Silo commits transactions via group commit, in the notion of consecutive epoches. The global epoch counter E denotes the current epoch. Transactions are only committed after the current epoch ends, and log entries for the transaction (as well as all other transactions) become durable. The worker thread suspends the transaction if it has finished execution, but has not been made persistence. The application is only notified of transaction commit after the epoch that the transaction is in ends and everything has been persisted to the disk.

SiloR assign one logger thread per disk to maximize throughput. Logger threads are then assigned to several worker threads pinned on the same socket to avoid remove reads and writes. The system has a central buffer pool. Buffer objects are allocated from the buffer pool, which holds log entries generated by worker threads, and are passed around between workers and loggers. During transaction execution within an epoch, worker threads write redo log entries into the buffer object. The redo entry consists of at least the version number and the redo after image of the tuple. Once the buffer object is full, the worker thread writing the object passes it to one of the logger threads assigned to it. Note that since the maximum number of log entries in the memory is upper bounded by the size of the buffer pool at initialization, this in fact constitutes a feedback mechanism, where loggers can impose backpressure to workers if transaction throughput exceeds logging throughput, such that workers can slow down and avoid overcommitting the system with too many log entries. In practice, in order to reduce unnecessary worker thread stalls due to not being able to claim an empty buffer, the logger thread releases the buffer back to the pool immediately after the write system call returns, instead of waiting for the write to be persisted.

In order to drive the logging process forward, in SiloR worker threads refresh their local copy of the global epoch when a transaction commits (if the thread is busy) or periodically using a timer (if the thread is idle). Worker thread having a local epoch e means that the worker thread will not generate any log records with epoches less than e, which is useful for determining the range of log entries that have already been persisted. Note that there is no guarantee that log entries of transactions committed with ct below the local epoch have been persisted, even if all worker threads have their local epoches greather than some e’. This is because some log entries may still have not been processed by the logger thread, and will remain in the volatile buffer for a while. Logger threads hence should take into consideration both the contents of the buffer and the local epoches of worker threads assigned to it. The process is described as follows. First, every logger thread scans the assigned worker threads, and computes the minimum of all local epoches. Next, the logger thread scans the current buffer objects that have not been persisted, and computes the minimum epoch of all log entries. These buffers are written to the disk as soon as possible, during which time the logger just waits for more buffer objects to come and repeats this step every time a new buffer object is received. The minimum log entry epoch is only computed after all entries in the buffer object have been persisted to the disk. In the third step, the logger thread takes the smaller from step one and step two, and advertises this value to a special logger thread. The special logger thread then computes the minimum of all logger threads, and writes this value minus one to a separate file on the disk as pepoch. Pepoch indicates the maximum epoch that is known to be durable. The value of pepoch is also broadcasted to all worker threads. Transactions whose commit timestamp is in pepoch or before it can be safely committed.

It is worth noting that log entries are not sorted before they are written to the disk. It is therefore possible and is actually the norm that log entries from different threads and epoches are mixed together. Although the paper suggests that a worker thread should flush its log buffer to the logger thread when the global epoch is advanced, this is merely an optimization to ensure progress, and even with this mechanism deployed, log entries from the same epoch are still mixed up together.

To reduce log storage pressure and to accelerate log restoration during recovery, SiloR periodically takes checkpoints of the entire database, and writes these checkpoints to the disk as a starting point for recovery. The checkpointing process is described as follows. First, a central coordinator takes the current global epoch as the staring epoch e_l of the checkpoint. Then n checkopinting threads are started, each responsible for 1/n of each table in the database. These checkpointing threads read tuples after locking them, which guarantees the consistency of a single tuple, but not across tuples. This checkpointing scheme works well with the concurrency control protocol, because if a transaction is about to commit on a tuple that is currently being read, it will detect the lock bit, and then abort. No partial update is hence observable. It is possible, however, that the checkpoint itself is inconsistent (a fuzzy checkpoint) because of concurrent transaction commits. Taking fuzzy checkpoints do not harm SiloR’s recoverability, since SiloR uses physical redo logging, which can always restore the database to the consistent state by applying redo images, which may potentially redo some tuples unnecessarily, but never gives up correctness. After the checkpoint is finished on all threads, the central coordinator notes down the current epoch, e_h, as the ending epoch. The checkpoint is not considered as completed, until the global persisted epoch variable, pepoch (see above), has reached e_h, after which the coordinator writes e_l and e_h to the checkpoint and mark the checkpoint as valid. After establishing a checkpoint, all older checkpoints and log entries before e_l can be deleted because they will no longer be used for recovery.

As an optimization, the checkpointing thread may ignore tuples with a timestamp larger than or equal to e_l, since during recovery these tuples will be restored from the log anyway. This optimization greatly reduces both the size and the time taken for a checkpoint which largely eliminates the need for incremental checkpointing.

During recovery, a dedicated thread first reads the checkpoint into memory, and then rebuilds tables, indices, etc. Recovery threads are then started to replay the log till the most recent presistent epoch. The recovery threads first read pepoch from the disk, which represents the last presistent epoch. Then they read log files from the disk in reverse creation time order, i.e. files created late are read before files created earlier. This is because SiloR adoptes redo logging, and hence only cares about the most up-to-date of a tuple. If we replay logs in the reverse order they were created, there is a large chance that most locations are only written a few times, which reduces memory traffic. Recovery threads then scans log entries in the file. If the entry has a timestamp larger than pepoch, it is ignored, because the entry represents an uncommitted update, which should be discarded because users are not notified yet. Log entries before e_l are ignored because they must have already been included in the checkpoint image. If the timestamp of the log entry is smaller than the current timestamp in the table, it will also be ignored, because the current value in the table is a more recent value written by some transaction. The tuple is only updated using the log entry if the timestamp of the tuple is smaller than the timestamp of the log entry.