Distributed Logless Atomic Durability with Persistent Memory



- Hide Paper Summary
Paper Title: Distributed Logless Atomic Durability with Persistent Memory
Link: https://dl.acm.org/citation.cfm?doid=3352460.3358321
Year: MICRO 2019
Keyword: NVM; LAD; Memory Controller



Back

Highlights:

  1. Elegant application of 2PC protocol to performing multi-agent atomic commit.

Questions

  1. L2 and below are basically write-through, which incurs very high latency for coherence messages and write backs

This paper proposes Logless Atomic Durability (LAD), a hardware framework for supporting failure atomic transactions in which no logging is involved for most of the time. The paper identified the problem of software logging as write amplification and excessive write orderings which require persist barrier to be issued quite often, degrading performance. The paper also identifies that hardware schemes, such as Kiln, are making unrealistic assumptions which make the propal less attractive. For example, Kiln assumes that the LLC is manufactured using STT-RAM and that the LLC can atomically commit a transaction by performing a battery-backed cache tag walk. Neither of these two assumptions is realistic nowadays. Today’s LLC is still manufactured using SRAM, and is quite unlikely to be replaced by STT-RAM in the near future. Furthermore, Non-uniform Cache Access (NUCA), which is common on server processors, partitions the LLC into several slices, each maintained by a separate controller. Atomicity of operations, such as flash-clearing all “speculative” bits, as proposed by Kiln, is not guaranteed.

The paper makes the following observation about implementing failure atomicity. First, in a distributed environment such as NUCA and/or multi-core, atomicity of operation is not guaranteed, since devices act on their own, and only communicate through pre-defined interfaces. In order to ensure that all of these devices make the same decision, Two-Phase Commit (2PC) must be employed to determine the final state of the operation. Second, software logging is expensive, making hardware assisted failure atomicity an attractive option. Among popular hardware logging designs, the paper points out that undo logging fits current NVDIMM the best for the following reasons: (1) For dynamic workloads (i.e. the write set is only known at the end of the transaction, not before the transaction), undo logging supports in-place update, which eliminates the remapping table in order for a transaction to read its own dirty data; (2) Although both redo and undo logging suffer from write amplification problem, undo logging has better access locality than redo logging. In undo logging, we need two row activations to write the undo log and the data, while in redo logging, we need one row activation to write redo data, and another two to read the log and to update data in-place after log commit; (3) Undo log entries are never read except for recovery. The log can be trivially removed after the transaction has committed, while in redo logging the log has to be replayed for in-place updates. Based on the above reasons, LAD uses undo logging whenever the write set of a durable transaction exceeds what the hardware could support. The last observation is that the memory hierarchy naturally serves as an intermediate buffer for pending updates that are supposed to be atomic. Part of the hierarchy is even made persistent using battery-backed SRAM to decrease latency of certain operations. The paper proposes that instead of letting these buffers drain as quickly as possible, we only clear these buffers lazily, taking advantage of the non-volatility of the buffers to implement a small “shadow page” for the write set. In this paper, the write pending queue (WPQ) is considered as non-volatile, which has been implemented on recent commercial products.

The paper makes the following assumptions, First, programmers rely on a transactional software interface to declare failure atomic sections in which either all stores are persisted, or none of them is persisted. The failure atomic section is denoted using a “persistent{}” construct, which is translated into section begin and commit instructions. The processor will start and commit the failure atomic section accordingly when these instructions are seen, and treat all stores within the section as failure atomic. Transactions in LAD are identified by a globally unique transaction identifier which consist of the ID of the thread and the thread-local serial number of the transaction instance. The thread-local serial number can be allocated by just incrementing a counter that is accessed exclusively by the thread. In the following discussion, we simply use the term “txn ID” to refer to the global identifier. Second, the paper assumes that the system consists of multiple memory controllers, which have non-volatile WPQ as discussed in the previous paragraph. The write set of a transaction can be scattered over all memory controllers, which require the processor to notify each of them about the transaction commit. In addition, this paper assumes an interconnection network in which packets to some controllers might be lost while some other controllers receive successfully. As we will see later, this assumption plays an important role in affecting the design of LAD, since a transaction can be only partially committed after the crash on some controllers, while uncommitted on the rest.

LAD extends both the L1 cache controller and the memory controllers by adding extra states. On the cache controller side, we first add a bitmap in which each bit represents whether the corresponding block is dirty during the transaction. Dirty blocks will be handled differently when they are evicted, requested via coherence, or when the transaction commits. We also add a thread-local counter which is part of the thread context for dispensing transaction ID. An ACK counter on L1 controller records the number of pending memory flushes that have not been acknowledged by the memory controller. On the memory controller side, we extend the WPQ by adding a few extra fields for each entry in the queue. These fields are: (1) A “speculative” bit to indicate whether the memory write is uncommitted or not. The memory controller must not write a speculative entry back to the NVM, unless there is no free space in the WPQ when a new entry arrives. In this case we spill the speculative states to a logging area, as we will describe below; (2) A transaction ID field recording the ID of the transaction that generates the store request. We also add a mapping structure (implemented as an array) to map thread IDs to the most recent committed transaction ID of that thread. To limit the size of this mapping structure, LAD puts an upper bound of 256 as the maximum number of threads that are allowed to execute concurrently. This mapping structure is updated when a thread commits and sends the global ID to the memory controller during the commit protocol. All newly added fields and structures on the memory controller are also battery backed, such that their contents can be restored after a crash.

We next describe the operation of LAD. When a new transaction is started, we assign a transaction ID by combining the current thread ID with the thread-local instance counter. The instance counter is incremented every time after the assignment. All store operations sent to the memory controller during the transaction are tagged with this transaction ID. Load operations are not affected. Store operations will set the corresponding bit in the per-core bitmap to indicate that the block has not been written back to the NVM for persistence. Note that although this bitmap logically belongs to the transaction, the paper suggests that the bitmap be cleared instead of saved into the thread context when context switch happens. All dirty blocks indicated by the bitmap are written back to the NVM (via the memory controller). When a block is evicted out of L1 or requested via coherence, the content of the block is written back through the memory hierarchy to the memory controller, updating lower level cached blocks if any. The bit is cleared after a dirty block is written back. In the following discussion we first assume that the memory controller WPQ could accomodate all write requests. We next describe how to handle queue overflow using logging.

When the transaction commits, the processor first stalls the pipeline until store buffer is drained, and then walks the cache tag and writes back all blocks indicated by the bitmap. The processor increments the ACK counter for every dirty block written, and decrements the counter when the memory controller ACKs the reception of the request. When the counter reaches zero, the cache controller begins the 2PC by sending a commit quest to all memory controllers. Upon receiving the request, memory controllers updates the most recent committed transaction of the thread to the one just received. This is also the logical serialization point of the commit process. The memory controller also flash-clears all speculative bits of entries generated by the committed transaction. In the last step, the memory controller respondes with a commit ACK message. Upon receiving the commit ACK message, the cache controller resumes execution by unblocking the processor, since the processor now knows that the transaction can always be recovered in case of a crash.

When the memory controller’s WPQ overflows, it has no choice but just to write data in-place after undo logging the pre-image of the updated location. The undo logging works just like normal logging, in which the data in the updated address is first copied to a separate logging area allocated on the NVM, persisted, and then the target address is updated. The global transaction ID is also written into the log header to help deciding whether the log needs to be applied. During crash recovery, we first check whether one or more undo log exists, and that whether the undo log entries belong to committed transactions. If logs exist, and they do not belong to committed transaction (i.e. transaction ID in the header larger than most recent committed ID), these logs are applied to the target locations in reverse chronological order (i.e. younger entries first). If logs belong to committed transactions, they are simply discarded. The paper suggests that the OS is responsible for allocating the log space and performing GC to reclaim stale entries.

During recovery, memory controllers first restore the content of the WPQ and the most recent committed transaction array. A BIOS thread calculates the most recent committed transaction ID for each thread in the array by computing the maximum thread-local transaction ID over all memory controllers. Note that memory controllers may not agree with each other about the maximum committed transaction ID, since at the time of the crash, some messages may have been delivered and processed, while others are not. After computing the vector, we first roll back transactions that have overflowed the WPQ but not committed as described in the previous paragraph. We then discard entries in the WPQ that belong to uncommitted transactions (we ignore the speculative bit, since it can be that the transaction is committed but the speculative bits are not cleared). In the last step, we drain the memory controller’s WPQ by writing back all entries to the NVM in the transaction order. Execution could resume after the recovery has been completed.