Proteus: A Flexible and Fast Software Supported Hardware Logging Approach for NVM



- Hide Paper Summary
Paper Title: Proteus: A Flexible and Fast Software Supported Hardware Logging Approach for NVM
Link: https://dl.acm.org/citation.cfm?id=3124539
Year: MICRO 2017
Keyword: NVM; Logging; Proteus



Back

This paper presents Proteus, a hardware extension on NVM based architectures for supporting atomic durability. Prior to Proteus, past designs tend to either only focus on software side or on the hardware side. Software solutions require programmers or compilers to insert software persist barriers to enforce correct write ordering. For example, Intel provided programmers with a set of new instructions, called PMEM, to compose persist barriers on x86 based architectures. A persist barrier using PMEM consists of one or more cache line flush instructions, and a pcommit instruction at the end. The cache line flush instructions make sure that all dirty states are written into the NVM controller, and pcommit guarantees that NVM controller has drained its internal write queue. Dirty states generated before the barrier and flushed by the cache line flush instructions are guaranteed to be persistent on the NVM after the barrier returns. Store fences are also inserted to impose correct memory ordering between PMEM instructions and earlier store instructions. Though not requiring any expensive hardware changes and can be deployed quickly, software barriers are overly conservative, and are slow and cumbersome to use. This is because programmers have no control over how cache lines are maintained by the hardware cache and how dirty lines are written back by the NVM controller. In contrast, hardware solutions generally require extending the cache, the directory, or the memory controller to automate logging and recovery. They resort to hardware address remapping in the case of Copy-on-Write (COW), or hardware store queue in the case of undo logging. The problem with hardware solutions is that on-chip storage resources are often scarce and have limitations (e.g. access latency, capacity, etc.), which further puts a hard limit on the number of concurrent transactions or the maximum read/write sets. When the working set of a transaction exceeds these hard limits, hardware systems either need to stall to wait for resource, or use software handler as a fall back path.

Proteus, on the other hand, borrows from both sides and achieves the favorable characteristics of software and hardware approaches. Compared with software solutions, Proteus relies on programmers or compilers to mark transaction boundaries as well as to issue instructions that generate and flushe log entries, when shared data items are updated during a transaction. Compared with hardware solutions, Proteus similarly extends the processor with a log write queue which stores pending log entries to be written. In addition, an extra log pending queue is added to the memory controller, such that some log entries do not even need to be written to the NVM if the transaction commits before the queue overflows. We elaborate on these designs in the following sections.

Protus assumes transactional semantics for durable writes: Either all writes within a transaction become durable as the transaction commits, or none of them becomes durable, and the transaction is aborted. Special instructions are needed to begin and attempt to commit a durable transaction. When generating code for a transaction body, compilers are responsible for expanding every transactional store instruction to global data items into three instructions: One for reading the old value and generating the log entry, another for writing data that has been loaded into the cache, and the last for writing the log entry into NVM logging area. Semantics and implementation of the three instructions will be covered in details. Each transaction is allocated a unique transaction ID (TID) when it commits. The TID is useful not only for conflict detection purposes, but also for the hardware to identify which log entry belongs to which transaction, as we will see later.

Two new instructions are added to the ISA: log-load and log-flush. The log-load instruction takes one address and one register argument. The address points to an aligned 32 byte memory block, which is the basic unit of logging in Protus. Choosing 32 bytes as the logging unit is to reduce the number of NVM writes when the entry is flushed back to the NVM, since write back to NVM must be 64 byte, cache line sized memory block. A log entry consisting of 32 byte data plus metadata is guaranteed to be written back using just one write operation to the NVM. On executing the instruction, the processor attempts to load the L1 cache with the block on the given address if it has not been loaded (and raises exception as usual if the access is invalid), and the copy the 32 byte block to a special register, the log register. Log registers are 64 byte registers allocated from a separate register file. Each log register holds the log entry, including data and metadata such as address and property bits. On executing a log-load instruction, the processor issues a read, and uses the returned data to fill the given log register. Compilers are responsible for allocating and reusing log registers. The second instruction is log-flush, which writes the content of a given log register into a target address. Both the register name and the address are encoded by the instruction. Note that, for undo logging based systems, if the same address can be logged multiple times (i.e. a log entry is generated every time the address is written), then undo log entries must be written by the order that these entries are generated. During recovery, only the earliest entry for an address is replayed, because later entries contain dirty data written in the transaction, which should not be part of the state after roll back. To achieve this, Protus also has a special register, LTA (Log Target Address), for dispensing sequence numbers. The register is used to generate the address on NVM logging area, to which the log entry is written. After each read, the value of LTA is incremented.

To architectually support the log-flush operation, the paper proposes adding a special log queue for holding log write back operations that have not been completed. The log queue is a hardware associative searching structure. Each entry in the queue consists of the source address, destination address, log metadata, and data. When a log-flush instruction enters the pipeline, a slot is reserved in the log queue, which will be filled later as the instruction flows through the pipeline. After the slot becomes ready, it can be selected by the write back hardware and then flushed. Entries not on the same address can be flushed out-of-order to maximize bandwidth, which also enables some optimizations. To enforce write ordering, when a store instruction is about to commit, an associative search is performed on the log queue. If the log entry from the same source address as the address of the store instruction has not been flushed yet, the pipeline should prevent the store instruction from committing, and therefore guarantees that the log entry can always be persisted before the corresponding dirty data. Note that the system circumvents the cache hierarchy when performing log entry flush. These entries are directly sent to the memory controller when they leave the log queue.

To avoid logging unnecessary data, each processor also has a Log Lookup Table (LLT), which stores the addresses of the most recent log entries that are generated. Since undo logging only requires generating the log entry on the first write within the transaction, later writes on the same address does not need to be logged, although the store instruction has been expanded by the compiler, wasting bandwidth and NVM storage. To solve this problem, when a new log entry is inserted into the log queue, an entry with the same address is also inserted into the LLT. On executing log-load and log-flush, the address is searched against the LLT. If the address exists in the LLT, then both instructions will be essentially no-op. When the LLT is full, a LRU entry is evicted, and then the newest entry is inserted.

The paper also made an important observation: as long as crash and recovery is uncommon, log entries are written once, and almost never read. The implication is that, instead of always storing log entries on the NVM, which suffers both higher write latency compared with DRAM, and limited P/E cycles, log entries can be buffered in a small memory-backed SRAM hardware queue. The hardware queue is called a Log Pending Queue (LPQ), which can be made fast and durable. An LPQ entry cosists of all fields of the log queue, plus the TID of the transaction that generates these entries. Log entries flushed from processors are first buffered by the LPQ, until the buffer becomes full and spills entries onto the NVM, while other data items still use the normal data path. After the transaction commits, the commit message along with the TID is sent to the memory controller. The memory controller checks the LPQ, and invalidates all log entries that have the same TID. If some entries have already been spilled into the NVM, then the commit handler in the memory controller also walks the logging area, and invalidates them as well. To slightly accelerate the log cleaning process, log buffers can be allocated in a per-thread manner. Since a thread can only be executing one transaction at a given time, log buffers can be directly removed once the corresponding transaction commits.