HOOP: Efficient Hardware-Assisted Out-of-Place Update for Non-Volatile Memory



- Hide Paper Summary
Paper Title: HOOP: Efficient Hardware-Assisted Out-of-Place Update for Non-Volatile Memory
Link: https://dl.acm.org/doi/10.1109/ISCA45697.2020.00055
Year: ISCA 2020
Keyword: NVM; Redo Logging; HOOP



Back

Highlight:

  1. Using redo logging for failure atomicity. The read redirection problem is addressed by an on-memory mapping table, which indices the redo log entries for both committed and uncommitted data. Log replay is performed in the background by applying log entries back to the home addresses, and then remove the mapping table entry. Transactional dirty data lines are discarded on eviction to avoid polluting the home address image.

  2. The cache hierarchy does not track the write set (read set is not tracked for failure atomicity) of a transaction. Instead, the write set is sent to the memory controller when they are first written. This is different from a design where a subset of writes are cached in the hierarchy, which is only flushed back on transaction commit via tag walk. HOOP avoids the tag walk at the cost of longer first-time write latency.

  3. GC interval is upper bounded by the controller’s mapping table size

  4. The paper also attempts to reduce write amplification by (1) Coalescing writes from multiple committed txns by a group replay and only selecting the most up-to-date wrote; (2) Byte-granularity logging instead of cache line granularity.

Questions

  1. This paper makes a fundamental mistake of claiming that the algorithm is shadow paging, while it is actually redo logging with an auxiliary index. I do appreciate this combination, which has not been fully explored. I also quite appreciate the idea of amortizing multiple transactions’ updates to one log replay, and byte-granularity logging, but by the end of the day, this is really not shadow paging. In shadow paging design, there is no fixed home location for data items to be written back, which is also the biggest difference between shadow paging and redo logging. Also the paper distinguishes shadow paging from log-structured design, but these two are in fact the same thing, i.e., log-structured NVM is just an aggressive case of shadowing.

  2. If the on-controller mapping table is full, then GC must be invoked, and there is no way to avoid this. The problem is the size of a transaction is upper bounded by the mapping table size, since otherwise the table would overflow first, but no entry can be released, since no committed log entry is replayed. The paper failed even to mention this issue.

  3. The maximum size of the log buffer area is also upper bounded by the mapping table size, but the paper seems to be suggesting that the log buffer can be arbitrarily large?

  4. The paper indicates that a persistent bit is added to each line in the hierarchy, but does not clearify how this bit is used. Is it used for tracking the write set?

  5. The paper does not tell what happens when a transactionally written dirty line is written back. Are they discarded, or are they treated like a normal write back? I would say the former is the correct answer, since in the latter case, the line will pollute the working image before the transaction commits or aborts. making it impossible to recover. The paper seems to suggest the former option, since there is actually a bit for tracking transactionally written lines in the hierarchy.

  6. The paper is also inconsistent about how log entries are generated. On page 5 it is claimed that “Whenever a cache line is evicted from the LLC within a transaction, the cache line is written into the OOP region.”, indicating that dirty lines are redirected to the log buffer, while on page 8 under “Store Operation” subsection, it is stated that “As a result, the cache controller will send the modified data and its home-region address to HOOP”, indicating that the cache controller directly send updated data and the physical address to the memory controller on each update. In this case dirty lines are in fact not needed.

  7. What if there are multiple memory controllers? Do they work together on the collective log buffer, or they have private log buffer? How does GC work? Do you elect a master controller to perform GC, or you let them work in parallel (which is difficult to do)?

  8. I do not get why eviction buffer is required? The paper seems to be suggesting that GC will lock out the entire NVM device to avoid race conditions.

  9. Fine-grained logging may amplify the number of read accesses by generating multiple requests due to the fact that a log entry may not contain the full cache line. This may affect bandwidth especially if both requests access the NVM.

This paper proposes HOOP, a hardware redo logging design with low write amplification and performance overhead for achieving transactional failure atomicity. The paper is motivated by the fact that most previous designs either use logging or shadow paging, which both have flaws. Logging, for example, requiring writing the same piece of data twice, first to the log buffer, and then to the home location, which doubles the write bandwidth to the NVM device, harming device lifetime as well as available bandwidth on the bus. In addition, both undo and redo logging approaches enforce write ordering between log entry and dirty data, which is on the critical path of the execution, degrading performance as the pipeline gets frequently stalled. Shadow paging, on the other hand, still incurs write amplification if implemented in page granularity. With cache line granularity shadowing, writes no longer require duplicating a page, but previous hardware proposals introduce other performance bottleneck such as TLB translation entries which brings TLB shootdown cost on each entry update. Furthermore, the paper points out that log-structured NVM is also infisible, despite good write performance and locality, due to the high read indirection cost, which can be as bad as O(log(N)) where N is the number of log entries on the device.

This paper addresses the above challenges with a combination of techniques. First, to reduce the write amplification of logging, the paper proposes that log entries should be generated in byte granularity, since it is observed that many cache lines are only updated sparsely. In addition, the paper also proposes that writes in different transactions to the same data location can be coalesced into a single write, saving log replay bandwidth. Second, redo log entries are not flushed in the foreground, stalling the processor as transaction commits. Instead, log entries are directly generated into the log buffer on the memory controller, which are then written back to the NVM in the background without any cycle overhead. Third, to avoid read redirection problem with redo logging, HOOP uses a mapping table on the memory controller serving as the index of redo log entries in the log buffer. Data requests are redirected to the log buffer, if the mapping table indicates that a log entry exists, which contains the more up-to-date image.

We describe the hardware changes as follows. The main function of HOOP is implemented on the memory controller. Two buffers are added to the controller: A data buffer for holding log entries that have not yet been written back to the NVM, and an eviction buffer storing evicted lines while GC is being performed. In addition, a hardware mapping table is added to the memory controller, which maps physical addresses to either a data buffer slot on the NVM, or a location in the NVM log buffer.

At initialization time, a log buffer is reserved on the NVM to store redo log entries, the address of which is known by the memory controller. Redo log entries are organized as chunks of addresses and data. Chunks from the same transaction are linked together as linked lists for scanning the working set of a single transaction. Transactional status are also maintained in the chunk header for recovery purposes. Note that log entries are not uniformly sized, since HOOP uses a finer granularity than cache line to generate log entries. In this regard, HOOP stores addresses and data of log entries in distinct chunks, and link them together with a pointer field in the address chunk. One extra benefit of this design is that the working set of a transaction can be scanned without reading the data, which reduces the complexity of recovery.

On memory access operations, if the request does not hit a line in the hierarchy, then the request is forwarded to the memory controller. The memory controller first checks the mapping table for a potential log access. If the mapping table indicates a hit, then the log entry is accessed. Note that even in this case, the home address may also be accessed, since log entries do not always contain full cache line data. If the mapping table indicates a miss, then the eviction buffer is checked, since the eviction buffer is logically part of the NVM image. If eviction buffer also indicates a miss, the home address in the request is accessed.

For load operations, once the access is completed, the memory controller assembles the full cache line using data from multiple responses (if multiple requests were generated), and return it to the upper level cache. For store operations, the L1 cache controller immediately sends the updated data to the memory controller for log entry generation. The memory controller’s data buffer serves to coalesce and absorb repeated writes to the same cache line to avoid frequently writing into the NVM, which is also a problem with canonical redo logging.

In the meantime, the dirty block in the L1 cache is also marked as transactionally written by setting a bit in the tag entry, which is not cleared on transaction commit for low commit latency. Such a block will always be discarded by the memory controller when is evicted from the LLC to avoid polluting the current NVM image with uncommitted dirty data. Even for committed data this is a redundant write, since the updated data already exist in the log buffer, which will be replayed later on the home address anyway.

One of the most important features of HOOP is that log entries are fine-grained, meaning that only the modified bytes will be persisted. This reduces write traffic of redo logging since many lines are updated sparsely. To this end, both the data buffer on the memory controller and the log buffer on the NVM need to track fine-grained status for each byte or word, requiring an extra bit map for each entry.

On transaction commit, the memory controller flushes its data buffer to the NVM log buffer, stalling the committing processor’s pipeline, and then marks the transaction in the log buffer as committed. No other operation is needed, featuring a fast commit sequence.

When the memory controller’s mapping table is full, indicating that no more entries can be inserted into the log buffer, GC is invoked to replay the log entries to their home addresses and clean up the mapping table. The GC process reduces write bandwidth of sequential log replay by coalescing and combining writes to the same address by different transactions into one write. This is achieved by having the GC thread first scan the entire log buffer from most recently committed transactions to less recent ones, and building an offline index tracking the most recent committed writes to each address. If an address is written multiple times, only the most recent write will be preserved in the index. Then the GC thread scans the index, and writes back dirty lines to their home addresses. This process does not need to be atomic, since redo GC is idempotent, meaning that even after a crash, the entire process can be performed regardless of whether the previous one succeeds or not.

For each address being replayed, the GC thread removes the corresponding mapping table entry from the memory controller’s mapping table, since the most up-to-date data can now be found on the home address. If no committed transaction is present in the log buffer at the time of GC, however, then no GC is performed, and the memory controller’s mapping table will remain full even after GC. In this case, the current transaction has to abort.

To avoid race conditions of memory write backs, the entire address space is locked out during the GC, blocking any write back from the cache hierarchy. The memory controller’s eviction buffer holds evicted lines from the LLC during GC, which will be checked by access requests as the eviction buffer is logically part of the address space.

The recovery process after crash is similar to GC, except that multiple threads are employed to accelerate the process. The paper suggests that recovery threads should first collectively build an index of committed writes by scanning address chunks of committed transactions. Then these threads collaborate to replay the log entries back to the home address, after which execution could resume.