SSP: Eliminating Redundant Wrints in Failure-Atomic NVRAMs via Shadow Sub-Paging

- Hide Paper Summary

Paper Title: Optimizing Systems for Byte-Addressable NVM by Reducing Bit Flipping
Link: https://dl.acm.org/citation.cfm?id=3358326
Year: MICRO 2019
Keyword: NVM; SSP; Shadow Mapping; Double Buffering

Back

Note: I think I got something wrong in the below discussion. It seems later to me that the authors may want to use “committed” vector to indicate the location of the current consistent image, while using “current” vector to indicate whether the most recent cache line is committed or speculative. The “current” vector is hence not transient, and must be preserved after the transaction commits. If a cache line is written during a transaction, the current vector is modified globally to notify all other cores that the address generation should use the non-committed slot, since the transaction has not been committed yet. The tricky part is, when the transaction has been committed, we toggle the persistent version of the committed vector stored on the NVM, but leave the cached “committed” vectors in all other TLBs unchanged. We also do not clear the “current” vector such that address generation on those cores is still correct. On the committing core, however, logically speaking, the committed and current bit should be toggled together to indicate that the committed version has changed and that the most recent version is the committed version. Such toggling has no effect on address generation since they are XOR’ed together to determine the page address. The conclusion is, the commit process does not need to toggle any of the TLB bits since they are naturally correct. The on-NVM image of the committed bits are flipped by the memory controller, but such changes are not necessarily reflected on the TLB. A TLB entry is only refreshed when it is evicted and reloaded, in which case the committed bits will be the on-NVM version, and the “current” bits are all cleared. The on-NVM version of these bit vectors are never modified by TLB eviction since TLB may have cached a stale and incorrect version.

This paper proposes SSP, a hardware framework for achieving persistent transactions on byte-addressable non-volatile memory system. The paper points out that existing software schemes, such as logging, shadow mapping, and log-structured update, are all inefficient in terms of both latency and throughput. For example, in the most two common logging schemes, redo logging and undo logging, old or new data must be written and flushed back to the NVM before the corrsponding location can be updated in-place to avoid dirty data cache line being evicted back to the NVM before the log entries are persisted. This introduces two problems. The first problem is write amplification, since data is written twice (if we consider metadata writes and log truncation overhead, the overhead is even larger). The second problem is that the write ordering between log entries and in-place updated data is critical for correctness. Programmers must issue a software write barrier which stalls the processor until previous stores are accepted by the memory controller every time write ordering is desired. For shadow mapping, one of the biggest concerns is write amplification, which can be significant when the amount of dirty data per tracking unit (e.g. 4KB pages) is small. Even if only a few cache lines are updated per page, the original page must be duplicated before update is conducted. In addition, shadow paging requires that every load and store to NVM area be translated by a translation layer, which identifies the current active version to be updated, and re-writes the target address for the memory operation. On commit, all dirty contents are written back to the in-memory shadow page, and the stable version is switched from the old page to the new page atomically. This can be achieved by software instrumentation provided by the compiler, or by a hardware mapping structure. Both will add the overhead of address translation and metadata maintenance to the critical path. Log-structured updated leverages the fact that NVM performs the best for sequential writes, and that atomic multi-word update can be achieved in an append-only manner by first appending data to the end of the log, and then switching the pointer to the log tail with an atomic 8-byte write. Updated are not committed until the atomic write is flushed back to the NVM. To facilitate runtime address computation, log-structured systems maintain a virtualized linear address space, in which addresses are mapped to offsets into the log by a volatile mapping table. The mapping table is updated every time an address is updated and appended to the end of the log. On a recovery, we rebuild the mapping table by scanning the entire log without having to recover the potentially inconsistent image from the NVM (the table is in volatile DRAM for lower access latency). The major concern of log-structured system is the overhead of garbage collection, resulting in unnecessary data movement and data fragmentation.

SSP is based on shadow paging that we described above, but overcomes the problem of write amplification by using finer granularity tracking and hardware assistance for address remapping. The old page is not necessarily duplicated into the buffer before updates can be applied, since SSP tracks shadow pages in cache line granularity. At a high level, each virtual address mapped to NVM region is backed by two physical pages, page 0 and page 1. Page 0 is mapped by the OS naturally as in any virtual memory system, while page 1 is assigned to the virtual page by hardware (selected from a pool of free physical pages on the NVM device). At any given moment, the current consistent image of memory can reside in both pages at cache line granularity, which is encoded by the per-page “committed” bit vector. To elaborate: Each page has a persistent committed bit vector as part of the page metadata. The number of bits in the committed vector equals the number of cache lines in the page (64 for x86-64). A “1” bit indicates that the corresponding cache line is part of the current consistent image, while a “0” bit indicates that the cache line can be used to buffer speculative updates that may become committed in the future. The memory controller is responsible for maintaining the committed vectors in a separate region of the NVM, and updates the committed bit vectors in an atomic manner when the consistent image moves from one state to the next (e.g. at the end of a persistent transaction). The memory controller also maintains the invariant that one and only one bit is set in the committed bit vector of the two sibling pages assigned to a virtual page frame. When a cache line is to be updated, we first consult the committed bit vector using the physical address of the page (i.e. the one mapped by the OS) to determine which page to write into. If page 1 is to be used (i.e. the current committed version is on page 0), then we return the address of page 1 to the cache controller, with the data read from page 0 (since the granularity of update is smaller than the granularity of memory transaction, a read is still needed in case other words is accessed later). When the cache controller receives the line, it rewrites the line tag using the address of page 1, such that the cache line is mapped to the same offset of page 1. This hardware remapping guarantees that even if the dirty line is evicted out of the cache during the transaction, the committed image will not be affected, and that the system can always recovery to the consistent state by using the committed image. When the transaction commits, the cache controller writes all dirty lines that is still in the cache back to the NVM. Note that since the cache tag has been changed to the alternate location, the write back is just like normal cache line write back. The memory controller must guarantee that the write back of data and the update of committed bits are atomic, i.e. when the system crashes, either committed bits reflect the after-commit state, and all slots contain committed data, or the committed bits reflect the before-commit state, and we discard data (which can be partially written) in the alternate location.

We next describe the architecture of SSP. The design of SSP consists of two parts, the on-core part and off-core part. The on-core part of SSP consists of an extended TLB and special logic on the cache controller. The TLB is extended with three bit vectors for access tracking, and an extra address field to rewrite an incoming cache line based on the location of the committed data. The three bits vectors are committed vector, as described above, current vector, indicating the current read image, and updated vector, tracking the write set. Note that the current vector may not equal the committed vector, since a transaction may read dirty data written by its own. In this case, the dirty data should be returned rather than being determined using the committed vector.

SSP follows a transactional interface, but does not provide transactional memory capabilities. Applications start and end a failure atomic region using SSP_BEGIN and SSP_END instructions. All stores within the failure atomic region will be persisted in an atomic manner, i.e. either all of them are available after a crash, or none of them is. The paper suggests that applications should synchronize with locks or other approaches.

When a cache line is read by a load instruction, the processor checks the TLB as in a normal read. If the TLB misses, a page walk will be initiated to fetch the page table entry from memory. SSP extends the page walker such that the alternate page address and the three bit vectors are also loaded from the memory controller using the physical address. (Note: This is likely a design flaw since we must serialize memory controller access after the page walk. Using virtual address to fetch the vectors will be better, but this requires that memory controller perform context switch.) After the TLB entry is populated, we check the committed and current bit to determine whether the data should be read from page 0 or page 1. The physical address for probing the L1 cache is formed using page 0’s address (i.e. the one in the page table) if committed bit indicates that consistent data is in page 0 and that current bit is clear. We read from page 1 if committed bit indicates page 0 but current bit is set, or committed bit indicates page 1 and current bit is clear.

The process is more complicated in the case of a write, since write operation allocates the slot from the non-committed page. If TLB misses, we populate the TLB entry as described above. A speculative cache line is created, if the current bit is not set, by first performing a read as described above, and then changing the cache tag using the alternate address. We set the current bit to indicate that the cache line has been allocated, and that all future writes can directly be performed on the speculative slot. If the current bit is already set, then we simply generate the address using the page address not indicated by the committed bit without performing the read first. The updated bit is also set accordingly to track dirty data. If a TLB entry whose updated bit vector is not empty is evicted from the TLB, the paper suggests that the transaction should abort, since we will lose track of the full set of dirty data by evicting the entry. The aborted transaction must retry using a software handler with logging to complete the failure atomic section. Also note that in a multicore system with mutiple private TLBs, all cores should have the same view of the location of current dirty data written by any other core. This is achieved by synchronizing the current bit vector of a core with other TLBs when it is updated. The paper proposes using the cache coherence network to broadcast the TLB update message to other cores to force a TLB update. It is suggested by the evaluation section that the overhead of broadcasting is negligible, since only a small portion of writes will update the current bit vector (most of them hit the cache or are non-transactional). Note that the paper claims that the current bit vector update can be piggybacked into the invalidation message sent as part of the write invalidation process. This optimization works, because the writing processor needs to acquire exclusive ownership globally before the write can be performed, at which time the current vector is also updated. Other processors accessing this cache line later will simply use its own current vector to generate a new address to probe the cache.

When a transaction is committed, the processor walks the TLB, toggles bits in committed vector for dirty cache lines, and flushes back all dirty lines to the NVM. Note that dirty lines are not yet visible now. The processor then notifies the memory controller that the transaction can be committed, and the memory controller persists the committed vector atomically using redo write-ahead logging. The memory controller first writes all intended changes to the committed vector to a logging area, flushs the log, and then performs the updates in-place, before updates are also flushed. The transaction is fully committed only after the log is committed. **Although the paper does not mention whether the committed vector should also be made globally consistent by sending broadcasts to other cores that have possibly cached a stale version of the committed vector, my best guess is that this is necessary to avoid other cores accessing stale data which violates memory consistency ordering. ** After transaction commit, both current vector and updated vector are cleared, since they are transactional local.

This paper also proposes a way of reducing storage requirement by performing page consolidation. A page and its shadow buddy can be consolidated, if no processor can access any of the cache line in both pages (since we overwrite the speculative version and deallocates one the two pages after consolidation). This requirement translates to the fact that when a page is to be consolidated, none of the TLBs can contain an entry mapping the current page and its shadow buddy. To achieve this, the memory controller maintains a reference counter for each page in its internal mapping table. When a page walk acquires information of a persistent page, the reference counter is incremented by one, and whenever a TLB entry is evicted from the TLB, it is decremented by one. A page is consolidated only if the reference counter is zero. Consolidation proceeds by first counting the number of consistent cache lines in each page, by doing a popcount on each of the bit vectors. Then consistent cache lines are migrated from the page with less popcount to the page with more popcount. After the migration is completed, the OS changes the page mapping in the page table such that the virtual address is mapped to the consolidated page. The other page is released back to the free page pool which can then be allocated later. Since the OS is involved in the consolidation process, the paper suggests that consolidation should be done periodically by a background OS thread. Although not mentioned by the paper, consolidation must also be done when there is no log pending on both pages. In addition, the paper does not mention how consolidation can be made atomic. In fact it is atomic as long as we only change the committed bit vector as the last step of consolidation using redo write-ahead logging.

On recovery, the system first scans the logging area (the address of which is stored in a known location), and replays the log to recover “committed” bit vectors if the log has been committed (if not then discard the log). The system can then resume instantly without any data movement, since the committed vector indicates the current consistent image. Speculative data in the alternate location will be ignored.