PiCL: A Software Transparent, Persistent Cache Log for Nonvolatile Main Memory
- Hide Paper Summary
Link: https://parallel.princeton.edu/papers/micro18-nguyen-picl.pdf
Year: MICRO 2018
Keyword: NVM; Undo; Logging; PiCL; Checkpoint
This paper proposes PiCL, a novel architecture for supporting main memory checkpoints using NVM. Checkpoints are consistent main memory images that can be restored after power failure. In this paper, it is assumed that the execution of all processors in the system is divided into “epoches”. Epoches are also the basic unit of recovery: At recovery time, the main memory state is restored to the beginning of one of the previously saved epoches. The paper also assumes that the entire system is backed by NVM connected to the memory bus, through which persistent data can be read or written using ordinary processor load and store operations. Processors access the NVM in the same way as they access the DRAM. After the crash, all volatile state such as processor register file and cache content will be lost. Data stored in NVM, as its name suggests, can survive crashes and be used for post-failure recovery.
Undo and redo logging are the two most typical ways of performing recovery. Both methods require a centralized log kept in the NVM, the location of which is either hardcoded in all processors, or is stored in a well-known location on the NVM. In the following discussion we assume that logging is performed in cache line granularity as metadata is maintained per cache line. With undo logging, processors only save the undo image, which is the value of the cache line before the modification. To guarantee recoverability of modifications, two write ordering constranits a enforced. The first is that dirty cache lines can only be written back to the NVM after the undo log entries are. This is to ensure that modifications can awlays be rolled back in the case of failure. The second is that the epoch must not commit before all dirty cache lines are written back to the NVM. This is to ensure that the current epoch become durable before the next epoch begins, such that if the system crashes during the next epoch, we can always revert all changes and restore the state to the end of the current epoch. Undo log entries can be discarded after the current epoch has committed, because the epoch can no longer be rolled back under any circumstance. With redo logging, processors must adopt a specialcache eviction policy called “no-steal”, which means that the cache may not evict any dirty line before the current epoch commits. This is avoid dirty data being persisted on the NVM before a crash happens, in which case there is no way to restore to the previous state. Store operations to the cache lines also generate log entries which contain the after image of the line. The log must be flushed to the NVM before the epoch can be notified of the commit. Dirty lines, on the other hand, can reamin in the cache for efficiency purposes. On recovery, all modifications in the redo log will be replayed in the order they are appended to the centralized log. Redo logs can be discarded whenever the correponding dirty cache lines have been written back to the NVM, or the cache line is written by another store instruction.
This paper identifies several problems in both the classical undo and redo designs. First, for undo logging, the two write ordering constraints put the latency of NVM writes directly on the critical path of epoch execution. In the simple undo model, execution must wait for persistence of log entries and dirty cache lines to complete before it resumes. Even worse, the performance of random writes on NVM is usually several orders of magnitudes slower than reads and sequential writes. At the end of the epoch when dirty cache lines are forced back to the NVM, since cache lines themselves tend to have very little locality (as most locality is absorbed by the cache itself), the latency of NVM random writes will become a dominant factor in the total execution time. The second problem of undo logging is the fact that it is not scalable. As the system scales, the size of the cache continues growing larger and larger. Cache flush operations at the end of the epoch will therefore become more and more expensive, because it takes longer time and more NVM writes to flush dirty lines from the shared last-level cache. In addition, the paper also claims that undo logging introduces extra reads whenever dirty lines are evicted. This is because the paper assumes an undo model where the log entry is only generated on-demand, i.e. right before the cache line is to be evicted. Since the pre-image is not available in the cache, the processor has to issue one extra read operation to the NVM for the pre-image of the cache line, and then issue a write to persist it to the log area. The cache line eviction is suspended and only resumed after the log entry is persisted. The “hidden” read operation might be problematic in the runtime, as it incurs extra random reads to the NVM, which results in poor data locality. Redo logging, on the other hand, does not require expensive cache flush and read-log-modify, but is also severely limited by the “no-steal” property it imposes on the cache replacement policy. One of the consequences of the “no-steal” policy is that if the cache controller must evict a dirty line, the epoch will be forced to complete prematurally, quite similar to how best-effort HTM handles evictions of speculative cache lines. As a mitigation, some designs add a hardware victim cache to the LLC for holding evicted dirty lines before the epoch compeletes. The victim cache is typically an associative search structure of limited capacity, which is also not scalable as the size of the system increases. Furthermore, the victim cache may also overflow, in which case the epoch still has to be aborted forcefully.
The design of PiCL is based on undo logging, and it addresses the above problem with undo logging by decoupling the write back of both log entries and dirty lines on epoch commit. Besides, it solves the “read-log-modify” problem by generating log entries eagerly right before the store operation, when the pre-image is available in the cache. As a trade-off, during recovery, there is no guarantee that the memory state can be recovered to the most recently committed epoch, but just a recent epoch. We next describe describe the details of the design.
PiCL features asynchronous NVM writes, and the generation of multi-undo log. The execution model of PiCL epoches is as follows: Processors still have a global notion of epoches. Instead of committing an epoch before starting the next one in a strict sequential manner, PiCL divides the completion of epoches into two operations: commit, which means the epoch has completed all operations, and persist, which means all dirty lines and log entries have been written back to the NVM, and that the epoch can be restored on a crash. To achieve this, PiCL adds an undo log buffer to each core, and changes the logging mechanism such that log entries are generated eagerly when store instructions commit. For simplicity we ignore multi-undo log first. On every store instruction, if the address of the store is within the NVM address space, the processor first acquires the cache line using normal cache coherence protocol. After receiving the content of the line (or request hits a copy in the local cache), the processor then copies the cache line into the undo log entry. Since no modification has been done yet, the cache line is exactly the undo image that can be used for recovery. Entries in the log buffer do not have to be written back immediately. In fact, the paper proposes an optimization in which write operations resulting from NVM write back are combined into a bulk write of the size of the row buffer in NVM. Since row buffer activation takes non-negligible time in the process of NVM write, this optimization is expected to improve both the latency and throughput of log writes. A bloom filter can also be added to allow fast detection of whether an address has a log entry in the buffer. Note that since bloom filter does not support efficient delete operation, the filter is only updated when new entries are inserted into the buffer, and cleared when the buffer is flushed. When a cache line is to be evicted by the cache controller, the undo log buffer is checked using the address tag of the line. If the address also has an entry in the log buffer, the log buffer is flushed first to guarantee the write ordering between dirty data and log entries. Log entries generated for the same address must be written back in exactly the order that they are inserted into the buffer.
The second mechanism to support undo log is called Asynchronous Cache Scan (ACS). Instead of forcing flushing all dirty cache lines back to the NVM on epoch commit, which is neither scalable nor efficient, ACS allows the epoch to be committed immediately and persisted later in the background. ACS works as follows. The system maintains two epoch registers. One is the current epoch register, and another is the last persisted epoch register. The latter must also have a persistent copy on the NVM stored in a well-known location, such that the crash recovery routine can read it from the NVM and then restore the system state to that epoch (on an system performing synchronous cache write back, the value of these two registers would always differ by one, because the current epoch will not begin until the previous epoch persists). An epoch is persisted only if its dirty cache lines have been written back. This process, as we stated above, do not have to happen when the epoch commits. The ACS works as follows. The cache controller periodically scans the tag array and checks whether any cache line is written in epoch (last persisted epoch + 1), i.e. the next epoch to write back. If such a cache line is found, the ACS engine will try to write back the cache line to NVM. Note that the undo buffer should be checked before the write back is performed, and the undo log entry for the cache line is flushed first if it exists. If a cache line has been modified several times in different epoches, e.g. ei, ei+1, …, ej, the cache line only needs to be written back in ACS for epoch j. Although the cache line contents in epoch i, .., j - 1 are not written back, their values are actually saved by the undo log entry in epoch i + 1, …, j. From this example we can also see that for frequently written cache lines, ACS not only reduces the latency of the cache line flush operation, but also reduces the number of write backs issued. After ACS processes epoch k, it also writes k into the last persisted epoch field on NVM.
To remember the most recent epoch in which a cache line is written, PiCL extends the cache tag array with a EID field. The EID field will be checked against the current persisting epoch ID when ACS is running, and if the EID matches the cache line will be written back by the ACS. Otherwise it is ignored.
PiCL allows a cache line to be modified several times before it is written back to the NVM for persistence. This feature is called multi-undo logging. In PiCL, each log entry consists of two epoch IDs: The ID of the epoch that the cache line is last modified (version create), and the ID of the current executing epoch (version overwrite). During recovery, these two epoch IDs are used to determine whether the undo log entry should be applied based on the target epoch ID to restore (i.e. the last persisted ID): If the target ID is between the two IDs (excluding version overwrite ID), the undo log entry will be restored. Otherwise it is just ignored. When a cache line is loaded into L1 cache for write, there are two possible cases. In the first case, the cache line is directly read from the NVM. In this case, we could not know exactly when the cache line was last time modified, so the cache line is treated as being last written in the most recent persisted epoch (while in fact it can actually be modified after that epoch and was just evicted due to cache replacement) and the EID is set to the ID of the most recent persisted epoch. In the second case, the write request hits the L1, and the cache line’s EID tag is precisely the most recent epoch that this line was modified. In both cases, a log entry is generated for the modification operation. The created ID for the undo entry is the cache line’s EID, and the overwrite ID is the current running epoch ID. PiCL guarantees that at most one log entry is generated for every address in one epoch. Every time a log entry is generated for a cache line, the cache line will be marked as already logged, and later store operations on that line will be ignored. On an epoch change, all cache lines need to be unmarked in order for new log entries for the new epoch to be generated.
On recovery, the recovery routine scans the log in backward direction, and performs the epoch ID check described in the previous paragraph. If the log entry satisfies the condition, then its content will be applied to the NVM image. Note that there can be multiple entries that satisfy the condition, and if this is the case, they should be applyed in the reverse order they are inserted into the log. Imagine that a cache line is written in epoch 4, written back to the NVM during epoch 6 due to capacity eviction, and written again in epoch 7. The most recent persisted epoch is 2 when both writes happens (i.e. the ACS does not make progress). If the system crashes, then there are two log entries whose interval incluses epoch 2, the persistent epoch ID. The first entry is generated by the first write, which has an interval [2, 4). The second entry is generated by the second write, and since it has been evicted back to the NVM before the second write, the create epoch ID in the log entry must be set to 2, which is the most recent persistent ID. The second log entry therefore has an interval of [2, 7). The recovery routine will first apply the second write’s undo log content (which is the content already in the NVM), and then apply the content in log entry [2, 4) (which is the content before the first modification and is not on the NVM). This process could stop, if the overwrite epoch ID of the log entry is smaller than the last persistent epoch ID, because we are certain that any cache line that were modificed before the persistent ID must have already been written back. Such entries can also be garbage collected, because in the future they will always be ignored in all cases.