ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

- Hide Paper Summary

Paper Title: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
Link: https://ieeexplore.ieee.org/document/1003567
Year: ISCA 2002
Keyword: Checkpointing

Back

This paper proposes ReVive, a general purpose rollback recovery mechanism for shared-memory multiprocessors. There are two design goals: Software rollback and crash recovery. Software rollback requires the system to preserve a consistent snapshot of system states. On a rollback request, the state snapshot is quickly restored, and system execution resumes from the restoration point as if nothing has ever happened. Crash recovery focuses on the scenario where one or more components of the system malfunction and cause data loss. A previously saved system snapshot must be loaded from some external sources, or rebuilt from remaining available information after the crash. In the case of ReVive, since all data i s stored in DRAM and hence is volatile, we only consider recovery from detectable errors caused by malfunctioning memory modules.

ReVive is built on Stanford DASH multiprocessor, which is a shared-memory architecture using a customized hardware coherence protocol to maintain cache coherence. DASH consists of multiple processors and memory modules connected to a low latency communication network. Each processor has a memory module and a cache controller. The cache controller is responsible for handling cache coherence requests forwarded to that node. This paper assumes that these processor and memory modules could fail independently, i.e. it is possible that one of the memory modules stopped working while others remain usable. Failures are restricted to non-power failure only, as a power failure will wipe out the content of every memory module, and hence render the system unrecoverable. The paper also assumes a fail-safe error model: When an error happens, it is guaranteed that it will be detected by hradware using error-correcting bits or other techniques within a bounded amount of time. This bounded amount of time is known in advance, such that the system just assumes that the system will no longer be automatically rolled back after this period of time. This is critical to I/O handling, because the system needs to buffer I/O messages until it is certain that these non-retractable I/O operations will not be called back in the future. When an error is detected, the system just discards active data, and restores the global state to a previously saved, consistent checkpoint.

ReVive protects the system from data loss by adding redundancy when one or more memory module suddenly stops working and loses all data it stores. Memory modules are divided into parity groups. For each page in each parity group, an extra parity page is stored, and if one page in a parity group is lost, this parity page is used to rebuild the missing page. Note that parity pages are also scattered in all memory modules to lessen the risk that a few broken memory modules bring the system into an unrecoverable state because more than one page in the parity group is missing. On write backs of dirty cache lines, both the old cache line and the parity should be updated. The sequence is described as follows. First, the cache line is marked as busy by the directory such that no other processors could access the line via coherence to preserve the atomicity of parity update. The directory controller then reads the old content of the line, computes the XOR with the new line, and then writes the new line. If an acknowledgement for the write back is required, it will also be sent by the directory controller. Next, the directory controller reads the parity page, and computes the XOR between the parity page and the result of the previous XOR. In the last step, the updated parity is written back, and then the cache line is unlocked. The processor requesting the write back can resume execution as long as it receives the acknowledgement from the directory. Parity page update can be performed in the background and hence is overlapped with normal execution.

Note that, in the above procedure, if the two update operations, data write and log write, cannot be made atomic, which is the case for most hardware, extra steps need to be taken to ensure that the atomicty is not violated. The paper suggests that each parity entry is extended with a one bit marker. The marker is unset before updating data, and only re-set when updated parity has been written. During recovery, if a parity page is found to have the marker clear, the parity page is corrupted, and must be diacarded, because there is no guarantee that the parity page has been updated after the data page.

Distributed parity partially solves failure recovery problem: On a component failure, by XOR’ing the parity page with all remaining data pages in the parity group, we can reconstruct the content of the failing memory module. Similarly, if the parity page itself is lost, we can also easily compute it by XOR’ing all data pages. In the paper, it is assumed that at most one page will be lost during the failure. If, however, more than one page in a parity group is lost due to the failure, an external replica must be used to restore the state.

Rebuilding system state after failure is not the entire story: On a failure, part (or all) of the processor states may also be lost. Without the exact processor state, it would be impossible to resume execution on an inconsistent memory image. ReVive relies on incremental checkpointing to guarantee that the system state can always be rolled back to the last consistent snapshot. During normal execution, the memory controller priodically interrupts all processors by sending them checkpointing signals. On receiving such a signal, processors will stop fetching new instructions immediately. Then processors drain their volatile states such as store queues and instruction windows, to ensure that information contained in these hardware structures will not be lost on a recovery. Next, processors evict all dirty cache lines in the local cache back to their home memory modules. During this process the parity bits of the memory pages are also updated accordingly. Finally, processors write their execution context into a known location in the address space, after which they reply to the memory controller indicating that the checkpoint has been completed. After receiving replies from all processors, the memory controller then send a second signal to all processors to let them resume execution. Note that such two-phase commit style global coordination is necessary during a checkpoint. Processors must wait until all other peer processors finish their checkpoint preparation before it could resume execution. Otherwise, a processor that completes the checkpoint early may interfere with processors that have not finished, which can corrupt the checkpoint.

During normal execution, the directory controller monitors memory modification actions from all processors, and performs undo logging to ensure that dirty states can always be restored to the initial value when the most recent checkpoint is made. There are two cases: When the directory sees a read-exclusive or upgrade request, or when the directory sees a write back request. In the former case, the directory eagerly logs the requested cache line, because a read-exclusive or upgrade request generally indicate that the requesting processor intends to perform write to the line. In the latter case, the directory first reads out the original value of the line, writes it to the logging area, updates the parity as usual, only after which the write back can be served. In the meantime when the log entry is being written, the cache line must be marked as busy by the directory controller, such that no data write back is served. Enforcing the write ordering between log entry write and data write is very important in undo logging, as the logging scheme does not work if data is updated first without writing the log, and then system crashes. In this case, the dirty state cannot be restored because the undo image is not written.

Each memory module has a reserved chunk of memory dedicated for logging. Given that the logging area is large enough to hold all log entries during a checkpoint (which is easy to enforce, because if the area ia about to overflow, the directory controller just starts a new checkpoint), only two logging areas are needed: One to hold the most recent checkpoint, and another to hold the next incoming checkpoint. A pointer in a known hardware location indicates which checkpoint is the most recent one. After a new checkpoint is made, the controller atomically updates this pointer to point to the current logging area. The other logging area is implicitly garbage collected.

To avoid excessive logging, i.e. multiple log entries being generated for the same address in the same execution, which is entirely unnecessary because undo logging only needs the first image, the directory controller also allocates one bit for each page. The bit is set every time a log entry is generated for the page, and cleared when a new checkpoint is made. If the bit is on when a memory modification operation is detected, the directory controller simply ignores the request, since an undo image is already present.

On recovery, the recovery handler first restores the logging area using parity pages. It then walks log entries one by one, and attempts to restore the page using the undo image. If, however, the page to be restored it inaccessible, the handler will then use the parity page and remaining data pages to fix it. After all pages are restored, the recovery handler instructs all processors to load their execution context from the logging area. In the last step of recovery, the directory controller waits for all processors to be ready, after which it sends a signal to resume normal execution.