PMTest: A Fast and Flexible Testing Framework for Persist Memory Programs

- Hide Paper Summary

Paper Title: PMTest: A Fast and Flexible Testing Framework for Persist Memory Programs
Link: https://dl.acm.org/citation.cfm?id=3304015
Year: ASPLOS 2019
Keyword: NVM, Undo Logging; Testing; PMTest

Back

This paper presents PMTest, a testing framework for persistent memory programs to detect common bugs such as writes not being persisted or incorrect write ordering. The paper claims that programming for NVM is difficult, even with pre-built libraries, for two reasons. First, programmers should track dirty objects, and issue appropriate function calls or instructions to flush them back to the NVM. Failing to do so will result in objects not being persisted after a crash, which renders the object unrecoverable. The second reason is that certain programming paradigm, such as undo logging, requires that the log entry be persisted on NVM before the object can be modified in-place. This is because there is only very limited control over how the cache hierarchy evicts dirty cache lines back to the NVM. If the dirty line reaches the NVM before the log entry does, then after a crash the dirty line cannot be undone.

In practice, programmers often deal with NVM with special instruction sequences or library calls. On x86 platform, a “clwb” (cache line write back) followed by “sfence” (store fence) constitutes what is known as a “persist barrier”. The processor will stall on the persist barrier until the cache line has been written back to the NVM. Various NVM libraries also provide capabilities to support persistent objects and/or logging and recovery. These library calls are also use the basic persistent barrier as the most fundamental primitive.

The paper identifies two types of most common bugs in persistent programs. The first type is not persisting a memory object whose value should be preserved after a crash, which results in inconsistency. The second type is not enforcing the write ordering required to perform logging or other tasks. This may cause dirty cache lines being written back out of the program order, which also results in inconsistency or unrecoverable objects. The paper also identifies a few common pitfalls that may happen in NVM programming, such as flushing a clean cache line, flushing a dirty line more than once, and generating more than one undo log record for one checkpoint. These pitfalls do not affect correctness, but may have negtive performance implications. PMTest treats this as warnings, while the former category as errors.

To deal with these problems, PMTest instruments every memory operation, sfence, and clwb/clflush in the program (which can be done either manually or via an instrumenting compiler). A trace of the executed code is generated by the instrumented program, which contains the above instructions executed by the processor in program order. The trace is then sent to a background thread for analysis. In order to detect potential bugs, programmers should also manually insert testing functions at proper locations (e.g. at the end of a NVM transaction) to “assert” invariants that should always be obeyed in a correct program. The analysis thread will also take these testing function into consideration, and uses the information collected from the trace to determine whether the invariants are satisfied. Currently two testing functions are supported: isPersist() to assert that the address has been persisted, and isOrderedBefore(X, Y) to assert that address X is guaranteed to be persisted before address Y.

The background thread analysis the lifecycle of a memory variable based on the notion of epoches. The basic observation is that, once a memory object is written by a store instruction, due to the fact that the cache hierarchy can evict dirty line at unknown time, the lifecycle of this dirty line is from the mement the store is executed to the next sfence instruction, given that there is a clwb on the dirty address in-between. PMTest therefore divides the entire execution into consetive epoches, separated by sfence instructions in the trace. The initial epoch is set to zero, and every time a sfence instruction is seen, the thread increments the current epoch by one. The lifecycle of memory objects is maintained by an interval tree described as follows. The interval tree takes an address range as key, and outputs a begin and an end epoch, called a “persist interval” (for simplicity we assume objects have non-overlapping address ranges). When an object is written, the address range is added to the tree if not already there, and the persist interval is set to [current epoch, +∞) (if the range already exists, it is not changed). An object also has a “flush interval”, which records the begin and end epoch in which the object is flushed but not guaranteed to be persisted. When a clwb instruction is processed, the flush interval of the address in clwb’s operand is set to [current epoch, +∞), indicating that cache line flush has been initiated, but it is uncertain when it would finish. Note that addresses that are not flushed will not be affected. Finally, when a sfence instruction is seen, for all addresse ranges that have a flush interval whose end epoch is +∞, both the persist interval and the flush interval is set to [original begin epoch, current epoch) after the current epoch is incremented. Note that in order to enumerate all addresses that satisfy this property, we can simply remember all outstanding clwbs that have not seen a corresponding sfence, instead of walking the interval tree. All clwbs are guaranteed to complete after next sfence is executed.

To test isOrdered(), the analysis thread simply checks whether the persist interval of the memory object has a finite end epoch. If true, then the object is guaranteed to have been persisted to the NVM. To test isOrderedBefore(X, Y), the analysis thread checks whether the end epoch of address X is no larger than the begin epoch of address Y. Because otherwise address Y may be persisted unexpected (e.g. by a cache line eviction) before X is guaranteed to be persisted.

Besides the two correctness assertions, the analysis thread can also test some conditions that only affect perofrmance. For example, if the address range does not have a persist interval assigned when a clwb is executed on that object, we can conclude that the clwb is performed on a clean cache line, which is unnecessary. Similarly, if there is already a flush interval when clwb is executed, we know the memory object is flushed more than once. To check whether the write ordering of undo log entries are enforced, the thread can also track log objects and their persistence status with another interval tree. If an object is written without having a corresponding log entry persisted, the object write and log write are in the wrong order and may risk object corruption.