Pangolin: A Fault-Tolerant Persistent Memory Programming Library



- Hide Paper Summary
Paper Title: Pangolin: A Fault-Tolerant Persistent Memory Programming Library
Link: https://www.usenix.org/system/files/atc19-zhang-lu.pdf
Year: USENIX ATC 2019
Keyword: NVM; Pangolin; Redo Logging



Back

This paper proposes Pangolin, a software library providing redundancy and repairing capabilities to existing NVM libraries. The paper identifies two causes of data corruptions in NVM-based systems. The first reason is hardware error, which can be either medium error, or faulty firmware. This type of error can be detected by internal ECC circult, which is reported to the processor via interrupts. The second reason is software error, which happens when a bug in the library overwrites a piece of memory it should not have modified. Such type of errors cannot be caught by hardware, since the hardware does not recognize logical errors in the software level. The goal of the design is to find and correct most errors of these two types. The paper also suggests that compared with prior proposals, Pangolin uses less space, is more efficient, and ssupports online error detection.

This paper assumes the collaboration between the hardware error detection mechanism, the OS, and the library. The hardware replies on ECC bits to detect errors, but cannot fix them. On detection of data corruption, an interrupt called Machine Check Exception is raised by the hardware to the processor. The processor forwards the interrupt to the OS, which further raises a SIGBUS error to the application. The default SIGBUS signal handler will force the application to exit to avoid the error being propagated. The OS will also take measures to remap the corrupted hardware page to another physical location using the ACPI interface. These corupted pages will be marked as “poisoned” by the OS using page protection mechanism to avoid future reads and writes. Currently all error checking mechanism works on page level, since this is the basic unit of memory mapping in most cases.

Pangolin is based on an object level persistence library called libpmemobj. The library uses direct access (DAX) interface which maps a file stored on NVM into virtual address space. Application programs, however, are not allowed to directly write into the DAX file for the purpose of redundancy verification and update. The DAX file is divided into chunks, which is used for memory allocation within the heap. Metadata and data are both stored in the DAX file, the details of which are not covered in the paper. Before an object is accessed, it must be opened using a library call, and after all updates the object should be committed into NVM using another library call. The library supports two modes, one object mode which only guarantees eight byte atomic update, and another transactional mode which supports multi-word atomic updates with higher persistence overhead. To support relocation of DAX mapped objects, all pointers are represented in relative format using object IDs. A library call translates object IDs to pointers that can be used by applications to address the object.

Pangolin uses a combination of checksum and parity to ensure detection and correction of data corruption. For data corruption detection, Pangolin appends a checksum field to every user object and metadata object residing in the mapped file. The checksum algorithm is Adler32 instead of more commonly used CRC32 for its support of incremental checksum update. This is extremely important when the object is large, and the partial update is relatively small compared with the object size. In this case, CRC32 requires rescanning the entire object to compute the checksum, while Adler32 can derive the new checksum only using partial updates. For data correction, Pangolin uses parity chunks to fix single chunk corruption. Chunks are organized into a logical two dimensional array, consisting of chunk rows and chunk columns. The chunk row size S is a configurable parameter when the DAX file is initialized. Beginning from the first chunk in the DAX file, S adjacent chunks constitute a chunk row, and chunks from different rows that are of the same offset constitute a chunk column. Pangolin dedicate the last chunk row to store parity data for chunk columns before them. The last chunk row is not read on normal operations, and will be used to restore the corrupted chunk on detection of errors. Multiple corruptions, however, cannot be corrected using parity. The paper suggests that this is very unlikely, and that Pangolin does not support correcting multiple chunks in the same chunk column. This can be solved by duplicating data into another device or partition, which is out of the scope of the paper.

We next cover the details of error detection and correction. The error detection happens when an object is opened in read-write mode for update. The opening library call allocates a DRAM buffer of the object size, called “micro buffer”, and copies the object content into the DRAM buffer. All updates from the application are made on this buffer to protect data on NVM (recall that Pangolin disallows updating NVM directly by application). The checksum is also computed as the object is copied into the micro buffer. The computed checksum is then compared with the one in the object header to see if they match or not. If they do not match, data corruption is detected, and online recovery is triggered. We postpone the process of online recovery to later paragraphs for clarity.

Reads and writes onto the object must be conducted using the library call as well to record modified ranges on the object image. These information is stored in the micro buffer header for later access. Objects in the DRAM micro buffer are protected from buffer overflow and invalid array indexing using canary bytes inserted around the object. These canary bytes have random values which are duplicated in the micro buffer header also. The library checks the value of these bytes when the object is committed to the NVM for possible overwrites. Software bugs can be detected and fixed this way if the two values mismatch.

The third way of detecting errors is to rely on OS’s SIGBUS signal. The library overrides the signal handler on initialization. During normal operation, if the signal is received, the library will imemdiately enter recovery procedure. The OS should also remap the fauly hardware page to another address. Such remapping needs to be kept persistent, such that between system reboots the faulty page will not be reused.

Read only objects can be opened without allocating a micro buffer. This forfeits the one-time checksum verification on object open. The paper suggests that read-only object can be verified by a background scrubbling process, or verify the object on each access.

We next cover parity update. On transaction or object commit, the library first writes redo log records to the NVM logging area, and then apply updates to the object in-place. The log is cleared after object updates are fully persisted. Note that only affected ranges are updated, while untouched data is not written. Checksums are not treated differently by the library; They are just like normal data that is updated transactionally. The library then computes new parity for affected chunks, and update the parity chunk using bitwise XOR. There are two issues in the above naive algorithm. The first issue is that if multiple transactions are updating the parity concurrently, the parity itself might be corrupted due to lost updates. The paper proposes using atomic XOR instructions that are implemented on modern processors. Each updating transaction first computes the XOR between old and new data, and they apply the result to the parity chunk using atomic XOR. Since XOR operations are commutative, the net result is equivalent to updating the parity chunk with both updates. The paper also proposes using vectorized XOR, which is not atomic on most architectures, to reduce the overhead of atomicity (e.g. implicit atomic barrier). These instructions are more efficient, but extra measures must be taken to serialize concurrent transactions. It is suggested that per-chunk reader-writer locks being used to update parity chunks. If the region is small, then atomic XOR is used, and reader lock is acquired. If the region is large, then we use vectorized instruction to update parity, but acquire write lock for exclusiveness.

The second issue is that there is a short window between atomic data update and parity update. If the system crashes within this short window, then parity and data will be inconsistent. Pangolin does not try to update parity bits atomically with data to reduce write amplification. On a post-crash recovery, it is assumed that parity bits are all corrupted, which will be recomputed using data pages. The paper argues that the chance of two corrupted pages in the same chunk column after a crash is very low, giving justification to preferring performance over correctness guarantee.

The post-crash recovery of Pangolin is not different from other redo log based systems. The recovery procedure first attempts to locate the redo log in the log buffer. If the log is completed, then it is replayed in-place. Otherwise, the log has not been fully persisted before the crash, and we know in-place data is still consistent. After log replay, parity pages are recomputed, assuming that data pages are always correct. If SIGBUS is raised at this point, then Pangolin admits a unrecoverable error, which forces the application to stop.

Online correction of data corruption is triggered when corruption is reported by either the OS, the background scrubbling thread, or the library itself. The correction thread first sets a “freeze” bit in the DAX file to prevent new transactions from starting. In addition, it waits for all current threads to exit. This is to guarantee that the correction thread has full permission for accessing the parity chunk row, because otherwise we may read partially inconsistent parity data. Threads are allowed to proceed once the recovery completes.