ArchTM: Architecture-Aware, High Performance Transaction for Persistent Memory



- Hide Paper Summary
Paper Title: ArchTM: Architecture-Aware, High Performance Transaction for Persistent Memory
Link: https://www.usenix.org/system/files/fast21-wu-kai.pdf
Year: FAST 2021
Keyword: NVM; STM; ArchTM



Back

Highlights:

  1. Object-level shadowing for atomic durability, and two version MVCC w/ locking for transactional synchronization.

  2. Do not maintain memory allocator metadata in the NVM. Use object’s annotated headers to recover objects after a crash. Objects can be identified by linearly scanning the segment. This is made possible by maintaining the header for both allocated and free objects (free space is treated as free objects).

  3. Two memory allocators. Large objects are allocated with jemalloc. Smaller ones are allocated just using a simple segment allocator (e.g., like a stack). Objects are freed into a free list that is checked before using the segment allocator.

  4. Object pointers are abstracted away with object IDs, which are mapped by a volatile direct-mapped array. This allows object versions to be selected dynamically. It also simplifies defragmentation since moving objects only involves changing the mapping entry. This is similar to BwTree’s mapping table.

Comments:

  1. The paper did not mention how to expand the table when new object IDs are larger than table size. The simplest way is to use mmap() to allocate a huge array without backing memory.

  2. Why do you need both ownership record in table entry and the write lock? The ownership record itself can be used as write lock. NULL means unlocked, while any non-NULL value means locked, and the owner of the object is the record pointer which points to the persistent transaction descriptor.

  3. Shadowing does not necessarily reduce write traffic. If objects are large and modifications are small, the entire object needs to be duplicated but only a small part of it is updated. In this case, logging is more economical since logging only writes data that is actually written, while object shadowing needs to persist the entire object at commit time.

This paper presents ArchTM, a durable software transactional memory design for byte-addressable non-volatile memory. The paper is motivated by the discrepancy between previous NVM transactional framework proposals that use emulated or simulated NVM device and the actual hardware performance characteristics of commercial NVM device. Previous designs tend to overestimate write performance and underestimate the write granularity for NVM devices due to lack of precise information and using throttled DRAM device for emulation.

The actual NVM device, however, demonstrates different characteristics than DRAM in the following way. First, NVM devices generally sustain much lower write throughput than read throughput. Writing large amount of data into the device, therefore, will severely degrade performance. This discourages putting non-essential metadata on the NVM since metadata will be updated frequently. Second, NVM devices exhibit optimal performance on multiples of 256B writes. This is caused by its internal write combining buffer whose size is exactly 256 bytes. Writes belonging to the same 256 byte block will likely be combined in the buffer, without accessing the physical storage medium. This implies that application should refrain from writing small data items to randomly addresses; In other words, NVM devices prefer large and consecutive writes in order to achieve optimal performance.

ArchTM leverages the above observations to achieve better transactional performance. First, ArchTM adopts shadowing as the persistence protocol for atomic durability. Conventional logging-based schemes at least doubles the write traffic to the NVM since the same data item must be written twice, once for the log, and second time for committed data. Besides, logging requires excessive persist barriers consisting of cache line flushes and store fences, which can also degrade performance. ArchTM, on the contrary, copies an object when it is to be modified within a transaction for the first time, such that both the old and new version are present. The consistent memory image is still valid as the old object is preserved. This requires less write traffic and less persist barrier, since only objects modified during a transaction are flushed at transaction commit point.

Second, ArchTM reduces non-essential and small, random NVM writes by maintaining object-mapping metadata in the DRAM, since metadata update may happen far more frequently than data updates, and they usually only write a few words. The object mapping will be lost during a crash, and rebuilt during recovery by scanning allocated segments, thanks to the per-object annotation that we discuss later. Similarly, memory allocator metadata is also maintained in the DRAM. Allocated and free blocks are recognized during the after-crash scan by reading the annotation on the block header.

Lastly, ArchTM optimizes memory allocation such that small objects are allocated from a single free list, rather than from several size classes. Despite increased fragmentation when these objects are freed, this is justified by the observation that objects allocated together tend to be written together, increasing the write locality, making it easier for the device to combine writes.

Overall, ArchTM is a software transactional memory framework that follows the conventional transactional interface. Here, “transaction” means two independent semantic guarantees. First, reads and writes performed in different transactions are synchronized with snapshot isolation (SI) semantics, meaning that each transaction observes a consistent snapshot at transaction begin time, and successfully commits if no earlier transaction commits on its write set. Second, transactions are also basic units for atomic durability, meaning that the memory updates made by a
transaction are either all written to the NVM, or none of them is. This is important for data consistency, since writes that transit the system from one consistent state to another are most likely non-atomic to the NVM device. ArchTM is also object based, and objects must be “opened” before it is accessed for read or write. Objects in ArchTM are accessed using a special handler called the object ID, which is an abstract representation of the object instance pointer. Object IDs are indices to an object mapping table, the entry of which stores pointers to old and new version of the object for shadowing purposes.

The framework spawns worker threads to execute given transactional functions. Each worker thread maintains pre-thread metadata such as object allocation list and free list (on DRAM). The framework also spawns background threads for garbage collection and defragmentation.

We next describe data structures. Each thread has a thread-local transaction descriptor that stores the current status (not active, active, committed) of the transaction, the begin timestamp and the commit timestamp. The two timestamps are distributed by a global hardware counter (e.g., x86’s RDTSC), which serve as logical timestamps for transaction begin and commit. When a transaction is not active, these two timestamps are considered as invalid regardless of the actual value.

The object mapping table in the DRAM maps object ID to the pointer of the instance. The table must only be accessed within a transaction, using the object ID and the transaction’s begin timestamp. The mapping table is a direct-mapped, linear array, and is indexed directly using the object ID. Although the paper did not mention the concrete implementation, the table could be allocated as a huge array on the virtual address space (which has 48 valid bits) without any backing physical page. Table entries are demand-paged in as the table expands. Each entry of the table consists of two pointers to old and new versions of the object, an ownership record that contains a pointer to the transaction descriptor, and a write lock. The write lock is acquired before the object is accessed for writes, and not locked for reads.

Object IDs are allocated as a new instance of any object is created. The ID allocator is just a global ID counter and a free list. Freed objects will return their IDs back to the free list for recycling. The object ID allocator is maintained by the memory allocator, such that when a new object is allocated, a new ID is also allocated. The memory allocator then registers the object to the table by writing the pointer value to the table entry (the index of which is the object ID), and clearing all other fields. The allocator returns the object ID as a handler. All object accesses must pass the handler to the corresponding access functions.

Memory allocators also maintain volatile metadata in the DRAM. The paper proposes two different allocation strategies. For large objects, the usual jemalloc is used, and objects are allocated from a thread-local size-specific free list. Smaller objects, however, are allocated from a simple per-thread memory segment like a log. Objects are stored compactly within the segment for better write locality. Per-thread memory segments are allocated from a global pool of segments. In the segment-based allocator, free objects are not returned to the segment since the allocator will not be able to recycle them anyway. Instead, they are inserted into a per-thread free list, which is responsible for recycling freed blocks (the simple allocator searches the free list before allocating a new object on the segment). The per-thread free list is also periodically merged into a global free list for balancing reasons, i.e., it balances free blocks between threads that performs more allocation than deallocation and threads that does the opposite. (I am not too sure about whether this paragraph precisely reflects what the paper intends to convey. I just described what I though was the best strategy. The paper may actually say otherwise. In practice, this can be implemented in several different ways.)

Since all allocation metadata is in the volatile memory, the object layout in the NVM will be lost on a crash. To enable the recovery handler to recognize valid objects after a crash, the allocator always annotates objects with a special header that contains the object ID, the transaction ID that last writes it, and the object size. Both allocated and free objects must possess this header, such that a memory segment can be scanned by linearly by adding the size in the current annotated header to the starting address of the object to compute the header address of the next annotated header.

Initially, a segment is treated as one monolithic free object with the size being the size of the segment minus the header size. When a new object is to be allocated on the free space (free space is treated as a free object), if the allocation size matches the size of the free space, its header is updated with object info, persisted by a barrier, and allocation completes. Otherwise, the free space is split into two by first writing a new header at the split point with size being the size of the free space minus allocated size, persisting the new header with a barrier, and then updating the existing header with object info using a second barrier. This way, object headers on the NVM image is always consistent with the actual allocation map, which guarantees that all allocated objects can be found with one linear scan, and that no memory leak would occur, since objects are allocated atomically with the persistence of the annotated header. (The paper did not go through such details. I guess this is what is actually happening.)

To reduce fragmentation, ArchTM spawns background thread that periodically scans per-thread segments, and compact allocated objects by copying them to another vacant segment. This can be done in parallel with transaction execution, since objects are accessed via object ID, not pointer. Moving an object is as simple as copying to another location, locking the mapping table entry, and changing the pointer to the new location. All later references to the object will use the object on the new location.

At transaction begin, the transaction obtains the begin timestamp and atomically stores it into the thread-local descriptor, together with the new status (active). On transactional reads, the object mapping table is queried with the begin timestamp and object ID. If the object’s entry is currently owned by another transaction, and the owner is committed, then the new version is accessed if the begin timestamp is larger than the commit timestamp of the owner. This is to maintain the logical ordering that objects committed before the transaction begin is part of the snapshot read by the transaction. Otherwise, the old version is accessed.

On a transactional write, the write wrapper attempts to lock the object by acquiring the lock word in the table entry. If the lock operation fails, indicating that a concurrent transaction is also updating the object, the current transaction aborts and retries, since SI requires that the write snapshot must not be overwritten by another concurrent transaction. If locking succeeds, the wrapper duplicates the old object (with transaction ID written to the object header), store the pointer as the new object, and apply the update. The object ID is also inserted into a per-transaction write set for the final commit.

On transaction commit, no validation is performed, since we eagerly synchronize writing transaction using locks. The transaction persists all objects in its write set using persist barriers. The transaction is atomically committed by obtaining a new timestamp from the timestamp source as the commit timestamp, and updating the status in the transaction descriptor to committed, making the new version of the object readable to transactions that started after the commit timestamp.

The committed transaction, however, has not finished yet. It needs to wait for all transactions that started before its commit timestamp to complete or abort, and then reclaim the old objects it has shadowed. The committed transaction simply “spins” on the list of active transactions and wait for those whose begin timestamp is smaller than its commit timestamp to complete or abort. It then assigns the new version to the old version in the table entry, and clearing the new version, and moves the old version to the free block list. The transaction fully completes by unlocking the object and then transiting to not active state atomically.

On crash recovery, the recovery handler first rebuilds the object mapping table by scanning allocated segments. Once a valid object is found, it is inserted into the table being rebuilt using the object ID in the header, if the writing transaction’s status field shows committed or not active in the persistent descriptor. If an object already exists on the entry, the transaction ID (i.e., commit timestamp of the transaction who wrote it) at object header is compared, and the larger one wins. Note that this is possible if the system crashes before background GC invalidates the block, in which case two valid instances with the same object ID but different transaction ID can co-exist. The allocator metadata is also rebuilt in the meantime. The system is restored to a consistent state in which all logically committed transactions are applied, and those not completed are rolled back.

To accelerate the recovery process, the paper proposes that the volatile mapping table be periodically checkpoint by incrementally copying the DRAM pages into the NVM (using virtual memory tricks). After a checkpoint completes, the incremental object allocation since the checkpoint is tracked by storing a list of segmented allocated for objects on the NVM. On recovery, the checkpoint is first loaded into the volatile memory as a starting point, and segments in the aforementioned list are scanned to locate objects that are allocated after the checkpoint. These objects are recovered in the same way as in the base algorithm.