Makalu: Fast Recoverable Allocation of Non-Volatile Memory
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?doid=2983990.2984019
Year: OOPSLA 2016
Keyword: NVM; malloc; Makalu; Memory management
This paper presents Makalu, a memory allocator for byte-addressable non-volatile memory. This paper identifies three challenges of implementing memory allocator for NVM due to its non-volatility. The first challenge is to maintain the consistency of allocator metadata, since memory allocator itself also maintains a set of data structures, both non-volatile core and volatile auxiliary data structure, to facilitate efficient allocation and deallocation. To reduce persistency overhead of changing non-volatile states atomically with regard to failures, prior researches usually only maintain the consistency of core data structure, while relying on a post-crash recovery routine to rebild auxiliary data structures. This way, most allocation requests can be satisfied using DRAM-only auxiliary data structure, while the costly NVM allocation can be amortized. The second challenge is to ensure that each block is allocated at most once even after crash. If core metadata is not maintained properly, already allocated block may appear as free after recovery, causing double allocation and data corruption. This challenge is usually solved by persisting the fact that a block has already been allocated to the NVM before returning the address of the allocated block to the application. If the system crashes before the persist completes, the memory block will not be used by the application, and it will be free after recovery. If, on the other hand, the crash happens after the persist completes, the block will not be reused by the allocator after recovery, since the NVM image indicates that the block is already allocated. The third challenge, which is also the most important one, is to avoid memory leak after crash recovery. Memory leaks on NVM are more severe than in volatile memory system, since leaks are permanent, and cannot be reset by restarting the system. Memory leaks occur if the system crashes after a block is marked as allocated, retuened by the allocator, but before it is linked into the application. In fact, in the most common programming paradigm using malloc(), the pointer returned is first stored in a volatile local variable (or register, which is also volatile), whose content is initialized, and then transferred to a field in the persistent data structure. In this paradigm, there is a short time window in which the allocated block belongs to neither the allocator nor the data structure. If crash occurs in this time window, the block will be permanently lost, since no reference to the block exists after crash recovery. To solve this problem, prior researches propose using a new allocation paradigm, which consists of three steps. In the first reservation step, the address of the memory block is returned to the caller, but the allocation is still held pending. If a crash happens, the allocator owns the block, since only volatile state is modified to reflect the fact that the block has been allocated. In the second initialize step, the application initializes the content of the block. In the last activation step, the address of the target word in persistent data structure, as well as the pointer to the block, is passed to an activation function. The activation function atomically transfers the ownership of the block from the allocator to the application, by directly writing the address of the block to the target word and removing the block from the allocator’s metadata the NVM. This process can be implemented with redo write-ahead logging, which commits the activation by the last atomic persistence of the log commit mark. If crash happens after the commit point, the recovery routine will locate and replay the log, which makes it appear that the crash actually happened after the application has linked the block into the data structure.
Makalu assumes the following configuration. The system is installed with both DRAM and NVM. The allocator requests a bulk of memory from the NVM region as a heap. The high end of the heap can be extended further if more memory is needed. Global metadata is maintained at the very beginning of the heap. These global metadata include the starting and end address of the heap, an array of root objects, the address of the block header map, the logging area, the current log version, the addresses of block headers, etc. The paper assumes that the file is always mapped to the same virtual address, and hence absolute virtual address pointers can be used. The heap storage is divided into three parts. The first part stores the heap metadata as descrived above. The second part stores block headers and data. The third part stores block header map, which maps arbitrary address within the data area to the corresponding block header address. The paper does not mention the details of the header map and how it is allocated and stored, so we assume the block header map is simply a hash table in a reserved chunk of memory, which can be stored at the end of the heap area every time a heap is created or expanded (on expansion, just reserve sufficient number of mapping area at the end of the expanded heap, and copy existing header to the new heder map). Memory in the second part is managed in chunks and blocks. A block is defined as 4KB of NVM memory, which is either used to serve large allocations, or divided into equal sized “objects” to serve small and medium allocations. A chunk is a continuous range of blocks, which is maintained by only one chunk header (there is a type field in the header) to save metadata. Initially, the entire heap consists of a single chunk, which is then divided into different sized chunks and blocks as memory requests are served. One difference between Makalu and other NVM allocators is that Makalu stores block headers separately in a well-known location. There can be multiple block header regions within the heap. This happens when the current pool of headers run out, while there are still free storage in the heap, in which case we expand the end of the heap by a small amount to make a new block header region. Although not mentioned by the paper, all headers regions are supposed to be linked together as a linked list, such that a full scan of the linked list can discover all blocks and chunks in the current heap. The allocator always updates block headers atomically using undo logging to ensure a consistent view of blocks and chunks after recovery. Within the block header stores the type and size class of the block (if it is used to allocate small and medium objects), the flag indicating whether the allocator owns the block (1 means not owning), and the bitmap for object allocation within the block. A chunk header only stores the size of the chunk in the number of blocks and the osnership flag.
Makalu also assumes that objects allocated to the application are accessed via pointers. The application using Makalu must guarantee that objects not reachable from one of the 256 roots in the heap header are garbage nodes, which can be reclaimed after crash recovery using an offline mark-and-sweep GC. The offline GC starts at each root pointer, and marks the block as being used in the header (if it is a large object), or marks the block as being used and sets the corresponding allocation bitmap in the block header, if it is a small or medium sized object. Block header addresses can be obtained from the block header map. Thie process continues recursively for every pointer field in the root object, until all reachable objects are marked. The allocator frees those blocks or objects that appear to be allocated, but unmarked.
Makalu solves the first two challenges exactly as we have described in the previous paragraph. To solve metadata consistency problem, Makalu makes a clear distinction between core and auxiliary metadata. Core metadata consists of the heap header, block headers, and the metadata for the persistent file (assuming DAX file access protocol using mmap()). Core metadata is always persistent by using undo logging, and is maintained on the NVM. Auxiliary data structures, such as block and object free lists, are maintained only in DRAM, which will be lost after a crash or normal shutdown. Other auxiliary data structures, such as block header maps, and bitmaps within object block headers, have one persistent copy in NVM and one volatile copy in DRAM. Only the volatile copy is used and updated during regular operation. On a normal shutdown, these metadata are written back to the corresponding NVM area to avoid a time consuming rebuild on every restart after normal shutdown.
Makalu solves double allocation problem after the crash by always marking the block to be allocated (or partially allocated) before returning the address of the object. Every block and chunk header has a field indicating whether the block has been allocated to be used or in the free list. Before using the block or chunk to fulfill an allocation request, the header field is set and persisted, which effectively gives up ownership of the block or chunk (but part of the block may still be free). If the system crashes after the block header field is set, this block would appear to have been allocated to the application, causing memory leak without proper handling. In Makalu, the offline garbage collection ensures that even if such blocks exist, they will be discovered by the post-crash GC, and the ownership will be returned to the allocator.
Makalu also maintains a set of DRAM data structures to facilitate allocation. For each thread, there is a thread-local object list, which is a free list of objects (using object itself to store “next” pointer). Each size class has its own free list, which is used to serve small allocation requests falling into the particular size class. Three global lists are maintained to serve medium and large allocations. Medium allocation uses the global free list, which is similar to the per-thread segregated free list, except that every size class has a lock, which allows allocations on different sizes to proceed individually. Large allocations directly use the global chunk list to acquire a chunk of the requested size. In addition, partially full blocks are also maintained in a global reclamation block list. This list stores free lists of blocks of different size classes. Each block in the list is partially allocated (i.e. some objects are allocated and some are not). This list is the result of truncating the small and medium sized free list, and also will be generated during offline garbage collection. In the former case, objects in the free lists are returned to the corresponding blocks. A block may not be fully empty after the truncation, which qualifies it to be added into the reclamation list. In the latter case, the GC process will mark all objects that are currently in-use within small and medium size class blocks. Those that are not marked are still free, which will cause the block being added into the reclamation list also.
We next describe the allocation process. For small objects (64 - 400 bytes), we first round the requested size to the nearest multiple of 16 bytes, which is the smallest granularity of object allocation. We then check whether the thread-local object list is empty. If not, an object is returned to the caller directly. If the list is empty, we check the block reclamation list for partially allocated blocks of the same size class. If there is a block, it will then be removed from the list, whose free objects are added into the thread-local free list. The block no longer needs to be tracked after this, since we can compute the block address using object’s address. If the reclamation list is also empty, our last resort is to allocate a free block from the global chunk list. This process involves removing a block from the chunk list, initializing and adding the new block header to the header area on NVM, adding the mapping between block address and block header to the block map, and finally updating the old block header. This process is not intrinsically atomic since it contains several non-atomic multi-cache line writes. We use undo logging to ensure the atomicity of operation (see below).
Medium sized objects are allocated in the exact same way as small objects, except that objects are removed from the global free list rather than thread-local free list. If no object is available, a new block is allocated from the global chunk list using undo logging as well. Large objects directly allocate from the global chunk list.
The logging scheme is described as follows. For each step during allocation, we copy the before value of the field to be updated to the logging area located at the heap header, before updating the field in-place. Note that the log entry must be flushed before update can be done. Each log entry contains the address, the old value, and the current log version (also located in the header). After all updates are flushed, we commit the operation by atomically incrementing the log version counter in the heap header. This can be done by a processor fetch-and-increment instruction followed by a flush. After this point, all undo log entries are invalidated, since the version of the log does not match the version in the heap header.
When an object is freed, it is directly added into the thread-local list of the deallocating thread. Note that Makalu does not maintain thread ID for each object denoting the ID of the thread-local pool that this object is allocated from. The object will be returned to the block if the free list of this size class is longer than a threshold, at which point the free list will be truncated. Objects removed from the list will be returned to the corresponding block. Large objects are freed by directly adding the chunk back to the chunk list, possibly merging this chunk with neighboring chunks to reduce fragmentation. The header used by the deallocated object will be freed also. Occasionally, blocks in the reclamation list can become fully empty as a result of free list truncation. We return the clean block back to the free chunk list in the same way as we deallocate a large object.
On a normal shutdown, Makalu will write back the DRAM version of the header map back to the NVM to avoid rebuilding this structure at the next startup. In addition, all objects in both free lists are returned to the corresponding blocks, updating their allocation bitmap, which are then flushed back to the NVM. Reclamation block headers are also written back.
On a normal startup, the allocator first loads the header map into memory, and then initializes free list as empty. The free chunk list and reclamation list are also initialized by scanning the free header map for unallocation chunks and partially allocated blocks, respectively. Normal execution could proceed after we rebuild the DRAM data structure.
On a post-crash recovery, both the header map and block bitmap can be in an inconsistent state. To solve this, we we first check the logging area for any pending operations that have not yet committed. If entries exist, and the version of the entries match the current log entry in the heap header, the pending operation is rolled back by copying the before image back to the indicated address. We increment and flush the log version counter after recovery to commit the completion of recovery. We then scan the header areas within the heap, and rebuild the block header map. For small and medium sized blocks, we clear the allocation bitmap in the header to avoid using potentially inconsistent metadata. The GC process is then invoked, which swaps all objects reachable from the root objects, and marks them as alive. The paper assumes that type information is embedded in the object, such that the GC knows which value is a pointer. Persistent-to-volatile pointers are ignored. For every object identified as live object, we obtain its header via the global header map. If the object is a huge block, then the object header (chunk header) is set as being used. The rest of the chunks are added into the free chunk list. For small and medium sized objects, we locate its block header, and sets the corresponding bit in the bitmap. These blocks are also added into the block reclamation list. The rest of the blocks are also added back to the global chunk list, possibly merging with neighboring chunks. Normal execution could resume after recovery.