Bridging the Performance Gap for Copy-based Garbage Collectors atop Non-Volatile Memory
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/3447786.3456246
Year: EuroSys 2021
Keyword: Java; Garbage Collection; NVM
Highlights:
-
Generational copy GC will generate a large quantity of small and random writes, mostly due to object copy and pointer update in the header of the source object. These small and random writes are detrimental to performance when running on NVM, and when mixed with reads, they can saturate NVM’s operation bandwidth faster, hence causing performance degradation.
-
To remove the small and random writes, we can (1) Use DRAM as a temporary buffer to hold the objects, thus eliminating the writes to NVM and only writing back the consolidated buffer using large and sequential writes; and (2) Use DRAM data structures to track the old-new object relation instead of tracking them with pointers embedded in the object header.
This paper presents an improved Java Garbage Collection (GC) algorithm optimized for Byte-Addressable Non-Volatile Memory (NVM). The paper is motivated by the performance degradation of Java’s default GC algorithm when deployed on NVM due to NVM’s low write bandwidth, especially under random I/O. The proposed algorithm transforms the read/write pattern of GC, which contains excessive small writes and has bad locality, into large, sequential writes by leveraging DRAM as an intermediate buffer. Consequently, higher performance can be achieved with the improved algorithm as GC utilizes NVM more efficiently and hence constitute fewer cycles on the execution critical path.
The paper aims at optimizing the GC algorithm on Java Virtual Machine (NVM) when deployed on NVM. The JVM provides a managed execution environment where heap allocations need not be explicitly freed by the programmer. Rather, the virtual machine maintains internal tracking data structures such that objects living on the heap can be automatically deallocated when they are no longer referenced by the execution context, hence becoming “dead”. While the GC algorithm, called Garbage-First GC or “G1GC”, consists of several smaller subroutines to deal with different cases, the paper focuses on the most common and frequently invoked type of routine which we describe as follows. The Java heap allocator places objects (regardless of their sizes, unlike a general heap allocator) on memory blocks called regions using the simple bump pointer allocation strategy and objects are never explicitly freed. The JVM maintains two different types of regions, one is the young spaces which contain objects that are allocated more recently and have not gone through any GC process. The other type of region is called the old space which contains regions that have been GC’ed at least once. For every young region, the GC algorithm keeps track of a set of objects in the old space, called the remembered set, that holds references to objects in the young space. Note that the paper assumes that the algorithm neglects the persistence aspect of NVM and only treats it as a larger version of DRAM.
During garbage collection, the algorithm first allocates a new region in the old space as the destination of the object copy. Then for every young region, the algorithm traverses its remembered set and copies all objects that are still referenced by an old space object to the newly allocated region. For each object being copied in this process, the algorithm also adds the other young objects that it references, hence recursively marking live objects in the young space such that they will eventually be copied. The marking process is implemented as graph Depth-First Search (DFS) using a LIFO stack (for better locality). Additionally, after an object is copied from the young space to the old space, the reference pointer that point to the object is updated to point to the new object. Besides, the pointer to the object in the destination is also stored in the header of the source object. This header pointer serves as a trampoline such that if the same object is visited from another reference in the future, the algorithm can figure out that the object has already been GC’ed, and then correctly update the references.
To identify the performance traits of the above GC process on NVM, the paper conducts experiments that compares the execution latency and memory bandwidth consumption between GC running on DRAM and NVM. The paper made three critical observations. First, a higher percentage of execution cycles are spent on GC when running on NVM. In particular, NVM-based GC suffers a greater slowdown than the rest of the execution, which is also larger than the average slowdown over all benchmarks. This phenomenon suggests that GC has become a bigger bottleneck on NVM than on DRAM. Second, when testing for bandwidth consumption over an execution process involving both GC and non-GC code, the paper observes that the bandwidth consumption drops when GC is triggered, indicating that GC saturates the NVM device somewhere other than bandwidth. By contrast, when executing on DRAM, the bandwidth consumption increases during GC as DRAM devices are not saturated. This phenomenon is explained by the large quantity of small and random I/O requests generated by GC, which can easily saturate NVM’s operation bandwidth. Lastly, when the number of threads increases beyond eight, the bandwidth consumption on NVM stops scaling, while on DRAM it keeps scaling until a higher number of threads. This result demonstrates from a different perspective that GC will saturate NVM’s bandwidth even at a low thread count, and therefore is not scalable.
To address the performance issue identified above, the paper proposes two design changes to the existing algorithm. The first is to use DRAM buffer to temporarily serve as the copy destination when a region is being filled up. By diverting the write operations from the new region allocated on NVM to the temporary region on DRAM, the GC process no longer issues small and random writes to the NVM in the form of object copy. Instead, objects are written to the DRAM buffer, being “consolidated” there, and when the DRAM buffer becomes full, written back to the NVM region in large sequential writes. To avoid updating references to the GC’ed objects twice, the first time to the DRAM buffer and then to the newly allocated NVM region when the DRAM buffer is emptied, the algorithm reserves a shadow region on the NVM every time a DRAM region is created. The references are still updated to point to the shadow region on the NVM, which will finally become canonical after the DRAM region is written back.
The second problem lies in the trampoline pointer in the header of the source objects. During GC, this pointer is updated to point to the destination when the object is copied such that references to this object, when they are traversed, can be correctly updated to also point to the destination object. However, the updates to the source object header also generate small and random writes which are particularly detrimental to performance. Fortunately, these small and random writes can be eliminated by using a separate data structure in the DRAM to track the source-destination relation of GC’ed objects. To this purpose, the paper proposes adding a shared lock-free map that, given a key as the address of the source object, returns the address of the destination object. In order to bound the amount of DRAM that the map can use, the map adopts an open addressing hash table scheme and can fail to insert after a certain number of hash conflicts. If an insertion fails, the algorithm falls back to using the source object header to store the pointer.
The paper also proposes a few extra optimizations. First, object copy from the source region to the DRAM buffer can use non-temporal writes, which bypass the cache hierarchy and will not pollute the cache. Second, as DRAM and NVM are two separate devices, the process of writing a DRAM region back to the NVM can be conducted by another thread and partially overlapped with the object copy. However, the paper notes that the write back should only be performed after the DFS traversal has been completed for all objects in the DRAM region. To trace the progress of DFS, the paper leverages the fact that the first object being touched by DFS is also the last that will be popped out of the stack (due to the LIFO stack). Therefore, the first object being pushed into the DFS stack will be used as an indicator of whether the region can be written back to the NVM. Lastly, to increase the cache hit ratio of DFS traversal, the paper proposes inserting prefetching instructions to an object when the object’s address is added to the DFS stack. The paper also observes that prefetching benefits NVM more than it benefits DRAM, due to NVM’s bigger access latency.