Z-Rays: Divide Arrays and Conquer Speed and Flexibility
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/1806596.1806649
Year: PLDI 2010
Keyword: Array; ZRay; Data Structure
Highlight:
-
Using non-continuous chunk design for arrays while only incur a constant number of indirection per access (at most one). This is better than tree-based array implementation where the access time is not even constant.
-
Leveraging the observation that most accesses will be on the first 4KB range.
-
With the extra indirection, zero elimination and CoW basically comes for free.
Questions
-
It is true that discontinuous arrays can reduce large chunk allocation, but when the array gets really large, the spine itself becomes a large array, which can cause the same problem (but at a lesser degree compared with continuous chunks). Although this does not offset the benefits of Z-Ray, the paper should mention this as a limitation.
-
It is not clear how GC is performed for CoW shared arraylets. If an arraylet shared by more than one spines, and one of the spines is GC’ed, how do we deal with the arraylet pointer in other spines? If no special action is taken, on the next CoW, the arraylet will become unreferenced, since the write protocol requires that a new arraylet be allocated. Does GC thread scan arrarlets without any reference and GC them?
This paper presents Z-Rays, a discontinuous array implementation featuring GC-friendliness and storage advantages. As one of the most commonly used data structures in computer programming, arrays have long been implemented as a large, continuous chunk of memory, which can be linearly addressed given an index value. This simple design, however, has a few disadvantages when the array becomes large. First, large array objects require a big chunk of virtual address, which may become an issue for managed languages, where background garbage collection (GC) threads may need to copy object around during GC. Large objects will, therefore, affect GC performance by introducing huge object copy overhead. In real-time platforms, such unpredicable background activity will affect the timeliness of operations, making the system less reliable. Second, allocating large chunks will introduce fragmentation on the address space, especially if multiple large chunks are present, lowering the utilization of allocator memory. In addition, most allocators are tuned for allocating small objects, not large chunks. Lastly, some big arrays are sparse, containing mostly zeros. In this case, storing all these zeros in a consecutive chunk both wastes memory, and unnecessarily increases the memory footprint.
Z-Rays addresses the above problems by using uniformly sized, discontinuous and smaller chunks as the data store, which is organized into a two-dimensional array for fast access. The array itself is allocated as a wrapper object, called the “spine”, which contains the array of chunks and other metadata, such as the number of elements. Note that Z-Rays only implements static array, i.e., arrays whose sizes are statically known, and will not expand or shrink. As a result, the size of the array can also be known at compilation time, given a fixed chunk size. The chucks, called “arraylets”, are smaller arrays whose size is a compilation time constant. Array elements are still linearly mapped within a chunk, while at Z-Ray level, at most one indirection is required to access an element, maintaining the constant access time expection.
We next describe the baseline Z-Ray. In the baseline, given an array size of S, and arraylet size of M (in the number of elements), the spine wrapper object is allocated as an array of lower_bound(S / M) chunk pointers, with each pointer referencing an conventional array of size M. In addition, the remaining (S % M) elements are inlined at the end of the spine, if N is not a multiple of M. Z-Ray chooses not to store them as a smaller arrarlet or a partially full arraylet in order to: (1) Further simplify memory management, since only two size classes are needed (the spine and the arraylet); (2) Avoid wasting storage on a few elements using a full sized arraylet. An extra pointer is added at the end of the pointer array to point to the first element of the inlined remainder. This way, addresses of elements in the remainder part can be computed the same as addresses of other elements. The address of element on index j is computed by adding the arraylet base address stored in slot (i / M) of the spine with (i % M) timed by the size of the object. Array bound check should also be performed before address computation to avoid using an invalid arraylet base address.
The paper then proposes first-N optimization to avoid the extra indirection required for element accesses. The paper leverages the observation that 90% or more of array accesses lie within the first 4KB of elements. The first-N optimization inlines the first N bytes of elements in the spine object, where N is set to 4KB by default. Accessing to the first N elements, therefore, only involves one memory access, plus an index value comparison with N. The paper reports that the first-N optimization reduces the extra overhead of indirection by half. Addresses of elements beyond the first N are computed by first subtracting (N / obj_size) from the index, and then following the same rule as in the base line design.
The next optimization is zero elimination. In a sparse matrix, where most of the elements are of trivial value zero, the array can be compressed by eliminating large ranges of zeros and only storing them implicitly. Z-Rays performs zero elimination on arraylet granularity, meaning that an arraylet full of zeros will not be explicitly maintained. In this case, the pointer value of that arraylet in the spine is set to NULL. Read operations, on reading a NULL value from the spine, should immediately return zero without dereferencing the pointer. Write operations need to allocate an arraylet, if the ponter is NULL, which is initialized to all-zero except the element being written (zero writes can be eliminated as well). Note that zero elimination only applies to the middle part of the array. Nether the first N not the remainder can be eliminated, since they are stored inline.
The last optmization is Copy-on-Write, which reduces the number of element copies during array copy. When a Z-Ray is duplicated or copied to another, only the spine is copied, but not the arraylets, which will be shared between the two instances of spines. To indicate the fact that the arraylet is shared and cannot be directly written into, the paper proposing that one bit from the pointer value be dedicated as a “taint bit”. Given that pointer values have a few redundant bits due to address alignment and OS address mapping, the taint bit can be either a low bit or a high bit of the pointer. The taint bit is checked before an arraylet is written into. If set, the arraylet will be duplicated and linked to the current Z-Ray instance under modification, before the write is performed on the new arraylet.
The paper also discusses the interaction between Z-Ray and generational GC. In generational GC, objects are first allocated in a nursery heap, assuming that most objects die young. The GC process, therefore, scans the nursery heap at a higher frequency than the matured heap, which stores objects that have survived GC while it was in the nursery heap. Objects that are still alive after GC will be moved from the nursery heap to the matured heap, and they will be scanned less frequently, since old objects are expected to have longer time span. Z-Rays reduces object copy traffic when moving from the nursery heap to the matured heap, since arraylets are allocated at a separate heap using simple slab allocators, with only the spine being allocated on the nursery heap. Arraylets, therefore, will not be scanned at all by the GC thread, whose life span is solely dependent on the life span of the spine object. When the spine object is GC’ed, the arraylets contained in that spine is GC’ed as well, unless it is shared.