Mallacc: Accelerating Memory Allocation
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?doid=3037697.3037736
Year: ASPLOS 2017
Keyword: malloc; Accelerator; Special Purpose Hardware
This paper proposes mallacc, a special purpose hardware design that accelerates the C heap memory allocation API: malloc(). Malloc, as one of the most important C library functions, are performance critical and must complete within only a few cycles. We briefly describe common designs of state-of-the-art malloc in this paragraph. Users call malloc() with the size of the request, which is the minimum amount of memory that the allocator should return. On receiving the request, the allocator aligns the requested size, and rounds it up to the nearest size available for allocation. Keeping the number of available sizes in a reasonable range achieves a balance between internal fragmentation (memory under-utilization because not all allocated spaces are used by the user program) and the space taken to maintain metadata which can have a negative effect on performance if the metadata size is too large and cannot fit into the cache. Free memory blocks of each size class are usually maintained per size bucket in the form of linked lists. The allocator removes the head element from the list and returns it to the user if the list is non-empty. To avoid synchronization overheads, each thread maintains their own size buckets and size classes. If a size bucket is empty, the allocator either tries to claim more memory blocks from a larger bucket, or from the higher level memory pool. The memory pool is a shared data structure, which itself can also be hierarchical and consists of multiple smaller pools. Accesses to the memory pool usually requires locking or other forms of synchronizaton. During normal operation, the first level buckets will be accessed most frequently, which is the fast-path of allocation. The paper suggests that the fast-path is already quite highly optimized, and only takes 20 to 30 cycles to complete. Free operations are treated almost the same way as allocation. When the bucket is calculated, the freed memory block is inserted to the head of the linked list. Migrations of blocks between threads and between different levels of memory pools are also implemented to bound the total amount of memory usage. We do not concentrate on these details because the implementation of block migration policies differ greately among implementations, and are not on the fast-path.
In the following sections, we use tcmalloc as a typical example to illustrate how mallacc improves the performance of memory allocation. The hardware design, however, is totally general and can be equally applied to other malloc implemetations (partially because the fast-path is performance critical, and hence different malloc implementations are largely the same or at least very similar to each other).
Regardless of the fact that the fast-path is already highly optimized, the paper still observes two major sources of slow down in the already-fast malloc fast-path, which is described as follows. First, during the first stage where the request size is rounded up to an existing size class, a lookup table is used to translate the aligned requested size into an actual allocation size. To accelerate the translation process, tcmalloc using the aligned requested size as an index to perform two table lookups. Both lookup tables are static read-only area of memory organized as arrays. The request size (after alignment) is directly used as the index to the first table to compute the size class. Then the size class is used to index the second lookup table, and the output is the index of the bucket where the free list is maintained. Both arrays are small enough such that there is non-negligible change that they will remain in the LLC or even L1. It is, however, still slow because the two load instructions are on the critical path of malloc, and hence cannot be overlapped or speculated. The second source of slow down is the overhead to maintain the linked list after the bucket has been identified. In order to remove the first element from the linked list and return it as the allocated block, two loads and one store instruction are needed: The first load instruction reads the pointer to the head node, and the following load reads the pointer to the next node (the pointer to the next node is stored inside the unallocated block itself). Then a store instruction updates the bucket to point to the next block in the linked list using the result of the second load. These three memory instructions may suffer performance bottleneck from two aspects. First is that they are dependent on each other, which makes speculation impossible. The second is that the linked structure may not have good cache hit ratio as in the case of array lookup from the previous stage, because modern cache hierarchy does not handle linked structure very well.
To handle these two issues, the paper proposes adding a small hardware buffer in the cache hierarchy to aid the fast-path of malloc. Given that existing fast-path implementations are already very fast, the hardware buffer must be close enough to the processor and has low access overhead. One important observation made by the paper is that, although tcmalloc allocates from a rather large range of size classes (there were around 2100 size classes), in practice, less than five size classes are frequently used. Based on this observation, the paper proposes adding a four entry on-chip buffer to each core. The buffer holds mappings from requested size (after alignment) to size class index, and then to the first two entries of the free block linked list. Mappings are learned from previous allocations, and need to be inserted and updated by software. When there is a hit on the malloc cache, results can be returned from the cache instantly without waiting. The hardware cache contains six fields. The first field is a valid bit indicating whether the entry is valid. This bit is initialized to zero at system startup time, and toggled when an entry is inserted for the first time. The second entry is a pair of requested sizes after alignment. Associative search is performed on this field by the controller hardware when a requested size (after alignment) is used to query the cache. If the requested size are within the range of the entry, then it is a cache hit. The third and fourth fields are size class index, and the actual allocated size. Both are returned as the result of the query using requested size. They are inserted by software if a previous lookup misses the cache, and has to perform memory reads. If the cache is hit, then we can remove the two load instructions from the critical path, because it is no longer required to obtain both the size class and the actual allocation size via array lookups, removing the first source of slow down mentioned above. The fifth and sixth fields are the head pointer and next block pointer of the bucket. These two fields are also filled by the software when there is a miss. Otherwise, if the cache is hit, the two values are loaded into registers and can be directly used by malloc to update the bucket and then return the allocated block. On x86, if an entry is not found in the cache, then ZF in FLAGS register will be set, indicating that the software must fill the cache after it has completed the software-only slow-path. In addition, a circuit can also be added to align the requested size. This usually involves adding a constant to the requested size and then right shift a few bits. If this is implemented, then a requested size from the user can directly be used as the input to the cache.
The head and next pointer fields of the cache needs to be maintained in such a way that they can consistently hit. The proposed design is described as follows. After obtaining the size class index, malloc removes the first element from the free list using a “pop” instruction. The “pop” instruction writes the current “head” in a register, and shifts the “next” field into the “head” field. The “next” field is then cleared. If the “next” field is empty, then the “pop” instruction assumes that software did not update the field after the last pop, and will clear the entry to avoid reading invalid pointers. To update the “next” field, every time after the software pops the head of a bucket, it is supposed to issue a special prefetch instruction, which loads the pointer stored in the head block, and fills the “next” field with the value. This prefetch instruction breaks the dependency chain in the original malloc, where two loads and one store must be executed sequentially even on an out-of-order core, because there are data dependencies between them. Now with prefetching, the reading of the next block pointer can be overlapped with normal operation of malloc and the application, and the next time malloc is invoked, it is likely that the pointer value is ready, and the request is fulfilled instantly. If the prefetching has not completed when an request arrive, the request needs to be blocked and wait for prefetching. Note that this will not make things worse, because if prefetching cannot complete in a timely manner, it must be the case that the data is rather cold, and during normal operation the execution will also block here waiting for data.
Since the malloc cache is only a fast access to frequently used data maintained by malloc, the cache can always be flushed if there is a context switch. This is also necessary because most malloc implementations use thread-local buckets to avoid fast-path synchronization. Occasionally, misspeculation might occur as a result of branch mis-prediction. In this case, the entry that are filled during the speculation must also be invalidated. To achieve this we need to add an entry pointer to every instruction in the commit queue. If the instruction is nullified because of mis-prediction, the corresponding entry is also cleared.