A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/2749469.2750399
Year: ISCA 2015
Keyword: Compression; GPU; BDI; CABA
Note: I am not an expert on GPGPU architecture, and have only read very basic GPGPU architecture literatures. The following paper summary may be inaccurate and/or incorrect when it involves GPGPU internals. I will strive to clarify as much as I can, and put the focus on compression rather than GPGPU.
Highlight:
-
Using an execution model similar to hyper-threading on CPU to leverage idle cycles and resources more efficiently
-
Proposes a GPU implementation of BDI. BDI is friendlier to GPGPUs, since all compressed words are of the same size, which fits into GPU’s regular computation model.
Questions
- The paper mentions that CABA can help addressing the idle cycle problem. But when it comes to compression, since decompression must occur after data has been fetched, the main kernel is still stalled until data arrival. As a result, decompression cannot be overlapped with data stall, which contradicts the design goal in previous sections.
This paper proposes Core Assisted Bottleneck Acceleration (CABA), an architecture that leverages idle cycles and resources on GPGPUs to perform out-of-band tasks for acceleration. The paper points out that resources on modern GPGPUs are not fully utilized in many cases due to several reasons. First, the memory bandwidth and inter-component link bandwidth are often underutilized, due to the speed discrepancy between execution units and memory modules. As a result, if most threads are stalled on memory instructions, cycles will be wasted since no progress can be made during the stalled cycles. Second, on-chip resources, such as registers and local memory storage, which are allocated statically by compilers and scheduled dynamically by hardware, are often under utilized as well. This is caused by the hardware scheduler not being able schedule an entire thread block, which is the unit of scheduling, with partial resources on the current processing unit. These registers and local storage will be wasted, which can be leveraged to perform background tasks as prposed in this paper.
The CABA architecture in this paper is described as follows. CABA adopts a software-hardware co-design approach to achieve a balance between efficiency, usability and hardware complexity. Pure hardware or software schemes are sub-optimal, due to the level of new hardware components added to existing GPGPU, and the limitation of existing GPGPU programming models. On the software level, CABA customizes compilers to allow each kernel to be accompanied by one or more special helper routines, which are executed in the background when the main kernel is stalled or when some events are triggered. These routines are largely transparent to the main kernel except some necessary data exchange at the beginning and the end of the execution. The compiler is aware of the semantics of helper routines, and will compile them accordingly based on the architecture specification. One of the most notable features of the compiler is that the register and local storage of the helper routine should be explicitly allocated by the compiler statically, which is added to the total resource requirement of each thread block. The GPGPU hardware is unaware of the extra resources that are dedicated to the execution of these helper routines.
On the hardware side, when GPGPU is programmed with the kernel, the helper routine is also transferred to GPGPU’s instruction memory, and buffered on an on-chip instruction cache. The helper routine is executed as Assist Warps in a similar way as the main kernel. Assist Warps are not executed, until some predefined conditions or events occur, in which case they are scheduled on the processing unit. The paper proposes three extra hardware components added to the GPGPU for tracking, scheduling, and maintaining states for these warps. The first component is the Assist Warp Store (AWS) that stores the program body of the helper routines. More than one routines can be stored at the same time, which are indexed using a globally unique source ID (SR.ID) and instruction ID (Inst.ID) for accessing individual instructions.
The second component is Assist Warp Buffer (AWB), which is a buffer for decoded Assist Wrap instructions. The paper suggests that this buffer may share the same storage with the existing Instruction Buffer (IB), while providing a few extra slots for low priority tasks which can be issued concurrently with the main warp in the background, without contending for resources with the latter.
The last component is Assist Warp Controller (AWC), which schedules Assist Warps on the processing unit. During normal execution, the AWC monitors trigger events or conditions. When any of the triggers occur, it schedules out the main kernel and brings in Assist Warp’s instruction. Note that GPGPUs are typically not equipped with proper hardware for context switch. The paper suggests that the Assist Warp executes under the same context as the main kernel. The compiler should be aware of this, and reserve dedicated registers for the Assict Warp to use to avoid corrupting main kernel’s state unexpectedly.
One of the most important features of CABA is the priority level of Assist Warps. High priority warps will override the current kernel, causing normal execution to stop, and be scheduled on the execution unit. Such warps are often tasks that must be performed in order to achieve correct execution, such as data decompression on a memory load. Low priority warps, in contrast, can be executed in the background concurrently with the main kernel, such as opportunistic prefetching. Their completion is not even guaranteed, since these warps are considered as optional.
The paper then discusses the possibility of using CABA to perform compression on GPGPU’s L2 cache and main memory. The peper gives three reasons why compression should better be implemented with CABA rather than with specialized control and data path. First, CABA leverages existing buffering and execution units on the GPGPU, while adding dedicated components for compression would change the hardware significantly with less generality achieved. Second, different workloads exhibit different data regularity. No universal compression algorithm is better than any other. To make compression widely applicable to GPGPU, the compression scheme should be able to implement several common algorithms, such as FPC, BDI, C-PACK, which tackle different types of data. With dedicated hardware this would be difficult or impossible. Lastly, some workloads are uncompressable, in which case compression should be turned off. CABA can easily turn on or off compression via software intervention. It also enables application programmer to select the best performing compression scheme by writing their own assist kernel, taking semantic level information into consideration.
CABA-based memory compression only leverages the bandwidth saving proprty of compression, without exploiting the cache and storage saving capability. This is because the cache and memory access protocol will have to be altered significantly to accommodate variably sized compressed chunks and an extra level of indirection for mapping more logical data than physical capacity. Only minimum modification is needed to perform compression: one bit per cache line sized block is added to L2 cache and the main memory to indicate whether the block is compressed. The burst length of compressed blocks are also stored in a separate, direct-mapped area of GPGPU’s main memory. To avoid doubling the number of accesses, an on-chip metadata cache is added to allow fast access of both information before data arrives.
On a read instruction, the AWC checks whether the bit fetched from the metadata cache indicates a compressed block. If true, the high priority Assist Warp is scheduled for decompression when data arrives. Similarly, on a store instruction, the low priority compressor Assist Warp is scheduled to be executed in the background.
The paper proposes an implementation of BDI on GPGPU. One of the most important properties of BDI is the regularity of compressed data, which makes parallel compression more feasible than other dictionary-based or pattern-based designs. The BDI algorithm searches an optimal base, delta size and the input code size to minimize the number of bits required to encode a cache block. All input code words (the size of which is part of the search argument) are encoded into shorter but still uniformed sized, compressed code words. Each input code word is therefore mapped to a lane in the Assist Warp, and attempts to either subtract the selected base or implicit zero base from the code word. The hardware then checks whether the delta can be encoded with the given number of bits, and uses a global predicate register to summarize the result into a single bit predicate. This process is repeated for all combinations until a compression argument is found.