Enhancing Address Translations in Throughput Processors via Compression

- Hide Paper Summary

Paper Title: Enhancing Address Translations in Throughput Processors via Compression
Link: https://dl.acm.org/doi/10.1145/3410463.3414633
Year: PACT 2020
Keyword: GPGPU; TLB Compression; Compression

Back

Highlights:

The TLB of GPGPU suffers high miss ratios due to the clustered access pattern; The misses affect L2 TLB more than L1 TLB as the latter can be avoided by wrap scheduling.
TLB entries can be compressed using base-delta algorithm. A per-set register stores the shared base, while compressed entries store the delta. Both virtual and physical addresses can be compressed this way. We can also partition the ways of the TLB to only compress some ways.
The base value for each set can also be dynamically updated using a rebase counter. The counter is decremented for every insert miss (i.e., insertions that cannot be compressed). When the counter reaches zero, a new base is chosen after evicting all existing entries.

Comments:

The paper overestimates the effective compression ratio, which is not 2, but 1.5, since only half of the TLB is compressed.

This paper proposes a technique for compressing GPGPU TLB entries to accommodate the increasingly irregular access pattern of modern GPU workloads. The paper is motivated by the observation that modern GPGPU workloads suffer a high L2 TLB miss ratio which can become a performance bottleneck under Unified Virtual Memory (UVM). The paper proposes to compress the TLB translation entry by extracting the common high bits of both virtual and physical addresses and storing compressed translation entries in the hardware structure for reduced misses.

The paper assumes Unified Virtual Memory (UVM) architecture where the GPU device shares the same virtual address space as the process running on the CPU. Compared with prior memory architectures where the GPU and CPU do not share any address space, UVM has two obvious advantages. First, programmers no longer need to manually transfer data from and into the GPU memory, which greatly simplifies the development process. Secondly, UVM enables the GPU device to access a working set that is larger than its onboard memory size, as data can be dynamically paged in and out of GPU memory.

To support UVM, GPU devices must be able to translate virtual addresses issued by programs into the physical address of the main memory, before the address can be used for memory accesses. Current GPU implementations either add dedicated TLB structures to the GPU device or rely on the existing IOMMU to perform address translation. In both cases, the address translation is not free, since TLB misses will trigger page walks to fetch the translation entry from the in-memory page table, the latency of which is usually non-negligible and is on the critical path.

To study the properties of address translation on GPU devices, the paper conducted experiments on a GPU simulator with 22 workloads. The simulator models a two-level TLB hierarchy, with the first level being private and at a per-SM basis and the second level being a shared TLB structure. The paper presents several observations. First, different workloads demonstrate different TLB miss ratios, but in general, the miss ratios are significant, indicating that TLB misses have become a major source of bottleneck in these executions. In particular, if we quantitatively classify the degree of locality using reuse distance, i.e., the number of other accesses between two consecutive accesses to the same page, we can observe that most pages demonstrate larger reuse distances, which implies bad locality. Second, by replacing the realistic L1 cache with an ideal L1 that does not incur any access overhead (but can still miss), the overall performance improvement is very limited. On the contrary, by replacing the existing L2 TLB with an ideal one, the paper observes significant performance improvement, which is a good indication that optimizing the L2 TLB can bring more benefits than optimizing the L1 TLB. Lastly, the paper also observes that accesses exhibit a clustered pattern, meaning that most accesses are clustered around one or more regions during a time window, and in different time windows, the regions where accesses are clustered may vary. However, the paper also noted that most accesses do not demonstrate any stride pattern, i.e., there is no consistent stride value such that these accesses can be described. The observed access pattern is a strong hint that most addresses in the TLB will share the same higher bits as they are close to each other in the address space.

Based on the above observations, the paper proposes a TLB compression scheme where the L2 TLB entries are compressed using a simple base-delta algorithm. At a high level, every set of the compressed L2 TLB is partitioned into two equally sized parts. The uncompressed partition stores TLB entries in an uncompressed format as in the baseline system. The compressed partition, however, only stores the delta of both virtual and physical addresses from the base address, leaving the common base address being stored by a per-set register (we need two separate registers for virtual and physical addresses in the set). More specifically, the paper assumes 2GB physical memory with regular 4KB pages and also assumes an L2 TLB of 32 sets (i.e., 5 lowest bits for set indexing). Therefore, each virtual page address has 31 bits, and each physical address has 19 bits, which sum up to a total of 50 bits without compression. With compression enabled, every compressed entry only stores the 13 bits and 9 bits delta for virtual and physical addresses respectively as the delta, leaving the higher 18 and 10 bits as the base address. Consequently, each TLB entry slot (which has 50 bits) can now store two compressed entries (21 bits each), resulting in an effective compression ratio of 1.5 across the entire TLB. Every set also has two base address registers storing the higher bits from virtual and physical addresses, respectively. All compressed entries in a set share the same base stored in the corresponding registers.

During a TLB lookup, both the base registers and the set are fetched. The TLB controller checks the uncompressed partition as usual. For the compressed partition, the controller first generates the full virtual address by concatenating the higher bits with the bits stored in the compressed entry. Then it checks the requested address with the full virtual address as in the uncompressed case. A hit is signaled if the lookup finds a matching address in any of the two partitions.

On an L2 TLB insertion, the controller first checks whether the inserted address can be compressed with the base address of the set. If positive, then the new entry is allocated from the compressed partition, and otherwise, it is allocated from the uncompressed partition. The paper also suggests that both partitions implement their own eviction policies, such that insertion operations within one partition always only evict from the same partition.

The above simple scheme may not work well with the clustered access pattern presented earlier, in which case more than one base address is needed for different windows. Therefore, it is critical for the compression scheme to be able to dynamically adjust the base address of each set. To achieve this goal, the paper proposes to add an extra “rebasing” counter for each set. The initial value of the rebasing counter is set to a positive value. Whenever an insert operation fails to insert the entry into the compressed partition, the value of the counter is decremented by one. When the counter value drops to zero, the TLB controller will evict all existing entries in the compressed partition, and reset the base register values on the next insertion.