Trident: Harnessing Architectural Resources for All Page Sizes in x86 Processors



- Hide Paper Summary
Paper Title: Trident: Harnessing Architectural Resources for All Page Sizes in x86 Processors
Link: https://dl.acm.org/doi/pdf/10.1145/3466752.3480062
Year: MICRO 2021
Keyword: TLB; Huge Page; Virtual Memory; Trident; THP



Back

Highlights:

  1. 1GB physical memory chunks can be allocated with compaction similar to how THP does it. However, since 1GB pages are more difficult to allocate, the compaction process should take into account unmovable pages and the amount of data copy. Trident implements this feature using two counters per 1GB frame.

  2. Page-level data copy is less expensive for a virtualized guest, because this can be done by the host changing its address mapping for the guest, making it ideal for page compaction. This feature requires a paravirtualization interface that allows the guest to communicate the intent to the host.

This paper presents Trident, an OS kernel mechanism that enables 1GB huge pages for general applications. The paper is motivated by the fact that today’s processor’s 1GB huge page support is largely wasted due to not being used in most applications. Meanwhile, users are still penalized even if the support is not enabled because of the hardware resource dedicated to it. The paper proposes a software mechanism that enables applications to leverage 1GB huge pages for better performance, with a combination of techniques to ensure that large 1GB addresses can be efficiently found.

The paper points out that there are currently two ways of leveraging huge pages. First, huge pages can be mapped explicitly using a helper library, most notably libHugetlbfs, and the OS will set up the mapping eagerly by allocating 2MB or 1GB chunks or physical memory, and mapping the corresponding virtual addresses to it. Second, huge pages may also be leveraged in the background by an OS kernel thread compacting data pages into 2MB chunks, and then mapping user space memory using 2MB huge pages whenever it is possible. This approach is implemented in the Linux kernel as Transparent Huge Page (THP), but only 2MB huge pages are supported.

The paper then lists several reasons that 1GB huge pages are not widely adopted. First, there lacks sufficient evidence to show the performance benefits brought by 1GB huge pages, compared with only using 2MB pages. Second, in earlier processor models, there was only a very limited number of TLB entries for 1GB pages. As a result, when the amount of memory mapped with 1GB pages exceed the maximum supported on hardware, TLB thrashing would occur, which degrades performance. Lastly, software support for 1GB huge pages is difficult to build, mostly because finding large 1GB chunks of physical memory is challenging, and often poses a great overhead due to the amount of data copy involved.

The paper also conducted studies on 1GB huge pages on a variety of applications, by enabling 1GB huge pages via static allocation using libHugetlbfs. The paper presents two observations. First, 1GB huge pages help improve performance for many applications, and the performance improvement is more prominent in a virtualized environment. This is not surprising, because 1GB huge pages reduces both the number of TLB misses by using less TLB entries for the same amount of mapped memory, and it reduces the steps of page walk when TLB misses occur. Furthermore, in virtualized environments, the cost of TLB misses are even larger, as every step of page walk in the guest will now be a full walk in the host as well, which explains the increased benefits. The second observation is that 1GB pages alone are not sufficient to map a process’s address space, since many virtual address ranges are just fragmented as memory is allocated and deallocated incrementally. Besides, only supporting 1GB and 4KB pages is also sub-optimal, because frequent page faults are observed for the 4KB regions. The paper hence concludes as all three page sizes should be leveraged in order to delivery optimal performance.

The paper then presents Trident as a software mechanism that enables 1GB huge pages to applications. Trident is implemented as an OS kernel component, and it works with libHugetlbfs. Instead of eagerly allocating huge pages to applications as in the current ligHugetlbfs implementation, Trident allows applications to map 1GB pages explicitly in the user space, and then lazily back them with 1GB physical memory chunk on page faults, just like how traditional 4KB pages are backed by physical memory. The paper states that the biggest challenge of implementing Trident is hence to reliably and efficiently find 1GB memory chunks lazily page faults. This task is generally non-trivial as physical memory tends to become more and more fragmented as the system allocates and releases memory.

Trident addresses several issues that will occur during 1GB page allocation. First, the paper observes that zeroing out the 1GB page lies on the critical path. For 4KB pages, the cost of zeroing out is minimal, due to the small page size. For 1GB pages, however, zeroing out would consume numerous cycles and memory bandwidth. The paper therefore proposes a background zeroing scheme where 1GB pages are zeroed out by a kernel thread in the background, once they are found. The kernel thread may utilize idle resources when the application is inactive (e.g., waiting for I/O). This way, the overhead of zeroing out pages not only is removed from the critical path, but also does not contend resource with the application.

Second, in order to reliably find 1GB memory chunks, the paper proposes that Trident should keep compacting pages in the background to create 1GB memory chunks. In the naive approach, Trident attempts to free up 1GB chunks by scanning the entire physical address allocation map, and then moving as many physical pages as possible to create consecutive 1GB physical memory chunks. This approach, however, is agnostic to data copy overhead as well as unmovable pages (e.g., pages for kernel data structures). Trident improves the naive approach by tracking the amount of allocated memory and the number of unmovable data for every 1GB page frame. Page compaction is prioritized for those that do not have unmovable contents, and with the minimum amount of data to copy.

Third, when 1GB physical memory chunk cannot be allocated, Trident will fall back to using 2MB pages for a virtual memory region mapped as 1GB page, instead of directly using 4KB. Trident can hence extract the most benefit out of huge pages, rather than just benefiting from 1GB huge page.

The paper also proposes Trident-PV, which is a paravirtualizing technique based on Trident that allows virtualized guests to copy data around with zero copy overhead. The Trident-PV design is based on the fact that 1GB physical chunks can be created easily in a virtualized guest without any data copy, since the virtualized guest physical memory is actually further mapped by the host page table. To leverage this observation, Trident-PV adds a hyper-call interface to the virtualization platform that enables the guest to inform the host of the data copy. On receiving the hyper-call, the host satisfies the request by simply updating its page table to remap guest physical memory as specified in the request. The overhead of the hyper-call can also be amortized by batching all memory copy requests for compaction of a 1GB page in a single call.