Devirtualizing Memory in Heteogeneous System
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?doid=3173162.3173194
Year: ASPLOS 2018
Keyword: TLB; Virtual Memory; Accelerator; GPU
This paper proposes Devirtualized Memory (DVM), which aims at providing high memory bandwidth to accelerators that can access virtual address space. This is crucial for improving the usability and efficiency of accelerators such as GPUs for three reasons. First, accelerators need to access virtual memory for protection and resource multiplexing. The virtual memory protection bits and remapping fits into this perfectly. Second, if accelerators could access the virtual memory the same way as the CPU does, then a pointer on the CPU has the same meaning for the accelerator. Pointer based data structures, such as linked lists, trees and graphs, can be transferred between CPU and accelerator directly without data marshalling and unmarshalling. This is particularly attractive when the accelerator processes graphs, where nodes are linked together using pointers. The last reason is that data transfer to and from the accelerator is typically slow. Letting the accelerator fetch data from the main memory, on the other hand, amortizes the overhead of data movement, and hence enables finer grained batches being processed by the accelerator.
When accelerators access the memory using virtual addresses, the IOMMU is responsible for translating virtual addresses into physical addresses. IOMMU is a device connected to the system bus. All memory requests issued by the device must be forwarded by the IOMMU in order to access memory. During system startup, the Operating System configures the IOMMU by assigning I/O page tables to devices. Each device could have its own page table, the content of which may or may not be identical to the page table used by MMU. The IOMMU translates the address in the memory request issued by devices using the page table before forwading them to the memory controller. Similar to the case of CPU, in order to make translation faster, the IOMMU may also have a built-in TLB, which caches recently used translation entries.
The IOMMU address translation can become a performance bottleneck when the accelerator runs memory intensive workloads. This paper addresses this problem using identity mapping. An identify mapping is a trivial way of performing address translation: it directly outputs VA as the PA, which makes address translation unnecessary. Recall that the IOMMU also checks permissions in addition to address translation. To facilitate this, the paper also proposes adding a new page table type: the Permissions Entry (PE). PE can replace a PTE on any level, assuming a multi-level page table. It contains 16 permission descriptors, which describes the access permissions of the memory range covered by an entry on that level. For example, an L2 PE covers 2MB of memory, and hence each permission descriptor defines the permission of 128KB of consecutive memory; For L3 PE each descriptor defines the permission for 64MB of memory. To reduce the number of entries, L1 PEs are not allowed. Each permission descriptor consists of two bits, and hence could represent four different permissions: Read-Only, Read-Write; Read-Execute; Invalid. The Invalid permission can be used to “punch holes” in a page. This is particularly useful if the OS can find an identity mapping that is almost consecutive, with a few “holes” in the middle that has already been occupied for other purposes. Instead of giving up finding an identify mapping, the holds can be masked off by setting the permission bits to Invalid.
The PE distinguishes itself from other PTE types using the type field in the PTE. When the page walker accesses the PTE, it checks whether the PTE is of PE type. If this is true, then translation terminates, and the page walker treats the remainder bits in the address as an offset and computes the permission of the address. Otherwise, the page walker walks the page table as usual. An important invariant of this design is that if there is a PE on the path of the page walk, the time it takes is always less than or equal to normal address translation, because the page walker only does less but not more. The paper therefore concludes that, by using permission bits, the new system can run at least as fast as an unchanged one.
To further accelerate the translation process, the paper also proposes adding a PE cache to the IOMMU, called Access Validation Cache (AVC). The AVC is a regular 4-way set-associative cache which stores PE entries from the page table. Since identity mapping and permission bits can reduce the number of L1 PTEs, there will not be pollution caused by too many L1 PTEs entering the cache. The page walker on IOMMU then checks the cache for a hit before going to the main memory to fetch PTEs.
Address translation can even overlapped with memory access if the accelerator supports load speculation (also called preload in the paper). The IOMMU allows a load request to bypass translation and permission check before it completes the permission lookup. If later on, IOMMU decides that either the page is not readable, or the address translation is not an identity mapping, it will notify the accelerator. On receiving the notification, the accelerator squashes the load operation and everything that follows, and re-executes the load using the correct mapping information.
To obtain identity mapping, the OS memory allocator needs to be modified in order to work with identity mapping. For maximum compatibility, only heap allocations are considered for identity mapping. The memory allocator uses modified mmap() which will try to obtain a range of pre-populated physical pages. It then assigns an identical virtual address to the range, which constitutes an identity mapping. If identity mapping cannot be constructed, the allocator falls back to using the original allocator. Correctness will not not affected in this case.