Efficient virtual memory for big memory servers

- Hide Paper Summary

Paper Title: Efficient virtual memory for big memory servers
Link: https://dl.acm.org/citation.cfm?doid=2485922.2485943
Year: ISCA 2013
Keyword: Direct Segment; Segmentation

Back

This paper proposes direct segment, a segmentation based approach to virtual address translation. The motivation of direct segment is high overhead of Translation Lookaside Buffer (TLB) lookup on modern big-data machines. On one hand, big-data applications do not require complicated memory mapping. On the other hand, existing paging systems manage memory mapping for each 4 KB page separately, relying on a TLB to accelerate translation in most cases. The classical paging-based address mapping in this regard is inefficient by having both page walk overhead and redundency of memory protection bits. Based on these observations, direct segment is designed to eliminate paging overhead with simple hardware changes. We elaborate the design in the next few sections.

Long running big data applications, such as memcached or MySQL, demonstrate several memory usage patterns that distinguish them from short running interactive programs. First, they normally do not rely on the OS to swap in and swap out physical pages to transparently overcommit. Instead, they treat the main memory as a buffer pool/object pool, and automatically adjusts its memory usage to the size of the physical memory available in the system. Swapping can do very little good here, because big-data applications are not willing to suffer from the extra I/O overhead bound to swapping. Second, big-data applications typically allocate its workspace memory at startup, and then perform memory management by its own. Fine-grained memory management at page granularity is of little use in this scenario, as external fragmentation is not observable by the OS. Third, the workspace memory of big-data applications are almost always of readable and writable permission. Per-page protections bits are useful in cases such as protecting code segment from being maliciously altered, but there is no way of selectively turning them off for the workspace memory area. Overall, we conclude that the current hardware page-level fine grained mapping and protection machanism is sufficient, but not in its best shape to deliver high performance for big-data applications.

The paper also argues that the overhead of page level mapping and protection constitutes a non-negligible amount of cycles while running big-data applications. The evidence is presented by running several benchmarks, including graph500, GUPS, and NBP. The benchmarking result shows that for regular 4 KB pages, D-TLB misses can take as much as 51% of execution time. Even with 2 MB and 1 GB super pages, D-TLB can still cause slow down, raning from zero to 10% of execution time.

A segmentation approach can help relieve the slow down caused by page level mapping for big-data applications. In the direct segment design, three registers are added to the context of a process: BASE, LIMIT and OFFSET. The BASE register holds the start address of segmentation in virtual address space, while LIMIT holds the address of the end of the segment. OFFSET stores the different between the physical and virtual segments. During an address translation, the MMU compares the virtual address with BASE and LIMIT. If the virtual address falls in-between, then segmentation is used, and TLB lookup is aborted. The physical address is generated by adding
OFFSET onto the virtual address, which can be done with only few cycles. Page table and TLB is entirely avoided in the translation process. To turn off segmentation, just set LIMIT and BASE to the same value. MMU will always go for page translation because no address can fall into the range. The address range mapped by direct segment is always readable and writable. These three registers are also part of the process context. On context switches and optionally on system calls, the OS must swapped them in and out.

The OS allows applications to take advantage of direct segment by providing the abstraction of primary region. Memory chunks allocated in the primary region are expected to not benefit from paging. The application could request explicitly in system calls (such as mmap()) for a primary region allocation, or the system administrator could configure the OS to allocate all memory of certain applications by default. To implement primary region efficiently, the OS must be aware of the address space requirement on both virtual and physical addresses. Large and consecutive address spaces need to be reserved for primary region mapping. The OS can also dynamically adjust the size of the region at runtime. If the OS finds it possible that even itself could benefit from direct segment, then kernel memory could also be mapped using direct segments. In this case, the OS must also swap the three registers mentioned previously during system call and exit.

Compared with paging, direct segment scales well as memory size keeps growing. To accommodate for larger and larger main memory sizes, traditional paging mechanism must either increase the size of the TLB, or the granularity of mapping. The former is not always achievable, because TLB lookup is on the critical path of all memory operations. Large TLB can be slow and power hungry. The latter seems feasible, but it has implementation difficulties, because desining TLBs for different size classes is challenging. In addition, large mapping units are inflexible. Possible memory wastage grows with the size of the mapping. Direct mapping, on the other hand, scales perfectly. Arbitraty large segments can be mapped with the addition of only three registers. It also has low implementation difficulty, as the segmentation scheme is straightforward, and the OS only needs to save and restore three registers on each context switch. On virtual machines, direct segment can be applied to reduce nest page walk overhead, which can end up with 24 memory references at most. The mapping between gPA and hPA could be described by a segment mapping, and it only takes 4 memory references plus a segment translation to convert gVA to hPA.

Direct segment has a few limitations. First, it is not general enough as only one segment mapping is provisioned per process context. According to another study, segment mapping works best if several segments can be mapped at the same time. Second, direct segments forfeits the ability of a big-data application to overcommit via demand paging. Physical memory is allocated even if it is not used. This, however, can be alleviated by the fact that the configuration file of some applications specify the maximum amount of memory it can consume. By configuring these applications in a clever way, the negative effect of not being able to overcommit may be reduced or even offset.