Rebooting Virtual Memory with Midgard



- Hide Paper Summary
Paper Title: Rebooting Virtual Memory with Midgard
Link: http://cs.yale.edu/homes/abhishek/sidgupta-isca21.pdf
Year: ISCA 2021
Keyword: Virtual Memory; Segmentation; Midgard



Back

Highlights:

  1. This paper absorbs several observations: (1) Large mapping granularity helps to reduce translation overhead; (2) Smaller granularity is good for resource management; (3) Virtual address caches do not need translation, but due to lack of inter-process isolation, they suffer from synonym and homonym problems. The paper combines these three to only use their advantages while dodging the disadvantages: (1) Large granularity mapping is done between VA and MA at processor side to reduce translation overhead; (2) The cache hierarchy uses MA which is similar to VA caches, but processes are properly isolated; (3) Physical pages are still managed in small granularity by mapping MA to PA when data is transferred between the cache hierarchy and the main memory.

  2. Decouples isolation from physical memory management. In conventional virtual memory systems, isolation is performed in much larger granularity, e.g., code and data segments are typically either fully shared or not shared between processes. Similar argument is also true for access permissions. In one word: It is wasteful to maintain fine-grained info at page granularity. On the other hand, maintaining per-4KB page metadata is good for resource management, because it allows flexible address mapping and reduces segmentation. This two goals should be decoupled in future virtual memory systems.

  3. Addresses make no sense in a cache hierarchy. They are simply symbols for distinguishing different blocks. As long as these symbols do not collide (i.e., different blocks do not use the same tag), and that both upper level (CPU / VA space) and lower level (main memory / PA space) understand the symbol (i.e., have a way of translating them from VA/to PA), the cache will work.

  4. In fact, the cache may just have its own address space, according to the point above. Midgard can be generalized by giving the cache hierarchy its own address space, which is neither VA nor PA. The extra address space can be designed to be semantically rich (e.g., contains type information, allows certain hints from the application, etc.). As long as mapping from VA to this special address space and from this space to PA are easy to perform, this is always doable.

Comments:

  1. How is CoW performed in Midgard? When you fork a process, you copy its Midgard mapping table, essentially sharing code and data segments between the parent and child processes, and the permission of the child table changes to read-only. So what happens if a writes happens on the child? Do you just copy the entire segment (expensive)? Or you break down the segment (nullifies Midgard) in the child process such that one 4KB of them can be readable-writable? File-mapped I/O that requires fine-grained access control also needs to break down segments. Same for some mmap() based tricks (e.g., allocating a large array not backed by any physical storage).

  2. Cache coherent DMA (and in general, peripherals that access the memory) has to know the reverse translation between PA and MA, if the DMA is configured with PA, or has a copy of the page table, if the DMA is configured with MA (this might just be the right thing to do). Luckily, existing IOMMUs already implement the latter, and even better, Midgard’s page table is designed to be on the memory controller side, meaning that it is closer to peripherals.

  3. I do not quite get what the “short-circuit translation between MA and PA” is. I understand that it can be performed from the lowest level of the radix tree with the hope that the lowest level entries are already cached in the LLC (it is already done on today’s MMU, with a page walk cache). What I could not get is that, if you perform this optimization, then we do you need to allocate all levels of the radix tree in a linear array? Yes, I know this is because the PA of the entry can be easily computed given the MA. But, with a linear array, you do not even need upper level entries - just read the direct-mapped entry, and that’s it. Later sections (Sec. IV-B) also suggest that all page table entries are mapped into one consecutive MA segment, which is even more confusing. I can be wrong, but I think the authors are mixing two different things here.

  4. I have one question on the argument that caches can “filter out” most translation activities. Although data looks good and it is consistent with the claim, my question is, if the working set is large, then LLC misses should still be an issue, and on each LLC miss, we still perform translation between MA and PA, which is still on the critical path. If the working set is small, then TLBs will cover the working set, and TLB will not become the bottleneck. In fact, TLB coverage is much bigger than LLC size in most systems, so why would the LLC “filter out” something that is cached by a larger structure?

  5. If page walker issue memory operations to the LLC, wouldn’t there be recursive misses? Imagine if the page walker issues command to read the lowest level PTE using Midgard address, which misses the LLC, and the LLC controller needs to go to the physical address to fetch the entry, which will trigger another translation request on the page table’s Midgard address, and so on. The page walker needs to be re-entrant (i.e., being able to serve multiple concurrent walk requests, and switch between different instances of walks with a stack). This is considerably more difficult than a state-less, non-entrant page walker. One solution is to let the LLC know that Midgard page table always has identity mapping (i.e., Midgard addresses always map to the same physical addresses), such that there is no need for address translation when fetching page table entries.

This paper proposes Midgard, a novel virtual memory framework for reduced translation overhead on the critical path. The paper is motived by the fact that memory translation has become a major performance bottleneck on today’s multicore platform, where several TBs of memory on a single node is not uncommon. Maintaining large working sets in the main memory requires proportionally more translation entries, as modern hardware still performs address translation in a fixed granularity. Unfortunately, the hardware structure for caching these entries for fast access, i.e., TLBs, cannot scale with the growth of working set size for various reasons. As a result, when the working set size exceeds the maximum coverage of the TLB, the MMU must frequently perform page walks to bring new translation entries into the TLB. This is likely on the critical path, since memory instructions will remain in the ROB and/or various queues until their addresses are resolved, which can potentially cause structural hazard to the following instructions. The paper also points out that, since TLBs are essentially caches to main memory data (PTEs), large TLBs also complicate the coherence protocol, namely, the TLB shootdown protocol. On modern architectures, as memory accesses become heterogeneous in terms of latency and throughput, it is a common optimization to migrate pages between different memory modules. This process requires updating the translation information both in the main memory and cached by the TLB, which frequently involves TLB shootdowns. Large TLBs can make the performance worse due to the shootdown overhead.

The paper noted that previous proposals attempted to solve this issue from different directions. Virtual hierarchy uses virtual addresses, instead of physical addresses, to avoid paying the cost of translation on the critical path. Addresses are only translated when accesses are bring made external to the virtual hierarchy. This approach, however, suffers from synonym and homonym, which are caused by one physical address (PA) being mapped by multiple virtual addresses (VA) (i.e., page sharing between address spaces), and one VA being mapped to different PAs, respectively.

Other attempts have also been made with single address space OS and huge pages. The former requires significant changes to today’s programming paradigm and OS interface, which is unrealistic. The latter, despite the fact that it is quite mature and has been already deployed, still causes issues such as fragmentation and alignment problems.

Midgard, on the other hand, decouples semantics level process isolation with physical level resource management, which are two independent functionalities coupled together in today’s existing virtual memory system. There are several design highlights. First, it adds an extra Midgard address space (MA space) between the VA space and the PA space. Processes still have their own VA spaces. The difference is that, instead of performing page-level mapping from VA directly to PA, which is enforced to simplify physical page management, process structures are first mapped from VA to the MA in a segmented manner. To elaborate: modern OSes organize the address space of a process into several segments, i.e., the code segment, data segment, and shared libraries. Each segment is a logical entity in which every address shares the same access permission information, and each segment occupy a consecutive range of address in the VA. Segments from different processes are mapped to different segments in the MA space with one single mapping (shared segments in different processes such as the OS image or libraries are mapped to the same MA segment). Since the number of segments remains relatively constant, which is typically not proportional to the number of pages in the working set, and that segments can be large, the number of translation entries from VA to MA space remains low, e.g., several hundreds of entries can be sufficient. Addresses within segments are mapped linearly from VA to MA space, and hence do not need any extra mapping entry.

Second, the cache hierarchy uses translated MA as tags, such that VA in programs only need to be translated to MA, before they can be used to access the cache hierarchy. This is similar to the virtual cache hierarchy setup, but it gets rid of homonym and synonym, since (1) Process isolation is still enforced since non-sharing segments are still mapped to different MA segments; and (2) Pages/segments shared by different processes have the same MA, and therefore, could only have one copy in the hierarchy, eliminating the possibility of synonym.

Third, MA are translated to PA when a block is written back to the cache, or when a fetch request misses the LLC. The MA to PA translation is performed at traditional 4KB page granularity, and therefore still enjoys the benefit of fine-grained memory management and low fragmentation. Although this step still requires a heavyweight translation, as the number of translation entries is still linearly proportional to the working set size, the paper argues that most memory accesses would not need such translation, since the cache hierarchy has filtered out most memory accesses, which would simply use the faster VA to MA translation.

We next describe the implementational details of Midgard. The VA to MA mapping is implemented as a per-process table, which is accelerated by a per-core hardware cache similar to the TLB, called the Virtual Lookaside Buffer (VLB). The table stores range descriptors that encodes the base address, bound, mapped MA space address, and access permission bits of each segment. The translation table is implemented as a B+Tree for ranged lookup, and the VLB is implemented as a CAM that supports ranged address lookup. The paper noted that a typical process will only use hundreds of segments, which can be stored within a single 4KB physical page.

Addresses in MA space has 64 bits, since it is global and must hold all segments from all processes. As a result, cache tags must also be extended to support the full 64-bit MA.

The OS maintains the mapping between VA and MA. All processes in the system share the same MA, and their segments must be allocated in non-overlapping address ranges. Segments can also dynamically grow in the MA space. Sometimes, segments fail to grow because of another nearby segment. In this case, the OS either breaks down the segment after growth into two, or remaps the segment into a different MA. In the latter case, the cache needs to be flushed, since blocks are cached with MA tags. One nice thing about Midgard is that segment remapping does not require movement of data, as a remapped MA segment can always be mapped to the same physical location by updating MA to PA mapping.

The paper notices that the VLB can still become a bottleneck if not implemented properly, due to the longer latency of hardware range lookup. To deal with this, the paper proposes a two-level VLB architecture. The second-level VLB is fully associative, and it still performs ranged lookups. The first-level VLB, however, uses the classical page-based interface, and it can be set-associative for faster accesses. The entries of the first-level VLB are generated on-the-fly when the L2 VLB is accessed. The L1 VLB entry does not correspond to any entry in the in-memory mapping table.

On the memory controller side, a conventional page table translates MA to PA in 4KB granularity. Bigger granularities can also be supported using the same technique as in today’s page table and TLB. To accelerate translation, a Midgard Lookaside Buffer (MLB) is also added to serve the purpose of what is expected from a conventional TLB. The page table is also implemented as a radix tree. The controller-side MMU performs page walks to fetch PTEs from the table similar to a conventional MMU. The paper suggests that, since the page table must be addressable in order for the OS to make updates, the global page table itself is also mapped by Midgard addresses. To achieve this, the OS reserves a chunk of memory at initialization time (with identity mapping maybe?) on both Midgard and physical address space with proper mappings. The page walker, as a result, injects memory operations to the LLC during page walks. The page walk access can be satisfied either by coherence, if a more recent block is cached in the hierarchy, or by the LLC controller issuing commands to read physical memory.

The paper noted that, however, since midgard addresses have 64 bits, the radix tree, given a fan-out of 512, requires six levels, suggesting longer translation latency. To counter this, the paper proposes a few optimizations, such as short-circuit accesses, continuous radix tree nodes, etc.