CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation

- Hide Paper Summary

Paper Title: CARAT CAKE: Replacing Paging via Compiler/Kernel Cooperation
Link: https://dl.acm.org/doi/10.1145/3503222.3507771
Year: ASPLOS 2020
Keyword: Carat; Compiler; TLB; OS

Back

Highlights:

The MMU hardware is detrimental to performance in various aspects, e.g., limited L1 size, TLB misses on the critical path, area and power overhead, and Spectre attack.
The MMU provides three different features, i.e., permission checking, control flow checking (for OS entry points), and transparent page migration. All features can be implemented in software by having the compiler injecting software stubs before memory accesses and control flow transfers.
With software assistance, we can build a single address space OS that still supports multiple processes. Isolation between processes and between processes and the kernel is achieved with software runtime checks.
When a region (i.e., a segment in x86 addressing model) is relocated by the OS, the software needs to rewrite all pointers that point to objects within the region. This goal can be achieved by tracking all object addresses and escapes (i.e., memory locations that store pointers to the objects) using internal data structures. The software runtime then iterates over the objects in the region and then rewrites their values to point to the relocated region.

Comments:

My biggest concern about the design is its usage scenario. If Carat is to be deployed for servers running cloud workloads, how does it solve the virtualization problem (which is addressed with 2D page table walks)? This design seems a perfect fit for micro VMs that run within a lightweight supervisor. It enables the micro VM to start multiple processes as service handlers.
Statically linked binaries can be huge and there is no virtual memory mechanism to deduplicate them. How does Carat address this problem?

This paper proposes Carat Cake, a compiler-OS co-design that eliminates the need for the hardware memory management unit (MMU) and TLB while providing the same level of address mapping and protection. Carat Cake is motivated by the high performance and power overhead of today’s address translation infrastructure and aims at eliminating the overhead completely thus allowing applications to be directly executed on the physical address space. Carat Cake achieves its design goal via software instrumentations that are inserted by the compiler to check access rights in the runtime. It also requires a user-space software runtime as well as Operating System coordination such that address mapping and memory compaction can be conducted transparently to the application.

Conventionally, address translation occurs for every memory access instruction issued by the CPU. Address translation is performed by the hardware MMU consisting of a TLB for caching recently used translation entries and a page table walker that fetches an entry from the main memory when an access misses the TLB. While providing the flexibility of a per-process virtual address space, the benefit comes at a cost. The paper identifies four issues with the MMU. First, the MMU hardware consumes real estate on the chip and increases energy consumption as it is a significant piece of hardware that is accessed for every memory instruction. Besides, MMU hardware is difficult to get right and requires non-trivial design and verification efforts, which elongates the development process. Second, the L1 TLB limits the size of the L1 cache because the L1 cache is virtually indexed. As a result, the number of bits that must remain unchanged before and after the translation should be at least the number of bits used to index the L1 cache. For example, on the current architecture, a 4KB page guarantees that the lower 12 bits will not change, thus enabling a maximum number of (12 - 6) = 6 bits for generating the index, limiting the L1 size to 64 sets. Third, address translation also incurs extra latency on the memory access critical path, especially when the translation misses the TLB and a page walk has to be started. The extra latency has become an issue with today’s big data workload whose working set size far exceeds the coverage of any realistic TLB hardware. Lastly, security attacks that exploit hardware-controlled access permissions, such as Spectre, can be difficult to mitigate with software patches, as the MMU hardware is generally not programmable.

In order to address all these issues with hardware address translation, this paper proposes to eliminate the hardware MMU, and instead, use software approaches to perform address translation and to enforce access rights. The paper identifies three design goals that must be achieved in order to provide the same level of abstraction as the existing virtual memory mechanism. First, the design should be able to enforce access rights at user-defined memory regions. This feature corresponds to the per-page access permission bits of the conventional virtual memory system. Second, the design should only allow pre-defined entry points to request kernel functions or high-privileged functions while restricting access to arbitrary code in the kernel and another process’s address space. This feature corresponds to the system call feature on the existing architecture that exposes supervisor entry points. Lastly, the design should also support transparent page migration which enables physical pages to move around or be swapped out without disrupting user-space execution. In conventional systems, this feature is achieved by updating the virtual-to-physical mapping and then performing a TLB shootdown.

In order to move the responsibility of MMU to the software, Carat leverages compile-time instrumentation to add stubs before certain critical operations. At a high level, Carat uses a customized compiler toolchain with special passes after generating the IR to insert the stubs called “guards”. When an application is compiled, the compiler scans the IR, identifies instructions that may require extra operations, and inserts the guards. The same process is repeated for every dependency. The executable binary is generated by statically linking all pieces together and signing it with the provenance of the compiler used. The Carat runtime is also linked into the application. The runtime is responsible for communicating with the OS and, as we will see later, performing various tasks on behalf of the OS that would have been done by the hardware MMU on today’s hardware.

We next describe how Caret implements each of its features. In order to enforce access rights and function call privilege checks, Caret inserts a stub before every memory operation, including explicit memory accesses, function calls and returns, and the function prologue/epilogue to set up the stack. The software stub checks the dynamic address of the memory access and compares the address against an internal region table. The region table is initialized by the OS when the kernel loads the binary and is communicated to Caret runtime when the process is initialized. The table consists of valid address regions that are assigned to the process and the access rights of each region. For memory operations, the base address and the operation itself are checked to see whether it is within the allowed range and if the operation matches the permission. For stack operations, Caret checks if the stack is about to overrun the allocated area. If true, then Caret notifies the kernel which may either allocate more memory for the stack or terminate the process. For function calls, both the target address of the call and the stack boundary are checked. If the target of the function call is not one of the pre-defined entry points, the call instruction triggers an access violation and the process should be terminated.

Inserting a stub before every memory access will inevitably slow down normal execution tremendously. To solve this problem, the paper proposes several optimizations. The first is to leverage hardware extensions such as Intel MPX to perform bounds checks, eliminating the instructions to look up the region table and perform the comparison. Second, static analyses can be conducted to further reduce the number of checks. For example, loop invariants only need to be checked once at the beginning of the loop. Besides, if the compiler can infer the range of the loop, it may also hoist the check outside of the loop as a big range.

As discussed in earlier sections, Carat also supports relocations of regions allocated to a process. Region allocation is vital to Carat’s normal operation, as all processes share the same physical address space and therefore regions need to be constantly moved to reduce fragmentation. In order to move regions, Carat tracks all pointers that point to non-local objects (e.g., heap, global data) and rewrites them when the region is relocated. The tracking mechanism is implemented with two internal data structures. The first is an allocation map that stores pointers being allocated statically and dynamically. The allocation map is populated when malloc() is called to acquire a pointer. Statically allocated objects are inserted into the map at initialization time. The second map is the allocation escape map, which maps every object pointer in the allocation map to a set of memory locations that store the pointer value. The allocation escape map is updated based on the result of the escape analysis when entering and exiting a function. Luckily, the paper observes that most allocations only have a small number of escapes, and hence the map is only updated sparsely.

When the kernel decides the relocate a region belonging to a process, it notifies the process by signaling the Carat runtime that exists in one of the process’s regions. If the runtime decides that the relocation address is valid, it begins the relocation process by stopping all threads in the process and letting them dump a snapshot of their register states. Then the runtime copies the region from its current physical address to the destination physical address and rewrites all pointers pointing to the region. The pointers are rewritten by using the region’s old address to query the allocation map, and for each object residing in that region, rewriting the pointer value such that it refers to the same object after the relocation. Since some registers may also contain a pointer value, the runtime also rewrites the register dump to reflect the changes. After pointers have been rewritten, the threads can load the register snapshot back and resume execution.

The paper authors also build a prototype OS kernel to validate the feasibility of the design. The OS kernel still runs on x86 architecture and hence does not eliminate the MMU hardware (the paper mentioned that you can turn off the MMU in 32-bit mode, but it is no longer supported under 64-bit mode). Instead, the kernel initializes an identity mapping from the virtual to the physical address space using the biggest page size possible which is 1GB. This way, although the MMU is still functioning, it is extremely unlikely to become an actual bottleneck since the coverage of the MMU is maximized. The authors also extended the OS kernel to support the process abstraction and regions of processes. The OS kernel is also compiled with the same compiler such that kernel code is also protected against access violations. In the final operating model, all processes running within the OS share the same identity-mapped virtual address space. Each process consists of one or more regions which are continuous ranges of memory assigned to the process with the proper permission. Carat software instrumentation guarantees that processes remain isolated from each other by checking every memory instruction for permission. Processes still request OS service using system calls. However, since the process and the OS run under the same privilege level, they are also isolated from the OS by checking the integrity of control flow transfers during the runtime.