SpaceJMP: Programming with Multiple Virtual Address Spaces
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=2872366
Year: ASPLOS 2016
Keyword: Virtual Memory
This paper proposes SpaceJMP, which is a software-only framework for supporting address space switching in operating system kernels. Traditionally, a virtual address space (VAS) is a second-class object in the operating system, which is only one of the resources held within a certain process. The model of operation works well until recently when big-data workloads of tens of TBs in size begin to emerge. The paper identifies three scenarios in which a single VAS may not be sufficient to support such workloads. First, as workload sizes are becoming larger, the virtual address space size remains unchanged over the past few years. On current platforms, most VAS only has 48 bits, which can support 256TB of virtual memory. If the size of the working set grows beyond this limit, the memory allocator is no longer able to allocate from the VAS due to exhasuatation of addresses (and not to mention the OS, stack and code themselves occupy part of the VAS). The second scenario occurs when the address space migrates between different machine sessions. For example, when a non-volatile device is installed in the system and mapped to part of the address space, it is required that pointer-based data structures must either use special representation of pointers (e.g. the pointer stores the offset from the mapped region rather than VA), or be mapped to the same VA on all sessions, due to the fact that pointers in these data structures store virtual addresses, which is the offset from zero address, rather than the mapped region. When the mapped device is installed on another machine (or even another process session), it is possible that the base address of the region will differ from it used to be, causing program error since all pointers now refer to undefined objects. If, on the other hand, that the address space of the NVM device can always start at absolute zero (i.e. it has its own address space), no matter how the VAS migrates, pointer semantics are always consistent. The last scenario is when processes share part of their address space. This has already been implemented on current systems. The paper, however, claims that these interfaces are either inefficient (overhead of building page tables) or difficult to use (coupled with the file system interface, for example, in mmap()).
The paper proposes decoupling the one-to-one correspondence between processes and VASes, which allows VAS to be created as a first-class object as indicated by a process. Furthermore, processes are also allowed to detach from the current VAS, and attach to a different VAS, which is equivalent to “swapping out” data that is not currently used in a more coarse-grained (and therefore, more efficient) manner, enabling the same address space to be reused.
Due to the fact that no matter how VASes are switched in and out, some basic address mapping must be kept consistent, such as user application code, application stack, OS data structure, etc., because otherwise, the behavior is undefined. To this end, the paper proposes encapsulating these important mappings as “segments”. A segment can be thought of as a collection of VA to PA mapping on a consecutive range of VAs. An application’s context in the traditional sense consists of several segments on non-overlapping addresses. Since a VAS can be shared between different processes, segments are also further divided into two kinds: public and private. Address mappings contained by public segments can be seen by all sharing processed on a certain VAS. No two public segments can overlap in the address they map, nor can they overlap with private mappings. Private mappings, on the other hand, are process-specific. Each process can has its own private mapping on some addresses with colliding with each other, which provides a chance for them to “paste” their own mappings such as code and stack, onto a shared VAS. Private mappings between processes never collide, while inside a process they must still only map distinct addresses.
In SpaceJMP, segments and VASes are both created via library APIs. Segments are created by specifying their base addresses and sizes. After created, the segment can be used to represent an address range mapped by the current VAS. VASes can be created in a similar way (without specifying the base address and size). The creation of both VASes and segments return an object handler as the argument for future operations on these objects. Processes may attach to an existing VAS using the vas_attach() interface, specifying the VAS by its object handler. This function returns a session handler the usage of which will be explained in the following. After attaching to a VAS and before switching to it, the process can “paste” private segments into the current session using seg_attach(). The arguments to seg_attach() are the segment’s object handler, and the session handler for the current process. This attach operation allows the process to access its execution context after switching to the new VAS by mapping the same VAs to the same physical pages in the new VAS. Processes can also attach globally visible public segments to a VAS using seg_attach(), as long as they use the VAS’s object handler instead of the session handler as the VAS argument. Attaching a public segment may not always succeed, since a previous attachment of segment may have already used that part of the address space. Processes switch into the new VAS using vas_switch() which is similar to a context switch except that it never switches out the current control flow and exeution stack and that no register context is saved. VASes are maintained by the OS as first-class objects, which means that even when all processes are detached from the VAS, they can still exist within a namespace and be found by calling vas_find().
The implementation of SpaceJMP involves creating and destroying extra page tables for processes. Instead of having strictly one page table per process, with SpaceJMP, each process can possess multiple page tables, each representing a VAS it has attached to (and these page tables may be cached by the kernel to make future attachment faster). On creation of a VAS, a null page table is initialized as the base mapping (i.e. all mappings are invalid until segments are pasted). On the attachment of a public segment, the page table for the VAS is modified to incorporate the entries from the page table of the calling process. On the attachment of a private segment, since they are highly likely to collide among processes (since processes application code are all mapped to the same base address in most systems), no global change can be made on the VAS’s page table. Instead, the kernel performs what is similar to a “copy-on-write”, which duplicates the current page table of the VAS, and then copies entries from the calling process’s current page table into the duplication. Recall that the session handler instead of VAS handler is used for attaching a private segment, the OS simply relates the customized page table to the session handler, and every time the process attaches to the VAS, the corresponding page table from the session object will be used. Furthermore, the page table will be updated if public segments are attached after the creation of the duplicated page table.