BabelFish: Fuse Address Translation for Containers

- Hide Paper Summary

Paper Title: BabelFish: Fuse Address Translation for Containers
Link: https://dl.acm.org/doi/10.1109/ISCA45697.2020.00049
Year: ISCA 2020
Keyword: Virtual Memory; Linux; Paging; MMU; Containers; BabelFish

Back

Highlights:

Containerized processes often share code and data due to using fork() and running on the same binaries, libraries, and middleware. This causes redundant translation entries in the TLB, which can be removed by allowing sharing lower levels of the page table, and correspondingly, the TLB entries, across processes.
Assign each process a CCID to group them into the same domain for sharing translation entries. Lookup logic can use either the conventional ASID, or the CCID for tag comparison. This still allows processes with different CCIDs to use distinct translation entries.
Even for shared TLB entries, exceptions can be allowed using a bit mask tracking processes that do not actually share the entry. This is useful for the common scenario where most processes share a translation entry, but a few of them actually makes private copies of the page due to CoW.
Lower level page table entries can be shared by adding reference counters to these entries. The “O” bit and the ORPC bit are also tracked in these shared entries.

Comments:

In the base version (without the PC bit mask), do you assume that the TLB set index is generated solely with the lower bits of VA? Or it is generated by a function taking both VA and ASID/CCID? I am asking this, because if it is the former case, then all entries with the same VA will be mapped to the same set, which sounds very unlikely because this causes unnecessary set conflicts. In the latter case, the lookup latency can potentially be doubled, because the set index using ASID and CCID can be different. Two probes must be made in this case, CCID first, and ASID later if the previous one misses.
The above issue is even more prominent in the version with PC bit mask, because if a process shares the same CCID with other processes, but it has a private entry, then the first lookup must be performed using CCID and the VA to generate the index, which will hit the entry, and the PC mask will be checked. Only at this moment can we know that the VA has a private entry, and the second probe is performed using ASID and the VA. OK, actually I am wrong here. The OS may just set the PC mask index register of the process to a special value to indicate that it does not belong to any exception. In this case, we do not need a hit of the shared entry to determine whether it is part of the exception.
Where are the CCID field stored? In the page table? What if there are not sufficient number of vacant bits in the current Intel PTE? Obviously software has no way to set the CCID field in the TLB, as the TLB lookup and page walk are fully controlled by the MMU.
Is it a hard requirement that all processes sharing translation entries must use the same page table? I guess it is positive, because otherwise, the TLB may fetch redundant entries with different ASIDs but identical VAs, in which case it also needs to identify redundant entries and merge them.

This paper proposes BabelFish, a virtual memory optimization that aims at reducing duplicated TLB entries and page table entries. BabelFish is motived by the fact that containerized processes often share physical pages and the corresponding address mappings. On current TLB architectures, these mappings will be cached by the TLB as distinct entries, because of the ASID field for eliminating homonym or expensive TLB flushes on context switches. BabelFish reduces the degree of redundancy in both TLB caching and page table entries in the main memory by allowing single TLB entries and single page table entries to be shared across processes, with probable exceptions tracked by additional metadata structures.

Container is a lightweight mechanism for isolating processes running in the same OS, which has drawn an increasing amount of interest in microservice and serverless due to its faster loading time compared with virtual machines. Each containerized process may have its own namespaces and illusion of exclusive ownership to resources such as CPU, memory, and the file system. Processes in containers are, in fact, still Linux processes, and have their own address spaces with virtual-to-physical mapping. The paper observes that containerized processes typically have many identical VA to PA mappings and permission bits despite that they are actually in different address spaces. The paper identifies several factors that contribute to this observation. First, in microservice and serverless architectures, the number of service instances are adjusted based on the dynamic load, and it is common that many instances of the same service are started to handle requests. All of the instances share the same underlying binary and the libraries, which are also usually mapped to the same virtual address at each process. Second, the paper claims that container processes are created with the fork() system call, which produces a child process sharing the VA to PA mapping of the parent process in a Copy-on-Write (CoW) manner. Many pages will not be CoW’ed, and remains being shared and read-only between the two processes, which are also at the same virtual address and mapped to the same physical address. Third, these processes can map certain shared files to the address using mmap() and MAP_SHARED flag, meaning that the content of the file is mapped to the address space and is backed by the same physical pages. If mmap() picks the same virtual address for the mapping, all sharers will also observe the same VA to PA translation entries. Lastly, cloud providers who is responsible for maintaining the containerized environments may also add their own middleware for container management and pricing. The code and data of the middleware can also be shared by all processes.

BabelFish assumes a two-level TLB architecture, in which each entry consists of at least the VA, which is used as the lookup key, the PA, a set of permission bits, and an ASID field to distinguish between the entries from different processes. The organization of the TLB is orthogonal to the topic. The paper also assumes Intel architecture’s four-level, radix tree page table, and an MMU that could perform page walks. A Page Walk Cache (PWC) may also be present to reduce main memory accesses from the page walk by caching the in-memory translation entries from all levels.

BabelFish works by assigning processes that are likely to share translation entries (e.g., processes fork()’ed from the same Zygote) with the same CCID (Container Context ID). The CCID of the process is stored in a context register that is set by the OS on every context switch. TLB entries are also extended with two extra fields: a CCID field that stores the CCID of the processes that may use this translation entry, and an Ownership bit indicating whether the page is privately owned by a process, if set, or shared among processes having the same CCID. In the base version, the TLB lookup logic is modified as follows. When an entry is under comparison in the middle of the lookup, the Ownership bit is examined. If the bit is set, then the ASID is compared with the ASID of the current context (CR3 register value), indicating that the entry is exclusively used by a context. Otherwise, if the ownership bit is clear, the CCID field of the entry is compared with the CCID value of the current context. In either case, a hit is indicated, if both the virtual addresses and the ASID / CCID field match.

The paper also considers the case where a majority of the processes with the same CCID share the same translation entry, but a small set of processes, with the same CCID, may have their private entries for the same virtual address. This scenario will occur if a small subset of the processes write to the shared page, and create their own private translations due to CoW. To support this, BabelFish allows a few “exceptions” per entry such that those exceptions will use their private entry during address translation. The exceptions are marked using a 32-bit “Private Copy (PC)” bit mask per TLB entry, and each bit represents whether a process has a private translation. Mapping from processes to bit mask locations are tracked by the process context, i.e., by a context register. The bit offset is sent to the TLB lookup logic on lookups. During lookup, if a shared entry is hit (Ownership bit is clear), but the PC bit for the process is set, then the lookup logic still uses the ASID for comparison.

The OS is responsible for assigns processes in the same CCID to bit indices in the PC mask, and once the assignment is made, it is not to be changed in the remaining lifetime of the process (which are typically short-lived, as serverless functions are small). At most 32 processes can have exceptions on any of the page they map. On context switches, the OS sets the context register storing the index into the PC mask of the process being scheduled in (if there is none, then the process does not belong to the exceptions, in which case the bit index is not used). In order to track the PC bit mask for each page, the OS maintains a separate metadata structure, called MaskPage. This structure contains of bit masks of all pages ever mapped by the CCID, and will be walked by the MMU on a TLB miss to fetch the bits into the TLB entry.

To avoid accessing the bit mask on every TLB lookup, and fetching the PC bit mask from MaskPage on evert TLB miss, the paper also proposes adding an additional ORPC bit to both the TLB entry and the PTE. The ORPC is simply a bitwise OR of all 32 bits in the PC mask, and it servers as a shortcut telling the TLB hardware that no exception exists on this address. The ORPC bit is set by the OS in the PTE when the PC mask is updated, and fetched into the TLB with regular page walks. The lookup logic uses the ORPC bit to decide whether the PC mask field should be accessed on lookups to reduce unnecessary read activity (e.g., the PC mask can be stored in a separate physical bank, and only accessed when ORPC bit is one). Page walkers will also not walk the MaskPage structure, if the ORPC bit is found to be zero during a page walk.

On a page CoW, the normal procedure is followed, i.e., the OS allocates a new physical page, copies the content of the shared page, and updates (or creates, if the entry does not exist yet) the page table entries of both processes. With BabelFish, one addition step is needed to assign the process an index into the PC mask, if not yet, and then update the PC mask in MaskPage. A TLB shootdown is then conducted to invalidate the out-of-dated cached TLB entry, which is also mandatory without BabelFish.

BabelFish also requires processes in the same CCID that share a translation entry to also share the page table. BabelFish allows multiple middle-level page table entries (must be of the same type) to point to the same lower-level part, essentially sharing these tables across the processes owning the upper-level tables. The shared PTEs should have their Ownership bit cleared, and ORPC bit maintained based on the value of the PC mask of the page. Reference counters should also be maintained such that the shared entries can only be recollected after all the sharer processes have exited.

The paper also noted that, by sharing lower-level page table entries, BabelFish also reduces the number of minor page misses, which are caused by the page table entries not being populated after a fork(). As long as one of the processes sharing the entries have populated it, no minor page faults on the same address will happen for future accesses from other processes.