ATTC (@C): Addressable-TLB based Translation Coherence

- Hide Paper Summary

Paper Title: ATTC (@C): Addressable-TLB based Translation Coherence
Link: https://dl.acm.org/doi/10.1145/3410463.3414653
Year: PACT 2020
Keyword: ATTC; TLB Shootdown; Virtualization

Back

Highlight:

TLB shootdown has become more complicated in recent architectures, because (1) the 2D nested page table walk design has more shootdown scenarios, e.g., both the host and the guest can update the mapping; (2) existing software solutions do not support fine-grained, precise tracking of the paging history; and (3) in hybrid NVM-DRAM systems, frequent page migration between the two may aggravate the situation as every page migration operation requires a TLB shootdown.
We can address the TLB coherence problem on hardware by adding a level-3 ATLB in the main memory. TLB coherence messages can then be sent as regular cache coherence messages using the memory address containing the entry’s address. In other words, the in-memory TLB structure becomes the point of coherence, where every update to the translation entry (i.e., the page table) is also reflected on the in-memory TLB as a memory update. The memory update is then captured by regular cache coherence and sent to the sharer core that has the same translation entry, which triggers TLB invalidation.
Cache coherence only use physical addresses. However, we can design the ATLB such that the physical address encodes both virtual address and VMID information.

Comments:

Modern TLB entries have two disambiguation fields. One is VMID which distinguishes between VMs and the VMM, which the paper discusses. The other, however, is PCID which distinguishes between different processes within the same VM instance or within the VMM/host OS, which the paper neglects. How does the PCID field affect the design of ATTC?
The paper did not discuss how the ATLB is populated. Is it populated automatically by the MMU page walker on the completion of page walks? Is it populated by the guest OS? It seems to me that the paper adds the ATLB as a level 3 in-memory TLB. If this impression is true, then the ATLB should be populated by the 2D page table walker.
If the cache block containing the ATLB entry is evicted from the private hierarchy (or from the LLC, depending on whether the coherence protocol has silent L2 eviction), then cache coherence will no longer be able to reach the private hierarchy potentially holding the TLB entry. How did ATTC deal with this scenario? One possible solution is to mark cache blocks containing ATLB entries in the L2 cache and then invalidate TLB entries as well when a marked block is evicted. However, this approach would make the L2 essentially inclusive of the TLB, which may incur unexpected performance issues. Besides, it introduces extra tagging per cache block.
If an entry is evicted from the ATLB as part of the MMU’s page walk process, does the MMU also evict all locally cached TLB entries for correctness? The paper should have added a discussion of this scenario.

This paper proposes ATTC, a hardware TLB coherence scheme that eliminates the overhead of software TLB shootdown in a virtualized environment. ATTC is built upon cache coherence and it reuses the existing cache coherence network to perform TLB coherence on hardware. Compared with prior schemes that also rely on cache coherence, ATTC handles virtualized environments better and incurs only minimum hardware additions on the TLB structure.

The paper is motivated by the inefficiency of existing software solutions to maintain TLB coherence. Conventionally, when the OS kernel updates a page table entry that mandates a TLB coherence action on a processor core, the kernel must explicitly notify the other cores that can potentially cache the same translation entry in their local TLB, and wait for their response to complete the TLB shootdown process. This process is implemented in pure software using IPIs and is blocking, which incurs non-negligible overhead. The problem is further exacerbated with the introduction of nested page tables as part of the virtualization support. In particular, with the nested page table, the TLB shootdown mechanism must deal with two radically different scenarios. The first scenario is when the guest OS updates the guest page table. In this case, the guest OS tracks the virtual cores on which the translation entry can be potentially cached, and issues IPIs to these cores. In the second scenario, the host OS (or hypervisor) updates the translation from guest physical addresses (gPA) to host physical addresses (hPA). In this case, the set of cores that can potentially cache the translation is not clear to the host OS, as the host OS only tracks the set of virtual cores allocated to a particular VM, but not the cores that each individual process running in the VM instance is scheduled on. Besides, the host OS cannot decide the gVA that corresponds to the gPA being updated, as it requires an inverse mapping of the inner (guest OS) page table. As a result, the host OS is unable to determine the virtual address to be invalidated (which must be the gVA because the TLB is designed to be indexed by gVA). In the current OS implementation, this issue is addressed by flushing the entire TLB on all virtual cores that have been allocated to the VM instance.

The paper also pointed out that while prior works seem to perfectly reduce the overhead of software TLB shootdowns, these proposed mechanisms have been largely rendered less effective by nested page table, since prior works mostly assume a single-level translation between VA and PA and are generally difficult to be extended to support 2D page tables. Furthermore, prior works tend to extend TLB entries with more metadata bits, which results in higher hardware cost and more design complications.

Lastly, as byte-addressable NVM is gradually adopted as the main memory as a replacement to DRAM, the paper observes that on future systems, pages need to be constantly migrated between NVM and DRAM. Page migration would incur more frequent TLB shootdowns, which makes an efficient solution incredibly valuable.

Based on the above observations, the paper proposes ATTC which we describe as follows. In ATTC, a level-3 TLB, called the ATLB, is added to the main memory. The ATLB is organized as a set-associative main memory cache, but it stores translation entries from gVA to hPA as a result of 2D page table walks. Although the paper did not specify the optimal parameters for ATLB, it is suggested that ATLB should be at least a few MBs in size, and have more sets than the L2 TLB. Each entry of the ATLB stores the gVA to hPA mapping as a regular hardware TLB. When TLB lookups miss the local hierarchy, the MMU will further perform lookups in the ATLB. The ATLB is indexed using both the VMID and the gVA, such that translation entries of different VM instances can co-exist peacefully in the ATLB.

The ATTC design performs hardware TLB shootdown using regular cache coherence and leverages the in-memory ATLB as the point of coherence. More specifically, when a core performs a page walk and brings the translation entry into its local TLB hierarchy, the core should also read the ATLB and bring the corresponding cache block into its local hierarchy (as a natural result of the ATLB lookup). As long as the cache block stays in the private hierarchy, the core that caches the translation entry in the local TLBs is bound to receive cache coherence messages when the ATLB entry containing the translation is modified by another core. Based on this observation, the TLB shootdown process of the guest OS is implemented as follows. First, when the guest OS updates a page table entry, it also issues a write to update the corresponding ATLB entry. This write operation will then be intercepted by the MMU (by performing a simple address check to see whether write operations lie within ATLB’s address range), and the MMU performs sanity checks to ensure that the core has the right permission to update the ATLB entry. When the write operation is handled by regular cache coherence, all other cores that have read the same entry earlier will receive a cache invalidation message containing the physical address of the ATLB entry. On receiving the message, the local cache controller will decide whether physical address lies within ATLB’s address range, and if positive, the cache controller will compute the set index using the entry’s physical address and the starting address of the ATLB. The set index of ATLB is designed such that it encodes both the partial gVA of the entry as well as the VMID. After computing the set index, the cache controller then issues invalidation messages to the TLB using the partial gVA and the VMID. False matches can arise as a result of using only partial gVAs to perform tag match. However, the paper noted that this case is very rare and does not affect performance.

When the host OS updates the gPA to hPA mapping, however, the ATLB cannot be directly used to perform invalidation, since the ATLB is indexed by the gVA. To fill the gap, ATTC also introduces an inverse mapping table, named INVTBL, which maps an hPA to the set index of the gVA that translates to it. The INVTBL is also a set-associative structure stored in the main memory. The INVTBL is populated by the page table walker after a 2D page walk completes, and is maintained inclusive to the ATLB, i.e., when an INVTLB entry is invalidated, the corresponding ATLB entry should also be invalidated (using the set index stored in the INVTLB entry). When the host OS updates the outer page table, it performs a dummy write operation to the INVTBL entry after calculating the entry’s address using the old hPA. The dummy write operation will be intercepted by the hardware MMU, and then redirected to the corresponding ATLB entry whose address is stored in the INVTBL entry. The write to the ATLB entry will trigger cache coherence messages in the same manner as in the previous case, which eventually invalidates all local TLB entries affected by the outer page table update.