Rethinking TLB Designs in Virtualized Environments: A Very Large Part-of-Memory TLB

- Hide Paper Summary

Paper Title: Rethinking TLB Designs in Virtualized Environments: A Very Large Part-of-Memory TLB
Link: https://ieeexplore.ieee.org/document/8192494
Year: ISCA 2017
Keyword: TLB; POM-TLB;

Back

Highlights:

Simple and intuitive design, especially the 4-way set-associative cache-like structure.

Questions

The paper did not explain how dirty/accessed bits are maintained in different hierarchies, including L2/LLC and the in-memoty POM-TLB. For example, if a write access bypasses the cache, and goes directly into the POM-TLB, how would the dirty bits be maintained? This is non-trivial since the dirty bit dictates whether the page should be written back to the disk when it is swapped out, and hence affect correctness.
The paper did not mention how the bypassing predictor state is updated. It cannot be known whether an entry exists in the L2/LLC without probing the cache, which is the case of predicting “not in the cache”. I can only imagine there being a bit in the TLB to indicate whether it has been cached, or the MMU probes the cache to verrify the existence of such entries in the background.

This paper proposes Part-of-Memory TLB (POM-TLB), a novel design that adds a L3 TLB to the existing address translation hierarchy. The paper sets its context under virtualization in which address translation can be a major bottleneck due to 2-D page table walk and frequent context switch between VMs. To solve this problem, previous researches proposed several solutions, including adding a larger L2 TLB to increase the address coverage, using multiple hardware page walkers to serve multiple requests from different cores concurrently (processors stall on TLB miss), and adding a cache dedicated to intermediate entries of page table walk. These solutions, however, either add non-negligible hardware overheads and verification cost, or pose new challenge and trade-offs to address. For example, by making the L2 TLB larger, it is expected that more entries can be cached at the same time. This, however, does not necessarily imply better performance, since a larger L2 TLB takes longer to access, which is on the critical path of address translation.

This paper takes a different approach by adding an L3 TLB in the DRAM to reduce the frequency of page walks. This is especially beneficial for VMM, since a 2-D page walk can incur 24 cache misses (20 accessing the host page table, and 4 accessing the guest page table). Reducing the number of page walks, therefore, can save many unnecessary memory accesses, decreasing both the bandwidth requirement and the latency.

The L3 TLB is organized as follows. A chunk of memory in the DRAM is statically allocated from the physical address space for storing translation entries. Instead of caching intermediate results of page table walk as in the MMU cache proposal, POM-TLB only stores TLB entries in the same format as in hardware TLB. A typical TLB entry consists of the VA, the PA, attribute bits (permissions), and other metadata (such as ASID and virtual machine ID). The paper assumes a 16 byte TLB entry, which means that a single 64 byte cache line can hold four entries. POM-TLB is organized as a 4-way set-associative cache, and also operates similarly. The lower bits of virtual page number is first used as an index to select a set from the POM-TLB, and then the 64 byte set is read from the DRAM, and then associatively searched using higher bits, attempting to find a matching entry. If a matching entry is found, then the result will be returned to higher level TLBs. A page walk is initiated as usual if the POM-TLB misses. The physical address of the POM-TLB can be programmed and configured as part of the bootstraping process. The MMU uses the base address plus the offset of the set (64B * set index) to generate the address of the TLB entries. The paper chose 4-way set-associative TLB, because most DRAM’s burst size is 64 bytes, exactly the size of four entries. This way, only one DRAM read command is needed to access a single set.

To accelerate access latency of POM-TLB entries, the MMU injects the read request into the L2 cache every time it tries to read the POM-TLB. The L2 cache may keep the TLB set as regular data, which allows lower latency re-access of the same entry if it has not been evicted. Most L2 caches nowadays are physically tagged, and therefore, the MMU can just use the physical address generated as the request address.

One problem with the baseline design of POM-TLB presented above is the handling of different page sizes, or huge pages. One possible way is to allocate two non-overlapped memory regions from the DRAM, one for regular 4KB pages, and another for 1GB huge pages. Without any special settings, for every translation request, we must probe the cache and POM-TLB with both possibilities, since the indexing bits of these two paging schemes differ for the same VA. Traditionally this could be done either in parallel, which does not increase the length of the critical path but increases L2 access traffic, or be done serially at the cost of longer translation latency. The paper proposes adding a hardware predictor to solve the issue. The predictor has 512 single bit entries. 9 bits from the input VA to be translated are taken as the index into the predictor. If the bit indicates small page (0 bit), then the address of the TLB entry is generated using the base address of small page TLB. Otherwise the MMU uses huge page TLB base address. Note that if the entry of the predicted size cannot be found in L2, LLC and the POM-TLB of the corresponding size, we do not re-probe the cache using another size’s address. Instead, the page walker is invoked, and the page walk can identify the correct size of the page, and sents its feedback to the predictor for updating the predictor entry. This does not affect correctness, since the correct information can always be found from the page table, and that TLB is just a cache of entries. The paper reports > 95% success rate of the page size predictor.

In addition to the page size predictor, the paper also proposes adding a cache bypassing predictor. The motivation is that sometimes, the frequency of TLB access in the L2 is far less than the frequency of data access. It is therefore expected that only few or even none TLB entries will be cached in this case, and probing the L2/LLC with the generated address is simply a waste of time. To recognize this pathetic case when it happens, we add another 512 entry. single-bit predictor to classify whether an access needs to bypass the cache hierarchy and directly go into the POM-TLB. Different from the page size predictor to which we feed with the VA to be translated, in this case, we feed the bypassing predictor with 9 bits from the generated TLB entry’s address, which happens after the page size prediction (we need it to choose the base address of POM-TLB).