CSALT: Context Switch Aware Large TLB
- Hide Paper Summary
Link: https://ieeexplore.ieee.org/document/8686492
Year: MICRO 2018
Keyword: TLB; LRU; Cache Replacement
This paper proposes CSALT, a cache partitioning scheme for unified L3 TLB and data on lower level caches. The paper identifies that, on modern platform where virtualization is supported, running multiple virtual machines can overburden the TLB with excessive misses, resulting in a massive number of page walks. A page walk on today’s 2-D page table can take, in the worst case, 24 memory accesses (20 for accessing host page table for every guest physical address, and 4 for the standard 4-level guest translation), the latency of which can easily become a performance bottleneck on a saturated system.
Previous solutions focus on reducing the number of page walks required for a translation and reducing the latency of each page walk access. To reduce page walks which require a number of memory access on different levels of the page table (typically four), previous researches propose using an in-memory L3 TLB which stores translation entries as a hardware TLB does in order to avoid a page walk. The L3 TLB can be as large as 16MB, a capacity that exceeds all hardware implementations. Some other researches also propose caching the intermediate page table entries in hardware caches, such that the latency of accessing the entry can be reduced to as fast as a regular cache access. Using part of the data cache to store TLB entries, however, can negatively affect performance, since the cache might be polluted by extra TLB entries as a result of TLB misses. As we will see later, CSALT addresses the cache pollution problem by partitioning the cache into two parts, dedicated for data and TLB entries respectively.
In this paper, we assume that TLB entries are cached by the L2 and LLC, sharing the same storage with data accesses. The paper also assumes that there is an underlying L3 TLB in the main memory, but also claims that this is not necessary, and that the cache partitioning scheme works on all designs that cache page table entries in the data cache. Only translation entries (i.e. from VA to PA) are stored in the cache. Other intermediate entries might be cached by a dedicated cache in the page walker.
To solve the cache pollution problem when data and TLB entries share the same cache, CSALT partitions the cache into two parts, one for data and another for TLB entries. The partitioning parameter, N, dictates that in a way-associative cache, way 0 to way N - 1 of the cache is dedicated to data, while way N to the maximum way (W) is dedicated to TLB entries. When a memory access from upper levels misses the cache, the newly inserted entry must only be allocated from the data part, if the request is a regular load or store, or from the TLB entry part, if the request is to a memory region allocated for L3 TLB (i.e. it is a hardware generated request to fetch L3 TLB entry). When a block is to be evicted, the cache controller will prioritize eviction of blocks in the wrong partition, i.e. a data block in the TLB partition, or a TLB block in the data partition. Either case, the distribution of blocks will converge to the state where all blocks in the data partition are data blocks, and so does the TLB partition. We will see from later sections that a block can reside in the wrong partition when the parameter N is adjusted.
CSALT does not rely on a statically determined N to work correctly. Instead, the parameter N is determined dynamically based on profiling results. The profiling process compares the performance of the current partitioning with a potentially better partitioning parameter N’ within a time window, called an epoch. At the end of the epoch, if the latter is indeed better, then the parameter N will be updated to be N’, and the hardare will converge to the new parameter. The profiling is based on the concept of LRU stack. The LRU stack is an ordered list, in which a more recently accessed block is ordered before a less recently accessed block. The most recently used block is at the head of the list, while the least recently used block is at the tail (when LRU evicts a block from the cache, the block at the tail is selected). Each way in a single set has a position in the LRU stack, and therefore, the stack size is identical to the number of ways. Every element in the LRU stack has a counter recording the number of hits on the element (note that when the element changes position in the list, the countre value goes with the element). When a cache access hits a block, we increment the counter of the current LRU stack element that corresponds to the way being hit. The meaning of counters in the LRU stack is that, on a cache managed by LRU, the sum of LRU position 0 to N - 1 is exactly the number of cache hits we will observe under the same trace, if we “disable” way N to way W - 1. In other words, the LRU stack help us estimate the potential number of cache hits if we dedicate part of the ways to other purposes.
CSALT works as follows. Within each epoch, two LRU stacks are maintained, one for data partition and another for TLB partition. At system startup, N is set as half of the total number of ways (i.e. divided evenly). On every cache hit, depending on the partition, the corresponding LRU element counter from that partition’s stack is incremented. At the end of the epoch, the hardware computes a target function for all possible values of N (from 0 to W - 1 where W is the number of ways). The target function is defined as the sum of all counters for entries that will be active if the parameter is set as N. To elaborate: it sums up counter for LRU element 0 to N - 1 in the data partition’s stack, which is added with the sum of LRU element 0 to (W - N - 1) in TLB partition’s stack. The hardware then selects the N that maximizes the target function, and uses it as the N for the next epoch.
In reality, a data cache miss and a TLB cache miss have different penalties. This, however, is not reflected by the target function in which the cache hit counters are simply sumed up. To differentiate between the importance of data block hit and TLB block hit, we can enhance the target function by multiplying a “hit contribution” with the sum of counters from the LRU list. For example, for L2 caches, if the data block misses the cache, then we need to pay the penalty of accessing the LLC or even the DRAM. This penalty can be computed using the LLC hit rate, LLC latency, and DRAM latency. The LLC hit rate can be obtained from the performance counter already in the PMU. We use the ratio between the L2 miss penalty and the L2 access latency as the hit contribution (or “performance gain” in the paper), meaning that the longer a L2 data miss takes, the better it would be if we allocate more ways to the data partition. In the final target function, after suming up way 0 to N - 1 of the data LRU stack, we multiply this with the hit contribution. Similar things are also done for the TLB partition. The final maximization of the target function remains unchanged, which gives the parameter N for the next epoch.