SIPT: Speculatively Indexed, Physically Tagged Caches
- Hide Paper Summary
Link: https://ieeexplore.ieee.org/document/8327003/
Year: 2018
Keyword: Cache Hierarchy; Speculative Index
This paper proposes SIPT, a cache design that uses speculative bits obtained from virtual addresses for set selection, without incurring synonym and homonym problems as in virtually indexed caches. In a traditional virtually indexed, physically tagged L1 cache, the cache lookup uses the bits that do not change during the translation as the set index. The translation and set selection can thus be parallized, and TLB lookup is likely not on the critical path of the memory operation. The downside of this approach is that the L1 way size is limited. Assuming 4KB page size and 64 byte cache line size, at most 6 bits can be used as the set index. The maximum number of sets is therefore only 64, and the maximum size of each way is 4KB. If we want to build a larger L1 cache, the only viable way is to increase the number of ways in a set, which also increases the latency and energy consumption because more tags and data needs to be read and compared. Nowadays, typical L1d cache is eight way set-associative, and the size is 32KB.
Instead of keeping on adding ways in a set, the cache controller can use more bits to select sets before the physical address is available from the TLB. The cache controller “guesses” the bits that may change during the translation. If the guessed bits (usually 1 - 3 bits) do not match the actual physical address, then the speculation fails, and the cache lookup operation restarts. All cycles and energy for the failed lookup are wasted. If, however, the speculation is correct, then we essentially have a larger L1 cache without serializing address translation and set selection.
There are different levels of speculation. In its simplest form, the cache controller just speculates that the bits in the virtual address will not change, and always use the virtual address to generate the index. This scheme in fact works well for some workloads, because the operating system may have optimizations such as page coloring that tries to restrict the pattern that virtual pages can be mapped to physical pages. For other workloads, however, performance degrades severely as a consequence of wrong speculation.
In addition to the static predictor (always unchanged) as described in the previous paragraph, we can build a more complicated predictor using perceptron. The perceptron predictor works similarly to a branch predictor using global history. Here the global history is the result of previous speculations (bits changed/unchanged), represented as a bit array, where 1 means correct speculation and 0 means incorrect speculation. We keep h bits in the global history G. The prediction table has 64 entries, indexed by PC slices of load/store instructions. Each entry in the prediction table consists of (h + 1) parameters, w0, w1, …, wh. On prediction, the cache controller computes p = w0 + w1 * G1 + … + wh * Gh. If p is non-negative, the controller speculates. Otherwise, it waits for address translation. This scheme does not eliminate TLB lookup from the critical path entirely, as the controller still has to wait for the physical address if it decides not to speculate. It is effective in reducing the overhead of misspeculation, though.
Even if the perceptron predictor gives positive result on speculation, possibilties exist that the bits in the physical address do not match the bits in the virtual address. If this happens, the misspeculation penalty still applies. To avoid misspeculating the value of the index (rather than whether or not to speculate), the delta between the index bits from the virtual address with the bits in the physical address is also predicted using a table. The index delta buffer (IDB) servers a similar purpose to the branch target buffer in a branch predictor. On each prediction, the cache controller uses the current PC to index into IDB, and fetches the index delta. The delta is then added onto the speculative bits (carries are ignored) to form the index that we use to select sets. After the physical address is generated, we update the IDB entry to the actual delta of speculative bits between virtual and physical addresses. In the paper, the IDB has the same number of entries as the perceptron prediction table. Note that IDB is not needed if only a single bit is to be predicted, because flipping the bit is sufficient.
The reason that the index can be predicted using deltas is within the memory allocator in the operating system. In Linux, for example, the page allocator uses buddy system, which breaks large chunks into smaller ones if existing chunks could not satisfy a physical memory request. That is to say, physical memory allocated by the virtual memory manager is likely to be consecutive, when a chunk of virtual address is populated with physical pages during the initialization phase of data structures. The delta between the virtual chunk and the physical chunk is therefore a constant. By remembering the constant delta in the last translation, there is high probability that the next few translations will also observe the same delta. Even if this assumption fails to be true in some rare cases, the prediction should also work well. This is because load/store instructions usually demonstrate high spatial locality, i.e. nearby instructions are prone to access data items on the same page. If several accesses hit the same virtual page, their virtual to physical mapping must be the same, and the index delta prediction is always successful. These two observations together explains the high accuracy of the index delta prediction described in the paper.