Ripple: Profile-Guided Instruction Cache Replacement for Data Center Applications
- Hide Paper Summary
Link: https://conferences.computer.org/iscapub/pdfs/ISCA2021-4ghucdBnCWYB7ES2Pe4YdT/333300a734/333300a734.pdf
Year: ISCA 2021
Keyword: Ripple; i-cache; Cache Replacement
Highlight:
-
The effect of i-cache prefetching heavily depends on the replacement algorithm. All existing algorithms studies in this paper are non-satisfactory for various reasons.
-
The paper observes that the ideal replacement algorithm (MIN) works well with prefetching, but it is unable to be implemented as hardware component due to lack of future knowledge.
-
To address the above limitation, the paper proposes Ripple which uses off-line basic block analysis. Ripple runs simulation on basic block traces, and simulates the ideal replacement algorithm. It then attempts to insert i-cache flush instructions at certain places in the program, such that the behavior of the instrumented binary mimics the behavior of the ideal algorithm.
-
The locations of inserting eviction instructions are selected by finding basic blocks that are executed between the last usage of a block and its eviction in the ideal algorithm. One of the blocks in this “eviction window” is selected as the “cue block” based on the corelation between the execution of the basic block and the eviction of the cache block (the higher the better).
-
Some blocks could not be chosen as the “cue block” if the corelation between the basic block and the eviction is weak. This can be recognized using coverage and accuracy.
Comments:
- I am no expert in cache replacement, so I will not make comments on the technical soundness of the details of the algorithm. My only comment is that, maybe the authors can make it clear that the off-line simulation uses a i-cache simulator, and the replacement algorithm is the ideal algorithm. Otherwise it is difficult to understand the second half of the paper.
This paper presents Ripple, a novel instruction cache replacement algorithm using off-line optimality analysis. The paper is motivated by the fact that modern data center applications can incur a heavy burden on the instruction cache, which is caused by numerous software stacks and many modules for different purposes within each stack. The paper shows that a typical data application can execute up to several MBs of code in the working set. On current platforms with tens of KBs of L1 instruction cache, this will result in 23% to 80% vacant pipeline slots.
The paper also noted that, despite best efforts from previous proposals that attempt to address the i-cache problem, such as specialized prefetchers, these proposals often only give sub-optimal improvements, because of not being able to understand the essential problem of high i-cache miss rates. As a result, these prefetchers can bring many unnecessary cache blocks into the i-cache, potentially evicting useful blocks, causing what we call “cache pollution”. In addition, specialized hardware prefetchers need extra hardware resource on the chip, which can be expensive in some cases.
Hardware prefetchers, however, may actually become helpful when combined with the correct replacement algorithm. The paper experimented with an ideal algorithm that always evicts blocks that will be used in the furthest future, and gives priority to prefetched blocks over regular blocks (i.e., the well-known MIN replacement algorithm). Results show that performance can be improved with the proper replacement algorithm compared with the baseline using LRU.
The paper makes two observations about a good replacement algorithm. First, the replacement algorithm should evict cache blocks that are prefetched, but not used in the future. In MIN this can be inferred accurately from the execution history, which is assumed to be known. Second, the algorithm should not evict blocks that are “hard to prefetch”, which is defined as cache blocks that cannot be prefetched with good accuracy, e.g., due to indirect branches (depending on the prefetcher). Unfortunately, existing algorithms could not easily fulfill the task in these two observations, as they are most likely invoked on-the-fly, with information only on the current state of the system, and a limited execution history.
The paper then investigated a few popular eviction algorithms, and compared their performance implication with hardware prefetchers. Sadly, none of the existing algorithms studied by this paper performs well. Among the many algorithms, GHRP predicts a cache block to be either dead or alive based on control flow information, and evicts blocks that are deemed dead. The paper points out that certain implementational issues prevent GHRP from outperforming the baseline LRU. But even after correcting the issue, GHRP only outperforms LRU by a negligible amount. Hawkeye/Harmony is originally designed for data caches, and it predicts a block to be either “cache-friendly” or “cache-averse” using the PC of the memory instruction. This will not work for i-cache evictions, since the i-cache is driven by changing PC values. The paper also experimented with RRIP family, and without surprise, results are also not good. The problem with RRIP is that RRIP is specifically designed to address the scan pattern which will incur pathological cases with LRU. RRIP avoids the problem by treating newly inserted blocks only as “friendly” after the first reference after insertion. This assumption, however, does not work for i-cache, since scan is a relatively rare pattern for execution. The paper summarizes at the end of this section that i-cache usage follows a different pattern than data caches. Most prominently, even the same i-cache blocks will demonstrate varying re-reference intervals during different stages of program execution. This renders conventional algorithms depending on the assumption of fixed re-use behavior on the same address useless.
The paper then presents Ripple. Ripple is a software implemented replacement algorithm that relies on special i-cache flush instructions to mimic the behavior of the ideal algorithm, which has already been shown to exhibit the best performance improvement. The biggest advantage of Ripple is that, instead of running the algorithm on-the-fly as the binary executes, which is limited by on-chip computing resources, decision response time (usually a few cycles), and knowledge of future references, Ripple collects execution profile (mainly basic block information), and analyzes the profile off-line by simulating the ideal algorithm (i.e., MIN). Once the globally ideal trace has been known, Ripple then inserts cache block eviction instructions in certain basic blocks to achieve the same effect of the ideal algorithm in the run time, i.e., blocks that should have been evicted will actually be evicted during the execution at roughly the same point.
Ripple operates as follows. First, the application is profiled with a basic block profiling tool, and a list of basic blocks in the execution order is output. Second, Ripple runs an off-line i-cache simulator with basic blocks addresses and sizes. Note that since basic blocks are code regions that are always executed from the beginning to the end, the simulation can be as simple as inputting all addresses in all basic blocks in the execution order to the simulator. The replacement algorithm for i-cache used in the simulation is the ideal algorithm, and processor core is not simulated. Then, for all cache block addresses that have been evicted at least once, Ripple identifies their eviction windows. An eviction window is defined as the set of basic blocks that are executed between the last usage of a block and its eviction. Note that the same block could be evicted multiple times, and therefore have multiple eviction windows.
The implication of eviction windows is that, if the block is to be evicted as in the ideal algorithm, then the eviction could happen by inserting an i-cache flush instruction in any of these blocks in the window. To decide which basic block is the best one, Ripple uses two types of per-basic block statistics. The first one is the absolute number of times the basic block occurs in the trace. The second one is the number of times that the basic block appears in the block’s window. Ripple computes the ratio between the second and the first statistics number, and selects the block with the maximum ratio as the cue block. The heuristics here is that, by maximizing the ratio between the two numbers, we maximize the chance that whenever the cur block is executed, the eviction of the block under question would follow. Ripple then inserts a cache block flush instruction into the cur block using the address of the cache block. This process is repeated for all blocks that have at least one eviction window.
The paper also noted that not all blocks with the maximum ratio can be chosen as the cue block. If the absolute value of the ratio is too low, indicating that there is not strong relationship between the basic block and the evicted cache block, the flush should not be inserted to avoid disrupting normal caching in a different context. In other words: A threshold should be given as the minimum ratio computed above. Cur blocks under the threshold will not be selected for inserting evictions.
To help finding the threshold, which is program-specific, the paper hence defines two metrics: Coverage and Accuracy. Coverage is defined as the number of manual evictions made by Ripple divided by the number of evictions that would be made by an ideal algorithm. The higher the coverage is, the better the coverage is, meaning that Ripple can mimic most of the evictions done by the ideal algorithm. Accuracy, on the other hand, is defined as the ratio between the evictions (by running the code with the current manual insertion of evictions) that will also be made by the ideal algorithm and total number of evictions made by manual evictions. Accuracy reflects how many evictions are actually useful. The paper shows that there is a clear trade-off between these two, and that the threshold should be given to maximize overall performance improvement. Unluckily, the paper does not give a detailed explanation on how the threshold is selected, but just mentions that the threshold should be chosen such that both accuracy and coverage are at acceptable values. The paper reports that empirical values of a good threshold are between 40% andd 60%.
The paper also noted that Intel already proposes adding a new i-cache flush instructions in future server platforms. Ripple could be implemented with this new instruction without any hardware modification. The authors also propose their own instruction for better i-cache management, namely invalidate, which differs from existing cache flush instructions by only evicting a block from the i-cache, but still keeping it within the rest of the hierarchy.