Improving the Utilization of Micro-Operation Caches in x86 Processors

- Hide Paper Summary

Paper Title: Improving the Utilization of Micro-Operation Caches in x86 Processors
Link: https://www.microarch.org/micro53/papers/738300a160.pdf
Year: MICRO 2020
Keyword: Microarchitecture; uop cache

Back

Highlight:

Uop cache is an important component of the x86 frontend that stores translated uops. The uops are organized into entries, which are highly related to Prediction Windows, but are smaller in size, due to various factors that can cause a PW to be divided into several entries.
uop is compressible, due to the fact that entries are variably sized, and are typically much smaller than the slot size. This resembles the case of cache compression, in which cache lines are also variably sized after compression.
CLASP performs simple uop cache compression by storing consecutive entries in the same PW into the same slot, based on the observation that This does not require any extra tag storage or special handling for invalidation, since the compressed entry can be found in either set A or set (A - 1) where A is the set index of the entry, if inserted uncompressed.
Compaction performs regular uop cache compression by fitting multiple entries into the same slot. The tag array is duplicated per slot just like a regular cache compression scheme. The paper also proposes three different insertion schemes for compression lines to maximize locality of a physical slot.

Questions

The paper never gave a precise definition of a PW, while suggesting that the uop cache is addressed using the base address of the PW. This raises many questions: (1) A PW can be stored in multiple entries due to limitations or i-cache line boundaries, how are these multiple entries addressed individually even if they are in the same PW? Shouldn’t they have address conflict, if all entries are tagged with the same PW base address? My best guess is that they still use the actual physical address of the instruction that corresponding to the first uop in the entry, but each entry also indicates the address of the next instruction. (2) How are indices generated from uop cache entry address to sets? In order for CLASP to work, it must be guaranteed that entries in the same PW but delimited by cache line boundaries be mapped to adjacent sets in tne uop cache. This suggests a different addressing scheme than the paper describes.
The paper also did not give a full description of how metadata is encoded in the uop cache, causing much confusion in understanding how uop entries are recognized and delivered to the backend. For example, CLASP requires the cache controller to recognize two entries that are consecutive but were divided due to i-cache line boundary. How is this achieved? Is there any special encoding involved?

This paper proposes two techniques for optimizing the performance of micro-operation (uop) cache on x86 processors. The uop cache is an important component on the frontend pipeline, since it stores and feeds decoded uops to the backend without involving the instruction cache and the decoder. This reduces both pipeline depth, instruction feed latency, and energy consumption, which all help in performance improvement. Current implementation of the uop cache suffers from external fragmentation, which originates from the fact that uop cache entries are variably sized. A fragmented uop cache negatively affects performance, since it reduces the effective size of the cache, which can further stall the backend pipeline due to insufficient instruction bandwidth.

The paper assumes a uop cache architecture as follows. The uop cache itself is implemented just like a regular cache, with ways and sets and fixed size (64 bytes) slots. Uops, once they are generated by the instruction decoder, are first aggregated in an accumulation buffer, after which they are inserted into the uop cache as individual entries. One uop cache entry consists of a stream of uops translated from consecutive instructions. One of the most important features of uop entries is that they are variably sized, i.e., entries do not always consist of a fixed number of uops, despite the fact that uops are of equal length for easier processing, just like RISC instructions. These entries are sub-sequences of actual dynamic uop streams that are executed. An uop cache entry starts when a branch instruction changes the control flow to a target instruction, or when a previous entry has been formed and a new entry starts naturally. A uop entry ends on several conditions: (1) When the uop stream of the entry come from instructions in a different i-cache slot, i.e., the decoder terminates the current entry when the current i-cache entry is exhausted. This is to avoid mapping more than one i-cache entries to the same uop cache slot, in which case, when the i-cache invalidates entries due to self-modifying code, the uop cache needs to be searched extensively for all occurrances of uops that correspond to the cache line being invalidated; (2) When a taken branch uop occurs. The branch decision is an output from the branch predictor, and is used by the accumulation buffer to delimit uop cache entries; (3) When the maximum number of uops, certain fields (immediate values, offset values) are reached, or when the entry is already full.

The uop cache is indexed using the physical address of the prediction window (PW), which is a consecutive range of instructions that the branch predictor has predicted to be executed next. A PW can be divided into multiple uop entries, and stored in the uop cache individually. The paper makes an critical observations that, if two uop cache entries are from the same PW, representing a consecutive uop stream, but they are delimited by i-cache line boundaries, these two entries must be mapped to adjacent sets in the uop cache (although the paper never elaborates the index generation scheme for uop caches).

In the runtime, the uop cache is probed by the uop buffer connecting the frontend nand the backend, in parallel with the loop uop buffer and the decoder. If the uops to be executed next is found in the uop buffer or loop uop buffer, then decoder is not used, and uops are directly fed from one of the two buffers. On the other hand, if both buffers miss, the decoder will be activated and process instruction flows fetched from the i-cache. In the latter case, more pipeline stages are involved, which implies a larger branch misprediction latency. In addition, the decoder consumes more power than the two small buffers, making the frontend less energy efficient.

The paper points out, however, that existing uop cache architecture suffers from severe fragmentation problem due to the variably sized entries. These entries are often significantly smaller than the slot size because of the entry termination conditions described above. Fragmentation lowers effective size of the uop cache since only a small fraction of cache storage is actually utilized. In addition, it lowers power efficiency of the frontend, since the decoder will be replied on to provide uops to the backend more frequently.

The paper proposes two techniques for reducing uop cache fragmentation. The first one, called Cache Line Boundary Agnostic uop Cache Design (CLASP), attempts to store two entries in the same slot, if these two entries are consecutive in the same PW, but divided due to an i-cache line boundary. Recall that entries generated by this condition will always be stored on two consecutive sets in the uop cache. When an entry is to be inserted, the uop cache controller will first generate the index A as usual, and then check both set A and (A - 1). If an entry that satisfies the condition exists in set (A - 1), and that the total size of two enties fit into the same slot, then the controller will put the second entry to set (A - 1). To address invalidation problem, on invalidating the PW’s second uop cache entry, both set A and set (A - 1) are searched to avoid missing the entry. In order to check both sets in the same cycle, which is critical for fast delivery of uop entries, the paper proposes that the uop cache tag store be implemented as two interleaved banks.

The second technique is similar to data cache compression, which stores multiple variably sized entries into the same slot. Address tags and other metadata bits are duplicated for each entry of the uop cache, which will be checked in parallel on a set lookup. Each tag also contains an addition field storing the offset of the entry in the physical slot. When a new entry is to be inserted into the uop cache, the uop cache controller checks whether any existing slot can accommodate the new entry without evicting an existing entry. If true, the entry is inserted into an appropriate slot, with the tag initialized properly, and the extre field pointing to the bit offset of the entry. Otherwise, an existing entry is evicted as usual.

The paper then proposes three different insertion schemes that determine the entry to be inserted into. The first scheme prioritizes the entry that is closer to the MRU location, if multiple candidates are feasible. The reasoning behind is to group entries with similar expected reusage into the same slot, such that the invalidation of the slot will not accidentally evict a recently used entry. The second scheme prioritizes the slot with an entry in the same PW, since execution is highly localized in the same PW (recall that PWs are consecutive range of instructions). To achieve this, each uop cache entry as well as uops in the accumulation buffer is tagged with the PW ID. The uop cache controller favors slots that store an entry of the same PW ID as the incoming one during the insertion. The third scheme is even more aggressive. Instead of passively finding a slot candidate, the uop controller proactively moves existing around in the same set to force entries from the same PW to be stored in the same slot. This may incur extra write operations to the data bank during the insertion process, but the paper claims that this is infrequent due to increased locality.