IMP: Indirect Memory Prefetcher
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=2830807
Year: MICRO 2015
Keyword: Cache; Prefetching
This paper proposes Indirect Memory Prefetcher (IMP) to address the common problem of slow indirect memory operations. Indirect memory operations are of the form A[B[i]], where B[i] is an array of indices that are read and used with a regular pattern, and A[i] stores data which are often accessed without a pattern. This scenario is common in compressed representation of sparse data structures. In practice, both arrays are pre-computed at the beginning of an iteration, and remain unchanged during the iteration, because the compact form is difficult to modify. Access patterns like this suffer low spatial locality especially when the size of A is large, because it is unclear whether two adjacent elements in A will be physically stored close to each other. In addition, a classical stream prefetcher which detects regular access patterns as access streams also cannot help in this case, because the contents of B[i] usually do not result in a regular pattern of accessing A even if B itself is accessed regularly (e.g. using linear scans), making it harder for hardware prefetcher to work.
In order to also efficiently prefetch array A, IMP takes advantage of the following observation: The address of A[B[i]] to prefetch is computed using a simple formula: &A[B[i]] == sizeof(A[0]) * B[i] + &A[0]. Given that the size of elements in array A is a compile time constant, and the base address of A is fixed at the beginning of the iteration, this equation only contains one variable, B[i], and two unknown constants, sizeof(A[0]) and &A[0]. By using a stream prefetcher we can easily prefetch the content of array B since it is accessed regularly. By monitoring cache miss events, it is likely that we can also detect the target address (if there is no cache miss for multiple accesses, then it is an indication that the locality of accesses is good, and prefetching is unnecessary) that should have been prefetched if the prefetcher works perfectly. The prefetching problem then essentially boils down to the following question: Given a line equation y = ax + b, where x is B[i] and y is &A[B[i]], how many data points do we need to fix parameters a and b? The answer is two, and after computing a and b, for further accesses of A[B[i]], we can then use the value of b[i + 1] as an input to generate the next address to prefetch.
One simplification is made in the paper to make the hardware implementation more practical: for performance critical part of the program, most compilers will align data items accessed in an array. It is hence very likely that sizeof(A[0]) can only have a limited set of values, e.g. 4 (C int type or float), 8 (C long type or double), 16 (vector type), or 1/8 (bit field where B[i] is the offset of bits. Converted to offset of bytes by division it with 8). Tha hardware does not need to perform multiplication or division using an expensive multiplier and divisor, because they can be achieved easily using left and right shifts on bit level.
The architecture of IMP is described as follows. IMP is based on existing designs of stream prefetchers. In these prefetchers, stream information is stored in a small associative buffer called the stream table. The stream table maintains states that are necessary to detect memory streaming access pattern. Several streams can be active at the same moment, and they occupy different entries in the stream table. To detect more complicated patterns, IMP extends the stream prefetcher with an Indirect Pattern Detector (IPD). The IPD is organized as a table, similar to the stream table, and each entry in IPD is assigned to one indirect memory reference of form A[B[i]]. After a stream has been detected by the stream detector, IPD begins monitoring memory references made by the stream. For a memory reference B[i] detected by the stream detector, IPD checks if it is already in the table, and if not assigns a new entry. We use the PC of the instruction to disambiguate between different streams; If two accessing instructions have the same PC, then we assume they constitute the same stream even if the pattern changes, because this can be a result of nested loop. After detecting the memory reference to B[i], the IPD keeps monitoring cache miss events following the read. The cache miss will also be detected and the address will be sent to the IPD, after which the latter computes the base address of A (i.e. &A[0]) using four hard-coded element sizes mentioned in previous paragraphs. The results are stored in another hardware table. If multiple cache misses occur afte reading B[i], the first few of them will be captured, and used to compute the base address (so the table should have multiple entries, one for each cache miss for the same B[i] read). On a second access to elements in the same stream (i.e. B[i + k]), the IPD repeats the above process, using another table to keep base address computed from B[i + k] and cache miss addresses. After the base addresses have been computed, IPD walks the two tables using a state machine, and compares the values of base addresses stored on the corresponding location. If two values match, then we have high confidence that an indirect pattern has been detected. The element size and base address will then be associated with the stream that accesses B[i] by inserting them into columns of the stream table. In addition, the current value of B[i] is also stored in a column of the stream table. We use this value to check whether the prefetching would work as expected. If on the third acces to B[i], there is still no indirect access pattern detected, IMP assumes that the pattern does not exist, and frees the entry from IPD.
After the indirect access pattern has been detected, the next step is to verify that the base address and the element size of A are computed correctly. For each entry in the stream table, there is a saturating counter associated with it. Every time the stream is accessed, the value of B[i] is used to compute the expected target address, and this address is compared with every memory access (or cache miss? The paper did not elaborate). If there is a match, which means that the prediction is successful, the saturating counter is incremented. Otherwise, if no match is found until the next access of B[i], the counter is decremented. Once the saturating counter reaches a threshold, IMP will start prefetching as follows. On an access to the stream reading B[i], the prefetcher issues memory read operations to read B[i + Δ], computes the target address, and then prefetches the target address. For memory blocks prefetched into the cache, IMP also tracks their usage by the processor. The value of Δ is initially small, and will be incremented gradually as the prefetcher warms up. The value of Δ is considered as optimized if a block is used shortly after it is inserted into the cache, until which the prefetcher stops incrementing it.
IMP can work in both physical addresses and virtual addresses. The former requires that the buffer be cleared or saved on a context switch, while the latter does not support prefetching across page boundaries (because the underlying physical pages may not be allocated in contiguous page frames). In the paper it is suggested that IMP should be deployed with virtual prefetching. It is therefore necessary to place the IMP near the address translation facility.