Page Size Aware Cache Prefetching
- Hide Paper Summary
Link: https://ieeexplore.ieee.org/document/9923823/
Year: MICRO 2022
Keyword: Prefetching; Set Dueling; Huge Page
Highlights:
-
Spatial prefetchers can prefetch across 4KB page boundaries if huge page is used
-
The prefetcher can be notified of the page size by passing 1 bit from the MMU to indicate page size
-
Set dueling can be used to determine which page size to use if both are available
Comments:
- The paper underestimates the hardware overhead by not counting the bits needed to indicate whether the block is a result of prefetching. The overhead is relatively small, though, since it is only needed for the static sets.
This paper proposes Page Size Propagation Module (PPM), an extension to the existing spatial cache prefetchers that enables prefetching on 2MB huge pages. The paper was motivated by the fact that 2MB huge pages are prevalent in most workloads, but the existing spatial prefetchers cannot prefetch across the regular 4KB page boundary. The paper addresses this problem by allowing page size information to be passed by the MMU to the prefetcher in the low-level cache’s prefetchers, such that the prefetcher can dynamically decide whether or not to prefetch across 4KB page boundaries.
This paper focuses on improving spatial prefetchers, one of the two types of prefetchers that predict future addresses for memory accesses based on past spatial access patterns (e.g., deltas). Compared with temporal prefetchers, which record histories of past cache misses and replay the accesses to fulfill the misses in the hope that future accesses will reproduce the past pattern, spatial prefetchers have three apparent advantages. First, spatial prefetchers require much less metadata, as it only maintains metadata for describing the pattern. Temporal prefetchers, on the contrary, need to keep track of access traces in order to reproduce them. Secondly, spatial prefetchers can prevent compulsory misses, since the predicted addresses do not have to occur in the previous execution history. By contrast, temporal prefetchers need to see the trace at least once in order to reproduce them later, hence suffering compulsory misses when the address is accessed for the first time. Lastly, spatial prefetchers also have better row buffer locality as the prefetching request is spatially close to the recent addresses being accessed. Temporal prefetchers, on the other hand, do not have such guarantees and therefore exhibit a lower row buffer hit ratio, resulting in higher latency and energy consumption.
Despite the many advantages of spatial prefetchers, the paper identifies that one of the biggest problems of spatial prefetchers is their inability to prefetch across 4KB page boundaries in the past literature. Two factors contributed to such a uniform design decision. First, due to the regular 4KB granularity address mapping, pages that are consecutive on the virtual address space are not necessarily consecutive on the physical address space. Consequently, in order to prefetch across page boundaries on the physical address space (which is mandatory as lower-level caches are tagged using physical addresses), the prefetcher must have access to translation metadata. However, translation metadata is usually only accessible to the L1 cache, since the TLB is physically located at that level. Secondly, without translation metadata, speculatively allowing prefetchers to read past virtual page boundaries may raise serious security concerns, as the prefetcher can become a side channel that breaks address space isolation and leaks data not belonging to the current process.
Based on the above reasons, the paper investigated the feasibility of prefetching on huge pages, which is a
feature supported by recent x86 architectures. With huge pages enabled, either the application programmer or the
Operating System kernel can choose to map physical memory in bigger granularity, such as 2MB or 1GB (the latter
is only available using the hugetlbfs
library). The paper also made several observations on the interaction
between huge pages and spatial prefetching. First, the paper evaluated several benchmarks with Transparent Huge
Page (THP) enabled. Results show that most of the workloads can efficiently utilize huge page support during
the entire execution, indicating the feasibility of performing prefetching at huge page granularity.
Secondly, the paper also conducted experiments by enabling the prefetcher to prefetch across 4KB page boundaries
when the page size allows with pattern information still maintained at 4KB granularity. The result shows that
extra benefits can be gained on most workloads due to the increased opportunity exposed.
Lastly, the paper modified the prefetcher design to store pattern information at 2MB granularity whenever the page
size allows. Results show further improvement for most workloads, due to (1) less metadata is needed
to track pattern information; and (2) patterns that cross 4KB page boundaries can be discovered as they can now be
mapped to the same entry.
All the above three observations support the claim that prefetching at larger page granularity can potentially help
with performance.
Based on these observations, the paper proposes Page-Size Propagation Module (PPM) to leverage the opportunity of prefetching across 4KB page boundaries. The PPM design assumes a hardware spatial prefetcher located at the L2 data cache. The design itself, however, is not restricted to L2 prefetchers. In addition, the PPM design is compatible with most prefetchers, but in the paper, it is assumed that pattern information is tracked by a hardware table that is indexed with the lower bits of the page address (the exact bits to be selected depending on the page size). The most critical goal of PPM is to propagate the page size information from the MMU to the prefetcher. To achieve this goal, PPM adds an extra bit to the L1 MSHR that stores page size information, with a zero bit indicating that the cache miss occurs on a regular 4KB page, and a one bit indicating that it occurs on a 2MB page. More page sizes can be supported by adding extra bits into the MSHR. With the extra bit added, when the MMU performs an address translation on a memory access, it also sends the page size information to the L1 cache controller. If the memory access misses the L1 cache, the extra bit is then inserted into the MSHR, and then passed to the L2 cache. On receiving the access request, the prefetcher will then use the page address to index into the hardware table that stores pattern information. The bits are selected based on the value of the bit.
While the above naive approach works well in some cases, the authors have also noted that prefetching at 2MB granularity is not always desired, as it can sometimes decrease performance. To address this problem, PPM also leverages a technique called set dueling to select the proper page granularity when both are available. The mechanism works as follows. PPM adds a saturating counter of three bits whose highest bit is used to select the prefetching size. A smaller number of sets in the L2 cache are dedicated to 4KB and 2MB pages statically, i.e., accesses that hit these sets will always trigger prefetching (if any) in the corresponding granularity. When a prefetched block is hit by an access (marked with a per-block bit), the saturating counter is incremented by one, if the block is in the 4KB sets, and decremented by one, if otherwise. Accesses to the rest of the sets will then use the highest bit of the saturating counter to determine the page size if both sizes are available.