Jenga: Software-Defined Cache Hierarchy



- Hide Paper Summary
Paper Title: Jenga: Software-Defined Cache Hierarchy
Link: https://dl.acm.org/doi/10.1145/3140659.3080214
Year: ISCA 2017
Keyword: Cache; Jenga; Software-Defined Cache



Back

Highlight:

  1. Allows dynamic reconfiguration of distributed LLCs by changing the way addresses are mapped to slices. In conventional designs, they are mapped using a static function, while in this paper, they are mapped based on two software-configurable arrays.

  2. Adding different address classes and allowing different classes to be handled differently by the cache hierarchy.

Questions

  1. How does the OS know that another thread has accessed the page for reclassification? What kind of hardware is required to support this? Same question for detecting global access on process-local and thread-local pages.

  2. Why private cache copies are invalidiated on page reclassification? According to my understanding, this will only change the storage location at shared level, but not private level.

  3. How does OS detect the working set size of a thread? And how does OS determine which size is the best fit? Obviously you cannot simply use the working set size, as it may overwhelm the cache with useless contents.

  4. The OS/software part is not described clearly. Discussions of Jenga and Jigsaw are mixed together without making distinctions on which feature is on which design. The consequence is that the software modeling part is really complicated and difficult to understand. For example, the paper used the term “most favorable banks”, but never gave any matric on how to evaluate banks. Latency might be a simple metric, but latency along will not work, since you also need a target cache size, which is again never discussed.

This paper proposes Jenga, a software-defined, configurable cache hierarchy. The paper points out that the conventional, rigid cache hierarchy has two disadvantages. First, as the size of the working set is becoming larger, cache hierarchy can actually hinder performance, if most of the accesses will miss in the LLC. In this case, the request must perform a lookup at LLC, which turns out to be a miss, and then access the main memory. The cycles spent on accessing the LLC, in the common case of a miss, are wasted, and could be avoided if the request bypasses the LLC. Second, excessive hierarchies consume energy without providing proportional benefits. Even if the latency of accessing these cache levels can be hidden or overlapped using vsrious techniques, the energy consumption cannot be hidden, and is only becoming worse as the size of the cache increases.

The paper also made three observations on the performance of application and their preferred cache hierarchy. First, applications require a broad range of configurations in order to achieve optimal performance. These configurations vary not only in the size, type of a single level, but also in the number of levels in the hierarchy. Second, one single rigid hierarchy is not sufficient to achieve optimal performance for all applications. Some applications perfer a large, monolithic single level hierarchy, while others prefer two-level hierarchy with different LLC sizes. A rigid hierarchy would only benefit a few applications, while hurting performance for the rest. The same is also true for energy consumption. The last observation is that applications are pretty sensitive to configuration changes, especially the number of hierarchies. The paper reports that applications can lose up to 20% of its optimal performance if the cache hierarchy is not ideally configured.

We next describe the base line architecture of Jenga. Although Jenga does not depend on the architecture being of a particular topology, the paper, for ease of illustration, assumes that the system consists of a mesh of processors. Each node of the mesh consists of a processor core and its private L1 and L2 caches. Meshes communicate via an on-chip network. Mesh are also equipped with NoC routers that support inter-mesh packet routing in any direction. Each mesh also contains part of the distributed L3, which is implemented as a set of SRAM banks. In conventional hierarchies, these banks would act as L3 slices, which serve requests for addresses that are mapped to the mesh by static hash functions. Cache coherence between private L1 and L2 cache line copies are handled using separate directory banks alongside the SRAM data bank. These directory banks track the sharing status for addresses that are stored in the data banks. The paper also noted that building the directory into SRAM data banks is not an economical choice, given that most cache lines will not have any sharing status due to the fact the L3 is much larger than L1 and L2. In addition to the L3, the paper also assumes that die-stack DRAM modules are available on-chip. These DRAM modules can be assigned the role of either L3 or L4, depending on application needs.

Jenga allows system software to configure the cache hierarchy with a combination of hardware address mapping, performance monitor counters, and OS modules for deciding the optimal configuration. Jenga allows each thread to have their own hierarchy configuration, “virtualizing” the hardware by providing the illusion that different threads can use different address mappings to cache banks, although the cache itself still remains invisible to applications. Jenga enables reconfiguration at L3 and an optionally L4 level. The L3 level can be built with both SRAM and DRAM banks, while the L4 can only be built with DRAM banks.

We first introduce the address mapping hardware. Jenga supports three types of caching policies for each thread: Thread-local, process-local, and global. Each type reflects one possible user scenario of the cached data, which may affect the allocation policy. For example, cache lines denoted with thread-local type does not need coherence at all, for whom coherence entries need not be maintained in the low-level directory. Besides, the paper also suggests that cache lines denoted as thread-local and process-local should be cached in a bank as close to the executing core as possible in order to reduce the access time. The placement of global data is less concerned, as it will be accessed by all threads running on different cores anyway. The page table should have a attribute field to store the caching policy type, which will be fetched by the MMU page table walker in the event of TLB misses. The TLB should also be extended with an entry to store the caching policy of pages. The caching policy will also be included in the fetch request sent to the L3 cache for address computation.

To support per-thread virtual caching policy, Jenga also extends the processor context with a Virtual Hierarchy Table (VHT), which stores the address mapping for requests going into the L3 cache. The VHT resides on the L2 cache controller, which should be swapped in and out on context switches. Each VHT consists of three entries, each being responsible for one of the three caching policies. Each entry consists of two Virtual Hierarchy Descriptors (VHD), where one is used for the normal operation, and both are used during a configuration change, as we will see below. Each VHD itself also consists of two arrays of N items in size. The first array contains bank IDs that constitute the L3 cache, and the second array contains bank IDs for the L4 cache. Note that bank IDs could duplicate, since the array is not used as a set, but instead, it serves as a hash table. Each VHD also includes a configurable hash function. Requested addresses are hashed into one of the array elements on both arrays, which define the possible access path of the request.

The L2 cache controller is responsible for computing the hash and the access path on a cache miss, after which the information is stored into the request message, and then dispatched to the next level L3 bank indicated by the VHD. On receiving the request, both the L3 and potentially L4 controllers will perform a lookup on its local array, and forward the request to the next level if miss is indicated. For L3 caches, if L4 is not present in the request, the request should be forwarded directly to the main memory, bypassing the L4 level.

In order to change the configuration, system software first writes the updated configuration into the “shadow” VHD array, and then notifies the hardware that migration should be performed. The hardware, after receiving the notification, starts copying cache lines from the source SRAM bank to the destination bank. During migration, both banks are considered as valid to serve read requests. Write requests, however, must block to wait for migration to complete. To avoiding blocking writers for too long, the paper also suggests that when a write operation hits a line that has not been migrated, the line will be given priority, and write could be performed after it has been migrated. After hardware migration completes, the shadow array becomes the new normal array, and the previous normal array now turns into shadow array.

Pages can also be reclassified. The paper suggests that all pages start as thread-local. When another thread accesses the page, it is promoted to process-local. When a process-local or thread-local page is accessed by another process, it is then promoted to global. On each page classification, the page table entry should be updated, followed by a TLB shoowdown. In addition, the OS evicts all cache line copies of in the shared level for pages whose caching policy has been changed, since change of policy may involve moving cache lines on one bank to migrate to another bank. Note that hardware does not support page reclassification, compared with cache line migration, since the paper suggests that page reclassifications are infrequent events.

Each node in the mesh is equipped with performance monitoring counters that measure the access latency, access frequency, and other statistics on its local SRAM bank. The system software will collect data from these counters to decide whether reconfigurations are necessary, and how. Thw software part has three important components, which we discuss below. The first component reads the performance counter, and computes the access latency curve for all SRAM and DRAM banks. The curve reflects the run-time access latency of each bank under the current configuration. Banks are allocated to each core by greedily allocating the most favorable bank to a core in a round-robin manner, until all banks have been allocated. The second component then determines the optimal topology of the virtual cache. It evaluates expected latency of a few possible candidate configuration using the latency curve computed from the previous step. To reduce complexity of computation, the evaluation does not have to cover all possible cases, since small configuration changes from the current condition will unlikely bring any change. The paper, therefore, proposes that the evluation points selected are exponentially far away from the current configuration. Both one-level and two-level hierarchies are evaluated, and the best of them is selected. The last component evaluates bandwidth effect using a queuing model. It attempts to migrate a bank incrementally to another, and evaluates bandwidth-induced latency before and after the migration. If latency decreases, the migration is actually performed. Bandwidth-induced latency is modeled using the access frequency and M/D/1 queuing model, where SRAM is considered to have infinite bandwidth (i.e., latency not affected by bandwidth), and DRAM banks have normal bandwidth as half of the peak bandwidth, due to scheduling, refreshing, power saving, etc. This component avoids creating access hotspots at certain DRAM banks with large capacity.