Project PBerry: FPGA Acceleration for Remote Memory



- Hide Paper Summary
Paper Title: Project PBerry: FPGA Acceleration for Remote Memory
Link: https://dl.acm.org/citation.cfm?id=3321424
Year: HotOS 2019
Keyword: FPGA; Remote Memory



Back

This paper presents Project PBerry, an FPGA accelerated scheme for implementing efficient remote memory. The paper identifies two major problems of current implementation of distributed remote memory, which is based on software handling of page faults and demand paging. The first problem is write amplification. When a remote page is evicted out of the current host’s main memory, it needs to be transferred to the remote machine in the granularity of a page. This is because current hardware does not support cache line level dirty data tracking, while page level tracking is implemented in the MMU by setting the “dirty” bit in the page table entry before a store instruction accesses the page. If only a few cache lines are modified in the page before it is evicted, it is a waste of bandwidth to transfer the entire page to remote host. The paper suggests that for certain workloads, the majority of pages have less than 7 dirty cache lines. The second problem is software overhead for handling remote pages. Two page faults are required to handle a remote page correctly: one to trigger the transfer of the remote page (after which the page is write-protected), and another to release the write protection on the page after adding it to the write set. The second page fault is necessary, since user level remote memory libraries have no permission to access the page table.

Project PBerry assumes a cache coherent FPGA with a hardware component that can either be configured to be a memory controller handling requests to and from upper level hierarchiy, or as a cache controller which maintains cache line states collectively with other bus agents on the network. The FPGA is connected to the system bus via one of the implementations of bus protocols. The paper proposes two schemes for connecting the FPGA Module (PBF) to the system bus. In the first scheme, the PBF acts as a memory controller, which snoops on cache line evictions from upper level caches. At system startup time, part of the physical address space is mapped to this FPGA device, such that upper level controllers will send cache lines mapped to this address range to the FPGA device rather than regular controller (or rely on snooping? I am not an expert on this, so I assume upper level cache controllers either “know” which address is mapped to which destination, or they put the address on a broadcasting bus, and wait for lower level devices to take the responsibility). The OS is also configured such that remote pages are mapped to this physical address range. When an evicted line is received by the FPGA, it stores the data of evicted cache lines in its internal memory, and tracks metadata (e.g. tags, sizes) in a buffer accessible to the OS (note that most coherence protocols only write back dirty cache lines, and will discard clean lines if they are evicted). When a read request misses upper level caches, they will also be forwarded to the FPGA device. We postpone the handling of read requests to later sections.

One potential problem of this scheme is that the memory controller on the FPGA only has partial information about dirty lines. This is because dirty line tracking is triggered only by cache write backs. The processor cache, in the meantime, may also have a few dirty lines that have not been written back yet. To gather complete information, the FPGA device needs to inject read shared requests to upper level caches in order to force a write back (most protocols will perform dirty write back on M to S state transition). Another problem is that only a small subset of physical addresses can be configured to be mapped to the FPGA, due to the fact that FPGA has limited on-board memory. Addresses mapped to local DRAM cannot be tracked, since their write backs will not be sent to the FPGA.

The second scheme configures the FPGA device as a cache controller, which operates on the same level as the LLC. The FPGA in this case does not store data; Instead, only address tags and coherence states are stored, which allows higher tracking capacity given the same amount of on-board memory. Tracking can begin without prior configuration of the MMU and the OS; The FPGA device starts tracking by issuing read shared requests to all other LLC controllers. After obtaining read permissions from the directory, it adds the tags (and only tags) into its internal memory as if it were a large cache. When any of the cache lines under tracking are modified by other processors, an invalidation coherence request will be forwarded to the FPGA, since all cache line copies are in shared state and need to be invalidated before the write. The FPGA device then adds the address tag into the dirty data buffer. In this scheme, a write can only be tracked once, since the tag has been invalidated by the first write. The FPGA needs to issue read shared requests again in order to start the next round of tracking.

The second scheme tracks the entire address space without special MMU and OS support. It it, however, infeasible since it requires reconfiguring the internal cache structure of the FPGA to act as a “virtual” cache without storing data. The paper subtly suggests that this may involve more work than the first scheme, and conducts no more discussion of this scheme in later sections. In our following discussion, we also only focus on the first scheme, since it integrates better with remote memory access.

As mentioned earlier, when a cache line is written back to the FPGA from upper level caches, the line address is added to a buffer shared between the FPGA and the OS daemon. When multiple GBs of memory are being tracked in this scheme, the storage requirement for address tags inevitably explodes, which may eventually exceed the capacity of on-board memory. In addition, since we assume that most remote pages are only accessed sparsely, tracking “dense” pages (regarding dirty line distribution) should not be the top priority of this scheme.

The paper proposes using two bloom filters to identify dense pages. A cache line bloom filter tracks the approximate set of cache lines that have already been added into the buffer. A page level counting bloom filter tracks the number of cache lines in a page that have already been tracked. When a new cache line is to be added, we first check if it is in the cache line bloom filter (with false positive). If negative, we set the bloom filter bit, and append the tag into the buffer. We also increment the counter in the corresponding entry of the counting bloom filter. If the counter value is larger than a threshold K, meaning that there are more than K dirty cache lines in this page, the entire page is marked as dirty without any fine-grained tracking, in order to save space. If the result of testing cache line bloom filter is positive, we still need to append the tag to the buffer, as this might be a false positive. No other actions are needed, since this may just be the same dirty line being repeatedly written and evicted.

The paper also proposes eliminating the second write-protection page fault entirely by allowing the FPGA to make requests to remote hosts for the page. This also requires that the OS map remote virtual pages to the physical address range handled by the FPGA, instead of marking them as not present in the software only scheme. This way, memory instructions on these pages will not incur costly page faults. Instead, the access request will miss the entire hierarchy, which is then forwarded to the FPGA. The FPGA sends a data fetch packet to the remote host, requesting only the cache line to be accessed by the processor. As pointed out by the paper, this scheme is similar to the critical word first technique for handling cache misses (i.e. the requested word is loaded into the register via a short circuit path while the rest of the line is loaded in the background). The FPGA can then migrate the remote page in the background without blocking the reading instruction, which reduces critical path latency.

To further overlap normal execution and background page migration, the paper also suggests that the OS move its page table to the FPGA. When an accessed is to be made to a remote virtual page, the MMU must initiate a page walk to obtain VA to PA mapping. The FPGA snoops on page walker memory traffic, and derives the physical address. It can then start background transferration of the page even before the cache miss request is issued.