Buri: Scaling Big-Memory Computing with Hardware-Based Memory Expansion



- Hide Paper Summary
Paper Title: Buri: Scaling Big-Memory Computing with Hardware-Based Memory Expansion
Link: https://dl.acm.org/doi/10.1145/2808233
Year: TACO 2015
Keyword: Compression; Memory Compression; Buri



Back

Highlights:

  1. This is the first paper/article I have read that mentions a new page should not be compressed on allocation; Instead buffer should be used to at least accumulate some cache lines before the page is actually allocated from the storage

Questions

  1. Does the OS assume fixed sized shadow address space, or the shadow address space could change dynamically? In the latter case, the mapping table size should also be dynamically adjusted, since that table uses shadow space page ID as index

  2. I don’t get why storing uncompressed lines in the overflow area. Maybe it is to reduce power consumption? But storing compressed lines, even if they are not 4:1 compressed, can help reducing storage.

This journal article proposes Buri, a hardware main memory compression scheme designed for big-data workloads. The paper observes that due to the high memory demand of big memory workloads, simple scaling up the number of cores and memory modules is no longer possible or finacially feasible to extend the amount of memory in a system. Hardware memory compression, on the other hand, reduces the amount of physical storage required for these workloads without installing more components.

The paper then identifies a few difficulties of designing a hardware memory compression scheme. First, compression adds one extra level of indirection on the address tralsnation path, namely from uncompressed address space to compressed memory space. To accommodate the extra indirection level, there are two design options. The first is to perform both translations via the MMU, and let the OS set up an explicit mapping from the uncompressed to compressed address space. This option introduces considerable changes throughout the hardware and software stack, which is incompatible with existing hardware abstractions and may cause some to fail (e.g. DMA, huge pages). The second option is to explicitly maintain the three-level translation scheme, and isolates address translation from uncompressed to compressed address space to memory components such as the memory controller or the DRAM module. This way, existing components can work correctly without any modification, which makes commercial adoption much easier.

The second difficulty is more complicated memory management with compressed blocks. In uncompressed address space, physical memory is managed by the OS in the unit of fixed sized pages, while in compressed address space, pages are no longer uniformly sized after compression. In order to conserve memory, the memory allocation strategy should dynamically adjust the number of physical blocks allocated to a logical page according to the runtime compression ratio. The memory allocator should therefore be able to allocate free memory either incrementally, or in variably sized blocks.

The third difficulty is modification of existing hardware components, such as the TLB and cache. As has been stated in the above discussion, if the design combines VA and uncompressed PA, both the cache tagging and TLB need to be changed. The cache tags no longer use uncompressed PA, since the address generation unit outputs PA in the compressed space. Meanwhile, the TLB should directly translate from VA to compressed PA, being burdurned with the extra responsibility of maintaining translation information for compression. The above discussion, in fact, further confims that separating uncompressed and compressed PA explicitly help isolating hardware changes and reducing both hardware and software migration cost.

Buri addresses the above challenges with the following high-level design ideas. First, Buri adopts the three-level address mapping scheme, explicitly acknowledging an intermediate, uncompressed PA between VA and compressed PA, called the “shadow address space”. Upper level components, such as OS, cache, and TLB, always use shadow address space for space management and address translation, as in a conventional system without compression. Translation is performed by memory controllers using a dedicated mapping table in the unmapped part of physical memory, the content of which is also entirely managed by the hardware. Second, Buri allocates memory in fixed sized chunks incrementally. Only one size class is maintained to avoid the complexity of splitting and merging size classes to fulfill allocation requests. The physical storage of a 4KB compressed page is then represented by four pointers to non-continuous blocks. Note that the page size of storage allocation is orthogonal to the translation page size, which is defined by the page table structure. The memory controller always compress data in the shadow address space in the granularity of 4KB pages, regardless of the OS’s paging policy. Lastly, Buri isolates hardware changes to the memory controller and extra components added to the controller. All upper level components use the abstraction of the shadow address space without any special customization, except that the OS should support memory module hot-plug to handle dynamic compression ratio changes.

We next describe Buri’s data and metadata layout. From a high level, every aligned 4KB page is compressed with a variant of LCP, and stored in the unit of 1KB blocks. In LCP, the compression ratio of a 4KB page is pre-determined before any cache line is written. Compressed lines are then mapped into the page linearlly, the offset of which is calculated using the compression ratio. If a compressed line could not fit into its regular slot, the line will be appended at the end of the current page, with an extra level of indirection indicating the offset of the overflowed line. In the original LCP design, each page also contains a metadata header for storing pointers to these overflowed lines. A cache line lookup must first check this header to see if the line resides in the conventional slot or the overflow area, necessitating an extra DRAM read. In this paper, the indirection pointers are stored in the metadata entry, which can fit into a 64 byte DRAM read unit, which further optimizes performance. This paper suggests that a static compression of 4:1 be used globally. Cache lines that are not compressibile to one fourth of its original size should always be stored in the overflow area. This design decision not only achieves a balance between design complexity and effectiveness of compression, but also eliminates recompression, during which cache lines will be copied around. In future designs, this compression ratio may also be dynamically adjusted based on runtime profiling results.

Each logical 4KB page in the shadow address space can be backed by at most four 1KB blocks. These four blocks are stored in the metadata area, together with cache line status of the page, such that the offset of a compressed line can be calculated with at most one DRAM read to the metadata entry. Physical storage is always allocated in 1KB block granularity. Buri divides usable memory into 16MB chunks, within which 1KB blocks form free lists for fast allocation. Chunk headers form a higher level free list, containing pointers to the head of the block free list within the chunk, which are stored in the metadata area before the mapping table. This two-level scheme trades-off between chunk header metadata overhead and free block search cost. On block allocation, the memory controller first scans the chunk free list, and selects a free block, if the allocation is the initial block of a page. On an incremental allocation, i.e. if a page already has some blocks and more are requested, the memory controller first attempts to allocate from the same chunk to improve locality. If this is not possible, then blocks from other chunks are used instead. Block release is just the reverse of block allocation. The block is first returned to the chunk free list, if the chunk still has free blocks. Otherwise, the chunk header is updated in the metadata area to point to the released block. The next block pointer of the free list is stored in-place at the beginning of a free block.

As is the case for many previously proposed compression schemes, Buri uses a direct mapped translation table at the beginning of the physical DRAM to perform shadow to physical translation in the granularity of 4KB logical pages. Each 1KB block has an entry in the mapping table, which contains the allocation status, four block pointers, compression metadata, and per-cache line status. As discussed above, the allocation status stores the current size of the compressed page, and some control bits (dirty, valid, etc.). The four block pointers consist of the body of the compressed page, which are allocated incrementally. Per-line status information is a two-bit field indicating the type of the line (normal, zero, overflow, invalid). The physical address of a compressed line can thus be calculated using this header alone without consulting the body itself. Buri also features a metadata cache and a pointer cache for overflowed lines to reduce DRAM accesses. Most translation requests should be filtered out by the two metadata caches.

Initially, all metadata entries are set as invalid, and none of them is backed by any physical storage. When a page is first accessed from the shadow address space, perhaps because the OS just allocates a new page in the shadow address space, a page fault-like event is triggered on the memory controller. The controller first activates a 4KB buffer to accumulate cache lines from the upper levels before sending them for compression. This amortizes the overhead of page allocation and metadata update for the first few cache lines of a compressed page. The buffer is retired after a while or after another controller page fault happens. The compressor processes each line in the buffer, and also computes the compressed size. If all lines fit into their physical slots (recall that Buri assumes static 4:1 compression ratio), only one 1KB page will be allocated. Otherwise, two or more pages are allocated, with overflowed pages stored after the regulat slots.

When a dirty line is written back, its compressed size may have changed, which makes it no longer fit for the physical slot. In this case, the slot stores a pointer to the overflow area, at which location the line is written. The metadata entry is also updated to reflect the status change of the line. If the overflow area itself overflows, one more block is requested from the free list as shown above. If the page already has four blocks, then the page will be stored uncompressed, since compression does not reduce storage usage.

The article also proposes a few optimizations that can be applied for better performance. First, zero cache lines are not even physically stored. Once a zero line is detected on a write request, the per-line status field is set to reflect the presence of a zero line. On future accesses to that line, the page body is not even accessed if a zero line is seen, which reduces the number of memory accesses if many cache lines are all-zero. The second optimization is prefetcing. Both page-level and line-level prefetching is proposed. For page level pre-fetching, a global predictor is used to generate the next page to be accessed, whose metadata entry is then prefetched into the buffer. For line-level prefetcing, when a compressed line is read in-place, the controller always prefetches data in the next slot if it contains an indirection pointer. This reduces one extra memory access for the pointer if access locality is high. The last optimization is on memory request scheduling, which prioritizes blocks whose metadata is in the cache.