Software-Defined Address Mapping: A Case on 3D Memory

- Hide Paper Summary

Paper Title: Software-Defined Address Mapping: A Case on 3D Memory
Link: https://dl.acm.org/doi/10.1145/3503222.3507774
Year: ASPLOS 2022
Keyword: 3D Memory; HBM; HMC; DRAM; SDAM

Back

Highlights:

3D Stacked DRAM has many more channels and much smaller rows, which makes channel contention a problem especially on strided patterns.
Global address mapping schemes based on bit flip frequency does not work well because there can be mixed patterns.
Hardware can be designed in a way that it leverages pattern information from software in the unit of 2MB chunks, and applies different address generation schemes within a chunk in order to minimize channel contention.
Software can find the access pattern by first identifying the major variables in the program, and then monitoring the DRAM access pattern of these variables. Then according to the pattern, these variables will be allocated on different chunks, and the pattern will be notified to the hardware on a per-chunk basis.

Comments:

Typo “verse versa” on page 5 under “Functional correctness guarantee”.
Does the approach work on existing binaries? I think it is possible by using an instrumented malloc() library, but the design is meant to be used in a scenario that you have the binary and can conveniently modify it, or at least the memory allocation sites of major variables.
What happens after profiling? Do you identify the major variables, and manually change the source code such that these variables are allocated to the corresponding chunks? Can this be further automated such that the information is passed to the allocator by the profiler?
The results look promising, so I guess the method works well. But if you change the allocation pattern, wouldn’t you also change the access pattern? So to which extent are the profiling results representative? I think if the major variables are big arrays, then the pattern is unlikely to change, and the problem is just a matter of allocating big arrays from the corresponding chunk pool.
The method seems to work better for large array-of-struct allocation. But how does it work for smaller allocations? What if the program’s working set consists of millions or billions of small objects? In this case, the objects will be grouped together on allocator’s internal arenas. How does software profiling figure out the pattern in this case? Note that variables are heavily aliased, if pointers are used to traverse through these objects.
Does the 2MB chunk need to be aligned? I think the paper suggests so by referring to OS buddy allocator. In this case, the design would face the same external fragmentation problem as 2MB huge pages do.

This paper proposes Software-Defined Address Mapping (SDAM), a software-hardware co-design aimed at optimizing address mapping for 3D stacked memory. The paper is motivated by the channel contention problem in high-performance 3D memory introduced by the static physical-to-hardware address mapping scheme. The paper proposes a virtualized, dynamic address remapping scheme that performs address mapping in the unit of chunks. The per-chunk address mapping is configurable, such that software components can cluster data with similar access patterns to the same chunk. Channel contention is reduced by performing different address mapping for different chunks based on their access patterns.

SDM is based on 3D stacked DRAM modules. One of the biggest difference between 3D DRAM and the regular DRAM is that the controller has considerably more channels (32 channels), while having much smaller rows (256 bytes). This property opens the opportunity of Channel-Level Parallelism (CLP), where memory requests are interleaved on different channels, such that these requests can be serviced concurrently. With all channels working in parallel, 3D DRAM can deliver much higher peak bandwidth than traditional DRAM modules.

Existing 3D DRAM device uses a simple and static address mapping scheme that is initialized at boot time. Bit slices of physical addresses are used to address channel, bank, and row. The paper presents an example of existing physical address mapping, where the lower bits are used to address channel, the middle bits for bank, and high bits for row. Such scheme, however, is vulnerable to access patterns that cause channel contention. For example, if the channel is addressed by the low 4 bits, then an access stride of 16 will cause all memory requests to be handled by channel zero, serializing all memory requests, which causes under-utilization and lowers the throughput.

Prior works leverage CLP in both hardware and software. Hardware approaches monitor the address patterns for memory requests, and monitors the access pattern using the frequency of bit flips as a metric. Bits that flip the most frequently will be used to constitute the channel address, such that consecutive memory accesses are likely to be mapped to different channels, which maximizes CLP. Software approaches track memory access pattern at page level, and changes the physical address mapping of pages that cause channel contention.
The paper notes that neither approach is optimal, because the hardware approach only supports a global policy, which is insufficient to capture all access patterns, and the software approach is only able to keep track of access patterns at page granularity, which might be fine for regular DRAM, but 3D memory has row size much smaller than a page.

The paper also makes three observations regarding memory access pattern. First, when several different simple patterns are mixed together, the overall access pattern will be dependent on all the individual ones. This suggests that an effective scheme must be able to adapt to changing patterns, rather than only using a global pattern. The second observation is that when multiple access patterns co-exist, no global address mapping scheme works optimally. This necessitates that a per-pattern address mapping policy, because it is difficult to come up with a one-fit-all policy. Lastly, the paper also observes that in SPEC, most memory accesses are only caused by a small number of language-level variables, and these variables constitute a large fraction of the application’s memory footprint. It is therefore sufficient to only focus on these “major variables” and capture their access pattern.

The paper proposes SDAM that works as a software-hardware co-design. From software’s perspective, data items that have similar access patterns are identified by offline profiling, and then allocated on the same 2MB memory chunks. Meanwhile, the software notifies the memory controller of the access pattern of a particular chunk, and the memory controller stores the mapping from chunks to patterns in a special table, called the Chunk Mapping Table (CMT). On memory accesses, chunk information is retrieved from the CMT, and the address for accessing the device is generated using pattern information stored in the CMT by a hardware unit, the Address Mapping Unit (AMU).

On the hardware side, memory access pattern is tracked at 2MB chunk granularity. Chunks are registered to hardware via software interfaces, and they are stored in the CMT. CMT entries are retrieved on memory accesses to obtain the pattern. A small cache may also be added to reduce the latency of CMT access. The physical storage of the DRAM device is also divided into 2MB chunks, and address mapping works independently within each chunk according to the pattern. On each memory access, the CMT entry is retrieved, and the channel is selected based on the pattern information stored in the CMT entry. The rest of the address is also generated by the AMU, after which the request is handled by the selected channel.

On the software side, both the OS and the user-space heap memory allocator are modified to be aware of chunks. The OS gives away physical pages to allocators in 2MB chunks (e.g., as a huge page), and tracks chunk’s pattern information in its internal virtual memory data structure. The user-space allocator maintains a heap for each of the 256 patterns. The implementation of the allocator is unchanged, except that they request memory from the OS in 2MB chunks, and that they accept pattern information from the application, and passes them to the OS when allocating pages (i.e., on mmap() system calls). Applications allocate memory by calling malloc() with an extra argument describing pattern information. The allocator fulfills the allocation from one of the 256 heaps based on the pattern information, such that objects that share an access pattern will also be on the same 2MB chunk.

When the pattern is simple, programmer may enter the pattern information at call sites of malloc(). Otherwise, the paper proposes using offline profiling with k-means or machine learning to infer the pattern. The profiling process identifies allocation sites in the program, and associates them with the addresses being accessed. Then a clustering algorithm (e.g., k-means or more complicated ML algorithms) is used to identify the access patterns. If a major variable is deemed to have a pattern, then the allocation site of that major variable will be modified to pass the pattern information to the allocator. The modified program can then enjoy the benefits of SDAM as a result of lowered channel contention on the DRAM.