A Case for Richer Cross-Layer Abstractions: Bridging the Semantics Gap with Expressive Memory



- Hide Paper Summary
Paper Title: A Case for Richer Cross-Layer Abstractions: Bridging the Semantics Gap with Expressive Memory
Link: https://ieeexplore.ieee.org/document/8416829
Year: ISCA 2018
Keyword: Tagged Architecture; XMEM



Back

Highlight:

  1. The main memory can be metadata-tagged with a direct-mapped array reserved from the physical address space plus a tag cache (ALB in this paper) on the cache side. The tagging overhead is negligible for an intermediate-sized granularity (512B) and reasonably small tags (1B).

  2. The paper demonstrates a good division of responsibility between software and hardware. In order to make XMem
    extensible, the OS is responsible for knowing all attributes and initializing the per-component PAT. This is better than hard coding the attributes in hardware, because updating the OS driver and loader for each new attribute added is definitely easier than updating the hardware components.

  3. Attributes (memory tags) can be statically determined, recognized by compilers, and stored in the application binary. This simplifies the creation and maintenance of attributes, since they are just created in advance, and communicated to the hardware before execution.

Comments:

  1. The paper says in the beginning that XMem should support flexible granularity, and then in the main design it turns out that the granularity is still fixed and quite inflexible (512B chunks). What if I just need per-cache line metadata? Is is possible to be emulated in XMEM?

This paper proposes Expressive Memory (XMem), a memory framework for passing high-level software semantics information to the underlying hardware by tagging the address space with metadata. The paper is motivated by the fact that program metadata can be helpful in many scenarios, while in reality, they are difficult to obtain or infer from untagged memory. The paper uses two examples to justify the motivation. The first example, cache optimization, requires that programmers make assumptions about the cache configuration, and hard code the optimization for operations such as matrix multiplication. If the actual hardware configuration differs from the assumption, the optimization will not work. It would hence be nice if software can provide locality information in the run time to aid cache controllers for block eviction. The second example, DRAM page placement, requires the knowledge of access patterns of mapped pages, such that pages that are often accessed together can be allocated on different DRAM banks, which utilizes inter-bank parallelism for higher access throughput. The high-level semantics information is only obtainable from the software, which, if known by the OS and DRAM controller, would help them make better placement decisions.

The paper identifies three challenges in implementing XMem. First, the tagging granularity should be flexible, as semantics metadata applies to both very large and very small data items. Neither per-address nor per-page granularity would work very well, as they do not achieve a good balance between metadata overhead and flexibility. Second, the design itself should not restrict which kind of information is conveyed from the software to the memory hierarchy, since XMem is supposed to be a framework generally applicable to all scenarios where high-level semantics are needed, and not to be a specialized solution to a few pre-defined problems. This implies that the design should be flexible enough to allow arbitrary information to be passed, which may also be extended in the future without changing the hardware. Lastly, the semantics of memory regions can be conveniently changed in the runtime, by either unmapping/remapping the region, and/or enabling/disabling the semantics information. This is critical to the usability of XMem, as memory behavior will often change at different stages of application execution.

At a high level, XMem represents a set of semantics information as an “atom”. The contents of an atom must be determined statically at compilation time, and remain immutable during execution. Atoms are identified with an application-level atom ID, which is assigned statically at compilation time, since the compiler can see the full set of atoms. Atoms with identical contents that appear at different locations in the program will be deduplicated by the compiler, such that they have the same atom ID. Applications may define atoms anywhere in the application, and refer to them using an 8-bit handler object, which is just the compiler-assigned ID of the atom. At compilation time, atoms are stored in a special segment of the compiled binary, which will be recognized by the OS binary loader. The paper suggests that the OS dynamic loader should read atoms statically embedded in the binary, and use them to initialize a hardware table, as we will see below.

The paper does not specify the contents of the atoms, as XMem is designed to be easily extendible. The paper, however, does recommends a few frequently used properties, such as data types, access patterns, read-/write-only regions, frequency of accesses, working set size, and the reuse distances. All these semantics properties can be represented by either small integers or bit masks. Compilers should be able to identify all properties, and transform them to a binary format that can be communicated to the hardware.

Atoms are stored in a centralized fashion in the memory hierarchy by a Global Attribute Table (GAT). The GAT is context-sensitive and hence needs to be loaded and spilled at context switches. As mentioned earlier, the GAT is initialized by the OS at binary loading time. The OS reads the atom segment of the binary and transfers all atoms to the GAT in the order specified by the compiler. Applications can hence refer to atoms using the index as assigned by the compiler, which is also an index into the hardware GAT. Components in the memory hierarchy may leverage the GAT in two ways: They can either directly read information from the GAT, or use a Private Attribute Table (PAT) to store component-specific information. The OS is also responsible for initializing the PAT with information that a certain component is interested in.

Applications assign semantics to addresses by mapping these addresses to atoms. The mapping relation between addresses and atoms are many-to-one, meaning that one address can be mapped to at most one atom, while an atom can describe the semantics of many addresses. At application level, XMem provides two libraries calls, namely, MAP() and UNMAP(), to associate addresses to atoms, and remove the association, respectively. At hardware level, these two function calls are implemented with special instructions that take the atom ID and address range to be mapped as operands, which stores the mapping relation into a hardware table called the Atom Address Map (AAM). The AAM is a reserved region of physical memory for tracking the atom ID for the entire physical address space at 512 byte granularity. The AAM is organized as a direct-mapped, flat array, where each 512 byte chunk of physical memory is mapped to an 8-bit atom ID in a linear fashion. The AAM is updated when MAP() and UNMAP() are called, and is accessed when the physical memory is accessed. To avoid accessing the AAM for each main memory access, the paper proposes adding an Atom Lookaside Buffer (ALB) at the cache hierarchy as a fast store of frequently used AAM entries. The ALB is organized similar to the TLB, which uses page granularity mapping, and each ALB entry just consists of an array of atom IDs, one bit for each 512 byte chunk.

To support efficient enabling and disabling of individual atoms, which is a useful feature, because memory properties can dynamically change in the run time, the paper also proposes adding an Atom Status Table (AST) for tracking the status of atoms. Since the atom ID only has 8 bits, the AST only needs to support 256 atoms, which can be implemented conveniently as a 256-bit mask. XMem provides instructions to set or clear a bit in the AST, which are encapsulated as ACTIVATE() and DEACTIVATE() system calls.

The operations of XMem is very simple. When the main memory is accessed as a result of cache misses, the AAM is also accessed in parallel (which likely will not lead to a second memory access due to ALB filtering out most requests) to retrieve the atom ID associated with the address to be accessed. The atom ID information is propagated to each level of the cache hierarchy, and can be utilized by various memory components in multiple ways. The contents of atoms are initialized by the OS loader before execution, and each component may define their own attributes in their PATs. The OS is assumed to be aware of all components in the hierarchy, and is responsible for initializing the GAT and all PATs using static atom information stored in the compiled binary.