Rethinking the Memory Hierarchy for Modern Languages



- Hide Paper Summary
Paper Title: Rethinking the Memory Hierarchy for Modern Languages
Link: https://ieeexplore.ieee.org/document/8574542
Year: MICRO 2018
Keyword: Cache; Hotpads



Back

Highlight:

  1. Creating objects directly in the cache rather than alloc it first in VA space can help reducing compulsory cache misses. This is also leveraged in page overlays.

  2. It is good thinking to treat caches as independent devices that have their own address spaces, rather than transparent faster storage for the PA space. This brings more interesting use cases and more algorithms that are originally designed for distributed systems.

  3. A good way to reduce backward references in the memory hierarchy is like what this paper did: Forbidding any pointer from lower level to higher level. This way when higher level objects move in the cache, there will not be any massive pointer change in the lower level which is also significantly larger and impractical to scan.

  4. I like the way data isolation is done in this paper. Locations of canonical objects are used as the identities of the object, which requires an update (or redirections) on all referencing pointers when canonical objects are moved to a new location. Non-canonical objects, however, are fully isolated by not being able to pointed to by pointers. Their locations are, in fact, abstracted away by the usage of the associative tag array on each level. Moving non-canonical object copies are therefore localized to only the current level, since the controller can change the mapping quickly by an atomic update to the array

Questions

  1. This design is really just a mixture of over-thinking and ad-hoc bricklaying without any elegant philosophy. I understand that the author may have obtained the initial idea from software generational GC and felt motivated to port it to hardware, treating the memory hierarchy as heaps of different generations. But really, GC and memory allocation is not a thing that you should perform purely on hardware. Also the design is full of loopholes and bugs (see below). There are numerous corner cases that cannot be handled efficiently. This is “innovation just because you want to be different” without any meaningful contribution.

  2. The paper solves backward pointer updating problems with a brute-force scan of the tag array. Even if this is doable in hardware-only pads, how can you achieve that for main memory GC? Or objects shall not be moved for main memory GC?

  3. If sub-objects are allowed, canonical addresses are no longer unique (i.e. a byte can be represented by multiple canonical addresses) How does this work? What if different pads cache different sub-objects that overlap with each other? Does hardware compute segment intersection?

  4. By lazily rewriting upper level pointers when a canonical object is evicted to lower level, the object has two de-facto identities, one in the old location, another in the new location. How Hotpads prevent this from happening? i.e. what if two canonical pointers, to the two locations respectively, are compared?

  5. The canonical pointer in the header of object copies serve as backward pointers. Are these pointers updated when a canonical object moves? Are they modified during GC?

  6. Non-canonical pointers rewriting that are the results of L1 access should not be written back to lower levels. What if such a pointer is written from a register into the pointer field of the object? If the rewrite is to be executed as data array update, how would L1 restore the previous value? What if the pointer update is truly a pointer update? One way is to discard pointer rewriting results and let it be rolled back naturally in lower levels. But, what if a canonical pointer is first written to a pointer field as a normal update, and then rewritten due to refill from lower levels? How do we recover the dirty pointer value (its original value is never stored in any lower level)?

  7. When a canonical object is GC’ed at lower level, how are copies of the object invalidated?

  8. Overall, the non-atomic nature of pointer rewriting when a canonical object moves (during GC) will introduce really obsecure data and control races in practice in a multicore environment when mixed with coherence.

This paper proposes Hotpads, an eccentric memory hierarchy architecture that optimizes for small object allocation, garbage collection in managed languages, memory safety, and reducing associative lookup costs in conventional cache hierarchy. The paper is motived by three important observations on modern programming languages. First, modern languages are mostly memory safe with object abstraction. The memory layout of objects are opaque to application code, which cannot be directly addressed using pointer cast or pointer arithmetic, but only accessed with pre-defined methods. This reduces chances that unintended wild pointers, buffer overflows or malicious attacks from corrupting the state of objects. The architecture, however, failed to provide such protection on ISA level, forcing language runtimes to implement their own memory protection mechanism. Second, most modern languages rely on background automatic garbage collection to recycle dead objects, which takes a software thread to periodically scan objects in the background, which also potentially moves objects around for compaction. The usage of GC algorithms is already a feature in these languages that is so common to justify a hardware implemented version be embedded in the cache system. Lastly, the paper also points out that conventional cache systems require an associative lookup of the tag array for each cache access, which has large power overhead. The paper seeks to reduce such overhead by using pointers that directly point to cache locations, rather than to the underlying physical address space. Cache accesses, in most cases, are just to follow the pointer to the data array and access its content without an associative address tag lookup.

The paper describes a complicated memory hierarchy that operates at object granularity. The most prominent feature of Hotpads is that pointers do not store linear addresses in the virtual memory space, assuming a uniform address space between DRAM and cache. Instead, caches are more similar to independent storage devices called “pads”, which must be explicitly addressed. This is different from the conventional transparent abstraction of a faster main memory, in which each access to the cache must involve translation between the pointer address and the cache’s internal address. In addition, objects are no longer backed by the main memory, i.e. all objects must have a main memory address in order to be uniquely identified across the system. In Hotpads, each cache device has its own address space. Objects can be backed by any of the storage device in the hierarchy without explicitly allocating storage in other components of the hierarchy. Objects are accessed by creating copies from their backing store to the L1 pad. Active pointers referring to the objects are also rewritten to point to the L1 copy of the object for fast, direct access, without any associative lookup. Lastly, Hotpads implement memory allocation and garbage collection in hardware. The data array of a pad is maintained as a heap which always allocates from the head. Object allocation is as simple as incrementing the head pointer of the pad. In case of storage exhaustion, pads at each level will perform per-object hardware controlled garbage collection and object eviction automatically independently without software intervention. Objects allocated by software therefore do not need to be freed explicitly. In order to identify blocks to be freed, the hardware periodically runs a mark-and-sweep garbage collection algorithm using pointer values in the register file as root pointers. Live blocks that are not frequently used are also evicted. The remaining blocks are then moved to form a compacted chunk, which frees storage in the data array for future allocation. To aid hardware GC, Hotpads are type-aware, and marks pointer fields in an object explicitly. Type checks for pointers are performed when they are accessed with special instructions. Furthermore, the GC process leverages pointer fields to traverse between live objects. The language compiler should cooperate to identify pointer field while compiling or initializing an object.

The data flow of Hotpads is more similar to the one in a tiered distributed system than in the conventional cache. The system is still divided into different levels as the conventional cache, with higher levels featuring smaller but faster storage, and lower levels with larger but slower storage. Each level of the pads, including the main memory, has a private address space that can be directly addressed by upper level requests without address translation. Pointers may store addresses to any of these levels as long as the object to be pointed to fulfills certain properties (discussed later). In Hotpads, objects start their life cycle in the L1 pad when the application allocates a chunk of memory using alloc instruction. This instruction simply increments the head pointer of the pad by the specified amount, and returns the internal hardware address (i.e. offset in the data array) of the allocated chunk in the destination register. At this moment, the object has not yet been allocated any storage in any lower level pads and the main memory. This is the biggest difference between Hotpads and a conventional memory hierarchy, in which objects are always allocated in the virtual address space as its “home address” serving as the unique identifier of the object, reserving storage in the main memory before they are brought into the cache hierarchy. As time passes, if the object is “dead”, indicating that no other object contains any reference to it, the object will be GC’ed from the L1 pad. If, on the other hand, the object survives long enough to be evicted to lower level pads, it will be allocated a new hardware address in the lower level and then copied there. When an object in the lower level is to be accessed via a pointer, it must be brought into the L1 pad by copying its content to the L1 pad, and rewriting the pointer used for the access to directly point to the L1 copy. Note that such pointer rewriting is not really necessary for the correctness of operation, but merely as a critical performance optimization. Without pointer writing, the identify of the object needs to be translated into the hardware address of the L1 pad on each access, which requires associative lookups of the tag array. The paper eliminates such lookup by directly using hareware addresses in pointer values, and rewriting the pointer to directly address the L1 pad when the object it points to has been copied to the L1.

Each object are allowed to have only one canonical copy, which is the copy of the object in the lowest level. The hardware address of an object’s canonical copy serves as a globally unique identifier to that object. All copies of the object in upper levels of the hierarchy, expect the L1, are non-canonical objects, which may not be pointed to by any of the pointers in the runtime data structure or register file. This restriction is to avoid unnecessary pointer aliasing which complicates pointer comparison, and to ensure that the most up-to-date version of the copy is accessed. Non-canonical objects in non-L1 pads can only be accessed via an associative tag lookup (see below), and must not be pointed to by any pointers. As already discussed above, this isolates canonical object copies from being accidentally accessed, and localizes the effect of moving these canonical pointers, since the only access path to these objects are via the tag array lookup. The most up-to-date copy is always the one in the L1, or in the highest levels of the hierarchy. To help locate the canonical copy for objects, each non-canonical copy of an object is accompanied by the canonical pointer of the object in its header as its identity. When a canonical copy of the object moves down the hierarchy, all upper level pointers to that canonical copy must be invalidated, and then updated to the new location of the canonical copy in the lower level. The paper proposes that this be done lazily, and in a delayed manner: When the canonical copy is transferred to a lower level, the old address of the object in the current level is still preserved, but the valid bit is set to zero. The new canonical pointer of the object is stored in the old location, indicating that the object has been moved, and the search process should use the new canonical address. During a later GC, this “stub” object will be identified and reclaimed as garbage. All canonical pointers that still use the old canonical address of the object in upper levels will be rewritten with the new canonical address of the object during GC.

One of the most important features of Hotpads is pointer rewriting, which happens when: (1) an object, no matter canonical or not, in lower levels of the hierarchy addressed by a canonical pointer is copied to the L1 pad; and (2) when a canonical object’s “stub” is GC’ed in lower levels. In the first case, the pointer that is used to address the canonical copy of the object is written with the L1 pad’s address for reducing tag lookups. In the second case, the canonical pointers are written with the new canonical address of the object after it has moved down. Note that pointer rewrite is merely a measure for optimizing L1 pad accesses. It does not modify the object.
Pointer fields and register file values that are written due to refills from lower levels will not be written back to lower levels when the object is written back. The hardware simply will not mark these pointer fields as dirty, and let the fields be rolled back naturally to the old value as the write back occurs.

To help finding the copy of an object using its canonical address as identity, each pad is also equipped with a tag array just like regular cache. The tag array maps canonical addresses of objects to their hardware addresses, if these objects’ copy exist in the current pad (excluding canonical copies, which can be directly addressed in the corresponding level). The tag array is implemented as a hardware hash table (the paper claimed it to be like the one in V-Way cache, but in fact it is just a hash table). Evictions may also happen in the tag array, in which case the object mapped by the tag entry is also evicted. As already stated above, if the object to be evicted is a canonical copy, the storage it used to occupy must be preserved, and the new canonical address after the eviction is stored there for lazy pointer rewriting.

Hotpads also changes the pointer format to enable more efficient addressing within the cache hierarchy. Instead of always using the flat address which is also the backing store of all objects in a conventional memory hierarchy, pointers in Hotpad store hardware addresses of objects. Only L1 addresses (both canonical and non-canonical) and canonical addresses below L1 are allowed to be used as a pointer value. Non-canonical copies of objects can only serve pad access misses from upper levels, but not be pointed to and addressed directly. The higher 14 bits of the pointer stores the object size. One extra bit in the pointer is dedicated to denote whether the pointer is canonical or not. The rest 48 bits store the hardware address of each level. The identity of the hardware device to which the address points should also be encoded in the pointer, such that the identity of the device can be known given the pointer value. The paper suggests that the 48 bit value domain be partitioned such that different devices occupy an non-overlapping partition. Addresses within a certain device are always linearly mapped to the 48 bit value domain.

Two kinds of pointers can be used for L1 pad access, one for fast path access and the other for slow path. The slow path access uses non-canonical pointers pointing to lower level pads, which happens only for the first access of the object using that pointer before the object is evicted. The L1 pad must check, like in conventional caches, whether the canonical pointer’s object has already been cached in the pad by performing a tag comparison. If the comparison indicates a miss, the canonical address is forwarded to lower level pads, which also perform similar checks until the canonical level is reached, in which case the pointer is directly used to address the storage. Objects not in the L1 pad, no matter canonical or not, will be copied to the L1 pad for access. The copied object is non-canonical. After copying the object, the non-L1 canonical pointer used to access the pointer should be rewritten with the non-canonical pointer in the L1 pad. The next time this pointer is used for access, no tag lookup is performed, and the L1 data array can be directly addressed. All objects are stored with runtime type information, conventional coherence bits, as well as a “canonical bit” denoting whether the object is canonical or just a copy. In addition, non-canonical objects are always stored with a header indicating its canonical address. This canonical address serves as the identity of the copy.

Hotpads maintains the data array as a heap and allocates object from all levels of pads by incrementing the heap top pointer.
Garbage collection is therefore necessary to reclaim dead objects that are not referenced by any other live objects. The GC process, however, involves moving canonical objects around in the pad address space, causing massive pointer updates for all objects that hold a pointer to the canonical object to be moved. To localize GC and avoid such massive pointer updates, Hotpads enforces the invariant that no lower level objects may contain canonical pointers to upper levels (non-canonical pointers to upper levels are impossible, too, since they can only be a result of pointer rewriting and later eviction of the pointer-holding object. But the rewritten pointer field will be discarded when the object is evicted). Canonical pointers from upper to lower levels are allowed, though. When an upper level canonical object is moved in a certain pad for GC, only pointers from upper levels pads and from the register file need to be rewritten, upper bounding the total number of rewrites to the total number of bytes in the upper level storage. We do not describe the GC process in details in this summary. Overall, the GC algorithm moves around objects that are still alive using mark-and-sweep, and stores them compactly as a single chunk in the dara array. To help identify pointer fields in objects, Hotpads adds one metadata field per two words (each object must be at least two words in size, one for the header, the other for data). The metadata field tracks dirty and valid bits as in a conventional cache at finer granularity. Furthermore, one bit is added to indicate whether the word stores a pointer value. The replacement policy also requires several bits for coarse grained LRU. The GC starts with pointer values in the register file as roots (including the stack pointer, which implicitly marks the entire execution stack for scanning as well), and runs BFS accelerated by a small hardware FIFO buffer to mark all canonical and non-canonical objects as “live”. Note that GC only works for canonical objects, since they are allocated “permanently”. Non-canonical objects, on the contrary, are just cached copies of some lower level canonical objects. These cached copies will be invalidated when their canonical objects are GC’ed. No pointer rewriting happens for GC’ed canonical objects, since these objects are not referenced anymore. Pointers are rewritten, however, for canonical and non-canonical objects that are moved for compaction. The pointer rewriting is performed using a temporary renaming table allocated in the data array. Afer the GC, all canonical and non-canonical pointers will reference the newly moved objects rather than their old locations. If the GC is executed on lower level pads other than L1, the renaming process should also propagate to upper levels. The controller at upper levels scans the metadata array and rewrites the field if it is a pointer and its value matches one of the old locations of objects being moved during GC.

Along with GC, objects are also evicted out of the pad if they are not recently used. This mimics generational GC in which infrequently accessed objects are evicted to larger, older heaps, while short-lived objects are always allocated from the smaller and younger heap. Generational GC accelerates the GC process since most objects die young, and therefore, most storage can be reclaimed by only performing GC on the younger heap (higher level pads in our case). Recall that no canonical pointer is allowed from lower level pads to higher level pads to avoid massive pointer updates, when an object is evicted, the hardware must check its pointer fields (using the per-double word metadata bits), and evict canonical objects that are still in the L1 pad. Canonical pointers are also rewritten when the canonical object is evicted during GC. If a non-canonical object is evicted, the pointer MUST NOT be rewritten to the new non-canonical address (non-canonical objects to non-L1 pads are forbidden to avoid complicated aliasing problems). Instead, the canonical address from the object to be moved is fetched, and the pointer value is stored to the canonical value.

GC in the main memory are delegated to software threads due to its complexity. The paper, however, does not give detailed description on how main memory GC is performed, and how address is rewritten for moved in-memory objects.

The paper also discusses second-order issues such as coherence, partial order accesses, compatible mode to support old block interface, etc. Unfortunately, the paper did not elaborate these aspects of Hotpads well, and some mechanisms seem broken and buggy.

Hotpads enhances the ISA by the addition of a few special instructions. To enforce data isolation between pointers and normal data, pointer fields must be accessed with ldptr and stptr for reads and writes respectively. The hardware checks the per double-word metadata field to determine whether the access is legal or not. Furthermore, before accessing an object via pointer, the object must be fetched to the L1 pad followed by a pointer rewriting. This semantics is conveyed to the cache system using derefptr instruction, indicating that the program intends to access the object being pointed to. Normal data accesses still use ld and st instructions, with an optional displacement value as immediate value or register offset. Accesses using ld and st to pointer fields are prohibited to protect pointer semantics. Pointer comparison is non-trivial in Hotpads due to pointer aliasing. Canonical addresses can be directly compared, since they represent the identity of objects. Non-canonical addresses, which must be in the L1, can also be numerically compared, since objects can only have at most one copy in the L1 cache. Canonical and non-canonical comparison is more tricky, since they might actually refer to the same object. In this case, the hardware fetches the canonical address from the non-canonical object’s header, after which canonical addresses can be compared.