Distributed Shared Persistent Memory
- Hide Paper Summary
Link: https://dl.acm.org/doi/10.1145/3127479.3128610
Year: SoCC 2017
Keyword: NVM; Redo Logging; Shadow Paging; Hotpot
Highlight:
-
Using three-stage commit protocol to redo the commit if the system crashes half way.
-
Using shadow paging to update a remote chunk that is cached in the current node. This way, all partial updates can be rolled back to the last commit checkpoint.
Questions
- Some important abstractions should be elaborated in more details. For example: (1) How are directories of which VA is mapped to which PA on which machine maintained? When a chunk migrates, how to update these directories? (2) Are chunks or pages allocated on initialization once and for all, or chunks can be allocated incrementally? (3) Is there a unified physical address space, or the resource manager always (4) Does DSPM assume that local NVM capacity must be larger than the dirty working set? (5) How does NVM chunk allocation and deallocation work?
This paper presents Distributed Shared Persistent Memory (DSPM), an architecture and software stack for large-scale distributed NVM deployment. Prior proposals already implement distributed shared DRAM and distributed storage array based on DRAM and disks. The paper points out, however, that the same technique cannot be directly ported to support distributed NVM for two reasons. First, although conventional DSM implements software-based coherence, caching, etc., some critical features for NVM are still missing, such as distributed object naming and replication support. Second, distributed storage systems are mostly implemented as file system services or object-based stores. The software overhead for maintaining the abstraction is too large, given the relatively fast access speed of NVM.
DSPM, on the contrary, combines features from both sides to make a design that is both fast and reliable. It borrows the object naming semantics, data replication and distributed commit from storage architectures to manage persistent data in a consistent and reliable manner. Meanwhile, DSPM also borrows coherence and byte-granularity access from conventional DSM, enabling fast data accesses via memory instructions over the network.
We first describe the baseline distributed system as follows. DSPM is implemented as a kernel level software stack, which manages virtual address translations, page faults, and runtime metadata.DSPM assumes a distributed system where computation nodes communicate via high-bandwidth low-latency network, such as Infiniband. The communication is abstracted by RDMA into RPC requests and responses, which is implemented as a customized networking stack. DSPM nodes serve as both computation and storage nodes, which are equipped with NVM devices. DRAM is also installed in order to run the OS and maintain volatile states. DSPM assumes that tasks can be started at any node in the system, and each task can consist of multiple threads. The same task, however, must only be executed on the same machine to avoid both radical changes to the OS and over-complicating the commit protocol. The NVM address space is divided into a shared part and a private part. The shared NVM is publicly accessible by all nodes, under the abstraction of a unified virtual address space per-process, while private NVM is used as both a cache for frequently accessed pages, and storage for inactive replications.
DSPM nodes maintains per-process metadata for the mapping between virtual pages and physical resources. The matadata tracks, at a potentially different granularity (the paper suggests 4MB chunks), the ownership and the sharing status of virtual addresses. A virtual address chunk can be mapped to either a local physical chunk, or a remote physical chunk. In the latter case, the remote node ID and the chunk ID must be stored in the metadata. These metadata is persistently stored on the local NVM, such that the recovery handler can infer the state of each chunk before the crash, and deal with chunks at different stages of modification. DSPM leverages existing paging mechanism to translate virtual address accesses. Those that are not cached locally are marked as invalid in the page table, which will raise a page fault on the first access. The page fault handler first queries the metadata table to locate the chunk, and then fetchs it from the remote node, after which the chunk is installed in the private NVM. The page table entry is then updated to point to the local page.
The paper does not mention whether a unified physical address space for all processes are maintained, or virtual addresses from processes are directly translated into node ID and chunk ID. The paper also does not mention who is responsible for maintaining the global address mapping between processes’ virtual chunks to node and chunk IDs. In either case, a remote chunk fetch request must evetually be translated into a node ID and a chunk ID, and be sent to the corresponding node. In the below discussion, we assume that process maintains its own virtual chunk to IDs mapping for better sclability, i.e., there is no unified physical address space, and memory allocation directly returns the two IDs. The paper does indicate, however, that each node maintains metadata for chunks in its public NVM. For each chunk, the node tracks the sharers with a bit vector, and the locking status of the chunk for commit operations.
Chunks are assigned to processes via the resource allocation service, which maintains a global resource map. The allocation service tracks allocation status of chunks on all nodes. When a request is received, it finds a chunk in round-robin manner to evenly distribute workload, and returns the node ID and chunk ID to the requestor. Chunk release simply marks the chunk as available in the resource map.
Each chunk can have several replications in different states. The one indicated by node ID and chunk ID is regarded as the “owner” chunk, which is responsible for replying to chunk fetch requests and handling chunk commits. A chunk is fetched when it is first accessed by the computation. A fetched chunk is stored in the private part of the local NVM. A chunk can be in one of the three states when it is stored in the private NVM (i.e., when it is not the owner): redundant, committed, and dirty. A redundant chunk copy is clean, and it has not yet been mapped. Redundant chunks are “pushed” from the owner node to achieve a certain replication level (N copies, where N can be set by the user). Committed chunks are mapped by the page table, and is still clean. Committed chunks are read-only, which is also considered as a valid replication. Both redundant and committed chunks are clean, and reflect the most up-to-date committed version of data. The owner node of the chunk also marks any other node that has a committed or redundant chunk in its sharer list. When the redundancy level of a chunk drops belows the threshold N, the owner node will also proactively push chunks to other nodes to restore the number of redundant copies to N. The chunks pushed to other nodes will be initially in redundant state.
On page faults, redundant chunks can also satisfy the request locally with contacting the owner node. In this case, the page fault handler changes its state to committed, marking the chunk with read-only permission. Read-only chunks must not be written into, and only fulfills read requests. On a write instruction, another page fault will be raised. The page fault handler, knowing that the chunk is in committed state, will duplicate the chunk in the private cache, and updates page table entry to point to the new chunk with full permission. The new chunk will be in dirty state, and remain in that state for later memory operations, until the chunk is committed. In the meantime, the original committed or redundant state chunk is still stored in the local cache as the undo image, if a crash happens before the dirty chunk is committed.
Chunk can also migrate from the current owner to another node, if the migration lowers the amount of remote requests. By placing a chunk close to or on the node that frequently accesses and commits it, the system observes less traffic and lower access latency. The paper suggests that chunk owners should keep statistics on commit and fetch requests from all sharers of the chunk. The chunk is migrated to the most frequent requestor, if the cost (measured in number of bytes copied) is outweighted by the benefits (less requests) of migration. Note that after migration, the owner address of the chunk has changed. All sharers of the chunk should be notified about the new address, and update their metadata. The old copy can only be freed after all sharers have acknowledged the address change.
Chunks can also be evicted, if the private NVM runs out of space. Only redundant and committed chunks can be evicted, while dirty chunks must remain in the cache. DSPM gives priority to redundant chunks, since it is expected not to be accessed in the future. Although evictions will not write back data as all committed and redundant chunks are clean, the owner should still be notified, and update its sharer bit vector. If the number of replicas drops below the threshold N, the owner will also push a chunk to another node to maintain the level of replication.
Applications make their private updates publicly available by committing dirty data. Note that DSPM does not perform coherence actions on every write operation to keep all copies up-to-date. This is because: (1) Coherence is a relatively long latency process, which should only be performed sparsely; (2) Since DSPM uses shadow paging to buffer dirty data locally, it is unnecessary to propagate every write to remote nodes, since the clean copy is also present on the same node. In practice, most applications have clearly defined “commit points”, at which time the image is in a consistent state, which can be made available to other processes. In addition, on crash recovery, the system state is only recovered to the most recent commit point. Application developers should define commit points based on the semantics of the application, and issue commit commands to DSPM via system calls.
The commit process is triggered by a commit command, which consists of three steps. In the first step, the committing node persistently logs the dirty chunk addresses to be committed, after which it sends a commit prepare message to all owner nodes with the updated chunk image. In the second step, the owner node first writes a redo log entry in case the commit crashes, and then notifies all sharers to update their cached chunk with the committed dirty chunk. Sharers can directly update the chunk without redo logs (if a dirty chunk exists, the committed chunk should be written as a redundant chunk). After all sharers finished updating the chunk, the owner node then replays the redo log to the chunk image, but keeps the log after that. After the committing node receives replies from all owner nodes, the committing node deletes its redo log entry, and notifies all owners to remove their redo entries as well. Dirty chunks will transit into committed state on the committing node. Locks are also released to allow read and write access to committed data.
One of the most notable difference between DSPM and a single node memory system is that DSPM supports multiple concurrent writers to the same chunk. These writers will not contend with each other under Multiple-Writer, Multiple-Reader (MWMR) mode. Instead, due to the shadow paging update protocol, multiple nodes can hold dirty copies of the chunk which contain different values. At commit time, these chunks will be sent to the owner node as discussed in the previous paragraph. The protocol must be careful not to create data dependency if two commits happen concurrently. For example, if both node A and B write X, Y, which are on different owners, but node A commits X after Y, while node B commits Y after X, then the final image is not serializable, since A and B’s updates are partially visible. The three-phase commit protocol, therefore, requires that the commit process must not begin before a committing node acquires locks on all owner nodes. A concurrent and conflicting committing node must wait until the current process finishes before it can initiate the next round commit.
DSPM also supports Single-Writer, Multiple-Reader (SWMR) mode. Under this mode, the per-chunk lock is acquired when a chunk is fetched for write. The lock is released only after commit. The first step of the commit process for SWMR can be skipped, as the lock has already been acquired.
Two types of crashes can happen. In the first type of crash, data stored NVM devices are corrupted. In this case, the rest of the DSPM identifies chunks that become “ownerless” due to the crash, and migrates these chunks to one of the replication nodes. In the second type of crash, the node loses all progress of the computation, including volatile states in the cache hierarchy and DRAM. In this case, the system state rolls back to the last consistent commit point on the crashed node as follows (other nodes are not affected). If the node is committing, then the commit process is re-executed by replaying the redo log stored on the NVM. If the node is one of the owner nodes that participate in a commit process, then the node replays its redo log, if it exists, and copies the redo image to the chunk to complete the redo process. If, however, the node is neither committing nor participating in a commit process, recovery proceeds by dropping all dirty state chunks from the local cache, restoring the state to the consistent image at the last commit point.
DSPM assumes that applications initialize their data structures as “datasets”, with names assigned to them. The name of a dataset acts as a handle for other processes that share the same dataset to open it. DSPM uses a persistent hash table to maintain the mapping between datasets and their names. DSPM also tracks the virtual address range that a dataset should be mapped into as one of its properties. Since datasets can be opened at multiple different nodes, it is requires that pointer semantics be the same on these different nodes. DSPM ensures this by always mapping the same dataset to the same virtual address at all nodes. A centralized service, called the Central Dispatcher (CD), manages virtual address allocation for all nodes. A virtual address range is only usable after being granted by the CD to avoid two datasets being mapped to the same virtual address.