pLock: A Fast Lock for Architectures with Explicit Inter-Core Message Passing
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?doid=3297858.3304030
Year: ASPLOS 2019
Keyword: SW26010; Synchronization; Lock; Message Passing
This paper presents pLock, a distributed locking scheme for high performance computing (HPC) implemented on SW26010 architecture. Typical HPC architectures use Explicit Message Passing (EMP) to implement inter-core communication. Shared data between cores must be explicitly sent and received by primitives rather than implicitly acquired via cache coherence protocol. In fact, most HPC does not implememt cache coherence and shared address space for efficiency reasons. As a result, these HPC platforms support the abstraction of critical sections using one or more dedicated lock servers. Before entering the critical section, a core must send a request to the lock server, which is then granted if the lock is currently unheld, or queued into the requestor list by the lock server if otherwise. When a thread finishes the critical section, it sends an unlock message to the lock server, which causes the latter to either release the lock, or grant the lock to the next requestor in the requestor list. For N cores and S critical sections on each core, this will incur 3NS messages (1 for request, 1 for grant, and 1 for release).
This paper observes that this classical EMP scheme is suboptimal for two reasons. First, the 3NS message overhead can be further reduced by use of clever lock granting scheme. Second, on an architecture with non-uniform communication latency, lock acquisition and release may take longer for some cores for every request, which further degrades performance. This paper solves the first problem by allowing locks to be passed between peers to reduce lock granting message overhead. The second problem is solved by adding local lock servers.
The paper is based on SW26010 platform, which consists of “core groups”. A core group is formed by connecting 64 cores in a 8 * 8 mesh, in which each non-boundary core is directly connected to its neighbors in the grid. Although each processor has its own private and shared cache, no cache coherence is implemented in the core group. Processors explicitly send and receive messages using hardware primitives built into the ISA. In addition, no routing is implemented on the on-chip network, and consequently, cores can only directly communicate with other cores on the same row or the same column. A message transferred on the network is 256 bit in length.
The paper solves the first problem using a technique called the “chaining lock”. As in the canonical EMP-lock scheme, the lock server processes lock requests from clients, and maintains a list of lock requestors. Initially, the list is empty. As lock requests arrive, the list will gradually grow. The trick of chaining lock is that, when a new lock is granted to one of the requestors in the list, the lock server will piggyback the list of other waiting clients to the lock requestor in the grant message. When the current lock requestor finishes the critical section, instead of sending a lock release mesaage back to the lock server, it selects one from the waiting list, and passes the rest of the list down to this selected client, together with a lock granted message. On reception of the granted message from its peer, the selected client can enter the critical section as if the lock were granted by the master lock server. By using this scheme, we essentially combined the lock release and lock grant sequence into one message, saving (NS - 1) message (since there are (NS - 1) lock passing, and for each lock passing we save one message). The net result is that only (2NS + 1) message is sent in chaining lock, an almost 33% reduction.
The second problem is solved using another technique called hierarchical lock. The observation is that cores that are close to each other can communicate in shorter latency. If we group cores into smaller clusters based on the distance, then lock passing within the cluster can be more efficient than using a global lock server. To achieve this, in addition to the master lock server, a few local lock servers are also designated. Based on the clustering scheme, each core cluster has a local lock server, to which all local requests are sent. Local lock servers maintain the lock table as in the master lock server. On receiving a lock request, the local server checks its lock table to see whether the lock has been granted to any other client in the cluster. If positive, then the lock server will add the requestor into the lock list, and when the lock is released, the local server can grant the lock directly to the next requestor without having to contant the master lock server. On the other hand, if the local lock server does not have the lock in the table, it needs to send a lock request to the master lock server as if the local server were a client. Once the lock is granted by the master server, it can be further granted by the local lock server to clients in the cluster. In this case, clients in the cluster can acquire the lock without sending a message to the far away master server, benefiting from the lower communication latency.
Although hierarchical locking does not decrease the actual number of messages sent over the network, it does localize messages such that a core only communicates with cores that are close by for most of the time. In addition, the paper also suggests that the chaining lock be applied to this scheme, such that locks can be passed between clients in a cluster and also between local lock servers without the redundant release message, which results in both shorter latency and less message.