Lightweight Locking for Main Memory Database Systems
- Hide Paper Summary
Link: https://dl.acm.org/citation.cfm?id=2448947
Year: VLDB 2012
Keyword: Very Lighweight Locking; VLL
Classical lock based transactional systems usually suffer from performance bottleneck because of the extra contention at the lock manager. Locks are essential for old systems using locking protocols such as Two-Phase Locking (2PL) to produce serializable schedules. In the paper it is claimed that the lock manager is implemented as a centralized hash table. Lock entries are hashed into one of the buckets of the table. For each lock entry, lists of transactions that are currently holding the lock as well as those who are blocked by the lock are maintained. To protect the consistency of the lock table data structure itself, each bucket and lock entry has a latch, which is used to serialize insert, delete and read operations on the lock table. On a main-memory database deployed on multicore platform, such a centralized lock manager is not efficient and not scalable. One or more linked list needs to be traversed in order to find the lock entry and the identity of threads that are related to the lock, which costs cycles. In addition, the linked structure is not cache friendly, and is prone to incur high cache miss ratio. In terms of scalability, the hash table itself is a centralized structure, managed by lightweight latches. Worker threads need to acquire and release these latches everytime the lock table is accessed. This may cause frequent cache line invalidation, and hence degrade performance.
This paper aims at solving the lock manager’s efficiency and scalability problem by using per-tuple, metadata-less very lightweight locks (VLL). As the first step towards optimizing locks, VLL eliminates the lock manager entirely from the system. As an alternative, each data item, e.g. rows, tuples, etc, is extended with an extra metadata field that is invisible to users, which is used to maintain locks. This removes the need of traversing the linked list for every lock request and also reduces the probabilities of expensive cache line misses. The next step is to eliminate the presence of linked structures that record the current holder of the lock as well as threads waiting for the lock. VLL uses two counters, one to record the number of threads currently holding the lock in shared mode, another to record the number of threads holding the lock in exclusive mode. Note that even though exclusive mode lock can only be held by at most one threads at any moment, we need more than one bit, because as explained later, these two counters actually count the number of waiting and active threads on the lock mode. In practice, the two counters could be implemented using a single 64 bit integer, split evenly. Threads requesting to acquire the lock in either shared or exclusive mode perform Compare-and-Swap (CAS) to atomically increment the corresponding counter, and read the value before the increment. If the value before increment indicates that the lock can be acquired (i.e. no exclusive holder for shared request, and no lock holder for exclusive request), then the thread proceeds without blocking. Otherwise, the thread returns failure. We introduce in the next paragraph how VLL cooperate with the concurrency control layer to provide lightweight but yet efficient locking service.
VLL requires that threads pre-declare their lock set before execution. This could be achieved by speculatively running the transaction without concurrency control, and gather the set of locks under the speculative execution. If a lock not in the lock set is requested during normal execution, the transaction has to abort and retry in a more conservative lock set. We assume in the following text that transactions already know their lock sets. Before the transaction is submitted for execution, it first enters a critical section. In the critical section, the transaction enqueues itself into a global transaction queue. The order that transactions are entered into the queue defines the order their locks should be granted and released. Since transactions only access the queue in the critical section, the global ordering is well-defined and no race condition should occur. Within the critical section, the transaction acquires all locks in its lock set as described in the previous paragraph. Should a lock request fail, the transaction exits the critical section, and blocks itself. If all locks are acquired successfully, the transaction exits the critical section and is submitted for execution.
When transactions are finished, they release the locks in their lock sets by performing another CAS operation to atomically decrement the counter. Releasing locks this way, however, does not awaken blocked threads, as they did not subscribe to the unlock event and have no way to be notified. VLL deals with this problem by trying to unblock the first thread in the global queue. The reasoning is that since transactions are ordered by the global queue, then it is easy to see that the transaction at the head of the queue can always be scheduled, because no other transaction could be holding a lock that it is waiting for. Deadlock is also not a problem, because transactions enter the queue under the protection of a critical section. The only possible reason for being blocked is to develop forward dependencies with transactions that are already in the queue. In this case, these transactions are always awakened earlier than the current one, making deadlock impossible.
In the paper, it is suggested that the size of the global queue should not exceed a certain threshold. If too many transactions are waiting to be scheduled, then transactions that enter the queue later will find it difficult to be immediately scheduled, because it needs to stay non-conflicting with all transactions currently in the queue. In practice, the system uses the number of blocked transactions in the queue as an indicator. If this number exceeds a certain threshold, the system stops accepting new transactions into the queue, and will prioritize the task of finding transactions that can be blocked. We introduce the method of finding such transactions in the next paragraph.
Selective Contention Analysis (SCA) is proposed as an optimization of VLL which allows early unblocking of transactions. SCA threads searches from the head of the queue for a transaction that is currently blocked but is actually ready to be executed. This is possible if all locks in the transaction’s lock set has been released, but it has not reached the head of the queue. Recall that all transactions are ordered by the queue, and all dependencies are only developed from the current transaction to transactions closer to the head of the queue. In order to identify a transaction which has no outstanding dependency with any transaction, we only need to scan the lock set of transactions in the queue that are closer to the head, and make sure the current transaction’s lock set has empty overlap with the union of their lock sets. To simplify complicated set union and intersection, a bloom filter is used to indicate which locks are currently held. The SCA thread scans the queue from the head, and summarizes the transaction’s lock set into two bloom filters, one for shared lock and another for exclusive lock. The transaction under processing intersects its lock set’s bloom filter with the culmulative filters by performing bit-wise AND. With bloom filters, false positives may happen and a transaction may remain blocked even if it is eligible for scheduling. This, however, only affects performance, but not correctness. As an optimization, given that the lock set of a transaction is read-only, the SCA thread could cache the two bloom filters for every individual transaction in the queue once they are computed. This avoids re-hashing overhead of locks in the lock set.