MaaT: Effective and Scalable Coordination of Distributed Transactions in the Cloud
- Hide Paper Summary
Link: http://www.vldb.org/pvldb/vol7/p329-mahmoud.pdf
Year: VLDB 2014
Keyword: MaaT; Concurrency Control; OCC; Interval-Based CC
Conflict Resolution: N/A
Version Management: N/A
R/W Bookkeeping: Software
This paper proposes a distributed concurrency control protocol that uses interval-based validation, called MaaT. Transactions are logically ordered by timestamps, which are integers whose value denote the serialization order of transactions. Each transaction in the system has two timestamps: A lower bound (lb) which represents the lowest time the transaction could commit, and correspondingly a higher bound (ub) which represents the largest time the transaction could commit. As the transaction reads or writes data items, the CC manager adjusts both lb and ub, such that the ordering between the current transaction and other transactions are consistent with read and write operations. At any time during the execution, if the transaction’s lb and ub crosses, i.e. the lb is larger than the ub, then an ordering conflict is detected, because there is no serial schedule that allows the current transaction to be properly ordered with other transactions based on the existing execution history. The current transaction must abort and roll back all changes it has made and retry. If at the end of the transaction, the interval [lb, ub) is still valid, the CC manager picks an arbitrary timestamp in this range as the commit timestamp (ct) of the transaction, and commits all changes. Data items are tagged with two timestamps: A read timestamp (rts), indicating the largest committed transaction that read this item, and a write timestamp (wts), indicating the largest committed transaction that wrote this item. Both timestamps are updated when a transaction commits.
Under distributed settings, the database is divided into multiple partitions. Each partition is hosted by a backend data server. User requests are forwarded to one of the front-end servers, which implement the concurrency control logic, and are responsible for issuing data requests to the backend. To reduce synchronization, each backend server only maintains states for its own partition. Global transaction states are exchanged via the link between servers. Compared with a non-distributed database, network connection and latency between servers pose a different set of problems and hence affects the overall design goal. For example, instead of trying to minimize data transfer between processing nodes as in a non-distributed architecture, in distributed designs we are more focused on reducing the number of data communication. This paper proposes five design goals that a distributed database CC scheme should generally meet: High throughput, efficient CPU utilization, scalability, no thrashing, and liveness.
One of the important features of MaaT is the usage of soft locks. Similar to ordinary mutual exclusion locks, soft locks can be acquired in two modes: read (shared) mode and write (exclusive) mode. Rather than blocking transactions from proceeding when a conflicting lock mode is attempted, soft locks never block. Instead, a list of conflicting lock owners are returned, which serves as the basis for conflict resolution during validation. Soft locks are acquired in a two-phase manner as in S2PL. Once the transaction is committed or aborted, all soft locks will be released.
Another important feature is the transaction table. Every data server has a transaction table which records the status of transaction on its partition. The transaction table only maintains the transaction status that is consistent with its local access on the server, which is not coherenct across servers. This way, we avoid expensive broadcast on the validation stage, which is a requirement in some earlier interval-based distributed protocols.
The details of the CC scheme is described as follows. On system initialization, all data items are initialized such that their wts and rts are both zero. No centralized counter dispensing timestamps is required. On transaction begin, its lb and ub are initialized to 0 and +∞ respectively. The transaction is assigned a unique transaction ID (TID) as the system-wide identifier. A new entry is also created in the transaction table. This only needs to be done at the server responsible for running CC for the transaction. The table entry is created on other servers lazily only after the transaction is known to that server.
On transactional read, the client server sends a read request to the destination server. The destination server adds the transaction into its transaction table if it is not already there. The server then returns the value of the data item together with the wts of the item, as well as a list of conflicting lock owners denoted as UW(x). The data item is also read locked by the reading transaction. By reading the wts of data items, the transaction serializes after committed transactions that have written this item as the value has been observed. By returning the list of write lock holders, the transaction serializes before uncommitted writers of the item, because the uncommitted values are not observed. By read locking the data item, the transaction guarantees that later uncommitted writes will also be serialized after it when the writer transaction validates. After receiving the response from the destination server, the lb of the transaction is adjusted to the maximum of the current lb and the returned wts.
Transactional write operations are processed locally: they are buffered in a write queue, and forwarded if later transactional reads hits the same data item. No communication with the remote server is required for writes.
After the transaction body finishes execution, the client server validates the transaction before committing. A pre-write message is sent to data servers that the transaction has accessed. The message consists of the transaction’s write set and all UW(x) sets collected during the read phase. We give a detailed description below on how data server handles the pre-write message. First, for all items in the write set, the server fetches the corresponding rts and wts. The write operation synchrionizes with committed reads and writes by adjusting its lb to the maximum of rts, wts and the old lb. Second, the server detects all conflicting locks for elements in the write set. Uncommitted reads (i.e. transactions that have read locked the item) are denoted as UR(y) and uncommitted writes (i.e. transactions that have write locked the item) are denoted as UW(y). Finally, the server write locks the data item for later reads and writes from other transactions to synchronize with it.
The data server then proceeds to validate the transaction. The validation algorithm in MaaT tries to reduce aborts by serializing transactions as much as possible instead of forcing them to abort when a conflict is detected. The serialization consists of three parts. First, the current transaction serializes before uncommitted writers on data items it has read. This is achieved by checking the status of transactions in UW(x) set. If any of them is committed or validated, the ub of the current transaction is adjusted as the minimum of lb of the writer transactions and the old ub of the current transaction. If these transactions are still uncommitted, they are put into a per-transaction set, after(T), such that the adjustment can be done later. The second part of validation is to serialize the current transaction after all uncommitted readers of data items it intends to write. Similarly to read serialization, the status of transactions in UR(y) is checked. If they have been committed or validated, the lb of the current transaction is adjusted to the maximum of the ub of the reader transactions and the old lb of the current transaction. Uncommitted reader transactions are added to another per-transaction set, before(T). The third part is to serialize writes against writes of other transactions. Since write operations do not observe the value of each other, conflicting write operations can be serialized in arbitraty order. In MaaT, the heuristic is used such that if another conflicting transaction has committed a write operation, the current transaction will serialize after it, and will perform the write after validation. On the other hand, if a conflicting transaction has not yet committed when the current transaction is validating, then we assume that the former is lagging behind, and therefore will be serialized after the latter. In this case, we put the conflicting transaction into the after(T) set as well.
After validation, the data server checks if the lb and the ub of the current transaction crosses. If they do, then the transaction cannot commit. Otherwise, a commit interval is chosen as the range of valid commit timestamps. To avoid aborting running transactions to aggressively by directly using [lb, ub) as the range, the data servers chooses a sub-range within [lb, ub), and sends a commit message back to the client server. The state of the transaction is also changed to VALIDATED to avoid being aborted by validations of other transactions.
After receiving the commit message from all data servers, the client server performs the last step of validation by intersecting the ranges from all servers. If the final range is still valid, then the transaction is ready to commit. The client server sends a commit confirmation message consisting of the read set, the write set of the transaction and the final commit timestamp (which is selected from the range) to data servers. On receiving the message, the data server applies writes to data items, adjusts their wts and rts using the commit timestamp, and updates the lb and ub of the current transaction in its transaction table to the commit timestamp. The lb and ub of transactions in before(T) and after(T) are also adjusted accordingly to reflect the fact that the dependencies are materialized after the transaction has committed. All soft locks acquired by the transaction are released.
The entry in the transaction table cannot be deleted immediately after the transaction commits. This is because UW(x), UR(y) and UW(y) do not include all dependencies the transaction has established with other uncommitted transactions. To see an example, consider the case where another transaction T’ just write locked an item after transaction T read locked it. Transaction T knows nothing about T’ when it performs validation. Actually, the serialization between T and T’ on the data item is performed by T’. If T removes itself from the table right after it commits, T’ would not be able to find the entry of T when T’ validates, which will lead to incorrect serialization because the WAR dependency between T’ and T is not enforced. To solve the garbage collection problem, either transaction T’ increments the reference counter when it locks the data item in a conflicting mode, or registers itself into T’s transaction states. If reference counting is used, then the entry for T can be safely recycled when the number decreases to zero after T commits (because we know after T releases its soft locks there will not be any more increments). If T’ registers the dependency to T explicitly, then on the last step of T’s commit sequence, T will adjust the lb and ub to be consistent with the dependency, and then immediately remove the entry from the table.