Why 2PC
Databases usually run on a single machine. Single instance databases are not scalable because they cannot be horizontally scaled to handle more load when they hit a vertical scaling ceiling. Distributed Databases emerged to solve this by allowing a single logical database run on multiple machines.
Relational Databases typically provide Database Transaction guarantee. Multiple commands executed within a database transaction become an atomic unit - they either succeed or fail as a whole.
In distributed databases, data needs to be broken into different partitions, and each partition will be stored on a separate machine, through a process called Sharding. This complicates transaction management as the atomicity has to now span multiple machines, often over unreliable network.
Two phase commit is therefore devised to allow distributed databases to offer users the ability to use transaction, even if the commands require access to data in different machines.
Two Phase Commit
There are two main actors in a two phase commit (2PC) process - a coordinator and participants. In a distributed database cluster, one of the instances will act as the 2PC coordinator, the other instances will act as the participant.
Prepare Phase
The coordinator instance has visibility to which participants need to be involved in the transaction, and what are the queries that need to be performed. When a distributed transaction is initiated, the coordinator will execute the first phase of 2PC - the prepare phase in the following order.
Persisting transaction information: The coordinator instance will persist the transaction metadata in its persistent logs. This is to ensure that if the coordinator crashes, it is able to recover its state.
Broadcast prepare message: After coordinator logs the transaction, it will send a prepare message to all involved participants. The message contains the queries each participant needs to execute.
Participant Execution: When the participant receives the prepare message, it will place a lock on the affected rows and execute these queries. The changes are persisted on its Write Ahead Log (WAL) for durability, but these changes are not committed. If this execution fails, an error message is returned to the coordinator. If the execution succeeds, a success message is returned to the coordinator. When a participant returns a success message to the coordinator, it gives up its rights to abort the transaction unilaterally.
Response handling when there’s error: Assuming there is no network disruption, when all participants have returned a response to the coordinator, the coordinator will perform the next step. If any of the participant returns an error response or timeout, the coordinator sends an abort message to all the participants which triggers a rollback. Similar to the prepare message, this abort message is also persisted in the coordinator’s logs. The abort message will reset the entire cluster back to the initial consistent state. The transaction is aborted.
Response handling when there’s no error: If all participants return a success response, the coordinator will move onto the commit phase. It will send a commit message to the involved participants, and persist the commit message in its logs. This marks the point of no return.
Commit Phase
The commit phase is the successful scenario where a distributed transaction can proceed.
Participant Execution: upon receiving the commit message, all participants will commit the transaction’s changes to the disk and release the locks. It will then return a final acknowledgement message to the coordinator.
Returns result to client: when the coordinator receives a success message from all the participants, it will return a success message back to the client who started the transaction.
Locking in 2PC
Participants need to place locks on affected rows to the prevent them being overwritten by other write requests. A system implementing 2PC will likely have
- strict two-phase locking: acquire all necessary locks before performing an operation.
- hierarchical locking: provides shared and exclusive locks for transactions to use
- timeout: if a lock is held for too long, it’s eventually released, this is risky if the coordinator decides to commit soon after
Drawbacks
Single Points of Failure
Both the coordinator and participants do not have high availability setup in a traditional distributed database. This means the system cannot tolerate any fault:
- coordinator failure: unable to orchestrate nor complete transactions
- participant failure: unable to proceed in the
PREPAREphase
A sudden coordinator failure after the PREPARE phase can also cause all participants to go into a “DOUBT” state, where all the locks are held but there’s no COMMIT message to proceed.
We can gain higher availability if we run replicas for both the coordinator and participants. This is the key decision made in distributed database like Spanner and Yugabyte.
Lock Contention
The above discussion assumes write requests from two transactions will modify the same record. This requires synchronisation of the write requests through locking.
Although this is what the database user sees (updating a row modifies that row), the underlying data structure does not need to conform to this. By keeping different versions of the same record in the database, and allowing concurrent transactions to only operate on versions visible to them, transactions can be parallelised. This database configuration is referred to as MVCC.