Transaction level fault tolerance in YaXAHA Cluster

March 27, 20223 minute read

Kirill Zinov

YaXAHA Cluster does not have a single entry point, which ensures that there are no global failures and data loss. If a cluster node becomes unavailable, the client's request will be processed by another node.

In relational databases, everything revolves around transactions. Every time we buy a product, a lot of transactions appear in the seller's database and do their useful job. From checking the availability of the product, reserving it for us, confirming payment, and so on, right up to order receive confirmation.

Therefore, we have made great effort to ensure that YaXAHA Cluster provides maximum fault tolerance and logical data integrity at transaction level, from the very beginning of its existence.

Transaction start

In master to multiple followers replication, the transaction starts on the master copy of the database, which takes over the entire load and, it is in fact a point of failure. Whereas in YaXAHA Cluster, all healthy cluster nodes can start a transaction. This feature allows to get rid of the single point of failure and distribute the load across the entire cluster.

Ultimately, it does not matter how we will let the client application know about the existence of several entry points to connect to the cluster. This can be a comma-separated list of multiple hosts in connection URI or using DNS. The main thing here is that if part of the cluster becomes unavailable, the client application will be able to repeat its request to the available cluster node.

Transaction execution

So, the client has successfully sent a request to one of the cluster nodes. This node initiates the execution of all operations included in the transaction, we call it the originator-node.

At this point, the leader-node is notified about a new transaction and takes the role of transaction manager. Periodically the leader-node is re-elected by voting among all healthy cluster nodes. This manager node keeps record of transactions and makes decisions within the boundaries of the cluster, but it is not the first point where the client sends its requests.

Now the transaction has been accounted for by the leader-node, it has been assigned a unique number, and it begins to be executed in parallel on other cluster nodes. When the transaction is complete, each node notifies that it is ready to finally commit the transaction. And as soon as the originator-node receives the required notifications minimum from other nodes of the cluster, the client will get a response about a successfully completed transaction.

Thus, a cluster combines the benefits of both synchronous and asynchronous database replication. But it does not collect all the delays that occur along the way of synchronous replication. Instead, synchronous replication is considered complete when the nodes that have reported back faster succeed. Same time, it brings the reliability of synchronous data replication with a high scaling potential.

Kirill Zinov

CIO / Principal Software Engineer