Circular replication failure recovery
It's important to cover something else: how to recover from a failure in a circular topology, thus achieving a replication array with the major features of a full-fledged cluster. The principle involved is the same, but multiply the practice by the number of nodes (Figure 8).
Figure 8. A heartbeat check in circular replication
Instead of having only one table
who, add one per node and one federated table
master_who pointing to its corresponding master. The structure of the table is the same as what you have seen before. What changes is the connection string. To set the appropriate table in each node, use a convenient stored procedure.
create procedure make_connections() begin drop table if exists master_who; case @@server_id when 10 then CREATE TABLE master_who ( server_id int not null primary key ) ENGINE = federated CONNECTION = 'mysql://username:password@earth:3306/replica/who'; when 20 then CREATE TABLE master_who ( server_id int not null primary key ) ENGINE = federated CONNECTION = 'mysql://username:password@water:3306/replica/who'; when 30 then CREATE TABLE master_who ( server_id int not null primary key ) ENGINE = federated CONNECTION = 'mysql://username:password@air:3306/replica/who'; when 40 then CREATE TABLE master_who ( server_id int not null primary key ) ENGINE = federated CONNECTION = 'mysql://username:password@fire:3306/replica/who'; else select "unhandled server id " as "error"; end case; end
The event is exactly the same in the normal replication scheme. The called procedure is different:
create procedure check_master() deterministic begin declare master_dead boolean default false; declare curx cursor for select server_id from replica.master_who; declare continue handler for SQLSTATE 'HY000' set master_dead = true; open curx; if (master_dead) then stop slave; case @@server_id when 10 then change master to master_host = 'fire'; when 20 then change master to master_host = 'earth'; when 30 then change master to master_host = 'water'; when 40 then change master to master_host = 'air'; else -- report the error in a log table insert into check_master_log values (now(), @@server_id, "not handled server id"); end case; start slave; alter event check_master_conn disable; end if; end
The failure scenario is similar to the previous one. Unlike the previous example, where all slaves had the same failure mechanism and acted together to resume the replication, in this case only one node has to do something. The others will continue as usual (Figure 9).
Note that in the examples throughout this article, I use the server hostnames for clarity. In production servers, though, I always use the server IP address, for performance reasons (it will save some lookup time) and also because I often can use a dedicated high-speed line to connect servers for replication.
Figure 9. A node failure in circular replication
Node 40 detects that node 30 is no longer active. Thus its
check_master will switch the master to node 20, air, for which business continued as usual (Figure 10).
Figure 10. Recovering from a node failure in circular replication
Notice that, after recovering from failure, master checking does not resume in node 40. It's the same principle I have mentioned before, when discussing chain failures in normal replication. In both cases it's possible to set a second level of events and procedures to be called after the first recover, so that the system will survive to a second failure as well. I leave this task as an exercise to the industrious reader.