ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Advanced MySQL Replication Techniques
Pages: 1, 2, 3, 4, 5, 6

Circular replication failure recovery

It's important to cover something else: how to recover from a failure in a circular topology, thus achieving a replication array with the major features of a full-fledged cluster. The principle involved is the same, but multiply the practice by the number of nodes (Figure 8).

A heartbeat check in circular replication
Figure 8. A heartbeat check in circular replication

Instead of having only one table who, add one per node and one federated table master_who pointing to its corresponding master. The structure of the table is the same as what you have seen before. What changes is the connection string. To set the appropriate table in each node, use a convenient stored procedure.

create procedure make_connections()
begin
    drop table if exists master_who;
    case @@server_id
        when 10 then
            CREATE TABLE master_who
                (
                    server_id int not null primary key
                ) ENGINE = federated
                CONNECTION = 'mysql://username:password@earth:3306/replica/who';
        when 20 then
            CREATE TABLE master_who
                (
                    server_id int not null primary key
                ) ENGINE = federated
                CONNECTION = 'mysql://username:password@water:3306/replica/who';
        when 30 then
            CREATE TABLE master_who
                (
                    server_id int not null primary key
                ) ENGINE = federated
                CONNECTION = 'mysql://username:password@air:3306/replica/who';
        when 40 then
            CREATE TABLE master_who
                (
                    server_id int not null primary key
                ) ENGINE = federated
                CONNECTION = 'mysql://username:password@fire:3306/replica/who';
    else
        select "unhandled server id " as "error";
    end case;
end

The event is exactly the same in the normal replication scheme. The called procedure is different:

create procedure check_master()
deterministic
begin
    declare master_dead boolean default false;
    declare curx cursor for
        select server_id from replica.master_who;
    declare continue handler for SQLSTATE 'HY000'
        set master_dead = true;
    open curx;

    if (master_dead) then
        stop slave;
        case @@server_id
        when 10 then
            change master to
                master_host     = 'fire';
        when 20 then
            change master to
                master_host     = 'earth';
        when 30 then
            change master to
                master_host     = 'water';
        when 40 then
            change master to
                master_host     = 'air';
           else
               --  report the error in a log table
               insert into check_master_log values (now(), @@server_id,
                      "not handled server id");
        end case;
        start slave;
        alter event check_master_conn disable;
    end if;
end

The failure scenario is similar to the previous one. Unlike the previous example, where all slaves had the same failure mechanism and acted together to resume the replication, in this case only one node has to do something. The others will continue as usual (Figure 9).

Note that in the examples throughout this article, I use the server hostnames for clarity. In production servers, though, I always use the server IP address, for performance reasons (it will save some lookup time) and also because I often can use a dedicated high-speed line to connect servers for replication.

A node failure in circular replication
Figure 9. A node failure in circular replication

Node 40 detects that node 30 is no longer active. Thus its check_master will switch the master to node 20, air, for which business continued as usual (Figure 10).

Recovering from a node failure in circular replication
Figure 10. Recovering from a node failure in circular replication

Notice that, after recovering from failure, master checking does not resume in node 40. It's the same principle I have mentioned before, when discussing chain failures in normal replication. In both cases it's possible to set a second level of events and procedures to be called after the first recover, so that the system will survive to a second failure as well. I leave this task as an exercise to the industrious reader.

Pages: 1, 2, 3, 4, 5, 6

Next Pagearrow





Sponsored by: