ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Building a Self-Healing Network
Pages: 1, 2, 3

Creating the Glue

Now it's time to build the glue that attaches your NAGIOS monitoring to your Cfengine instance and enables self-healing. When NAGIOS detects a state change for a service check, it can call a custom script. If NAGIOS calls the script with a critical error, it invokes cfrun to execute Cfengine on the remote host.



This script, handle_cfrun.sh, goes into /usr/lib/nagios/plugins, or wherever you have configured NAGIOS's USER1 directory. Once it's in place, make sure it is executable via the NAGIOS user. Also be sure to set the HOME variable to the home directory of the NAGIOS user.

#!/bin/sh
# On a critical/hard or the third critical/soft, fire off cfrun.
HOME=/var/log/nagios
export HOME
HOST=`echo $3 | cut -f1 -d.`

case "$1" in
"CRITICAL")
  case "$2" in
  "SOFT")
    case "$4" in
    "3")
      /usr/sbin/cfrun -f $HOME/cfrun.hosts -T $HOST
      ;;
    esac
    ;;
  "HARD")
    /usr/sbin/cfrun -f $HOME/cfrun.hosts -T $HOST
    ;;
  esac
  ;;
esac
exit 0

Next, modify your NAGIOS services.cfg so that when a state change occurs over the course of a service check, it will call the external script. Add these lines to your NAGIOS services.cfg, either on your generic template, or for each service:

event_handler_enabled 1

event_handler handle_cfrun

Now modify your misccommands.cfg file to establish the proper mapping between our event handler and the script that's called. Add the following to the end of your misccommands.cfg file:

define command{
    command_name    handle_cfrun
    command_line    $USER1$/handle_cfrun.sh \
        $SERVICESTATE$ $STATETYPE$ \
        $HOSTNAME$ $SERVICEATTEMPT$
}

The \ in the listing indicates that the following line is a continuation rather than a new line.

service nagios restart will activate the changes you've made to your NAGIOS configuration. However, you must also configure Cfengine on the web server to authenticate and authorize NAGIOS to run cfrun from the monitor server.

Cfengine's remote security model is based on public/private key pairs, which are associated with a userid and host ip address. The remote access configuration here means your monitor system can only invoke cfagent to execute the installed policy and nothing more. So, you must generate a key pair for NAGIOS and place the public side of it onto your web server.

Become the NAGIOS user via su - nagios. Next, run cfkey. This will create a public/private key pair and the output will indicate where that key pair lives. Copy the .pub side of that key pair into the /var/cfengine/ppkeys directory on the web server. It must have a special name. The format for public keys is username-ip.addr.pub, so replace the username with nagios, and the IP address with your monitor's ip address, creating a file such as nagios-192.168.0.10.pub. Next, edit your cfservd.conf file and add nagios to the allowed user directive, then make sure that the IP address of your monitor server matches the ACL for the cfagent binary.

      AllowUsers = ( nagios root )
 ...
      /var/cfengine/bin/cfagent 127.0.0.1 192.168.0.10

Finally, create a file cfrun.hosts in NAGIOS's home directory containing:

webserver

Now check that your authentication works. su - nagios, then execute:

cfrun -f ~/cfrun.hosts webserver

Type yes if you're asked to accept a key. You should get a response that indicates success rather than failure.

That's it. You can test your configuration now. Try it by simulating a crash. On the web server, run:

killall -QUIT httpd

Now tail -f /var/log/messages on your monitor server. You should see messages like this, and you can verify the availability of your server yourself:

Nov 26 14:12:32 monitor nagios:
  SERVICE ALERT: webserver;HTTP;CRITICAL;SOFT;1;Connection refused
Nov 26 14:12:32 monitor nagios:
  SERVICE EVENT HANDLER: webserver;HTTP;CRITICAL;SOFT;1;handle_cfrun
Nov 26 14:12:33 webserver cfservd[7845]:
  Executing command /var/cfengine/bin/cfagent --no-splay --inform
Nov 26 14:13:32 monitor nagios:
  SERVICE ALERT: webserver;HTTP;OK;SOFT;2;HTTP OK HTTP/1.1 200 OK - 271
  bytes in 0.002 seconds

Adding to the System

It is easy to expand on this initial configuration to cover almost any service environment. An easy first step is to put all the daemons on which you depend into the processes section of your cfagent.conf. Some candidates include sendmail, xinetd, sshd, and so on.

Another way to expand the system is to configure Cfengine to detect certain error states, set classes based on that state, and then execute actions based on that class. The following example snippets from a Cfengine configuration demonstrates the principle. In this case, cfagent calls an external program to detect if the HTTP server has hung. If so, it forces a restart.

actionsequence = ( shellcommands processes )

AddInstallable = ( restart_apache )

shellcommands:
  "/some/path/check_httpd_hung.sh" define=httpHung

processes:
  httpHung::
    "httpd" restart "/usr/sbin/httpd" signal=term useshell=false

Summary

I've illustrated a self-healing functionality for networks by using Cfengine and NAGIOS. This capability is easy to implement and easily extended to more complex failure situations. Real-world experience has illustrated a five-minute failure-to-recovery time for this system. Although that is not instant, it is on par with response times when humans are part of the response cycle. The system is secure, easily maintainable, and implementable by most system administrators.

Greg Retkowski is a network engineering consultant with over 10 years of experience in UNIX/Linux network environments.


Return to ONLamp.com.



Sponsored by: