ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Building a Self-Healing Network

by Greg Retkowski
05/25/2006

Computer immunology is a hot topic in system administration. Wouldn't it be great to have our servers solve their own problems? System administrators would be free to work proactively, rather than reactively, to improve the quality of the network.

This is a noble goal, but few solutions have made it out of the lab and into the real world. Most real-world environments automate service monitoring, then notify a human to repair any detected fault. Other sites invest a large amount of time creating and maintaining a custom patchwork of scripts for detecting and repairing frequently recurring faults. This article demonstrates how to build a self-healing network infrastructure using mature open source software components that are widely used by system administrators. These components are NAGIOS and Cfengine.

NAGIOS is a network monitoring system with a web-based interface that tracks the health of servers and the services they provide. It does this by periodically polling the server/service with a health-checking script. If it detects what it believes is a failure state based on repeated health-check failures, it will note the specific server and take actions such as paging and emailing system administrators.

Cfengine is a policy engine that will detect a delta (difference) in a system's current configuration state and its optimal configuration state based on policy. It was developed by Mark Burgess of Oslo University College. Cfengine has many functions that facilitate self-healing. However, Cfengine runs only periodically because its delta detection process is too computationally intensive to run continuously. In most deployments, Cfengine runs once an hour.

By combining these two software packages, you can create a self-healing capability on your network. First, configure NAGIOS to do health checking on a server and, in the event of a failure, to invoke Cfengine on the remote server to repair the fault. The system will operate in a secure manner with little system or network overhead.

Implementation

The network for the example configuration is fairly straightforward and you'll find it easy to tailor to your specific environment. The network has a monitor host (named monitor, at 192.168.0.10) running NAGIOS, and a web server (named webserver, at 192.168.0.20) running an Apache HTTP server. The goal is for the Apache server to continue to serve pages to hypothetical users, and for any fault that occurs to be rectified in short order. For clarity I've split these functions across two hosts, but there is no reason that both functions could not run on the same host.

The example network runs Fedora Core 3. Installation should be very similar to any other Red Hat/RPM-based system. If you are comfortable with installing and configuring software on your preferred flavor of Linux, you can easily accomodate other distributions. The configurations should work across all platforms with few modifications once the software is installed.

The concept is simple. NAGIOS detects a fault with the HTTP service. As part of its event handling system, it requests remote execution of Cfengine via the cfrun utility. Cfengine runs and detects the missing httpd process and restarts it. Voilá!

Download & Installation

Both NAGIOS and Cfengine are available from the DAG Repository for all versions of Red Hat and Fedora. If your package manager is configured for DAG, it's as simple as:

yum -y install nagios nagios-plugins cfengine

For the web server (assuming you also need Apache):

yum -y install cfengine httpd

To find out how to configure your package manager to use DAG, visit the Dag FAQ. If you're a build-from-source person, visit the Cfengine and NAGIOS websites to download the source tarballs directly. The Cfengine Wiki has more details on other subjects.

Configuring Cfengine

In this case you will be setting up a very simplistic Cfengine instance, whereas the sole purpose of this Cfengine configuration is to restart a failed HTTP server. Cfengine can do many more worthwhile things, and I recommend Luke A. Kanies' excellent articles Introducing Cfengine and Integrating Cfengine with CVS.

Cfengine keeps its configuration data in /var/cfengine/inputs. There are a few key files you will put into this directory to get your Cfengine instance up and running. On your web server, cfagent.conf should contain:

control:
  actionsequence = ( processes )
  smtpserver = ( localhost ) # used by cfexecd
  sysadm = ( root@localhost ) # where to mail output

processes:
  "httpd" restart "/usr/sbin/httpd" useshell=false

cfservd.conf should be:

control:
cfrunCommand = ( "/var/cfengine/bin/cfagent" )
AllowUsers = ( root )
admit:
/var/cfengine/bin/cfagent 127.0.0.1

cfrun.hosts must read:

webserver

Make sure that your Cfengine config parses properly by running Cfengine from the command line:

/usr/sbin/cfagent -qIv

You'll see verbose output. Remove the v flag and the only remaining output will be that indicating a difference between system state and Cfengine policy. For example, if you execute:

killall httpd;/usr/sbin/cfagent -qI

you'll see that cfagent restarts the httpd daemon. Now that you have it installed, start up all your Cfengine services:

for i in cfenvd cfservd cfexecd; do 
  chkconfig $i on; service $i restart;done

Now your Cfengine config works: it returns your system to the desired state, a live httpd server, via Cfengine policy. This Cfengine rule, executed once an hour by cfexecd, will restart the httpd server if it's down. However, if you want automated dynamic response to failure, you need to integrate a second part to monitor the httpd server and kick off Cfengine when a failure occurs.

Configuring NAGIOS

Two articles by Oktay Altunergil cover NAGIOS in depth. The first, Installing Nagios, covers installing NAGIOS from source. The second, Nagios, Part 2, has an in-depth discussion of the configuration files that are at the heart of NAGIOS's behavior.

The configuration files for NAGIOS typically live in /etc/nagios. The hosts.cfg file defines which hosts NAGIOS should monitor. This file simply defines the web server and its IP address.

# Generic host definition template
define host{
    name                          generic-host ; Host template
    notifications_enabled         1
    event_handler_enabled         1
    flap_detection_enabled        1
    process_perf_data             1
    retain_status_information     1
    retain_nonstatus_information  1
    register                      0 ; DONT REGISTER THIS TEMPLATE
}

# our apache server host definition
define host{
    use                     generic-host ; template to use
    host_name               webserver

    alias                   Our apache webserver
    address                 192.168.0.20
    check_command           check-host-alive
    max_check_attempts      10
    notification_interval   120
    notification_period     24x7
    notification_options    d,u,r
}

services.cfg contains definitions of which services to monitor for each host. This file checks the reachability (via ping) and the availability of the HTTP server.

# Generic service definition template
define service{
    name             generic-service ; This is a template.
    active_checks_enabled           1
    passive_checks_enabled          1
    parallelize_check               1
    obsess_over_service             1
    check_freshness                 0
    notifications_enabled           1
    event_handler_enabled           1
    flap_detection_enabled          1
    process_perf_data               1
    retain_status_information       1
    retain_nonstatus_information    1
    register                        0       ; DONT REGISTER TEMPLATE
}

# Service definition
define service{
    use                             generic-service ; Name of template
    host_name                       webserver
    service_description             PING
    is_volatile                     0
    check_period                    24x7
    max_check_attempts              3
    normal_check_interval           2
    retry_check_interval            1
    contact_groups                  admins
    notification_interval           120
    notification_period             24x7
    notification_options            c,r
    check_command                   check_ping!100.0,20%!500.0,60%
}


# Service definition
define service{
    use                             generic-service ; Name of template
    host_name                       webserver
    service_description             HTTP
    is_volatile                     0
    check_period                    24x7
    max_check_attempts              3
    normal_check_interval           2
    retry_check_interval            1
    contact_groups                  admins
    notification_interval           120
    notification_period             24x7
    notification_options            w,u,c,r
    check_command                   check_http
    event_handler_enabled           1
    event_handler                   handle_cfrun
}

The configuration file contacts.cfg defines who to contact when a monitoring event occurs and how to make the contact. A basic configuration simply mails root.

define contact{
    contact_name                    nagios
    alias                           Nagios Admin
    service_notification_period     24x7
    host_notification_period        24x7
    service_notification_options    w,u,c,r
    host_notification_options       d,u,r
    service_notification_commands   notify-by-email,notify-by-epager
    host_notification_commands      host-notify-by-email,host-notify-by-epager
    email                           root@localhost.localdomain
    pager                           root@localhost.localdomain
}

contactgroups.cfg defines groupings of contacts.

define contactgroup{
        contactgroup_name       admins
        alias                   Apache Server Administrators
        members                 nagios
}

The hostgroups.cfg file contains a mapping of hosts to groups. You only have one host in its own group, associated with your one contact group.

define hostgroup{
        hostgroup_name  webserver
        alias           Apache Web Servers
        contact_groups  admins
        members         webserver
}

Zero out the files dependencies.cfg and escalations.cfg (for example, cp /dev/null to each of these) since you don't need these files in this configuration.

Finally, edit cgi.cfg. If you are in a lab or isolated environment, set use_authentication=0. Otherwise, set up an appropriate htaccess configuration for your /nagios/ directory with sane values. For more information on how NAGIOS manages CGI security, review the NAGIOS CGI Authentication Documentation.

Start up your NAGIOS server: service nagios start.

Go to http://monitor/nagios/ and click service checks. After a few moments, you should see an http & ping in the green. One final note: if you have just installed Apache on your web server, make sure there's a /var/www/html/index.html document so that the server returns OK. Otherwise, it will return 203/NOT AUTHORIZED, which will cause health checking to fail.

You've now created a very vanilla NAGIOS and Cfengine environment. This is something you may have already put into place in your network. But hold on to your hat--here's where I make it interesting.

Creating the Glue

Now it's time to build the glue that attaches your NAGIOS monitoring to your Cfengine instance and enables self-healing. When NAGIOS detects a state change for a service check, it can call a custom script. If NAGIOS calls the script with a critical error, it invokes cfrun to execute Cfengine on the remote host.

This script, handle_cfrun.sh, goes into /usr/lib/nagios/plugins, or wherever you have configured NAGIOS's USER1 directory. Once it's in place, make sure it is executable via the NAGIOS user. Also be sure to set the HOME variable to the home directory of the NAGIOS user.

#!/bin/sh
# On a critical/hard or the third critical/soft, fire off cfrun.
HOME=/var/log/nagios
export HOME
HOST=`echo $3 | cut -f1 -d.`

case "$1" in
"CRITICAL")
  case "$2" in
  "SOFT")
    case "$4" in
    "3")
      /usr/sbin/cfrun -f $HOME/cfrun.hosts -T $HOST
      ;;
    esac
    ;;
  "HARD")
    /usr/sbin/cfrun -f $HOME/cfrun.hosts -T $HOST
    ;;
  esac
  ;;
esac
exit 0

Next, modify your NAGIOS services.cfg so that when a state change occurs over the course of a service check, it will call the external script. Add these lines to your NAGIOS services.cfg, either on your generic template, or for each service:

event_handler_enabled 1

event_handler handle_cfrun

Now modify your misccommands.cfg file to establish the proper mapping between our event handler and the script that's called. Add the following to the end of your misccommands.cfg file:

define command{
    command_name    handle_cfrun
    command_line    $USER1$/handle_cfrun.sh \
        $SERVICESTATE$ $STATETYPE$ \
        $HOSTNAME$ $SERVICEATTEMPT$
}

The \ in the listing indicates that the following line is a continuation rather than a new line.

service nagios restart will activate the changes you've made to your NAGIOS configuration. However, you must also configure Cfengine on the web server to authenticate and authorize NAGIOS to run cfrun from the monitor server.

Cfengine's remote security model is based on public/private key pairs, which are associated with a userid and host ip address. The remote access configuration here means your monitor system can only invoke cfagent to execute the installed policy and nothing more. So, you must generate a key pair for NAGIOS and place the public side of it onto your web server.

Become the NAGIOS user via su - nagios. Next, run cfkey. This will create a public/private key pair and the output will indicate where that key pair lives. Copy the .pub side of that key pair into the /var/cfengine/ppkeys directory on the web server. It must have a special name. The format for public keys is username-ip.addr.pub, so replace the username with nagios, and the IP address with your monitor's ip address, creating a file such as nagios-192.168.0.10.pub. Next, edit your cfservd.conf file and add nagios to the allowed user directive, then make sure that the IP address of your monitor server matches the ACL for the cfagent binary.

      AllowUsers = ( nagios root )
 ...
      /var/cfengine/bin/cfagent 127.0.0.1 192.168.0.10

Finally, create a file cfrun.hosts in NAGIOS's home directory containing:

webserver

Now check that your authentication works. su - nagios, then execute:

cfrun -f ~/cfrun.hosts webserver

Type yes if you're asked to accept a key. You should get a response that indicates success rather than failure.

That's it. You can test your configuration now. Try it by simulating a crash. On the web server, run:

killall -QUIT httpd

Now tail -f /var/log/messages on your monitor server. You should see messages like this, and you can verify the availability of your server yourself:

Nov 26 14:12:32 monitor nagios:
  SERVICE ALERT: webserver;HTTP;CRITICAL;SOFT;1;Connection refused
Nov 26 14:12:32 monitor nagios:
  SERVICE EVENT HANDLER: webserver;HTTP;CRITICAL;SOFT;1;handle_cfrun
Nov 26 14:12:33 webserver cfservd[7845]:
  Executing command /var/cfengine/bin/cfagent --no-splay --inform
Nov 26 14:13:32 monitor nagios:
  SERVICE ALERT: webserver;HTTP;OK;SOFT;2;HTTP OK HTTP/1.1 200 OK - 271
  bytes in 0.002 seconds

Adding to the System

It is easy to expand on this initial configuration to cover almost any service environment. An easy first step is to put all the daemons on which you depend into the processes section of your cfagent.conf. Some candidates include sendmail, xinetd, sshd, and so on.

Another way to expand the system is to configure Cfengine to detect certain error states, set classes based on that state, and then execute actions based on that class. The following example snippets from a Cfengine configuration demonstrates the principle. In this case, cfagent calls an external program to detect if the HTTP server has hung. If so, it forces a restart.

actionsequence = ( shellcommands processes )

AddInstallable = ( restart_apache )

shellcommands:
  "/some/path/check_httpd_hung.sh" define=httpHung

processes:
  httpHung::
    "httpd" restart "/usr/sbin/httpd" signal=term useshell=false

Summary

I've illustrated a self-healing functionality for networks by using Cfengine and NAGIOS. This capability is easy to implement and easily extended to more complex failure situations. Real-world experience has illustrated a five-minute failure-to-recovery time for this system. Although that is not instant, it is on par with response times when humans are part of the response cycle. The system is secure, easily maintainable, and implementable by most system administrators.

Greg Retkowski is a network engineering consultant with over 10 years of experience in UNIX/Linux network environments.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.