ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Essential System Administration, 3rd Edition

Top Five Open Source Packages for System Administrators

by Æleen Frisch, author of Essential System Administration, 3rd Edition
12/05/2002

This is the fourth installment of a five-part series in which I introduce my current list of the most useful and widely applicable open source administrative tools. In general, these tools can make your job easier no matter what Unix operating system your computers run.

#2: Nagios

The second place in my top five tools list goes to Nagios, written by Ethan Galstad. Nagios is a feature-rich network monitoring package. Its displays provide current information about system or resource status across an entire network. In addition, it can also be configured to send alerts and perform other actions when problems are detected. This week, we'll look at the sort of monitoring that Nagios provides and also briefly discuss configuring the package.

Note: Nagios was formerly known as Netsaint. Netsaint configuration files are compatible with Nagios, although Nagios has adopted a new, simpler syntax. You can also convert Netsaint configurations files with the included convertcfg utility.

In This Series

Number Five: Amanda
The countdown begins with Amanda, an enterprise backup utility.

Number Four: LDAP
The countdown continues with LDAP, a protocol that supports a directory service.

Number Three: GRUB
The countdown continues with GRUB, the GRand Unified Bootloader.

What Nagios Can Do

Nagios monitors a wide variety of system properties, including system- performance metrics such as load average and free disk space; the presence of important services like HTTP and SMTP; and per-host network availability and reachability. It also allows the system administrator to define what constitutes a significant event on each host--for example, how high a load average is "too high"--and what to do when such conditions are detected.

In addition to detecting problems with hosts and their important services, Nagios also allows the system administrator to specify what should be done as a result. A problem can trigger an alert to be sent to a designated recipient via various communication mechanisms (such as email, Unix message, pager). It is also possible to define an event handler: a program that is run when a problem is detected. Such programs can attempt to solve the problem encountered, and they can also proactively prevent some serious problems when they get triggered by warning conditions.

The information that Nagios collects is displayed in a series of automatically generated Web pages. This format is quite convenient in that it allows a system administrator to view network status information from various points throughout the network.

Figure 1 illustrates the top-level Nagios display, known as the "Tactical Overview."


Figure 1. Nagios Tactical Overview display

Related Reading

Essential System Administration
Tools and Techniques for Linux and Unix Administration
By Æleen Frisch

The narrow column on the left of the display lists links to all of the possible Nagios displays (the one for the current display has been highlighted in the illustration). The Tactical Overview shows very general statistics about the overall network status. In this case, 20 hosts are being monitored, and 16 are currently up. Three hosts are down, and one is unreachable from the monitoring system, presumably because the gateway to it is down. Of the problems on the three hosts that are down, one has been acknowledged by a system administrator. The display also indicates that there are three services that have "critical" status (probably indicating a failure), and two others are in a "warning" state.

Each of the problem indicator displays also functions as a link to another Web page giving details about that particular item.

Figure 2 illustrates a Nagios Status Overview display. The three sections display summary status information about the hosts being monitored (upper left), services being monitored (upper right), and a further status breakdown by host group (lower portion of the boxed section of the figure). Once again, each item contains links to more detailed views of its current information. In this case, the hosts that are being monitored have been configured into four groups for Nagios reporting purposes. Three of the groups contain hosts in the same physical location within the company, and the final group, Printers, contains network printers that are being monitored. The system administrator is free to group hosts and devices in ways that make sense for her needs.


Figure 2. Nagios Status Overview and details for the Printers group

The display at the bottom of Figure 2 shows the most important part of the detailed display that results when one clicks on the Printers link in the upper display. It lists each printer separately, along with its device status and services status. In this example, at the moment, one of the four printers is down (the printer named ingres).

Figure 3 illustrates the detailed display that can be obtained for an individual host (or device). Here we see some detailed information about a host named leah. Once again, there are several sections to the display. The host name and IP address appear in the upper left of the display, along with an icon that the system administrator has assigned to this host. Here, the icon suggests that the system's operating system is some version of Windows; conventionally, icons are keyed to the operating system type. The table in the upper right gives some overall uptime and reachability statistics about the host over the period that the current monitoring session has been running.


Figure 3. Detailed Host Status information about host Leah

The table below the operating system icon, titled "Host State Information" provides information about the current status of the host, including whether or not it is up, how long it has been that way, when it was last checked, and the command used to perform the check, and the settings of various configuration parameters (such as host notifications and event handler).

The box titled "Host Commands" contains a series of links, which allow the system administrator to perform many different monitoring-related actions on this host. The various items are described in Table 1. Examining the list will give you further details about Nagios' capabilities.

Table 1. Available actions in the Nagios Host Information display

Item Meaning
Disable checks of this host Stop monitoring this host for availability.
Acknowledge this host problem Respond to a current problem (discussed below).
Disable notifications for this host Don't send alerts if this host is unavailable.
Delay next host notification Delay the next alert for host unavailability.
Schedule downtime for this host. Cancel scheduled downtime for this host Define or cancel schedule downtime. During downtime, host unavailability is not considered a problem
Disable notifications for all services on this host. Enable notifications for all services on this host. Don't/do send alerts if a service on this host fails.
Schedule an immediate check of all services on this host Check all services as soon as possible (rather than waiting for their next scheduled time).
Disable checks of all services on this host
Enable checks of all services on this host
Disable or enable checking service health on this host.
Disable event handler for this host Prevent the event handler from running when a problem is detected on this host.
Disable flap detection for this host Don't try to detect flaps (rapid up-down or on-off oscillations) on this host or its services.

The second menu item allows you to acknowledge any current problem. Acknowledging simply means "I know about the problem, and it is being handled." Nagios marks the corresponding event as such, and future alerts are suppressed until the item returns to its normal state. This process also allows you to enter a comment explaining the situation, an action that is helpful when more than one administrator regularly examines the monitoring data.

If you don't like all of these table-oriented status displays, Nagios also has the capability to use graphical ones. For example, Figure 4 illustrates a map created for the small network being monitored here. The map is laid out to indicate three separate groups of hosts, with host taurus serving as a gateway between the group at the upper left and the ones at the bottom of the window.


Figure 4. A Nagios map

Much more complex network topologies can be represented in an analogous way. See the Nagios Web site for example screen shots.

Configuring Nagios

Initially, configuring Nagios can seem daunting, and there is a fair amount of startup overhead to getting things going. But keep in mind that:

Nagios uses the following configuration files:

The package provides sample starter versions of all of these file. We will consider some aspects of these file types in the remainder of this article.

Nagios configuration files are generally stored in /usr/local/nagios/etc.

The nagios.cfg File

This configuration file contains directives that apply to the entire Nagios monitoring system. Here is an annotated sample version illustrating some of its most important features:

# File locations
log_file=/usr/local/nagios/var/nagios.log
cfg_file=/usr/local/nagios/etc/checkcommands.cfg
cfg_file=/usr/local/nagios/etc/misccommands.cfg
cfg_file=/usr/local/nagios/etc/hosts.cfg
resource_file=/usr/local/nagios/etc/resource.cfg
lock_file=/usr/local/nagios/var/nagios.lock
... 

The first part of the configuration file specifies various file locations, including the general log file, files holding service check command and notification and event handler command definitions (checkcommands and misccommands). Other cfg_file directives are used by the administrator to specify the object definition files in use at that site (indicated by the one in red). Locations for other types of files follow. The lock file holds the PID of the current nagios process.


# Logging settings
log_rotation_method=d 
log_archive_path=/usr/local/nagios/var/archives
use_syslog=1
log_host_retries=1
log_event_handlers=1
...

These directives specify logging settings, including how often logs are rotated (here, daily), the archive directory for old files, whether to log significant problems to syslog as well, and whether to log individual event types.


# Global settings
nagios_user=nagios
nagios_group=nagios
date_format=us
admin_email=nagadmin
admin_pager=19995551212

These lines specify various global settings, including the user/group as which the nagios daemon runs, the output format for dates (here, US style), and the administrator's email address. The final item sets the value of the $ADMINPAGER$ macro, which can be used in command definitions.


# Package-wide event handlers
enable_event_handlers=1
global_host_event_handler=global-event-command
global_service_event_handler=global-svc-command

Settings related to event handlers. You can optionally define a single event handler for all host failures and service failures in this file if appropriate. Commands are defined in an object configuration file.


# Concurrent checks and time-outs
max_concurrent_checks=0
service_check_timeout=60
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
...

These directives control the number of maximum checks that can be made at the same time (0 means an unlimited number), as well as time-outs for various types of commands (values in seconds).


# Retained status information
retain_state_information=1
retention_update_interval=60
use_retained_program_state=1

These lines tell Nagios to retain information about host and service status between sessions, saving the values every 60 seconds, and reloading them when the facility starts up.


# Passive service checks
accept_passive_service_checks=1
check_service_freshness=1

These directives enable "passive checks": status data produced by external commands which Nagios imports periodically.


# Save Nagios data for later use
process_performance_data=1
host_perfdata_command=process-host-perfdata
service_perfdata_command=process-service-perfdata

These directives allow you to save Nagios data externally for long term analysis or other purposes. The commands specified here must be defined in some object configuration file. The simplest such command simply writes the command's output to an external file: e.g., echo $OUTPUT$ >> file, but you can perform whatever action is appropriate (e.g., send the data to an RRDTool or other database).

Note that the directives appear in a slightly different order in the sample nagios.cfg file provided with the package.

Object Configuration Files

The bulk of Nagios configuration occurs in the object configuration files. These files define hosts and services to be monitored, how various status conditions should be interpreted, and what actions should be taken when they occur. These files are used to define the following items:

The items in red will need to be defined for virtually every Nagios installation; the ones in black are optional. In the sample Nagios configuration provided with the package, each type of object is defined in a separate configuration file (named after the object type, excluding any spaces). However, you can arrange your definitions in any form that makes sense to you.

Hosts and Host Groups

All of these items are defined via templates: named sets of attributes and settings that can be easily applied to any number of actual objects. For example, here is a template definition for hosts:

define host{
; Template name
   name                      normal  
; This is only a template (not a real host)
   register                       0 

; Host notifications are enabled
   notifications_enabled          1  
; Command to check if host is available
   check_command   check-host-alive  
; Recheck failures this many times
   max_check_attempts            
; Repeat failure notifications every 2 hours
   notification_interval        120  
; When to check (time period name)
   notification_period         24x7  
; Notify when down, unreachable and on recovery
   notification_options       d,u,r  
; Host event handler is enabled
   event_handler_enabled          1  
; Event handler command (defined elsewhere)
   event_handler            host-eh  
; Flap detection is disabled
   flap_detection_enabled         0  
; Save performance data
   process_perf_data              1 
; Save status information across restarts 
   retain_status_information      1  
}

This template defines a variety of host-monitoring settings (which are explained in the comments following the semicolons). Here is a host definition that uses this template:

define host{
; Template on which to base host
   use                        normal  
; Note the attribute is not "name" as above
   host_name                  beulah  
; Longer description
   alias            beulah: SuSE 8.1  
; IP address
   address              192.168.1.44  
; Overrides template value
   max_check_attempts              8  
}

Other hosts may be defined in a similar way. Host definitions themselves can also be used as templates, provided that a name attribute is included.

Once hosts have been defined, they may be placed into host groups via directives like this one:

define hostgroup{
   hostgroup_name      bldg2
   alias               Building 2
   contact_groups      admins1
   members             beulah,callisto,ariadne,leah,lovelace,valley
}

This definition creates the host group named bldg2, consisting of six hosts (all previously defined via define host directives). The contact_groups attribute specifies who to send notifications to, and it is defined elsewhere (as we'll see).

You can use as many host groups as you want to. Hosts can be part of multiple host groups, and host groups themselves may be nested.

Services

Here are two service templates and a service definition:

define service{  ; Define defaults for all services
   name                  generic
   register                    0
; Check service every 30 minutes
   normal_check_interval      30  
; Retry failing checks every 3 minutes, up to 5 times
   retry_check_interval        3  
   max_check_attempts          5
   event_handler_enabled       1
   check_period             24x7
; Repeat notifications for failures every 2 hours
   notification_interval     120  
   notification_period     6to22
; Notify contacts about critical failures/recoveries
   notification_options      c,r  
   notifications_enabled       1
   contact_groups         admins
} 

define service{  ; Define the SMTP service
   use                    generic
   name              generic-smtp
   register                     0

   service_description Check SMTP
   check_command       check_smtp
   event_handler          eh_smtp
   contact_groups      mailadmins
}

define service{  ; Define services to be monitored
   use               generic-SMTP
; Monitor SMTP for all hosts in this host group
   host_groups          mailhosts  
}

The first template (generic) defines some settings, which can be applied to a variety of service types. The second template, generic-SMTP, uses the first template as a starting point and adds to them in order to create a generic SMTP monitoring service. Specifically, it defines a check command, an event handler, and a contact group that are appropriate for the SMTP service. The final define service stanza sets up SMTP monitoring for all of the hosts in the mailhosts host group.

Contacts and Contact Groups

Here are two stanzas defining a contact and a contact group:

define contact{
   contact_name                    nagadmin
   alias                           Nagios Admin
; When to notify about service problems
   service_notification_period     6to22  
; When to notify about host problems
   host_notification_period        24x7   
; Notify on critical problems and recoveries
   service_notification_options    c,r    
; Notify on host down and recoveries
   host_notification_options       d,r    
   service_notification_commands   notify-by-email
   host_notification_commands      host-notify-by-epager
   email                           nagios-admins@ahania.com
   pager                           $ADMINPAGER$
}

define contactgroup{
   contactgroup_name               mailadmins
   alias                           Mail Admins
   members                         mailadm,chavez,catfemme
}

The first stanza defines a contact named nagadmin. It also defines what events to notify this contact about and the time periods during which notifications should be sent. The commands to use to generate the alerts are also specified, along with arguments to them (see below).

Time Periods

Time period definitions are quite simple. Here are the definitions of the two time periods we have used so far:

define timeperiod{
   timeperiod_name 24x7
   alias           24 Hours A Day, 7 Days A Week
   sunday          00:00-24:00
   monday          00:00-24:00
   tuesday         00:00-24:00
   wednesday       00:00-24:00
   thursday        00:00-24:00
   friday          00:00-24:00
   saturday        00:00-24:00
}

define timeperiod{
   timeperiod_name 6to22
   alias           Weekdays, 6 AM to 10 PM
   Monday          06:00-22:00
   Tuesday         06:00-22:00
   Wednesday       06:00-22:00
   Thursday        06:00-22:00
   Friday          06:00-22:00
}

Note that only the applicable days need be included in the definition.

Commands

The commands referred to in many of the preceding object definitions also must be defined. For example, here is the SMTP service check command definition:

define command{
   command_name  check_smtp
   command_line  $USER1$/check_smtp -H $HOSTADDRESS$
}

This command runs the check_smtp script stored in the directory defined in the macro $USER1$ (defined in the resource.cfg file--see below); this macro conventionally holds the path to the Nagios plug-ins directory. The command is passed the option -H, followed by the IP address of the host to be checked (the latter is expanded from the built-in $HOSTADDRESS$ macro).

You can determine the syntax for any plug-in by running it with the --help option. You can also extend Nagios by adding custom plug-ins of your own. See the documentation for details on how to accomplish this.

Event handers are defined in the same way, as in this example:

define command{
   command_name  eh_smtp
   command_line  /usr/local/nagios/eh/fix_mail $HOSTADDRESS$ $STATETYPE$
}

Here, we define the command named eh_smtp. It specifies the full path to a program to run, passing two arguments: the host's IP address and the value of the $STATETYPE$ macro. This item is set to HARD for critical failures and SOFT for warnings.

Here are the definitions of commands used for notifications (we've wrapped the command_line setting for clarity):

define command{
   command_name  notify-by-email
   command_line  /usr/bin/printf "%b" "***** Nagios 1.0 *****\n\n
                 Notification Type: $NOTIFICATIONTYPE$\n\n
                 Service: $SERVICEDESC$\n
                 Host: $HOSTALIAS$\n
                 Address: $HOSTADDRESS$\n
                 State: $SERVICESTATE$\n\n
                 Date/Time: $DATETIME$\n\n
                 Additional Info:\n\n$OUTPUT$" | 
     /usr/bin/mail -s "** $NOTIFICATIONTYPE$
     alert - $HOSTALIAS$/$SERVICEDESC$ 
                 is $SERVICESTATE$ **" $CONTACTEMAIL$
}

This command constructs a simple email message using the printf command and many built-in Nagios macros. It then sends the message using the mail command, specifying the recipient as the $CONTACTEMAIL$ macro. The latter contains the value of the corresponding email attribute for the host or service that is generating the alert.

The cgi.cfg File

The cgi.cfg configuration file has several different functions with the Nagios system. Among the most important is authentication, allowing Nagios and its data to be restricted to appropriate people. Here are some sample directives related to authorization:

use_authentication=1
authorized_for_configuration_information=netsaintadmin,root,chavez
authorized_for_all_services=netsaintadmin,root,chavez,maresca

The first entry enables the access control mechanism. The next two entries specify users who are allowed to view Nagios configuration information and services status information (respectively). Note that all users also must be authenticated to the Web server using the usual Apache htpasswd mechanism.

This same configuration file is also used to store settings for icon-based status displays, as in these examples:

hostextinfo[janine]=;redhat.gif;;redhat.gd2;;168,36;,,;
hostextinfo[ishtar]=;apple.gif;;apple.gd2;;125,36;,,;

These entries specify extended attributes for the hosts defined in the entries labeled janine and ishtar. The filenames in this example specify images files for the host in status tables (GIF format--see Figure 3) and in the status map (GD2 format), and the two numeric values specify the device's location--for example, x and y coordinates--within the 2D status map. (Figure 4 provides an example status map display).

The resource.cfg File

The final configuration file we will consider is the resource.cfg file. It is used to define site-specific macros, conventionally named $USER1$ through $USER32$:

# $USER1$ = path to plugins directory
$USER1$=/usr/lib/nagiosplugins    
...

# Store a username and password (hidden)
$USER3$=administrator
$USER4$=somepassword

The first macros defines the path to the Nagios plug-ins directory; this usage is assumed by the supplied sample configuration files.

The other two macros are used in this case to store a username and password. These items can be used in command definitions for added security. The resource.cfg file itself can be protected against all non-root access without compromising the ability of CGI programs to run successfully.

Pre-Checking a Nagios Configuration

Since Nagios configuration is somewhat involved, the package provides a command that can be used to verify it prior to running the program. Here is an example of its use:

# cd /usr/local/nagios/etc
# /usr/local/nagios/bin/nagios -v nagios.cfg

This will check the Nagios configuration, which uses nagios.cfg as its main configuration file.

Essential System Administration Pocket Reference

Related Reading

Essential System Administration Pocket Reference
Commands and File Formats
By Æleen Frisch

Resources

For more information about Nagios, including installation instructions, and how to initiate and manage monitoring, consult the following sources:

If you liked this article and would like to receive the free ESA3 newsletter, you can sign up here.

You may also like AEleen's latest book, the just-released System Administration Pocket Reference.

Æleen Frisch has been a system administrator for over 20 years, tending a plethora of VMS, Unix, Macintosh, and Windows systems. If you liked this article and would like to receive the free ESA3 newsletter, you can sign up at http://www.aeleen.com/esa3_news.htm.


O'Reilly & Associates recently released (August 2002) Essential System Administration, 3rd Edition.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.