Top Five Open Source Packages for System Administratorsby Æleen Frisch, author of Essential System Administration, 3rd Edition
This is the fourth installment of a five-part series in which I introduce my current list of the most useful and widely applicable open source administrative tools. In general, these tools can make your job easier no matter what Unix operating system your computers run.
The second place in my top five tools list goes to Nagios, written by Ethan Galstad. Nagios is a feature-rich network monitoring package. Its displays provide current information about system or resource status across an entire network. In addition, it can also be configured to send alerts and perform other actions when problems are detected. This week, we'll look at the sort of monitoring that Nagios provides and also briefly discuss configuring the package.
Note: Nagios was formerly known as Netsaint. Netsaint configuration files are compatible with Nagios, although Nagios has adopted a new, simpler syntax. You can also convert Netsaint configurations files with the included
In This Series
Number Five: Amanda
Number Four: LDAP
Number Three: GRUB
What Nagios Can Do
Nagios monitors a wide variety of system properties, including system- performance metrics such as load average and free disk space; the presence of important services like HTTP and SMTP; and per-host network availability and reachability. It also allows the system administrator to define what constitutes a significant event on each host--for example, how high a load average is "too high"--and what to do when such conditions are detected.
In addition to detecting problems with hosts and their important services, Nagios also allows the system administrator to specify what should be done as a result. A problem can trigger an alert to be sent to a designated recipient via various communication mechanisms (such as email, Unix message, pager). It is also possible to define an event handler: a program that is run when a problem is detected. Such programs can attempt to solve the problem encountered, and they can also proactively prevent some serious problems when they get triggered by warning conditions.
The information that Nagios collects is displayed in a series of automatically generated Web pages. This format is quite convenient in that it allows a system administrator to view network status information from various points throughout the network.
Figure 1 illustrates the top-level Nagios display, known as the "Tactical Overview."
Figure 1. Nagios Tactical Overview display
The narrow column on the left of the display lists links to all of the possible Nagios displays (the one for the current display has been highlighted in the illustration). The Tactical Overview shows very general statistics about the overall network status. In this case, 20 hosts are being monitored, and 16 are currently up. Three hosts are down, and one is unreachable from the monitoring system, presumably because the gateway to it is down. Of the problems on the three hosts that are down, one has been acknowledged by a system administrator. The display also indicates that there are three services that have "critical" status (probably indicating a failure), and two others are in a "warning" state.
Each of the problem indicator displays also functions as a link to another Web page giving details about that particular item.
Figure 2 illustrates a Nagios Status Overview display. The three sections display summary status information about the hosts being monitored (upper left), services being monitored (upper right), and a further status breakdown by host group (lower portion of the boxed section of the figure). Once again, each item contains links to more detailed views of its current information. In this case, the hosts that are being monitored have been configured into four groups for Nagios reporting purposes. Three of the groups contain hosts in the same physical location within the company, and the final group, Printers, contains network printers that are being monitored. The system administrator is free to group hosts and devices in ways that make sense for her needs.
Figure 2. Nagios Status Overview and details for the Printers group
The display at the bottom of Figure 2 shows the most important
part of the detailed display that results when one clicks on the Printers link
in the upper display. It lists each printer separately, along with its device
status and services status. In this example, at the moment, one of the four
printers is down (the printer named
Figure 3 illustrates the detailed display that can be obtained for
an individual host (or device). Here we see some detailed information about a
leah. Once again, there are several sections to the display.
The host name and IP address appear in the upper left of the display, along with
an icon that the system administrator has assigned to this host. Here, the icon
suggests that the system's operating system is some version of Windows;
conventionally, icons are keyed to the operating system type. The table in the
upper right gives some overall uptime and reachability statistics about the host
over the period that the current monitoring session has been running.
Figure 3. Detailed Host Status information about host Leah
The table below the operating system icon, titled "Host State Information" provides information about the current status of the host, including whether or not it is up, how long it has been that way, when it was last checked, and the command used to perform the check, and the settings of various configuration parameters (such as host notifications and event handler).
The box titled "Host Commands" contains a series of links, which allow the system administrator to perform many different monitoring-related actions on this host. The various items are described in Table 1. Examining the list will give you further details about Nagios' capabilities.
Table 1. Available actions in the Nagios Host Information display
|Disable checks of this host||Stop monitoring this host for availability.|
|Acknowledge this host problem||Respond to a current problem (discussed below).|
|Disable notifications for this host||Don't send alerts if this host is unavailable.|
|Delay next host notification||Delay the next alert for host unavailability.|
|Schedule downtime for this host. Cancel scheduled downtime for this host||Define or cancel schedule downtime. During downtime, host unavailability is not considered a problem|
|Disable notifications for all services on this host. Enable notifications for all services on this host.||Don't/do send alerts if a service on this host fails.|
|Schedule an immediate check of all services on this host||Check all services as soon as possible (rather than waiting for their next scheduled time).|
|Disable checks of all services on this
Enable checks of all services on this host
|Disable or enable checking service health on this host.|
|Disable event handler for this host||Prevent the event handler from running when a problem is detected on this host.|
|Disable flap detection for this host||Don't try to detect flaps (rapid up-down or on-off oscillations) on this host or its services.|
The second menu item allows you to acknowledge any current problem. Acknowledging simply means "I know about the problem, and it is being handled." Nagios marks the corresponding event as such, and future alerts are suppressed until the item returns to its normal state. This process also allows you to enter a comment explaining the situation, an action that is helpful when more than one administrator regularly examines the monitoring data.
If you don't like all of these table-oriented status displays, Nagios also has the capability to use graphical ones. For example, Figure 4 illustrates a map created for the small network being monitored here. The map is laid out to indicate three separate groups of hosts, with host taurus serving as a gateway between the group at the upper left and the ones at the bottom of the window.
Figure 4. A Nagios map
Much more complex network topologies can be represented in an analogous way. See the Nagios Web site for example screen shots.