ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Monitoring RAID with NetSaint

by Dan Langille
03/17/2005

In my previous article, I talked about my RAID-5 installation. It has been up and running for a few days now. I'm pleased with the result. However, RAID can fail. When it does, you need to take action before the next failure. Two failures close together, no matter how rare that may be, will involve a complete reinstall1.

I have been using NetSaint since first writing about it back in 2001. NetSaint development has continued under a new name: Nagios. I continue to use NetSaint; it does what I need.

The monitoring consists of three main components:

  1. NetSaint (which I assume you have installed and configured). I'm guessing my tools will also work with Nagios.
  2. netsaint_statd, which provides remote monitoring of hosts, as patched with my change.
  3. check_adptraid.pl, the plugin that monitors the RAID status.

With these simple tools, you'll be able to monitor your RAID array.

1For my setup, at least. You might know of RAID setups that allow for multiple failures, but mine does not.

Monitoring the Array

Monitoring the health of your RAID array is vital to the health of your system. Fortunately, Adaptec has a tool for this. It is available within the FreeBSD sysutils/asr-utils port. After installing the port, it took me a while to figure out what to use and how to use it. Compounding the problem, a runtime error took me on a little tangent before I could get it running. I will show you how to integrate this utility into your NetSaint configuration.

My first few attempts at running the monitoring tool failed, with this result:

# /usr/local/sbin/raidutil -L all
Engine connect failed: Open

After some Googling, I found that the problem was shared memory. It seems that with PostgreSQL running, raidutil could not acquire what it needed. I hunted around, asked questions, and found a few knobs and switches:

# grep SHM /usr/src/sys/i386/conf/LINT
options         SYSVSHM         # include support for shared memory
options         SHMMAXPGS=1025  # max amount of shared memory pages (4k on i386)
options         SHMALL=1025     # max number of shared memory pages system wide
options         SHMMAX="(SHMMAXPGS*PAGE_SIZE+1)"
options         SHMMIN=2        # min shared memory segment size (bytes)
options         SHMMNI=33       # max number of shared memory identifiers
options         SHMSEG=9        # max shared memory segments per process

These kernel options are also available as sysctl values:

$ sysctl -a | grep shm
kern.ipc.shmmax: 33554432
kern.ipc.shmmin: 1
kern.ipc.shmmni: 192
kern.ipc.shmseg: 128
kern.ipc.shmall: 8192
kern.ipc.shm_use_phys: 0
kern.ipc.shm_allow_removed: 0

I stared playing with kern.ipc.shmmax but failed to find anything useful. I went up to some very large values. I suspect someone will suggest appropriate values. I found the solution by modifying the number of PostgreSQL connections, changing the value of max_connections from 40 to 30 in /usr/local/pgsql/data/postgresql.conf. Issuing the following command invoked the changes by restarting the PostgreSQL postmaster:

$ kill -HUP `cat /usr/local/pgsql/data/postmaster.pid`

Now that raidutil can run, the output should resemble:

$ sudo raidutil -L all
RAIDUTIL  Version: 3.04  Date: 9/27/2000  FreeBSD CLI Configuration Utility
Adaptec ENGINE  Version: 3.04  Date: 9/27/2000  Adaptec FreeBSD SCSI Engine

#  b0 b1 b2  Controller     Cache  FW    NVRAM     Serial     Status
---------------------------------------------------------------------------
d0 -- -- --  ADAP2400A      16MB   3A0L  CHNL 1.1  BF0B111Z0B4Optimal

Physical View
Address    Type              Manufacturer/Model         Capacity  Status
---------------------------------------------------------------------------
d0b0t0d0   Disk Drive (DASD) ST380011 A                 76319MB   Optimal
d0b1t0d0   Disk Drive (DASD) ST380011 A                 76319MB   Optimal
d0b2t0d0   Disk Drive (DASD) ST380011 A                 76319MB   Replaced Drive
d0b3t0d0   Disk Drive (DASD) ST380011 A                 76319MB   Optimal

Logical View
Address       Type              Manufacturer/Model      Capacity  Status
---------------------------------------------------------------------------
d0b0t0d0     RAID 5 (Redundant ADAPTEC  RAID-5         228957MB  Reconstruct 94%
 d0b0t0d0    Disk Drive (DASD) ST380011 A              76319MB   Optimal
 d0b1t0d0    Disk Drive (DASD) ST380011 A              76319MB   Optimal
 d0b2t0d0    Disk Drive (DASD) ST380011 A              76319MB   Replaced Drive
 d0b3t0d0    Disk Drive (DASD) ST380011 A              76319MB   Optimal


Address    Max Speed  Actual Rate / Width
---------------------------------------------------------------------------
d0b0t0d0   50 MHz     100 MB/sec    wide
d0b1t0d0   50 MHz     100 MB/sec    wide
d0b2t0d0   50 MHz     100 MB/sec    wide
d0b3t0d0   10 MHz     100 MB/sec    wide

Address    Manufacturer/Model        Write Cache Mode (HBA/Device)
---------------------------------------------------------------------------
d0b0t0d0   ADAPTEC  RAID-5           Write Back / --
 d0b0t0d0  ST380011 A                -- / Write Back
 d0b1t0d0  ST380011 A                -- / Write Back
 d0b2t0d0  ST380011 A                -- / Write Back
 d0b3t0d0  ST380011 A                -- / Write Back

#  Controller     Cache  FW    NVRAM     BIOS   SMOR      Serial
---------------------------------------------------------------------------
d0 ADAP2400A      16MB   3A0L  CHNL 1.1  1.62   1.12/79I  BF0B111Z0B4

#  Controller      Status     Voltage  Current  Full Cap  Rem Cap  Rem Time
---------------------------------------------------------------------------
d0 ADAP2400A       No battery

Address    Manufacturer/Model        FW          Serial        123456789012
---------------------------------------------------------------------------
d0b0t0d0   ST380011 A                3.06 1ABW6AY1             -X-XX--X-O--
d0b1t0d0   ST380011 A                3.06 1ABEYH4P             -X-XX--X-O--
d0b2t0d0   ST380011 A                3.06 1ABRWK0E             -X-XX--X-O--
d0b3t0d0   ST380011 A                3.06 1ABRDS5E             -X-XX--X-O--

Capabilities Map:  Column 1 = Soft Reset
                   Column 2 = Cmd Queuing
                   Column 3 = Linked Cmds
                   Column 4 = Synchronous
                   Column 5 = Wide 16
                   Column 6 = Wide 32
                   Column 7 = Relative Addr
                   Column 8 = SCSI II
                   Column 9 = S.M.A.R.T.
                   Column 0 = SCAM
                   Column 1 = SCSI-3
                   Column 2 = SAF-TE
   X = Capability Exists, - = Capability does not exist, O = Not Supported

The output shows:

It is a subset of this information that you can use to determine whether all is well with the RAID array. My next task was experimentation to determine what raidutil reports when the array is in different states.

Note: I did not actually replace d0b2t0d0 as the output above indicates. As part of my RAID testing, I shut down the system, disconnected the power to one drive, started the system, verified that it still ran, shut down again, reconnected the drive, powered up again, and started to rebuild the array.

Know Your RAID

I'm sure that each RAID utility will have different responses to different situations. I investigated what raidutil reports about my Adaptec 2400A. I did that by disconnecting a drive from the array, booting, and then building the array. The conditions reported allowed me to customize my scripts.

Normal

Here is what raidutil reports when all is well:

# /usr/local/bin/raidutil -L logical
Address    Type              Manufacturer/Model         Capacity  Status
---------------------------------------------------------------------------
d0b0t0d0   RAID 5 (Redundant ADAPTEC  RAID-5            228957MB  Optimal

Degraded

I shut down the system, removed the power from one drive, and then rebooted. Here is what raidutil reported:

# /usr/local/bin/raidutil -L logical
Address    Type              Manufacturer/Model         Capacity  Status
---------------------------------------------------------------------------
d0b0t0d0   RAID 5 (Redundant ADAPTE  RAID-5            228957MB  Degraded

This is the normal situation when a disk has died or, in this case, has been removed from the array.

After I added the disk back in, raidutil reported the same status. To recover an array, you must rebuild it!

Reconstruction

You can also use raidutil to start the rebuilding process. This will sync up the degraded drive with the rest of the array. This can be a lengthy process, but it is vital. Start rebuilding with this command:

$ /usr/local/bin/raidutil -a rebuild d0 d0b0t0d0

where d0b0t0d0 is the address supplied in the above raidutil output.

After rebuilding has started, raidutil will report:

# /usr/local/bin/raidutil -L logical
Address    Type              Manufacturer/Model         Capacity  Status
---------------------------------------------------------------------------
d0b0t0d0   RAID 5 (Redundant ADAPTE  RAID-5            228957MB  Reconstruct 0%

The percentage will slowly creep up until all disks are resynced.

Using netsaint_statd

The scripts supplied with netsaint_statd come in two types:

The daemon is netsaint_statd. Install it on every machine you wish to monitor. I downloaded the netsaint_statd tarball and untarred it to the directory /usr/local/libexec/netsaint/netsaint_statd on my RAID machine. Strictly speaking, the check_*.pl scripts do not need to be on the RAID machine, only the netsaint_statd. You can remove them if you want. I have them only on the NetSaint machine.

I use the following script to start it at boot time:

$ less /usr/local/etc/rc.d/netsaint_statd.sh
#!/bin/sh
case "$1" in
    start)
        /usr/local/libexec/netsaint/netsaint_statd/netsaint_statd
        ;;
esac
exit 0

Then I started up the script:

# /usr/local/etc/rc.d/netsaint_statd.sh start

The RAID machine has the netsaint_statd script running as a daemon waiting for incoming requests. Now I can move my attention to the NetSaint machine.

This post on remote monitoring by RevDigger is the basis for what I did to set up netsaint_statd.

I installed the netsaint_statd tarball into the same directory on the NetSaint machine. When you install it, remember that it needs the check_*.pl scripts this time.

Now that NetSaint has the tools, you need to tell it about them. I added this to the end of my /usr/local/etc/netsaint/commands.cfg file:

# netsaint_statd remote commands
command[check_rload]=$USER1$/netsaint_statd/check_load.pl \
    $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$
command[check_rprocs]=$USER1$/netsaint_statd/check_procs.pl \
    $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$
command[check_rusers]=$USER1$/netsaint_statd/check_users.pl \
    $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$
command[check_rdisk]=$USER1$/netsaint_statd/check_disk.pl \
    $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$
command[check_rall_disks]=$USER1$/netsaint_statd/check_all_disks.pl \
    $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$
command[check_adptraid.pl]=$USER1$/netsaint_statd/check_adptraid.pl \
    $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$

Here are the entries I added to /usr/local/etc/netsaint/hosts.cfg to add monitoring for the machine named polo. Specifically, I wanted to monitor the load, the number of processes, the number of users, and disk space.

service[polo]=LOAD;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rload!  3
service[polo]=PROCS;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rprocs!
service[polo]=USERS;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rusers!  4
service[polo]=DISKSALL;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rall_disks

Then I restarted NetSaint:

% /usr/local/etc/rc.d/netsaint.sh restart

After the restart, I began to see those services in my NetSaint web site. This is great!

RAID Notification Overview

Persuading NetSaint to monitor my RAID array was not as simple as configuring it to monitor a regular disk. I was already using netsaint_statd to monitor remote machines. I have them all set up so I can see load, process count, users, and disk space usage. I extended netsaint_statd to monitor RAID status.

This additional feature involved a few distinct steps:

  1. Creating a Perl script for use by netsaint_statd to monitor the RAID
  2. Extending netsaint_statd to use that script
  3. Adding RAID to the services monitored by NetSaint

RAID Perl script

As the basis for the Perl script, I used check_users.pl as supplied with netsaint_statd to create check_adptraid.pl. I installed that script into the same directory as all the other netsaint_statd scripts (/usr/local/libexec/netsaint/netsaint_statd/netsaint_statd).

If you look at this script, you'll see that it looks for the three major status values:

if ($servanswer =~ m%^Reconstruct%) {
        $state  = "WARNING";
        $answer = $servanswer;
} else {
    if ($servanswer =~ m%^Degraded%) {
        $state  = "CRITICAL";
        $answer = $servanswer;
    } else {
        if ($servanswer =~ m%^Optimal%) {
            $state  = "OK";
            $answer = $servanswer;
        } else {
            $answer = $servanswer;
            $state  = "CRITICAL";
        }
   }
}

I decided that degraded and unknown results will be CRITICAL, optimal will be OK, and reconstruction will be a WARNING.

The next step was to modify netsaint_statd to use this newly added script.

netsaint_statd patch

Apply the netsaint_statd patch like this:

$ cd /usr/local/libexec/netsaint/netsaint_statd
$ patch < path.to.patch.you.downloaded

Now that you have modified the daemon, kill it and restart it:

# ps auwx | grep netsaint_statd
root 28778 0.0 0.5 3052 2460 ?? Ss 6:56PM 0:00.32
/usr/bin/perl
/usr/local/libexec/netsaint/netsaint_statd/netsaint_statd
# kill -TERM 28778
# /usr/local/etc/rc.d/netsaint_statd.sh start
#

Add RAID to the Services Monitored by NetSaint

The remote RAID box is ready to tell you all about the RAID status. Now it's time to test it.

# cd /usr/local/libexec/netsaint/netsaint_statd
# perl check_adptraid.pl polo
Reconstruct 85%

That looks right to me! Now I'll show you what I added to NetSaint to use this new tool.

First, I added the service definition to /usr/local/etc/netsaint/hosts.cfg:

service[polo]=RAID;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_adptraid.pl

I have set up a new notification_group (raid-admins) because I want to receive notifications via text message to my cell phone when the RAID array has a problem.

The contact group I created was:

contactgroup[raid-admins]=RAID Administrators;danphone,dan

In this case, I want notifications go to the contacts danphone and dan.

Here are the contacts that relate to the above contact group; the lines below may be wrapped, but in NetSaint there should be only two lines:


contact[dan]=Dan Langille;24x7;24x7;1;1;0;1;1;0;notify-by-email;
     host-notify-by-email;dan;
contact[danphone]=Dan Langille;24x7;24x7;1;1;0;1;1;0;notify-xtrashort;
     notify-xtrashort;dan;6135551212@pcs.example.com;

This shows that the script will email me and send a message to my cell phone.

After restarting NetSaint, I saw Figure 1.

Thumbnail, click for full-size image.
Figure 1. A NetSaint warning--click to see large version

If your RAID is really important to you, then you will definitely want to test the notification via cell phone. I did. I know it works. I hope it goes unused.

Got Monitor?

I've said it before, and you'll hear it again: you must monitor your RAID to achieve the full benefits of it. By using NetSaint and the above scripts, you should have plenty of time to replace a dead drive before the array is destroyed. That notification alone could save you several hours.

Happy RAIDing.

Dan Langille runs a consulting group in Ottawa, Canada, and lives in a house ruled by felines.

The Complete FreeBSD

Related Reading

The Complete FreeBSD
Documentation from the Source
By Greg Lehey

Return to the BSD DevCenter.

Copyright © 2009 O'Reilly Media, Inc.