Monitoring RAID with NetSaint
Pages: 1, 2
Know Your RAID
I'm sure that each RAID utility will have different responses to different
situations. I investigated what raidutil reports about my Adaptec
2400A. I did that by disconnecting a drive from the array, booting, and then
building the array. The conditions reported allowed me to customize my
scripts.
Normal
Here is what raidutil reports when all is well:
# /usr/local/bin/raidutil -L logical
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 RAID 5 (Redundant ADAPTEC RAID-5 228957MB Optimal
Degraded
I shut down the system, removed the power from one drive, and then rebooted.
Here is what raidutil reported:
# /usr/local/bin/raidutil -L logical
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 RAID 5 (Redundant ADAPTE RAID-5 228957MB Degraded
This is the normal situation when a disk has died or, in this case, has been removed from the array.
After I added the disk back in, raidutil reported the same
status. To recover an array, you must rebuild it!
Reconstruction
You can also use raidutil to start the rebuilding process. This
will sync up the degraded drive with the rest of the array. This can be a
lengthy process, but it is vital. Start rebuilding with this command:
$ /usr/local/bin/raidutil -a rebuild d0 d0b0t0d0
where d0b0t0d0 is the address supplied in the above
raidutil output.
After rebuilding has started, raidutil will report:
# /usr/local/bin/raidutil -L logical
Address Type Manufacturer/Model Capacity Status
---------------------------------------------------------------------------
d0b0t0d0 RAID 5 (Redundant ADAPTE RAID-5 228957MB Reconstruct 0%
The percentage will slowly creep up until all disks are resynced.
Using netsaint_statd
The scripts supplied with netsaint_statd come in two types:
- Scripts that fetch information from a remote machine
- A daemon that processes incoming requests and supplies the information
The daemon is netsaint_statd. Install it on every machine you
wish to monitor. I downloaded the netsaint_statd tarball
and untarred it to the directory
/usr/local/libexec/netsaint/netsaint_statd on my RAID machine.
Strictly speaking, the check_*.pl scripts do not need to be on the
RAID machine, only the netsaint_statd. You can remove them if you
want. I have them only on the NetSaint machine.
I use the following script to start it at boot time:
$ less /usr/local/etc/rc.d/netsaint_statd.sh
#!/bin/sh
case "$1" in
start)
/usr/local/libexec/netsaint/netsaint_statd/netsaint_statd
;;
esac
exit 0
Then I started up the script:
# /usr/local/etc/rc.d/netsaint_statd.sh start
The RAID machine has the netsaint_statd script running as a
daemon waiting for incoming requests. Now I can move my attention to the
NetSaint machine.
This post
on remote monitoring by RevDigger is the basis for what I did to set up
netsaint_statd.
I installed the netsaint_statd tarball into the same directory
on the NetSaint machine. When you install it, remember that it needs the
check_*.pl scripts this time.
Now that NetSaint has the tools, you need to tell it about them. I added this to the end of my /usr/local/etc/netsaint/commands.cfg file:
# netsaint_statd remote commands
command[check_rload]=$USER1$/netsaint_statd/check_load.pl \
$HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$
command[check_rprocs]=$USER1$/netsaint_statd/check_procs.pl \
$HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$
command[check_rusers]=$USER1$/netsaint_statd/check_users.pl \
$HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$
command[check_rdisk]=$USER1$/netsaint_statd/check_disk.pl \
$HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$
command[check_rall_disks]=$USER1$/netsaint_statd/check_all_disks.pl \
$HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$
command[check_adptraid.pl]=$USER1$/netsaint_statd/check_adptraid.pl \
$HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$
Here are the entries I added to /usr/local/etc/netsaint/hosts.cfg
to add monitoring for the machine named polo. Specifically, I
wanted to monitor the load, the number of processes, the number of users, and disk
space.
service[polo]=LOAD;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rload! 3
service[polo]=PROCS;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rprocs!
service[polo]=USERS;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rusers! 4
service[polo]=DISKSALL;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rall_disks
Then I restarted NetSaint:
% /usr/local/etc/rc.d/netsaint.sh restart
After the restart, I began to see those services in my NetSaint web site. This is great!
RAID Notification Overview
Persuading NetSaint to monitor my RAID array was not as simple as
configuring it to monitor a regular disk. I was already using netsaint_statd to
monitor remote machines. I have them all set up so I can see load, process
count, users, and disk space usage. I extended netsaint_statd to
monitor RAID status.
This additional feature involved a few distinct steps:
- Creating a Perl script for use by
netsaint_statdto monitor the RAID - Extending
netsaint_statdto use that script - Adding RAID to the services monitored by NetSaint
RAID Perl script
As the basis for the Perl script, I used check_users.pl as
supplied with netsaint_statd to create check_adptraid.pl.
I installed that script into the same directory as all the other
netsaint_statd scripts
(/usr/local/libexec/netsaint/netsaint_statd/netsaint_statd).
If you look at this script, you'll see that it looks for the three major status values:
if ($servanswer =~ m%^Reconstruct%) {
$state = "WARNING";
$answer = $servanswer;
} else {
if ($servanswer =~ m%^Degraded%) {
$state = "CRITICAL";
$answer = $servanswer;
} else {
if ($servanswer =~ m%^Optimal%) {
$state = "OK";
$answer = $servanswer;
} else {
$answer = $servanswer;
$state = "CRITICAL";
}
}
}
I decided that degraded and unknown results will be CRITICAL, optimal will
be OK, and reconstruction will be a WARNING.
The next step was to modify netsaint_statd to use this newly
added script.
netsaint_statd patch
Apply the netsaint_statd patch like this:
$ cd /usr/local/libexec/netsaint/netsaint_statd
$ patch < path.to.patch.you.downloaded
Now that you have modified the daemon, kill it and restart it:
# ps auwx | grep netsaint_statd
root 28778 0.0 0.5 3052 2460 ?? Ss 6:56PM 0:00.32
/usr/bin/perl
/usr/local/libexec/netsaint/netsaint_statd/netsaint_statd
# kill -TERM 28778
# /usr/local/etc/rc.d/netsaint_statd.sh start
#
Add RAID to the Services Monitored by NetSaint
The remote RAID box is ready to tell you all about the RAID status. Now it's time to test it.
# cd /usr/local/libexec/netsaint/netsaint_statd
# perl check_adptraid.pl polo
Reconstruct 85%
That looks right to me! Now I'll show you what I added to NetSaint to use this new tool.
First, I added the service definition to /usr/local/etc/netsaint/hosts.cfg:
service[polo]=RAID;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_adptraid.pl
I have set up a new notification_group (raid-admins) because I
want to receive notifications via text message to my cell phone when the RAID
array has a problem.
The contact group I created was:
contactgroup[raid-admins]=RAID Administrators;danphone,dan
In this case, I want notifications go to the contacts danphone
and dan.
Here are the contacts that relate to the above contact group; the lines below may be wrapped, but in NetSaint there should be only two lines:
contact[dan]=Dan Langille;24x7;24x7;1;1;0;1;1;0;notify-by-email;
host-notify-by-email;dan;
contact[danphone]=Dan Langille;24x7;24x7;1;1;0;1;1;0;notify-xtrashort;
notify-xtrashort;dan;6135551212@pcs.example.com;
This shows that the script will email me and send a message to my cell phone.
After restarting NetSaint, I saw Figure 1.

Figure 1. A NetSaint warning--click to see large version
If your RAID is really important to you, then you will definitely want to test the notification via cell phone. I did. I know it works. I hope it goes unused.
Got Monitor?
I've said it before, and you'll hear it again: you must monitor your RAID to achieve the full benefits of it. By using NetSaint and the above scripts, you should have plenty of time to replace a dead drive before the array is destroyed. That notification alone could save you several hours.
Happy RAIDing.
Dan Langille runs a consulting group in Ottawa, Canada, and lives in a house ruled by felines.
|
Related Reading The Complete FreeBSD |
Return to the BSD DevCenter.




