So you've deployed your networks, systems, applications, and all of the lubricant necessary to make them work. (Usually, that lubricant comes in the form of aspirin and/or alcohol for the technician, but that's another story altogether.)
Once everything is handily deployed, you suddenly find yourself thinking, "Whew, that's a job well done." But then it dawns on you: not only are you not done, but you are now stuck tending the monster you've just created. Dr. Frankenstein found himself in a similar situation at one point.
What you need is a way to make sure these resources are available to your users, as well as to provide yourself a handy way to consolidate the critical information from each of those devices into a single console. This will allow you to verify system availability, provide a single clearinghouse of all the information you need to do your job, and enable you to keep your sanity intact.
OpenNMS was designed from the ground up to be a one-for-one replacement for HP's OpenView, IBM's Tivoli, CA's Unicenter, and the like. With that in mind, the OpenNMS team designed it as a network management tool, complete with SNMP hooks and a system-monitoring tool that can measure the availability of critical network services. It also has a configurable, event-driven messaging subsystem that allows you to plug-in event streams from other sources, such as vulnerability information from Nessus, tailed log files, and /proc-based monitors. And in good open source fashion (OpenNMS is released under the terms of the GPL), the product was designed to leverage preexisting tools where it made sense. Therefore, the SNMP performance data storage and graphing system uses RRDTool (MRTG anyone?), the Web server/JSP container/servlet engine is Apache's Jakarta Tomcat, and the underlying RDBMS is PostgreSQL.
Built from the ground up in Java, the project has covered a lot of territory in a short period of time. Version 0.4 was the first public release at the end of 2000, with 0.9.6 currently available and 1.0 slated for sometime in April. But enough about releases.
Good question. The simple fact is that the people who rely on your network, most likely, don't actually care about the network structure. They probably do care about the services provided by it, namely Web servers, databases, and email, and perhaps other services that make accessing those possible, such as DNS and DHCP.
As the network administrator, you have different needs. You need a tool that can help manage the network infrastructure as a means of providing access to these network services. OpenNMS' approach is to manage each device as a host of specific services, whether they be simple network connectivity (e.g., ICMP Pings) or complex Web transactions. Once installed and minimally configured, OpenNMS will automatically discover everything attached to the network and scan those devices for services supported (e.g., HTTP on port 80, SMTP on port 25, etc.). And those scans are deeper than a remedial portscan, actually exercising the protocol instead of just issuing a socket connect.
Once discovered, the services are committed to the database and scheduled to be polled every five minutes by default. This verifies that they are still available. If they don't respond, "critical services" are checked to begin the problem-isolation process. Following that, an appropriate event that reflects the isolated problem is generated. In turn, this event can be configured to create an outage record, invoke some sort of auto-response action (i.e., run a script), create a trouble-ticket, send a page or email notification, and/or eventually end up in OpenNMS' event browser.
But identifying outages is the simple stuff. Once an outage has been determined, what then? OpenNMS implements intelligent behaviors, such as dynamically changing the polling interval based on duration of an outage. For example, polling every 30 seconds during the initial five minutes of a outage, backing off to minute intervals for the next hour, and then correlating the polls to determine the root cause. In practice, that means it would generate one message that says a network interface went down, instead of five separate messages that say various services on that interface were unavailable. But before we can get any deeper into the workings of OpenNMS, let's talk about installing the package(s).
The easiest way to get this stuff on a machine is to run, as root on your Linux box (which will need a minimum of 256MB of RAM, 50MB of free disk space under /opt, and a 500MHz processor or better), the command:
lynx -source install.opennms.org | sh
And hope like hell you are running an RPM-capable system. Assuming you are, the online installer will check to see what is installed on your system and download the necessary prerequisites, including PostgreSQL, RRDTool, Tomcat, and some other incidentals. If you aren't using RedHat or Mandrake, or you aren't comfortable running a downloaded script with superuser privileges, there are instructions at the project Web site to help you build from source on your own. We'll leave that as an exercise to the reader.
Once installed, take a look in your /opt/OpenNMS directory. You should find a ./etc directory that includes a boatload of XML-based config files. While you could easily write a book on all of the configuration options, I'll focus on a few key parameters in three files:
First, discovery-configuration.xml controls what TCP/IP addresses will be discovered by the discovery process. The file supports the configuration of address ranges, which are defined with
<end> tags, and yes, you can configure multiple ranges (for multiple, non-contiguous IP networks), as well as exclude ranges (for eliminating discovery of DHCP ranges, as an example). And fortunately, the product ships with an example configuration that is pretty intuitive, so being the smart person you are, you'll figure this part out.
But while one might assume otherwise, the discovery process does nothing more than discover that an interface responded to a ping test. In fact, discovered interfaces are not even added to the database. Instead, the discovery process, when it discovers a new interface, simply generates an event that is picked up by the Capabilities Checking daemon,
capsd. In turn,
capsd scans the interface for the services it provides, and once a complete profile for that interface has been ascertained, it is then, and only then, committed to the database. Which means to get nodes into the database, you need to be able to manhandle (or womanhandle, as appropriate) the capsd-configuration.xml file.
The configuration file capsd-configuration.xml provides a comprehensive list of all the
capsd plug-ins which will be doing the service scanning (we can safely leave these alone for now), as well as the TCP/IP address ranges we are interested in managing. This means that to complete discovery for even a simple environment, you will need to configure your network ranges in at least the discovery-configuration.xml and capsd-configuration.xml files. While a pain, this duality in configuration provides some subtle yet powerful default node-handling options.
Want to find out if new nodes are added to the network without automatically scheduling them for polling? Thought so.
<ip-management> elements at the end of the file where the ranges are specified, you have the opportunity to set a policy to determine whether interfaces contained in that block will be managed (polled) or unmanaged (not polled). As before, you can specify ranges, as well as specific addresses. Pretty cool, huh? The pain associated with doing this configuration right one time is far outweighed by the fact that you shouldn't ever have to touch this again, or at least not until your network changes drastically or grows significantly.
Each of the
<protocol-plugin> elements referenced earlier in that file carries its own complement of attributes and properties to allow you to get in and tune/tweak each plug-in in considerable depth. While it's quite handy, that tuning is well outside the scope of this article. Just know that it's there if you need it.
Finally, we have the poller-configuration.xml file, which controls the polling subsystem (duh). While sporting IP address ranges and file structures similar to the previous two config files we've looked at, this file is notably more complex. And as always, with complexity comes flexibility.
The poller-configuration.xml file is comprised of a series of
<package> elements. These packages are functional groupings of TCP/IP address collections and services you wish to poll, along with specific configurations for those services. For example, if you wanted to monitor the entire 192.168.0.0/24 Class C network with pings to interfaces once every five minutes, but wanted to poll all of the nodes with a fourth octet greater than 200 once every minute (convoluted, yes, but bear with me), you could create the following approximation:
<poller-configuration> <package name=?FiveMinutePings?> <include-range begin=?192.168.0.1? end=?192.168.1.200?/> <service name=?ICMP? interval=?300000?> <parameter key=?retry? value=?3?/> <parameter key=?timeout? value=?3000?/> </service> </package> <package name=?OneMinutePings?> <include-range begin=?192.168.0.201? end=?192.168.1.254?/> <service name=?ICMP? interval=?60000?> <parameter key=?retry? value=?3?/> <parameter key=?timeout? value=?3000?/> </service> </package> </poller-configuration>
The only thing we've really done is create multiple
<package> elements, and within each
<service> element for ICMP, associate an interval attribute value of 60000 (which is the time between polls in milliseconds). Easy enough. For now, we'll bypass the power of the rules engine (invoked by the
<filter> element) and the
<outage-calendar> functionality, despite their coolness and power.
At this point, you should have your bearings enough to get a rudimentary configuration in place and to let discovery,
capsd, and the polling subsystem do their thing for you. The next part is simple: View the results.
Assuming you've gotten everything installed already, you can start up OpenNMS with:
And once it returns for you, you can point a browser at:
You should find things beginning to populate in 5-10 minutes. This includes seeing nodes starting to be populated into categories on the main panel, events being displayed in the events browser, and perhaps even outages appearing in the "Nodes with Outages" table.
Yes, there is a degree of voodoo involved in getting the user interface configured for your environment, and we haven't even talked about SNMP yet, but the good news is that there's time for that later. In the interim, you will be able to start basic monitoring of your environment for service availability. And you can rest easy knowing that OpenNMS ships with sensible defaults and a lot of preconfiguration, so many of the things you haven't yet gotten around to, you may never need to.
Shane O'Donnell serves as OpenNMS project manager and chief architect, drawing on extensive experience in the network management industry. He holds an M.S. in something vaguely relevant like Computer Science or something.
Return to ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.