ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Eleven Metrics to Monitor for a Happy and Healthy Squid
Pages: 1, 2, 3

10. Free Disk Space

Disk space is another finite resource consumed by Squid. When you run Squid on a dedicated system, controlling the disk usage is relatively easy. If you have other applications using the same partitions as Squid, you need to be a little more careful. We need to worry about disk space for two reasons: the disk cache and Squid's log files.



If Squid gets a "no space left on device" error while writing to the disk cache, it resets the cache size and keeps going. In other words, this is a non-fatal error. The new cache size is set to what Squid believes is the current size. This also causes Squid to start removing existing objects to make room for new ones. Running out of space when writing a logfile, however, is a fatal error. The Squid process exits, rather than continue operating without the ability to log important information.

Free disk space information is only available through the cache manager. Furthermore, Squid only tells you about the cache_dir directories. It won't tell you about the status of the partition where you store your log files (unless that partition is also a cache directory). Thus, you may want to develop your own simple script to monitor free space on your logging partition.

The storedir cache manager page has a section like this for each cache directory:

Store Directory #0 (diskd): /cache0/Cache
FS Block Size 1024 Bytes
First level subdirectories: 16
Second level subdirectories: 64
Maximum Size: 15360000 KB
Current Size: 13823540 KB
Percent Used: 90.00%
Filemap bits in use: 774113 of 2097152 (37%)
Filesystem Space in use: 14019955/17370434 KB (81%)
Filesystem Inodes in use: 774981/4340990 (18%)
Flags:
Pending operations: 0
Removal policy: lru
LRU reference age: 22.46 days

We are particularly interested in two lines: the "Percent Used" and "Filesystem Space in use" lines.

The "Percent Used" line shows how much space Squid has used, compared to the size you specified on the cache_dir line. This will normally be equal to, or less than, the value for cache_swap_low.

The "Filesystem Space in use" line shows how much space is actually used on this partition. Squid gets the information from the statvfs() system call. It should match what you would see by running df from your shell. This is the important value to watch. If the percentage hits 100 percent, Squid will receive "no space left on device" errors.

11. Hit Ratio

Cache hit ratio is another metric that can vary a lot from time to time. Its high variability means that it is not always a good indicator of a problem. A sudden drop in hit ratio might mean that one of the cache clients is a crawler or something that adds no-cache directives to its requests. Perhaps the best reason to monitor it is simply to understand how many requests benefit are served directly from the cache (in case the boss asks you to justify Squid's existence).

You can get the hit ratio, calculated over the last five minutes, by requesting this SNMP OID:

enterprises.nlanr.squid.cachePerf.cacheProtoStats.cacheMedianSvcTable.cacheMedianSvcEntry.cacheRequestHitRatio.5

The same information is available on the cache manager "info" page:

# squidclient mgr:info | grep 'Request Hit Ratios'
Request Hit Ratios:     5min: 29.8%, 60min: 44.1%

My Squid-rrd Monitoring Utility

For better or worse, the cache manager currently provides more useful information than Squid's SNMP implementation. However, the cache manager output was designed to be human-readable. It would be awkward for you to write a bunch of software to grep for all of the relevant information and extract the values. Especially since I have already done it for you.

I have a Perl script, recently enhanced by Dan Kogai, to issue cache manager requests and store the values into an RRD database. If you don't know about RRDtool yet, you should. It is Tobi Oetiker's successor to MRTG. It's very cool.

My Perl script runs periodically from cron. It makes cache manager requests and uses regular expressions to parse the output for certain metrics. The extracted values are stored in various RRD files. I also provide a template CGI script that displays the RRD data.

You can find my code and documentation at www.squid-cache.org/~wessels/squid-rrd. I've included some of the graphs below. You can view more graphs (and look at the full-size versions of the ones below) by visiting my stats page for the IRCache proxies at www.ircache.net/Cache/Statistics/Vitals/rrd/cgi.

Figure 1
These two graphs show memory usage and page-fault rate for a one-month period. You can clearly see when Squid was restarted because the memory usage goes down. It slowly climbs back up as Squid runs. You can also see that the page-fault rate increases as the memory consumption increases.

Figure 2
These five graphs show various metrics for a 24-hour period. You can see that an increase in load causes corresponding increases in CPU usage, file descriptor usage, and, to some extent, response times. The file descriptor graph shows a brief spike during the late evening hours.

Duane Wessels discovered Unix and the Internet as an undergraduate student studying physics at Washington State University.


O'Reilly & Associates published Squid: The Definitive Guide in January 2004.


Return to ONLamp.com.



Sponsored by: