|
Eleven Metrics to Monitor for a Happy and Healthy Squidby Duane Wessels, author of Squid: The Definitive Guide03/25/2004 |
In this article, I'll show you how to stay on top of Squid's performance. If you follow this advice, you should be able to discover problems before your users begin calling you to complain.
Squid provides two interfaces for monitoring its operation: SNMP and the cache manager. Each has its own set of advantages and shortcomings.
SNMP is nice because it is familiar to many of us. If you already
have some SNMP software deployed in your network, you may be able
to easily add Squid to the other services that you already monitor.
Squid's SNMP implementation is disabled by default at compile time.
To use SNMP, you must pass the --enable-snmp option to
./configure like this: ./configure --enable-snmp ...
The downside to SNMP is that you can't use it to monitor all of the metrics that I talk about in this article. Squid's MIB has remained almost unchanged since it was first written in 1997. Some of the things that you should monitor are only available through the cache manager interface.
The cache manager is a set of "pages" that you can request
from Squid with a special URL syntax. You can also use
Squid's cachemgr.cgi utility to view the information through
a web browser. As you'll see in the examples, it is a little
awkward to use the cache manager for periodic data collection.
I have a solution to this problem, which I'll describe at the end
of the article.
|
Related Reading
|
1. Process Size
Squid's process size has a direct impact on performance. If the process becomes too large, and won't fit entirely in memory, your operating system swaps portions of it to disk. This causes performance to degrade quickly -- i.e., you'll see an increase in response times. Squid's process size can be a little bit difficult to control at times. It depends on the number of objects in your cache, the number of simultaneous users, and the types of objects that they download.
Squid has four ways to determine its process size. One or more
of them may not be supported on your particular operating system.
They are: getrusage(), mallinfo(),
mstats(), and sbrk().
The getrusage() function reports the "Maximum Resident Set Size" (Max RSS).
This is the largest amount of physical memory that the process
has ever occupied. This is not always the best metric, because
if the process size becomes larger than your memory's capacity,
the Max RSS value does not increase. In other words, Max RSS
is always less than your physical memory size, no matter how big
the Squid process becomes.
The mallinfo() and mstats() functions are features of some
malloc (memory allocation) libraries. They are a good indication
of process size, when available. The mstats() function is
unique to the GNUmalloc library.
The sbrk() function also provides a good indication of process
size and seems to work on most operating systems.
Unfortunately, the only metric available as an SNMP object
is the getrusage() Max RSS value. You can get it with
this OID under the Squid MIB:
enterprises.nlanr.squid.cachePerf.cacheSysPerf.cacheMaxResSize
To get the other process size metrics, you'll need to use the cache manager. Request the "info" page and look for these lines:
# squidclient mgr:info | less
...
Process Data Segment Size via sbrk(): 959398 KB
Maximum Resident Size: 924516 KB
...
Total space in arena: 959392 KB
You can also use the high_memory_warning directive in
squid.conf to warn you if the process size exceeds a limit that you
specify. For example:
high_memory_warning 500
2. Page Fault Rate
As I mentioned in the discussion about memory usage, Squid's performance suffers when the process size exceeds your system's physical memory capacity. A good way to detect this is by monitoring the process' page-fault rate.
A page fault occurs when the program needs to access an area of memory that was swapped to disk. Page faults are blocking operations. That is, the process pauses until the memory area has been read back from disk. Until then, Squid cannot do any useful work. A low page-fault rate, say, less than one per second, may not be noticeable. However, as the rate increases, client requests take longer and longer to complete.
When using SNMP, Squid only reports the page-fault counter, rather
than the rate. The counter is an ever-increasing value reported
by the getrusage() function. You can calculate the rate by comparing
values taken at different times. Programs such as RRDTool and MRTG
do this automatically. You can get the page fault count by requesting
this SNMP OID:
enterprises.nlanr.squid.cachePerf.cacheSysPerf.cacheSysPageFaults
Alternatively, you can get it from the cache manager's info page:
# squidclient mgr:info | grep 'Page faults'
Page faults with physical i/o: 2712
You can also get the rate, calculated over five- and 60-minute intervals, by requesting other cache manager pages:
# squidclient mgr:5min | grep page_fault
page_faults = 0.146658/sec
# squidclient mgr:60min | grep page_fault
page_faults = 0.041663/sec
The high_page_fault_warning directive in squid.conf will warn you
if Squid detects a high page fault rate. You specify a limit on
the mean page-fault rate, measured over a one-minute interval. For
example:
high_page_fault_warning 10
3. HTTP Request Rate
The HTTP request rate is a simple metric. It is the rate of requests made by clients to Squid. A quick glance at a graph of request rate versus time can help answer a number of questions. For example, if you notice that Squid suddenly seems slow, you can determine whether or not it is due to an increase in load. If the request rate seems normal, then the slowness must be due to something else.
Once you get to know what your daily load pattern looks like, you can easily identify strange events that may warrant further investigation. For example, a sudden drop in load may indicate some sort of network outage, or perhaps disgruntled users who have figured out how to bypass Squid. Similarly, a sudden increase in load might mean that one or more of your users has installed a web crawler or has been infected with a virus.
As with the page fault value, you can only get the HTTP request counter value from SNMP. Use this OID:
enterprises.nlanr.squid.cachePerf.cacheProtoStats.cacheProtoAggregateStats.cacheProtoClientHttpRequests
The cache manager reports this information in a variety of ways:
# squidclient mgr:info | grep 'Number of HTTP requests'
Number of HTTP requests received: 535805
# squidclient mgr:info | grep 'Average HTTP requests'
Average HTTP requests per minute since start: 108.4
# squidclient mgr:5min | grep 'client_http.requests'
client_http.requests = 3.002991/sec
# squidclient mgr:60min | grep 'client_http.requests'
client_http.requests = 2.636987/sec
4. ICP Request Rate
If you have neighbor caches using ICP, you'll probably want to monitor the ICP request rate as well. While there aren't any significant performance issues related to ICP queries, this will at least tell you if neighbor caches are up and running.
To get the ICP query rate via SNMP, use this OID:
enterprises.nlanr.squid.cachePerf.cacheProtoStats.cacheProtoAggregateStats.cacheIcpPktsRecv
Note that the SNMP counter includes both queries and responses that your Squid cache receives. There is no SNMP object that will give you only the queries. You can get only received queries from the cache manager, however. For example:
# squidclient mgr:counters | grep icp.queries_recv
icp.queries_recv = 8595602




