Profiling LAMP Applications with Apache's Blackbox Logsby Chris Josephes
Most people use the Apache web server for web services because it can scale to match growth. It works well on low- and high-end server hardware and it has an extensible API for adding new modules. Also, since HTTP is a stateless protocol, you can add more web servers to handle the same content.
If your CGI scripts are taking too long, you can modify your code to turn your Apache server into an application server that runs the scripts in the background. If you want redundancy, you can put multiple servers in a load-balanced pool. If your outgoing bandwidth costs are too high, you can use a distributed caching service like Akamai or Mirror Image.
But by adding more components to your web server environment, you increase the chance that something will go wrong. When something does go wrong, the complexity makes the problem analysis more difficult.
In almost all cases when an Apache web server is performing outside of normal specifications, a good administrator will check the web server log. This comes from the most common theory that a web server's slow performance is due to the server being hammered with requests. Maybe a web designer is now serving a five megabyte MPEG file, or maybe there is a worm out there that is running a distributed DOS attack against your site. In these types of situations, checking the log file is a good thing.
My current work assignment puts me in a team of administrators handling roughly 20 web servers. The servers are load-balanced and hand out both static and dynamic content. We proxy almost everything through a distributed caching service to reduce our bandwidth costs and to improve client load times.
It isn't foolproof. We usually see balanced traffic 99.9% of the time, but sometimes a group of cache servers will fail, or new content may be added that does not have properly set expires headers. Sometime during my first month, I experienced my first unexpected traffic surge, and my boss asked me to see if I could find anything in the log files.
Log File Analysis
When I started my analysis, I made the assumption that most of the content will be cached. I should be seeing a lot of hits from unique IP addresses during the allowed cache lifetime of each object. It is also possible that the cache server will expire the object itself if it ran out of disk space, or the cache tries to reload an expired object that may not have changed. With that in mind, I whipped up a couple of log-file parsing scripts.
Just to review, here is a tiny clip of a web server log using the common log file format.
10.0.1.51 - - [15/Feb/2004:14:47:33 -0600] "GET / HTTP/1.1" 200 2401 10.0.1.22 - - [15/Feb/2004:14:49:07 -0600] "GET /manual/ HTTP/1.1" 200 701 10.0.1.51 - - [15/Feb/2004:14:49:11 -0600] "GET /robots.txt HTTP/1.1" 200 300
The most obvious information you can get from each entry is the source IP of the request, the request string, and the time of the request. With a parsing script, you can also find the size of the largest file served, the highest number of requests that came in during one single second, a breakdown of all the source IPs, and the ratio of 200 to 304 hits.
After some checking I discovered that a few objects had a much higher hit ratio than others. Those hits all had unique IP addresses, and none of those addresses had hit other URLs handled by our server. My conclusion was that the cache servers were being bypassed for a few specific objects, which would explain the high traffic load. Later, we discovered a group of URLs was posted to a public web site that incorrectly pointed directly to us.
Since that incident, I worked on more tools to handle log-file hit analysis. I took older log files and ran them through more parsing scripts to find out as much as I could about our traffic patterns.
Unfortunately, I learned that there was performance data out there that I would never be able to find in the log file. I couldn't see whether keep-alives were being used by the web clients, which child processes handled which request, and I couldn't determine how many different clients were coming from the same source IP. While the common log-file format (and the combined format) are great for hit tracking, they aren't suitable for getting hardcore performance data.
Fortunately, the Apache web server is known for being incredibly flexible.
mod_log_config module makes it easy to put together a log file
format that will give you the data you need. The Apache manual lists all of
the logging directives you can use to build your own custom format.
The Blackbox Format
In this article I will discuss a format that uses the best logging directives of Apache to create a running record of what Apache is doing when incoming clients access the server. The name of the format is called Blackbox.
A black box is a device that records information on the performance of an aircraft. Airspeed, direction, and status indicators combine to give an accurate account of what happened during a flight. If the aircraft develops problems (or in the worse case, crashes) technicians can use the data in the black box to find out what went on within the aircraft.
This log format is meant to supplement the common access log file, not to replace it. Since Apache can handle multiple log files, administrators can have Apache write to the second log file by adding two configuration directives.