ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Profiling LAMP Applications with Apache's Blackbox Logs

by Chris Josephes
04/22/2004

Most people use the Apache web server for web services because it can scale to match growth. It works well on low- and high-end server hardware and it has an extensible API for adding new modules. Also, since HTTP is a stateless protocol, you can add more web servers to handle the same content.

If your CGI scripts are taking too long, you can modify your code to turn your Apache server into an application server that runs the scripts in the background. If you want redundancy, you can put multiple servers in a load-balanced pool. If your outgoing bandwidth costs are too high, you can use a distributed caching service like Akamai or Mirror Image.

But by adding more components to your web server environment, you increase the chance that something will go wrong. When something does go wrong, the complexity makes the problem analysis more difficult.

In almost all cases when an Apache web server is performing outside of normal specifications, a good administrator will check the web server log. This comes from the most common theory that a web server's slow performance is due to the server being hammered with requests. Maybe a web designer is now serving a five megabyte MPEG file, or maybe there is a worm out there that is running a distributed DOS attack against your site. In these types of situations, checking the log file is a good thing.

My current work assignment puts me in a team of administrators handling roughly 20 web servers. The servers are load-balanced and hand out both static and dynamic content. We proxy almost everything through a distributed caching service to reduce our bandwidth costs and to improve client load times.

It isn't foolproof. We usually see balanced traffic 99.9% of the time, but sometimes a group of cache servers will fail, or new content may be added that does not have properly set expires headers. Sometime during my first month, I experienced my first unexpected traffic surge, and my boss asked me to see if I could find anything in the log files.

Log File Analysis

When I started my analysis, I made the assumption that most of the content will be cached. I should be seeing a lot of hits from unique IP addresses during the allowed cache lifetime of each object. It is also possible that the cache server will expire the object itself if it ran out of disk space, or the cache tries to reload an expired object that may not have changed. With that in mind, I whipped up a couple of log-file parsing scripts.

Just to review, here is a tiny clip of a web server log using the common log file format.

10.0.1.51 - - [15/Feb/2004:14:47:33 -0600] "GET / HTTP/1.1" 200 2401
10.0.1.22 - - [15/Feb/2004:14:49:07 -0600] "GET /manual/ HTTP/1.1" 200 701
10.0.1.51 - - [15/Feb/2004:14:49:11 -0600] "GET /robots.txt HTTP/1.1" 200 300

The most obvious information you can get from each entry is the source IP of the request, the request string, and the time of the request. With a parsing script, you can also find the size of the largest file served, the highest number of requests that came in during one single second, a breakdown of all the source IPs, and the ratio of 200 to 304 hits.

Related Reading

Apache Cookbook
By Ken Coar, Rich Bowen

After some checking I discovered that a few objects had a much higher hit ratio than others. Those hits all had unique IP addresses, and none of those addresses had hit other URLs handled by our server. My conclusion was that the cache servers were being bypassed for a few specific objects, which would explain the high traffic load. Later, we discovered a group of URLs was posted to a public web site that incorrectly pointed directly to us.

Since that incident, I worked on more tools to handle log-file hit analysis. I took older log files and ran them through more parsing scripts to find out as much as I could about our traffic patterns.

Unfortunately, I learned that there was performance data out there that I would never be able to find in the log file. I couldn't see whether keep-alives were being used by the web clients, which child processes handled which request, and I couldn't determine how many different clients were coming from the same source IP. While the common log-file format (and the combined format) are great for hit tracking, they aren't suitable for getting hardcore performance data.

Fortunately, the Apache web server is known for being incredibly flexible. The mod_log_config module makes it easy to put together a log file format that will give you the data you need. The Apache manual lists all of the logging directives you can use to build your own custom format.

The Blackbox Format

In this article I will discuss a format that uses the best logging directives of Apache to create a running record of what Apache is doing when incoming clients access the server. The name of the format is called Blackbox.

A black box is a device that records information on the performance of an aircraft. Airspeed, direction, and status indicators combine to give an accurate account of what happened during a flight. If the aircraft develops problems (or in the worse case, crashes) technicians can use the data in the black box to find out what went on within the aircraft.

This log format is meant to supplement the common access log file, not to replace it. Since Apache can handle multiple log files, administrators can have Apache write to the second log file by adding two configuration directives.

What to Log

The directives listed below are most compatible with Apache 2.0 with the mod_logio module. Later in the article, I will discuss cross-compatibility with Apache 1.3 and other server environments.

Source IP, Time, and Request Line

%a, %t, and %r

These directives are already used in the common log file format. They are the three most obvious request metrics to track.

When logging the remote host, it is important to log the client IP address, not the hostname. To do this, use the %a directive instead of %h. Even if HostnameLookups are turned on, using %a will only record the IP. For the purposes of the Blackbox format, reverse DNS should not be trusted.

The %t directive records the time that the request started. It could be modified using a strftime format, but it would be better to keep it as is. That makes it easier to correlate lines between the Blackbox log file and the common log file.

The %r directive is the first line of text sent by the web client, which includes the request method, the full URL, and the HTTP protocol. It is possible to break up this data using individual directives. For example, you could log a URL without a query string. Again, it's better to keep the request line intact for comparison.

Process id and Thread id

%{pid}P and %{tid}P

When the Apache server starts, it spawns off child processes to handle incoming requests. As it runs, it shuts down older processes and adds new ones. Apache can add additional child processes if it needs to keep up with a high demand. By recording the process id and thread id (if applicable), you will have a record of which child process handled an incoming client.

You can also track the number of Apache processes for a given time and determine when a child process shut down. If you are running an application handler (mod_perl, mod_python), recording the PID will make it easier to find out what hits a child process was handling when debugging an application error.

Connection Status

%X

The connection status directive tells us detailed information about the client connection. It returns one of three flags: X if the client aborted the connection before completion, + if the client has indicated that it will use keep-alives (and request additional URLs), or - if the connection will be closed after the request.

Keep-Alive is an HTTP 1.1 directive that informs a web server that a client can request multiple files during the same connection. This way a client doesn't need to go through the overhead of re-establishing a TCP connection to retrieve a new file.

For Apache 1.3, use the %c directive in place of the %X directive.

Status Codes

%s and %>s

There's nothing really new about this directive, since it's already used in the common log file format. The CLF records the status code — after any redirections take place — with %>s. For the Blackbox format, we will want to record the status code before and after the redirection took place.

Time to Serve Request

%T and %D

The common log file format cannot accurately determine the amount of time it takes to serve a file. Some parsing programs will try to make estimates based on the timestamp on hits from the same source, but it is very unreliable, especially if the hits are being made in parallel.

These two directives will give you the exact metrics you need. The %T directive will report the time in seconds it took to handle the request while the %D directive will report the same time in microseconds.

Apache 1.3 does not support the %D directive.

Bytes Sent and Received

%I, %O, and %B

Apache 2.0 includes the optional mod_logio module which can report on how many bytes of traffic a client sent and how many bytes the server returned.

The %b directive does a good job, but it only reports the bytes returned in the requested object, excluding the bytes from the HTTP headers. The header traffic is usually small, but you may want to record it to get a better idea of outgoing TCP traffic for a given interface can be like. Recording the incoming bytes is helpful when your users are uploading files with the PUT or POST methods.

Use %I to record incoming bytes, %O to record outgoing bytes, and %B to record outgoing content bytes. In cases where no content is returned, the %B directive returns a zero, whereas %b returns a dash. Since we're dealing with integer values, it's better to use %B.

Client Source Port

%{REMOTE_PORT}e or %S

Logging the client source TCP port can provide some useful network data and can help us associate a single client with multiple requests.

If two clients from the same IP address make simultaneous connections, the common log file format cannot distinguish between those clients. Otherwise, if the client uses keep-alives, then every hit made from a single TCP session will be associated by the same client port number.

The port information can indicate how many connections our server is handling at once, which may help in tuning server TCP/IP settings. It will also identify which client ports are legitimate requests if the administrator is examining a possible SYN-attack against a server.

This is the part where Apache doesn't really help you out. The only native way to record the source port is by using %{REMOTE_PORT}e, which grabs the value of the REMOTE_PORT environment variable, but environment variables only come into play when mod_cgi or mod_cgid is handling the request.

The only way to record the source port for every request is by modifying the source code of mod_log_config.c or by creating a custom logging module. In both cases, a new format directive could then record the source port of the client connection.

If you want to record the remote port, here's the source code modification for Apache 2.0. All of the changes are made in modules/loggers/mod_log_config.c in the source distribution.

Find the function definition for log_remote_address. Add the following function after it:

/*
 * log_remote_port patch
 */

static const char *log_remote_port(request_rec *r, char *a)
{
        apr_port_t rport;
        apr_sockaddr_port_get(&rport, r->connection->remote_addr);
        return apr_itoa(r->pool, rport);
}

In layman's terms, the function sets up a variable to hold the port number, which we obtain from a sockaddr_in structure, and then convert it into a string to return.

Next, find a function called log_pre_config at the end of the file. We need to insert two lines into that function to register the function we just wrote.

static int log_pre_config(apr_pool_t *p, apr_pool_t *plog, apr_pool_t *ptemp)
{
        .....
        log_pfn_register(p, "s", log_status, 1);
        /* log_remote_port patch */
        log_pfn_register(p, "S", log_remote_port, 0);
    }
    return OK
}

Save the code changes, then recompile Apache.

What Not to Log

At this point there are 14 fields in our log file format, which is getting pretty close to the limits of readability. Before considering making the format bigger, let's consider what doesn't need to be logged.

A lot of data in the combined log file format is client supplied, which means our log analysis depends on a client being truthful. The HTTP username, ident username, HTTP referrer, and user agent strings all come from data sent from the client. For example, some robots pretend that they are really a web browser, so you can't tell when your site is being spidered. It's difficult to verify this data, so it's best to ignore it.

Keep logging the data in the combined log file format. If you need to refer to the data at a later time, it shouldn't be too difficult to match a Blackbox log entry to a combined log entry.

Putting it Together

The following is the configuration syntax for the Blackbox log format. It's designed to group similar directives together and still be readable when using the tail command.

Apache 2.0 Configuration

# Blackbox log format
<IfModule mod_logio.c>
LogFormat "%a/%S %X %t \"%r\" %s/%>s %{pid}P/%{tid}P %T/%D %I/%O/%B" blackbox
</IfModule>

<IfModule !mod_logio.c>
LogFormat "%a/%S %X %t \"%r\" %s/%>s %{pid}P/%{tid}P %T/%D 0/0/%B" blackbox
</IfModule>

CustomLog /var/opt/apache2/logs/blackbox blackbox

Apache 1.3 Configuration

# Blackbox log format
LogFormat "%a/%S %c %t \"%r\" %s/%>s %P/0 %T/0 0/0/%B" blackbox
CustomLog /var/opt/apache/logs/blackbox blackbox

If you decide not to use the remote port logging patch, replace %S with 0, otherwise Apache will refuse to run.

The formats above include all of the directives to log and still provide a cross-compatibility between the Apache 1.3 and 2.0 server releases. Even though 1.3 does not support threading or newer 2.0 logging directives, any parsing script will work with both versions.

Once the directives are in place, restart Apache. You will need to modify whatever scripts you use for log file rotation to take the Blackbox log file into account. You might want to consider retaining at least 10 days of logging data for your records.

Working with the Blackbox Format

Now that you're logging the data, you'll need to find a way to put it to work.

The Perl Cookbook featured a field extractor regular expression for the common log file format. That same code can be modified to extract all of the fields in the Blackbox format.

#!/usr/local/bin/perl

# Our pattern matching format
# A "\s" is used at the end of lines 1 and 2 to catch the space character
# in the format

$Fmt=<<EOFMT
^([^/]+)/(\\d+)\\ (\\S)\\ \\[([^:]+):(\\d+:\\d+:\\d+)\\ ([^\\]]+)\\]\\s
"(\\S+)\\ (.*?)\\ (\\S+)"\\ (\\d+)/(\\d+)\\ (\\d+)/(\\d+)\\s
(\\d+)/(\\d+)\\ (\\d+)/(\\d+)/(\\d+)\$
EOFMT
;

# Read a piped log file from standard input

while (<STDIN>)
{
        ($remoteIp,
         $remotePort,
         $connectionStatus,
         $date,
         $time,
         $gmtOffset,
         $requestMethod,
         $url,
         $protocol,
         $statusBeforeRedir,
         $statusAfterRedir,
         $processId,
         $threadId,
         $seconds,
         $microseconds,
         $bytesIn,
         $bytesOut,
         $bytesContent)=/$Fmt/x;
}

There are a few things to keep in mind when parsing your log files.

Apache only logs attempted client requests. If someone connects to port 80 and doesn't send any data, it won't be logged. At a minimum, a client needs to send some text and one carriage return. Also, the pattern above assumes a proper request line will be transmitted by the client. If a client only transmits garbage, the regex will fail.

The timestamps refer to the time the request was started, so it is possible to see log file entries that are out of order, especially if you are dealing with a mix of large files and small files being served.

The %T and %D directives refer to the time it takes to handle the entire transaction, including the amount of time it takes for the client to transmit data. If you see wide variances in the amount of time it takes to serve the same file, it may be related to a web client network problem.

Listed below are a few ideas for potential analysis tools that you can develop from the Blackbox log.

Performance Graphing

Long-term collection of Blackbox data is better suited to a graphing environment like RRDTool. You can graph metrics including bytes in and out, maximum clients per second, and child process lifetime.

To do this, you need to write a program that continually scans the Blackbox log in short intervals (5 or 10 minutes), grabs all new entries, and then imports the data into the RRD file.

RRDTool collects time-based sampling data for any duration. One data file can keep data for durations of a week, a month, or even a year. You can merge data files into a single graph if you want to report on the performance of a group of load-balanced servers.

See the RRDTool web site for more information.

Flight Recorder

You have almost enough data in the Blackbox format to see exactly how a single client handled a series of requests. You can extract a full HTTP session by filtering on the remote port and IP.

You could also program a HTTP client to replay the exact same data if you wanted to try and simulate the client actions. The downside is that you won't have a record of all of the client headers passed, such as cookies or authentication data. Plus the timestamp resolution is to the second, which may not be a 100% match with when the original client transmitted the request.

Final Thoughts

Everyone has been logging web server traffic as long as the web has been around, but the emphasis has always been aimed toward the content served, not the server itself. There are tools and modules out there that can monitor performance, but most of them generate reports of data you can already find just by logging it.

The Blackbox format is a simple alternative since it doesn't require additional modules. Further, you can examine the data without a running web server. All the work to get it up and running requires the addition of Apache logging directives and an optional patch to the source code.

Chris Josephes works as a system administrator for Internet Broadcasting.


Return to the Apache DevCenter.

Copyright © 2009 O'Reilly Media, Inc.