ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Analyzing Web Logs with AWStats

by Sean Carlos
12/01/2005

A crucial, if often overlooked, aspect of running a successful web site is the study of activity occurring within the site. The information gleaned provides valuable input to continuous improvement initiatives, ranging from site architecture and content enhancements to traffic generation. This is the first of a two-part series exploring how to use the open source tool AWStats to perform web server log file analysis. This first part shows how to prepare a sample web log file, perform a basic installation of AWStats, generate reports, and review web analytics terminology; the second part will focus on report interpretation. My aim is to clear away some of the common misconceptions around hits, pages, and visits. The insight will provide a basis for creating a setup to meet production requirements.

Installing AWStats

Web log analysis can be resource-intensive and usually takes place on a system different from the production web server(s). This separation also allows for the flexibility inherent in heterogeneous architectures, where web servers might be running Linux while log analysis tools run under Windows or vice versa. I've assumed a minimalist scenario in which you have AWStats installed on a desktop workstation for ad hoc analysis. While AWStats will run on any platform that supports a recent Perl interpreter, this article covers AWStats 6.4 using either Linux or Windows.

Binary executables for Linux (.rpm) and Windows (.exe) are available from the AWStats project home page and the AWStats project on SourceForge. Download and run the executable appropriate for your workstation. In the case of a Windows install, a script will prompt you for information about your web environment. Answer N to skip this step, and press Enter until the command window closes.

Once the installation finishes, you should find the AWStats programs and documentation on your hard drive, likely in /usr/local/awstats/ or C:\Program Files\AWStats\. Now check that Perl is available. From the system command prompt, type:

$ perl -v

You should see version information if you already have Perl installed. AWStats will stop if the version is lower than 5.005_03; the latest version (5.8.x) is recommended, as it offers performance improvements. To install or update Perl, get a version for Linux from Perl for Linux or for Windows from ActiveState's ActivePerl.

Preparing Web Server Log File Data

To produce reports, you need a least a day of web server log file data. If you are using an Apache server, ensure that you have set the web server logging format to Combined. In the case of Microsoft's IIS web server, set your format to a modified version of the W3C Extended Log File Format, following the instructions in AWStats IIS configuration Part B, Step 1. These configurations add necessary data elements such as user agent (browser) and referring site to the base log configuration. For other web servers, consult the AWStats LogFormat parameter values to get a list of data elements required for complete reporting.

Restart the web server for the new logging values to take effect (after saving the old logs, if needed). If you have access to data from a production web server that you cannot restart, you can use the data as is, with two caveats. If you are not logging all the required data elements, such as user agent, the relevant AWStats reports will be empty. In addition, you must manually map each field being logged using the LogFormat parameter; otherwise, most of your data file will appear as corrupted to AWStats.

Once logging has run for at least a calendar day, copy the log file(s) to the system on which you installed AWStats, using the following target destination, with one of the following:

$ cp /var/log/httpd/access_log /tmp/access.log

# or

> copy C:\WINDOWS\system32\Logfiles\W3SVC1\ex050623.log C:\temp\access.log

Adjust the origin locations as needed based on your web server configuration.

You can also combine multiple logs from different dates combined using the type (Windows) or cat (Linux) utility (in a production setting, turn the filename into a parameter). Be careful to combine the files in chronological order:

$ cat logfile1 logfile2 logfile3 > access.log

In the case of multiple servers in load balancing, merge the logs with the AWStats logresolvemerge.pl utility.

Creating an AWStats Configuration File

A sample AWStats configuration file, awstats.model.conf, comes with the AWStats installation. Copy the file, changing model to the name of the domain to analyze. While custom dictates the use of a domain name, in reality it can be anything. This example analyzes data from www.antezeta.com, so the model is antezeta:

$ cp /etc/awstats/awstats.model.conf /etc/awstats/awstats.antezeta.conf

> copy "C:\Program Files\AWStats\wwwroot\cgi-bin\awstats.model.conf" \
    "C:\Program Files\AWStats\wwwroot\cgi-bin\awstats.antezeta.conf"

Open the resulting file in your favorite text editor. Change each of the following values as necessary (where antezeta.com represents your domain):

SiteDomain="www.antezeta.com"
HostAliases="www.antezeta.com localhost 127.0.0.1"
LogType=W

Set the parameter LogFormat to 1 for Apache, 2 for Microsoft IIS < 6.0, or date time cs-method cs-uri-stem cs-username c-ip cs-version cs(User-Agent) cs(Referer) sc-status sc-bytes for IIS 6.x. For other web servers, see the documentation in the configuration file.

LogFormat=1

Set the parameter DNSLookup to 1 unless your web server already performs reverse DNS lookup on hostnames (that is, translating the host IP address 123.456.789.012 to user34.adsl.myisp.com or similar). Because reverse DNS lookup is slow, web servers do not usually perform it, as it would delay user navigation.

DNSLookup=1

Save the file.

Web Site Measurement Hacks

Related Reading

Web Site Measurement Hacks
Tips & Tools to Help Optimize Your Online Business
By Eric T. Peterson

Building and Updating the AWStats Statistics Database

AWStats uses intermediary files to produce its reports--one for each month of each year for each configuration file you have created. These files represent a compact, optimized version of raw web server log file data, based on preference settings in the AWStats configuration file. Run the command appropriate for your operating system to generate a statistics file for the web log saved earlier in the temporary directory (replace antezeta with your domain name):

$ perl /usr/local/awstats/wwwroot/cgi-bin/awstats.pl -config=antezeta \
    -update -LogFile=/tmp/access.log

> perl "C:\Program Files\AWStats\wwwroot\cgi-bin\awstats.pl" -config=antezeta \
    -update -LogFile=C:\temp\access.log

You should see output similar to this Windows example:

Update for config "C:\Program Files\AWStats\wwwroot\cgi-bin/
    awstats.antezeta.conf"
With data in log file "C:\temp\access.log"...
Phase 1 : First bypass old records, searching new record...
Searching new records from beginning of log file...
Phase 2 : Now process new records (Flush history on disk after 20000 hosts)...
Jumped lines in file: 0
Parsed lines in file: 539
 Found 1 dropped records,
 Found 4 corrupted records,
 Found 0 old records,
 Found 534 new qualified records.

This will generate a statistics file awstatsMMYYYY.antezeta.txt in the same directory as awstats.pl (unless you gave a different value to DirData in awstats.antezeta.conf):

Directory of C:\Program Files\AWStats\wwwroot\cgi-bin
06/23/2005 03:51 PM 6,633 awstats062005.antezeta.txt

where MM is the month and YYYY the year of the web server log data. Should the input data bridge two months, the statistics database will consist of two statistics files.

Rerun the previous command to generate the statistics database. Instead of 534 new records, you have 534 old ones:

Update for config "C:\Program Files\AWStats\wwwroot\cgi-bin/
    awstats.antezeta.conf"
With data in log file "C:\temp\access.log"...
Phase 1 : First bypass old records, searching new record...
Searching new records from beginning of log file...
Jumped lines in file: 0
Parsed lines in file: 539
 Found 1 dropped records,
 Found 4 corrupted records,
 Found 534 old records,
 Found 0 new qualified records.

AWStats, noticing it received an old file, correctly ignores the old data. However, AWStats is less flexible when it comes to processing log files out of order--it must process them chronologically. If you skip a day's processing, AWStats will ignore it if you try to process it after processing successive days. The solution is to delete that month's statistics file and reprocess the log data for the entire month to date. Similarly, some AWStats configuration file changes affect statistics file generation. If your log files are not large and you have doubts, delete the statistics file(s) and reprocess your logs.

"Corrupted" record tips

Log retention tip

Storing the original log files for extended periods is a good practice, unless legal or company policy dictates otherwise. Access to historical logs lets you regenerate your reports if you subsequently make a configuration file change or decide to migrate to another web log analysis tool.

Producing the First Reports

After you have created a statistics database, it's possible to run reports. While AWStats supports a very nice on-demand web CGI interface, it's easy to create static HTML reports to avoid having to reconfigure your web server. The following commands will generate the reports in the /tmp or C:\temp directory:

$ perl "/usr/local/awstats/tools/awstats_buildstaticpages.pl"
    -config=antezeta -lang=en
    -awstatsprog="/usr/local/awstats/wwwroot/cgi-bin/awstats.pl"
    -dir="/tmp"
    -diricons="/usr/local/awstats/wwwroot/icon"

> perl "C:\Program Files\AWStats\tools\awstats_buildstaticpages.pl"
    -config=antezeta
    -lang=en -awstatsprog="C:\Program Files\AWStats\wwwroot\cgi-bin\awstats.pl"
    -dir="C:\temp" -diricons="../Program%20Files/AWStats/wwwroot/icon"

AWStats creates the HTML reports in the temp directory specified by -dir; the main index file is awstats.config.html (for this example, awstats.antezeta.html). Open it in a web browser.

Report graph tip

Should the report graphs be clear rather than colored, verify the directory specified with the -diricons parameter. This value is hardcoded in the HTML files. In the Windows example above, we had to encode the space in the directory name with the %20 notation. We also used HTML forward slashes rather than Windows backslashes.

Report Foundations: Hits, Pages, Sessions, and Visitors

To put the created reports in context, begin by looking at the raw log data, and from that define basic web analytics terminology.

Anatomy of a web server log file

Using the configuration format specified earlier, each web log will have multiple lines of text, each containing nine fields of data. To understand the work AWStats has to perform, consider how a record looks:

Table 1. Web server log record (line) example
  Field Data Example Explanation
1 Host (user) IP d81-211-134-62.cust.tele2.it There has been a DNS lookup in this case. The web server can do it, but you can also do it later, if you do it at all. Judging from the user's host, there is a reasonable probability that the request came from Italy. (However, if the host were something like proxy.alitalia.it, the user might have been working for Alitalia in Boston!)
2 RFC 1413 identity (username) of the client determined by identd. - Rarely used. PC clients do not usually run identd. A dash is a placeholder in the absence of a value.
3 Authenticated User (login name) - The login name for a web server-required login. This is not usually present--most web sites use application server logins, not web server logins.
4 The date and time that the server finished processing the request [08/Jun/2005:19:03:22 +0200] Time includes UTC (Coordinated Universal Time) offset.
5 The user request GET/HTTP/1.1 In this case, the client requested the top-level default document / (index.html) using the GET method of the HTTP protocol version 1.1.
6 Response Status sent to client 200
  • 1xx--informational
  • 2xx--successful
  • 3xx--redirection
  • 4xx--client error
  • 5xx--server error
7 Bytes sent, excluding HTTP headers 4544  
8 Referer (sic) URL, if any http://www.antezeta.com/about.html The URL from which the client made the request. This field is blank if the user directly types a URL, chooses a bookmark, or uses privacy software that blocks the information from being sent.
9 User-Agent identification as reported by the user agent. This usually includes operating system and browser names and versions. Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050513 Fedora/1.0.4-1.3.1 Firefox/1.0.4 This is a Firefox 1.0.4 browser on a Fedora Linux system. Note: some browsers, such as Opera, let the user choose which identification to send. A user can claim to use Microsoft Internet Explorer 6 even while using Opera. This impostor functionality is a response to all the poorly designed "Optimized for browser x" sites that refuse to work with other, legitimate standards-compliant browsers.

This one web server entry, a successful request for http://www.antezeta.com/, represents what is commonly called a hit. An anonymous user navigated from the page http://www.antezeta.com/about.html.

Hits everywhere

Consider that the web site's home page in the above example is actually a group of files--one text file (index.html), one style sheet to indicate formatting (CSS), six image files (GIF, ICO, and PNG), and some dynamic client-side logic (JavaScript) stored in two separate files on the web server. Simply calling up the home page will result in ten file requests to the web server, and thus ten hits:

Qty

Item

1 HTML text file; for example, index.html
1 CSS formatting instructions file
6 GIF, ICO, and PNG image files
2 js JavaScript client logic instruction files
10 Total hits

Probably the most common web metric bandied about, "hits" is also the most meaningless.

Hit
A hit is a successful request for an object from a web server. Success usually merits a status code of 200 or, for objects that are identical to those already in a user's cache, 304.

Along with bandwidth consumption, hits can be useful as an input for server sizing and capacity planning. While people make much of hits to tout the success of a site, hits have no intrinsic business value. Representations to the contrary probably indicate a lack of understanding of how futile hits are as a useful business measure.

Turning to Pages

As the internet has matured, more sophisticated attention turned from hits to pages. Unfortunately, this opened a new can of worms: there is no standard definition of a page. A web server log file simply contains information on objects requested from the web server. It is up to the web server log file analysis software to give semantic meaning to those objects.

Page
Generally a page is a content object that a user viewed, such as an HTML file, a word processing document, or an Adobe Acrobat PDF file.

AWStats works by exclusion in defining a page. By default, any object accessed by a user on your web server is a page unless it has a filename suffix of css, js, class, gif, jpg, jpeg, png, bmp, or ico. You must explicitly add any other objects you do not want to count as pages in AWStats reports. For example, add ZIP achieves and Flash animation files to this list by adding their suffixes to the AWStats NotPageList directive in the AWStats configuration file:

NotPageList="css js class gif jpg jpeg png bmp ico swf zip
tgz gz tar"

Then AWStats will count everything but the following as pages:

Table 3. Files not counted as pages
Suffix Description
css Cascading Style Sheet formating instruction files
js JavaScript dynamic program logic
class Java program files
gif, jpg, jpeg, png, and bmp Various image/photo formats
ico An image icon file; many sites have a company logo saved as favicon.ico; many browsers use this in bookmarks (favorites) and tabs
swf ShockWave Flash animation
zip, tgz, gz, and tar Achieve formats created by PKZip, WinZip, tar, gzip, or similar

One advantage to this approach is that if you are using a CGI to generate dynamic pages, you do not have to worry about each CGI query counting as a page--this will be automatic.

Counting tips

Visitors and sessions

While the concept of a page is open to some interpretation, the concept of a visitor (and a visit, also known as a session) is more difficult to define. Log data neither defines nor tracks a visitor entity. Several heuristic approaches can be used to extrapolate individual visitors from server log data, each approach adding an additional level of refinement.

Visitor
By convention, a visitor is at least the IP address (host) from which the web requests originate. Many commercial tools use cookies to increase the accuracy of this approach. AWStats does not yet use cookies to increase the accuracy of visitor recognition. This is an often-requested AWStats enhancement. Perhaps a Perl programmer reading this will take on the challenge.
Visit
A visit constitutes all activity occurring without a break of more than 30 minutes. Thus, if you request a page and then wait 29 minutes before requesting a new page, both page requests take place during the same visit (or session). However, if you request the subsequent page 30 minutes and 1 second later, that is a new visit. AWStats currently considers a visitor session break to be 60 minutes. Hopefully, this will be configurable in a future version.
Session
Synonym for visit.
Unique visitors
The count of visitors after removing duplicate visits.
Authenticated visitors
Users who have logged in with a username and password. This can be a web server-controlled login or an application server-level login. Web log analysis tools like AWStats track logins at the web server level. The application level login is more common.

Some significant problems are inherent in tracking visitors and their visits with web log analysis software such as AWStats.

Despite these limitations in heuristic approaches, the concept of visitors and sessions (each individual visit) remains a valid tool as an indication of overall user behavior and trends.

Table 4. Visits and unique visitors
Visitor No. Visits (sessions) Unique visits
1 2 1
2 1 1
3 12 1
3 15 3

Bandwidth consumption

Bandwidth consumption is of interest to technical staff, as there is usually an economic cost associated with its use. On a more granular level, large individual file sizes will indicate performance issues, especially for dial-up users.

Bandwidth
The total file size sent from the web server to the end user. This does not include HTTP headers in served objects, HTTP request headers from users, nor bytes needed by the underlying network protocols.

The final part of this series will look at the reports we generated, using the definitions above to identify business and technical metrics to watch.

Sean Carlos is president of Antezeta, an internet consultancy focusing on Merit-Based™ search engine optimization, search engine marketing, web analytics, and web site usability.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.