ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Analyzing Web Logs with AWStats, Part 2

by Sean Carlos
01/09/2006

Part one of this series showed how to perform a basic installation of the web server log analysis tool AWStats, generate sample reports, and understand basic web analytics terminology. This second article delves into the reports, noting metrics worth watching for technical and business staff.

AWStats Summary and When Reports

AWStats provides several summary reports that show the overall website trends over different time intervals for the reporting period, usually monthly.

Summary

The General Summary report breaks down unique visitors, number of visits, pages, hits and bandwidth by human and non-human visitors for the current reporting interval. Metrics to watch are:

Unique visitors
The overall number of users coming to a site is useful to track the general trend over time. Declines, unless due to seasonal fluctuations, are trouble indicators. Is a competitor drawing away your potential traffic? Did site changes remove you from search engines? Did you make changes in your marketing activity?
Unique visitors/Number of visits ratio
This answers the question "Are visitors returning?" If they are not, they might not be finding what they were looking for--did you tell a search engine that you offer poker when you really sell widgets? Can a mere mortal survive your navigation system?
Pages
This is most useful relative to the number of visitors.
Pages/visitors ratio
This indicates the extent to which your site engaged a visitor. Generally, the higher the better, but a high number can also indicate the presence of convoluted site flows and processes that force motivated visitors to slog through page after page of obstacles in order to accomplish their goals.
Monthly history

This provides a breakdown of visitors and pages, month by month, for the current year. In month-to-month comparisons, do not forget to adjust for the variable number of days in a month. Keep an eye on the Unique visitors/number of visits and Pages/visitors ratios that you must manually calculate.

Days of month
Over the course of a month, unusual peaks may indicate your site is receiving promotion from elsewhere--review referral traffic. Abnormally low traffic, if not on a holiday, may indicate difficulty in reaching your site, whether it was down or performing poorly. Review monitoring reports if available.
Days of week
This is useful to see what days are most popular. For marketing, it indicates when the greatest number of eyeballs is generally present. For technical staff, this can suggest when to perform scheduled maintenance. There is a slight skew in that each of the seven days is not consistently present each month (30 and 31 not being evenly divisible by 7).
Hours
This is a more granular breakdown of general peaks as in the days of the week report. Hourly peaks are useful to technical staff responsible for capacity planning. Sometimes an hourly breakdown for a specific day is necessary. To generate this, see the AWStats unofficial daily reports feature documentation.

Day-to-day and month-to-month trends are useful in evaluating the impact of marketing initiatives and/or external traffic drivers.

AWStats also supports yearly reporting intervals, using the command line -month=all -year=YYYY date syntax or through the AllowFullYearView configuration option for the on-demand CGI interface. Daily reports are available as an unofficial feature. Changes planned for version 6.5 should facilitate reporting on different time intervals.

Web Site Measurement Hacks

Related Reading

Web Site Measurement Hacks
Tips & Tools to Help Optimize Your Online Business
By Eric T. Peterson

User Provenance: Traffic Building and Monitoring

Qualified traffic building (and monitoring) is an essential marketing activity for most sites, regardless of business model. On the technical side, changes in traffic patterns are useful for technical capacity planning and management. Traffic comes from:

The relevant AWStats reports are in the "Referers" section. The misspelling is due to a historic mistake.

Bookmarks or Direct URL Entry

AWStats reports on visitors where a "refering" URL is missing in the first page request during a visit/session. The Referrers report calls this "Direct address/Bookmarks." Some privacy software, such as Norton Internet Security, can block transfer of referring URL information, meaning that those visitors will appear in this report even if they came from a link on another site or a search engine.

Search Engine Usage Reports

Search engine reports show which search engines and queries brought visitors to the site. The main report contains the top listings; each section has a link to the complete listing for the reporting period.

Links from an Internet Search Engine
This answers the question, "Which search engines are sending us the most traffic?" See the related "Search engine crawlers" report to ensure that search engines are indexing your site.
Search Key Phrases
The top ten multiple-word combinations entered by users.
Search Keywords
The top ten individual words entered by users. This list results by aggregating the phrases into single words.

Search activity information is extremely useful in validating and refining merit-based search engine optimization efforts. Key word phrases identify the language used by site visitors, language that is usually rather colloquial compared to the jargon often prevalent in site copy. Consider revising the site copy to ensure it contains the language used by your target audience while maintaining a professional tone. The absence of keyword synonyms may be more of an indicator that these words are not present in your site's content rather than a lack of the use of these terms by internet users.

Referrers (a.k.a. "Referers")

While most indirect traffic comes through search engines, traffic can also come from external site inbound links--links due to compelling content, advertising agreements, etc. Monitor inbound links from external sites to:

By default, AWStats shows the specific page that referred to your site. It is also possible to aggregate the referrers by domain, by taking advantage of AWStats' custom report feature. Simply add the following logic to your AWstats configuration file:

ExtraSectionName1="Referring Sites by domain - Top 25"
ExtraSectionCodeFilter1="200 304"
ExtraSectionFirstColumnTitle1="Site"
ExtraSectionFirstColumnValues1="REFERER,^http:\/\/([^\/]+)\/|^HTTP:\/\/([^\/]+)\/"
ExtraSectionFirstColumnFormat1="<a href=&aps;http://%s/&aps;
title=&aps;http://%s/&aps; target=&aps;_blank&aps;>%s</a>"
ExtraSectionStatTypes1=PHBL
ExtraSectionAddAverageRow1=1
ExtraSectionAddSumRow1=1
MaxNbOfExtra1=25
MinHitExtra1=1

This section will appear for data after the configuration file has been updated. To retroactively generate this report, you must delete the AWStats statistics files and regenerate them as well, as the reports run from them.

Geographic Provenance

To the extent it's possible to associate a visitor's host name with a physical location, it is possible to report on the geographic provenance. By default, AWStats offers country-level reporting.

Countries
The countries report generally indicates the countries of traffic origin. As noted earlier, some users may access the internet though large company proxies, masking their true location. Nevertheless, country information can assist strategic planning relative to foreign markets--should a commerce site accept foreign payment methods? Should you translate the site into local languages? Are you big in Japan? Tip: If country information is missing, you probably haven't performed reverse DNSLookup on your logs. Alternatively, you can use the Maxmind Geoip plugin.
Region
City
ISP
Organization
Additional geographic granularity is in theory possible by purchasing commercial AWStats plugins from Maxmind. There are, however, limitations based on how visitors connect to the internet through ISPs. U.S. visitors may appear disproportionately in Virginia, having entered the internet though AOL proxies located there.

User Behavior Within the Site Reports

Several marketing reports assist in the understanding of how users behave once they have arrived at your site.

Visit duration
This is the time from the first page request up until the last page request, without a break longer than 60 minutes (the session expiration time). In general, the longer the time, the greater your site keeps a visitor's attention. The actual duration will always be longer, as web servers cannot track how long the visitor stayed on the final page before typing in a new URL or closing the browser. Short visits mean your site is not capturing the attention of your visitors.
Pages-URL
The main report shows the top ten pages seen; a link calls up the full list for the reporting period. This is useful as a gauge of which pages are more and less popular.
Entry pages
The top site access pages are an indicator of which pages search engines and external sites target. Consider ensuring that these pages speak to an audience that arrives directly to these pages.
Exit pages
These are the top pages where a visitor has abandoned the site. In the best case, the top pages are the conclusion of a natural process flow and simply indicate opportunities to entice the visitor to explore the site further. If the page is at the beginning or in the middle of a logical process flow, you have direct evidence that something impedes the conversion of a visitor into a customer. Review the page to determine what is driving visitors away--a form with 30 fields or the lack of a visible "continue" button when using Firefox, perhaps?

Tip: Consider extending AWStats by using custom reports such as ExtraSection to monitor specific site pages and/or directories. The following example, added to your AWStats configuration file, will track the most-visited first- and second-level site directories. For sites that have placed business content in distinct directories, this type of report provides overall performance at a glance.

ExtraSectionName2="Top 50 first and second level directories"
ExtraSectionCodeFilter2="200 304"
ExtraSectionCondition2="URL,^\/.*"
ExtraSectionFirstColumnTitle2="Directory"
ExtraSectionFirstColumnValues2="URL,(^(\/[\w]+\/[\w]+\/)|^(\/[\w]+\/))"
ExtraSectionStatTypes2=PHB
ExtraSectionAddAverageRow2=0
ExtraSectionAddSumRow2=0
MaxNbOfExtra2=50
MinHitExtra2=1

For each line, change the 2= to 1= if you do not already have an ExtraSection enabled. In addition to the second example here, there are six examples in the AWStats online documentation topic "ExtraSection," and additional samples in the AWStats web analytics resource center.

Site Development and Management Reports

Several technical reports assist site development and quality control.

Operating systems and versions
This provides insight into the operating systems (and in the detail report, which versions of them) that visitors use to access the site. Use in combination with the browser report to identify where to concentrate site testing efforts.
Browsers
The top browsers used to visit the site. Use this to prioritize testing efforts.
HTTP status codes

Most AWStats reports work from successful requests--status 200 or 304. This report contains the others. Monitor it for potential problems. The most common are:

401 Unauthorized
For sites with a server-based login to a reserved site area, this indicates failed logins.
404 Document Not Found
This indicates a request for an object not found on the web server. This may be a file forgotten during a porting in production, an incorrect link, an outdated link from an external site (consider contacting the site to update the link), or an attack attempt.
500 Internal Server Error
This usually indicates an incorrectly configured web server or the failure of the web server to call an external program or application server.

Tip: Consider creating a custom report on the log field user agent to report on browser and operating system combinations.

AWStats Non-Human Activity Reports

We tend to think of interactive activity when we think of requests to our websites, but behind the scenes there is also a lot of automated, non-human traffic. This breaks down into four basic types:

The term robot, implying automation, refers to any of the four types. Crawler or spider refers to the undirected activity typical of search engine indexing tools: they follow links from one site to another and links within a site trawling for new content and other sites. Exploit attacks usually try to issue commands in an attempt to gain system access.

The good news is that AWStats can recognize most non-human traffic automatically and separate it from the general interactive activity reports.

Search Engine Crawlers

Crawler traffic is highly beneficial--it is the ongoing updating of your content in search engine indexes. The ability to monitor this traffic is essential as part of an overall search engine optimization strategy. Many organizations invest in paid inclusion or keywords without first having exploited the greater benefits of organic merit-based search engine optimization (SEO). Monitor this traffic to ensure Google and other bots are updating their indexes on a regular basis. The relevant AWStats report is Robots/Spiders visitors.

Off-line Download

Off-line downloading tools, such as Wget and htttrack, will download content within a domain or subdirectory of a domain, as specified by the human user who launches the tool. While your server logs these requests, you do not really know if a user ever will look at all of the pages, nor how many times the user will consult the pages off-line. From a business point of view, off-line downloading could represent monitoring by your competition. The relevant AWStats report is Browsers.

Attacks

Some site traffic consists of automated attempts to exploit weaknesses in web servers in an attempt to hijack the server. AWStats currently tracks five types of attacks on Microsoft IIS. If you don't use IIS, you can disable the report. The relevant AWStats report is Worm/Virus attacks.

Monitoring Scripts

Many sites employ automated virtual transactions to monitor specific processes in their website. The usual practice is to filter this traffic from your web statistics. To this end, AWStats provides two configuration directives. You can use SkipHosts if all of the traffic (and just that traffic) comes from a specific IP address, or SkipUserAgents if the "robot" performing the transaction identifies itself with a particular name.

A Note on Measuring Non-Human Traffic and Page Tagging

One criticism leveled at web server log file data analysis is that the presence of non-human traffic distorts the statistics. The primary alternative method, page tagging, works by including page tags that should call the counting server only when a normal browser, not a robot, visits the page. In theory, this excludes non-human traffic. Page tag vendors tout this as beneficial. Unfortunately, this approach misses information essential to the management of most sites. In particular, visibility of search engine crawler activity is an essential ingredient of an overall search engine strategy. AWStats offers the best of both worlds-- it captures automated traffic and reports on it, but maintains this data separate from interactive human user reports. Web log analysis can also report on objects that you cannot readily tag, such as images and binary document files.

Parting Words

These articles have only touched the surface of what is possible with web analytics and AWStats. The following resources may help you integrate web log analysis with AWStats into your website management.

Final Tips

Additional Resources

Support Options

Measurement Guidelines

The following provide more exhaustive information on web analytics terminology and its usage.

Known Robots

Web Caching

Improper cache management, all too common, can affect both correct content delivery and web statistics.

Alternative Open Source Web Log Analysis Tools

There are two significant open source alternatives to AWStats.

None of the leading open source web analytics tools includes clickstream (path) analysis, a feature usually found in "enterprise-class" commercial solutions. StatViz, available for multiple platforms, may help fill this void. I have written rudimentary StatViz installation and configuration instructions for Linux to facilitate StatViz evaluation.

AWStats is Thanks to ...

AWStats' principal author is Laurent Destailleur, eldy@users.sourceforge.net. To ensure that he maximizes his time dedicated to improving AWStats, you should use the community email addresses rather than writing him directly.

Sean Carlos is president of Antezeta, an internet consultancy focusing on Merit-Based™ search engine optimization, search engine marketing, web analytics, and web site usability.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.