ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Datamining Apache Logs with PostgreSQL

by Robert Bernier
03/03/2005

The hardest thing I had to learn, when I started working with SQL, was to speak English well. Data processing using a procedural or object-oriented programming language involves breaking down the entire operation into multiple steps. These functions, or methods, massage the data into a form that makes it easier to understand the logic of the process. Results from one function can pass into functions that eventually produce the desired result. In other words, GIGO (garbage in, garbage out).

However, performing data manipulation using SQL in a similar fashion is simply not possible. Composing SQL statements is very much an act of oratory. The process is similar to that of explaining your actions to another person.

This article plumbs the depths of your mastery of the spoken word by showing how you can compose the necessary instructions to datamine an Apache system log file that you've uploaded into a PostgreSQL database. I'll use only those statements as defined in the SQL-92 and SQL-99 protocols. I won't use any user-defined functions.

For the purposes of this article, I assume that you have a working PostgreSQL database engine. Refer either to your distribution's documentation or the PostgreSQL home page to learn how to set up your system.

Working with psql, the PostgreSQL Console Client

There are two kinds of commands demonstrated: SQL and psql meta commands. SQL commands always end with a semicolon. psql meta commands always begin with a backslash. SQL commands are case insensitive, while psql meta commands are case sensitive.

The Apache Log File

Properly mined, system log files provide an important source of information to the sysadmin, the webmaster, and the nontechnical members of the team such as those in sales and senior management. The challenge is in translating the data into a form that reveals significant patterns.

Here's an example record that syslog records from Apache (this is one line):

66.194.6.81 - - [10/Nov/2004:15:50:06 -0500]  \
	"GET /en/ HTTP/1.1" 200 13463 "-" \
	"Mozilla/5.0 (compatible; Konqueror/3.0-rc2; i686 Linux; 20020613)"

Syslog data can be quite detailed, so finding what you want is a question of setting up the configuration instructions correctly. Refer to the syslog and Apache documentation for more information.

Datamining the Log File, the Theory

The first step is uploading the Apache log file into a temporary session table on a PostgreSQL database. This table will disappear as soon as the psql client has shut down its connection to the database. Each line of data in the log file is one record in the table.

The second step is processing the table by parsing critical data into another table with its own columns. In this case, the table consists of three columns: the IP address, the date stamp, and a copy of the individual record from the first table.

Log in to the template1 database using the console client psql as the superuser, postgres, and create a database called apachelog:

$ psql -U postgres template1

CREATE DATABASE apachelog;

\c  log_apache postgres

Note: There is always a database named template1 in a PostgreSQL cluster.

Create a temporary session table called tmp_apache, consisting of one column named record with the text datatype:

CREATE TEMPORARY TABLE tmp_apache(record TEXT);

Note: Remember, this table will disappear as soon as you log out of your session because it's a temporary table.

Copy into the table tmp_apache the data from the log file directly:

\copy tmp_apache from 'my_apachelog.txt'

You can review the first record of the log with the SQL command:

SELECT record FROM tmp_apache LIMIT 1;

Parsing the Log File

Create a table called http_log. Give it three columns, ip_addr, the_date, and record:

CREATE TABLE http_log(log_date DATE,ip_addr CIDR,record TEXT);

Suppose you want to explain the above statement to another person. You can restate it as:

CREATE a TABLE called http_log with the columns log_date of data type DATE, ip_addr of data type CIDR and record of data type TEXT.

The next step is to extract the IP address and timestamp for each record:

INSERT INTO http_log(log_date,ip_addr,record)
     SELECT CAST(substr(record,strpos(record,'[')+1,20) AS date),
            CAST(substr(record,0,strpos(record,' ')) AS cidr),
            record
FROM tmp_apache;

There are two parts to this statement. The first part is the inner SELECT query, and the second portion is the outer INSERT statement.

The SELECT statement returns three columns of information that originates from the single column in tmp_apache. The query identifies both the IP address and the date stamp, which it then parses. Because the functions used can parse only text strings, the query must cast the results appropriately to insert them into a column of another data type.

There are two PostgreSQL functions used in this SELECT query: substr() and strpos().

The function strpos() returns the integer value (the location) of a particular character in a text string. Here, it identifies the date stamp by searching the record for the first instance of a square bracket ([). Refer to the HTTP log; this occurs at the beginning of the date.

The second function, substr(), returns a substring from a longer text string. It takes three parameters: the text string to search, the substring start point, and the substring end point. The date stamp is always 20 characters long in the example log file.

The same parsing process repeats for the IP address. The IP address is the first piece of data at the beginning of the record. The functions therefore identify the location of the leftmost white space in the record.

The INSERT INTO statement takes the results from the SELECT query and adds them directly into the three columns of the http_log table.

The following psql commands output the results into a text file named results.txt, then launch vim on it by using the bang (!) command:

\o results.txt
SELECT * FROM http_log;
\o
\!vim results.txt

Answering Questions

It's now time to answer questions. There's a saying, The perfect question supplies its own answer. So too is it true to say that a well-composed query statement explains the kind of information that it should return.

Remember, you can output all query results to a text file with the psql \o meta command. The meta command \e enters into the editor mode, thus permitting you to compose your SQL statements in a formatted manner. The default editor is vi, but you can change this by altering the editor environment variable, which can be PSQL_EDITOR, EDITOR, or VISUAL. For example, this command will change the default editor to midnight commander (assuming you're using the bash shell):

$ export EDITOR="mc -e"

List the IP addresses that have visited the site

SELECT ip_addr FROM http_log;

SELECT all records from in the column ip_addr FROM the table http_log.

List the distinct IP addresses (removing duplicates) that have visited the site

SELECT DISTINCT ON(ip_addr) ip_addr FROM http_log;

Note: DISTINCT is a nonstandard clause found only in PostgreSQL.

SELECT unique and DISTINCT records ON the column ip_addr and return the records from the column ip_addr FROM the table http_log;

Count the total number of hits recorded in the http_log table

SELECT count(*) FROM http_log;

SELECT and count all records FROM the table http_log.

Count all IP addresses (including repetitive hits from the same IP address) that have visited the site for each day

SELECT log_date,count(ip_addr) AS "Visits"
FROM http_log
GROUP BY log_date
ORDER BY log_date DESC;

SELECT to return data from the column log_date and count the number of ip_addr for each date. Label the IP address counted column's title AS Visits FROM the table http_log. GROUP the results by the log_date and sort them in ORDER BY the log_date in DESCending order.

Group By is a term, also called a clause, used when one or more of the returned columns returns a value that is the product of a function. This function processes a range, or group, of numbers. For example, this last query has a column whose summed total involves counting the number of IP addresses, count(ip_addr), for each logged date, log_date. This summed total, also known as an aggregate function, can be any kind of function that operates on a group of numbers. Examples include obtaining the average, maximum, or minimum value of a given group of numbers.

Count the unique IP addresses (distinct hits) that have visited the site for each day

SELECT log_date,count(ip_addr) AS "Visits"
FROM (
            SELECT DISTINCT ON(log_date,ip_addr)
             log_date,ip_addr FROM http_log
           ) AS http_counter_tmp      
GROUP BY log_date
ORDER BY log_date DESC;

SELECT all data from the columns log_date and count the number of ip_addr for each date. Label the column for the counted numbers AS Visits. This data comes FROM another query that will SELECT DISTINCT values ON the columns log_date and ip_addr, respectively, for the data returned from the columns log_date and ip_addr FROM the table http_log. Save this subselect saved AS a temporary table called http_counter_tmp. GROUP the results BY log_date and ORDER them BY log_date in a DESCending manner.

Did you notice how the last query used two SELECTs? The ability to nest SELECT statements is a subselect, and it is part of the SQL standard. One of the advantages of a subselect lies in the DBA's ability to group and express complex, multiple operations as a single statement. Among other things, this makes it easier to understand the logic behind an operation as you read the instructions with minimal extraneous documentation.

Count the daily number of downloads of mydoc.pdf

SELECT log_date, count(ip_addr) AS "mydoc.pdf downloads"
FROM http_log 
WHERE record ~ 'mydoc.pdf'
GROUP BY log_date
ORDER BY log_date DESC;

SELECT all data from the columns log_date and count the number of ip_addr for each date. Label the counted column's title AS mydoc.pdf downloads. The columns come FROM the table http_log. Return data from all rows WHERE the record column contains the text string mydoc.pdf. GROUP the results BY log_date and ORDER them BY the log_date too in DESCending order from the newest to the oldest dates.

Count any any exploit attempts that use very long strings

SELECT
     length(record) AS "Record Length",
     ip_addr,
     log_date 
FROM http_log 
ORDER BY length(record) DESC, log_date DESC;

Find exploit attempts by SELECTing the character length of the record, labeled AS the Record Length. Include also the ip_addr and the log_date FROM the table http_log. ORDER the results BY the length of the record string and the log_date both in a DESCending manner.

Related Reading

SQL in a Nutshell
A Desktop Quick Reference
By Kevin Kline

Note: You should be able to scroll through the results using the psql console client if your binaries have readlin library support. Otherwise, use the meta command \o to output the results into a text file that you can review with the appropriate editor.

Generating the Report

Reports should be brief and yet furnish complex information in a clear format. They are however, not necessarily as easy to make as they are to read.

This example report combines the results of three other tables. Here is the generate tables script.

The report has four columns:

  1. The log date.
  2. The daily number of visits by a unique IP address.
  3. The number of downloads of a PDF document, "SRAA whitepaper," on the web site.
  4. The number of downloads for a second PDF document, "Informix whitepaper."

What makes this example interesting is that it assumes that only one of the three tables, T1, has inputs for every day. Otherwise the queries could have been simpler. The other two, T2 and T3, will have infrequent inputs: some days there will be downloads, and other days there won't. The challenge is to put all the data for all three tables on the same row for the same day. The DISTINCT clause used in tables T2 and T3 prevents counting the same person more than once if he downloads a PDF document multiple times on the same day.

The report uses two queries, lest the SQL statement grow too long to understand easily. The first statement saves the results in the table webhits_tmp. The second query statement generates the actual report.

SELECT DISTINCT ON(t1.log_date,visits,"SRAA pdf","Informix pdf") 
   t1.log_date, 
   t1.visits,
   CASE WHEN t1.log_date=t2.log_date THEN t2."SRAA Whitepaper" END AS "SRAA pdf",
   CASE WHEN t1.log_date=t3.log_date THEN t3."Informix Whitepaper" 
        END AS "Informix pdf"
INTO TEMP TABLE webhits_tmp
FROM t1,t2,t3;

Here's one way to express the above query:

SELECT only records that have unique and DISTINCT values ON the columns t1.log_date, visits, SRAA pdf, and Informix pdf, respectively. Return data from the columns t1.log_date, t1.visits, "SRAA pdf" and "Informix pdf". In the CASE WHEN t1.log_date is equal to t2.log_date THEN return t2."SRAA Whitepaper" that END AS "SRAA pdf". In the CASE WHEN t1.log_date is equal to t3.log_date THEN return t3."Informix Whitepaper" that END AS "Informix pdf". Place the results of the statement INTO the TEMPorary TABLE webhits_tmp. Use the columns FROM the tables t1, t2, and t3, respectively, and ORDER the output BY log_date in a DESCending manner.

Here's the second statement:

SELECT DISTINCT ON(log_date,visits) * 
FROM webhits_tmp 
ORDER BY log_date DESC;

The explanation is:

SELECT only records that have unique and DISTINCT values ON the columns log_date and visits. Return all the columns FROM the table webhits_tmp and ORDER the output BY log_date in a DESCending manner.

Here's what the report might look like:

log_date visits SRAA pdf Informix pdf
2004-12-17 320 10
2004-12-16 540 30 10
2004-12-15 510
500 2
600 15

Caveats

Keep in mind that the solutions presented in this article are not necessarily the best way to achieve your goals. Excellent SQL statements not only return accurate data, but they also do it in the fastest possible manner.

You can often phrase a SQL statement any number of ways and it will still return the same result. However, some statements will run faster than others, because the database engine performs the logic of solving your query differently. The composition of a fast query requires an intimate knowledge of your database and the type of hardware you run.

Conclusion

Composing good SQL statements involves not only a logical approach to the problem but also the capacity to express the solution in a concise manner that actually makes sense when you open your mouth.

Robert Bernier is the PostgreSQL business intelligence analyst for SRA America, a subsidiary of Software Research America (SRA).


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.