I once did a contract administrating the network of a small startup company. The owner's total existence revolved around the email generated from his various business relationships and he never made a decision unless the issues were first detailed in an email. My boss once confided in me of his hatred of his email client because it limited him in the way he could organize his information.
I was so impressed by his desperation that I found myself constantly thinking about his dilemma. In time the answer came to me. He needed a data mine.
What is data mining anyway? Data mining is a class of database applications that look for hidden patterns in a group of data.
In writing this article it quickly became apparent that you might need several pieces of unrelated technical know-how to create a data mine so I ended up asking myself "who am I writing this article for?". Do I write for the Perl programmer or the DBA? Programming and managing data don't use the same set of skills but they are nevertheless both required. So where does this article go?
This article is for those who need a guide to generating information from existing data and are looking for ideas on how to do it. Thomas Edison once said that genius was 1 percent inspiration and 99 percent perspiration. Here's your 1 percent.
The objective is to data mine the email that resides in a Mozilla mail box.
We are especially interested in MS Word .doc attachments. We'll
parse the email and upload it into a relational database where we can then
analyze it.
The Mozilla mailboxes follow the mbox format.
Perl has several excellent email parsing modules available on the CPAN. We'll also use the PostgreSQL database and its console-based client psql, as well as the command-line utilities find, file, and antiword.
Perl-wise, you'll need intermediate programming experience. You'll need to know how to execute command-line utilties within a Perl script. You should be able to install and use Perl modules from the CPAN, and you'll need to be familiar with using PostgreSQL from Perl.
Regarding PostgreSQL, you should know how to install PostgreSQL on Linux either by using precompiled binaries such as RPMs, Debian packages, and tar balls or by compiling it from source code. If someone else has installed this already, great! You should be familiar with the SQL 92 and SQL 99 standards in that you can create tables, views, constraints, and user-defined functions.
On the command line, you need to know how to pipe standard input and output from one process to the next.
I assume you have Perl 5.8.x installed with access to the CPAN configured
correctly. You also need a standard installation of Postgres with minimal
security options configured. The example Perl script will connect to the
database via a socket connection, so uncomment the line tcpip_socket =
true in postgreql.conf.
Furthermore, the script assumes that the connection requires a password:
# TYPE DATABASE USER IP-ADDRESS IP-MASK METHOD
local all all trust
host all all 127.0.0.1 255.255.255.255 password
I'll assume that we'll use the PostgreSQL superuser account,
postgres. As well, the mbox file we'll process is
Sent and is in the same directory as the parser script.
The Mozilla mail client stores its mailboxes under the user's home
directory, usually in ~/.mozilla. Figure 1 shows the different
mailboxes of an existing profile. The mboxes are those files without any
extensions. Another hint is to look at the file size; mailboxes are normally
the largest files present.

Figure 1. Screenshot of a Mozilla mailbox.
The following command lists all the mailboxes in my Mozilla profile account:
$ find ~/.Mozilla/default/rfv2udn7.slt/Mail/ -type f -not \
-iname "*.*" -not -iname ".*" -print
Before we can create the right database schema, we need to decide what kind of information to extract from these messages. There are many options, but the following assumptions will suffice:
With that in mind, let's set up the database. In a console, create a
database called email:
$ psql -U postgres -command "create database email" template1
If this is your first time using PostgreSQL, alter the username
postgres and give it the password of 123. By default, this
username has no assigned password.
$ psql -U postgres --command "alter user postgres with password '123' " email
The most important column in the tables is the messageid, which is a unique alphanumeric code that is part of the email message. The tables have been defined such that no data population can occur unless the messageid already exists in the table mailid.
The message body and attachments are to be stored as large objects. Many of the tables have a column of type oid, which refers to the actual data that is located in the system catalog pg_largeobject.
Indexes are created by default in Postgres for each primary key that is defined in a table.
The create tables script will clean the
database email and create the tables. Invoke it by executing it from the
console:
$ psql -U postgres -f createTables.sql email
|
The mbox parser script first defines some global variables and then establishes a database connection with the PostgreSQL server:
my $dbh = DBI->connect($dsn,$user,$pass);
die $DBI::errstr unless defined $dbh;
$dbh->{PrintError} = 0;
The script then reads the mailbox and breaks down the component parts of each individual message:
my $mb = Mail::MboxParser->new(\@myarray,
decode => 'ALL',
parseropts => $parseropts);
A while loop processes messages from the mailbox and identifies the interesting parts of the message:
my $id = $msg->id;
my $from = $msg->from->{email};
my $subject = $msg->header->{subject};
The MboxParser module identifies the main body of the
message:
my $body = $msg->body($msg->find_body);
The main body and attachments of a message are referenced in the
main and attachment tables with an oid. Prior to
version 7.1, PostgreSQL cannot insert data into a row that exceeds the default
data page size of 8192 bytes. To get around this, we'll instead insert this
information as large objects. Upload a large object by using
psql:
my $myoid=`psql -U postgres --command "\\lo_import \'$body_file\' \'$mymailbox\' " $db`;
my @oidarray = split(/ /,$myoid);
This method requires less coding and is simpler to understand than is using
libpq's more efficient large object call.
We are now able to insert the email message into the main
table:
$sql = "INSERT INTO main VALUES( ?, ?, ?, @oidarray[1])";
$sth = $dbh->prepare($sql) or die $DBI::errstr;
$sth->execute( $id, $from, $subject ) or die $DBI::errstr;
This snippet of code generates the list of email recipients:
#POPULATE TABLE 'mailto'
for my $msg_TO ($msg->to) {
$sql = "INSERT INTO mailto VALUES( ?, ? )";
$sth = $dbh->prepare($sql) or die $DBI::errstr;
$sth->execute( $id, $msg_TO->{email} ) or die $DBI::errstr;
}
The most interesting part of the program identifies MS Word attachments:
# DETERMINE ATTACHMENT MIME-TYPES
my $decodedattachment = `echo "$attachment" | ./decode.pl |
tee $attachment_file | file -i -`;
my @test0 = split(/ /, $decodedattachment);
my $mime_type = $test0[1];
chomp $mime_type;
# VALIDATING MSWORD DOCS
if ($mime_type eq "application/MS Word") {
# CONVERT INTO READABLE TEXT
`antiword $attachment_file > $MS Word_file;
mv -f $MS Word_file $attachment_file`;
} else {
# NOT MSWORD DOCS,
# USE THE ENCODED MIME VERSION IN THE FILE
`echo "$attachment" > $attachment_file`;
}
decode.pl decodes the MIME Base64 message back into its
original binary encoding. There is also a PostgreSQL function for this. The
program then pipes the encoded output into the file utility to
determine if it really is a MS Word document. At the same time, it saves the
decoded attachment as the temporary file /tmp/attachment.
Any attachment that is identified as a MS Word document with a MIME type of
application/MS Word runs through the antiword utility
to turn it into an ASCII text document. The program then uploads this as a
large object.
All other attachments are saved in the database as large objects in their original MIME encodings.
Now that the information is in the database, the next trick is to retrieve it in a way that makes sense. I've created the following queries, based on my own database that I used to develop this article.
First, let's summarize all the email messages that have both a main body and an attachment. (Refer to the table creation create tables script for table details).
SELECT m.from_email AS From,
m.subject AS Subject,
m.mailbody AS Message,
a.attachment AS Attachment
FROM main m,
attachment a
WHERE m.messageid=a.message
This will produce output similar to:
| From | Subject | Message | Attachment |
|---|---|---|---|
robert.bernier5@sympatico.ca |
message 1 |
60227 |
60230 |
robert.bernier5@sympatico.ca |
message 2 |
60233 |
60236 |
casestudy@postgresql.org |
message 3 |
60239 |
60242 |
casestudy@postgresql.org |
message 4 |
60245 |
60248 |
The Message and Attachments are oids, or reference numbers, to the large
objects now stored in the pg_largeobject catalog.
Now, let's list all messages with attachments and identify those attachments that were MS Word documents:
SELECT m.messageid AS "Message ID",
a.attachment AS "Attachment oid",
CASE
WHEN a.mime_type='application/MS Word'
THEN 'true'
ELSE 'false'
END
AS "MS Word attachment"
FROM main m, attachment a
WHERE m.messageid=a.messageid;
This will produce output similar to:
| Message ID | Attachment oid | MS Word attachment |
|---|---|---|
3FF79F08.6060208@sympatico.ca |
60230 |
true |
3FF79F29.2090101@sympatico.ca |
60236 |
false |
3FF7A511.2030104@postgresql.org |
60242 |
false |
3FF7A52A.8040203@postgresql.org |
60248 |
true |
You can return the ASCII text conversion of any MS Word attachment from a query similar to:
SELECT encode(lo.data,'escape') AS "My Document"
FROM pg_largeobject lo,
attachme1t a
WHERE a.messageid = '3FF7A52A.8040203@postgresql.org'
AND a.attachment = lo.loid;
The encode function casts My Document from a
bytea to a text type by removing all escape sequences
that the lo_import function originally inserted.
To fetch results with more detail, you'll have to search specific columns for patterns. There are two ways to search text, with the SQL LIKE expression or with PostgreSQL's POSIX-style regular expressions.
LIKEThis query will search for the phrase "These functions read and write" in a particular Word document.
SELECT m.from_email AS "From",
a.description AS "File name",
CASE
WHEN encode(lo.data,'escape') LIKE '%These functions read and write%'
THEN 'true'
ELSE 'false'
END AS "Search results"
FROM pg_largeobject lo,
attachment a,
main m
WHERE a.messageid = '3FF7A52A.8040203@postgresql.org'
AND a.attachment = lo.loid
AND a.messageid = m.messageid;
Here are the results from my sample tables:
| From | File name | Search results |
|---|---|---|
casestudy@postgresql.org |
doc2.doc |
true |
This query will count the number of emails with attachments and whose from
address begins with the letters ca and end in
org.
SELECT count(m.from_email)
FROM main m,
attachment a
WHERE m.messageid = a.messageid
AND m.from_email ~ '^ca'
AND m.from_email ~ 'org$';
In the interests of keeping the Perl script as simple as possible, I've assumed that all email attachments arrive encoded in Base64. As well, since I wrote the script to follow the flow of this article, the code is a bit clumsy.
The database definitions here only demonstrate where you can make a start. Making a database that flies like a jet is an article in itself.
Save yourself time and grief by saving SQL queries and definitions in a file
and then invoking it through the psql client. If you're paranoid, you can put these files under version control
Invoke the create tables script to clean out the database as you rerun the Perl scripts.
Antiword is a bit quirky in that it will not parse files that
it considers too small.
Viewing the Mozilla mbox through its client can be deceiving. It never really removes email from the box even if you've deleted it, so you may well see email in your database that you thought you destroyed months before.
You can export large objects from the database by using the
lo_export function in the psql client. For example
\lo_export 123 temp.txt will save the large object with the oid
number of 132 to a file named temp.txt.
There are two parts to a successful data mine; the mechanism that prepares the document and the algorithm that looks for patterns. We've looked at one way getting email into a database but frankly, I haven't given that much coverage on the power of regular expressions in a database environment, although some of you Perl fiends have a pretty good idea of the possibilities.
Robert Bernier is the PostgreSQL business intelligence analyst for SRA America, a subsidiary of Software Research America (SRA).
Return to ONLamp.com.
Copyright © 2007 O'Reilly Media, Inc.