ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Data Mining Email

by Robert Bernier
04/08/2004

I once did a contract administrating the network of a small startup company. The owner's total existence revolved around the email generated from his various business relationships and he never made a decision unless the issues were first detailed in an email. My boss once confided in me of his hatred of his email client because it limited him in the way he could organize his information.

I was so impressed by his desperation that I found myself constantly thinking about his dilemma. In time the answer came to me. He needed a data mine.

What is data mining anyway? Data mining is a class of database applications that look for hidden patterns in a group of data.

In writing this article it quickly became apparent that you might need several pieces of unrelated technical know-how to create a data mine so I ended up asking myself "who am I writing this article for?". Do I write for the Perl programmer or the DBA? Programming and managing data don't use the same set of skills but they are nevertheless both required. So where does this article go?

This article is for those who need a guide to generating information from existing data and are looking for ideas on how to do it. Thomas Edison once said that genius was 1 percent inspiration and 99 percent perspiration. Here's your 1 percent.

The Scenario

The objective is to data mine the email that resides in a Mozilla mail box. We are especially interested in MS Word .doc attachments. We'll parse the email and upload it into a relational database where we can then analyze it.

The Tools

The Mozilla mailboxes follow the mbox format. Perl has several excellent email parsing modules available on the CPAN. We'll also use the PostgreSQL database and its console-based client psql, as well as the command-line utilities find, file, and antiword.

The Developer's Skillset

Perl-wise, you'll need intermediate programming experience. You'll need to know how to execute command-line utilties within a Perl script. You should be able to install and use Perl modules from the CPAN, and you'll need to be familiar with using PostgreSQL from Perl.

Regarding PostgreSQL, you should know how to install PostgreSQL on Linux either by using precompiled binaries such as RPMs, Debian packages, and tar balls or by compiling it from source code. If someone else has installed this already, great! You should be familiar with the SQL 92 and SQL 99 standards in that you can create tables, views, constraints, and user-defined functions.

On the command line, you need to know how to pipe standard input and output from one process to the next.

The Assumptions

I assume you have Perl 5.8.x installed with access to the CPAN configured correctly. You also need a standard installation of Postgres with minimal security options configured. The example Perl script will connect to the database via a socket connection, so uncomment the line tcpip_socket = true in postgreql.conf.

Furthermore, the script assumes that the connection requires a password:

# TYPE    DATABASE   USER    IP-ADDRESS   IP-MASK        METHOD

local     all        all                                  trust
host      all        all     127.0.0.1    255.255.255.255 password

I'll assume that we'll use the PostgreSQL superuser account, postgres. As well, the mbox file we'll process is Sent and is in the same directory as the parser script.

Part 1. Searching for Mailboxes

The Mozilla mail client stores its mailboxes under the user's home directory, usually in ~/.mozilla. Figure 1 shows the different mailboxes of an existing profile. The mboxes are those files without any extensions. Another hint is to look at the file size; mailboxes are normally the largest files present.

screen shot of Mozilla mail box
Figure 1. Screenshot of a Mozilla mailbox.

The following command lists all the mailboxes in my Mozilla profile account:

$ find ~/.Mozilla/default/rfv2udn7.slt/Mail/ -type f -not \
	-iname "*.*" -not -iname ".*" -print

Part 2. Preparing the Database

Before we can create the right database schema, we need to decide what kind of information to extract from these messages. There are many options, but the following assumptions will suffice:

The Database

With that in mind, let's set up the database. In a console, create a database called email:

$ psql -U postgres -command "create database email" template1

If this is your first time using PostgreSQL, alter the username postgres and give it the password of 123. By default, this username has no assigned password.

 $ psql -U postgres --command "alter user postgres with password '123' " email

The Tables

The most important column in the tables is the messageid, which is a unique alphanumeric code that is part of the email message. The tables have been defined such that no data population can occur unless the messageid already exists in the table mailid.

The message body and attachments are to be stored as large objects. Many of the tables have a column of type oid, which refers to the actual data that is located in the system catalog pg_largeobject.

Indexes are created by default in Postgres for each primary key that is defined in a table.

The create tables script will clean the database email and create the tables. Invoke it by executing it from the console:

$ psql -U postgres -f createTables.sql email

Part 3. Developing the Email Parser

Install the following modules from the CPAN, along with their dependencies:

The mbox parser script first defines some global variables and then establishes a database connection with the PostgreSQL server:

my $dbh = DBI->connect($dsn,$user,$pass);
die $DBI::errstr unless defined $dbh;
$dbh->{PrintError} = 0;

The script then reads the mailbox and breaks down the component parts of each individual message:

my $mb = Mail::MboxParser->new(\@myarray,
	decode     => 'ALL',
	parseropts => $parseropts);

A while loop processes messages from the mailbox and identifies the interesting parts of the message:

my $id      = $msg->id;
my $from    = $msg->from->{email};
my $subject = $msg->header->{subject};

The MboxParser module identifies the main body of the message:

my $body = $msg->body($msg->find_body);

The main body and attachments of a message are referenced in the main and attachment tables with an oid. Prior to version 7.1, PostgreSQL cannot insert data into a row that exceeds the default data page size of 8192 bytes. To get around this, we'll instead insert this information as large objects. Upload a large object by using psql:

my $myoid=`psql -U postgres --command "\\lo_import \'$body_file\' \'$mymailbox\' " $db`;
my @oidarray = split(/ /,$myoid);

This method requires less coding and is simpler to understand than is using libpq's more efficient large object call.

We are now able to insert the email message into the main table:

$sql = "INSERT INTO main VALUES( ?, ?, ?, @oidarray[1])";
$sth = $dbh->prepare($sql) or die $DBI::errstr;
$sth->execute( $id, $from, $subject ) or die $DBI::errstr;

This snippet of code generates the list of email recipients:

#POPULATE TABLE 'mailto'

for my $msg_TO ($msg->to) {
	$sql = "INSERT INTO mailto VALUES( ?, ? )";
	$sth = $dbh->prepare($sql) or die $DBI::errstr;
	$sth->execute( $id, $msg_TO->{email} ) or die $DBI::errstr;
}

The most interesting part of the program identifies MS Word attachments:

# DETERMINE ATTACHMENT MIME-TYPES

my $decodedattachment = `echo "$attachment" | ./decode.pl |
	tee $attachment_file | file -i -`;
my @test0     = split(/ /, $decodedattachment);
my $mime_type = $test0[1];
chomp $mime_type;

# VALIDATING MSWORD DOCS

if ($mime_type eq "application/MS Word") {

	# CONVERT INTO READABLE TEXT
	`antiword $attachment_file > $MS Word_file;
	mv -f $MS Word_file $attachment_file`;

} else {

	# NOT MSWORD DOCS,
	# USE THE ENCODED MIME VERSION IN THE FILE
	`echo "$attachment" > $attachment_file`;
}

decode.pl decodes the MIME Base64 message back into its original binary encoding. There is also a PostgreSQL function for this. The program then pipes the encoded output into the file utility to determine if it really is a MS Word document. At the same time, it saves the decoded attachment as the temporary file /tmp/attachment.

Any attachment that is identified as a MS Word document with a MIME type of application/MS Word runs through the antiword utility to turn it into an ASCII text document. The program then uploads this as a large object.

All other attachments are saved in the database as large objects in their original MIME encodings.

Part 4. Querying the Database

Now that the information is in the database, the next trick is to retrieve it in a way that makes sense. I've created the following queries, based on my own database that I used to develop this article.

First, let's summarize all the email messages that have both a main body and an attachment. (Refer to the table creation create tables script for table details).

SELECT m.from_email AS From,
       m.subject    AS Subject,
       m.mailbody   AS Message,
       a.attachment AS Attachment
FROM   main       m,
       attachment a
WHERE  m.messageid=a.message

This will produce output similar to:

From Subject Message Attachment
robert.bernier5@sympatico.ca message 1 60227 60230
robert.bernier5@sympatico.ca message 2 60233 60236
casestudy@postgresql.org message 3 60239 60242
casestudy@postgresql.org message 4 60245 60248

The Message and Attachments are oids, or reference numbers, to the large objects now stored in the pg_largeobject catalog.

Now, let's list all messages with attachments and identify those attachments that were MS Word documents:

SELECT m.messageid  AS "Message ID",
       a.attachment AS "Attachment oid",
	CASE
		WHEN a.mime_type='application/MS Word'
		THEN 'true'
 		ELSE 'false'
	END
                    AS "MS Word attachment"
FROM   main m, attachment a
WHERE  m.messageid=a.messageid;

This will produce output similar to:

Message ID Attachment oid MS Word attachment
3FF79F08.6060208@sympatico.ca 60230 true
3FF79F29.2090101@sympatico.ca 60236 false
3FF7A511.2030104@postgresql.org 60242 false
3FF7A52A.8040203@postgresql.org 60248 true

You can return the ASCII text conversion of any MS Word attachment from a query similar to:

SELECT encode(lo.data,'escape') AS "My Document"
FROM   pg_largeobject lo,
       attachme1t     a
WHERE  a.messageid  = '3FF7A52A.8040203@postgresql.org'
AND    a.attachment = lo.loid;

The encode function casts My Document from a bytea to a text type by removing all escape sequences that the lo_import function originally inserted.

Searching for Patterns

To fetch results with more detail, you'll have to search specific columns for patterns. There are two ways to search text, with the SQL LIKE expression or with PostgreSQL's POSIX-style regular expressions.

Using LIKE

This query will search for the phrase "These functions read and write" in a particular Word document.


SELECT m.from_email  AS "From",
       a.description AS "File name",
	CASE
		WHEN encode(lo.data,'escape') LIKE '%These functions read and write%'
		THEN 'true'
  		ELSE 'false'
	END AS "Search results"
FROM pg_largeobject lo,
     attachment      a,
     main            m
WHERE a.messageid  = '3FF7A52A.8040203@postgresql.org'
AND   a.attachment = lo.loid
AND   a.messageid  = m.messageid;

Here are the results from my sample tables:

From File name Search results
casestudy@postgresql.org doc2.doc true

Regular Expressions

This query will count the number of emails with attachments and whose from address begins with the letters ca and end in org.

SELECT count(m.from_email)
FROM   main       m,
       attachment a
WHERE  m.messageid  = a.messageid
AND    m.from_email ~ '^ca'
AND    m.from_email ~ 'org$';

Caveats and Suggestions

In the interests of keeping the Perl script as simple as possible, I've assumed that all email attachments arrive encoded in Base64. As well, since I wrote the script to follow the flow of this article, the code is a bit clumsy.

The database definitions here only demonstrate where you can make a start. Making a database that flies like a jet is an article in itself.

Save yourself time and grief by saving SQL queries and definitions in a file and then invoking it through the psql client. If you're paranoid, you can put these files under version control

Invoke the create tables script to clean out the database as you rerun the Perl scripts.

Antiword is a bit quirky in that it will not parse files that it considers too small.

Viewing the Mozilla mbox through its client can be deceiving. It never really removes email from the box even if you've deleted it, so you may well see email in your database that you thought you destroyed months before.

You can export large objects from the database by using the lo_export function in the psql client. For example \lo_export 123 temp.txt will save the large object with the oid number of 132 to a file named temp.txt.

Conclusion

There are two parts to a successful data mine; the mechanism that prepares the document and the algorithm that looks for patterns. We've looked at one way getting email into a database but frankly, I haven't given that much coverage on the power of regular expressions in a database environment, although some of you Perl fiends have a pretty good idea of the possibilities.

References

Robert Bernier is the PostgreSQL business intelligence analyst for SRA America, a subsidiary of Software Research America (SRA).


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.