In my last article on data mining email, I described how you could upload a Mozilla mailbox into a PostgreSQL database. With the email uploaded, I showed how to search through the email looking for text strings using both the standard SQL and PostgreSQL's POSIX regular expressions. As a grand finale, you then saw how you could extract Microsoft Word document attachments and perform text searches on those too.
In the true spirit of hacking, this article will consider the next step of data mining: performing in-depth searches on email by tweaking an existing mail system. The objective is to develop a solution requiring no more expertise than that of an intermediate-level DBA.
To avoid a learning curve for our user base, I intend to use an existing email client rather than create one. The next requirement is to select a protocol. I chose IMAP because it centralizes the administrative control of all user accounts onto one server. The third requirement is to choose a mechanism that gives the email client user the ability to carry out complex queries: The solution is to implement a relational database management system as the back end to the IMAP server. This DBMS must have the ability to create, store, and execute procedures. The last requirement is that the solution must be capable of backups and can upload the email, in the form of a data dump, to another machine.
Now that I've defined what I want, let me explain how I'll achieve it.
The only way that an ordinary IMAP-enabled email client can carry out user-defined queries is if the server's configuration allows it to recognize certain IMAP instructions as extra, nonstandard ones.
To keep this article short and sweet, I'll demonstrate two specific database queries my Mozilla email client invokes. The client will inform the server to:
Obviously, I'll have to either create an IMAP service from scratch or hack an existing mail system. I've opted for the latter.
Robert Bernier will present Getting Smart with Search Technologies.
This open panel discusses the current and future techniques for searching, collating, and using electronic based information.
|
Searching Freshmeat for "IMAP" and "database" returned 22 projects. Seven of the projects dealt with IMAP servers, but only three of those had promise. In the end, I chose DBmail out of my final list of three projects. The other two choices were BEJY and OMMS.
DBmail is an application suite that manages email accounts using both the POP3 and IMAP4rev1 protocols. DBmail stores the emails using either MySQL or PostgreSQL as the database back end.
Hacking DBmail is very easy, as it treats the database management system as no more than a glorified storage box. All of the application logic executes in the front end of the IMAP service. As an email client interacts with the service, DBmail processes the email request by sending SQL statements to the back end. This means that if I hack the database end, I can avoid rewriting any C code, instead adding SQL instructions to redefine one set of email client instructions and execute another set of instructions instead.
Of the two database engines, only PostgreSQL has the flexibility to make this hack possible because of its comprehensive support of procedures and triggers.
Procedures somewhat resemble functions in the C programming language. PostgreSQL procedures support many programming languages, including Perl, Tcl, Python, Java, Ruby, and R. Procedures are especially useful when they are combined with triggers.
Triggers provide the ability to execute functions when certain
events take place. These events are the SQL commands that manipulate the
contents of tables, such as INSERT, DELETE, and
UPDATE. Triggers can execute procedures before or after an event
takes place. For example, you can create a function that, when triggered,
records the username and date stamp for the particular action it carried out
and saves it in another table.
For the purposes of this article, I've made several assumptions:
postgres, carries out all the
actions.There's no DBmail documentation describing how the mail system works, so you'll need to read the source code directly and analyze it in the same manner outlined in this article.
You should become familiar with the other DBmail utilities in order to use them in a production environment.
It's important that you appreciate how you'll spend your time when reverse-engineering somebody else's work. In writing this article, for example, I spent about 10 percent of my time developing the source code examples and another 10 percent writing up the article. The rest of the time I spent figuring out how DBmail works.
To keep things as simple as possible, we'll assume that you have
administrative rights on your machine and are comfortable with downloading,
compiling, and configuring source code. Remember to refer to the
INSTALL text file as well as the other source documentation for
further guidance.
The PostgreSQL installation is straightforward. Download the source code from one of the PostgreSQL mirrors, and compile the binary with the following commands:
$ configure --with-perl
$ make
% make install
Note: You'll need to have Perl already installed on your server before you
can configure the source code using the --with-perl switch.
We need the --with-perl switch in order to use the untrusted
procedural language plplerlu. An untrusted language allows a user to access the
server and run scripts as the PostgreSQL superuser. I'll use plperlu to invoke
a couple of command-line utilities to convert Microsoft Word attachments to plain-text
attachments.
You can install binaries rather than compile the source.
For Red Hat RPMs, you can install the appropriate binaries with the command:
% rpm -Uvh postgresql*.rpm
You can do the following for Debian installations:
% apt-get install postgresql
% apt-get install postgres-client
For the purposes of this article I'll use the superuser account,
postgres, as DBmail's user account. Don't do this on a real
production server!
If you decide to compile the source code, download either the stable version or the latest experimental version from DBmail's download page. Keep in mind that PostgreSQL should already be running before you compile and run DBmail. The two versions of DBmail each use a different set of commands to build, compile, and install the binaries.
The 1.2.x series begins the build process by invoking an interactive shell:
$ build.sh
Compile it with the command:
$ make clean all
Then install the binaries:
% install-dbmail.sh
The newer, experimental 2.0rc.x version uses a different set of commands:
$ configure --with-pgsql
$ make
% install-dbmail.sh
Now we must finish our configuration by creating the database, attaching the procedural language plperlu to the database, configuring DBmail, and uploading some test emails into the database.
DBmail provides a script to prepare the
PostgreSQL database. Find it in the DBmail source tree under the directory
/sql/postgresql. I created a database called imap_mail and invoked
the SQL script using the following command:
$ psql -U postgres -d imap_mail -f ./create_tables.pgsql
These two commands add plpgsql and plperlu to the database imap_mail:
$ createlang -U postgres -d imap_mail plpgsql
$ createlang -U postgres -d imap_mail plperlu
Note: This data dump of the imap_mail database creates the database, the DBmail tables, the DBmail email user account, the functions, the triggers, the procedural languages plpgsql and plperlu, and the emails I used to develop this article. All you need to do is type:
$ psql -U postgres -d template1 -f imap_mail.dump
DBmail uses the configuration file dbmail.conf in
the directory /etc/dbmail.
We're now ready to start up the DBmail service by invoking the command:
$ bmail-imapd
This dbmail script is a copy of the Debian init script and permits you to run DBmail as a stand-alone service.
DBmail connects to the PostgreSQL server exclusively on a TCP socket connection on the default port of 5432.
DBmail is a little different from your standard IMAP server in that it runs
as a service independent of the superserver. Therefore, you must ensure that
nothing is already active on IMAP port 143 by removing all references to port
143 from your inetd.conf file. You can reload the superserver to
reflect these changes by using a command similar to:
% /etc/init.d/inetd reload
Check that the service is up and running. You can verify this by running
netstat -tp and netstat -tpl, as shown in Figure
1.

Figure 1. DBmail is up and is
connected to the PostgreSQL server
With the Dbmail server running, you can now create email user accounts.
Invoke the command dbmail-adduser as root:
% dbmail-adduser a mail_user 123 0 25000000
user@crunch.net user@crunch.org @crunch.com
This command creates an email account named mail_user with a
password of 123. The next number, 0, represents the user's
assigned user ID. The next number, 25mb, represents his email limit
(stating a limit of 0 will impose no limit). The remaining strings
are his email addresses. Notice that the last domain doesn't include a
username; in this case, the server will redirect all email addressed to
nonexistent user accounts in the domain crunch.com to
mail_user.
Remember to use the same username and password for your client that you
created with the dbmail-adduser command when you configured your
email client.
You can add email into the database from an existing mailbox using the
formail utility, which comes bundled with procmail, and the
dbmail-stmp utility, which uploads email into the DBmail mail
system:
$ cat mymailbox | formail -s dbmail-smtp -u mail_user
|
If I've learned one thing, it's that you should never remove somebody else's code from an existing application. Adding new source code is a safer technique; I've found that I can still do what I want and avoid the monumental effort of understanding somebody else's work. It also makes it much easier to port my work to newer versions of the application.
You can do two things to help understand how DBmail works:
An entity relationship diagram, or ERD, is a graphical representation of the database. An ERD typically shows the column names, their data types, and the relationships between the tables through the use of foreign keys. Figure 2 shows the ERD for the DBmail/PostgreSQL database.

Figure 2. Entity relationship diagram
for the email database imap_mail
The easiest way to obtain the ERD is to use the psql client to
read the relationships between the tables using the meta command
\d. Figure 3 shows a screenshot of the table description for
messages.

Figure 3. Using psql to obtain a table
description
I need to work with only a few of the DBmail tables: mailboxes,
messages, and messageblks.
PostgreSQL provides you the ability to record all SQL statements via the
syslog daemon. To invoke it, adjust the data cluster's configuration file
postgresql.conf by inserting the following lines:
syslog = 2
syslog_facility = 'LOCAL0'
syslog_ident = 'postgres'
log_statement = true
log_timestamp = true
The changes take effect once you've reloaded the service:
% /etc/init.d/postgresql reload
The DBmail error message levels are configured in the "dbmail.conf" file. Syslog places them in the mail log files ie:
/var/log/mail.warn/var/log/mail.err/var/log/mail.logYou can read the syslog files using tail and
grep:
% tail -n 100 /var/log/messages | grep postgres
The biggest challenge was understanding how DBmail worked; the rest was pretty easy.
This first hack searches all email for the text
roman. Once it finds a message that matches, the hack moves the
message from the INBOX to a subfolder named v_1.
The search begins when the user creates the subfolder v_1. The
IMAP server recieves the instruction to create the folder from the client and
sends the appropriate SQL to the back end. The trigger tr_v_1,
attached to the table mailboxes, executes the function
f_v_1() when an INSERT attempts to create the folder
INBOX/v_1. The text search begins here. A SQL CASE statement
tests each email in the INBOX. When the function finds an email with the
desired text, it calls another function, f_update_messages, to
update the messages table with an UPDATE to the message's location from
INBOX to INBOX/v_1.
The trigger tr_v_1 attaches the function f_v_1()
to the mailboxes table. The function will trigger on every INSERT
made to this table:
CREATE TRIGGER tr_v_1 AFTER INSERT ON mailboxes
FOR EACH ROW EXECUTE PROCEDURE f_v_1();
The function f_v_1() starts by testing the new name of the new
folder:
IF NEW.name = ''INBOX/v_1'' THEN
.
.
END IF;
If the name of the created folder is v_1
(INBOX/v_1), it performs a text search for the strings
roman and rome.
I've used the PERFORM statement instead of SELECT because I'm interested in
the execution of the function f_update_messages() rather than any
text string that the CASE statement may return. The CASE statement allows me to
test for the desired text string that's located in the messageblk
table by JOINing on the mailbox table:
PERFORM
CASE
WHEN messageblk ~* ''roman'' OR messageblk ~* ''rome''
THEN f_update_messages(NEW.mailbox_idnr,mblk.message_idnr)
ELSE ''other''
END AS "search results"
FROM messageblks mblk, messages m, mailboxes mb
WHERE
mblk.message_idnr = m.message_idnr AND
m.mailbox_idnr = mb.mailbox_idnr;
This calls the function f_update_messages, passing the mailbox
identity number and the messageblks identity number as parameters. It then
reassigns new identity numbers that effectively reassign the email's location
from INBOX to the newly created subfolder v_1:
UPDATE messages SET mailbox_idnr=new_mailbox_idnr
WHERE message_idnr=new_message_idnr;
This second hack converts all email with a Microsoft Word attachment to text and appends it to the original email.
This hack starts out the same way the first one does, as the attempt of the
DBmail server to INSERT data into the table messageblks triggers
it. DBmail updates the email record located in this table when the client moves
email from one folder to another.
The f_v_2 function performs multiple actions:
The conversion operations use the command-line utilities
uudecode and antiword, which execute within the
plperlu function f_doc2txt(text).
The trigger tr_v_2 attaches the function f_v_2()
to the table messageblks. The function will execute for every
INSERT made to this table:
CREATE TRIGGER tr_v_2 BEFORE INSERT ON messageblks
FOR EACH ROW EXECUTE PROCEDURE f_v_2();
The function f_v_2() starts by identifying the mailbox's
folder name:
SELECT INTO mbx_name mailboxes.name
FROM mailboxes, messages
WHERE
mailboxes.mailbox_idnr=messages.mailbox_idnr AND
messages.message_idnr=NEW.message_idnr;
DBmail stores messages in two parts: the header and the message body.
This function identifies emails' Microsoft Word attachments while it moves them to
v_2. A positively identified message will fill in several
variables with information about all of the message's parts. This in essence
reconstructs the message:
SELECT INTO mail_header messageblk
FROM messageblks
WHERE messageblk_idnr=NEW.messageblk_idnr-1;
IF mail_header ~ ''multipart/mixed''
AND NEW.messageblk ~ content_type THEN
mime_boundary=trim(both ''"'' from
substring(mail_header from ''".*"''));
NEW.messageblk = f_get_document(NEW.messageblk,mime_boundary);
END IF;
The function f_get_document extracts the Microsoft Word MIME
attachment:
CREATE OR REPLACE FUNCTION f_get_document(text,text)
RETURNS TEXT AS'
DECLARE
mail_attachment ALIAS FOR $1;
mime_boundary ALIAS FOR $2;
mime_header TEXT;
str TEXT;BEGIN
mime_header = ''--''||mime_boundary||''\nContent-Type: text/plain; \
name=document.txt\n'';
mime_header = mime_header||''Content-Transfer-Encoding: 7bit\n'';
mime_header = mime_header||''Content-Disposition: inline; \
filename="document.txt"\n'';
str=substring(mail_attachment from \
''Content-Type: application/MS Word.*''||mime_boundary);
str=substring(str from ''\\n\\n.*\\n\\n'');
str=split_part(str,''\\n\\n'',2);
RETURN mime_header||f_doc2txt(str)||''--''||mime_boundary||''--'';
END;
'LANGUAGE plpgsql;
Notice that the function adds the header and MIME boundary to the Word
document, then turns the document into a text file by calling the function
f_doc2txt(). The Base64 encoding attachment, str, is
the function's parameter:
RETURN mime_header||f_doc2txt(str)||''--''||mime_boundary||''--'';
The function f_doc2txt() is short. It performs minor
formatting to prepare the attachment for the uudecode utility which turns it
back into its binary form. It then pipes the binary stream into
antiword, which converts it into a text stream:
CREATE OR REPLACE FUNCTION f_doc2txt(text) RETURNS TEXT AS'
use strict;
my $str="begin-base64 600 zzz\n".$_[0]."====";
$str= `echo "$str" | uudecode -o /dev/stdout | antiword - -i0`;
return $str;
' LANGUAGE plperlu;
The function returns this where it replaces the Word attachment with a
plain-text translation in the f_v_2() function:
NEW.messageblk = f_get_document(NEW.messageblk,mime_boundary);
.
.
.
RETURN NEW;
Read the SQL source code at the same time you read this article.
The DBmail documentation can help you configure mail agents such as Postfix to pipe email directly into the database.
DBmail is changing rapidly, so refer to the home site for the latest news.
Don't be bashful! Go to the PostgreSQL site to learn more about triggers and the amazing number of currently supported programming languages.
There's so much you can do to get the most out of your email. By using triggers in your database engine, you can enhance any existing behavior into new and creative solutions. All it takes is a little imagination.
Here are a couple of ideas to whet your appetite:
I began this article intending to highlight data-mining strategies, but as you can see it's hard to stick to this subject without also considering reverse-engineering techniques.
dbmail man pages, specifically:
man dbmail-adduser, which adds mail users to the Dbmail
mail systemman dbmail-imapd, the imap4r1 front end for the Dbmail
mail systemman dbmail-maintenance, which performs cleanup and integrity checksman dbmail-pop3d, the POP3d front end for the Dbmail
mail systemman dbmail-smtp, which inserts messages into the Dbmail
mail systemman uudecodeman antiwordRobert Bernier is the PostgreSQL business intelligence analyst for SRA America, a subsidiary of Software Research America (SRA).
Return to ONLamp.com
Copyright © 2007 O'Reilly Media, Inc.