Creating Custom Email Queriesby Robert Bernier
In my last article on data mining email, I described how you could upload a Mozilla mailbox into a PostgreSQL database. With the email uploaded, I showed how to search through the email looking for text strings using both the standard SQL and PostgreSQL's POSIX regular expressions. As a grand finale, you then saw how you could extract Microsoft Word document attachments and perform text searches on those too.
In the true spirit of hacking, this article will consider the next step of data mining: performing in-depth searches on email by tweaking an existing mail system. The objective is to develop a solution requiring no more expertise than that of an intermediate-level DBA.
The Design Requirements
To avoid a learning curve for our user base, I intend to use an existing email client rather than create one. The next requirement is to select a protocol. I chose IMAP because it centralizes the administrative control of all user accounts onto one server. The third requirement is to choose a mechanism that gives the email client user the ability to carry out complex queries: The solution is to implement a relational database management system as the back end to the IMAP server. This DBMS must have the ability to create, store, and execute procedures. The last requirement is that the solution must be capable of backups and can upload the email, in the form of a data dump, to another machine.
Now that I've defined what I want, let me explain how I'll achieve it.
The only way that an ordinary IMAP-enabled email client can carry out user-defined queries is if the server's configuration allows it to recognize certain IMAP instructions as extra, nonstandard ones.
To keep this article short and sweet, I'll demonstrate two specific database queries my Mozilla email client invokes. The client will inform the server to:
- Perform a text search on email located in the INBOX. The idea is to transfer the emails that match the search criteria into a new subfolder called v_1.
- Convert emails with Microsoft Word attachments to plain text by dragging them from the INBOX and dropping them into a user-created subfolder called v_2.
Obviously, I'll have to either create an IMAP service from scratch or hack an existing mail system. I've opted for the latter.
Robert Bernier will present Getting Smart with Search Technologies.
This open panel discusses the current and future techniques for searching, collating, and using electronic based information.
Searching Freshmeat for "IMAP" and "database" returned 22 projects. Seven of the projects dealt with IMAP servers, but only three of those had promise. In the end, I chose DBmail out of my final list of three projects. The other two choices were BEJY and OMMS.
DBmail is an application suite that manages email accounts using both the POP3 and IMAP4rev1 protocols. DBmail stores the emails using either MySQL or PostgreSQL as the database back end.
Hacking DBmail is very easy, as it treats the database management system as no more than a glorified storage box. All of the application logic executes in the front end of the IMAP service. As an email client interacts with the service, DBmail processes the email request by sending SQL statements to the back end. This means that if I hack the database end, I can avoid rewriting any C code, instead adding SQL instructions to redefine one set of email client instructions and execute another set of instructions instead.
Of the two database engines, only PostgreSQL has the flexibility to make this hack possible because of its comprehensive support of procedures and triggers.
Procedures somewhat resemble functions in the C programming language. PostgreSQL procedures support many programming languages, including Perl, Tcl, Python, Java, Ruby, and R. Procedures are especially useful when they are combined with triggers.
Triggers provide the ability to execute functions when certain
events take place. These events are the SQL commands that manipulate the
contents of tables, such as
UPDATE. Triggers can execute procedures before or after an event
takes place. For example, you can create a function that, when triggered,
records the username and date stamp for the particular action it carried out
and saves it in another table.
For the purposes of this article, I've made several assumptions:
- I tested this code with only one user account.
- There is only one Microsoft Word attachment per email
- The postgres superuser,
postgres, carries out all the actions.
There's no DBmail documentation describing how the mail system works, so you'll need to read the source code directly and analyze it in the same manner outlined in this article.
You should become familiar with the other DBmail utilities in order to use them in a production environment.
It's important that you appreciate how you'll spend your time when reverse-engineering somebody else's work. In writing this article, for example, I spent about 10 percent of my time developing the source code examples and another 10 percent writing up the article. The rest of the time I spent figuring out how DBmail works.
Setting Up the Environment
To keep things as simple as possible, we'll assume that you have
administrative rights on your machine and are comfortable with downloading,
compiling, and configuring source code. Remember to refer to the
INSTALL text file as well as the other source documentation for
The PostgreSQL installation is straightforward. Download the source code from one of the PostgreSQL mirrors, and compile the binary with the following commands:
$ configure --with-perl $ make % make install
Note: You'll need to have Perl already installed on your server before you
can configure the source code using the
We need the
--with-perl switch in order to use the untrusted
procedural language plplerlu. An untrusted language allows a user to access the
server and run scripts as the PostgreSQL superuser. I'll use plperlu to invoke
a couple of command-line utilities to convert Microsoft Word attachments to plain-text
You can install binaries rather than compile the source.
For Red Hat RPMs, you can install the appropriate binaries with the command:
% rpm -Uvh postgresql*.rpm
You can do the following for Debian installations:
% apt-get install postgresql % apt-get install postgres-client
For the purposes of this article I'll use the superuser account,
postgres, as DBmail's user account. Don't do this on a real
If you decide to compile the source code, download either the stable version or the latest experimental version from DBmail's download page. Keep in mind that PostgreSQL should already be running before you compile and run DBmail. The two versions of DBmail each use a different set of commands to build, compile, and install the binaries.
The 1.2.x series begins the build process by invoking an interactive shell:
Compile it with the command:
$ make clean all
Then install the binaries:
The newer, experimental 2.0rc.x version uses a different set of commands:
$ configure --with-pgsql $ make % install-dbmail.sh
Now we must finish our configuration by creating the database, attaching the procedural language plperlu to the database, configuring DBmail, and uploading some test emails into the database.
DBmail provides a script to prepare the
PostgreSQL database. Find it in the DBmail source tree under the directory
/sql/postgresql. I created a database called imap_mail and invoked
the SQL script using the following command:
$ psql -U postgres -d imap_mail -f ./create_tables.pgsql
These two commands add plpgsql and plperlu to the database imap_mail:
$ createlang -U postgres -d imap_mail plpgsql $ createlang -U postgres -d imap_mail plperlu
Note: This data dump of the imap_mail database creates the database, the DBmail tables, the DBmail email user account, the functions, the triggers, the procedural languages plpgsql and plperlu, and the emails I used to develop this article. All you need to do is type:
$ psql -U postgres -d template1 -f imap_mail.dump
DBmail uses the configuration file dbmail.conf in
We're now ready to start up the DBmail service by invoking the command:
This dbmail script is a copy of the Debian init script and permits you to run DBmail as a stand-alone service.
DBmail connects to the PostgreSQL server exclusively on a TCP socket connection on the default port of 5432.
DBmail is a little different from your standard IMAP server in that it runs
as a service independent of the superserver. Therefore, you must ensure that
nothing is already active on IMAP port 143 by removing all references to port
143 from your
inetd.conf file. You can reload the superserver to
reflect these changes by using a command similar to:
% /etc/init.d/inetd reload
Check that the service is up and running. You can verify this by running
netstat -tp and
netstat -tpl, as shown in Figure
Figure 1. DBmail is up and is connected to the PostgreSQL server
With the Dbmail server running, you can now create email user accounts.
Invoke the command
dbmail-adduser as root:
% dbmail-adduser a mail_user 123 0 25000000 email@example.com firstname.lastname@example.org @crunch.com
This command creates an email account named
mail_user with a
123. The next number,
0, represents the user's
assigned user ID. The next number,
25mb, represents his email limit
(stating a limit of
0 will impose no limit). The remaining strings
are his email addresses. Notice that the last domain doesn't include a
username; in this case, the server will redirect all email addressed to
nonexistent user accounts in the domain
Testing the Setup
Remember to use the same username and password for your client that you
created with the
dbmail-adduser command when you configured your
You can add email into the database from an existing mailbox using the
formail utility, which comes bundled with procmail, and the
dbmail-stmp utility, which uploads email into the DBmail mail
$ cat mymailbox | formail -s dbmail-smtp -u mail_user
Pages: 1, 2