ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Creating Custom Email Queries

by Robert Bernier
07/22/2004

In my last article on data mining email, I described how you could upload a Mozilla mailbox into a PostgreSQL database. With the email uploaded, I showed how to search through the email looking for text strings using both the standard SQL and PostgreSQL's POSIX regular expressions. As a grand finale, you then saw how you could extract Microsoft Word document attachments and perform text searches on those too.

In the true spirit of hacking, this article will consider the next step of data mining: performing in-depth searches on email by tweaking an existing mail system. The objective is to develop a solution requiring no more expertise than that of an intermediate-level DBA.

The Design Requirements

To avoid a learning curve for our user base, I intend to use an existing email client rather than create one. The next requirement is to select a protocol. I chose IMAP because it centralizes the administrative control of all user accounts onto one server. The third requirement is to choose a mechanism that gives the email client user the ability to carry out complex queries: The solution is to implement a relational database management system as the back end to the IMAP server. This DBMS must have the ability to create, store, and execute procedures. The last requirement is that the solution must be capable of backups and can upload the email, in the form of a data dump, to another machine.

The Solution

Now that I've defined what I want, let me explain how I'll achieve it.

The only way that an ordinary IMAP-enabled email client can carry out user-defined queries is if the server's configuration allows it to recognize certain IMAP instructions as extra, nonstandard ones.

To keep this article short and sweet, I'll demonstrate two specific database queries my Mozilla email client invokes. The client will inform the server to:

  1. Perform a text search on email located in the INBOX. The idea is to transfer the emails that match the search criteria into a new subfolder called v_1.
  2. Convert emails with Microsoft Word attachments to plain text by dragging them from the INBOX and dropping them into a user-created subfolder called v_2.

Obviously, I'll have to either create an IMAP service from scratch or hack an existing mail system. I've opted for the latter.


O'Reilly Open Source Convention -- July 26-30, 2004 Portland, OR.

Robert Bernier will present Getting Smart with Search Technologies. This open panel discusses the current and future techniques for searching, collating, and using electronic based information.
Register now!

Searching Freshmeat for "IMAP" and "database" returned 22 projects. Seven of the projects dealt with IMAP servers, but only three of those had promise. In the end, I chose DBmail out of my final list of three projects. The other two choices were BEJY and OMMS.

DBmail is an application suite that manages email accounts using both the POP3 and IMAP4rev1 protocols. DBmail stores the emails using either MySQL or PostgreSQL as the database back end.

Hacking DBmail is very easy, as it treats the database management system as no more than a glorified storage box. All of the application logic executes in the front end of the IMAP service. As an email client interacts with the service, DBmail processes the email request by sending SQL statements to the back end. This means that if I hack the database end, I can avoid rewriting any C code, instead adding SQL instructions to redefine one set of email client instructions and execute another set of instructions instead.

Of the two database engines, only PostgreSQL has the flexibility to make this hack possible because of its comprehensive support of procedures and triggers.

Procedures somewhat resemble functions in the C programming language. PostgreSQL procedures support many programming languages, including Perl, Tcl, Python, Java, Ruby, and R. Procedures are especially useful when they are combined with triggers.

Triggers provide the ability to execute functions when certain events take place. These events are the SQL commands that manipulate the contents of tables, such as INSERT, DELETE, and UPDATE. Triggers can execute procedures before or after an event takes place. For example, you can create a function that, when triggered, records the username and date stamp for the particular action it carried out and saves it in another table.

Caveats

For the purposes of this article, I've made several assumptions:

  • I tested this code with only one user account.
  • There is only one Microsoft Word attachment per email
  • The postgres superuser, postgres, carries out all the actions.

There's no DBmail documentation describing how the mail system works, so you'll need to read the source code directly and analyze it in the same manner outlined in this article.

You should become familiar with the other DBmail utilities in order to use them in a production environment.

It's important that you appreciate how you'll spend your time when reverse-engineering somebody else's work. In writing this article, for example, I spent about 10 percent of my time developing the source code examples and another 10 percent writing up the article. The rest of the time I spent figuring out how DBmail works.

Setting Up the Environment

To keep things as simple as possible, we'll assume that you have administrative rights on your machine and are comfortable with downloading, compiling, and configuring source code. Remember to refer to the INSTALL text file as well as the other source documentation for further guidance.

Preparing PostgreSQL

The PostgreSQL installation is straightforward. Download the source code from one of the PostgreSQL mirrors, and compile the binary with the following commands:

$ configure --with-perl
$ make
% make install

Note: You'll need to have Perl already installed on your server before you can configure the source code using the --with-perl switch.

We need the --with-perl switch in order to use the untrusted procedural language plplerlu. An untrusted language allows a user to access the server and run scripts as the PostgreSQL superuser. I'll use plperlu to invoke a couple of command-line utilities to convert Microsoft Word attachments to plain-text attachments.

You can install binaries rather than compile the source.

For Red Hat RPMs, you can install the appropriate binaries with the command:

% rpm -Uvh postgresql*.rpm

You can do the following for Debian installations:

% apt-get install postgresql
% apt-get install postgres-client

For the purposes of this article I'll use the superuser account, postgres, as DBmail's user account. Don't do this on a real production server!

Preparing DBmail

If you decide to compile the source code, download either the stable version or the latest experimental version from DBmail's download page. Keep in mind that PostgreSQL should already be running before you compile and run DBmail. The two versions of DBmail each use a different set of commands to build, compile, and install the binaries.

The 1.2.x series begins the build process by invoking an interactive shell:

$ build.sh

Compile it with the command:

$ make clean all

Then install the binaries:

% install-dbmail.sh

The newer, experimental 2.0rc.x version uses a different set of commands:

$ configure --with-pgsql
$ make
% install-dbmail.sh

Final Configurations

Now we must finish our configuration by creating the database, attaching the procedural language plperlu to the database, configuring DBmail, and uploading some test emails into the database.

DBmail provides a script to prepare the PostgreSQL database. Find it in the DBmail source tree under the directory /sql/postgresql. I created a database called imap_mail and invoked the SQL script using the following command:

$ psql -U postgres -d imap_mail -f ./create_tables.pgsql

These two commands add plpgsql and plperlu to the database imap_mail:

$ createlang -U postgres -d imap_mail plpgsql
$ createlang -U postgres -d imap_mail plperlu

Note: This data dump of the imap_mail database creates the database, the DBmail tables, the DBmail email user account, the functions, the triggers, the procedural languages plpgsql and plperlu, and the emails I used to develop this article. All you need to do is type:

$ psql  -U postgres -d template1 -f imap_mail.dump

DBmail uses the configuration file dbmail.conf in the directory /etc/dbmail.

We're now ready to start up the DBmail service by invoking the command:

$ bmail-imapd

This dbmail script is a copy of the Debian init script and permits you to run DBmail as a stand-alone service.

DBmail connects to the PostgreSQL server exclusively on a TCP socket connection on the default port of 5432.

DBmail is a little different from your standard IMAP server in that it runs as a service independent of the superserver. Therefore, you must ensure that nothing is already active on IMAP port 143 by removing all references to port 143 from your inetd.conf file. You can reload the superserver to reflect these changes by using a command similar to:

% /etc/init.d/inetd reload

Check that the service is up and running. You can verify this by running netstat -tp and netstat -tpl, as shown in Figure 1.

netstat of dbmail
and postgres processes
Figure 1. DBmail is up and is connected to the PostgreSQL server

With the Dbmail server running, you can now create email user accounts. Invoke the command dbmail-adduser as root:

% dbmail-adduser a mail_user 123 0 25000000
	user@crunch.net user@crunch.org @crunch.com

This command creates an email account named mail_user with a password of 123. The next number, 0, represents the user's assigned user ID. The next number, 25mb, represents his email limit (stating a limit of 0 will impose no limit). The remaining strings are his email addresses. Notice that the last domain doesn't include a username; in this case, the server will redirect all email addressed to nonexistent user accounts in the domain crunch.com to mail_user.

Testing the Setup

Remember to use the same username and password for your client that you created with the dbmail-adduser command when you configured your email client.

You can add email into the database from an existing mailbox using the formail utility, which comes bundled with procmail, and the dbmail-stmp utility, which uploads email into the DBmail mail system:

$ cat mymailbox | formail -s dbmail-smtp -u mail_user

Pages: 1, 2

Next Pagearrow





Sponsored by: