ONLamp.com
oreilly.comSafari Books Online.Conferences.

advertisement


Creating Custom Email Queries
Pages: 1, 2

Understanding the Job Ahead

If I've learned one thing, it's that you should never remove somebody else's code from an existing application. Adding new source code is a safer technique; I've found that I can still do what I want and avoid the monumental effort of understanding somebody else's work. It also makes it much easier to port my work to newer versions of the application.

You can do two things to help understand how DBmail works:

  1. Create an entity relationship diagram for the PostgreSQL database.
  2. Configure PostgreSQL to echo SQL commands to syslog and see what's going on.

An entity relationship diagram, or ERD, is a graphical representation of the database. An ERD typically shows the column names, their data types, and the relationships between the tables through the use of foreign keys. Figure 2 shows the ERD for the DBmail/PostgreSQL database.

entity
relationship diagram
Figure 2. Entity relationship diagram for the email database imap_mail

The easiest way to obtain the ERD is to use the psql client to read the relationships between the tables using the meta command \d. Figure 3 shows a screenshot of the table description for messages.

table structure
for messages
Figure 3. Using psql to obtain a table description

I need to work with only a few of the DBmail tables: mailboxes, messages, and messageblks.

PostgreSQL provides you the ability to record all SQL statements via the syslog daemon. To invoke it, adjust the data cluster's configuration file postgresql.conf by inserting the following lines:

syslog          = 2
syslog_facility = 'LOCAL0'
syslog_ident    = 'postgres'
log_statement   = true
log_timestamp   = true

The changes take effect once you've reloaded the service:

% /etc/init.d/postgresql reload

The DBmail error message levels are configured in the "dbmail.conf" file. Syslog places them in the mail log files ie:

  • /var/log/mail.warn
  • /var/log/mail.err
  • /var/log/mail.log

You can read the syslog files using tail and grep:

% tail -n 100 /var/log/messages | grep postgres

The Hacks

The biggest challenge was understanding how DBmail worked; the rest was pretty easy.

Hack 1: Text Searching

This first hack searches all email for the text roman. Once it finds a message that matches, the hack moves the message from the INBOX to a subfolder named v_1.

The search begins when the user creates the subfolder v_1. The IMAP server recieves the instruction to create the folder from the client and sends the appropriate SQL to the back end. The trigger tr_v_1, attached to the table mailboxes, executes the function f_v_1() when an INSERT attempts to create the folder INBOX/v_1. The text search begins here. A SQL CASE statement tests each email in the INBOX. When the function finds an email with the desired text, it calls another function, f_update_messages, to update the messages table with an UPDATE to the message's location from INBOX to INBOX/v_1.

The trigger tr_v_1 attaches the function f_v_1() to the mailboxes table. The function will trigger on every INSERT made to this table:

CREATE TRIGGER tr_v_1 AFTER INSERT ON mailboxes
	FOR EACH ROW EXECUTE PROCEDURE f_v_1();

The function f_v_1() starts by testing the new name of the new folder:

IF NEW.name = ''INBOX/v_1'' THEN
.
.
END IF;

If the name of the created folder is v_1 (INBOX/v_1), it performs a text search for the strings roman and rome.

I've used the PERFORM statement instead of SELECT because I'm interested in the execution of the function f_update_messages() rather than any text string that the CASE statement may return. The CASE statement allows me to test for the desired text string that's located in the messageblk table by JOINing on the mailbox table:

PERFORM
    CASE 
     WHEN messageblk ~* ''roman'' OR messageblk ~* ''rome''
      THEN f_update_messages(NEW.mailbox_idnr,mblk.message_idnr)
     ELSE ''other''
    END AS "search results"
    FROM messageblks mblk, messages m, mailboxes mb
    WHERE
      mblk.message_idnr = m.message_idnr AND
      m.mailbox_idnr = mb.mailbox_idnr;

This calls the function f_update_messages, passing the mailbox identity number and the messageblks identity number as parameters. It then reassigns new identity numbers that effectively reassign the email's location from INBOX to the newly created subfolder v_1:

UPDATE messages SET mailbox_idnr=new_mailbox_idnr
	WHERE message_idnr=new_message_idnr;

Hack 2: Converting Word Attachments to Plain Text

This second hack converts all email with a Microsoft Word attachment to text and appends it to the original email.

This hack starts out the same way the first one does, as the attempt of the DBmail server to INSERT data into the table messageblks triggers it. DBmail updates the email record located in this table when the client moves email from one folder to another.

The f_v_2 function performs multiple actions:

  • Identifies whether a message has attachments
  • Identifies whether the attachment is a Microsoft Word document
  • Isolates and extracts the Word document from the email
  • Converts it from its Base64 encoding to its binary form
  • Converts the binary form to plain text
  • Appends the plain-text transformation to the original email

The conversion operations use the command-line utilities uudecode and antiword, which execute within the plperlu function f_doc2txt(text).

The trigger tr_v_2 attaches the function f_v_2() to the table messageblks. The function will execute for every INSERT made to this table:

CREATE TRIGGER tr_v_2 BEFORE INSERT ON messageblks
	FOR EACH ROW EXECUTE PROCEDURE f_v_2();

The function f_v_2() starts by identifying the mailbox's folder name:

SELECT INTO mbx_name mailboxes.name
   FROM mailboxes, messages
   WHERE
    mailboxes.mailbox_idnr=messages.mailbox_idnr AND
    messages.message_idnr=NEW.message_idnr;

DBmail stores messages in two parts: the header and the message body.

This function identifies emails' Microsoft Word attachments while it moves them to v_2. A positively identified message will fill in several variables with information about all of the message's parts. This in essence reconstructs the message:

SELECT INTO mail_header messageblk
FROM messageblks
WHERE messageblk_idnr=NEW.messageblk_idnr-1;
IF mail_header ~ ''multipart/mixed''
	AND NEW.messageblk ~ content_type THEN
		mime_boundary=trim(both ''"'' from
		substring(mail_header from ''".*"''));
	NEW.messageblk = f_get_document(NEW.messageblk,mime_boundary);
END IF;

The function f_get_document extracts the Microsoft Word MIME attachment:

CREATE OR REPLACE FUNCTION f_get_document(text,text)
	RETURNS TEXT AS'
		DECLARE
		mail_attachment ALIAS FOR $1;
		mime_boundary ALIAS FOR $2;
		mime_header TEXT;
		str TEXT;BEGIN 
		mime_header = ''--''||mime_boundary||''\nContent-Type: text/plain; \
			name=document.txt\n'';
		mime_header = mime_header||''Content-Transfer-Encoding: 7bit\n'';
		mime_header = mime_header||''Content-Disposition: inline; \
			filename="document.txt"\n'';

		str=substring(mail_attachment from \
			''Content-Type: application/MS Word.*''||mime_boundary);
		str=substring(str from ''\\n\\n.*\\n\\n'');
		str=split_part(str,''\\n\\n'',2);

		RETURN mime_header||f_doc2txt(str)||''--''||mime_boundary||''--'';
END;
'LANGUAGE plpgsql;

Notice that the function adds the header and MIME boundary to the Word document, then turns the document into a text file by calling the function f_doc2txt(). The Base64 encoding attachment, str, is the function's parameter:

RETURN mime_header||f_doc2txt(str)||''--''||mime_boundary||''--'';

The function f_doc2txt() is short. It performs minor formatting to prepare the attachment for the uudecode utility which turns it back into its binary form. It then pipes the binary stream into antiword, which converts it into a text stream:

CREATE OR REPLACE FUNCTION f_doc2txt(text) RETURNS TEXT AS'
use strict;

my $str="begin-base64 600 zzz\n".$_[0]."====";
$str= `echo "$str" | uudecode -o /dev/stdout | antiword - -i0`;
return $str;
' LANGUAGE plperlu;

The function returns this where it replaces the Word attachment with a plain-text translation in the f_v_2() function:

NEW.messageblk = f_get_document(NEW.messageblk,mime_boundary);
.
.
.
RETURN NEW;

Tips

Read the SQL source code at the same time you read this article.

The DBmail documentation can help you configure mail agents such as Postfix to pipe email directly into the database.

DBmail is changing rapidly, so refer to the home site for the latest news.

Don't be bashful! Go to the PostgreSQL site to learn more about triggers and the amazing number of currently supported programming languages.

Conclusion

There's so much you can do to get the most out of your email. By using triggers in your database engine, you can enhance any existing behavior into new and creative solutions. All it takes is a little imagination.

Here are a couple of ideas to whet your appetite:

  • Sort and cross-index your email under several categories or searches, and place them in several folders at the same time.
  • Translate email into another language by piping your correspondence through an online translation facility.
  • Create and index your email, similar to Google, using PostgreSQL's tsearch advanced text-searching capabilities.

I began this article intending to highlight data-mining strategies, but as you can see it's hard to stick to this subject without also considering reverse-engineering techniques.

References

  • dbmail man pages, specifically:

    • man dbmail-adduser, which adds mail users to the Dbmail mail system
    • man dbmail-imapd, the imap4r1 front end for the Dbmail mail system
    • man dbmail-maintenance, which performs cleanup and integrity checks
    • man dbmail-pop3d, the POP3d front end for the Dbmail mail system
    • man dbmail-smtp, which inserts messages into the Dbmail mail system
  • man uudecode
  • man antiword
  • DBmail
  • PostgreSQL
  • RFC 3501, the IMAP protocol

Robert Bernier is the PostgreSQL business intelligence analyst for SRA America, a subsidiary of Software Research America (SRA).


Return to ONLamp.com



Sponsored by: