O'Reilly Databases

oreilly.comSafari Books Online.Conferences.

We've expanded our coverage and improved our search! Search for all things Database across O'Reilly!

Search Search Tips

advertisement
AddThis Social Bookmark Button

Print Subscribe to Databases Subscribe to Newsletters

Data Mining Email
Pages: 1, 2

Part 3. Developing the Email Parser

Install the following modules from the CPAN, along with their dependencies:
  • Mail::MboxParser, a simple interface that provides read-only access to Unix-mailboxes
  • DBD::Pg, the database interface between Perl and the PostgreSQL database.



The mbox parser script first defines some global variables and then establishes a database connection with the PostgreSQL server:

my $dbh = DBI->connect($dsn,$user,$pass);
die $DBI::errstr unless defined $dbh;
$dbh->{PrintError} = 0;

The script then reads the mailbox and breaks down the component parts of each individual message:

my $mb = Mail::MboxParser->new(\@myarray,
	decode     => 'ALL',
	parseropts => $parseropts);

A while loop processes messages from the mailbox and identifies the interesting parts of the message:

my $id      = $msg->id;
my $from    = $msg->from->{email};
my $subject = $msg->header->{subject};

The MboxParser module identifies the main body of the message:

my $body = $msg->body($msg->find_body);

The main body and attachments of a message are referenced in the main and attachment tables with an oid. Prior to version 7.1, PostgreSQL cannot insert data into a row that exceeds the default data page size of 8192 bytes. To get around this, we'll instead insert this information as large objects. Upload a large object by using psql:

my $myoid=`psql -U postgres --command "\\lo_import \'$body_file\' \'$mymailbox\' " $db`;
my @oidarray = split(/ /,$myoid);

This method requires less coding and is simpler to understand than is using libpq's more efficient large object call.

We are now able to insert the email message into the main table:

$sql = "INSERT INTO main VALUES( ?, ?, ?, @oidarray[1])";
$sth = $dbh->prepare($sql) or die $DBI::errstr;
$sth->execute( $id, $from, $subject ) or die $DBI::errstr;

This snippet of code generates the list of email recipients:

#POPULATE TABLE 'mailto'

for my $msg_TO ($msg->to) {
	$sql = "INSERT INTO mailto VALUES( ?, ? )";
	$sth = $dbh->prepare($sql) or die $DBI::errstr;
	$sth->execute( $id, $msg_TO->{email} ) or die $DBI::errstr;
}

The most interesting part of the program identifies MS Word attachments:

# DETERMINE ATTACHMENT MIME-TYPES

my $decodedattachment = `echo "$attachment" | ./decode.pl |
	tee $attachment_file | file -i -`;
my @test0     = split(/ /, $decodedattachment);
my $mime_type = $test0[1];
chomp $mime_type;

# VALIDATING MSWORD DOCS

if ($mime_type eq "application/MS Word") {

	# CONVERT INTO READABLE TEXT
	`antiword $attachment_file > $MS Word_file;
	mv -f $MS Word_file $attachment_file`;

} else {

	# NOT MSWORD DOCS,
	# USE THE ENCODED MIME VERSION IN THE FILE
	`echo "$attachment" > $attachment_file`;
}

decode.pl decodes the MIME Base64 message back into its original binary encoding. There is also a PostgreSQL function for this. The program then pipes the encoded output into the file utility to determine if it really is a MS Word document. At the same time, it saves the decoded attachment as the temporary file /tmp/attachment.

Any attachment that is identified as a MS Word document with a MIME type of application/MS Word runs through the antiword utility to turn it into an ASCII text document. The program then uploads this as a large object.

All other attachments are saved in the database as large objects in their original MIME encodings.

Part 4. Querying the Database

Now that the information is in the database, the next trick is to retrieve it in a way that makes sense. I've created the following queries, based on my own database that I used to develop this article.

First, let's summarize all the email messages that have both a main body and an attachment. (Refer to the table creation create tables script for table details).

SELECT m.from_email AS From,
       m.subject    AS Subject,
       m.mailbody   AS Message,
       a.attachment AS Attachment
FROM   main       m,
       attachment a
WHERE  m.messageid=a.message

This will produce output similar to:

From Subject Message Attachment
robert.bernier5@sympatico.ca message 1 60227 60230
robert.bernier5@sympatico.ca message 2 60233 60236
casestudy@postgresql.org message 3 60239 60242
casestudy@postgresql.org message 4 60245 60248

The Message and Attachments are oids, or reference numbers, to the large objects now stored in the pg_largeobject catalog.

Now, let's list all messages with attachments and identify those attachments that were MS Word documents:

SELECT m.messageid  AS "Message ID",
       a.attachment AS "Attachment oid",
	CASE
		WHEN a.mime_type='application/MS Word'
		THEN 'true'
 		ELSE 'false'
	END
                    AS "MS Word attachment"
FROM   main m, attachment a
WHERE  m.messageid=a.messageid;

This will produce output similar to:

Message ID Attachment oid MS Word attachment
3FF79F08.6060208@sympatico.ca 60230 true
3FF79F29.2090101@sympatico.ca 60236 false
3FF7A511.2030104@postgresql.org 60242 false
3FF7A52A.8040203@postgresql.org 60248 true

You can return the ASCII text conversion of any MS Word attachment from a query similar to:

SELECT encode(lo.data,'escape') AS "My Document"
FROM   pg_largeobject lo,
       attachme1t     a
WHERE  a.messageid  = '3FF7A52A.8040203@postgresql.org'
AND    a.attachment = lo.loid;

The encode function casts My Document from a bytea to a text type by removing all escape sequences that the lo_import function originally inserted.

Searching for Patterns

To fetch results with more detail, you'll have to search specific columns for patterns. There are two ways to search text, with the SQL LIKE expression or with PostgreSQL's POSIX-style regular expressions.

Using LIKE

This query will search for the phrase "These functions read and write" in a particular Word document.


SELECT m.from_email  AS "From",
       a.description AS "File name",
	CASE
		WHEN encode(lo.data,'escape') LIKE '%These functions read and write%'
		THEN 'true'
  		ELSE 'false'
	END AS "Search results"
FROM pg_largeobject lo,
     attachment      a,
     main            m
WHERE a.messageid  = '3FF7A52A.8040203@postgresql.org'
AND   a.attachment = lo.loid
AND   a.messageid  = m.messageid;

Here are the results from my sample tables:

From File name Search results
casestudy@postgresql.org doc2.doc true

Regular Expressions

This query will count the number of emails with attachments and whose from address begins with the letters ca and end in org.

SELECT count(m.from_email)
FROM   main       m,
       attachment a
WHERE  m.messageid  = a.messageid
AND    m.from_email ~ '^ca'
AND    m.from_email ~ 'org$';

Caveats and Suggestions

In the interests of keeping the Perl script as simple as possible, I've assumed that all email attachments arrive encoded in Base64. As well, since I wrote the script to follow the flow of this article, the code is a bit clumsy.

The database definitions here only demonstrate where you can make a start. Making a database that flies like a jet is an article in itself.

Save yourself time and grief by saving SQL queries and definitions in a file and then invoking it through the psql client. If you're paranoid, you can put these files under version control

Invoke the create tables script to clean out the database as you rerun the Perl scripts.

Antiword is a bit quirky in that it will not parse files that it considers too small.

Viewing the Mozilla mbox through its client can be deceiving. It never really removes email from the box even if you've deleted it, so you may well see email in your database that you thought you destroyed months before.

You can export large objects from the database by using the lo_export function in the psql client. For example \lo_export 123 temp.txt will save the large object with the oid number of 132 to a file named temp.txt.

Conclusion

There are two parts to a successful data mine; the mechanism that prepares the document and the algorithm that looks for patterns. We've looked at one way getting email into a database but frankly, I haven't given that much coverage on the power of regular expressions in a database environment, although some of you Perl fiends have a pretty good idea of the possibilities.

References

  • Perl
  • CPAN
  • PostgreSQL
  • Antiword
  • Create table script
  • Mail parsing script
  • MIME decoding script

Robert Bernier is the PostgreSQL business intelligence analyst for SRA America, a subsidiary of Software Research America (SRA).


Return to ONLamp.com.


Comments on this article

1 to 6 of 6
1 to 6 of 6


Tagged Articles

Be the first to post this article to del.icio.us

Related to this Article

Data Jujitsu: The Art of Turning Data into Product Data Jujitsu: The Art of Turning Data into Product
November 2012
$0.00 USD

Designing Great Data Products Designing Great Data Products
March 2012
$0.00 USD

Sponsored Resources

  • Inside Lightroom
Advertisement
O'reilly

© 2013, O’Reilly Media, Inc.

(707) 827-7019 (800) 889-8969

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.

About O'Reilly

  • Academic Solutions
  • Jobs
  • Contacts
  • Corporate Information
  • Press Room
  • Privacy Policy
  • Terms of Service
  • Writing for O'Reilly

Community

  • Authors
  • Community & Featured Users
  • Forums
  • Membership
  • Newsletters
  • O'Reilly Answers
  • RSS Feeds
  • User Groups

Partner Sites

  • makezine.com
  • makerfaire.com
  • craftzine.com
  • igniteshow.com
  • PayPal Developer Zone
  • O'Reilly Insights on Forbes.com

Shop O'Reilly

  • Customer Service
  • Contact Us
  • Shipping Information
  • Ordering & Payment
  • The O'Reilly Guarantee