Python DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Processing Mailbox Files with mailbox.py

by A. M. Kuchling
06/28/2007

Despite competition from the Web and instant messaging, email is still a primary communication medium for most Internet users, and many people have sizable archives of email messages. Sometimes these archived messages reside on an IMAP server, in which case you can use Python's imaplib module for scripted access to email. More often, messages will be downloaded to your computer and stored locally on disk.

Archived mail can be stored using many different file formats. The mailbox module in the Python standard library supports reading and modifying five different formats, all formats that are primarily used on Unix systems.

The mailbox module was greatly enhanced in Python 2.5. For a long time the mailbox module only supported reading mailboxes, not modifying them. Gregory K. Johnson, as his project for Google's 2005 Summer of Code, wrote code for adding and deleting messages; these new features went into Python 2.5, released in September 2006.

Supported Formats

Five mailbox formats are supported; each format is implemented by a different class.

  • The mbox format, supported by most Unix mail readers. Class name: mbox.
  • The Babyl format used by the RMAIL mail user agent for Emacs. Class name: Babyl.
  • The MMDF format used by the MMDF mail transfer agent. The MMDF program no longer seems to be actively developed (the last release was in 2000), but many Unix mail user agents still support MMDF-style mailboxes. Class name: MMDF.
  • The MH format used by the NMH mail user agent. This format began with the MH mail reader, which was implemented as a collection of executable commands to be invoked from a Unix shell. MH itself is no longer maintained, but there are several descendants (NMH, exmh) that are still being developed. Class name: MH.
  • Maildir was introduced by the qmail mail transfer agent and is now widely supported by many different MTAs and MUAs. The format was designed for robustness, avoiding the need to lock the mailbox when adding and removing messages. Class name: Maildir.

The contents of the mbox, Babyl, and MMDF formats are all stored in a single file; therefore modifying such mailboxes can require rewriting a substantial portion of the file. MH and Maildir mailboxes use a directory and store each message in a single file. MH gives each message a sequentially numbered filename (1, 2, 3, ...), so mailboxes may need to be periodically renumbered to remove gaps where messages were deleted. Maildir uses structured filenames containing the hostname, timestamp, and process ID.

Which mailbox format should you use for a new application? If you're writing a program that fits into a larger mail-processing system, your choice may be constrained by compatibility with existing code or mailboxes. If you're not so restricted, I strongly suggest that you use Maildir for three reasons:

  • Unlike the single-file formats, modifying a mailbox doesn't require copying and rewriting a sizable fraction of the mailbox's data, so Maildir is faster.
  • Your data is safer because locks aren't necessary. Incorrect use of locks can result in serious corruption and data loss.
  • Storing individual messages in single files makes Unix tools such as grep more useful. It's also easier to alter messages in such mailboxes, because manually deleting or editing message files doesn't break any invariants or internal data structures.

Most of the examples in this article will demonstrate the generic interface and will work with any mailbox format. I'll also show a few format-specific methods.

Basic Interface

The mailbox module is part of Python's standard library, but adding and deleting messages requires using Python 2.5. As of this writing, the Ubuntu Feisty and Fedora 7 Linux distributions use Python 2.5 as their standard version. Many distributors are still using older versions of Python; Mac OS includes Python 2.3, for example. Whatever your platform, you can download the latest version from http://www.python.org.

You open a mailbox by creating an instance of the corresponding class.

import mailbox

src_mbox = mailbox.Maildir('maildir-mbox', factory=mailbox.MaildirMessage)
dest_mbox = mailbox.mbox('new-m-box', create=False)

All mailbox classes support the same initial parameters, Mailbox(path, factory=None, create=True):

  • path: the path to the mailbox file or directory.
  • factory: a function or class that can take a string containing a message and will return an object representing the message. Other modules in the Python standard library make a few different choices available, or you can provide your own custom class.
  • create: by default, if the specified path doesn't exist, an empty mailbox will be created. If this automatic creation is undesirable, you can pass a false value for this parameter and a mailbox.NoSuchMailboxError will be raised if the mailbox doesn't exist.

Python has two sets of classes for working with email messages. One set is provided by the older rfc822 module. The other set is provided by the email package, which is more modern and provides easier access to MIME features. The email package is developed by Barry Warsaw, who also maintains the Mailman list manager; email is therefore well-tested and strives for RFC compliance.

For backward compatibility with previous versions of the mailbox module, some older classes still default to using a class from the rfc822 module, rfc822.Message, to represent messages. I won't discuss these older mailbox classes in this article because they could only read mailboxes, not write to them. Most of the classes covered here will use the newer class, email.message.Message, to represent messages. (Later, we'll learn that some mailbox formats have their own message classes to add format-specific methods; these custom classes are all subclasses of email.message.Message.)

There's one important exception that doesn't default to the newer module: the Maildir class. The Maildir class didn't need a complete rewrite to support modifying mailboxes, but the default message class was left as rfc822.Message to avoid breaking existing code.

When using Maildir, which class should you use? The default rfc822.Message class is fine for many purposes, but I recommend using the MaildirMessage class provided by the mailbox module, because it derives from email.message.Message and also adds Maildir-specific methods. To do this, you must open a Maildir mailbox like this:

src_mbox = mailbox.Maildir('maildir-archive',
                           factory=mailbox.MaildirMessage) 

To return to the generic interface, messages are retrieved by unique keys, and mailbox objects therefore support some of the same methods as Python's dictionaries:

# Raises KeyError if there's no message with the given key.
msg = src_mbox[key]

# 'msg' is set to None if there's no message with that key.
msg = src_mbox.get(key)

The mailbox's .keys(), .values(), and .items() method return lists of keys, messages, or (key,message) pairs. .iterkeys(), .itervalues(), and .iteritems() are variants that return iterators instead of lists.

You can also iterate over a mailbox directly using the for statement:

for msg in src_mbox:
    # do something with each message 'msg'
    ...

Note that mailbox iteration is different from dictionary iteration; dictionaries iterate over the keys, but mailboxes iterate over the values (that is, the messages themselves).

Another cautionary note: a new Message object is created every time you retrieve the message for a particular key, requiring some amount of disk I/O and parsing work. If speed is paramount, be careful to not repeatedly fetch the same message via .get(); fetch it once only.

This also means that modifying a Message object returned from a mailbox method doesn't modify the on-disk mailbox, and future retrievals of that message won't see any changes made to the object. You must explicitly write changes back to the mailbox (later, we'll see how that's done).

Working with email.Message Objects

Assuming you're using the email.message.Message class, you have a rich API for examining and modifying email messages.

Messages are represented as a tree of Message objects. Instances of Message consist of a set of email headers (Subject, From, To, etc.) and a payload. The payload can be either a string, which will be the body of the message, or a list of Message objects, which are treated as the entities making up a MIME multipart message.

Message instances can be converted to their string representation by calling Python's built-in str() function. (A debugging tip: simply printing a message with print msg will result in the contents of the message being printed; this can result in a lot more output than you expect. Use print repr(msg) to get output of the form <email.message.Message instance at 0x572b48>, which is more helpful when trying to dump variables.)

Pages: 1, 2, 3

Next Pagearrow





Sponsored by: