Despite competition from the Web and instant messaging, email is still a primary communication medium for most Internet users, and many people have sizable archives of email messages. Sometimes these archived messages reside on an IMAP server, in which case you can use Python's imaplib module for scripted access to email. More often, messages will be downloaded to your computer and stored locally on disk.
Archived mail can be stored using many different file formats. The mailbox module in the Python standard library supports reading and modifying five different formats, all formats that are primarily used on Unix systems.
The mailbox module was greatly enhanced in Python 2.5. For a long time the mailbox module only supported reading mailboxes, not modifying them. Gregory K. Johnson, as his project for Google's 2005 Summer of Code, wrote code for adding and deleting messages; these new features went into Python 2.5, released in September 2006.
Five mailbox formats are supported; each format is implemented by a different class.
mbox.Babyl.MMDF.MH.Maildir.The contents of the mbox, Babyl, and MMDF formats are all stored in a single file; therefore modifying such mailboxes can require rewriting a substantial portion of the file. MH and Maildir mailboxes use a directory and store each message in a single file. MH gives each message a sequentially numbered filename (1, 2, 3, ...), so mailboxes may need to be periodically renumbered to remove gaps where messages were deleted. Maildir uses structured filenames containing the hostname, timestamp, and process ID.
Which mailbox format should you use for a new application? If you're writing a program that fits into a larger mail-processing system, your choice may be constrained by compatibility with existing code or mailboxes. If you're not so restricted, I strongly suggest that you use Maildir for three reasons:
grep more useful. It's also easier to alter messages in such mailboxes, because manually deleting or editing message files doesn't break any invariants or internal data structures.Most of the examples in this article will demonstrate the generic interface and will work with any mailbox format. I'll also show a few format-specific methods.
The mailbox module is part of Python's standard library, but adding and deleting messages requires using Python 2.5. As of this writing, the Ubuntu Feisty and Fedora 7 Linux distributions use Python 2.5 as their standard version. Many distributors are still using older versions of Python; Mac OS includes Python 2.3, for example. Whatever your platform, you can download the latest version from http://www.python.org.
You open a mailbox by creating an instance of the corresponding class.
import mailbox
src_mbox = mailbox.Maildir('maildir-mbox', factory=mailbox.MaildirMessage)
dest_mbox = mailbox.mbox('new-m-box', create=False)
All mailbox classes support the same initial parameters, Mailbox(path, factory=None, create=True):
path: the path to the mailbox file or directory.factory: a function or class that can take a string containing a message and will return an object representing the message. Other modules in the Python standard library make a few different choices available, or you can provide your own custom class.create: by default, if the specified path doesn't exist, an empty mailbox will be created. If this automatic creation is undesirable, you can pass a false value for this parameter and a mailbox.NoSuchMailboxError will be raised if the mailbox doesn't exist.Python has two sets of classes for working with email messages. One set is provided by the older rfc822 module. The other set is provided by the email package, which is more modern and provides easier access to MIME features. The email package is developed by Barry Warsaw, who also maintains the Mailman list manager; email is therefore well-tested and strives for RFC compliance.
For backward compatibility with previous versions of the mailbox module, some older classes still default to using a class from the rfc822 module, rfc822.Message, to represent messages. I won't discuss these older mailbox classes in this article because they could only read mailboxes, not write to them. Most of the classes covered here will use the newer class, email.message.Message, to represent messages. (Later, we'll learn that some mailbox formats have their own message classes to add format-specific methods; these custom classes are all subclasses of email.message.Message.)
There's one important exception that doesn't default to the newer module: the Maildir class. The Maildir class didn't need a complete rewrite to support modifying mailboxes, but the default message class was left as rfc822.Message to avoid breaking existing code.
When using Maildir, which class should you use? The default rfc822.Message class is fine for many purposes, but I recommend using the MaildirMessage class provided by the mailbox module, because it derives from email.message.Message and also adds Maildir-specific methods. To do this, you must open a Maildir mailbox like this:
src_mbox = mailbox.Maildir('maildir-archive',
factory=mailbox.MaildirMessage)
To return to the generic interface, messages are retrieved by unique keys, and mailbox objects therefore support some of the same methods as Python's dictionaries:
# Raises KeyError if there's no message with the given key.
msg = src_mbox[key]
# 'msg' is set to None if there's no message with that key.
msg = src_mbox.get(key)
The mailbox's .keys(), .values(), and .items() method return lists of keys, messages, or (key,message) pairs. .iterkeys(), .itervalues(), and .iteritems() are variants that return iterators instead of lists.
You can also iterate over a mailbox directly using the for statement:
for msg in src_mbox:
# do something with each message 'msg'
...
Note that mailbox iteration is different from dictionary iteration; dictionaries iterate over the keys, but mailboxes iterate over the values (that is, the messages themselves).
Another cautionary note: a new Message object is created every time you retrieve the message for a particular key, requiring some amount of disk I/O and parsing work. If speed is paramount, be careful to not repeatedly fetch the same message via .get(); fetch it once only.
This also means that modifying a Message object returned from a mailbox method doesn't modify the on-disk mailbox, and future retrievals of that message won't see any changes made to the object. You must explicitly write changes back to the mailbox (later, we'll see how that's done).
Assuming you're using the email.message.Message class, you have a rich API for examining and modifying email messages.
Messages are represented as a tree of Message objects. Instances of Message consist of a set of email headers (Subject, From, To, etc.) and a payload. The payload can be either a string, which will be the body of the message, or a list of Message objects, which are treated as the entities making up a MIME multipart message.
Message instances can be converted to their string representation by calling Python's built-in str() function. (A debugging tip: simply printing a message with print msg will result in the contents of the message being printed; this can result in a lot more output than you expect. Use print repr(msg) to get output of the form <email.message.Message instance at 0x572b48>, which is more helpful when trying to dump variables.)
|
Headers are accessed by treating the Message object like a dictionary. The Message object preserves the case of header names, but headers are retrieved case-insensitively. Some usage examples:
print 'Number of headers:', len(msg)
# Retrieve the message ID
msg_id = msg['Message-ID']
print msg_id
# Equivalent: retrieval is case-insensitive.
msg_id = msg['message-id']
print msg_id
# Retrieve subject header, with a default value if
# the header isn't present.
subject = msg.get('Subject', 'No subject provided')
# Retrieve Cc header, returning None if it's not present.
cc = msg.get('Cc')
# Check if a header is present
if 'X-Virus-Scan' not in msg:
print 'Doing virus scan...'
# Add header value
msg['X-Virus-Scan'] = 'OK'
There can be multiple header lines using the same field name; the "Received" header is the most common example. When there are multiple header lines, the get() method will return a single arbitrarily chosen line. The get_all() method returns a list of all header values. set() never overwrites or deletes existing lines; it will always add a new header line.
Here are some examples using the Received header:
# Get list of received headers
recv_trail = msg.get_all('Received')
for line in recv_trail:
print line
# Add a new received line; this line will come
# last when the message headers are converted to
# a string.
msg['Received'] = 'from host1 by host2'
# Delete all received headers
del msg['Received']
# Replace the Subject header
msg.replace_header('Subject', '***SPAM*** ' + subject)
See the email package's documentation for a full list of methods and attributes.
Putting everything together for an example, the following script uses the mailbox module and Andrew Dalke's PyRSS2Gen to generate an RSS feed from a mailbox.
#!/usr/bin/env python2.5
import sys, mailbox, datetime
from email import utils
import PyRSS2Gen
if len(sys.argv) == 1:
print 'Usage: %s <maildir-1> <maildir-2> ...' % sys.argv[0]
sys.exit(1)
# Create RSS feed
feed = PyRSS2Gen.RSS2(title='Mailbox feed',
link='http://maildir-feed.example.com',
description=('Contains mailboxes: ' +
' '.join(sys.argv[1:])
))
# Loop over specified mailboxes
for filename in sys.argv[1:]:
mbox = mailbox.Maildir(filename)
for msg in mbox:
subject = msg.get('Subject', "")
guid_hdr = msg['Message-ID']
# Parse the date, turning it into a datetime object.
date_hdr = msg.get('Date')
if date_hdr is None:
date = datetime.datetime.now()
else:
(y, month, d,
h, min, sec,
_, _, _, tzoffset) = utils.parsedate_tz(date_hdr)
date = datetime.datetime(y, month, d, h, min, sec)
# Create RSS item and add it to the feed
item = PyRSS2Gen.RSSItem(pubDate=date, title=subject,
guid=PyRSS2Gen.Guid(guid_hdr, isPermaLink=False))
feed.items.append(item)
# Write generated RSS to stdout
feed.write_xml(sys.stdout, encoding='utf-8')
The examples so far have only examined mailboxes without changing their contents. Let's look at how to add, change, and remove messages from a mailbox.
Before making any alteration to a mailbox, always call the mailbox's lock() method to acquire a lock on the mailbox. When the changes are complete call the flush() method to write changes to disk and the unlock() method to release the lock on the mailbox.
Different mailbox classes will make changes to the underlying disk files at different times. For the single-file mailbox formats, new messages are added immediately but deleted messages aren't removed until you call flush(). On the other hand, directory-based formats, such as Maildir and MH, make all their changes immediately and the flush() method doesn't actually do anything. Thanks to Maildir's lock-free design, lock() and unlock() also don't have to do anything.
It's good practice to always call these methods, even if some or all of these methods are no-ops. Someone might come along and modify your code, or pass in a mbox object where you're expecting a Maildir object. People are very protective of their e-mail, so you should always be careful to avoid duplicating or worse, deleting messages.
|
Messages are added by calling the mailbox's add(msg) method. The msg parameter can be one of several different types:
mailbox.Message or email.message.Message.To copy messages from one mailbox to another, you could write:
src_mbox.lock() # Optional -- but a good idea
dest_mbox.lock() # Not optional!
try:
for msg in src_mbox:
new_key = dest_mbox.add(msg)
count += 1
finally:
src_mbox.close()
dest_mbox.close()
print count, 'messages copied'
The close() method does three things: it calls the flush() method to force any unwritten changes to disk, then calls the unlock() method to free the mailbox lock, and, finally, closes any open files.
Messages can be deleted by a del mbox[key] statement or by calling the remove(key) method. The following example deletes all messages that have been marked as spam:
try:
for key, msg in src.iteritems():
subject = msg.get('Subject', 'No subject provided')
if subject.startswith('***SPAM***'):
print 'Deleting', subject
del src[key]
finally:
src.close()
Because Message instances are newly generated every time a message is retrieved, modifying the instance doesn't affect the contents of the mailbox. To change the contents of a message, you must use dictionary-style assignment (dest_mbox[key] = new_msg) to update the message. The following example removes Re: prefixes from subject lines in a mailbox:
try:
src.lock()
for key, msg in src.iteritems():
subject = msg.get('Subject', '')
if subject.startswith('Re: '):
msg.replace_header('Subject', subject[4:])
src[key] = msg
finally:
src.close()
Some of the mailbox formats support additional information attached to each message:
In the mbox and MMDF formats, messages are separated by From lines that aren't part of the message headers or body. These From lines contain the envelope sender (the sender address supplied in the SMTP transaction) and the time the message was received. (These From lines may not necessarily have the same value as the RFC-2822 From header of the message.)
The get_from() method returns the contents of the From line, (not including the From prefix), and set_from(from_addr, [time_value]) sets a new value for the line. To write the change to disk, the modified message object must be stored in the mailbox again:
msg = mbox_mailbox[key] # Retrieve message
from_ = msg.get_from()
# Returns a value such as "amk@example.com Thu Jun 21 01:35:15 2007"
# A value of True records the current time as the timestamp.
msg.set_from('bjm@example.com', True)
# Or you can supply a tuple suitable for passing to time.gmtime().
msg.set_from('bjm@example.com', (2007, 6, 21, 1, 48, 53, 3, 172, 0))
mbox_mailbox[key] = msg # Store message
Some mail readers that use mbox format follow a convention of using either the Status or X-Status fields to record which messages have been read, answered, or marked as deleted. For example, D stands for deleted messages, R for read, and A for answered messages. Multiple flags can be set on a message at the same time. The get_flags() method returns a string of characters containing the flags that have been set. The set_flags(flag_string) method takes a string and sets the specified flags, unsetting all other flags. For example:
flags = msg.get_flags()
if 'R' not in flags:
# Unread message
print msg
msg.set_flags('R' + flags)
mbox_mailbox[key] = msg
The Maildir format also supports setting single-character flags on messages, but the flag characters are different: S is for seen messages, R is for replied, and T is for trashed. The flag interface is also different for historical reasons. msg.get_flags() still returns a string containing the currently set flags, but there's no set_flags(). Instead, msg.add_flag(flag_str) sets the supplied flags and .remove_flag(flag_string) removes them.
When using the Maildir format, messages are initially written into a tmp/ subdirectory, and once the message file has been completely written, it's moved into either the new/ or cur/ subdirectory. The get_subdir() method of a MaildirMessage instance returns the name of the subdirectory containing the message, and the set_subdir(new_dir) method records a new directory for the message. You still must store the modified message in the Maildir instance by doing maildir_mbox[key] = msg.
MH mailboxes support the creation of sequences, which are subsets of the messages in the mailbox. You might have one sequence that lists personal e-mails and another that contains work-related messages, for example. Sequences are identified by strings. The MH format defines a few standard sequence names such as unseen, flagged, and replied.
Messages are added to and removed from sequences by calling add_sequence(seqname) and remove_sequence(seqname) methods on the message objects. To write the change to disk, the modified message object must be stored in the mailbox again:
msg = mh_mailbox[key] # Retrieve message
msg.add_sequence('work')
msg.remove_sequence('unread')
mh_mailbox[key] = msg # Store message
The author would like to thank the following people for commenting on the first draft of this article: Aahz, Tal Einat, Jeffrey C. Jacobs, and Roy Smith. Any errors are the responsibility of the author.
A. M. Kuchling has 11 years of experience as a software developer and is a long-time member of the Python development community. Some of his Python-related work includes writing and maintaining several standard library modules, writing a series of "What's new in Python 2.x" articles and other documentation, planning the 2006 and 2007 PyCon conferences, and acting as a director of the Python Software Foundation. Andrew graduated with a B.Sc. in Computer Science from McGill University in 1995. His web page is at http://www.amk.ca.
Return to ONLamp.com.
Copyright © 2007 O'Reilly Media, Inc.