Mail-Filtering Techniques

by Emmanuel Dreyfus

Internet email used to be a great tool, but it's currently crippled with annoyances -- unsolicited commercial email (also known as spam), viruses, and denial-of-service mail floods. Filtering email has become common. Today, it's hardly possible to use email and make your address public without some sort of spam and virus filtering tools.

Some filtering can be done on the client and some must be done on the server. This article studies how to filter email efficiently and without sacrificing reliability. A second part will focus on how to write a mail filter for Sendmail, the most comprehensive and widespread mail server on the Internet.

Internet Mail Background

Internet email today is built around three protocols: Simple Mail Transfer Protocol (SMTP), Post Office Protocol version 3 (POP3), and Internet Mail Access Protocol version 4 (IMAP4). Older protocols for distributing email were common in the past. Some are still in operation for certain setups, but we will not cover them.

Several kinds of software implement the three protocols above. The Mail User Agent (MUA) is the mail client that the end user sees. MUAs include software such as Eudora, Netscape mail, Mozilla Thunderbird, Pegasus mail, or the infamous Outlook Express, a popular target for Windows viruses. There are also several webmail packages available, where the MUA runs on a web server, with just its user interface running on the user's machine through a web browser.

A MUA sends messages to mail servers using the SMTP protocol, and receives its mail from mail servers using the POP3 and IMAP4 protocols. POP3 and IMAP4 are similar, with IMAP4 being more recent and more feature-rich than POP3.

The mail server runs a Mail Transfer Agent (MTA), such as Sendmail, Postfix, Qmail or Exim. Its job is to receive messages through SMTP and to route them to their destinations. If the destination is a local mailbox, the MTA uses a Mail Delivery Agent (MDA) to drop the message in a mailbox. If the destination is another machine, the MTA uses SMTP to contact another MTA on the destination mail server. This MTA will in turn use an MDA to store the message in a mailbox.

Related Reading

Spam Letters
By Jonathan Land

When it has to reach a mail server for a remote domain, a MTA needs to know the address of the mail server for the domain. This information is available through the MX (Mail eXchanger) record of the DNS. DNS acts as a directory, explaining how to send mail to any mail-enabled domain. The mail server listed in the MX record is usually known as the MX server.

Of course, in the real world, things are usually much more complicated. There can be multiple front-end MX servers that relay mail to multiple mail servers in the inner network.

Filtering Mail Today

Mail filtering can be done at three different levels.

Filtering at the MUA Level

The user's MUA can filter out viruses using an anti-virus software, and spam using various techniques, including learning filters that will try to learn what the user considers spam. While this method is the most flexible for the user, it suffers several drawbacks:

Filtering at the MDA Level

MDA level filtering solves those two problems. Because it happens on the server, it can destroy junk messages before the client has to download them. Maintaining centralized tools is also much easier.

MDA-level filtering has been the most popular way of filtering on the mail server for a while. It is easy because any MTA has to call an external program for local mail delivery. On UNIX systems, this means invoking a command such as mail, mail.local, or procmail. Filtering is easy -- just invoke a filter instead of the MDA and have the filter invoke the real MDA after it has completed its job.

This approach worked for some time, but turned out to have one major drawback: there's no user interaction at the MDA level! The filter cannot ask the user if it is safe to destroy a given message that could be spam or contain a virus. When the MDA finds a suspicious message, it must notify the sender or the receiver so that the mail system remains reliable. If it notifies the receiver, the user will be flooded by notification of non-delivery instead of being flooded by viruses and spam. This changes foreign junk mail into locally generated junk mail and does not really solve the problem.

If the MDA notifies the sender, then we hit a loophole in SMTP: it does not require sender authentication. It is trivial to forge an email with a random source address. Nowadays, any spam or virus will have a forged return address. Sending a notification to a forged sender results in mail being sent to a nonexistent address in the best case, and to a person that did not send the spam or virus, in the worst case. This is not acceptable.

The only other option when working at the MDA level is to drop junk messages silently. This is not satisfying on the reliability front, since a false positive will be dropped without notification.

The other big problem with MDA-level filtering is that chaining different filters (for instance, an anti-virus and an anti-spam), is not straightforward at all. You must tell the first filter to invoke the second instead of the real MDA and the second to invoke the real MDA. This can be quite complicated and difficult to troubleshoot.

MX-Level Filtering

Fortunately, a solution exists to these problems. You cannot trust the sender's email address, so we must avoid relying on it for notification of non-delivery. If filtering occurs at the MTA level on the domain's MX, then we are directly talking with a real MTA on a real mail server, a spam engine, or a virus.

SMTP works with the concept of message responsibility. A server will receive a message that it will flush it to disk. Then it will tell the sender server that it accepted the message. This transfers the responsibility of the message to the receiver and the sender may remove the message from its mail queue.

If for some reason (disk full, system crash, load too high, network outage, recipient unknown) the receiver MTA does not tell the sender that it accepted the message, the message remains the sender's responsibility. If the problem was permanent (recipient unknown, for instance), then the sender will have to send a notification of non-delivery. If the error was temporary, then the sender ought to retry sending the message later.

If we refuse a message that comes from a spam engine or a virus, we directly tell the spam engine or the virus that we refuse the message. The sender is not a real MTA, so it likely does not do error handling. Its job is just to flood the Internet with junk mail. It will probably generate no notification at all. This is good.

If the sender is a real MTA, it will make a delivery status notification to the sender, which should be the address of the actual message sender. We have exactly what we want.

Limitations of MX Filtering

There are, however, some minor problems with MX-level filtering.

Survey of Mail Filtering Techniques

Various filtering techniques have been invented to work around spam floods. In turn, spammers invented various techniques to work around the anti-spam techniques. Here are the most used anti-spam workarounds:

Local Blacklists

This is probably the first technique ever used. The system administrator maintains a list of spammers' IP addresses.

On the pro side, it's useful against open relays and new spammers that don't yet know how to distribute their attacks.

On the con side, blacklists are hard to maintain. Spammers are a fast-moving target. They typically don't reuse the same IP twice in a row when sending spam to a domain.

Distributed Blacklists

The next step in the spam war was distributed blacklists. Sites can share blacklists and they are usually implemented through DNS. Many sites will reject messages from an IP after it appears in a distributed blacklist.

On the pro side, this is simple to use.

There are a few cons, however:


Some heavily spammed sites refuse anything except what comes from friendly IP addresses.

On the pro side, this removes 100 percent of the spam from other IP addresses.

On the con side, it cuts you off from potentially legitimate people not on the whitelist.

Content Filtering and Bayesian Filtering

While all the previous techniques rely on lists of IP addresses, content filtering tries to identify the message as spam by analyzing the content. Spam messages usually contain commercial messages and forged sender information, which makes classification possible. Bayesian filtering uses feedback from users about what is and isn't spam, and tries to score words as spammish or non-spammish. It then scores messages based on how spammish they look.

The pro argument is compelling. This technique is very promising, because it can adapt to individual views of what is spam. Moreover, because of the learning approach, the filters can evolve if the nature of the spam changes.

On the con side, spammers quickly learned to work around Bayesian filtering by inserting "positive" words into their messages, and by masquerading bad words. Fighting on this front would ultimately require the filter to do semantic analysis of messages, something that is not really practical yet.


This is an extreme measure -- "don't accept any mail unless a trusted PGP key has signed it or if the signer's key is in an trusted PKI repository."

It has a big pro, in that it removes 100 percent of the anonymous spam. (Unless spammers invade PGP keyrings or steal keys, it's very handy.)

On the con side, you can only receive messages from people that sign their emails.

Per-Recipient Addresses

This scheme uses dozens of email addresses, a different one for each person or entity with whom you exchange mail. When you start receiving spam on an address, you drop it.

On the pro side, if you receive spam on one address, you have a pretty good idea who sold your address.

There are two cons:

Sender Acknowledgment

Every time you receive an email, a robot handles it. The robot queues the message and sends a challenge to the sender. The challenge is usually just a message with a cookie that asks the sender to reply to confirm that she actually sent the message. When the robot receives the acknowledgment, it delivers the original mail to you.

The pros seem good. Fans claim that this removes 100 percent of spam. And it does tend to hide the fallout when people forge your address on spams and viruses.

The cons are many, however:

Real-Time Sender Address Checking

When a message arrives, the mail servers try to validate the sender address before accepting it. This validation attempts to send a message to the sender address. If the mail server of the sender address' domain responds with an invalid address error, the server can reject the original message.

The pro of this approach is that it can remove or reduce forgery.

The con is that some servers will accept a message even if the address is invalid, rejecting the message later.


The idea of greylists is that spammers never try to resend a message if they receive a temporary failure error. When the mail server receives a message, it refuses with a temporary error and remembers the delivery attempt for the recipient email address, source email address, and source IP. The next time the sender server attempts to send the message, the destination server will accept it. If the message was spam, then the sender will probably never try to resend it.

A server that does greylisting may also refuse a message until some time has elapsed since the first attempt. This forces spammers to stay at the same IP address for a while before the receiver will accept their junk mail.

There are two pros:

The con is that it introduces some delay on legitimate mail delivery.

Sender-Permitted Framework (SPF)

SPF is not a spam filtering technique. It is an anti-forgery technique. Using SPF, a domain can publish the list of machines that can send email on behalf of the domain. The list can be closed (hosts not listed by SPF records may not send legitimate mail from the domain), or open (hosts not listed by SPF records may send legitimate mail from the domain).

SPF will never stop the forgery of domains that don't implement SPF. SPF can be used as a tool to reduce the effect of other filtering techniques. For example, you can skip greylisting for SPF-compliant senders.

Emmanuel Dreyfus is a system and network administrator in Paris, France, and is currently a developer for NetBSD.

Return to

Copyright © 2017 O'Reilly Media, Inc.