Internet email used to be a great tool, but it's currently crippled with annoyances -- unsolicited commercial email (also known as spam), viruses, and denial-of-service mail floods. Filtering email has become common. Today, it's hardly possible to use email and make your address public without some sort of spam and virus filtering tools.
Some filtering can be done on the client and some must be done on the server. This article studies how to filter email efficiently and without sacrificing reliability. A second part will focus on how to write a mail filter for Sendmail, the most comprehensive and widespread mail server on the Internet.
Internet email today is built around three protocols: Simple Mail Transfer Protocol (SMTP), Post Office Protocol version 3 (POP3), and Internet Mail Access Protocol version 4 (IMAP4). Older protocols for distributing email were common in the past. Some are still in operation for certain setups, but we will not cover them.
Several kinds of software implement the three protocols above. The Mail User Agent (MUA) is the mail client that the end user sees. MUAs include software such as Eudora, Netscape mail, Mozilla Thunderbird, Pegasus mail, or the infamous Outlook Express, a popular target for Windows viruses. There are also several webmail packages available, where the MUA runs on a web server, with just its user interface running on the user's machine through a web browser.
A MUA sends messages to mail servers using the SMTP protocol, and receives its mail from mail servers using the POP3 and IMAP4 protocols. POP3 and IMAP4 are similar, with IMAP4 being more recent and more feature-rich than POP3.
The mail server runs a Mail Transfer Agent (MTA), such as Sendmail, Postfix, Qmail or Exim. Its job is to receive messages through SMTP and to route them to their destinations. If the destination is a local mailbox, the MTA uses a Mail Delivery Agent (MDA) to drop the message in a mailbox. If the destination is another machine, the MTA uses SMTP to contact another MTA on the destination mail server. This MTA will in turn use an MDA to store the message in a mailbox.
When it has to reach a mail server for a remote domain, a MTA needs to know the address of the mail server for the domain. This information is available through the MX (Mail eXchanger) record of the DNS. DNS acts as a directory, explaining how to send mail to any mail-enabled domain. The mail server listed in the MX record is usually known as the MX server.
Of course, in the real world, things are usually much more complicated. There can be multiple front-end MX servers that relay mail to multiple mail servers in the inner network.
Mail filtering can be done at three different levels.
The user's MUA can filter out viruses using an anti-virus software, and spam using various techniques, including learning filters that will try to learn what the user considers spam. While this method is the most flexible for the user, it suffers several drawbacks:
The client must download (at least part of) all messages to filter. This can be quite annoying when dealing with virus floods, especially for users who connect via dial-up.
On large networks, the system administrator must ensure that anti-virus definitions are up to date on many workstations. This can range from painful for a large corporation to impossible for an Internet Service Provider (ISP).
MDA level filtering solves those two problems. Because it happens on the server, it can destroy junk messages before the client has to download them. Maintaining centralized tools is also much easier.
MDA-level filtering has been the most popular way of filtering on the mail
server for a while. It is easy because any MTA has to call an external program
for local mail delivery. On UNIX systems, this means invoking a command such as
Filtering is easy -- just invoke a filter instead of the MDA and have the
filter invoke the real MDA after it has completed its job.
This approach worked for some time, but turned out to have one major drawback: there's no user interaction at the MDA level! The filter cannot ask the user if it is safe to destroy a given message that could be spam or contain a virus. When the MDA finds a suspicious message, it must notify the sender or the receiver so that the mail system remains reliable. If it notifies the receiver, the user will be flooded by notification of non-delivery instead of being flooded by viruses and spam. This changes foreign junk mail into locally generated junk mail and does not really solve the problem.
If the MDA notifies the sender, then we hit a loophole in SMTP: it does not require sender authentication. It is trivial to forge an email with a random source address. Nowadays, any spam or virus will have a forged return address. Sending a notification to a forged sender results in mail being sent to a nonexistent address in the best case, and to a person that did not send the spam or virus, in the worst case. This is not acceptable.
The only other option when working at the MDA level is to drop junk messages silently. This is not satisfying on the reliability front, since a false positive will be dropped without notification.
The other big problem with MDA-level filtering is that chaining different filters (for instance, an anti-virus and an anti-spam), is not straightforward at all. You must tell the first filter to invoke the second instead of the real MDA and the second to invoke the real MDA. This can be quite complicated and difficult to troubleshoot.
Fortunately, a solution exists to these problems. You cannot trust the sender's email address, so we must avoid relying on it for notification of non-delivery. If filtering occurs at the MTA level on the domain's MX, then we are directly talking with a real MTA on a real mail server, a spam engine, or a virus.
SMTP works with the concept of message responsibility. A server will receive a message that it will flush it to disk. Then it will tell the sender server that it accepted the message. This transfers the responsibility of the message to the receiver and the sender may remove the message from its mail queue.
If for some reason (disk full, system crash, load too high, network outage, recipient unknown) the receiver MTA does not tell the sender that it accepted the message, the message remains the sender's responsibility. If the problem was permanent (recipient unknown, for instance), then the sender will have to send a notification of non-delivery. If the error was temporary, then the sender ought to retry sending the message later.
If we refuse a message that comes from a spam engine or a virus, we directly tell the spam engine or the virus that we refuse the message. The sender is not a real MTA, so it likely does not do error handling. Its job is just to flood the Internet with junk mail. It will probably generate no notification at all. This is good.
If the sender is a real MTA, it will make a delivery status notification to the sender, which should be the address of the actual message sender. We have exactly what we want.
There are, however, some minor problems with MX-level filtering.
It works only at the MX level. If the domain MX accepts a junk email with a forged sender address, there is no point in refusing it at another internal mail server, even at the MTA level. The sending server will be your domain's MX, and it will send a delivery status notification to the forged address.
If spam or a virus is sent to a mailing list where a recipient has a filtering MX, then the list owner will receive the delivery status notification. This happens because the list server accepted the junk email once and sent it to the list. The problem does not exist on the filtering MX but on the list server, and should really be fixed there. The only other way to avoid this problem is to drop messages silently, which is unacceptable on the reliability front.
Additionally, list maintainers can use MDA- or MUA-based filters to deal automatically with delivery status notifications. These use a standard format that is easy to handle automatically.
When messages are relayed through forwarding to a filtering MX, we have the same issue. The mail server that forwards the junk mail will send a delivery status notification to a possibly forged address. Again the problem is not on the filtering MX, but on the forwarding host that accepted some junk mail, so it should be fixed there.
The same problem occurs again when receiving spam or viruses from an open relay. Again, the fix is the same. Fix or blacklist the open relay.
Various filtering techniques have been invented to work around spam floods. In turn, spammers invented various techniques to work around the anti-spam techniques. Here are the most used anti-spam workarounds:
This is probably the first technique ever used. The system administrator maintains a list of spammers' IP addresses.
On the pro side, it's useful against open relays and new spammers that don't yet know how to distribute their attacks.
On the con side, blacklists are hard to maintain. Spammers are a fast-moving target. They typically don't reuse the same IP twice in a row when sending spam to a domain.
The next step in the spam war was distributed blacklists. Sites can share blacklists and they are usually implemented through DNS. Many sites will reject messages from an IP after it appears in a distributed blacklist.
On the pro side, this is simple to use.
There are a few cons, however:
You're trusting someone else's idea of what is spam.
Blacklists can be poisoned with wrong information: spammers can spread viruses to send spam through an ISP's SMTP server to cause it to appear on a blacklist, though you really want to accept the mail from there.
Distributed blacklists are susceptible to Distributed Denial of Service (DDoS) attacks.
Some heavily spammed sites refuse anything except what comes from friendly IP addresses.
On the pro side, this removes 100 percent of the spam from other IP addresses.
On the con side, it cuts you off from potentially legitimate people not on the whitelist.
While all the previous techniques rely on lists of IP addresses, content filtering tries to identify the message as spam by analyzing the content. Spam messages usually contain commercial messages and forged sender information, which makes classification possible. Bayesian filtering uses feedback from users about what is and isn't spam, and tries to score words as spammish or non-spammish. It then scores messages based on how spammish they look.
The pro argument is compelling. This technique is very promising, because it can adapt to individual views of what is spam. Moreover, because of the learning approach, the filters can evolve if the nature of the spam changes.
On the con side, spammers quickly learned to work around Bayesian filtering by inserting "positive" words into their messages, and by masquerading bad words. Fighting on this front would ultimately require the filter to do semantic analysis of messages, something that is not really practical yet.
This is an extreme measure -- "don't accept any mail unless a trusted PGP key has signed it or if the signer's key is in an trusted PKI repository."
It has a big pro, in that it removes 100 percent of the anonymous spam. (Unless spammers invade PGP keyrings or steal keys, it's very handy.)
On the con side, you can only receive messages from people that sign their emails.
This scheme uses dozens of email addresses, a different one for each person or entity with whom you exchange mail. When you start receiving spam on an address, you drop it.
On the pro side, if you receive spam on one address, you have a pretty good idea who sold your address.
There are two cons:
There's quite a bit of overhead on new legitimate senders to send you a message.
It's a pain to manage when exchanging messages with many people.
Every time you receive an email, a robot handles it. The robot queues the message and sends a challenge to the sender. The challenge is usually just a message with a cookie that asks the sender to reply to confirm that she actually sent the message. When the robot receives the acknowledgment, it delivers the original mail to you.
The pros seem good. Fans claim that this removes 100 percent of spam. And it does tend to hide the fallout when people forge your address on spams and viruses.
The cons are many, however:
The sender needs to send an acknowledgment -- "Yes, I really did send you a message!" -- message, which is a bit rude.
When you receive viruses with forged sender addresses, someone that has nothing to do with the sender will receive a acknowledgment request.
Messages with the wrong sender address will spawn undeliverable junk acknowledgment requests.
This causes extra delay on legitimate mail delivery.
You cannot use this technique on messages sent from other robots, such as errors from mail servers.
When a message arrives, the mail servers try to validate the sender address before accepting it. This validation attempts to send a message to the sender address. If the mail server of the sender address' domain responds with an invalid address error, the server can reject the original message.
The pro of this approach is that it can remove or reduce forgery.
The con is that some servers will accept a message even if the address is invalid, rejecting the message later.
The idea of greylists is that spammers never try to resend a message if they receive a temporary failure error. When the mail server receives a message, it refuses with a temporary error and remembers the delivery attempt for the recipient email address, source email address, and source IP. The next time the sender server attempts to send the message, the destination server will accept it. If the message was spam, then the sender will probably never try to resend it.
A server that does greylisting may also refuse a message until some time has elapsed since the first attempt. This forces spammers to stay at the same IP address for a while before the receiver will accept their junk mail.
There are two pros:
As of today, it removes 99 percent of the spam with no false positives.
Spammers trying to slip junk past greylisting servers will have to keep the same address for some time, thus improving the efficiency of blacklists.
The con is that it introduces some delay on legitimate mail delivery.
SPF is not a spam filtering technique. It is an anti-forgery technique. Using SPF, a domain can publish the list of machines that can send email on behalf of the domain. The list can be closed (hosts not listed by SPF records may not send legitimate mail from the domain), or open (hosts not listed by SPF records may send legitimate mail from the domain).
SPF will never stop the forgery of domains that don't implement SPF. SPF can be used as a tool to reduce the effect of other filtering techniques. For example, you can skip greylisting for SPF-compliant senders.
Emmanuel Dreyfus is a system and network administrator in Paris, France, and is currently a developer for NetBSD.
Return to ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.