In the first part of this series, we studied the various spam filtering techniques; specifically, in which place of the electronic mail framework filtering measures work and what kind of filtering techniques are currently available.
This article focuses on the development of a spam filter, through the
example of milter-greylist, a greylisting plugin for Sendmail. We assume that
the reader knows the C programming language reasonably well. A basic
understanding of TCP/IP is also useful.
Sendmail made MTA-level filtering easy by introducing the Milter API. Milter
is a contraction of the term "mail filter." Milters are small daemons that
communicate with Sendmail through UNIX sockets or TCP/IP connections. They are
easy to configure; you just need to add a few lines to the
sendmail.cf configuration file. Here is an example for double
filtering by milter-regex and milter-greylist:
O InputMailFilters=regex,greylist
Xregex, S=local:/var/run/milter-regex/sock, F=T
Xgreylist, S=local:/var/milter-greylist/sock F=T
O Milter.macros.connect=j, _, {daemon_name}, {if_name}, {if_addr}, {client_addr}
O Milter.macros.envfrom=i, {mail_mailer}, {mail_host}, {mail_addr}
O Milter.macros.envrcpt={rcpt_mailer}, {rcpt_host}, {rcpt_addr}
The first line lists the milters to invoke for each message. Here,
filtering first uses regex, then greylist. Those names must correspond to the
next lines, which start with an X.
The X lines define each milter property: how to contact the milter (here, a
local UNIX socket) and what should happen if the milter fails.
(F=T means a temporary error, F=R means a permanent
error, and no F= means pass through as if the filter did not
exist.) Timeout values are optional.
The remaining lines select which Sendmail macros to export to the milter. We will see how to use them when we deal with the actual implementation.
The milter design allows them to run on the same machine as Sendmail, but also through the network. It is possible to build highly scalable setups, with farms of milter machines and load distributed though rotating DNS or TCP redirection.
Many milters are already available for anti-spam, anti-virus, archival, accounting, and various other purposes. Here is a set of my favorites:
milter-regex
filters mail by applying regular expressions. It can filter out files based on
headers (the Win32 header, for instance) or by extension. Here is a sample of a
milter-regex config file:
reject "Sorry, we do not accept ZIP archives anymore"
body /^(Content-Type: [^;]*; | )name=".*\.zip"/ie
body /^(Content-Disposition: attachment; | )filename=".*\.zip"/ie
It is also extremely useful when dealing with distributed denial-of-service
attacks. If you can find a common pattern in the junk messages, you can filter
them out with milter-regex.
milter-greylist
is an anti-spam tool I wrote. It uses the
greylist method,
and for now, it just zaps all of the spam without a false positive.
The principle is simple: on temporary errors, real MTAs wait for a while and
retry sending the message. Spam engines do not. When milter-greylist receives a
message, it refuses it with a temporary error, storing a tuple (source IP,
sender email, recipient email) in a table. On the next attempt, if it finds
the tuple in the table, it accepts the message.
Of course, spammers can start resending their messages. If this happens some day, we can force each message to wait for one hour before being accepted. If the spammer stays at the same address for one hour, the odds are good he will appear in a DNS-based blacklist before the second attempt.
White-listing and auto-white-listing can also reduce the delay on legitimate mail.
milter-sender is
a real-time, sender-address validator. It works by trying to send a message to
the sender address of each incoming message. If it receives a temporary error,
it temporarily refuses the incoming message. If it receives a permanent error,
it refuses the incoming message permanently, and so on.
j-chkmail checks
the message for forbidden attachment files and will refuse them. It is very
useful against viruses, and risks fewer false positives than the one-line
regular expression matching done by milter-regex.
There are also various milters to interface Sendmail with AMaViS, SpamAssassin, and many other tools. Web sites such as milter.org feature lists of available milters.
Milters are linked with libmilter, which handles the burden of the
communication with Sendmail. Milter authors just have to use the Milter API,
by including <libmilter/mfapi.h> and by linking with
libmilter. Because libmilter relies on libpthread, libpthread is required in
milter linkage as well.
Writing a milter tends to be surprisingly simple. Start by writing a daemon
that will parse its command-line options, detach to the background, open log
files, and so on. In order to specify the socket that will be used to
communicate with Sendmail, use smfi_setconn():
smfi_setconn(socket)
where socket is a string, usually taken from the command line,
that identifies the location of the socket. For a local socket, you can just
use a filesystem path.
The other required operation is to fill a struct, smfiDesc, with
a collection of callbacks and pass it to libmilter through
smfi_register():
struct smfiDesc smfilter =
{
"greylist", /* filter name */
SMFI_VERSION, /* version code */
SMFIF_ADDHDRS, /* flags */
mlfi_connect, /* connection info filter */
NULL, /* SMTP HELO command filter */
mlfi_envfrom, /* envelope sender filter */
mlfi_envrcpt, /* envelope recipient filter */
NULL, /* header filter */
NULL, /* end of header */
NULL, /* body block filter */
mlfi_eom, /* end of message */
NULL, /* message aborted */
mlfi_close, /* connection cleanup */
};
/* (some code) */
if (smfi_register(smfilter) == MI_FAILURE) {
fprintf(stderr, "%s: smfi_register failed\n", argv[0]);
exit(EX_UNAVAILABLE);
}
Once this is done, the program hands out control to libmilter forever by
calling smfi_main():
return smfi_main();
|
Now, every time the server handles an email, libmilter will call one of
the callbacks we registered through smfi_register(). For instance,
in this example, the mlfi_connect()
callback registers for connection time. Therefore, each time an SMTP client
connects to the machine, libmilter will invoke our
mlfi_connect() function.
Here is the mlfi_connect() function for milter-greylist:
sfsistat
mlfi_connect(ctx, hostname, addr)
SMFICTX *ctx;
char *hostname;
_SOCK_ADDR *addr;
{
struct mlfi_priv *priv;
struct sockaddr_in *addr_in;
if ((priv = malloc(sizeof(*priv))) == NULL)
return SMFIS_TEMPFAIL;
smfi_setpriv(ctx, priv);
bzero((void *)priv, sizeof(*priv));
priv->priv_whitelist = EXF_UNSET;
addr_in = (struct sockaddr_in *)addr;
if ((addr_in != NULL) && (addr_in->sin_family == AF_INET))
priv->priv_addr.s_addr = addr_in->sin_addr.s_addr;
return SMFIS_CONTINUE;
}
We have an opaque context pointer that libmilter will hand us on each
callback for the same SMTP connection. libmilter uses it to store various
pieces of information about the connection, including a user private pointer that we can
use to store our own data. smfi_setpriv() and
smfi_getpriv() set and retrieve this private pointer,
respectively.
milter-greylist's mlfi_connect() starts by allocating some
private memory for a mlfi_priv structure, which is defined like
this:
struct mlfi_priv {
struct in_addr priv_addr;
char priv_from[ADDRLEN + 1];
time_t priv_elapsed;
int priv_whitelist;
char *priv_queueid;
};
Our goal is to retrieve the tuple (source IP, sender email, recipient
email), so mlfi_priv has some storage for this information. In
mlfi_connect(), we store the client IP address in the
priv_addr field of mlfi_priv.
Before moving further, let us look at the anatomy of a SMTP transaction. Lines starting with >>> are sent from the client to the server, and lines starting with <<< are sent from the server to the client.
>>> 220 mx1.example.net ESMTP Sendmail 8.12.10/jtpda-5.4 ready at Fri, 26 Mar 2004 15:23:56 +0100 (CET)
<<< HELO mail.example.com
>>> 250 mx1.example.net Hello mail.example.com [192.0.2.26], pleased to meet you
<<< MAIL FROM: <John.Smith@example.com>
>>> 250 2.1.0 <John.Smith@example.com>... Sender ok
<<< RCPT TO: <Reginald.Wesson@example.net>
>>> 250 2.1.5 <Reginald.Wesson@example.net>... Recipient ok
>>> DATA
<<< 354 Enter mail, end with "." on a line by itself
>>> From: <John.Smith@example.com>
>>> To: <Reginald.Wesson@example.net>
>>> Date: Fri, 26 Mar 2004 15:23:57 +0100 (CET)
>>> Subject: Test
>>>
>>> This is a test message
>>> .
<<< 250 2.0.0 i2QENuV9026193 Message accepted for delivery
>>> QUIT
<<< 221 2.0.0 mx1.example.net closing connection
After smfi_connect(), libmilter will invoke the following
callbacks:
smfi_envfrom(), after the MAIL FROM command is sent.smfi_envrcpt(), after the RCPT TO command is sent.smfi_eom(), after the DATA command is finished.smfi_close(), at connection close time.Additionally, the following checkpoints could have callbacks, if we had registered them:
HELO command The Milter API
documents all of the possible callbacks. In each of the callbacks, it is
possible to call smfi_getpriv() to fetch the pointer to our
private data, so we can read and modify it.
In each callback, the return value can cause Sendmail to reject the message
either permanently (SMFIS_REJECT) or temporarily
(SMFIS_TEMPFAIL). Returning SMFIS_CONTINUE carries on
the transaction.
Depending on the callback, rejecting can have different meanings. For
example, mlfi_rcpt() is recipient-oriented. It can be called
several times for a message that has several recipients. Rejecting one
recipient will remove that recipient from the recipient list, but the message
will still go through for the other ones.
In message-oriented callbacks, such as mlfi_eom(), rejecting
causes the message to be rejected for all of the recipients.
Whatever happens to the message, the mlfi_close() callback will
be called. This is the place to de-allocate private data. Failure to do so
will cause a memory leak that will eventually crash the milter:
sfsistat
mlfi_close(ctx)
SMFICTX *ctx;
{
struct mlfi_priv *priv;
if ((priv = (struct mlfi_priv *) smfi_getpriv(ctx)) != NULL) {
free(priv);
smfi_setpriv(ctx, NULL);
}
return SMFIS_CONTINUE;
}
We complete our tuple in the mlfi_envrcpt() callback. We
already have the source IP and the sender email stored in
mlfi_priv(), and now we finally receive one recipient address.
This is the time for various checks, such as the whitelist check that
milter-greylist's except_filter() function performs. This function
is worth a few words. It walks a chained list of exceptions, looking for an
entry matching the recipient address or the source IP:
LIST_FOREACH(ex, &except_head, e_list) {
if (ex->e_type != E_RCPT)
continue;
if (emailcmp(rcpt, ex->e_rcpt) == 0) {
found = 1;
break;
}
}
The LIST_FOREACH macro comes from
<sys/queue.h>, along with a few other macros for defining
and walking different kinds of chained lists. Theses macros are extremely
useful, since they greatly reduce your ability to write bugs in chained-list
code.
Whether you use chained lists or fixed size tables, it's impossible to read
and write the data shared among threads in a milter, because the code runs in a
multi-threaded environment. Each time Sendmail handles a new message, it will
make a new connection to the milter, where libmilter spawns a new thread to
handle it. The milter may be processing several messages simultaneously.
It is therefore not safe to operate on shared data; another thread might be writing while we read, thus causing bugs. For instance, if we walk a chained list while another thread removes an item from it, we might jump out of the list and crash.
The workaround is locking. Each time we need to read some global data, we use a read lock. Each time we write to it, we use a write lock. The difference between read locks and write locks is that many threads can share a read lock, whereas only one thread can have a write lock.
In milter-greylist, we use lock macros to avoid bloating the code:
#define WRLOCK(lock) if (pthread_rwlock_wrlock(&(lock)) != 0) { \
syslog(LOG_ERR, "%s:%d pthread_rwlock_wrlock failed: %s", \
__FILE__, __LINE__, strerror(errno)); \
exit(EX_SOFTWARE); \
}
Before using the lock, it must be initialized. Do this by using
pthread_rwlock_init() before calling smfi_main().
There are many other problems caused by multi-threading. For instance,
milter-greylist has to write its database to a file when it is modified, so
that after a restart it can resume operation where it halted. It is not
possible to dump the database to a file from a callback, because another thread
could attempt to do this at the same time. To work around this problem,
dump.c devotes a single dumper thread to this operation. This
thread starts (using pthread_create() from main())
before the smfi_main() call.
The dumper thread sleeps on a flag, using pthread_cond_wait().
Each time another thread modifies the database, it wakes the dumper thread by
calling pthread_cond_signal(), and the dumper thread handles the
job of flushing data to disk.
Last but not least, a milter must only call thread-safe functions from
libraries. Any function that uses global variables or static memory is
thread-unsafe. For instance, you have to use inet_ntop(3) instead
of inet_ntoa(3).
Thread unsafety can be hard to guess. For instance, if your libc features a
BIND4-based DNS resolver, using DNS resolver functions will lead to trouble.
This kind of problem can be quite hard to discover, especially when linking
with third-party libraries.
Fortunately, this kind of problem is easy to track down. After receiving a
few messages, the milter will hang. At that time, if you attach
gdb(1) to it and type the bt command (this shows the
stack dump), you will always see it stuck in the same code path. This code path
is likely to contain a thread-unsafe function. Here is an example:
# ps -ax | grep milter-greylist
13694 ?? S 0:00.13 milter-greylist -p /var/milter-greylist/sock
# gdb milter-greylist
(gdb) attach 13694
0x4193f238 in recvfrom () from /usr/lib/libc.so.12
(gdb) bt
#0 0x4193f238 in recvfrom () from /usr/lib/libc.so.12
#1 0x418a43c0 in __pth_sc_recvfrom () from /usr/pkg/lib/libpthread.so.20
#2 0x418a2cfc in pth_recvfrom_ev () from /usr/pkg/lib/libpthread.so.20
#3 0x418a2a7c in pth_recv_ev () from /usr/pkg/lib/libpthread.so.20
#4 0x418a2a50 in pth_recv () from /usr/pkg/lib/libpthread.so.20
#5 0x418a4444 in recv () from /usr/pkg/lib/libpthread.so.20
#6 0x418a10fc in pth_poll_ev () from /usr/pkg/lib/libpthread.so.20
#7 0x418a0d44 in pth_poll () from /usr/pkg/lib/libpthread.so.20
#8 0x418a3d24 in poll () from /usr/pkg/lib/libpthread.so.20
#9 0x418799fc in res_send () from /usr/lib/libresolv.so.1
#10 0x41877ef4 in res_query () from /usr/lib/libresolv.so.1
#11 0x4184377c in SPF_dns_lookup_resolv (spfdcid=0x190caa0,
domain=0x182b290 "example.com", rr_type=16, should_cache=1)
at spf_dns_resolv.c:139
#12 0x4183fc64 in SPF_dns_lookup (spfdcid=0x0, domain=0x1a1df98 "",
rr_type=64, should_cache=2) at spf_dns.c:57
#13 0x4184291c in SPF_get_spf (spfcid=0x1987c00, spfdcid=0x190caa0,
domain=0x182b290 "example.com", c_results=0x1a1f8c8) at spf_get_spf.c:76
#14 0x418423ec in SPF_result (spfcid=0x1987c00, spfdcid=0x190caa0, domain=0x0)
at spf_result.c:376
#15 0x180b0e4 in spf_alt_check (in=0x0,
fromp=0x190c940 "<John.Doe@example.com>") at spf.c:126
#16 0x18022ec in mlfi_envfrom (ctx=0x0, envfrom=0x182b250)
at milter-greylist.c:178
#17 0x180e820 in st_sender ()
#18 0x180de14 in mi_engine ()
#19 0x180c4dc in mi_handle_session ()
#20 0x180bd50 in mi_thread_handle_wrapper ()
#21 0x4189bf7c in pth_spawn_trampoline () from /usr/pkg/lib/libpthread.so.20
#22 0x41898990 in pth_mctx_set_bootstrap () from /usr/pkg/lib/libpthread.so.20
#23 0x418988dc in pth_mctx_set_trampoline () from /usr/pkg/lib/libpthread.so.20
#24 0x7fffefdc in ?? ()
Note that if, when typing bt, you see no function name, make
sure the program was built with -g and that the binary was not
stripped at installation.
The last function invoked before the libpthread machinery is
res_send(3). A quick search on the Internet tells that this
function is not thread-safe in BIND4, which is what causes the problem. You
must use a BIND8 resolver to work around this problem.
From time to time, it is necessary to read some of Sendmail's macros by
using smfi_getsymval(). This is how, for example,
smfi_envrcpt() reads the message queue ID:
if ((priv->priv_queueid = smfi_getsymval(ctx, "{i}")) == NULL) {
syslog(LOG_DEBUG, "smfi_getsymval failed for {i}: %s",
strerror(errno));
priv->priv_queueid = "(unknown id)";
}
This can read only macros explicitly exported in sendmail.cf
using the O Milter.macros configuration lines.
In order to make debugging easier, milter-greylist adds an
X-Greylist header to any handled message that explains if the
message was delayed and how much, if the message is white-listed and why, and
so on. smfi_addheader() in smfi_envrcpt() handles
this. This function takes the opaque pointer, the header name, and the header
value as arguments.
Milter is a scalable, easy-to-use solution for MTA-level filtering. The API is quite straightforward to use and hides very few pitfalls. It's easy to start and to develop complex filtering techniques. It is indeed a great opportunity to have it in the battle against spam and viruses.
milter-greylist was really easy to implement. It took under a week to
produce something that works (with a few bugs), and less than a month to
complete version 1.0. I hope this article will help potential developers to
produce more milters.
Thanks to John Klos for reviewing this article.
Emmanuel Dreyfus is a system and network administrator in Paris, France, and is currently a developer for NetBSD.
Return to ONLamp.com.
Copyright © 2007 O'Reilly Media, Inc.