oreilly.comSafari Books Online.Conferences.


OpenBSD PF Developer Interview
Pages: 1, 2

Federico: Mike, how did you get the idea to include p0f features in PF?

MF: One of my coworkers, Greg Taleck, added p0f features to NFR's IDS to resolve traffic abiguities. And then a damn SMTP worm hit. That annoyed me and I wanted to filter all Windows boxes from connecting to my mail server for the duration of the worm. So I talked to Michal Zalewski who wrote p0f v1 and he was cool with integrating p0f into PF but the guy who had been maintaining p0f never responded to relicense the fingerprints. Michal then started p0f v2 which was not encumbered with the maintainer's copyright; I got it working in PF; and then blocked all Windows boxes from connecting to my mail server for the duration of the worm. Hurray! Never underestimate the annoyed developer.

Federico: What are you working on for 3.5?

MF: Working on brewing my own beer. Made a pretty good boston style ale, needed a little more hopping though. A golden ale is next. Theo has been calling me a nasty hobbittsesss lately too, so I've been working on growing hair on my feet. Hopefully, I'll get back to TCP scrubbing and normalization in time for 3.5.

HB: well, the focus this time is obviously bgpd.

bgp, the Border Gateway Protocol, is what ISPs speak to each other to announce reachability of their networks through certain paths. A bgp daemon announces its own networks to its neighbors, and its neighbors announce their networks and all networks including the paths to reach them they learned from their respective neighbors. In the usual so-called full-mesh setup that results in bgpd having a table of about 130 thousand networks (prefixes), and multiple paths to reach each. Of those it picks the "best" path (the algorithm for that decision is actually rather easy), and enters the resulting route into the kernel routing table.

Now, that is a bit more complicated than described here, and it is quite obvious that keeping these huge tables and working on them with reasonable performance is not that easy.

There are a few more or less free bgp implementations, but they all have major design flaws, and the resulting runtime problems. As I've been bitten by those I was considering doing a bgpd for some time, but was a bit scared by the projects size. When I was in Calgary in September I finally talked to theo about it who tricked me into starting coding. Back in Germany I finally did mid-November, and much to my surprise I had a fully working bgp session engine, fully implementing the Finite State Machine described in RFC 1771 as core, withing 9 days, and had sessions established and hold up to other bgp speakers. We found a few bugs later, but it is basically still what I had then. I talked to a few people and showed code, and fortunately, Claudio Jeker joined. He did an incredible amount of work implementing what we call the RDE, Route Decision Engine, that holds the tables of prefixes and paths. At the same time I started working on the code to interface the kernel routing table, which includes holding an internal view of it.

Well, nowadays we are feature complete for the basics.

We have no showstopper bugs we are aware of, heck, I am not aware of any bug right now (tho', let me assure you, there are a few). We learn routes, sync the one picked as best into the kernel routing table, can send them to our neighbors, and can announce our own networks. We have a control utility, bgpctl, too, which can be used to gather and show run-time data, take single sessions up/down, reload configuration, etc. And we have something that I have not seen anywhere before: we can couple and decouple the internal view from the kernel routing table.

So you can start up decoupled, adjust your settings while evaluating the internal view of the routing table, and then, after you are satisfied, you can issue a bgpctl fib couple and the routes enter the kernel. In the same vein a bgpctl fib decouple removes them again, leaving the kernel routing table as it was before coupling. Oh, and opposed to the other implementations, bgpd notices when you statically enter routes to the kernel routing tables and doesn't mess with them. It even tracks interfaces showing up and being removed at runtime like it is possible with PCMCIA and USB-based ones, and cloneable devices like tun and vlan. For most Ethernet devices it can even notice when you pull the cable (or the link gets lost for other resons) and react accordingly.

bgpd is 11500 lines of code as of tonight, of which about 500 are manpages. And it is very fast...

CB: I'm working on many little enhancements in the way PF deals with interfaces.

That includes better support for dynamic/cloneable interfaces, the ability to lock states to a single or group of interfaces, better handling of interface aliases and other related things. I believe there was 12 little points to the commit message. :)

RM: I've been mainly working on the components necessary to deploy OpenBSD in high availability and load balancing configurations, including the Common Address Redundancy Protocol (CARP), which handles IP address failover, and pfsync enhancements which synchronise state between two or mosynchronizes. I also added source ip tracking, which keeps track of states by source IP address, but this work was actually done before 3.4, at the hackathon in Calgary.

CEA: As you may have noticed, I have moved away from pf to privilege separation and bpf. Already worked on privsep for named in 3.5, and now there is at least the DHCP tools waiting for privilege separation. Henning is already working on dhclient. If I can find some time, I want to design some kind of framework for developing userland proxies.

Federico: The OpenBSD 3.5 release page lists six new PF improvements. Could you each explain your own work?

  1. Atomic commits of ruleset changes (reduce the chance of ending up in an inconsistent state).

    CB: This change ensures that when you type pfctl -f pf.conf, then the entire content of pf.conf will be loaded into PF kernel memory, or nothing at all if there are errors. Before that change, it was possible in rare circumstances that only half of the pf.conf ruleset would be loaded inside the kernel.

    So for example, you could have the new RDR rules loaded, but not FILTER rules.

    Or, if your main pf.conf contains load anchor entries, and some of the anchor files had a syntax error, then only part of the anchors would be loaded.

    This change does not bring any new functionality to PF, but it makes pfctl -f more reliable in case of errors (syntax errors, pfctl gets kill(1)ed, not enough memory is available, ...).

  2. A 30% reduction in the size of state table entries.

    RM: Basically I found a little trick of storing the tree indexes inside the state structure, rather than having separate tree nodes that point to the state structure. It's actually a pretty obvious thing in retrospect, but nobody had really considered it. For the end user, all this means is that they can have more states in the same ammount of memory.

  3. Source-tracking (limit the number of clients and states per client).

    RM: Source IP tracking allows you to create an entry for the source of connections and link states to it. This is useful for a number of reasons: first, it allows you to use a round-robin address allocation mechanism for translation or redirection, but ensure that the connections for a particular client are always mapped the same way. This functionality is important for some applications or protocols which rely on source address for identification, or in the case of server balancing, where the application keeps state across multiple connections, so the client must always connect to the same server.

    Second, it allows you to set limits on how many distinct sources can connect to a service, and how many simultaneous connections each source can have. This can be used to connection limit internal clients, or mitigate certain kinds of denial-of-service attacks.

  4. Sticky-address (the flexibility of round-robin with the benefits of source-hash).

    RM: When sticky-address is enabled, we create source-tracking entries for each source ip address, and states are associated with it. In this entry, we store the translation address that was selected by round robin, and the subsequent connection from this source, which hits the nat or rdr rule, will get this translation address rather than the next round-robinaddress. The source-tracking entries last at least as long as there are states associated with it, plus an additional configurable lifetime.

    So if you're redirecting traffic to a pool of web servers, and the first time a client connects, they get redirected to server 4, all connections afterward from that client will hit server 4, so long as the source-tracking entry exists.

    This is very similar in behaviour to source-hash, except it removes the restriction that the pool must be specified as a CIDR netblock; it can be a list of addresses, including network blocks, or more powerfully, it can be a table.

  5. Invert the socket match order when redirecting to localhost (prevents the potential security problem of mis-identifying remote connections as local).

    DH: It is common practice to redirect incoming TCP connections to local daemons using pf, for instance to force HTTP connections through a proxy, or to redirect spam to a tarpit.

    Often, the daemon was bound to and the redirection used as replacement destination. While using the loopback address is convenient in such cases (it's always present), that can have security implications.

    Many daemons assume that the loopback interface is isolated from the real network, i.e., that connections to sockets bound to are local, and may grant some privileges based on this assumption.

    pf redirecting foreign connections to the loopback address is violating that assumption, now suddenly foreign peers might be able to connect to daemons listening on loopback sockets.

    To deal with this potential risk, the network code has been changed so that foreign connections to loopback addresses are first matched against listeners on unbound sockets (listening on any address). Only if no such socket is found, the connection is matched against a specific loopback listener.

    So, if you're running a daemon listening on both and ANY, and use pf to redirect external connections to, these connections will now connect to the ANY socket, instead of the one, where the daemon might wrongly assume a local connection.

    This problem only occurs with daemons that follow this pattern (listen on in addition to other addresses, treat as privileged local connections), many daemons are don't.

  6. Significant improvements to interface handling.

    CB: Let's look at the commit message, since it describes things pretty clearly:

    1) PF should do the right thing when unplugging/replugging or cloning/destroying NICs.

    2) Rules can be loaded in the kernel for not-yet-existing devices (USB, PCMCIA, Cardbus). For example, it is valid to write: "pass in on kue0" before kue USB is plugged in.

    3) It is possible to write rules that apply to group of interfaces (drivers), like "pass in on ppp all".

    4) There is a new ":peer" modifier that completes the ":broadcast" and ":network" modifiers.

    5) There is a new ":0" modifier that will filter out interface aliases. Can also be applied to DNS names to restore original PF behaviour.

    pass in from will only select the first IP returned by resolver, while pass in from will select all IPs. Similarily, pass in from fxp0:0 or pass in from (fxp0:0) will not take into account address aliases on fxp0.

    6) The dynamic interface syntax (foo) has been vastly improved, and now support multiple addresses, v4 and v6 addresses, and all userland modifiers, like "pass in from (fxp0:network)".

    Specifying pass [...] from (ifspec) is now equivalent in all cases to pass [...] from ifspec, except that the ifspec -> IP address resolution is done in the kernel, i.e., will adapt automatically to interface address changes (dhcp, hot plug removal, whatever).

    7) Scrub rules now support the !if syntax.

    scrub in on !fxp0 now works.

    8) States can be bound to the specific interface that created them or to a group of interfaces for example:

    pass all keep state (if-bound)
    pass all keep state (group-bound)
    pass all keep state (floating)

    9) The default value when only keep state is given can be selected by using the "set state-policy" statement.

    if you put set state-policy if-bound then all rules declared with keep state like pass out on fxp0 keep state will be if-bound.

    10) "pfctl -ss" will now print the interface scope of the state.

    Another piece I wrote on the pf@ mailing list gives a few more details about state binding.

Federico: The 3.5 presentation page says "Interface 'cloning', accessed by ifconfig(8) commands create and destroy. For example, `ifconfig vlan100 create'." How does it work?

HB: That's a very cool addition. Let's take vlan, for example.

Previously, you had a fixed number of vlan interfaces in your kernel config. If you needed more, you needed a new kernel and a reboot. Now, you don't have any vlan interface by default — but the kernel has a "template". You create the interfaces as needed on the fly. So, when you configure you first vlan, you could do something along

# ifconfig vlan0 create
# ifconfig vlan0 vlan 100 vlandev fxp0 up

Of course, you can collapse those into one, but it is even nicer: ifconfig creates the interface for you when you configure it, without an explicit create:

# ifconfig vlan0 vlan 100 vlandev fxp0 up

is sufficient. When you don't need the interface any more, you just destroy it, and it is gone:

# ifconfig vlan0 destroy

Federico: The 3.5 presentation page says "authpf(8) now tags traffic in pflog(4) so that users may be associated with traffic through a NAT setup." How does it work?

DH: This is best explained with the example in the authpf(8) man page. You can use the following in authpf.rules (the ruleset which is loaded for each user who authenticates)

nat on $ext_if from $user_ip to any tag $user_ip -> $ext_addr
pass in quick on $int_if from $user_ip to any
pass out log quick on $ext_if tagged $user_ip keep state

Nothing special about the usage of tag/tagged here, except that we use a macro that gets expanded to the user's IP address, for instance NATed connections from get tag

The point of adding a unique per-user tag on the internal interface is so that we can pass connections on the external interface, after translation, with a unique rule as well. Without tags, connections from different source addresses would all pass by the same rule on the external interface.

The reason for this construct is that tcpdump on pflog0 shows anchor and ruleset name of the rule that created the matched state, and the ruleset name conveniently contains the user name and pid of the authpf process authenticating the user, for example

# tcpdump -n -e -ttt -i pflog0
Oct 31 19:42:30.296553 rule 0.bbeck(20267).1/0(match): pass out on fxp1: } > S 2131494121:2131494121(0) win }
16384 <mss 1460,nop,nop,sackOK> (DF)

The bbeck part is the name of the user that created the connection. This information can be used for logging, accounting or debugging.

Federico: Finally, OpenBSD introduced new tools for filtering gateway failover. Quoting from the 3.5 presentation page:

1) CARP (the Common Address Redundancy Protocol) carp(4) allows multiple machines to share responsibility for a given IP address or addresses. If the owner of the address fails, another member of the group will take over for it.

Ryan, could you explain the new Common Address Redundancy Protocol (CARP)?

RM: The Common Address Redundancy Protocol allows multiple hosts to transfer an IP address amongst each other, ensuring that this address is always available. CARP is much like VRRP, although it improves on it in many ways: it supports IPv6 addresses, provides strong authentication via a SHA1 HMAC, and supports a limited degree of load balancing via an "arp balancing" feature.

CARP is the direct result of our frustration with the current IETF standards process: Cisco maintains that they hold a patent which covers VRRP and none of the right people at the IETF are willing to stand up and tell them their patent is irrelevant. It's a specific case of the general problem of vendors involving themselves in the standards process, then producing patents after the standard is finalised. The same sort of thing is happening with the various IPSec standards. We'd like very much for the IETF to put an end to this, and use a non-RAND intellectual property policy, much as the w3c has done. An open standard is not really an open standard if you have to enter into licensing agreements to use it.

Author's note: The OpenBSD web site has some interesting commentary on Cisco Patents.

2) Additions to the pfsync(4) interface allow it to synchronise state table entries between or more firewalls which are operating in parallel, allowing stateful connections to cross any of the firewalls regardless of where the state was initially created.

Federico: Ryan, how would state table synchronization work?

RM: The pfsync protocol works by sending out state creations, updates, and deletions via multicast on a specified interface. Other firewalls listen for such messages, and import the changes into their state table. There is some additional complexity, of course: we implement some methods for minimizing pfsync traffic, and minimizing the mechanism for recovering from missed messages.

The net benefit of all this is that you can have two firewalls running in parallel and have one firewall backup for the other. In many situations this will be combined with CARP.

I've written an article that gives an overview on why pfsync and CARP are necessary, how they work, examples of how they can be used, and a sample configuration.

If you're considering building yourself a redundant firewall cluster, you'll probably want to read this.

Federico Biancuzzi is a freelance interviewer. His interviews appeared on publications such as,,,,,,, the Polish print magazine BSD Magazine, and the Italian print magazine Linux&C.

Return to the BSD DevCenter.

Sponsored by: