ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


OpenBSD PF Developer Interview, Part 2

by Federico Biancuzzi
05/06/2004

OpenBSD celebrated release 3.5 on 1 May 2004. In honor of this release, Federico Biancuzzi interviewed the developers of OpenBSD's PF, a powerful and flexible packet filtering interface. This is the second half of an interview that began in PF Developer Interview, Part 1.

Federico Biancuzzi: A recent research by Paul Watson entitled Slipping in the Window: TCP Reset Attacks caused a lot of rumors in the media. After the public presentation at the CanSecWest conference, we understood that the real situation was not so terrible for every OS. How does OpenBSD reduce the risk of this type of attack?

Henning Brauer: OpenBSD is not vulnerable to these attacks. We use random ephemeral ports since 1996. We require RSTs to be right in the edge of the window since 1998.

Federico: How can PF protect a vulnerable server from this type of attack?

HB: The scrubber has just been modified to drop RSTs which are not right on the edge of the window. A special form of NAT where only the src port is modified is beeing worked on.

Federico: During the end of April most PF developers flew to Ryan McBride's house to start a short PF hackathon. (See Figure 1 for a photo from Ryan himself.) What are the new features and enhancements that were planned or already committed for the next release?

Figure 1
Figure 1. OpenBSD hackers hard at work at a recent hackathon.

HB: We had an extremely productive four-day hacking party at Ryan's house in the woods near Sechelt (British Columbia, Canada — about 2 hours from Vancouver), yes.

It was really a network hackathon, not limited to pf. Besides a lot of smaller things, the stuff we hacked on includes:

Ongoing work includes a rework of how anchors work — they can be nested soon, and the ruleset name (which is really just one level of nesting) will go away.

Federico: What type of platform do you use for PF development? Has the code any optimization for a particular architecture, like 64bit (Athlon64, Sparc64, G5, Prescot)?

HB: I mostly hack pf on my i386 notebook and, less often, my sparc workstation. Tests on sparc64 are mandatory as this is the most picky platform, especially due to its memory alignment requirements. I occasionally test on alpha, hppa, amd64, mvme68k, mvme88k, mac68k and VAX as well.

pf operates completely in MI (Machine Independent) land, there is no MD (Machine Dependent) code whatsoever.

Ryan McBride: I develop on sparc64 and i386. sparc is used for testing, and I have VAX and cats boxes sitting here waiting to be set up for testing (too slow for real development)

Cedric Berger: Mostly i386 gear. When doing big changes, I usually try to compile the code on a Sparc64 box since unlike i386, it uses 64-bit, big-endian, and gcc3. If PF code works on Sparc64 and i386, it has good chances to work well on the other architectures.

Federico: Which platform would you suggest to deploy an OpenBSD based firewall for various scenarios (home, office, enterprise)?

HB: Well, basically, any.

Vax may not be a good choice for filtering Gigabit speeds, obviously. From the modern archs, sparc64, alpha, and amd64 are all good choices. i386 and powerpc less so, I like W^X purity, but should not impose performance problems either.

RM: For small sites, the Soekris 4501 or 4801 make excellent devices. For bigger links, an AMD64 based system is an excellent choice. Even if you're only dealing with 100Mbit links, using gigabit cards like em(4) or sk(4) will improve performance as these cards are designed to handle much higher numbers of packets per second than 100Mbit cards.

Federico: Reading the pf@ mailing list I found that the code has some limit regarding the hardware usage. What type of limits does PF present in the 3.5 release? How many concurrent states and how much RAM can PF handle?

HB: That thread is full of lies and uninformed guesses. Ignore it.

pf itself doesn't impose many limits. We have the settable state and fragment limits to prevent pool exhaustion, the amount of memory available for the pools used by pf varies depending on the hardware.

I don't have exact numbers; but 50,000 state entries are not a problem on a i386 with 128 MB.

That said, there is ongoing work which changes the way OpenBSD handles kernel memory used for the network stack — pf is not special here. This will allow for both more efficient usage, backpressure when needed, and more total memory available to the network stack including pf, thus allowing for much bigger state stables etc.

CB: With this patch by Mike Frantzen, you can use up to 768MB of RAM on i386 to store table entries, versus 64MB previously. We're looking for a similar improvement for the state table, but that's a bit more difficult.

Federico: PF is going to be ported to other BSDs. Daniel, are you proud of this? Are you working on these portings?

Daniel Hartmeier: I've been working with Pyun YongHyeon and Max Laier who do the FreeBSD port. They did all of the work, I merely tried to answer questions and sometimes aid in debugging. This has proven valuable to the base source, several bugs were found and fixed in the process. Maintaining several ports will cause additional work, of course, but the additional user base producing feedback makes it worth the effort.

Federico: Now that PF has been imported into the FreeBSD base system, do you think that some people will consider moving from OpenBSD to FreeBSD for its SMP support and better performance?

HB: Not at all. First, the performance advantages of FreeBSD, that is largely a myth. All BSDs are rather close to each other with respect to general performance.

Second, I don't expect any OpenBSD/pf user to switch, as pf on OpenBSD will always have some advantages — as we develop pf here, we can embed it much deeper. What I expect is FreeBSD users switching to pf from ipf and ipfw.

RM: On most firewalls, the CPU is not the bottleneck, so adding a second one will not help (and may even slow things down). That being said, SMP support is being worked on actively for OpenBSD, and will likely appear in 1 or 2 releases, depending on how long it takes to get it right.

CB: Competition is good. If FreeBSD is better than OpenBSD for some application, I will use FreeBSD for that application. If OpenBSD performance lacks in some areas, we will try to fix it. That being said, I believe OpenBSD is clearly the best choice for a firewall now, for various reasons.

Federico: Why does PF still miss one basic feature like IPFilter return-icmp-as-dest?

HB: Because nobody thought it was needed? There were no requests for this whatsoever, nobody of us had a need to, so nobody wrote that. That nobody requested it after, what is it now, 2.5 years?, is a strong sign it is not needed at all.

CB: Because nobody in the developer or user community cared enough to send a working patch.

Implementing that functionality is not easy, but I'm now looking at implementing return-icmp for pure bridges, when there is no routing table (return-rst for bridges has just been committed). It is very likely that once I've return-icmp working on bridges, adding return-icmp-as-dest support will be trivial.

Federico: Why doesn't PF come with an internal version number to track various updates typical of -stable branches?

HB: what for?

OpenBSD 3.5 is OpenBSD 3.5 is OpenBSD 3.5, period.

Can Erkin Acar: Since the kernel and userland is always synchronized, there is not much point in adding a version identifier. For external utilities, such as pftop (which still compiles on OpenBSD 3.0) the OpenBSD release numbers are usually sufficient.

Federico: What type of support does PF provide for IPv6? Are there any interesting features specifically for IPv6? What features are still missing?

HB: PF has full IPv6 support — there's nothing really special or different opposed to v4.

RM: We're missing IPv6 fragment reassembly support, but this is being worked on actively and will probably be included in 3.6. pfsync does not support IPv6 as a transport.

Federico: Since the ALTQ merge in the 3.3 release, a lot of people enjoyed shaping the bandwidth with PF rules. Is there anything new for ALTQ, like other types of queues?

HB: Kenjiro and myself added HFSC for 3.4 and polished things a little, and I rewrote the ID allocator. No real changes were done after that in the queueing arena, and there are no planned currently. In our eyes the current state is quite fine.

Federico: How do PF and IPSEC interact? What type of problems have you resolved and what stills need to be solved?

DH: pf sees IPsec encapsulated traffic both encapsulated on the real interface as well as decapsulated on enc(4). Filtering and translation can be done on either, with various effects. Apart from that, pf doesn't treat IPsec traffic differently from other protocols. It doesn't filter on SPI or other IPsec specific fields. UDP encapsulation for NAT traversal have recently been added, but that's outside of pf, in IPsec/stack code.

Several special requirements, like static source ports for isakmpd, have been addressed, so pf basically works at least as well with IPsec as any other packet filter not doing IPsec protocol inspection.

HB: They do not interact more than pf interacts with anything else network related — pf passes or blocks ipsec traffic.

CB: They interact pretty well. You can filter ESP packets on the real interface or decrypted packet on enc0. Nat/rdr is possible on enc0, but that's tricky. What I'd like to do for next release is to remove the need for the pass on enc0 proto ipencap all rule, that is just wrong.

Federico: WiFi network are becoming widely (and wildly) used. What can PF do for a wireless network? Is there any new idea specific for wireless filtering?

HB: I don't see wireless much different than wired networks with respect to pf. authpf can be especially neat in wireless networks, but it already is neat in wired ones too.

Federico: If I'm not wrong, tools that use raw access to network data bypass PF because the filtering happens after. How can this be solved? Is this a behavior you want to change?

HB: This is not true.

It is true that bpf is outside pf. This is actually very good for debugging.

We might add a possibility for bpf-based tools to request to be hooked in before pf. It might be useful for the dhcp programs. But then, that is not a real-world problem — I have privilege revoked dhcpd and dhcrelay so that they don't run as root anymore, and canacar@ helped out with bpf write filters (we have read filters already) and lock the bpf device so that no changes in those filters are possible anymore. Especially for dhcpd that means that one very worrysome piece of code is now locked away that nicely that you don't have to worry much anymore. And of course besides the privdrop and bpf security work, we cleaned that mess up big time...

The most worrysome of those programs is now dhclient which is scary, huge and still runs as root — even given we cut about half of its code out already. I have it running privilege separated on my machine already...

RM: I don't see this as a problem, and don't think that this will be changed.

CEA: This is by design, and I do not want/see this behavior changing. We have introduced bpf security extensions to solve this problem on a case-by-case basis. We are going through every program in the tree and modify them to use the security extensions and drop/separate privileges. At some point we may also start looking at critical applications in the ports tree.

Federico: What type of bpf security extensions have been introduced?

CEA: bpf is a device designed to capture packets from an interface. It has a filter language for selecting a subset of packets to be read, used mainly for performance reasons. bpf also contains some functionality for injecting packets into the network.

Programs use bpf by opening one of the /dev/bpfX devices, and obtaining a file descriptor. The access to the devices is restricted to root by default (through file permissions). The problem happens when a program wishes to drop privileges, or use privilege separation, after obtaining a bpf descriptor.

Even with dropped privileges, a program can change the filters, and the interface and, thus sniff any interface on the host. Furthermore, if the descriptor was opened with write access (some daemons require this, and libpcap does this by default) it is possible to inject packets to any of the available interfaces.

This had to be solved before any bpf-using program can be safely privilege separated. Two security mechanisms were introduced:

  1. write filtering allows setting bpf filters for write operations

  2. locking prevents "dangerous" changes to the descriptor such as modifying the read/write filters, and changing the interface. Obviously the descriptor cannot be unlocked once it is locked.

If the descriptor is properly configured and locked before dropping privileges, an exploit will not be able to further compromise the system through the bpf descriptor.

Federico: I've read this thread on the misc@ mailing list and I'm wondering what are the advantages of tcpdump privilege separation?

CEA: Network data is untrusted, and parsing them into a readable form is difficult and error-prone, especially for complex or obscure protocols, thus tcpdump (and most other sniffers) are complex and potentially dangerous pieces of code. A look at recently discovered vulnerabilities in such programs should give an idea. Even saved binary files may not be safe, and could act as time bombs. Privilege separation is used in these programs to isolate the dangerous packet parsing code into an unprivileged chroot jail.

At this point running tcpdump as root in OpenBSD is much more safer than running it unprivileged, since being root allows it to properly privsep. Hopefully this will be improved to cover unprivileged use, possibly using setuid after we resolve some signal issues. Yes, there is an irony here, making a program setuid root to make it safer :)

Federico: Can, could you provide a short history of pflogd?

CEA: The logging mechanism in pf is ingenious. Saving the raw packet dumps loses minimum information, (usually) uses less space than ASCII logs, and allows the logs to be analyzed using a variety of tools, including passive OS fingerprinting recently added to our tcpdump, and all the other cool analyzers/sniffers available.

The first version of pflogd is imported into the three about 2 months after the pflog interface. It had a basic functionality: dump the logs to a file in binary tcpdump format, make sure the existing file has the correct header before appending, handle SIGALRM for flushing logs, and SIGHUP for re-opening the log files for working correctly with log rotation.

At the last hackathon, right after OpenBSD 3.3 is released I have added support for the new pflog format. pflogd supports both, but refuses to overwrite an old file and outputs a 'Move away' warning to the syslog.

After OpenBSD 3.4 is released, the bpf extensions was ready, so pflogd was privilege separated. The privileged parent handled the bpf descriptor stuff and opening/positioning of log files, while the child running chrooted under _pflogd user is used for logging.

Later, (January 2004) it was noticed that the pflogd files may become corrupted if the partition gets full, or after an unclean shutdown. In this case some or all of the appended logs would become unreadable. pflogd now scans the complete log file, and detects corruption, and gives the "Move away" warning, refusing to append until the log is moved or rotated away.

Future: I have a (half forgotten) diff that handles the infamous "move away" part by renaming the existing log if a problem is detected. I have also not yet abandoned my plans for having ASCII pflogs. tcpdump is safe, and more powerful than anything I could put into pflogd, but lacks the rotation functionality.

Federico: Can, could you provide a short history of the PF logs format?

CEA: The pf logs contain a header and the logged packet itself. In the initial version, the header length was fixed, and very simple. It contained "interface, direction, action, a rule number" and a sub reason (why passed or dropped). This old format contained an unofficial link type (an identifier that determines the interface type and header format) and having a fixed length with no empty fields, it was not extendable.

After 3.3, we have improved the format to contain anchor and ruleset names and a header length field which will allow the format to be extended later as required. We have also changed the link type to the official id obtained from the libpcap/tcpdump maintainers.

Federico: The PF log format has changed over time. Do other operating systems or common software such as ethereal and the standard tcpdump support this format?

CEA: I have contributed patches to ethereal, and snort also supports the format. The standard tcpdump includes support to the new format since tcpdump_3_8rel2 (although they dropped backward compatibility to the old format).

Federico: Mike, you wrote most of the stateful engine. How did you audit it?

MF: Auditing is more of a continual process than one would think. Whenever something that you never thought of before comes out, you have to go through all your code again to make sure you're not affected. Often that is a self sustaining process. Inspiration often strikes during the audit and you think of more gotchas so you re-audit and think of more gotchas to audit for. I end up staring [at] a lot of weird and funky traffic at work; corner cases of corner cases. Most of my audits are walking those weird packets or connections through PF's state engine to make sure we don't block a valid packet.

Federico: Mike, you wrote most of the scrub feature. What improvements are planned for the future? What about TCP scrubbing and normalization?

MF: Niels Provos did the initial fragmentation and flag scrubbing support based on Paxson and Handley's USENIX paper. My planned improvements are in the normalization of TCP segments. The TCP protocol was designed so that there are only two active participants in any given connection. Making PF more active in the TCP stream will be done very gradually since mistakes lead to massive ACK storms or connections mysteriously freezing.

Federico: Mike, your job focuses on IDS development. Had you any ideas to make PF capable of interacting with an IDS like Snort?

MF: There are a few levels of interaction between a firewall and an IDS. I believe Snort has been able to do intrusion detection on any packet logged by PF for a year or two now (PF logged packets appear on the pflogd0 virtual interface which you can monitor with many libpcap based programs like tcpdump or ethereal).

There are already two ways to emulate Linux's DIVERT sockets and turn an IDS into an IPS (Intrusion Prevention System). One could use PF to route the packets to a tunnel device and read them there. Or one could block the packet in PF and watch the full packet show up on the pflogd0 logging interface.

And a passive IDS running on the firewall could easily tell PF to kill all of an attackers connections and add his IP address to a blacklist or even redirect any new connections from him to a honeypot.

The tools are all there. All it takes is someone to add the code to Snort. Someone whose employer doesn't compete with Snort :-).

Federico: Are there any plans to develop application proxies for common protocols like HTTP, SMTP and POP3?

DH: They are easily done in userland, see ftp-proxy(8). We agree to not do it in kernel. If there is enough demand for protocols besides FTP, someone will step up and do it, I'm sure.

HB: No.

ftp-proxy is needed due to the nature of ftp with its two connections, the place for other proxies is in ports IMHO. That said, some stuff is imaginable in-tree, and if somebody steps up and writes good code, this is certainly welcome. Whether it turns into a port then or goes into the main tree, I can't say yet.

RM: In the kernel? Certainly not, the risk for compromise is too great. In userland, already have ftp-proxy, and there are 3rd party applications which handle many of the other protocols: apache or squid for HTTP; MTAs such as sendmail are basically SMTP proxies by nature.

CEA: These protocols are quite firewall friendly since they use well defined ports. These protocols could benefit from content filtering or caching. OpenBSD already has spamd which is a proxy for spam :) and there are a number of http proxies in the ports tree. I am not aware of any specific plans, but someone might just decide to write one.

CB: I've no idea.

Federico: Some OpenBSD developers wrote various tools to analyze PF logs and statistics (pfstat, Hatchet,...). Is there any project to create a global and unique graphical interface to work with PF?

HB: No.

RM: Not by any OpenBSD developers. Doing a good GUI configuration tool for PF is very difficult because there are so many options, and laying them out intuitively in a graphical interface is nearly impossible.

Federico: Can, could you explain pftop?

CEA: It started as a text-mode realtime display tool for active pf states. It improved quickly with feedback and patches from the community. It now has rule and queue pages, and can compute per state throughput. I should probably find some time to catch up with the new pf features in 3.5 though. pftop is available in the ports tree as sysutils/pftop.

Federico: Please, tell us the all the truth about PF performance tweaking. What are the right settings to build a stable and network optimized kernel?

HB: Use GENERIC.

I totally don't get that tweak tweak tweak attitude. GENERIC works fine in almost all circumstances. If you have a real problem with GENERIC, as in a problem that shows up during real world usage, then post to misc@ and you'll get help.

Of course there are cases where you need to tune; I have machines where I need to crank nmbclusters a little. BUT: the key point is, as long as GENERIC works and doesn't show a problem that is the best you can get. And making GENERIC work for more and more scenarios is one of the major things we are doing since some releases, by switching from compile time options to runtime controllable stuff (mostly sysctl) or at least allow changing from config/ukc.

In fact, a lot of knobs should not ever be touched. This especially is true for nkmemclusters. The myth about that being needed comes from old days where we had a leak in the routing table code under some circumstances... nowadays you will in almost all cases shoot yourself in the foot by mucking with nkmemclusters. You increase it, something else needs more kernel memory, and you run out. You will crash. So don't touch it.

OpenBSD 3.6 will come with big improvements in that area.

Federico: How will PF evolve? More features or performance tweak?

DH: I think addition of new features will slow down over time, most things are covered by now. New features have to be justified against the instability changes introduce. Changes have been frequent at first, I don't mind changes settling down. That allows [us] to more carefully search for performance improvements and plain bugs.

MF: My PF wishlist is almost empty. I imagine most of the next big evolutions will be from spontaneous ideas someone has in the bar or us feeding off ideas of the others during the next PF hackathon.

HB: There's ongoing work on bpf security. We are also looking at further flexibility in the language and some internal changes that solve little problems.

There are no "big" new features planned for pf; maybe some of the stuff we do outside pf gets to interact with it in some way. We've been at the "pf is done" point quite a few times now, and there have been great ideas later on. pf development slowed down, and will slow down even more — not because we don't have enough developers or something, but simply because it is, well, pretty much done.

Compatibility becomes a much larger issue with every release that contains pf and with each and every pf installation, that's an area where I think we have to get better.

CB: The one thing I'd like to have for 3.6 is the ability for firewalls working in a bridged environment to send IP packets (i.e., firewalls without IP address and/or routing table). Currently, features like syn-proxy, return-icmp, return-rst don't work on such a firewall because PF does not know how and where to send packets. Fortunately, I think there are good 95% solutions to that problem, which is probably the most requested feature on PF lists.

Besides that, I'm not aware of any "big" things that would come, but we've been saying that for every past release, as far back as I remember :).

RM: Because of the very clean initial design, PF has never had real performance problems — in the vast majority of PF deployments, the CPU sits essentially idle. So there's not much incentive for developers to spend long hours squeezing a bit more performance out of the code — such efforts would likely increase the complexity and thus the chance of bugs.

With regards to new features, it is hard to say; PF is becoming fairly feature complete and eventually development will slow down to a maintenance mode, like many other areas of OpenBSD. On the other hand, put 2 PF developers in a room together, and they immediately begin to come up with new crazy ideas. So I don't see it stopping within the next few releases.

However, I think the bulk of new work being done will not be in adding new features, but in the following 2 areas: fine tuning the features which already exist as we learn more about how they are used in production deployments, and making internal changes which simplify or reduce the code base. The latter is very boring for users, but actually quite important in reducing the potential for bugs.

CEA: I think, with the recent failover work, the feature range is quite complete. There would definitely be some performance tweaks, and improvements/additions to supporting userland tools before new features.

Federico Biancuzzi is a freelance interviewer. His interviews appeared on publications such as ONLamp.com, LinuxDevCenter.com, SecurityFocus.com, NewsForge.com, Linux.com, TheRegister.co.uk, ArsTechnica.com, the Polish print magazine BSD Magazine, and the Italian print magazine Linux&C.


Return to the BSD DevCenter.

Copyright © 2009 O'Reilly Media, Inc.