BSD DevCenter
oreilly.comSafari Books Online.Conferences.


Puffy's Marathon: What's New in OpenBSD 4.2

by Federico Biancuzzi

OpenBSD is famous for its focus on security. Today, November 1st, the team is proud to announce Release 4.2.

Even though security is still there, this release comes with some amazing performance improvements: basic benchmarks showed PF being twice as fast, a rewrite of the TLB shootdown code for i386 and amd64 cut the time to do a full package build by 20 percent (mostly because all the forks in configure scripts have become much cheaper), and the improved frequency scaling on MP systems can help save nearly 20 percent of battery power.

And then the new features: FFS2, support for the Advanced Host Controller Interface, IP balancing in CARP, layer 7 manipulation with hoststated, Xenocara, and more!

Federico Biancuzzi interviewed 23 developers and assembled this huge interview...

There has been a lot of work to improve performance in PF and networking! What results have you achieved and how?

Henning Brauer: Network data travels in so-called mbufs through the system, preallocated, fixed size buffers, 256 bytes on OpenBSD. They are chained together, and they can, instead of carrying the data itself, point to mbuf clusters of 2 KB size each.

PF needs to keep track of various things it does to packets like the queue ID for ALTQ on the outbound interface, the tags for the tag/tagged keywords, the routing table ID, route-to loop prevention, and quite a bit more. Previously we used mbuf tags for that. mbuf tags are arbitrary data attached to a packet header mbuf. They use malloc'd memory. And that turned out to be a bottleneck. So I finally did what I wanted to do for some time (and that Theo, Ryan, and I discussed before)—put this extra information directly into the packet header mbuf, not mbuf tags, and thus get rid of the need to malloc memory for each packet handled by PF.

Since PF has its tentacles everywhere in the network stack, changing this was a big undertaking, but it turned out to make things way easier in many cases and even fix some failure modes (we cannot run out of memory for the mbuf tags any more).

In our tests with a Soekris 4801 as bridge with the simplest possible ruleset (just one rule: "pass all"), this change doubled performance, it went from being able to forward 29 to 58 MBit/s.

What about other PF optimizations?

Henning Brauer: Packet forwarding can skip IPsec stack if no IPsec flows are defined. This is simply a shortcut: if there are no IPsec flows in the system we do not need to descend into IPsec land. This yields a further 5 percent improvement in packet forwarding performance.

Also, quite some time ago, someone discovered that firewalls replied with RST or ICMP to packets with an invalid protocol checksum. Since an end host wouldn't have replied due the checksum error, you could spot the firewall. Due to that, we were verifying the protocol checksum for each and every packet in PF. I changed it to only do so if we are actually about to send an RST back. Voila, 10 percent higher forwarding rate.

How does your work on improving PF perfomance affect the two settings for states bounding (if-bound/floating)? What is the default in 4.2? Which setting is faster?

Henning Brauer: I have completely rewritten the code dealing with interface bound states. Previously, there was one (well, a pair really) state table per interface, plus a global one. So for each packet, we first had to do a lookup in the state table on the interface the packet was coming in or going out on, and then, if we didn't have a match, in the global table. Ryan split a state entry in the "key" (everything needed to find a state) and the state info itself.

In the second step I changed things to allow more than one state to attach to a state key entry; states are now a linked list on the state key. When inserting, we always insert if-bound states first and floating ones last. At lookup time, we now only have to search the global state table (there is no other any more), and then start walking the list of states associated with the key found, if any. We take the first one and check whether the interface the state is bound to matches the one we're currently dealing with or is unset (aka floating state). If that is true, we're done. If not, get the next state and repeat until we found one, and if we don't there is no match. This works because we make sure that if-bound states are always before floating ones. In the normal case with floating states there is only one entry and we're done. This change increased forwarding performance by more than 10 percent in our tests.

As you might guess, this was not a simple change. Ryan and I have been thinking about it, discussing and developing the concept for more than a year—well, this is only part of it really.

Defaults have not changed; floating states are default, and there are very, very, very few reasons to change that, ever. There is no performance difference between the two.

"Improvement in the memory pool handling code, removing time from the pool header leads to better packet rates." How did you spot this, and who fixed it?

David Gwynne: That was something I found at the hackathon this year while profiling the kernel. Ted Unangst fixed it for me.

The memory for packets in the kernel is allocated out of pools. Every time a chunk of memory was returned to a pool the hardware clock was read so it would know which pool was last used. Reading the hardware clock is very slow on some machines, which in turn causes things that use pools a lot to slow down. Since I was trying to move a lot of packets through a box, I was noticing it. After describing the problem to Ted, he decided that the timestamping was unnecessary and removed it.

"Make the internal random pool seed only from network interrupts, not once per packet due to drivers using interrupt mitigation more now." Another step to improve networking speed! Who worked on this?

David Gwynne: This was something I found while profiling the kernel at the Calgary hackathon this year. This time it was fixed by Ryan McBride.

The kernel is responsible for providing random numbers, but to do that effectively it has to collect randomness from suitable sources. One of those sources is the times that interrupts occur. The time that network interrupts are measured at is when they signal that a packet has been received. The time is read from the hardware clock which, as we've said before, can be really slow. The problem with this is that modern network cards are capable of receiving many packets per interrupt, which in turn means we read the hardware clock several times per interrupt instead of once like we intended.

Reducing the number of these clock reads means we can spend more time processing the packets themselves, which in turn increases the throughput. Ryan managed to do this by modifying the network stack to defer this stirring of that kernel random pool to the softnet interrupt handler.

When a network cards interrupt handler sends a packet to the stack, it is quickly analyzed to figure out if it is a packet we're interested in. If it is interesting, then we put it on a queue to be processed and a soft interrupt is raised. When the hardware interrupt is finished the soft network interrupt is called and all the packets in that queue are processed. So for every hardware interrupt that occurs, we end up doing one softnet interrupt too. By sticking the stirring of the random pool at the top of the softnet handler, Ryan got us back to reading the clock once per interrupt instead of once per packet. Throughput went up again.

"Enable interrupt holdoff on sis(4) chips that support it. Significant performance gain for slower CPU devices with sis(4), such as Soekris." Would you like to tell us more about this?

Chris Kuethe: Quite a number of network adapters have a configurable mechanism to prevent the machine from being run into the ground under network load. This is known as holdoff, mitigation or coalescing. The general idea is that the network adapter does not immediately raise an interrupt as soon as a frame is arrived; rather the interrupt is delayed a short time—usually one frame or a few hundred microseconds—in case another frame might arrive very soon thereafter.

Picking a good delay value, or set of conditions under which to signal the arrival of a frame is not easy. Too much holdoff and network performance is severely degraded, too little and no benefit will be noticed. When ping times go up and TCP stream speeds go down, you're delaying too much.

In the case of the Soekris (or anything else that uses sis(4)), interrupt holdoff was not enabled. By enabling holdoff, we allow the network controller to delay and buffer a few frames. This spreads cost of the interrupt across several packets.

What challenges does 10 Gb Ethernet support present?

David Gwynne: Our biggest challenge at the moment is finding developer time. We (always) have a lot of work to do but the time to do it is hard to find sometimes.

On a more technical level, supporting 10 Gb hardware is about as hard as it is to support any other new hardware. Someone has to sit down and figure the chip out and move data on and off it. That is possible using a driver from another operating system, but it is way easier if you have documentation. If you have documentation the driver is usually faster to develop and it always ends up being higher quality and more reliable. Fortunately a significant proportion of the vendors in the 10 Gb space are happy to provide documentation for their hardware.

Supporting the movement of packets through our network stack to the hardware at 10 Gb speeds is a problem we've always had. We've always wanted things to go faster, 10 Gb is just another level to strive for. Having said that though, moving packets in and out of boxes that fast makes problems more noticeable and therefore more attackable. 10 Gb hardware is also getting a lot smarter about how it moves packets from the chip to the computer itself. Some of those mechanisms are worth pursuing, others aren't.

One of the common mechanisms is offloading some or all of the work the network stack does onto the hardware. This ranges from offloading TCP segmentation (TSO and LSO) all the way up to full TCP Offload Engines (TOE). We actually like the OpenBSD network stack though, so we aren't going to implement support for this.

The other popular mechanism is to parallelize the movement of packets on and off the card (i.e., on "old" network cards you can only have one CPU operating on the hardware at a time, while a lot of 10 Gb cards provide multiple interfaces for this same activity, meaning you can have several CPUs moving packets on and off the chip at the same time). Supporting this obviously provides a big challenge to OpenBSD since we have the Big Giant Lock on SMP systems. Only one CPU can be running in the kernel at a time, so you can only have that one CPU dealing with the network card.

4.2 brings a new 10 Gb driver for Tehuti Network controllers (tht(4)), and a lot of improvements in the kernel and the network stack that help throughput. These improvements help all network cards though, not just 10 Gb ones.

What does this release offer to Wi-Fi users?

Jonathan Gray: In 4.2 the main additional wireless hardware support comes in the form of support for Marvell 88W8385 802.11g based Compact Flash devices in malo(4). This is of particular interest for zaurus users wanting faster network IO. Beyond that it was mostly some 802.11 stack/driver bug fixes. Two new drivers recently hit the development branch, Damien Bergamini's iwn(4) driver for Intel 4965AGN Draft-N and a port of Sepherosa Ziehau's bwi(4) driver for Broadcom AirForce/AirPort Extreme devices from DragonFly, however these were too late for 4.2 and will appear in 4.3.

Did you improve isakmpd interoperability?

Todd T. Fries: There are two important isakmpd(8) interoperability fixes new with the 4.2 release. One permits interoperability with other IKE implementations that re-key on udp port 4500, instead of expecting port 500 re-keying to occur. The other permits key exchange with RSA signature authentication to work with Cisco IOS. Both expand on the wide range of IKE implementations isakmpd(8) is already interoperable with.

"Provide software HMAC for glxsb CPUs, so IPSec can use the crypto HW." How much does this feature improve performance concretely?

Markus Friedl: Improves IPsec performance on a 500 Mhz machine from 17 Mbit/s to 30 Mbit/s with AES/SHA1 and PF enabled. This does not affect OpenSSH, since OpenSSH could use the hardware before this change.

Why have you replaced the random timestamps and ISN generation code for TCP with a RFC1948 based method?

Markus Friedl: Machines are getting faster, they are doing more TCP connections per second, so TCP client port reuse is getting much more likely. Both random timestamps and random ISNs make it hard for the TCP server to distinguish "old" TCP segments from new connections. Using RFC1948 based methods restore monotonic timestamps and ISNs for the same 4-tuple, making it possible for the TCP server to allow early port reuse.

You fixed a really old bug in the socket code. Would you like to tell us more about it?

Dimitry Andric: This is actually something that I didn't find myself. I just happened to see this issue coming along on the FreeBSD CVS commit list. It's a very tricky macro bug, that has existed since the very first revision of the sblock() macro, and was apparently never noticed until recently.

It replaces this version of a macro in src/sys/sys/socketvar.h:

#define sblock(sb, wf) ((sb)->sb_flags & SB_LOCK ? \
                (((wf) == M_WAITOK) ? sb_lock(sb) : EWOULDBLOCK) : \
                ((sb)->sb_flags |= SB_LOCK), 0)


#define sblock(sb, wf) ((sb)->sb_flags & SB_LOCK ? \
                (((wf) == M_WAITOK) ? sb_lock(sb) : EWOULDBLOCK) : \
                ((sb)->sb_flags |= SB_LOCK, 0))

Here sb is a pointer to struct sockbuf, and wf is an int ("waitfor").

The only difference is moving that next-to-last right parenthesis. But it changes the entire meaning of the macro! The original version will always return 0, since the ",0" is the last part of the complete expression. This was not what was intended, it should only directly return 0 in the case that sb didn't have its SB_LOCK flag set.

If you'd write this as a much clearer inline function, without the ?: operator, the original would become:

inline int sblock(struct sockbuf *sb, int wf)
        if (sb->sb_flags & SB_LOCK) {
                if (wf == M_WAITOK) {
                        (void) sb_lock(sb); // return value gets ignored
                } else {
                        (void) EWOULDBLOCK; // value gets ignored
        } else {
                sb->sb_flags |= SB_LOCK;
        return 0; // always succeeds! yeah right :)

while the fixed version would become:

inline int sblock(struct sockbuf *sb, int wf)
        if (sb->sb_flags & SB_LOCK) {
                if (wf == M_WAITOK) {
                        return sb_lock(sb);
                } else {
                        return EWOULDBLOCK;
        } else {
                sb->sb_flags |= SB_LOCK;
                return 0;

This is a good example of why the ?: operator should be used with caution, at least in complicated expressions with macros.

Pages: 1, 2, 3

Next Pagearrow

Sponsored by: