ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


OpenBSD 3.8: Hackers of the Lost RAID

by Federico Biancuzzi
10/20/2005

It's release time again for OpenBSD! The upcoming 3.8 will include some wonderful features for network gurus (trunking, tracking wireless roaming users, interface groups, a new ipsec configuration tool, and failover of ipsec links), a great rework of malloc() that will provide further security protections by default, and the first version of bioctl--a universal RAID management interface.

With so many goodies to taste, Federico Biancuzzi contacted a large group of developers and assembled this long, long interview.

OpenBSD 3.8 is the first release that supports network interface aggregation, using the virtual trunk(4) interface. How does it work?

Reyk Floeter: With trunk(4), you can combine one or more ports as into one virtual network interface. A port could be any physical Ethernet or wireless interface, and it's even possible to add other trunks as ports. The trunk driver will send outgoing traffic with a specific algorithm over the attached ports, which depends on the actual trunk protocol. The first release in OpenBSD 3.8 supports a simple round-robin protocol; outgoing packets are distributed over the ports in a circular way. Furthermore, incoming packets from any attached port are forwarded into the receive queue of the virtual trunk interface. trunk(4) is supported by most of the actual intelligent network switches and some other operating systems, but everyone uses a different name; i.e., HP calls a trunk a Trunk and Cisco calls it Ether Channel, while Linux is using bonding as its name.

trunk(4) provides several possible benefits. The first one is a slightly improved performance, because the traffic could be distributed over several physical network interfaces. You could get more than 100Mbit/s with a fast Ethernet trunk, even more than 1G/s with a gigabit trunk. The most interesting feature of trunk is failover on layer 2. The trunk will continue to work if you remove the network cable of an attached port, as long as there're other running ports attached to the trunk. The interface link states are used to detect inactive ports and to skip them in the round-robin scheduling.

Example of a trunk using two sk(4) gigabit adapters:

# ifconfig sk0 up
# ifconfig sk1 up
# ifconfig trunk0 create trunkport sk0 trunkport sk1
# ifconfig trunk0 192.168.1.200 netmask 255.255.255.0 up

What improvements and new features do you plan to develop with it?

Related Reading

BSD Hacks
100 Industrial Tip & Tools
By Dru Lavigne

Reyk Floeter: I already improved some minor things like interface flags and ifconfig behaviour. Now it's possible to add VLANs to the trunk with a full-size MTU, if all the attached ports are supporting oversized Ethernet frames (802.1q defines some extra bytes prepended to the Ethernet header for VLAN ID and priority). With the current trunk driver in OpenBSD 3.8, the MTU of a VLAN is limited to 1,496 bytes.

An interesting improvement of trunk(4) in OpenBSD-current is a full support for multicast. With multicast support you'll be able to use things like IPv6, hostapd, or even carp(4). Using the CARP- protocol over trunk is a very interesting feature to extend the failover capabilities of OpenBSD. For example, a redundant firewall setup using carp + pfsync(4) will support layer 2 redundancy with trunk as well; that's a way to keep your system "always on."

Work for IEEE 802.3ad LACP is in progress because it's important for interoperability with some network gear. It will be an optional trunk protocol in addition to the default round-robin scheduler. While starting the implementation of LACP, I noticed that I don't like it very much because it's a typical overengineered IEEE approach ... but it will be there, and I think it will be the most simplified but fully functional implementation.

Marc Espie recently had an interesting idea for using trunk to realize roaming between wireless and wired networks. This sounds weird, but the idea is simple--many people are using their wireless notebooks to do common work like hacking, browsing, and any kind of communications. But sometimes it is necessary to make some bulk transfers, like large file downloads or even system updates. I'm currently extending trunk(4) with an "active failover" mode, using multiple ports with a hierarchic priority; i.e., my ThinkPad's em(4) Gigabit Ethernet interface will be used as the master port with the highest priority, and the ath(4) wireless interface will be the failover interface in a virtual trunk. Whenever I unplug the Ethernet cable, the trunk will exclusively switch to the wireless interface and vice versa without resetting the running network connections. From the network side of view, this will be the same as a wireless station roaming between different APs in one collision domain.

How is your battle to get free access to specifications and redistributable firmware for WiFi chipsets going? Does this release include any new drivers or firmware?

Reyk Floeter: OpenBSD 3.8 will not add support for any new wireless network drivers. Work has been concentrated on improving the existing wireless drivers and our interpretation of the net80211 layer. Some commands have been added to ifconfig, and some cleanup has been done. It's funny that we removed more lines from net80211 than we added to during the last release cycle indeed.

Nevertheless, the battle for new wireless drivers is not over. Work for some new drivers and new chipset revisions is in progress. We got some amazing support from some vendors during the last month, and we'll be able to support their a/b/g chipsets very soon. But I have to note that we didn't get any documentation or hardware from wireless chipset vendors in the US, yet.

You have implemented the Inter-Access Point Protocol (IAPP). Now we can use hostapd to handle communication between different 802.11 wireless access points running in Host AP mode. This means that we could create a network of APs and track roaming users. How does it work concretely?

Reyk Floeter: I started the implementation of hostapd(8) some months ago to add this IAPP protocol. It's still a very basic implementation and does not support the full protocol yet. Currently, an access point using hostapd sends an IAPP notification to an internal multicast or broadcast group every time a new station has been successfully associated. Other hostapds listening to this multicast or broadcast group will be able to remove any allocated resources for this station, which will improve their availability. Additionally, it's possible to do central logging of any station movements in the wireless network in real time, using a hostapd listening to the IAPP messages.

Does it offer any new opportunity for WiFi attackers?

Reyk Floeter: No, it will not make it worse than IEEE 802.11 already is. But hostapd(8) provides some great features to improve wireless network security. After the c2k5 hackathon, I added support for "Event Rules." They provide a powerful mechanism to trigger certain actions when receiving specified IEEE 802.11 frames. Human-readable event rules in hostapd.conf(5), similar to the packet filter rules in pf.conf(5), will be helpful to implement proactive wireless network monitoring, also known as WIDS/WIPS (wireless intrusion detection/prevention system). With hostapd(8) I keep in track of what's going on in my wireless networks using a central logging server listening to the multicast group. In some setups, where nobody else is allowed to run an accesspoint, I even use hostapd to send deauthentication frames to stations associated to rogue accesspoints and to detect and log the rogue access points.

Reyk talked about network interface aggregation using the virtual trunk(4) interface. I think there is a conceptually similar feature that permits the creation of interface groups via ifconfig. What cool things can we do with it?

Henning Brauer: Well, while trunk groups interfaces to aggregate bandwidth, interface groups provide an administrative grouping. Interfaces can join and leave named groups any time, via ifconfig:

# ifconfig sk0 group somegroup
# ifconfig sk0 -group somegroup

Interfaces can be in more than one group, and of course a group can contain more than one interface. Now, pf can filter based on the group names. You could, for example, have your external interface on a typical firewall join a group ext, and have pf filter on the group ext instead of the interface. That way, your ruleset is hardware independent--the group assignment goes to the hostname.if files, which are machine dependent anyway. If you do the same for your internal interface it makes even more sense; if you add a second internal one, say, a wireless card, you just make it join the group--no need to modify the ruleset.

We do maintain some groups by default. Cloneable interfaces--or, rather, the cloned ones--are part of a group named by the driver. For example, all loopback interfaces are a member of the lo group, all tun interfaces are part of the tun group, and so on. And there is a group called egress that follows the IPv4 and IPv6 default routes--it contains all interfaces that default routes point to, which in the typical case is your "external" interface. But it is also very handy for notebooks with mixed wireless/wired usage.

The ifstated daemon runs commands in response to network state changes. For example, it can ensure that CARP interfaces stay in sync. If I remember correctly, this tool has been part of the source tree for a couple of releases, but it was not linked to the build process. Why? Did you need any particular feature that is only available since 3.8?

Marco Pfatschbacher: No, we didn't add any new ifstated features in 3.8. The reason it wasn't included till 3.8 was some discussion about the config file syntax and the actual practical use of it. The initial version was only for the purpose to trigger commands in response to CARP state changes. Later a complete state engine was added, and the ability to run external test commands. The idea was that a host having more than one CARP interface can monitor its surrounding network and give up its master state in case of an error. You can find details on Ryan's web site.

Since support for CARP to monitor its physical interfaces link state was added, there was a simpler and much nicer solution to this problem.

However, we still found it's a quite useful program and so we kept it. Just to give you some examples:

Currently I'm extending ifstated to react on address changes on interfaces. This is especially useful for the new in-kernel-pppoe device, where you can trigger some commands after you received a new IP address due to a disconnect to your ISP.

The upcoming 3.8 will include sasyncd, an IPSec SA synchronization daemon for failover gateways. How does it interact with PF, CARP, and isakmpd to provide failover capabilities?

Håkan Olsson: sasyncd currently relies on a specified CARP interface for master and backup state; i.e., when the interface changes state, so does sasyncd.

The sasyncd process synchronizes IPSec SA data to other sasyncd peers (SA creation, deletion, etc.). Then, in the kernel, pfsync(4) has been extended to also send replay-counter updates for these SAs to the other peers. Without this, the remote VPN peer would not accept any IPsec packets from the "new host" in case of a failover.

In 3.8, isakmpd(8) needs to be configured to only respond to negotiations in a failover scenario, i.e., it should not initiate them. This requirement should go away soon.

The new tool called ipsecctl sounds like an isakmpd replacement with an easier configuration. What type of features does it provide? How does it interact with isakmpd?

Hans-Joerg Hoexer: ipsecctl(8) is an IPsec management tool for both static and automatic keying and acts as a front end to isamkpd(8). It is still in development, but 3.8 ships a version already usable for simple setups.

Up to now, we had the ipsecadm(8) utility for static keying. All parameters are passed as command-line options. Thus one had to write a shell script to set up ipsec(4). See /usr/share/ipsec/rc.vpn. For isakmpd(8), we have two configuration files, isakmpd.conf(5) and isakmpd.policy(5). Even for simple setups, those get quickly complex.

To simplify this, we came up with ipsecctl(8). It is intended to provide a uniform way of configuring ipsec(4) and to completely obsolete ipsecadm(8) and isakmpd.conf(5)/isakmpd.policy(5).

We decided to use a language derived from pf.conf(5) (see ipsec.conf(5)): Rules define which packets will go through ipsec(4), which security services will be applied, and how keys are established. Care is taken that only a minimal set of parameters needs to be specified, and reasonable default values are used otherwise.

For example:

esp from 192.168.3.14 to 192.168.3.12 spi 0xdeadbeef:0xbeefdead \
    authkey file "auth14:auth12" enckey file "enc14:enc12"

This rule creates an IPsec tunnel between the hosts 192.168.3.4 and 192.168.3.12 using ESP with static keys read from some files. No authentication and encryption algorithms are specified; thus ipsecctl(8) will use HMAC-SHA2-256 and AES countermode as strong default algorithms.

For automatic keying, ipsecctl(8) generates proper configurations and feeds them to isakmpd(8) using its FIFO interface. Thus it is not necessary anymore to use isakmpd.conf(5). For example, to set up a VPN between the networks 10.1.1.0/24 and 10.1.2.0/24, one can use this rule:

ike esp from 10.1.1.0/24 to 10.1.2.0/24 peer 192.168.3.2

Again, ipsecctl(8) will choose good default values for authentication and encryption (3DES-SHA1 for phase 1 and AES-128 and HMAC-SHA2-256 for phase 2), SA lifetimes, and so on.

Current development focuses mainly on improving interaction with isakmpd(8).

I think watchdogd is another good tool for a server, but I'm wondering--are watchdog timers part of common motherboards?

Alexander Yurchenko: Most chipset vendors include a watchdog timer into their integrated circuits, although not every motherboard manufacturer actually uses it. For example, a popular Intel 6300ESB chipset has a watchdog timer, and OpenBSD 3.8 includes a driver for it--ichwdt(4)--but you should refer to your motherboard manual to see if you can benefit from it.

One tool to manage them all. That should be bioctl, a RAID management interface. This is the first version of the tool, so, what type of features does it provide already?

Marco Peereboom: It provides the bare necessities to do RAID management without rebooting. As a matter of fact, Theo wrote a very informative email to misc@ in which he explains the functionality.

The idea is to use the BIOS of the RAID card to create and delete RAID volumes and set some values like rebuild rate, alarm enable/disable, etc. Then use bioctl(8) to monitor the RAID HBA while running inside the OS. When a RAID volume fails, we can replace the bad disk with a new one or rely on the RAID card to start rebuilding on a hot spare. After a failed disk is replaced, bioctl(8) provides the mechanics to make the unused disk into a hot spare. This and a few more options (e.g., enable/disable alarm etc.) and commands (e.g., blink disk, create hot spare, etc.) essentially provides a full-fledged RAID management solution. Vendor solutions have a lot more options, but after evaluating those and really thinking about this, we came to the conclusion that those solutions are too complex and riddled with useless functionality. Don't get me wrong; we have plenty of work ahead of us. However, from a "What do we really use?" perspective, I think we are pretty darn close.

My experience has shown the following usage pattern:

  1. Create RAID volumes in BIOS
  2. Install OS
  3. Install software to do something
  4. Monitor OS & hardware

When upgrading becomes a necessity, people usually do this:

  1. Practice upgrade on secondary server
  2. Schedule downtime
  3. Execute upgrade
  4. Test and redeploy server
  5. Resume "Monitor OS & hardware"

Most likely failure scenario:

  1. Disk goes bad
  2. Hot spare kicks in and rebuilds from parity, essentially replacing the bad disk
  3. Operator physically replaces bad disk
  4. Operator makes newly inserted disk a hot spare
  5. Resume "Monitor OS & hardware"

bioctl(8) provides all necessary functionality to perform these steps.

Most of the monitoring magic is handled by the RAID firmware. So what we really had to do is come up with a "common language" that should work on all RAID controllers. The API we came up with is simple and should pretty much translate into just about any RAID card I have played with. Undoubtedly we'll run into some issues, but we'll deal with those when they appear. What makes this an interesting exercise is that the more RAID cards we support, the easier this'll become.

David Gwynne: This is minimal compared to the management tools provided by vendors, which provide things like the ability to create and destroy RAID sets all the way up to flashing the firmware on your controller. Since all this functionality is present in the controller's BIOS when you boot the machine, we considered most of this to be just fluff when you're actually in the operating system and running a production server. If you're going to modify the RAID sets and change those parameters, your machine is going to be out of service in a maintenance window, and rebooting isn't a problem. The real problem when you're running a machine is how to tell when your RAID sets are degraded and how to fix them. So keeping that in mind, we made a conscious decision to support the bare minimum that will be common across all controllers.

Is there any vendor that chose to contribute with hardware or specifications?

Marco Peereboom: LSI has been very nice in providing hardware, certain pieces of documentation, and engineering help. In the end, to make all this happen, there was quite a bit of reverse engineering done as well.

OpenBSD has not received any documentation or help from other vendors. This is really sad if you think about it. It is apparent that a lot of OpenBSD users want RAID and RAID management; not many days go by without someone on the lists asking about what products to buy and how to manage them. The answers have been pretty boring since everyone is pointed to a single vendor. The vendors that have not cooperated with OpenBSD are clearly losing business.

I honestly do not understand why vendors are so secretive about all this. All these products are essentially the same. Here is how a common command looks like:

  1. Send some command to firmware
  2. Wait for completion (either polled or via callback)
  3. Parse results

If you look at the code involved in getting this done all the way from userland into the RAID firmware, you'll realize that it is really trivial. Someone explain to me why setting a handful of values in a structure is considered IP.

David Gwynne: Personally I would appreciate documentation from a vendor more than hardware. Our community seems to be enthusiastic enough about what we're doing that they'll chip in for hardware when it is needed.

What is in the pipeline for future releases?

David Gwynne: At the moment, I am going over the ami driver and trying to clean it up and optimize it a bit. There are changes coming in the bio and sensor frameworks that will make things better; however, to most people these changes are mostly boring and transparent.

Marco Peereboom: On my own list I got adding mpt(4) to bio(4) and streamlining ami(4) support. In the not too distant future, SAS (Serial Attached SCSI) products will start to arrive in the field. I want to add support for those products as well.

In other words, I am not going to be bored. As a matter of fact, there is so much work in this area that we really could use some help. If there are any folks out there with the skills and time to work on this, please let me know.

You provide a couple of drivers to monitor hard disks status via SCSI Enclosure Services command set. A lot of people use SATA instead, because of its good performance and lower price. Does SATA offer similar features too?

Marco Peereboom: Well, SES and SAF-TE do not really monitor disk status. The only disk-related thing they monitor is "insertion" and "removal." SES and SAF-TE are used to provide the missing link in SCSI hot plug. Upon insertion or removal of a disk they will set or clear a bit, respectively, to indicate what the current slot status is. A RAID or SCSI controller will retrieve the slot status page and discover that something has been inserted or removed. Whenever something is inserted, it runs the so-called spin-up code to make the drive accessible. If something is removed, the RAID firmware will look to see if the removed disk was part of some RAID set and act accordingly.

All that said, SES and SAF-TE are more useful than just that. The other major component is monitoring environmentals. They measure and report temperatures, fan speeds, power supply status, etc. This data is conveniently available in sysctl(8) and can therefore be monitored with sensorsd(8).

In the near future, SAS and SATA2 drives will be used in mixed configurations since SAS can handle SATA2 devices transparently. The net is that at some point one can have a SAS transport from the RAID card to an external enclosure while the disks in that enclosure are SATA2. In this scenario, if there is a SES device on that enclosure, it will be available. So technically it isn't a SATA2 device; however, it is available and provides the same functionality. One word of caution though: SATA2 is cheap for a reason, so keep that in mind when making a purchase decision.

David Gwynne: I am aware that there is such a thing as a SATA SAF-TE device, but I don't have one and I haven't looked into them. I would be extremely surprised if they worked, since our enclosure drivers attach to scsibus and all our SATA drivers are supported by the pciide driver (which don't attach a scsibus). Unless the firmware in SATA variants of ami controllers emulate a SCSI bus over SATA for the enclosure, then our drivers cannot attach.

Does RAIDframe work with bioctl? Or maybe do you plan to replace RAIDframe?

David Gwynne: The software RAID framework is configured and managed by raidctl, which is completely separate from bioctl and its support for hardware RAID controllers. That said, it is possible that the hooks that bioctl uses could be implemented in RAIDframe to allow bioctl to monitor it as well. I don't see the benefit in doing so, though.

There is also a general consensus that RAIDframe is Not Good(tm) and needs to be either rewritten or hacked to bits into something simpler and smaller. I'd predict that it is more likely that ccd(4) will be extended to support RAID 5 as well as RAID 0 and 1 before RAIDframe has any work done on it. If that happens, then RAIDframe could just go away and no one would miss it.

This is all just talk at the moment, and as such totally unreliable as an indicator of future work. This isn't a priority for anyone I've spoken to, so don't be surprised if it doesn't happen in the short term or at all.

"wd(4) disks have the security feature frozen before being attached to prevent malicious users setting a password that would prevent the contents of the drive from being accessed." Does this mean that we cannot set a password anymore with atactl?

Jonathan Gray: Modern ATA disks have what is known as the security feature set. This allows passwords to be set on drives which prevent the contents from being accessed without the right password.

The problem with this was brought to our attention by the c't magazine article entitled "How ATA security functions jeopardize your data," which outlines how this can be abused.

In practice, the security feature set turns out to be a bad idea because it is nearly always on by default. If someone has the equivalent of root access for just a moment, they can set a password that will prevent the data on the drives from being usable. You have to either erase the drive or be prepared to pay a large amount of money to a data recovery company that has broken the system to get a usable drive again.

There is a workaround the standard allows us, which is turning off the ability to set passwords until the next boot cycle. Ideally BIOS implementations would deal with this and disable the security feature set by default, but most currently do not. So we take matters into our own hands and disable the security feature set on all ATA drives in the kernel before the rest of the system can use them.

So yes, no more password setting with atactl, but this turns out to be no great loss.

The man page of the new aps driver for the built-in accelerometer found in some IBM ThinkPad laptops states, "As IBM provides no documentation, it is not known what all the available sensors are used for." I thought IBM was an open source-friendly vendor, especially since they adopted Linux. How did you develop the driver?

Jonathan Gray: IBM only seems to be involved in open source to the extent that it suits them. This for the most part seems to mean on the server side of things. IBM employee Mark Smith and his friend Anurag Sharma reverse-engineered the Windows driver to figure out how parts of it are supposed to work; the driver is based on information in the document Mark has on his site. It kind of highlights how bad things are when an IBM employee has to reverse-engineer an IBM product to figure out how it works.

What is possibly more worrying are the standards bodies who either don't let people access standards at all (i.e., SD, need to be a member corporation and have signed an NDA) or hold them ransom (i.e., T13/ATA, PCMCIA/PC Card).

Is the ThinkPad Active Protection System effective? Did you make any test?

Jonathan Gray: The driver that will ship with 3.8 is largely a fancy toy. At one point I had it acting as an additional mouse, moving the cursor when you tilt the laptop; while this was a nice way of testing things, it is impractical to use, so the code was not committed. Other people have written userspace programs that do things like show the laptop orientation or lock the laptop if it is moved, and the sensorsd(8) daemon can react to changed sensor values with whatever command the user likes.

I have code that can be used by any driver to park the heads of all ATA disks attached to the system, and relevant changes in aps(4) to use this. This is the real reason for the sensor being present in the hardware: to park disk heads in the event of a fall. These changes will likely make their way into our next release.

I'm not in any great hurry to try drop-testing my recently purchased ThinkPad from any great height, as they aren't the cheapest things to replace ...

A new tool called stat has been included. It displays file status obtained from stat(2) or lstat(2). What can you do with it?

Otto Moerbeek: stat(1) is a tool to display the raw information on a file: its modification date, link count, owner, and more such things. Of course, there's this other well-known command, ls(1), that offers basically the same information. But ls(1) is mostly intended for human consumption, while stat(1) is nicer for scripting: it is capable of producing all available information in one call using output format, which is directly digestible by shell scripts. And if you, like me, are working on userland tools, you sometimes need the raw information to check if a command like tar(1) is doing the right thing.

stat(1) comes from the NetBSD project. We did some rewriting to make it use safe string operations like strlcpy and snprintf. We do not want unsafe string operations in the base source tree.

What is the purpose of xidle?

Federico G. Schwindt: Back in 2002, xautolock was removed from the base X system due to license issues. The author didn't want to change it, so we've been forced to take that direction. Shortly after, were added back to the ports tree but I missed its functionality from the base set. At the same time, I wanted to do some X programming, so xidle(1) was born.

Basically, it locks your X session when you're away. It also allows you to lock it by moving the mouse to a predefined corner. In reality it can run any program, but xlock(1) is the default, hence the term locking. And of course, it has a BSD license.

It seems you fixed several bugs in pax. This is an ancient tool, but you still find something to fix inside it. Were these pathname races and potential buffer handling problems a heritage of Berkeley's codebase?

Otto Moerbeek: Partly. Some of the fixes were done to fix bugs that were not in the original but introduced later. The pathname race fixes were to get remove the unsafe open(), write(), close(), chown(), chmod() constructs and replace them by the safer open(), write(), fchown(), fchmod(), close() sequence. The first is a variation on the more well-known traditional symlink race, which happens between the test for existence and actual creation of a file.

Looking at the release page, I read, "libc(3) source code has been converted to ANSI C." What type of nonstandard code was cleaned? And by the way, what is the origin of OpenBSD's libc?

Otto Moerbeek: There are three C standards: the original from Kernighan and Ritchie from 1978 (K&R), the ANSI C standard from 1989, and C99, the most recent. A large part of the OpenBSD source tree were originally written in K&R C, but a lot of files have been converted to ANSI C in the recent years. The C library (libc) was done for this release.

An important feature ANSI C adds are prototypes. Prototypes are vital to catch "misunderstandings" between the caller of a function and the implementation. While this problem was partly solved by introducing prototypes in the header files, the actual implementation of libc was mostly written without them. The C library mostly originates from the BSD4.4 Lite distribution, which was developed before ANSI C compilers were generally available.

Converting code to ANSI C is not only important to catch errors in argument handling. Having our code in a uniform format and style also helps in auditing; every program uses libc, and making it as bug-free as possible is very important. Auditing should not be hindered by having it in a nonstandard form, which might hurt the developer's eyes ;-)

What is the status of wide-character and locale support?

Marc Espie: Half the work is done. We stopped quite a bit before 3.8, in order to allow for reasonably comprehensive testing.

Right now:

I should add that implementing locale support looks a lot like a Chinese puzzle: you've got all these pieces that have to fit, and you have to find a correct building order, figuring out what you can have without destabilizing the whole. In the case at hand, this involved making sure that GNU-configure would not think we have full locale support yet, and also making some VERY intrusive changes to the libc (much to Theo's worries). But ... it works.

The utility gzsig lets you create and verify cryptographic signatures built into gzip file headers. I'm wondering if you plan to use it in the future to provide binary patches, or sign packages and release sets.

Marc Espie: No, I don't. Frankly, the concept of gzsig is something I stumbled upon a few years ago, and there was the beginning of a gzsig implementation in the old package tools. But it's gone from the new tools. gzsig has one major limitation: you have to download the whole archive in order to verify the signature. You can't do anything with a partial package, can't start scripts. So, the solution pkg_add is heading towards is signing the packing list, and making sure there's enough info in it so that you can check things one component at a time. Basically, you don't need a full sig for everything. If the packing list is signed, and it contains crypto hashes for all files, then it will work. And you have just-in-time signing, which is a very important feature.

Parts of it are working, parts are not finished, and I'm waiting to see what crypto hash we're going to use, since md5 and sha1 are basically out now.

It's just one feature among a lot of things happening in pkg_add. In 3.8, you'll notice the tools have become MORE sturdy. We've weeded out a lot of fringe cases, and the tools behave correctly in a lot more cases. And there's a new update option. In 3.8, it only helps you figure out what packages to replace. In -current, it's fully enabled, and it works.

We've got more checks. The packages have gone through extensive automatic consistency checks. And there's more planned. The quality is going up. Crypto signatures is just one component. The framework is designed to incorporate it when it's ready. But it won't be gzsig.

You worked on signal delivery, and improved signal handlers' reliability. What type of problems did you fix?

Otto Moerbeek: These fixes were typical teamwork. We had a known problem on sparc64 that sometimes cron(8) would not run after booting the machine. This has annoyed me for a long while, so at some point I decided to hunt the problem down. I changed rc to ktrace cron, and began a rebooting session to catch the error. After I had a ktrace of a failing cron startup, I saw that the problem occurs if a signal was delivered to a child process before the child returned from fork(). That enabled me to write a regress test that demonstrated the bug (called earlysig) without requiring a reboot, which put Mark Kettenis on track to actually analyze and fix the problem.

The macppc and i386 problems had to do with signal handlers using floating point. Matthieu Herrb had been debugging a problem with the X server on i386 without much progress, and at one point the hypothesis was that if floating-point computations were done inside a signal handler, the main program's floating-point registers could get trashed. Again I was able to write a regress test (fpsig) to show the problem, and further testing showed that it also occurred on macppc. Michael Shalayeff and Dale Rahn made the actual fixes for these bugs.

I read that realpath(3) is now thread-safe. Does this release include any improvements for SMP systems?

Niklas Hallqvist: There is often confusion between threading issues and MP issues. These are actually orthogonal, at least in userland. No matter how many (or few) actual executing units that are available, if the execution model contains threads (lightweight execution contexts), the code executed need to be "thread safe" in order to be correct. This is true even for uniprocessor systems. On the other hand, if you have an execution model that does not contain threads but rather just processes (i.e., execution contexts that do not share main memory resources), you don't need to have thread-safe code in order to execute your processes on many processors.

In OpenBSD, so far, this orthogonality is even clearer, since we don't map threaded processes onto several processors. The scheduling entity is the process, not the thread. This may seem suboptimal, and it is, but it is the conservative approach. You don't need to make code thread-safe in order to take advantage of multiprocessing. Thus, performance comes at a low cost. I am not saying we are not going to provide thread libraries that will take advantage of SMP; I'm just saying that was not the primary target. My impression is that people have been happy that as few things broke as it did when we implemented SMP. We are not here to provide users with the fanciest performance figures on earth, we are here because we personally want better performance without risking functionality.

As to the improvement part, yes, there have been bug fixes done, making SMP machines even more stable, but no new functionality has been added, as far as I can recall.

Quoting from the talk that Ted Unangst will give at EuroBSDCon: "The existing userland pthreads library in use by OpenBSD is hampered by poor performance, inability to utilize multiple CPUs, and unnecessary complexity. A replacement library, rthreads, utilizes a modified rfork() system call to create kernel threads. It is both simpler and more scalable than the library it replaces." Could you share some details about it?

Ted Unangst: First, rthreads is not included in 3.8; it's not clear when it will be incorporated into OpenBSD. rthreads started as an experiment to see how much effort would be involved in developing support for kernel-aware threads. It turns out that if you don't overcomplicate things, it's remarkably simple. Initially it seemed that we should support the M:N (or scheduler activation) model, because it was the "right way" to do things. After some more consideration, it became clear that you can get 95 percent of the way there with 1:1 threads, at about 20 percent of the complexity. Although rthreads is not finished, it currently provides a substantial portion of the pthreads API.

Could you talk about the new malloc(3) implementation and how it improves security?

Theo de Raadt: Traditionally, Unix malloc(3) has always just "extended the brk", which means extending the traditional Unix process data segment to allocate more memory. malloc(3) would simply extend the data segment, and then calve off little pieces to requesting callers as needed. It also remembered which pieces were which, so that free(3) could do its job.

The way this was always done in Unix has had a number of consequences, some of which we wanted to get rid of. In particular, malloc and free have not been able to provide strong protection against overflows or other corruption.

Our malloc implementation is a lot more resistant (than Linux) to "heap overflows in the malloc arena", but we wanted to improve things even more.

Starting a few months ago, the following changes were made:

Other results:

To some of you, this will sound like what the Electric Fence toolkit used to be for. But these features are enabled by default. Electric Fence was also very slow. It took nearly three years to write these OpenBSD changes, since performance was a serious consideration. (Early versions caused a nearly 50 percent slowdown).

Our changes have tremendous benefits, but until some bugs in external packages are found and fixed, there are some risks as well. Some software making incorrect assumptions will be running into these new security technologies.

We expect that our malloc will find more bugs in software, and this might hurt our user community in the short term. We know that what this new malloc is doing is perfectly legal but that realistically some open source software is of such low quality that it is just not ready for these things to happen.

We ask our users to help us uncover and fix more of these bugs in applications. Some will even be exploitable. Instead of saying that OpenBSD is busted in this regard, please realize that the software which is crashing is showing how shoddily it was written. Then help us fix it. For everyone ... not just OpenBSD users.

Do you plan to make other modifications to memory management functions?

Thierry Deval: malloc has been changed to use the mmap/munmap pair of memory-mapping functions, but the page's accounting is still done by keeping a private list of page mappings. And page guarding is still done the old way, by mapping an extra page that gets mprotect(2)ed to forbid any access. We think that page guarding can simply be ensured by modifying mmap(2) so that it always returns nonadjacent memory regions, thus removing that responsibility from malloc(3). Moreover, the page mapping list keeping is an expensive operation, and we should look at some way to improve it.

Federico Biancuzzi is a freelance interviewer. His interviews appeared on publications such as ONLamp.com, LinuxDevCenter.com, SecurityFocus.com, NewsForge.com, Linux.com, TheRegister.co.uk, ArsTechnica.com, the Polish print magazine BSD Magazine, and the Italian print magazine Linux&C.


Return to the BSD DevCenter.

Copyright © 2009 O'Reilly Media, Inc.