BSD DevCenter
oreilly.comSafari Books Online.Conferences.


Puffy's Marathon: What's New in OpenBSD 4.2
Pages: 1, 2, 3

CPU frequency and voltage can now be scaled on all CPUs when running GENERIC.MP on a multiprocessor i386 or AMD64 machine with Enhanced Speedstep or Powernow. How much power can we save when using the battery?

Gordon Willem Klok: Potentially a great deal of power, before the release of 4.1 I disabled many of the hw.setperf methods (such as enhanced speedstep and powernow) in multiprocessor kernels. This was necessary because without being SMP aware, twiddling hw.setperf either manually, or by apmd on your behalf, was essentially playing Russian roulette: whichever processor sysctl or apmd ran on would be the only one that the transition would be attempted on, and given the nature of speedstep and powernow in the current form, likely nothing would happen. So if you were using a multiprocessor kernel you had no opportunity to save power, while with these methods there is the potential to save a great deal of power and generate less heat, run quieter, etc.

I collected some decidedly unscientific results using my Thinkpad x60 with the battery removed measuring the draw from the wall with a kilowatt meter. With no power management in use my laptop draws about 28 watts when idle and as much as 49 watts with the both cores going full tilt. With hw.setperf set to zero (a core frequency of 1 Ghz versus the full speed of 2 Ghz) , this laptop draws about 22 watts saving about 6 watts or about 18 percent. What is even more interesting is that when going full tilt at 1 Ghz, the peak draw is only 28 watts, like the idle draw. Translating this into increased runtime is tricky and I didn't have time to conduct proper tests, but if we assume that a 18 percent power saving translates into a similar increase in runtime, assuming a runtime of 5 hours (not unlikely for a x60) at full speed, you can buy yourself almost another hour by running at the lowest hw.setperf setting.

Did you update the hw.setperf sysctl too?

Gordon Willem Klok: It was not necessary to change the hw.setperf sysctl at this time to accomplish multiprocessor aware frequency and voltage scaling. The hw.setperf code that first handles a request is fairly simple: it checks the arguments discarding values that are out of range (less than zero or greater than 100) and calls a function pointer that points at the routine that actually performs the transitions.

What I did was add a function that, after the underlying hw.setperf mechanism has been setup, stores a pointer to this function and substitutes its own. When hw.setperf is adjusted the mp_setperf mechanism executes the underlying mechanism on all processors on the system. Going forward the mechanism will likely need to be altered or the design philosophy of hw.setperf changed, at the very least AMD is moving to a model where every core in a system can be running at a different operating frequency and this will require some rethinking.

How does apmd(8) manage CPU throttling on MP systems?

Gordon Willem Klok: As the hw.setperf mechanism retained the same semantics as far as the userland interface was concerned Nikolay Sturm (sturm@) and a fellow by the name of Simon Effenberg only had to tweak apmd slightly to handle the MP case. Instead of having apmd consider only the idle time of a single processor, it looks at the average idle time of all the CPU's in a system and makes its transition decisions accordingly.

It seems that some aspects of sensorsd have been redesigned to be more user-friendly. Could you give us an overview of these changes?

Constantine A. Murenin: sensorsd was originally written when the sensors framework didn't support as many features as it does today. With OpenBSD 4.2, sensorsd is revamped to be more in touch with the recent and not-so-recent features of the framework.

For example, in sensorsd.conf(5) we now support matching by sensor type, so that a single rule can be written to apply to all temperature sensors (e.g., "temp:low=15C:high=65C").

People with server-grade hardware would be happy to know that now sensorsd requires zero configuration in order to report on the status changes of smart sensors—those that automatically provide sensor status themselves, like IPMI or bio(4)-based sensors, as well as all timedelta sensors. All that's needed to configure sensorsd in such cases is append sensorsd_flags="" to /etc/rc.conf.local.

Some improvements were made for monitoring consumer-grade sensors, too. For example, if you have an lm(4)-based Winbond chip that does fan-speed controlling, then previously you might have noticed that sensorsd was totally ignoring certain fanrpm sensors if they were marked as invalid at the time when sensorsd was started. (The reason they might have been marked as invalid in the first place is because physical sensors in certain fans don't produce valid readings if the voltage is too low, even if the fan itself is still spinning.) With the 4.2 sensorsd, if you specify to monitor a sensor that is periodically marked as invalid, then it will be reported as such, and value-based monitoring of such sensor will resume as soon as the invalid flag is reset by the driver.

Some other related features and cleanups went along the way, including usability improvements in the log format, ability to set manual boundaries for any kind of sensor, outstanding documentation updates and overall polishing.

I saw that you retired the cats platform, removed support for 80386 processors in the i386 platform code, and at the same time you are adding support to additional models of hppa and alpha systems. I think these are niche platforms, so I am wondering how do you choose what is worth your time?

Bob Beck: People choose what is worth their time based on what they want to work on, and what improves the project.

We have active developers with hppa and alpha systems, and not to be forgotten, supporting these architectures not only render easily available old hardware useful, but also helps us keep our code quality in general up. All the world not being a 32 bit Intel machine...

cats on the other hand, isn't easily available, and nobody wants to support it. There are faster arm platforms we do support that we focus our attention on.

80386 is different, it's not a separate arch, but rather a level of support in the i386 arch. It was clear nobody had actually tested OpenBSD on an 80386 system in a number of years, and none of us were willing to do so. Given that a genuine 80386 probably wouldn't run for a lot of other reasons (memory, etc.) it didn't make sense to maintain the lowest common denominator support for 80386 when it gets in the way of doing other things in the kernel.

Artur Grabowski: By what people want to work on. As far as I know, cats was so annoying that people hated the machines and wanted them to die (which they apparently did, catching fire was apparently popular). While 386 was mostly broken anyway, had a lot of baggage cluttering code we were working on and no one wanted to make sure it continued to work.

Sun finally is sharing some docs, did you take advantage of them to support PCIe UltraSPARC IIIi machines like the V215 and V245?

Mark Kettenis: It's really great that Sun is releasing more documentation for their hardware now. This will benefit all open source operating systems running on Sun UltraSPARC hardware, but one look at their wiki makes immediately clear that OpenBSD played a major role getting Sun to publish these documents.

Unfortunately that documentation wasn't available when I wrote the pyro(4) driver that was needed to support the V215 and V245. Instead I had to read lots of OpenSolaris code, and fill in some blanks myself. Because I didn't have the documentation available, the work took longer than necessary, and some interesting hardware features are missing from the driver.

I hope to revisit pyro(4) soon now that the docs are out there. But lately I've been rather busy with another feature that will make some sparc64 users happy (and for which the currently released docs are also a big help).

What is the Advanced Host Controller Interface?

David Gwynne: It is a specification that describes the interface a SATA controller should present to the operating system. It is similair to the PCI IDE controller specification in that many different vendors may have different chips all presenting a common interface, which are all supported by the one driver in an operating system. AHCI is the same idea, it just supports a different class of device, namely SATA, while the PCI IDE specification only dealt with IDE devices or devices that worked in an IDE compatible way. AHCI can be considered necessary since the SATA specification provides some advanced features (eg hotplug, bus expanders, and command queuing) that cannot fit into the existing PCI IDE interface.

There was a lot of work on the ahci(4) driver to get native support for some SATA controllers instead of going over pciide(4). What differences and advantages can we expect to see?

David Gwynne: Because the interface AHCI presents to the OS is so different to the one the PCI IDE specifies, it makes sense to have a separate and native driver for it. wdc(4), pciide(4) and ahci(4) can be considered equivalent because they provide the same functionality, namely taking commands from the operating system and putting them on the ATA devices that are hooked up to them. The difference between ahci(4) and the pciide(4) and wdc(4) drivers is where they get those commands from.

pciide(4) and wdc(4) both take their commands from wd(4) and atapiscsi(4), which are drivers that natively talk ATA commands. These four drivers are all that there is to support all the IDE controllers, and they're very tightly woven together. They were written a long time before SATA and some of its new features were ever considered, and because of this they lack the capability to support it.

On the other hand, some of the features that SATA offers sound an awful lot like what SCSI has had for years, and which our SCSI midlayer has been doing as a matter of course in that same time. Things like hotplug and command queueing are things that just work in SCSI land.

So I made the decision that rather than spending months refactoring the IDE code and potentially breaking support for everyones IDE hardware (which is a lot of people), I would write a SCSI to ATA translation layer aptly called atascsi. It sits between the SCSI midlayer (which is basically the scsibus(4) device driver) and the ATA controller that uses it and just turn SCSI commands into ATA commands. The rest of the semantics such as command queueing and so on are all handled by the existing infrastructure in the midlayer.

The other advantage of atascsi is that it can be reused on other ATA controllers. In this release both ahci(4) and sili(4) for Silicon Image 3124/3132/3531 controllers use atascsi. Also because of atascsi, all the devices on these controllers appear as SCSI devices, ie, sd(4) will attach to disks instead of wd(4).

You have included FFS2. What features does it provide?

Otto Moerbeek: The two most important benefits FFS2 provides are support for large (greater than 1 TB) filesystems, and much much quicker newfs(8) times. The code is mostly taken from FreeBSD with some parts from NetBSD. The on disk layout is largely the same, but we did not test if existing file systems can be interchanged with other BSDs. I know that NetBSD implements endian swapping for their filesystems, something we do not. So probably you'll see some differences there. Snapshot or background file system check is something we have not implemented yet. Userland utiltites that manipulate on disk data structures directly, like dump(8), restore(8) and fsck_ffs(8) have been converted to understand the FFS2 format. Also, the disklabels are now capable of partitioning very large disks, up to 128 petabytes. Partitions can also be that large, in theory at least.

What is OpenBSD 4.2 default filesystem for a fresh installation?

Otto Moerbeek: FFS1 remains the default filesystem for the foreseeable future. There are a couple of reasons for that: an important reason is that for small filesystems, FFS2 does not provide any benefit. Also, the boot code for the various platforms is not yet capable of understanding FFS2. So if you want to use an FFS2 filesystem, you'll need to create it using newfs -O2.

To convert an existing filesystem to FFS2, you'll need to tar, newfs -O2, and untar. But remember that the boot media do not support FFS2 yet, so filesystems containing the base system should remain FFS1.

I read that "some parts of the system are not 64-bit disk block clean yet, so partition larger than 2TB cannot be used at the moment." Is there anything users could do to help you extend the support?

Otto Moerbeek: Obviously by testing disks up to 2TB. Note that FFS1 can be used also on large disks, as long as the filesystem size stays below 1TB. A little warning: 1TB filesystems take a lot of time and memory to run fsck_ffs(8) on. Large block and fragment sizes can help solve that, at the cost of some wasted disk space. To make really large filesytems work in practice, a solution to the huge time and memory requirements for filesystem checking has to be implemented.

How is the work on bio(4) going on? It seems you have ported it to all the platforms!

Marco Peereboom: Bio(4) has been moving along. We have many more supported RAID cards. The one that is still glaringly missing is mpi(4). Dlg and I have been both telling each other "that we will do it soon" but none of us has found time to do it.

Bio(4) is starting to show some limitations that we want to solve. Softraid(4) is pushing some limitations like "creating disks" and the general consensus is that something needs to be done however what "it" is has not been determined at this time.

This release come with softraid(4) enabled in GENERIC so people can test. What can users do to help you? Send dmesg? Run any particular tool?

Marco Peereboom: Testing is always appreciated. I have received some pretty darn good test reports in the past and have been able to fix those bugs, so keep them coming.

softraid(4) is not enabled in GENERIC despite popular belief. Theo and I agree that it needs to do more before we can move forward and enable it. A second complication is that not all architectures are "ready" to run with softraid(4). We accomplished a lot during the last hackathon in moving in that direction but some older arches will need some love from the likes of Miod before softraid(4) can be enabled.

Theo and Tom have been doing some necessary groundwork to enable booting of softraid(4). I can't stress enough the crazy diffs Theo has been committing in the disklabel stuff. Tom on the other hand has been doing bootloader work as well. This is still not completed but it does bring us closer to a true booting softraid(4) implementation.

The glaring missing feature at this time is rebuilds. This feature is still brewing in my head. Surprisingly this is one of the most complex problems to solve in the softraid(4) stack. It is inherently racy and I don't want it to stand in the way of normal operations. Sitting at the boot prompt for several hours while rebuilding is unacceptable. Also unacceptable is calling something a background rebuild while the machine is essentially rendered useless due to performance issues. I also found out that people are attached to their data so maintaining data integrity is also high on the list :-) I have some ideas on how to solve this problem but have not made any serious attempts at implementing them yet.

What's also still missing are some additional disciplines like RAID 0, RAID 5, Concat etc. Other ideas that are floating are adding AOE (hi tedu!) and maybe iSCSI disciplines but that is further out.

What is the plan for the basic support for crypto(9) backed RAID in softraid(4)?

Marco Peereboom: Currently crypto(9) support is disabled for various reasons. The biggest one being that we have not figured out how to do key management yet. Tedu and I have been floating some ideas along the lines of keeping the key on a separate disk. For example, one can keep the key in some metadata on a USB key and only when the USB key remains inserted in the machine will softraid(4) decrypt. As soon as the USB key is pulled out softraid(4) would shut down the disk and make it unavailable. The main idea being here that the key is not physically part of the machine softraid(4) is running on and when separated both are useless. There are various hurdles to overcome that are being thought through.

Also problematic at this time is that the crypto thread is not running when softraid(4) is loaded. Obviously this causes hangs during boot time because the decrypting job never finishes. I have been talking to Theo on how to solve this problem and various scenarios are being explored...

Federico Biancuzzi is a freelance interviewer. His interviews appeared on publications such as,,,,,,, the Polish print magazine BSD Magazine, and the Italian print magazine Linux&C.

Return to ONLamp BSD Dev Center.

Sponsored by: