The day has come...
FreeBSD is back to its incredible performance and now can take advantage of multi-core/CPUs systems very well... so well that some benchmarks on both Intel and AMD systems showed release 7.0 being faster than Linux 2.6 when running PostreSQL or MySQL.
Federico Biancuzzi interviewed two dozen developers to discuss all the cool details of FreeBSD 7.0: networking and SMP performance, SCTP support, the new IPSEC stack, virtualization, monitoring frameworks, ports, storage limits and a new journaling facility, what changed in the accounting file format, jemalloc(), ULE, and more.
It seems network performance is much better in 7.0!
Andre Oppermann: In general it can be said that the FreeBSD 7.0 TCP stack is three to five times faster (either in increased speed where not maxed out, or reduced CPU usage). It is no problem to fill either 1 Gb/s or 10 Gb/s links to the max.
How did you reach these results?
Andre Oppermann: Careful analysis and lots of code profiling.
What type of improvements would TCP socket buffers auto-sizing offer? In which context would this feature show its best result?
Andre Oppermann: In a world of big files (video clips, entire DVDs, etc.), fast network connections (ADSL, VDSL, Cable, fiber to the home), and global distribution, the traditional TCP default configuration was hitting the socket buffer limit. Because TCP offers reliable data transport it has to keep a buffer of data sent until the remote end has acknowledged reception of it. It takes the ACK a full round trip (RTT, as seen with ping) to make it back. Thus for fast connections over large distances, like from U.S.A. to Europe or Asia, we need large socket buffer to keep all the unacknowledged data around. FreeBSD had a default 32 K send socket buffer. This supports a maximal transfer rate of only slightly more than 2 Mbit/s on a 100 ms RTT trans-continental link. Or at 200 ms just above 1 Mbit/s. With TCP send buffer auto scaling in its default settings it supports 20 Mbit/s at 100 ms and 10 Mbit/s at 200 ms (socket buffer at 256 KB) per TCP connection. That's an improvement of factor 10, or 1000%. If you have very fast Internet connections very far apart you may want to further adjust the defaults upwards. The nice thing about socket buffer auto-tuning is the conservation of kernel memory which is in somewhat limited supply. The tuning happens based on actual measured connection parameters and are adjusted dynamically. For example a SSH session on a 20 Mbit/s 100 ms link will not adjust upwards because the initial default parameters are completely sufficient and do not slow down the session. On the other hand a 1GB file transfer on the same connection will cause the tuning to kick in and to quickly increase the socket buffers to the max. The socket buffer auto-tuning was extensively tested on the European half of ftp.FreeBSD.ORG. From there I was able to download a full ISO image at close to 100 Mbit/s (my local connection speed) with auto-tuning. Before it would only go up to around 30 Mbit/s.
A few more performance relevant things I've changed/added to 7.0:
All this stuff accumulates quite a bit. ;-) And FreeBSD wasn't bad at all before. It just became even better than it was.
Other than that, I've done a lot of code overhaul and refactoring primarily in tcp_input.c and tcp_output.c to make it more readable and maintainable again. This work is still ongoing. However it already has shown increased interest from network researchers who have to modify the code for their experimental features. The cleanup makes it much more accessible again.
Direct dispatch of inbound network traffic. What is it?
Robert Watson: Direct dispatch is a performance enhancement for the network stack. In older versions of the BSD network stack, work is split over several threads when a packet is received:
Direct dispatch allows the ithread to perform full protocol processing through socket delivery. This results in significantly reduced latency by avoiding enqueue/dequeue and a context switch. It can also introduce new opportunities for parallelism: there's one ithread per device, so rather than a single thread doing all IP and TCP processing for input, it now happens in multiple device-specific threads. Finally, it eliminates a possible drop point -- when the "netisr queue" overflowed, we would drop packets -- now the queue doesn't exist, the drop point is pushed back into hardware. This means we don't do link layer processing for a packet unless we will also do IP layer processing, so when the system is under very high load, we don't waste CPU on packets that would otherwise be dropped later because TCP/IP can't keep up.
Like all optimizations, it comes with some trade-offs, so you can disable it and restore netisr processing for the input path using a sysctl. However, for many workloads it can result in a significant performance improvement, especially where latency is an issue.
You added support for TSO (TCP/IP segmentation offload) and LRO (Large Receive Offload) hardware on gigabit and faster cards. Does this mean that when these features are active the hardware partially bypasses FreeBSD's TCP/IP stack? What about bugs in hardware?
Andre Oppermann: LRO was added by Kip Macy with Andrew Gallatin and Jack Vogel.
With TSO only a small part of the TCP stack is bypassed. It's the part where are large amount of data from a socket write is split up into network MTU sized packets. TSO can handle up to 64KB sized writes. We give this large chunk and tell the network card to chop it up into smaller packets for the wire. All the headers are prepared by our TCP stack. The TSO hardware in the network card then only has to increment the TCP header fields for each packet sent until all are done. This process is rather straight forward. TSO is only used for standard bulk sending of data. All special cases, like retransmits and so on, are handled completely within our stack. Bugs in hardware can and do happen. We've done extensive testing and found a specific network card where we had to disable TSO because it wasn't correctly implemented.
LRO is actually not implemented in the network card hardware but in the device driver. All modern gigabit and higher speed network cards batch up received packets and issue only one interrupt for them. The driver then sends them up our network stack. In many cases, especially in LAN environments, a large number of successive packets belong to the same connection. Instead of handing up each packet individually LRO will perform some checks on the packets (port and sequence numbers among others) and merge successive packets together into one. The TCP stack then sees it as one large packet instead of many small ones. The performance benefit is the reduced overhead for entering the TCP stack.
Could you tell us more about the new rapid spanning tree and link aggregation support?
Andrew Thompson: The FreeBSD bridge now supports the rapid spanning tree protocol which provides much faster spanning tree convergence. The new topology will be active in under a second in most cases compared to 30-60 seconds for legacy STP. This makes it an excellent choice for Ethernet redundancy and is the standard used by modern switches. Progress is being made to implement the VLAN-aware MST protocol extension.
The link aggregation support came from the trunk(4) framework on OpenBSD and was extended to include NetBSD's LACP/802.3ad (IEEE aggregation protocol standard) implementation. This framework allows for different operating modes to be selected including Failover, EtherChannel, and (of course) LACP, for the purpose of providing fault-tolerance and/or high-speed links.
Both are documented in the FreeBSD Handbook and are straightforward to set up.
What is the status of wireless support in FreeBSD 7.0?
Andrew Thompson: The wireless networking has had a major update for 7.0. The most visible change is in scanning, this has been split out to support background scanning which updates the scan cache during inactivity so the client can roam to the strongest AP. The scanning policies have also been modularized.
The new code has working 802.11n support although no drivers have been released yet. Changes have also been made to allow future vap support which gives multi-bss/multi-sta on supporting hardware, the vap work is ongoing and may be released this year. Benjamin Close added the new wpi(4) driver for the Intel 3945 wireless card and the new usb drivers zyd(4) and rum(4) were ported over by Weongyo Jeong and Kevin Lo respectively.
The majority of the net80211 work was done by Sam Leffler with contributions from Kip Macy, Sepherosa Ziehau, Max Laier, Kevin Lo, myself, and others.
Beyond the big profiling work in the networking subsystem, you also added an implementation of the Stream Control Transmission Protocol (SCTP). Would you like to explain us what it is and who could take advantage of it?
Randall Stewart: SCTP is a general purpose transport protocol developed in the IETF. It is basically a "next-gen" TCP. It can be used basically anywhere you would use TCP, but there are some differences.
There are various tutorials you can find on SCTP. There is also an RFC (3286). These can give the interested person an "overview" of the protocol.
And an introduction that has a real nice comparision of SCTP/TCP and UDP can be found online.
Currently, if you are in Europe and you send an SMS message, you are using SCTP; or if you are in China and you make a phone call, you are using SCTP. SCTP is at the bottom of the "IP over SS7" stack known as sigtran. There are other places it is used as well, I know about some of them (not all) :-). For instance its a required part of the "IP-Fix" protocol, which is the standarized version of "reliable netflow." You may also see it used in some instances for web access. This is still in its early stages but it can provide some enormous benefits. The web server at www.sctp.org for example do both SCTP and TCP.
The University of Delaware's PEL lab (Protocol Engineering Lab) is doing some interesting work in pushing this forward, they have some very interesting videos showing the differences between TCP and SCTP. There is other information around as well on their main web site (patches for instance for both firefox and apache).
Basically you can think of SCTP as a "super TCP" that adds a LOT of features that make it so an applications can do "more" with less work. So why did we put it in FreeBSD? Well, let me turn the question around why would you expect FreeBSD to NOT have the "next generation" version of TCP available in its stack?
I believe we are actually "first" to make it part of the shipping kernel. In Linux you can enable it as a module, but there are extra steps you must take. For FreeBSD its just there, like TCP.
How does the new in-kernel Just-In-Time compiler for Berkeley Packet Filter programs work?
Jung-uk Kim: Berkeley Packet Filter (BPF) is a simple filter machine (src/sys/net/bpf_filter.c), which executes a filter program. In-kernel BPF JIT compiler turns this filter program into a series of native machine instructions when the filter program is loaded. Then, instead of emulating the filter machine, the pre-compiled codes are executed to evaluate each packet. In layman's terms:
JVM : Java JIT compiler ~= BPF : BPF JIT compiler
Please see bpf(4) and bpf(9) for more information.
BPF JIT compiler for i386 was ported from WinPcap 3.1 of NetGroup Politecnico di Torino. Then, amd64 support was added by me. This feature first showed up in WinPcap 3.0 alpha 2 according to the change log.
Could you tell us more about the migration from KAME IPsec to Fast IPsec?
Bjoern A. Zeeb: In November 2005 KAME announced "mission completed" on their excellent, highly appreciated contributions to the FreeBSD Project, developing and deploying an IPv6/IPsec reference implementation. As a consequence maintainership of their code was handed over to the different BSD projects.
FreeBSD already had a second IPsec implementation done by Sam Leffler which was derived from KAME and OpenBSD work but is fully SMP safe. That means it can exploit multiple cores/CPUs of modern hardware to improve performance. Fast IPsec also uses the crypto(4) framework supporting crypto accelerator cards and supports features like the virtual enc(4) interface permitting filtering of IPsec traffic.
George Neville-Neil, you can find his BSDTalk on IPsec online, implemented and committed IPv6 support for Fast IPsec, which was the last missing key feature.
With FreeBSD 7 and onwards, Fast IPsec, now simply called IPSEC, is the only implementation supported. To use IPsec, people will need to update their kernel configuration files according to the information given in the ipsec(4) manual page.
With the current implementation FreeBSD is well positioned and we are looking forward to integrate more contributed work to implement new standards, to further improve scaling, etc. during the next months.
Platforms & Management
How does FreeBSD 7.0 handle virtualization? Would you still suggest to use Jail, or is virtualizing the whole OS better?
Alexander Leidinger: Work is underway to get FreeBSD running as a guest in XEN (will not be part of 7.0, but if it is ready it could be part of 7.1 or 7.2 or 7.x).
Running FreeBSD in vmware/quemu/... is possible (not only with 7.0).
Using a jail or virtualization is not a "the one or the other"-question. It depends on your usage scenario. So yes, we (still) suggest to use jails, but there are also valid used for virtualization (be this running FreeBSD in vmware/qemu/... or in hardware assisted XEN).
How do you handle power management?
Nate Lawson: ACPI in FreeBSD 7.0 has been mostly an incremental improvement over 6.x. We've updated the ACPI-CA layer to fix various compatibility problems with non-standard BIOS. Cpufreq support has been available since 6.x but it is now enabled by default. The automatic management of CPU frequency (through powerd(8)) is also the same as 6.x. Internal work now allows rc.suspend and rc.resume to run to completion before the kernel suspends, similar to apmd(8). Most of the work has been on trying to fix bugs and increase stability.
You can see the man page for CPUfreq(4) to see the list of speedstep and all features that are supported. 6.x and 7.x are equivalent I believe.
Why have you added FireWire support to the boot loader?
Hidetoshi Shimokawa: FireWire support in the boot loader is limited to dcons(8) that is for kernel debugging. Unfortunately, it does not support sbp(8) (SCSI over FireWire) nor fwip(8) (IP over FireWire) to boot FreeBSD over them.
Dcons(8) exploits Remote DMA feature of FireWire OHCI (fwohci(8)) and provides a very convenient way for kernel debugging compared with legacy serial console.
As a kernel developer or system administrator, we sometimes need to interact with loader or kernel in very early stage of boot process over serial console or dcons(8). This is why I added FireWire support in loader(8).
If you would like to try kernel debugging using dcons(8) see the wiki page.
It is very useful for laptops without serial ports. Dcons(8) support in loader is limited to i386 and amd64 platforms.
What difficulties did you have to overcome to support legacy-free hardware (e.g. MacBook Pro)?
Rui Paulo: Legacy-free means running an operating system using the new "standards", i.e. the old keyboard controller is gone, BIOS is replaced by EFI, ACPI is now required and all basic peripherals are now USB.
First of all, FreeBSD is not yet legacy-free on the amd64/i386 architectures. We still run on MacBooks using the BIOS. There's preliminary support for EFI on i386. We still have one or two problems left regarding to the USB keyboard and the touchpad, but I'm confident we'll fix those in time. We had problems in the beginning regarding to how the hardware in the MacBooks is being initialized by the EFI, but I think we overcome those by now.
For testing, I used my first gen MacBook and asked on the mailing lists for more testers.
There's a wiki page about this project that I try to keep as up-to-date as possible.
Hi-def audio?! Did I hear correctly? :)
Alexander Leidinger: This is not only new in 7.0, it should also be available in 6.3 (also true for snd_emu10kx and snd_env24 and snd_anvy24ht). The new thing about sound in 7.0 is I think the internal improvements in the sound infrastructure. There was a lot of work from Ariff to get lower latency and less audible hickups. He also prepared the way for multi-channel audio. Multi-channel audio is not there yet, but the sound infrastructure is improved in a way so that implementing multi-channel audio is less work now.
A lot of people use FreeBSD as a server, so I think they will be very happy to hear that this release support IPMI (Intelligent Platform Management Interface) and monitoring system hardware. Would you like to tell us what features and hardware are supported?
Doug Ambrisko: My employer IronPort Systems needed a way to leverage the industry standard hardware reporting. Our servers had IPMI support. I looked at some existing projects like IPMItool, freeIPMI, Doug White's IPMI work and OpenBSD. IPMI stands for Intelligent Platform Management Interface. In simple terms it provides access to a microcontroller that can read temperature values, fan speed, report hardware events, log other system events including events from the OS, power control and in version 2.0 a industry standard to do SOL. SOL is Serial Over LAN. With this an IPMI complient tool like IPMItool can communication to the IPMI controller. The IPMI controller can be configured to be connected to an onboard serial port. Then you effectively have a built in terminal server.
So coming back to the start. I wanted to IPMI and leverage others people work as much as possible. There was a lot of people working on IPMItool and it you look around a lot of system vendors use it. It was modular with various interfaces. It didn't have a direct I/O interface but I think that is a good thing. FreeIPMI was a little hard to use at the time and didn't have some of the feature that IPMItool hard. OpenBSD had a built in interface which IMHO wouldn't leverage other people working on it as much as IPMItool or FreeIPMI.
One thing I've learnt when trying to get something on FreeBSD is to look to Linux and leverage them. So looking at IPMItool I figured out the minimum that I needed to do to provide and OpenIPMI compatible interface. It looked fairly straight forward so I worked at getting something to work. Having the kernel control access to the HW is important since you have to pass messages through a few I/O or memory addresses. If someone else starts banging on the same resources then you have corrupted messages. So the kernel ensures one thing is talking to the IPMI controller at one time. I think that in theory we could multiplex talking via the IPMB optional device. Once I got an OpenIPMI like interface working then we just had to teach IPMItool and FreeIPMI how to find that the driver is there.
The next part was to add the various more methods on how to talk to the IPMI controller. They can be KCS, SMIC, BT, and SSIF. KCS is the most popular. We support KCS, SMIC, and SSIF. Then there are the various methods that tell us how to talk to the IPMI controller. This itself can be describe via SMBIOS (which is related to DMI). Other methods include PCI attachment, ACPI and now hints file. I ran into server in which it had a IPMI controller but didn't have provide and locating services so we can manually specify in that case. Having all of the various attachments can make attachment tricky since it could be owned by ISA, ACPI, PCI or go through SMBUS (which is the onboard i2c interface). John Baldwin helped a lot in refactoring code, cleaning up some of my prototype hacks and adding his own. He got SSIF up and running. So Yahoo! helped a lot.
Once I got the base driver working then I started adding more features such as tying it into the watchdog kernel interface.
Fooling around with things is interesting. I made a simple port proxy to program so that I could attach GDB to a SOL session talking to the kernel gdb. Then I could do real remote debugging of a machine that was down a few floors only connected via a network.
Anything that is added to IPMItool or FreeIPMI we can use. There are some limitations though since I haven't implemented the event notifcation service. Still a lot of useful work is done by others that we can leverage by making a relatively simple driver.
Another advantage of the OpenIPMI like driver is that HW vendors can re-compile their binary tool to run on FreeBSD. I really should create a Linux ioctl shim but havn't needed to that. It would be really simple but I haven't got around to it.
There are some things the are in the wings such as IPMB support and sending panic information into the IPMI SEL (System Event Log). I also have some code to dump in an abbreviated kernel back trace. This can be useful if the disk dies and you can't core dump to a disk and are not watching the console for panic messages. Although these are nice to have features they are totally done yet but sample code could be made available to others to polish it off.
Again thanks to IronPort Systems and Yahoo! for working on the FreeBSD drivers. Obviously, we thank the people that work on IPMItool and FreeIPMI. Lastly we can't forget the people that got the OpenIPMI driver into Linux that we could emulate which made the other stuff just work on FreeBSD!
This release introduces a new platform: ARM. Not everyone has one of these boxes to play with, so could you tell us a bit of the features of the hardware and how FreeBSD can take advantage of them?
Olivier Houchard: Just to nitpick, technically the ARM port first appeared in RELENG_6 :) ARM is neat in that it's a simple and easy-to-understand processor. Unlike i386, each ARM CPU is very different, so supporting a new CPU is a bit of work, mainly device driver writing. It generally comes as a SoC, with a lot of devices embedded. you may find a PCI bus, an USB bus, an ethernet adapter, etc. Most of them are specialized, and I'm mostly interested in network and I/O CPUs, because that's an area where FreeBSD can shine. As for the performance work on 7.0, it had an impact on ARM, but not as big as other platforms, mainly because that work was essentially directed to improve SMP performances. There's SMP ARM hardware, but I wasn't able to get some yet :) I started the ARM port for no good reason :). I was looking for a "big" FreeBSD-related project, to learn more about FreeBSD internals, and adding support for a new platform seemed like a good choice at the time. I just picked ARM because it was the last big architecture nobody was working on with affordable hardware, and because NetBSD already supported it, so it would make the porting work easier.
Warner Losh: The ARM platform actually spans a wide range of potential devices. There are fast ARM processors that are used in things like SAN devices, network processing engines and the like. There are slower ARM devices that sip the power so are useful for applications where power usage is at a premium. These include battery powered devices as well as control systems where heat needs to be kept to a minimum. Given this wide range of devices, it is hard to comment more specifically.
I can comment on the ARM-based product that I helped bring to market. This product was the control processor for an advanced, flexible timing platform. This timing platform decoded signals from GPS and produced timing clocks and pulses that were aligned to the official UTC time standard. The processor booted off an SD card like you'd find in any digital camera today. We stored configuration information about the cards that were plugged in, as well as user settings, in I2C EEPROMs. We interfaced to the GPIO pins of the processor to detect cards arriving and departing, as well as for more mundane tasks like lighting LEDs. This project was based on the Atmel AT91RM9200 processor. Although it runs at only 180 MHz, we found FreeBSD to be quite responsive for the needs that we had. Our entire control program, complete with SNMP management interface, typically only used about 10-15% of the CPU.
Other ARM based devices, from vendors such as Intel, Broadcom, Marvell, Samsung, and Cirrus Logic are also supported, either by the mainline code, or by code that's in the experimental part of the FreeBSD on the glide path to being committed. These devices are used in everything from handheld electronics devices to network engines that act as access points. Network Storage solutions are also available.
The performance profile of these systems is similar between 6.x and 7.0. The major improvements in SMP scalability have not seemed to have an impact on the low end, single processor devices. These embedded devices have enough memory and CPU cycles to run well under FreeBSD.
The reason that I worked on the ARM port (along with many others, including Olivier Houchard, Bruce Simpson, Sam Laffler, Bernd Walter and a few my feeble memory can't recall at the moment) is two fold. First, my company needed a UNIX-like operating system, free from the GPL, that ran on a low-powered device. The ARM AT91RM9200 from Atmel was selected because it offered all the i/o pins and serial buses that were necessary to interface with the rest of the timing engine (done in an FPGA). Second, for a while I've been hearing from people that they needed a lower power, lower cost solution that runs FreeBSD than even the very nice low end Soekris boxes. I felt that by working on different embedded platforms that this would help people do that and increase FreeBSD's reach into the embedded world.
I have recently changed employment. My main job is to spearhead efforts to bring FreeBSD to the MIPS and PowerPC world. Working with the community and other companies in this space, our aim is to make FreeBSD on these platforms robust, scalable, and secure. While this work isn't going to be in 7.0, FreeBSD's reach into the embedded world will continue. The increasingly aggressive enforcement of the GPL by the Software Freedom Law Center and others has created a demand for software with a license that's easier to abide by. In addition, FreeBSD 7.0's scaling work helps it to be competitive in the higher end embedded space where everything is moving to multi-core designs.
Is FreeBSD ready for 64bit systems? Which version (amd64 or i386) would you suggest for x86-64 systems?
Ken Smith: The baseline system has been 64 bit ready for a long time now. But as we all know people who only care about the baseline system and nothing else are very rare—virtually everybody wants to run stuff on top of the baseline system. Until x86-64 systems became mainstream lots of the application software had 64-bit related bugs in it (or flat out wouldn't compile) but that has been changing now. It's gotten to the point this question isn't as simple to answer as it had been before. If you are running applications on servers my advice remains the same as it has been for a while now—give your set of applications a try on amd64 and if they seem to work consider going that way.
Several of the servers I run for work are running amd64. For desktop workstation type stuff I used to strictly recommend sticking to i386 because so much of the stuff typically used in that environment was either unavailable or buggy in 64-bit mode. I still recommend you stick to i386 if you're particularly conservative and/or tend to use lots of "multimedia/entertainment" type stuff (especially browser plug-ins, sound codecs, etc). But using amd64 has definitely become a viable alternative, if you're adventurous give it a try. My primary workstation at work is running amd64. So far the only glitch I tripped across was with the ancient thing I use to read usenet news with (xrn, don't ask why...) and I was actually able to fix it with about 15 minutes of digging around (yay Open Source). I was even able to get the backup package we use to work despite it being an i386 binary targetted at an older release (amd64 comes with 32-bit libraries that can be used to support i386 binaries, including the older sets of libraries available in the "compat" ports).
Mark Linimon: The state of the ports on amd64 does lag behind i386 a bit. Often this is due to the authors of the original software making such assumptions that an int is a pointer. If we are able to, we create patches and try to get the authors to accept them. This has become much easier over the past 2 years as more and more people are running in 64-bit mode. During this time we have seen a shift from our user base being 95% i386 and 5% amd64, to something more like 85% i386 and 15% amd64. (These figures are guesstimates, based on statistics gleaned from the problem report database).
Work is ongoing to alert port maintainers about problems on amd64 and create HTML reports showing the differences between the various build environments.
In certain cases drivers are not available for amd64 (in particular, third-party video drivers). Therefore, the suitability for 64-bit mode may depend on your application (e.g., if you are using FreeBSD as a workstation).
What can you tell us about the ports system and the ports collection?
Mark Linimon: 7.0 will ship with nearly 18,000 different ports. (The version of the ports tree that will ship with 7.0 is frozen except for security updates; the current version of the tree already has more than 18,000).
Binary packages are being built for the following build environments: amd64-5, amd64-6, amd64-7, amd64-8, i386-5, i386-6, i386-7, i386-8, sparc64-6, and sparc64-7. (We are limited in how many sparc packages we can build by the amount of hardware we have available). By "-8" we mean "freebsd-current" here (e.g., what at some point will become 8.0). Within a few months, however, FreeBSD 5.X will no longer be supported by the security team, and package builds will stop at that time. By that time we will be recommending that everyone not using FreeBSD in some kind of embedded application should have already moved to 6.2, 6.3, or 7.0.
Ongoing work is being done to monitor the quality of the package builds and provide feedback to maintainers and committers to try to get problems resolved as quickly as possible.
This release includes gcc 4.2. Why did you choose to upgrade? What changes should users expect to see?
Mark Linimon: The 7.X branch of FreeBSD will have a support lifetime of several years. Towards the end of that, the gcc developers will have almost assuredly dropped support for the older versions. This seemed the best version for us to be using during that timespan.
gcc 4.2 is more strict about the code that it accepts. Because of this, we have had to modify a number of ports, and send the patches upstream. In a few cases (e.g., where the port is not longer being actively developed) we chose to instate a dependency on an older gcc version also in ports. We prefer to avoid this whenever possible.
Which X Window systems do you support? And which Window Manager (Gnome? KDE 3/4)? What about 3D desktops and 3D acceleration?
Mark Linimon: At the moment we have XFree86 and xorg; however, due to lack of interest, we intend to drop XFree86 once 7.0 is released. There are a variety of window managers, the most well-known being Gnome and KDE. At the moment we only have KDE 3 in the ports collection; active work is being done by the KDE team to do all the necessary upgrades for KDE 4.
There are a few popular applications that people could be interested in using, for example Skype and Flash. Is this possible at the moment?
Alexander Leidinger: Both depend upon the linuxulator. Currently the default linux kernel emulation is 2.4 based. In 7.0 there are a lot of improvements to the linuxulator that we are able to emulate parts of a linux 2.6 kernel, but there are known bugs and some missing features, so it is not enabled by default. There may even be a bug which results in some programs (some games and maybe some other programs) not being able to run in the default 2.4 emulation mode, you need to check the errata page of 7.0 when it is out if the bug in the linuxulator is there (when we were to late to get it into 7.0 in time for the release) or not. That being said: there are a lot of people using Skype even with the bug still present in the kernel. Personally I use the 2.6 emulation with acroread just fine.
Flash is a different kind of a beast. It's possible to use Flash 7 (install nspluginwrapper and follow the instructions in the message which is displayed after the installation, works fine for me). For Flash 9 we need the 2.6 emulation, but unlike acroread, Flash 9 seems to be demanding some 2.6 features which are not stable enough yet (apart from that, Flash 9 itself also doesn't seem really stable in linux, so the problems accumulate). Bottom line: Flash 7 works, but is not used that much on high profile websites anymore, Flash 8/9 is used on more and more high profile websites, but doesn't work stable yet.
Some readers might not know that FreeBSD can run Linux apps using a Linux ABI compatibility layer, called Linuxulator. What is the situation in 7.0?
Alexander Leidinger: The default is 2.4.2 emulation. The target is 2.6.16 emulation. There are some known problems with 2.6.16, so it is not the default yet. A lot of compatibility problems are fixed (bugfixes and new stuff), even in 2.4.2 emulation. Several of the bugfixes will also be in 6.3, but not the 2.6.16 parts.
We didn't do performance tests, so I don't know about performance improvements specific to the linuxulator, but the performance improvements for FreeBSD itself surely will improve the corresponding linuxulator parts.
There is a wiki with development info, but this is mainly for the 2.6.16 part. Some bugs listed there are also in 2.4.2, but they are there more or less since 3.x. The big subpage with the colored OK/failed test results doesn't show the severity of the failed tests, so just because there's a red marker, it doesn't mean it is a big problem (or a problem at all) in FreeBSD. A lot of tests...
What limits does FreeBSD 7.0 has when dealing with storage?
Pawel Jakub Dawidek: FreeBSD 7.0 is really good at working with large file systems. UFS2 is 64 bit file system, so should be enough for anyone. The only problem is fsck, which can take many hours to complete for really large UFS file systems. Here of course comes gjournal. FreeBSD 7.0 also has support for Sun's revolutionary file system called ZFS, which makes FreeBSD a great choice as a file server. I can talk how cool ZFS is for hours, so I'll just stop here:) I'm not sure about FAT32 file system, I use it rarely and only for small file systems. I also suggest not to go back to UFS1.
Craig Rodrigues: I am not an MSDOS-FS expert, but I have committed some fixes in this area. In FreeBSD 7, it should be possible to mount a 500 GB disk by doing "mount -t msdosfs -o large" when mounting the disk.
Is this the first FreeBSD release that includes a journaling facility (gjournal)?
Pawel Jakub Dawidek: Yes, FreeBSD 7.0 will be the first FreeBSD release with gjournal support. I was hoping to include gjournal in FreeBSD 6.3, but unfortunately I run out of time.
gjournal is not a separate filesystem, and it actually works below the file system layer, so I am wondering what performance does it provide and in which context should it be used?
Pawel Jakub Dawidek: You are right, gjournal offers block level journaling and is file system independent. You can use gjournal without any file system on top of it and with a really small amount of work you can use it with any file system FreeBSD has. Currently only UFS support is implemented. The gjournal is just another GEOM class, which allows to write data in transactions. In case of UFS, we start transaction, modify file system, and close the transactions by synchronizing the file system. This happens every few seconds (5 seconds by default) gjournal closes the transaction and starts to copy changes in the background to the destination provider from the journal. In the meantime new transaction is in progress. This allows to recover really fast from a power failure or a system crash.
Because gjournal operates below file system layer it cannot recognize if the given write request consist data or metadata, so it just journals everything. This of course introduces impact on performance, so I did some optimizations to mitigate it. gjournal does some work to optimize written data, for example it tries to combine smaller requests into larger ones to minimalize number of I/O requests send to disk and also it sorts the requests to avoid heads seeking as much as possible.
All this makes gjournal performance really interesting. Single streams of writes work twice as slow as UFS without gjournal, because there is not much to optimize. On the other hand, many processes running in parallel and operating on small writes can work even twice as fast as UFS without gjournal (my test was to untar FreeBSD source tree in eight processes in parallel).
unionfs has been fixed. What was the problem?
Daichi GOTO: There were several known problems in unionfs implementation of FreeBSD up until 6.2-RELEASE. The specification is ambiguous and its locking implementation was buggy. Because of these issues, mounting unionfs cd9660 file system as a lower layer had caused problems.
Unionfs makes it possible to mount one file system on top of another. For example, you can mount a memory file system on top of a CD-ROM. As a result, it looks as if you could write to files on the CD-ROM.
Changes are only made to the upper file system layer and no changes are made to the lower one. Therefore, you can use it to keep modifications without changing the lower layers. For a more detailed explanation have a look at Section 6.7 on page 256 of "The Design and Implementation of the FreeBSD Operating System" by Marshall Kirk McKusick and George V. Neville-Neil.
We made a new unionfs for FreeBSD. The most valuable codes of our new implementation has already been merged up until FreeBSD 8/FreeBSD 7.0/FreeBSD 6.3.
Solving "Ambiguous Specification Problems" involves discussions about what the appropriate behaviour is. Because the specification of unionfs has ambiguity of its behavior, so it is difficult to implement appropriately. Therefore, I have proposed different options for different situations. New implementation includes an option that allows unionfs to change its behavior on three ways: [traditional mode], [transparent mode] and [masquerade mode]. [transparent mode] seems to be the most reasonable default behavior. It fixes most of the problems in the original implementation.
Why did we rewrite from scratch? The original (old) unionfs implementation of FreeBSD until 6.2-RELEASE had many dead-lock scenes easily. Fixing it was harder than rewriting it ;)
What's new in FreeBSD 7.0 from a security standpoint?
Robert Watson: While security auditing was available as an experimental feature in FreeBSD 6.2, it is significantly enhanced in FreeBSD 7.0. The most important change is that it is now available out-of-the-box without a kernel recompile. Administrators can turn it on with a simple rc.conf entry and start the audit daemon (or reboot). There are also a number of other improvements, such as XML printing mode, which allow praudit(8) to generate an XML version of a trail and improved support for auditing Linux-emulated processes, which make Audit a more accessible and usable service. Some of these improvements, including XML printing, will also appear in FreeBSD 6.3.
The priv(9) work is quite exciting, but for most users won't make an immediate difference in system behavior in 7.0. This work classified all kernel privilege checks into a set of specific privileges (around 200 of them), and introduced new kernel interfaces to check for them. While the base system doesn't yet make use of this, third party TrustedBSD MAC Framework security modules, such as SEBSD and mac_privs, can now modify the operating system privilege policy, granting extra privileges or restricting them. This work is also the foundation for a great deal of future work, such as the ability to grant specific privileges to specific users, or limit or expand the set of privileges available in a Jail. I hope to see features like this begin to appear in 7.1, and really take flight in the 8.x release series.
How did you change the accounting file format?
Diomidis Spinellis: The accounting facility of FreeBSD stores a record for each process that terminates. This record includes the name of the command, its user, system, and elapsed time, as well as the user and group id under which it was executed. I revised the accounting record format to store time values with microsecond precision. Historically, the time values were stored in a bespoke 13 bit fraction floating point format. The smallest time that could be stored in that format was fifteen milliseconds. With modern GHz processors the vast majority of processes execute in less than a millisecond, and therefore their accounted time values were recorded as zero.
For the new file format I adopted the IEEE 754 "float" format for storing time and usage values. For performance reasons, we don't use any floating point arithmetic in the kernel. Therefore, I wrote bit twiddling code in C that compresses the time values stored in the kernel structures into floating point numbers. Adopting the IEEE floating point format greatly increases the range and precision of the numbers, and also simplifies the processing of accounting records by third party tools. In the past, processing the accounting records meant decoding those strange 13-bit floating point numbers. Now you can just read the data into a plain C floating point variable and work with that.
Despite the many changes, the new record format and the tools for examining the last commands and for summarizing the accounting data (lastcomm and sa) maintain backwards compatibility with the original accounting format. The new records are also versioned, which means that future improvements can be gracefully integrated.
Performance & Concurrency
What features does the new performance measurement framework provide?
Joseph Koshy: First, permit me to offer a minor clarification: HWPMC(4), LIBPMC(3) and PMCSTAT(8) are not new in 7.0. They were first added to the tree before FreeBSD 6 was branched and have been under development since (work is by no means finished).
Profiling of dynamically loaded objects is present in 7.0, (i.e., shared libraries and dlopen()ed objects in userland and of course kernel modules). I should also mention the bug fixes too :).
HWPMC(4) and LIBPMC(3) work together to offer a platform over which applications that use in-CPU performance monitoring counters can be built. The platform "virtualizes" the hardware PMCs in the system. It allows multiple processes to concurrently allocate PMCs and use these to measure the performance of specific processes or the system as a whole. Measurement can be in the form of counting of hardware events or profiling based on the measured hardware events.
HWPMC(4) is the part that runs in the kernel while LIBPMC(3) offers the userland programming API. The PMCSTAT(8) command line tool was the proof-of-concept for the platform.
You can use PMCSTAT(8) today to answer the following broad questions:
Low operational overheads were one of the design goals of the platform. The other was to permit support measurement of the system 'as a whole' i.e., measure the kernel and userland together. Ease of use was another design goal, as was support for SMP platforms. These characteristics appear to the major ones that account for the popularity of the platform.
See also the full list of features.
I read that you added the support for Message Signaled Interrupts (MSI) and Extended Message Signaled Interrupts (MSI-X). Could you give us some details?
John Baldwin: MSI is an alternate method for PCI devices to post interrupts to CPUs. MSI interrupts are different from legacy PCI interrupts (also known as INTx interrupts) in several ways. First, legacy PCI interrupts are managed via extra side-band signals that are not part of the normal PCI bus (address and data signals). Legacy PCI interrupts are also limited in that each PCI device can only have a single interrupt.
MSI interrupts are actually implemented as memory write operations on the normal PCI bus similar to normal PCI DMA transactions. One of the benefits of this is that MSI interrupts do not require an interrupt controller external to the CPU like an 8259A or an I/O APIC. Instead, some chipset device in the PCI bus hierarchy is responsible for accepting the MSI transactions and forwarding them to the CPU appropriately. On an Intel chipset this is normally done in the north bridge (or equivalent) where an MSI message is transformed into an APIC message and sent directly to the local APIC(s). An additional benefit of this difference is that because MSI messages are normal PCI bus transactions they are subject to the regular PCI transaction ordering rules. As a result, when an MSI message arrives at a CPU and triggers an interrupt handler, any PCI transactions performed by the interrupting device prior to the interrupt are known to be complete. For the legacy PCI interrupt case this is not guaranteed. Thus, interrupt handlers for legacy PCI interrupts must always start with a read from a register on the PCI device itself that forces any pending PCI transactions to complete. One other benefit of this approach is that PCI devices no longer share interrupt lines which can result in lower overhead for interrupt handling.
Another advantage of MSI interrupts is that MSI interrupts allow for multiple, distinct interrupts for a given PCI device. This can be used to provide optimized interrupt handlers for common interrupt conditions. Not having to perform a read from a register on the device can work with this to help even more. For example, a PCI NIC may support having three separate MSI messages for transmit complete interrupts, receive complete interrupts, and everything else. The interrupt handler for the first message could simply walk the transmit ring cleaning up descriptors for transmitted packets. That handler would not have to query any of the PCI device's registers or look at the receive ring, it would simply access the transmit ring in memory. Similarly, the interrupt handler for the second message would just manage the receive ring and nothing else. The interrupt handler for the third message would be tasked with handling any other events (link state changes, etc.) and would have to read an interrupt status register from the PCI device to determine what interruptible conditions are asserted. Contrast this with a legacy PCI interrupt handler which would have to always read the interrupt status register to determine what conditions need to be handled. By having leaner and distinct interrupt handlers for the common cases, the MSI case can process packets with lower latency.
FreeBSD 6.3 and 7.0 support both MSI and MSI-X interrupts for PCI devices. Note that devices must support at least one of the MSI or MSI-X capabilities (this can be determined via pciconf -lc). Also, FreeBSD only enables MSI and MSI-X interrupts on systems with either a PCI-express or PCI-X chipset. In addition, PCI device drivers have to be updated to support MSI interrupts before they will be used. In simple cases these changes can be very small. Some drivers that currently support MSI include bce(4), bge(4), cxgb(4), em(4), mpt(4), and mxge(4).
One other note is that in 6.3 MSI is not enabled by default. You have to set a couple of tunables to enable it.
What is going to change with the new malloc() library?
Jason Evans: jemalloc is designed to support heavily multi-threaded applications on multi-processor systems. Since the malloc(3) API is set in stone, jemalloc has to pull some unusual tricks to scale without the application code changing. At the most basic level, jemalloc accomplishes this by multiplexing allocation requests across a set of completely independent memory arenas.
The idea of multiple arenas is not new; Larson and Krishnan published a paper on this approach over a decade ago. However, jemalloc takes the approach to its logical conclusion by improving how threads are assigned to arenas, as well as adding dynamic arena load balancing (dynamic load balancing is implemented in FreeBSD-current, and I plan to merge it to RELENG_7 before FreeBSD 7.1 is released).
Although my initial focus for jemalloc was multi-threading scalability, it is worth mentioning that at this point, jemalloc is faster than phkmalloc, the previous allocator, for pretty much all uses, single-threaded applications included. Also, the memory layout algorithms substantially reduce fragmentation in many cases.
libthr becomes the default threading library. What changes?
David Xu: libthr uses 1:1 threading mode, while libkse uses M:N threading mode instead. libthr is now default threading library in 7.0. from user point of view, you won't see any defference between libthr and libkse, only performance is an exception, libthr has better performance for many applications, for example MySQL database server. On SMP machine, it massively outperforms libkse. Developers and users should not worry about the change, since both thread libraries support POSIX threading specification.
How did you improve kernel locking performance so much?
Attilio Rao: During FreeBSD7 CURRENT lifetime my efforts in the SMP support area have been focused mainly on 2 tasks: sx locking primitive rewriting and sched_lock global spinlock decomposition. The former job had been made necessary in order to reduce the overhead on a widely used and very expensive primitive while the latter ranked as prioritary task in order to help the ULE scheduler in implementing per-CPU runqueues and in order to remove a big scalability bottleneck.
sx locks are a special kind of synchronizing primitive which can be acquired on 2 different ways: the shared way, allowing multiple thread to hold the lock concurrently for "read purposes" on the protected datas and the exclusive way which basically allows the lock to behave like a mutex. They are widely used in the FreeBSD kernel mainly for 2 reasons: an "historical" one as in the pre-FreeBSD7 era they were the only primitive allowing for a shared acquisition between concurrent threads and a "pragmatic" one as they offer an optimized way to perform unbounded sleeps (so that some subsystems are forced to use them).
Old sx locks were affected by some peculiar problems: instead than using directly sleepqueue(9) primitive for catering unbounded sleeping they were passing through the heavier msleep(9) primitive; also, the access to the lock structure itself was serialized by a front-end mutex penalyzing real shared accesses. In order to solve all those problems, a mechanism very similar to what happens with newly added rwlocks has been implemented: the lock operation has been splitted into a "fast path" case which consists of a sole atomic operation and an "hard path" case which accesses to the sleepqueues directly. So usually, on uncontested case, the locking / unlocking operation happens to be only a pair of atomic operation, in opposition to what was happening before (4 atomic for the better case + other work in the lock structure itself which is nomore necessary with the new implementation). Sx locks improvements have been used as a base for improving performance in other parts of the system (where the most relevant example is expressed, probabilly, by the rwatson's improved filedesc locking which entirely replaces the old-ish msleep approach with a set of sx lock operations).
sched_lock was a spinlock introduced at the beginning of the SMPng project with the purpose to protect the scheduler specific informations in the SMP systems. One of the most prominent ULE feature is probabilly the ability to exploit more than a runqueue, exactly one per-core. In order to make this effective in a SMP system, and allowing so real concurrent lookup on runqueues, the global scheduler lock needed to be decomposed in targeted locks. The choosen approach, which was partially inherited by Solaris, implements so a lock for any runqueue or sleepqueue and is basically referred as "per-container approach." Jeff Roberson lead the effort of breaking sched_lock and to implement a generic layer for an easy lock switching between the containers. I helped Jeff in this work locking some parts independently (like ldt handling mechanism in ia32, VM statistics, times accounting and others), submitting some spourious code, submitting bugfixes and offering revisions.
I heard that there was a lot of work on the new scheduler, ULE, but that it will not be the default scheduler for 7.0. Would you like to tell a bit more about its evolution, features, and performance?
Jeff Roberson: ULE was started in the 5.0-CURRENT timeframe as an experimental scheduler with the simple goal of exploiting CPU affinity on SMP machines. This optimization prefers to run a thread on the CPU it has most recently run on, which will allow it to make use of warm caches. At this time ULE was a bit premature as the kernel was not scalable enough for the scheduler to make many improvements and it suffered from some design problems with nice especially.
During the the 6 and 7 development cycles many developers contributed to significantly improve our kernel scalability. At this time I also fixed the problems with nice and other interactivity complaints and began working in earnest on the affinity and load balancing issues again. With the aid of Attilio Rao, Kris Kennaway, and John Baldwin I decomposed our single giant scheduler spinlock into per-CPU locks. With these changes ULE was finally able to shine and help the rest of the kernel show what it was able to. On 8 processor machines we are now competitive with all major operating systems we have benchmarked in a wide variety of server workloads.
Work has not stalled in 8.0 however. We have seen great gains on 16way machines by implementing CPU topology aware scheduling. These algorithms know where caches and busses are in the system and which CPUs share them. This enables us to make the best choices with regards to CPU affinity and load balancing. Work is also underway on a system for CPU binding and CPU sets which may be used to restrict jails to certain CPUs, for example. Much of this work is most generously being sponsored by Nokia. This and other improvements may be backported to 7.1 where ULE will likely become the default scheduler.
Who could try it on 7.0 and see a great performance improvement?
Jeff Roberson: Users with 4 or more processors and server type workloads will see the most improvement. Desktop use, batch compiles, and the like really have not benefited very much from affinity because they did not suffer from a lack of it before. However, ULE does offer superior interactivity on desktop systems and many users prefer it for this reason.
Federico Biancuzzi is a freelance interviewer. His interviews appeared on publications such as ONLamp.com, LinuxDevCenter.com, SecurityFocus.com, NewsForge.com, Linux.com, TheRegister.co.uk, ArsTechnica.com, the Polish print magazine BSD Magazine, and the Italian print magazine Linux&C.
Return to ONLamp.
Copyright © 2009 O'Reilly Media, Inc.