FreeBSD's SMPngby Federico Biancuzzi
Over the past five years, the FreeBSD developer team has worked very hard to improve performance on multiprocessor systems. Their goal was to remove the big kernel lock used in the 4.x branch, and replace it with fine-grained SMP support. This project, often referred to as SMPng ("SMP next generation"), was a very big effort and took four releases (from 5.0 to 5.3) to reach stable status.
Federico Biancuzzi interviewed FreeBSD Core member Scott Long about the SMPng technology, the current implementation status, future goals, and plans.
Could you introduce yourself?
Scott Long: I'm 30 years old and live near Boulder, Colorado with my wife of seven years and my two kids, aged four and six. I presently contract for Sandvine, Inc., of Waterloo, Canada, on FreeBSD system support. Before that I was part of the Open Source group at Adaptec for nearly five years. When I'm not working on FreeBSD, I enjoy biking, hiking, and camping in the nearby mountains.
What is your role in the project?
Scott Long: I've been using FreeBSD since its inception and have been a contributing member since 2000. I'm currently leading the release engineering of the 5.x release series, and I was recently voted into the FreeBSD Core Team.
My primary interest has been storage. I maintain most of the FreeBSD RAID drivers, and I wrote the UDF filesystem support. I've also written several smaller sound and USB drivers.
As some readers may not know the difference between various locks, can you summarize them?
Scott Long: A spinlock is a mutex that will cause the CPU to spin and repeatedly retry without interruption if the lock is not immediately available. The CPU will not do any other work until it acquires the lock.
A sleep lock is a mutex that will put the current thread/process to sleep if the lock is not immediately available. The thread will be woken back up when the holder of the lock releases the lock. In the [meantime], the CPU will stay busy by running other threads.
An SX (shared-exclusive) lock is a sleep lock allows multiple threads to hold it concurrently in the "read" state, but only one thread to hold it in the "write" state. "Read" and "write" refer to the operations that will be done on the resources that are protected by the lock.
"Blocking" is synonymous with sleeping. When a thread blocks, it is put onto a sleep queue in the scheduler and will stay there until it is woken up by an event like a lock being released.
Counting semaphores are also available. These are sleep locks that allow a predefined number of threads to hold the lock concurrently.
How does FreeBSD 4 SMP work?
Scott Long: The traditional method of SMP was to allow CPUs to run multiple userland processes concurrently, but allow only one CPU to be in the kernel at a time. This was done with the "mplock," a spin lock that guarded the entry to the kernel.
What type of problems did this have?
Scott Long: While this approach worked well for workloads that were largely done in userland and were largely computational, it worked very poorly for workloads that made heavy use of I/O, network, and other kernel services.
How does FreeBSD 5 SMPng solve them?
Scott Long: [In] FreeBSD 5, the mplock is gone. In its place is a set of locking and synchronization primitives that allow the various kernel services to provide their own synchronization and multiprocessor safety. Multiple processes are allowed to be in the kernel at the same time.
The second big change is that interrupt servicing is largely now done in special kernel process contexts called "interrupt threads." This allows a driver to use sleep locks in its interrupt handler without blocking the entire CPU while it is sleeping.
For areas that have not been made explicitly MPSAFE yet, a global mutex called "Giant" exists and protects similar to the mplock in 4.x. So the fundamental goal right now is to make the existing code be MPSAFE and push Giant out of the way as much as possible.
What are these interrupt threads?
Scott Long: An interrupt thread is a normal thread that stays within the kernel. [Its] sole purpose is to provide an execution context for a driver interrupt handler. This allows a driver to do things that normally would not be allowed in a traditional interrupt context, like block for resources and/or locks.
The traditional interrupt context is used to schedule the ithread (which is why the scheduler uses a spinlock). Some drivers that are sensitive to latency and don't require blocking can run their interrupt handler directly in the interrupt context and avoid using an ithread. This is very uncommon, however.
What do you think about DragonFlyBSD's SMP technology? How does it compare to SMPng?
Scott Long: The DragonFly approach appears to be very similar to Mach and AmigaOS in that concurrency comes from queuing and passing messages between subsystems rather than synchronizing access to shared data structures. Although their work is still in its infancy, we will definitely keep a close eye on it and see how it progresses.
What about NetBSD/OpenBSD SMP?
Scott Long: I'm not as familiar with OpenBSD, but NetBSD appears to be using the same approach as FreeBSD 4.x. There were recent benchmarks that claim that NetBSD 2.0 is now faster than FreeBSD 5.x. These tests were apparently only done on UP, so they show the cost of the SMPng locks on FreeBSD 5.x. There is quite a bit of work going on at the moment to address this.
Could FreeBSD SMPng deliver better performance than Linux SMP?
Scott Long: One of the design goals of SMPng is to encourage the use of sleep locks instead of spin locks. Sleep locks allow a blocked process to relinquish the CPU to another task, which should help concurrency and scalability. It's hard to say yet whether this will give better real-world benefit until we get further along with subsystem locking and fine tuning.
The Linux SMP approach appears to be focused mainly on using spinlocks for thread synchronization. This has the benefit of it being fairly easy and straightforward to lock code, but results in CPUs wasting time spinning when they could be doing other useful work. Most uses of [Linux] sleeplocks also block interrupts while the lock is held, so latency increases. There is ongoing work to implement kernel preemption in Linux to help mitigate these issues, but it seems to often be the source of controversy and problems.
It seems that parts of the code and ideas came from BSD/OS 5. How much has changed while porting to FreeBSD?
Scott Long: Certain subsystems, like the "WITNESS" lock debugging subsystem, came largely from BSD/OS. Most other work, however, was done independently and only conceptually shared between the two.
Can FreeBSD users be sure that no license, copyright, or patent problems will arise in future regarding that code?
Scott Long: The code sharing that was done, was done under an explicit agreement within BSDi and Wind River Systems that allowed it to happen.
What type of work was needed beyond the kernel-level hacking?
Scott Long: The SMPng work is mostly confined to the kernel. The new KSE threading package does take advantage of SMPng, but it's not an integral part of SMPng.
How does KSE (Kernel Scheduler Entities) interact with SMPng?
Scott Long: KSE is a derivative of the Scheduler Activation work that was done at the University of Washington back in the 1990s. It takes the classic "libc_r" pthreads library that did all threading in userland and replaces it with a model that shares thread management and scheduling between the kernel and userland, and allows multiple threads in a process to use multiple CPUs at once. It also solves the problem in libc_r of a thread that blocks in the kernel causing all threads in the process to block.
Under libc_r, a multithreaded application could only ever use one CPU. This is because libc_r multiplexed all of the userland threads into a single scheduling entity in the kernel. With KSE, the kernel provides multiple scheduling entities to the userland scheduler to multiplex as it sees fit. This means that true parallelism can now be used. Of course, this was already possible with the "LinuxThreads" package, but this package is typically more heavyweight than KSE.
Was the ULE scheduler introduced for SMPng needs?
Scott Long: The ULE scheduler was developed to help increase interactivity under heavy loads and attempt to use keep the CPUs more efficiently scheduled and prioritized. It is also a research vehicle for advanced scheduling algorithms on complex CPU topologies. It came along well after the original SMPng design work.
The ULE scheduler was the default for much of 2004, but was turned off for the 5.3 release. The reason is that while it has a lot of potential, it also wasn't being maintained well and had developed some serious stability problems. Jeff Roberson has started examining the problems and making changes that have been reported to help it. Hopefully it can be made the default again in the future.
But back to your question, the ULE scheduler was designed to work harder at keeping multiple CPUs as busy as possible. Under the traditional 4BSD scheduler, if a CPU went idle, it would stay idle until the next clock tick, typically 10ms later. If an interrupt came in during that time and an ithread (interrupt thread; a kernel thread that exclusively handles device driver interrupt processing) needed to be scheduled to service it, the idle CPU wouldn't wake up to handle it until the next tick. Part of the ULE design was focused on fixing this problem. This has actually also been addressed recently in the 4BSD scheduler with the "IPI wakeup" mechanism. With this, the scheduler will send a wakeup interrupt (IPI) to an idle CPU to get it executing the scheduled thread right away. This feature is enabled in 5.3.
The other design goal of ULE was to have it map out and understand the CPU topology and make good scheduling choices for features like HyperThreading. Unfortunately, to my knowledge this work is not yet complete.
Does SMPng improve performance on Intel Hyper-Threading capable CPUs?
Scott Long: As of right now, very little. The scheduler really needs to be aware of HyperThreading and schedule threads and processes appropriately so that the caches and TLBs can be shared and not get thrashed. The ULE scheduler will fill this role in the future, but it's not there yet.
Modern CPUs have a big L2 cache, often one or two megabytes. Is SMPng able to share the load among CPUs to gain cache affinity?
Scott Long: Again, the FreeBSD schedulers are fairly traditional and don't fully optimize for CPU topology. Work is ongoing on this.
It seems that Intel and AMD will start selling dual-core CPUs soon. Have you already received any dual-core CPUs to work on?
Scott Long: The project itself hasn't received any advanced hardware yet, but we keep close ties with Intel and AMD and hope to get hardware in the future. When Opteron was still under development, AMD was very generous with loaning development hardware to various groups in the project.
What type of limits does SMPng have? How many CPUs can it support?
Scott Long: It supports the standard Intel MP limits of eight CPUs on x86 and [AMD64]. Performance doesn't scale as well as we would like when adding CPUs, but it is something that we will be focusing on in the near future.
FreeBSD/ia64 is still maturing and unfortunately is not terribly stable under SMP, so it's impossible to comment on scalability there. The sparc64 port runs well on most of the common Ultra, Netra, and Blade lines (except for the UltraSparcIII series) which typically have one to four CPUs. I'm not aware of anyone running it on any of the higher-end systems.
Have you any benchmarks of SMPng scalability? I'd like to know how much performance will grow for every couple of CPUs that I add.
Scott Long: The focus so far with SMPng has been correctness. This doesn't always translate into better scalability at first, but is required for long-term success. Now that more pieces are locked and proven correct, we are starting to shift our focus to performance and scalability. 5.3 is a big release for this as we are starting to see some respectable performance benefits. We expect the performance and scalability to improve with subsequent releases.
Does SMPng worsen the performance of single-CPU systems?
Scott Long: We are trying very hard to address this. Right now, UP performance is slightly decreased, but as we work more to optimize the locking it will get better. SMPng also included a lot of work in kernel preemption and reduced latency for priority inversion of processes, so the end result should be a more responsive system on UP.
There will always be room for more improvements. 5.3 is the first milestone in making SMPng a reality. With the network stack locking mostly done, a lot of effort has shifted towards measuring and improving performance for both UP and SMP. This includes measure the cost of locks, optimizing code flow to avoid locks where possible, and batching data flow to amortize the cost of locks as much as possible. 5.4 and beyond will definitely benefit from this work.
Can we build a kernel without SMPng for single-CPU systems?
Scott Long: SMPng is part of the fundamental design for 5.x, and it cannot be compiled out. The kernel can be compiled for either SMP or UP, with the UP configuration reducing the cost for locks. The default shipped configuration is SMP to maximize compatibility.
FreeBSD 5 supports a larger number of platforms than 4.x branch. Are there any big differences in the SMPng code?
Scott Long: There are differences in some of the atomic operation and locking primitives, but the APIs expressed from those primitives are consistent to the rest of the kernel. As long as a programmer follows the APIs properly, the code should work fine and behave the same on any platform.
Is there any optimization for 64-bit platforms?
Scott Long: There are certain optimizations in the virtual memory management that can help 64-bit CPUs be quite a bit faster in this area.
What type of link is there between the busdma project and SMPng?
Scott Long: None. busdma is the API to abstract driver/hardware DMA operations so that a driver will work on non-i386 architectures without modification. ARM, MIPS, and sparc64 are examples of architectures where special handling is needed to make host memory available to peripherals.
The busdma API has certain programming requirements that must be followed in order to lock a driver correctly. Because of this, the notion of having a driver compliant to the API and having it locked (for SMPng) was combined into a single status page.
Imagine a home user with a dual-CPU workstation with KDE/GNOME and all of the other common applications. What type of experience should he expect moving from 4.10 to 5.3?
Scott Long: Much better interactivity for threaded processes, and much better interactivity and responsiveness between processes and with input devices. Audio and video streaming tasks that before could get choppy if the system was doing other tasks will definitely perform much better now.
When that user installs FreeBSD 5.3, what type of visible changes
should he expect to see? I read that it's now possible to measure time used by
each interrupt with
ps. Anything else?
Scott Long: All of the binaries in the /bin and /sbin directories are now dynamically linked. The /rescue directory has statically linked copies of many critical binaries and can be used to rescue a system that is broken due to shared library problems.
top will show stats of each thread in a
process if given the appropriate flags (this defaults to off to prevent
breaking scripts that might not handle it correctly). The interrupt time
top reports the amount of time spent in the ithreads
and in the low-level interrupt handlers.
Kernels and modules are now stored in /boot instead of /. /boot/kernel is the default directory that is searched by the loader for these. The loader now also has a simple menu for booting single-user or booting with various options disabled.
Power management is now radically different. ACPI is now preferred to APM; most ports have been converted to use ACPI, but some legacy might not have been converted.
PCCard peripherals are now handled by the cardbus subsystem instead of the old pccard subsystem. The old one still exists for migration purposes, but is largely unmaintained. The cardbus subsystem allows 32-bit cards to operate natively. 16-bit cards should continue to work.
The 5.3 release includes a multithreaded network stack. Which other re-engineered subsystems are ready?
Scott Long: The GEOM storage layer is new for 5.x and is inherently multithreaded. This allows storage drivers, RAID modules, and other data transforms that are SMP-aware to operate without the Giant lock.
Unix pipes and domain sockets can also operate without the Giant lock, allowing interprocess communication to be fairly efficient. The VM system is undergoing quite a bit of work to make it SMP-aware. The results here are less tangible but result in a lot of small improvements around the kernel.
With regard to packet filtering, through PF, IPF, or IPFW, does SMPng make any difference?
Scott Long: Some cases of firewall-based forwarding and routing will improve since they will be able to operate in the kernel without the Giant mutex and independent of other CPUs.
The PF filter is the only one that runs without the Giant lock. It is an adoption of the same codebase that OpenBSD developed several years ago and is the one with the most active development in FreeBSD. Being that it does not require the Giant lock, it has the potential to be the fastest of the three.
IPFW is the original FreeBSD packet filter that was developed many years ago and thus is the one that most people are familiar with. It's actually in its second incarnation as IPFW2. There is also an IPv6 version called IP6FW. A very nice feature of this package is the "dummynet" facility. This allows for simple traffic shaping and bandwidth management on any network interface.
IPFilter was introduced several years ago as an alternative to IPFW. The author of it is currently working on MPSAFE patches for it that should help performance.
Talking about the SMPng roadmap, what type of work is in store for future 5.x releases?
Scott Long: After the 5.3 release, the 5.x series will take on the 5-STABLE label and will focus mainly on incremental features, performance, and bug fixes. New development will shift to the 6-CURRENT stream, though pieces that can integrate back to 5-STABLE will be allowed to do so once they have proven themselves. I would expect more storage and network drivers to become MPSAFE, as well as possibly some peripheral systems like SCSI and maybe USB. We will also look at tweaking performance where we can.
We are planning on the 5.4 release in late February. The 5.5 release will likely be about four months after that, but it's also subject to some new plans that we have for 6.0.
After quite a bit of recent discussion, we decided to speed up the overall development cycle of FreeBSD and focus on doing major releases every 12-18 months, and minor releases at four-month intervals in between. This means that we will start preparing for 6.0 in May/June of 2005. We will create the RELENG_6 CVS branch then, spend one to three months after that on bug-fixes and QA, and then release 6.0 sometime around July/August of 2005. This schedule is still under quite a bit of development at the moment, so I won't promise any more than that.
With the new schedule, the primary focus will be on doing releases that are more timely and reliable from a scheduling point of view. One of the problems with 5.x development has been that we left users without support for new hardware while we delayed each release for more features or tweaks.
Which subsystems will have the biggest performance boost?
Scott Long: The largest focus right now is on the network stack. This will allow network packets to flow through the kernel in both directions in parallel, and will lower latency involved in receiving and processing inbound packets.
Several storage drivers, including the ATA driver, are MPSAFE now and should show similar benefits as what I mentioned for the network stack. Kernel services such as pipes, sockets, and VM are also MPSAFE now.
The ultimate goal with SMPng is to have a system that allows maximum useful work on all CPUs. Synchronizing access to resources is an unavoidable cost in SMP, so the focus is to allow CPUs to quickly switch to other tasks when the current task has to wait for a resource.
The SMPng work started with having the entire kernel protected with two locks, the Giant sleep mutex (which covered the entire kernel) and the scheduler spinlock (which provided synchronization during hardware interrupts). From there the effort has been to reduce the scope of the Giant lock by providing specific locking to individual drivers and subsystems. The eventual goal is to remove the Giant lock entirely.
Who worked or is working on SMPng? Who did what?
Scott Long: The original SMPng design meeting was attended by the following (taken from Steve Passe's SMPng page): Don Brady, Ramesh?, Ted Walker, Jeffrey Hsu, Chuck Paterson, Jonathan Lemon, Matt Dillon, Paul Saab, Kirk McKusick, Peter Wemm, Jayanth?, Doug Rabson, Jason Evans, David Greenman, Justin Gibbs, Greg Lehey, Mike Smith, Alfred Perlstein, David O'Brien, and Ceren Ercen.
Jason Evans led the architecture work until 2001. John Baldwin took over from there and has been leading the work ever since. The main SMPng project page lists many others who have [worked] on it.
Is there any trick or caveat that people developing a multithreaded application for FreeBSD 5-STABLE should know? Can you give any suggestions to optimize the code for SMPng?
Scott Long: The KSE threading library is now enabled as the default pthread library. The old libc_r can be used in [its] place to aid with debugging. GDB support for KSE is still evolving, so switching to libc_r for debugging is often helpful.
Right now, there is little way for an application to interact with SMPng properties. In the future there will likely be ways to group threads to specific CPUs in order to help with cache locality and parallelization.
Federico Biancuzzi is a freelance interviewer. His interviews appeared on publications such as ONLamp.com, LinuxDevCenter.com, SecurityFocus.com, NewsForge.com, Linux.com, TheRegister.co.uk, ArsTechnica.com, the Polish print magazine BSD Magazine, and the Italian print magazine Linux&C.
Return to the BSD DevCenter.