Performance & Concurrency
What features does the new performance measurement framework provide?
Joseph Koshy: First, permit me to offer a minor clarification: HWPMC(4), LIBPMC(3) and PMCSTAT(8) are not new in 7.0. They were first added to the tree before FreeBSD 6 was branched and have been under development since (work is by no means finished).
Profiling of dynamically loaded objects is present in 7.0, (i.e., shared libraries and dlopen()ed objects in userland and of course kernel modules). I should also mention the bug fixes too :).
HWPMC(4) and LIBPMC(3) work together to offer a platform over which applications that use in-CPU performance monitoring counters can be built. The platform "virtualizes" the hardware PMCs in the system. It allows multiple processes to concurrently allocate PMCs and use these to measure the performance of specific processes or the system as a whole. Measurement can be in the form of counting of hardware events or profiling based on the measured hardware events.
HWPMC(4) is the part that runs in the kernel while LIBPMC(3) offers the userland programming API. The PMCSTAT(8) command line tool was the proof-of-concept for the platform.
You can use PMCSTAT(8) today to answer the following broad questions:
- What is the system doing, i.e., what hardware events dominate the observed behaviour of the system?
- Which part of the code is associated with the observed behaviour of hardware?
Low operational overheads were one of the design goals of the platform. The other was to permit support measurement of the system 'as a whole' i.e., measure the kernel and userland together. Ease of use was another design goal, as was support for SMP platforms. These characteristics appear to the major ones that account for the popularity of the platform.
See also the full list of features.
I read that you added the support for Message Signaled Interrupts (MSI) and Extended Message Signaled Interrupts (MSI-X). Could you give us some details?
John Baldwin: MSI is an alternate method for PCI devices to post interrupts to CPUs. MSI interrupts are different from legacy PCI interrupts (also known as INTx interrupts) in several ways. First, legacy PCI interrupts are managed via extra side-band signals that are not part of the normal PCI bus (address and data signals). Legacy PCI interrupts are also limited in that each PCI device can only have a single interrupt.
MSI interrupts are actually implemented as memory write operations on the normal PCI bus similar to normal PCI DMA transactions. One of the benefits of this is that MSI interrupts do not require an interrupt controller external to the CPU like an 8259A or an I/O APIC. Instead, some chipset device in the PCI bus hierarchy is responsible for accepting the MSI transactions and forwarding them to the CPU appropriately. On an Intel chipset this is normally done in the north bridge (or equivalent) where an MSI message is transformed into an APIC message and sent directly to the local APIC(s). An additional benefit of this difference is that because MSI messages are normal PCI bus transactions they are subject to the regular PCI transaction ordering rules. As a result, when an MSI message arrives at a CPU and triggers an interrupt handler, any PCI transactions performed by the interrupting device prior to the interrupt are known to be complete. For the legacy PCI interrupt case this is not guaranteed. Thus, interrupt handlers for legacy PCI interrupts must always start with a read from a register on the PCI device itself that forces any pending PCI transactions to complete. One other benefit of this approach is that PCI devices no longer share interrupt lines which can result in lower overhead for interrupt handling.
Another advantage of MSI interrupts is that MSI interrupts allow for multiple, distinct interrupts for a given PCI device. This can be used to provide optimized interrupt handlers for common interrupt conditions. Not having to perform a read from a register on the device can work with this to help even more. For example, a PCI NIC may support having three separate MSI messages for transmit complete interrupts, receive complete interrupts, and everything else. The interrupt handler for the first message could simply walk the transmit ring cleaning up descriptors for transmitted packets. That handler would not have to query any of the PCI device's registers or look at the receive ring, it would simply access the transmit ring in memory. Similarly, the interrupt handler for the second message would just manage the receive ring and nothing else. The interrupt handler for the third message would be tasked with handling any other events (link state changes, etc.) and would have to read an interrupt status register from the PCI device to determine what interruptible conditions are asserted. Contrast this with a legacy PCI interrupt handler which would have to always read the interrupt status register to determine what conditions need to be handled. By having leaner and distinct interrupt handlers for the common cases, the MSI case can process packets with lower latency.
FreeBSD 6.3 and 7.0 support both MSI and MSI-X interrupts for PCI devices. Note that devices must support at least one of the MSI or MSI-X capabilities (this can be determined via pciconf -lc). Also, FreeBSD only enables MSI and MSI-X interrupts on systems with either a PCI-express or PCI-X chipset. In addition, PCI device drivers have to be updated to support MSI interrupts before they will be used. In simple cases these changes can be very small. Some drivers that currently support MSI include bce(4), bge(4), cxgb(4), em(4), mpt(4), and mxge(4).
One other note is that in 6.3 MSI is not enabled by default. You have to set a couple of tunables to enable it.
What is going to change with the new malloc() library?
Jason Evans: jemalloc is designed to support heavily multi-threaded applications on multi-processor systems. Since the malloc(3) API is set in stone, jemalloc has to pull some unusual tricks to scale without the application code changing. At the most basic level, jemalloc accomplishes this by multiplexing allocation requests across a set of completely independent memory arenas.
The idea of multiple arenas is not new; Larson and Krishnan published a paper on this approach over a decade ago. However, jemalloc takes the approach to its logical conclusion by improving how threads are assigned to arenas, as well as adding dynamic arena load balancing (dynamic load balancing is implemented in FreeBSD-current, and I plan to merge it to RELENG_7 before FreeBSD 7.1 is released).
Although my initial focus for jemalloc was multi-threading scalability, it is worth mentioning that at this point, jemalloc is faster than phkmalloc, the previous allocator, for pretty much all uses, single-threaded applications included. Also, the memory layout algorithms substantially reduce fragmentation in many cases.
libthr becomes the default threading library. What changes?
David Xu: libthr uses 1:1 threading mode, while libkse uses M:N threading mode instead. libthr is now default threading library in 7.0. from user point of view, you won't see any defference between libthr and libkse, only performance is an exception, libthr has better performance for many applications, for example MySQL database server. On SMP machine, it massively outperforms libkse. Developers and users should not worry about the change, since both thread libraries support POSIX threading specification.
How did you improve kernel locking performance so much?
Attilio Rao: During FreeBSD7 CURRENT lifetime my efforts in the SMP support area have been focused mainly on 2 tasks: sx locking primitive rewriting and sched_lock global spinlock decomposition. The former job had been made necessary in order to reduce the overhead on a widely used and very expensive primitive while the latter ranked as prioritary task in order to help the ULE scheduler in implementing per-CPU runqueues and in order to remove a big scalability bottleneck.
sx locks are a special kind of synchronizing primitive which can be acquired on 2 different ways: the shared way, allowing multiple thread to hold the lock concurrently for "read purposes" on the protected datas and the exclusive way which basically allows the lock to behave like a mutex. They are widely used in the FreeBSD kernel mainly for 2 reasons: an "historical" one as in the pre-FreeBSD7 era they were the only primitive allowing for a shared acquisition between concurrent threads and a "pragmatic" one as they offer an optimized way to perform unbounded sleeps (so that some subsystems are forced to use them).
Old sx locks were affected by some peculiar problems: instead than using directly sleepqueue(9) primitive for catering unbounded sleeping they were passing through the heavier msleep(9) primitive; also, the access to the lock structure itself was serialized by a front-end mutex penalyzing real shared accesses. In order to solve all those problems, a mechanism very similar to what happens with newly added rwlocks has been implemented: the lock operation has been splitted into a "fast path" case which consists of a sole atomic operation and an "hard path" case which accesses to the sleepqueues directly. So usually, on uncontested case, the locking / unlocking operation happens to be only a pair of atomic operation, in opposition to what was happening before (4 atomic for the better case + other work in the lock structure itself which is nomore necessary with the new implementation). Sx locks improvements have been used as a base for improving performance in other parts of the system (where the most relevant example is expressed, probabilly, by the rwatson's improved filedesc locking which entirely replaces the old-ish msleep approach with a set of sx lock operations).
sched_lock was a spinlock introduced at the beginning of the SMPng project with the purpose to protect the scheduler specific informations in the SMP systems. One of the most prominent ULE feature is probabilly the ability to exploit more than a runqueue, exactly one per-core. In order to make this effective in a SMP system, and allowing so real concurrent lookup on runqueues, the global scheduler lock needed to be decomposed in targeted locks. The choosen approach, which was partially inherited by Solaris, implements so a lock for any runqueue or sleepqueue and is basically referred as "per-container approach." Jeff Roberson lead the effort of breaking sched_lock and to implement a generic layer for an easy lock switching between the containers. I helped Jeff in this work locking some parts independently (like ldt handling mechanism in ia32, VM statistics, times accounting and others), submitting some spourious code, submitting bugfixes and offering revisions.
I heard that there was a lot of work on the new scheduler, ULE, but that it will not be the default scheduler for 7.0. Would you like to tell a bit more about its evolution, features, and performance?
Jeff Roberson: ULE was started in the 5.0-CURRENT timeframe as an experimental scheduler with the simple goal of exploiting CPU affinity on SMP machines. This optimization prefers to run a thread on the CPU it has most recently run on, which will allow it to make use of warm caches. At this time ULE was a bit premature as the kernel was not scalable enough for the scheduler to make many improvements and it suffered from some design problems with nice especially.
During the the 6 and 7 development cycles many developers contributed to significantly improve our kernel scalability. At this time I also fixed the problems with nice and other interactivity complaints and began working in earnest on the affinity and load balancing issues again. With the aid of Attilio Rao, Kris Kennaway, and John Baldwin I decomposed our single giant scheduler spinlock into per-CPU locks. With these changes ULE was finally able to shine and help the rest of the kernel show what it was able to. On 8 processor machines we are now competitive with all major operating systems we have benchmarked in a wide variety of server workloads.
Work has not stalled in 8.0 however. We have seen great gains on 16way machines by implementing CPU topology aware scheduling. These algorithms know where caches and busses are in the system and which CPUs share them. This enables us to make the best choices with regards to CPU affinity and load balancing. Work is also underway on a system for CPU binding and CPU sets which may be used to restrict jails to certain CPUs, for example. Much of this work is most generously being sponsored by Nokia. This and other improvements may be backported to 7.1 where ULE will likely become the default scheduler.
Who could try it on 7.0 and see a great performance improvement?
Jeff Roberson: Users with 4 or more processors and server type workloads will see the most improvement. Desktop use, batch compiles, and the like really have not benefited very much from affinity because they did not suffer from a lack of it before. However, ULE does offer superior interactivity on desktop systems and many users prefer it for this reason.
Federico Biancuzzi is a freelance interviewer. His interviews appeared on publications such as ONLamp.com, LinuxDevCenter.com, SecurityFocus.com, NewsForge.com, Linux.com, TheRegister.co.uk, ArsTechnica.com, the Polish print magazine BSD Magazine, and the Italian print magazine Linux&C.
Return to ONLamp.