Linux DevCenter    
 Published on Linux DevCenter (
 See this if you're having trouble printing code examples

Vanishing Features of the 2.6 Kernel

by Jerry Cooperstein

Many developers are eagerly awaiting the 2.6 Linux kernel. The feature freeze has passed, with a code freeze planned for January and final release slated for the second quarter of 2003. There is considerable excitement about anticipated enhancements, especially regarding scalability and performance.

However, some developers may first notice what doesn't work anymore. Some techniques and APIs have been removed, and existing device drivers and modular plugins may no longer work. At the same time, it will take some time to take advantage of new features and to find replacements for old ones.

Some deprecated techniques, such as task queues, have finally been eliminated. Other facilities, including in-kernel Web acceleration, have been supplanted by newer advances. Other changes, notably banishing the system call table from the list of exported symbols available to modules, have flowed more from philosophical and licensing issues than from technical considerations.

Export of the System Call Table

The Linux kernel has a monolithic architecture; it is one big program. All parts of the kernel are visible to each other unless their scope has been explicitly limited. Arguments are passed on the stack, as in any other C program. At the same time, Linux makes extensive use of modules: facilities that may be loaded and unloaded dynamically. (These are often, but not always device drivers.) Modules can only see explicitly exported symbols (functions, variables, etc.). Unless the kernel or a previously loaded module includes the statement EXPORT_SYMBOL(foobar);, the module cannot refer to foobar().

Extensive modularization does not render the kernel any less monolithic. The critical difference between monolithic and microkernels stems from how components communicate with each other. As long as the Linux kernel prefers function calls to message passing, its basic structure will remain monolithic.

The system call table is a vector containing the addresses of the functions executed whenever a system call is made from user space. When invoking a system call, the kernel receives the number of the call, the number of arguments, and the arguments themselves. It uses the call number as an offset into the table and places the arguments in the registers; they're not passed on the stack. Then it jumps to the appropriate address to execute the system call.

Exporting the system call table allows modules to substitute system calls with replacements of their own devising. To replace the basic kernel read() system call requires a simple code fragment:

extern int sys_call_table[];

read_save = sys_call_table[NR_read];
sys_call_table[NR_read] = read_sub;

where read_sub() has been defined somewhere in the module and the pointer to the original system call has been saved so that it can be restored upon module unloading:

sys_call_table[NR_read] = read_save;

So what is wrong with this technique?

Related Reading

Understanding the Linux Kernel
By Daniel P. Bovet, Marco Cesati

On the practical side, it is easy to incur race conditions, especially on multi-processor systems where the replacement happens while an application is using the system call. Various locking techniques can offer some protection, but the details are non-trivial. However, the abolition of this method is not primarily due to practical difficulties.

Some system calls penetrate deep into kernel's heart. Binary-only modules, where the source is not available under a GPL-compatible license, have enjoyed the use of this technique. Exported symbols have been visible to all modules.

The rules governing binary modules and GPL violations have always been fuzzy. Some argue that it is permissible for any such module to restrict itself to exported symbols. Others maintain it depends on whether or not the module fiddles with core kernel facilities. The line between central and peripheral matters has always been very gray.

To sharpen this delineation, the 2.4.10 version of modutils, which handles loading and unloading of modules, introduced module licenses. In addition, the EXPORT_SYMBOL_GPL macro, introduced in the 2.4.11 kernel, created two classes of exported symbols. Only modules with an acceptable open-source license can have access to symbols exported under the GPL. All previously exported symbols were grand fathered in.

This led to some loud arguments. Perhaps if the macro had been called EXPORT_SYMBOL_INTERNAL, it would have shown an intent of differentiating between modules implementing central and peripheral kernel facilities, rather than making a choice based on the kernel programmer's licensing philosophy.

Choosing to use EXPORT_SYMBOL_GPL(sys_call_table) would have satisfied many objections. Instead, the more draconian choice of embargoing all export of the system call table occurred. Red Hat did this in the patched 2.4.18 kernel shipped with Red Hat Linux 8.0, and Linus Torvalds did the same in the 2.5.41 development kernel. As a result, a module can no longer replace a system call through the simple code above. Its replacement adds support to register new system calls dynamically. This feature may continue to grow.

Most observers foresee a tightening of the limits on binary modules. This may very well break some rather expensive commercial Linux products, but that doesn't seem to bother most kernel developers. Reminding the purveyors of binary modules that they continue to operate at the pleasure of the Linux kernel developers and their open-source licenses is seen to be a necessary (even enjoyable) task. It has probably always been true that the only way to protect investment in Linux deployment of drivers and other kernel facilities (not applications) is to go open source, even if that is difficult for commercial enterprises to absorb. Recent developments seem to re-emphasize this.

Bye Bye Task Queues

The kernel often has to defer some tasks, scheduling them for later execution, though often as soon as possible. Commonly this is in interrupt service routines, where work is often divided between a "top half" and a "bottom half."

A typical top half stores incoming data and primes a device to be ready for new interrupts as quickly as possible. Other interrupts may even be disabled during this process, which highlights the necessity of quick execution. The bottom half performs less time-critical processing, such as filtering, copying to user space, etc. This helps increase device throughput.

Ancient kernels used a fixed set of 32 bottom half queues. These were superceded by the use of task queues, several of which were maintained by the kernel. Others could be created for specific purposes.

The task queue implementation has inherent limitations, and its use has been deprecated throughout the 2.4 series. For one thing, only one task queue could run at a time systemwide. For another, most task queues ran out of process context, which made it impossible for them to sleep. It also exposed them to all the other vulnerabilities of code running at "interrupt time."

The 2.4 kernel introduced tasklets as a partial replacement. Multiple tasklets of different types can run simultaneously. Tasklets always run on the CPU that scheduled them, which minimizes cache thrashing and, since it serializes things, simplifies re-entrancy problems and race conditions. However, tasklets still run in an interrupt-like context.

In the 2.5 kernel, task queues were gradually minimized and written out of existence. A new replacement called work queues took their place. Tasks are still placed on queues, but they run in process context, which permits sleeping. Unlike task queues, each work queue is tied to a set of threads, one per CPU, so a sleeping task doesn't block other work. One can also specify a minimum time period before the task is performed.

As a wrinkle, only GPL-licensed modules will be able to create their own work queues. Other modules will have to live with a default work queue maintained by the kernel.

O'Reilly Gear.

In-Kernel Web Server Acceleration

Web servers often dish up static file content, and each request requires at least two system calls and context switches. The 2.4 kernel included the khttpd in-kernel Web accelerator to handle such requests directly within the kernel. More complicated requests and requests with questionable security were passed off to the external Web server, such as Apache.

While khttpd could raise the number of handled requests by a factor of two-to-five, a later kernel patch called TUX (also known as the Red Hat Content Accelerator) by Ingo Molnar of Red Hat, achieved much higher speeds and included more advanced features. TUX can coordinate with kernel and user space modules and daemons that provide dynamic content. It also provides caching and can send a mixture of dynamically generated data. Pre-generated objects also can be sent. TUX uses zero-copy networking, can run its own CGI engine, and can be used as an ftp server.

As a result khttpd became less popular and less maintained. Many developers assumed TUX would take its place in the 2.6 kernel.

At the same time, important advances in user-space Web servers have helped them to reach performance levels previously available only to in-kernel Web accelerators. For example, the x15 server from Chromium uses a small pool of threads (4-8 per CPU) to control all network connections and network and disk I/O. Real time signals notify the server whenever data appears on a socket or when output is possible on the connection; no polling ever occurs. x15 avoids launching a thread for each connection and also benefits from zero-copy networking and other kernel enhancements. Dan Kegel has written an excellent summary of some of the issues involved in what he calls the C10K problem.

Many developers had been unhappy about pushing Web services into the kernel, feeling it was a slippery slope. Why not absorb all sorts of user-space facilities inside the kernel? Returning these features to user-space is thus quite welcome. TUX can also be applied as a patch and is available on all Red Hat systems.

Other Issues

The three issues we have highlighted are likely to affect a considerable number of users. Other worthy changes include:

  1. kdev_t, which encodes device major and minor numbers, has morphed from what was effectively a 16-bit quantity into a structure. Eventually the structure will contain more device-specific information. The major and minor bit fields should expand to a total of 32 bits, which will permit more devices to be registered uniquely.

  2. The API for block drivers has undergone a significant overhaul as part of the major enhancement of I/O operations.

  3. The remaining pcibios functions have been exterminated.

  4. The kiobuf, kiovec mechanism of pinning down user pages to permit direct access has been replaced by the get_user_pages() function.

  5. Kernel building and its interface have been reworked. The old Tcl/Tk graphical interface to xconfig has been replaced with a prettier and more functional GUI based on the QT graphical libraries.

Housecleaning is almost an obsession in Linux: features which have grown old or weak are euthanized, good ideas which no one ever used are obliterated, and sometimes mistakes are surgically removed before they grow out of control. This helps keep the kernel lean and understandable. Its growth in size is probably entirely due to new hardware devices and architectures, rather than new general features. Perhaps other deprecated features are yet to be removed before the new kernel debuts.

Jerry Cooperstein is a senior consultant and Linux training specialist at Axian Inc., in Beaverton Oregon, and lives in Corvallis, Oregon.

Return to the Linux DevCenter.

Copyright © 2009 O'Reilly Media, Inc.