ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


IRIX Binary Compatibility, Part 5
Reverse Engineering Threading

by Emmanuel Dreyfus
12/19/2002

Welcome back to our series on IRIX binary compatibility. In this part, we will study IRIX and NetBSD threading models. We will also examine how it is possible to emulate IRIX native threads on NetBSD, though NetBSD does not support a similar feature for its native binaries.

Introduction to Multithreading

Traditional UNIX systems implement multitasking by using multiple processes. A process is the unit of execution. It has an execution context and a virtual memory space, both private and specific to the process.

This scheme has some drawbacks. Consider the need for efficient interprocess communication. Traditionally, the programmer needs to arrange for an area of memory to be shared between processes. This makes the program more complex. Another problem is that switching virtual memory spaces on process switches can be expensive.

The solution to these problems is multithreading. In UNIX terminology, multithreading is the ability to run multiple execution contexts within the same virtual address space. Each execution context is known as a thread. The set of threads sharing the same address space is usually known as the process. Since memory is shared among the different threads of a process, communication between threads is transparent and efficient.

There are two popular ways of implementing threads on a UNIX system: kernel assisted threads and user threads.

With user threads, the kernel is not aware that there are multiple execution contexts in the process. Everything is done in userland, by timely switching between the threads. The advantage of this method is that there is no need for kernel support, which means that it is easy to implement user threads on any UNIX system. Additionally, there is an existing standard for the programming interface: POSIX.1c. Because of these two advantages, user threads are often referred as portable threads or POSIX threads.

In This Series

IRIX Binary Compatibility, Part 6
With IRIX threads emulated, it's time to emulate share groups, a building block of parallel processing. Emmanuel Dreyfus digs deep into his bag of reverse engineering tricks to demonstrate how headers, documentation, a debugger, and a lot of luck are helping NetBSD build a binary compatibility layer for IRIX.

IRIX Binary Compatibility, Part 4
Emmanuel Dreyfus tackles the chore of emulating IRIX signal handling on NetBSD.

IRIX Binary Compatibility, Part 3
Emmanuel Dreyfus shows us some of the IRIX oddities, the system calls that you will not see anywhere else.

IRIX Binary Compatibility, Part 2
Emmanual Dreyfus shows us how he implemented the things necessary to start an IRIX binary. These things include the program's arguments, environment, and for dynamic binaries, the ELF auxiliary table, which is used by the dynamic linker to learn how to link the program.

IRIX Binary Compatibility, Part 1
This article details the IRIX binary compatibility implementation for the NetBSD operating system. It covers creating a new emulation subsystem inside the NetBSD kernel as well as some reverse engineering to understand and reproduce how IRIX internals work.

The big disadvantage of user threads is that because the kernel has no knowledge of the existence of the threads, when one thread makes a blocking system call, the process, and therefore all of the threads in that process, are blocked. Also, because the kernel only sees one process, user threads cannot take advantage of multiple processors.

Kernel assisted threads are designed to address these two drawbacks. With this scheme, the kernel is responsible for scheduling the different threads within a process. Because it is aware of their existence, the kernel can schedule other threads of a process when a thread makes a blocking system call. The kernel is also able to balance multiple threads of a single process among multiple processors. One big disadvantage of kernel assisted threads is that it requires kernel support, which means that some UNIX kernels will not support them. Kernel assisted threads are not easily portable. This is why kernel assisted threads are also known as native threads.

Another big drawback is that there is no standardized programming interface for handling native threads, and each OS has its own set of system calls syntax and semantics. For instance, Linux creates native threads using Linux uses clone(2) and IRIX uses sproc(2).

One side note about kernel processes. Kernel processes are UNIX processes running entirely in kernel mode and never executing in user mode. Examples of kernel processes are the pagedaemon(8) on NetBSD, whose job is to swap out pages that have been unused for some time. Because it only manipulates kernel structures, pagedaemon does not need to run in user mode, therefore it is named as a kernel process. The problem is that kernel processes are sometimes referred as kernel threads because they are different execution contexts all sharing the kernel virtual memory space. This creates confusion with native threads, which do exactly the same thing though they share a user virtual memory space. In order to avoid confusion, we will try to speak about kernel processes and kernel assisted threads (or native threads) using these names in this document.

Thread Models in IRIX and NetBSD

IRIX supports portable threads, of course, but because this is all implemented in userland, it is transparent for our IRIX emulation on NetBSD. We do not have to set up anything to support that.

Native threads are implemented through sproc(2) system calls and variants: nsproc(2), sprocsp(2), and so on. Variants differ from the original system call by offering options such as the possibility to choose where the new thread's stack will be allocated. IRIX has special terminology for native threads created by sproc(2): they constitute a share group.

Here is sproc(2)'s prototype:

pid_t sproc (void (*entry)(void *), unsigned inh, void *arg);

The first argument is the thread entry point. Once the kernel has created the new thread, it returns to userland and resumes execution at the address pointed to by the entry argument.

inh stands for inherit. This integer value holds a few flags which specify whether various resources should be shared or not between the thread and its parent. Sharable resources include: address space, UID/GID, process limits, file descriptors, umask value, and working directory.

Finally, arg is a pointer to the thread entry function's arguments. There is no argument counter, as the parent and the child are supposed to know each other well enough to be aware of what is pointed to by arg.

Now, what about NetBSD? Of course NetBSD supports user threads, although there is no built-in user thread library. If you need user threads, you have to install a third-party thread library. The NetBSD package collection includes multiple user thread libraries: devel/unproven-pthreads, devel/pth, and devel/mit-pthreads. Of course all the packages work. There are three implementations available because none is perfect, so we leave the choice to the user.

With native threads, things become more difficult since NetBSD does not support native threads yet. Given this situation, emulating IRIX native threads may seem an impossible task. Fortunately, it is not, and we will explain why a bit later. For now, let us explain why NetBSD is so poor regarding threading.

We explained before that both user threads and native threads have their drawbacks. In an ideal world, we would want both a standardized API, and a thread model where threads can take advantage of multiple processors and are not blocked because of a blocking system call issued by another thread.

NetBSD is now close to be that ideal world thanks to Nathan J. Williams, who is working on an hybrid thread model that maps user threads to kernel threads. This concept is called Scheduler Activations and was first introduced by Sun Microsystems in Solaris.


O'Reilly Gear.

There is no thread capability built into NetBSD, whether user or native, just because we want the ideal world for NetBSD and nothing else. But before we enter the perfect world, we want to emulate IRIX native threads.

In fact, NetBSD does support native threads. NetBSD just lacks the API to spawn native threads from userland. From the kernel, and especially from a foreign OS emulation subsystem, there is no problem: you can create a native thread. This has been used for Linux emulation first since a lot of Linux applications require native threads through the clone(2) system call. In order to emulate Linux's clone(2), NetBSD needed some support for native threads. We end up with this strange situation where the NetBSD kernel is able to emulate features that it does not support natively.

In the NetBSD kernel sources, native threads are supported through the fork1(9) kernel function, found in sys/kern/kern_fork.c. fork1(9) has some options, including the possibility to share the virtual memory space with the parent.

It is worth mentioning that the NetBSD fork(2) system call implementation is just a wrapper to fork1(9) with appropriate arguments:

int sys_fork(struct proc *p, void *v, register_t *retval)
{  
        return (fork1(p, 0, SIGCHLD, NULL, 0, NULL, NULL, retval, NULL));
}

In this file, there is also a sys___clone() function, which is a Linux clone(2) compatible system call. It is not exported in the kernel API, but if you ever want to natively build a Linux application that uses clone(2) on NetBSD, consider adding sys___clone() to NetBSD native's syscall.master and rebuild a kernel (note that you also need to patch the libc to make the system call available).

Having described threads and the NetBSD implementation, let us move on to something more serious: the actual implementation.

Everything is done in sys/compat/irix/irix_prctl.c. Here we have the various sproc(2) variant implementations: irix_sys_pidsprocsp(), irix_sys_sprocsp() and irix_sys_sproc(), respectively implementing pidsprocsp(2), sprocsp(2), and sproc(2). All these functions call irix_sproc() with appropriate arguments. irix_sproc() does the actual work, which is mostly to call fork1(9).

Lists, Locks and Reference Counts

But things are not that simple. For various reasons, we need to be able to walk through the threads list in a share group. Therefore, we need a chained list of processes in the share group.

There are handy macros in the NetBSD headers <sys/queue.h> that are really helpful here. Documentation can be found in the queue(3) manual page. There are macros to declare the list, to walk through it, to insert elements in it, and so on. All the pointer arithmetic is done in the macro, and the programmer can focus on the actual use of the list. This is wonderful.

Creating the list is easy. To store the pointers to next/previous elements in the list, we can use a field of the emulation specific part of struc proc (it is pointed to by the p_emuldata field, and we created a struct irix_emuldata to describe it. Check the previous article about signal emulation if you need details about this). The only problem is deciding where to store the head of the list. Storing it in the irix_emuldata structure is not a good idea, since we have no guarantee at all that the processes holding the head of the queue will not terminate before the others.

The solution is to allocate a structure, which describes the share group. It is defined in sys/compat/irix/irix_prctl.h and is called struct irix_share_group.

/* IRIX share group structure */
struct irix_share_group {
        LIST_HEAD(isg_head, irix_emuldata) isg_head;    /* list head */
        struct lock isg_lock;                           /* list lock */
        int isg_refcount;
};

Here we have the list head, defined using the LIST_HEAD macro, one lock on the list and a reference counter.

The List

The list declaration will expand in the declaration of a structure. The first argument to LIST_HEAD is the structure name. This name does not really matter; we use the name of the field in struct irix_share_group, as this is more convenient. The second argument does matter; it is the type of the list elements. Here we specify that our list is made of struct irix_emuldata elements. (Note that 'struct' is omitted. The macro will add it.)

In the struct irix_emuldata, we have one field that holds the list pointers. The declaration is done with another macro:

LIST_ENTRY(irix_emuldata) ied_sglist;

The Lock

The lock is used because several processes could try to manipulate the list at once. This can of course happen on a multiprocessor system, where the two processors may be in kernel mode at the same time, trying to manipulate the same kernel structure. It can also happen on a uniprocessor system, too. Imagine we have to walk the share group list to apply some collective operation. For some reason this operation may cause the process to sleep, awaiting some particular event. During the sleep, other processes will be scheduled, and they may try to modify the list too. When our first process wakes up, it may be working on a list element that no longer exists, which is quite likely to cause a panic.

The solution to this is to use a lock. Locks are manipulated through the lockmgr(9) function. Simplest uses are to get a shared or exclusive lock and to release a lock. Here is an example of lockmgr(9) use:

(void)lockmgr(&lock, LK_SHARED, NULL);
/* Do something */
(void)lockmgr(&lock, LK_RELEASE, NULL);

We use shared locks when we want to read the list. Several processes may be doing this at once. They will all ask for a shared lock and get it.

Exclusive locks are requested when we want a process to modify the list. Upon exclusive lock request, lockmgr(9) will sleep until all shared locks are released, which guarantees that no reading process is still reading. Once exclusive lock is acquired by one process, any other process calling lockmgr(9) for a shared or exlcusive lock will go to sleep, thus ensuring that no process can read while our process is writing.

The Reference Count

The reference count enables us to track the irix_share_group structure usage. When a process is added to the list, we increase the count. When a process is released, we decrease the count. When the count hits zero, we can free the structure, since nobody is using it anymore.

One of the first jobs of irix_sproc() is to check if the irix_share_group structure has already been allocated. If this is not the case, the structure is allocated and the calling process is added to the list. The child process will be added later, as we cannot add it before it is actually created.

Parent and Child Life in the Kernel

fork1() has an argument named func which is used to specify the child entry point. This can be a NULL pointer if we want to return immediately to userland, or it can be the address of a kernel function used to perform some custom tasks before returning to userland. Since we need to arrange the child environment and to add it to the share group list, we use this feature. The child entry function is called irix_sproc_child() and is defined in sys/compat/irix/irix_prctl.c.

It is worth mentioning that the child entry function must be a kernel function. Passing a userland address here will cause a panic. To return to a custom userland address after fork1(9), the solution is to specify a kernel child entry function and to modify the child registers saved onto the stack in that function so that, on return to userland, the process will execute our userland function.

While executing irix_sproc_child(), we need some information from the parent. (The kernel stores all fundamental information about a process in the proc structure, so we read that.) This can lead to some problems if we just let the parent exit irix_sproc() while irix_sproc_child() is executing. If the parent returns to userland and catches a deadly signal, our child can be left manipulating a stale proc structure. This is very likely to cause a panic.

This situation can arise easily in a multiprocessor system where the second CPU will execute the child while the first CPU executes the parent. But it can happen on a uniprocessor system as well. If the child sleeps awaiting some resource, the kernel may schedule the parent to run and resume the child later.

The solution is to force the parent to sleep on a flag and make the child awaken the parent when it will no longer be fatal for the child to be orphaned. The address of the flag is passed to the child using the arg argument to fork1(9). This argument is a pointer that gets passed to the child entry function. For irix_sproc_child(), we defined it as a pointer to the irix_sproc_child_args structure (defined in sys/compat/irix/irix_prctl.c):

struct irix_sproc_child_args {
        struct proc **isc_proc;
        void *isc_entry;
        void *isc_arg;
        size_t isc_len;
        int isc_inh;
        struct proc *isc_parent;
        struct irix_share_group *isc_share_group;
        int isc_child_done;
};

The structure is allocated in irix_sproc() and is used to transmit a lot of information between the parent and the child: the child proc structure pointer (isc_proc), the entry, arg, len, and inh arguments that were given to the sproc(2) system call (isc_entry, isc_arg, isc_len, and isc_inh), the parent proc structure address (isc_parent), the share group structure address (isc_share_group), and the flag on which the parent is sleeping, awaiting for the child to finish building itself (isc_child_done).

Now that you have the whole picture, let us move to something even more interesting. irix_sproc_child() needs to set up the stack and registers just like IRIX does. Of course this is not documented anywhere, and this is why things are getting more interesting. Time has come to reverse engineer IRIX's sproc(2) implementation. But this will come in the next part, so stay tuned.

Emmanuel Dreyfus is a system and network administrator in Paris, France, and is currently a developer for NetBSD.


Return to the BSD DevCenter.

Copyright © 2009 O'Reilly Media, Inc.