BSD DevCenter
oreilly.comSafari Books Online.Conferences.


IRIX Binary Compatibility, Part 6
Pages: 1, 2

The IRIX Process Data Area (PRDA)

The first bug appeared when trying to run the IRIX version of Photoshop. It manifested itself as an unexpected SIGSEGV, which usually deeply depresses me. Tracing which bug in the emulation subsystem caused a segmentation fault can be extremely difficult, since the emulation inconsistency can be quite far away from the segmentation fault. It is not easy, but at least we can try.

Here is the kernel trace before the segmentation fault, obtained with ktrace:

 2811 AdobePhotoshop3. CALL  access(0x110437f8,0x3)
 2811 AdobePhotoshop3. NAMI  "/emul/irix/var/tmp"
 2811 AdobePhotoshop3. NAMI  "/var/tmp"
 2811 AdobePhotoshop3. RET   access 0
 2811 AdobePhotoshop3. CALL  getpid
 2811 AdobePhotoshop3. RET   getpid 2811/0xafb
 2811 AdobePhotoshop3. CALL  access(0x110437f8,0)
 2811 AdobePhotoshop3. NAMI  "/emul/irix/var/tmp/photoAAAa000fv"
 2811 AdobePhotoshop3. NAMI  "/var/tmp/photoAAAa000fv"
 2811 AdobePhotoshop3. RET   access -1 errno 2 No such file or directory
 2811 AdobePhotoshop3. PSIG  SIGSEGV caught handler=0x41444c4 mask=(13,20) 

Because the access system call just returns 0 or -1, I could not see any way of returning a result that could cause a segmentation fault. It had to be caused by something else.

Using gdb, we can disassemble the code leading to the SIGSEGV delivery:

Program received signal SIGSEGV, Segmentation fault.
0xfb06420 in ?? ()
(gdb) x/4i $pc-16
0xfb06414:      sw      $s1,40($sp)
0xfb06418:      sw      $s0,36($sp)
0xfb0641c:      lui     $t6,0x20
0xfb06420:      lw      $t7,3584($t6)
(gdb) info reg
         zero       at       v0       v1       a0       a1       a2       a3
 R0  00000000 00000000 11042eb8 0000024a 11042eb8 00000000 0000002d 00000001 
           t0       t1       t2       t3       t4       t5       t6       t7
 R8  0fb4f9c0 0fb4f9c0 00870000 00000000 00000000 0fb063f0 00200000 00000000 

The instruction that caused the exception is lw $t7,3584($t6). It was supposed to store the value at the address T6+0x3584 into T7. Given that T6 is 0x200000, we end up accessing 0x200e00, where no memory is mapped, hence the SIGSEGV.

Reading the code, it is interesting to note that T6 has just been initialized by a constant value: lui stands for load upper half word of an integer, hence lui $t6,0x20 just caused T6 to be filled with 0x200000. This attempt to read data at 0x200e00 is not a consequence of some data handed out by the emulation subsystem.

The idea that a page of memory should magically be mapped at that address seems a bit odd, but it is worth a program to check it. Here is code to do the job:

/* magicpage.c -- check for a mapping at 0x200000 */
#include <stdio.h>
#include <signal.h>

char *checkpoint = (char *)0x200000;

int main(int argc, char **argv) {
    char c;
    int sgn = 1;

    if (argc == 2)
        sgn = -1;

    while (1) {
        printf("trying %p\n", checkpoint);
        c = *checkpoint;
        checkpoint = (char *)
            ((unsigned long)checkpoint + (sgn * 0x1000));
    return 0;

With this program and another version that tries descending addresses, we confirm that the IRIX kernel maps one page of memory at 0x200000 on program startup.

Related Reading

Unix Power Tools
By Shelley Powers, Jerry Peek, Tim O'Reilly, Mike Loukides

The next step is to dump this area and to have the NetBSD kernel prepare it just like the IRIX kernel does. This could have been a difficult job, but fortunately the "magic page" is documented. I have to thank Chuck Silvers for pointing this out to me. We had an email exchange about how to handle some issues related to the "magic page", and Chuck called it the PRDA. I asked him why he used this name, and Chuck told me this was in the sproc(2) man page.

The sproc(2) man page explains that when you create a thread sharing the virtual memory space with the parent, everything is shared except the Process Data Area (PRDA). A reference to <sys/prctl.h> is also given to get more information about the PRDA. In this header file, we can find the actual definition of the structures contained in the PRDA, making it much easier to emulate.

The real problem now is to implement this feature correctly on NetBSD. Mapping and filling the PRDA at process creation time is easy, but having our sproc(2) emulation share the whole virtual memory space except for the PRDA is more difficult. We will cover this in the next section.

Private Mappings in Share Groups

We have to handle shared virtual memory spaces that contain one private page. In fact, we have to handle potentially multiple private pages, since the PRDA is not the only situation where this property is needed. In the IRIX mmap(2) man page, we can see that there is a MAP_LOCAL option used to make a private mapping within the share group shared virtual memory space.

Sharing the whole virtual memory space is simple: there is a field in the proc structure called p_vmspace (this is defined in < sys/proc.h>). This field is a pointer to a struct vmspace (defined in <uvm/uvm_extern.h>) which described a process virtual address space. When we want to share the whole virtual address space, we share the same struct vmspace among different processes.

qThe struct vmspace contains a substructure called vm_map (defined in <uvm/uvm_map.h>) which in turn contains the list of map entries, describing the various mappings in the process virtual address space. To share only some pages, we must have different vmspace structures with different list of mappings. The map entries linked in the lists will be the same for all the processes in the share group when they describe the shared regions. For private regions, each process will have its own map entries.

The real problem, since we have different lists for each process, is keeping the lists synchronized. If a process modifies the mapping in a shared region, the modification must be visible to all other processes. What are the ways of modifying the virtual address space mappings?

Through system calls, virtual memory mappings are affected by mmap(2), munmap(2), break(2), shmsys(2), mprotect(2), plock(2), mpin(2), munpin(2), memcntl(2), ptrace(2), and syssgi(2) commands such as SGI_ELFMAP (see part 3 of this series to learn about SGI_ELFMAP). For each of these system calls, if the operation is successful, the map entry list must be kept in sync.

Additionally, memory mappings can be affected by page fault handling. We have to handle these as well and maintain the map entries in sync within the share group each time a member makes a page fault. This is a primer on the NetBSD emulation subsystem, which has only been used to emulate system calls and signal handling so far.

We therefore had to write a per emulation page fault handling in the emulation subsystem. The goal is exactly the same as system call emulation: emulation independent code has some hooks to handle emulation specific behavior. The hook is usually done by using a pointer from struct emul, which points to some emulation specific function or data for each emulation. As an example, here is how the MIPS system call handler handles error codes (sys/arch/mips/mips/syscall.c:syscall_plain()):

if (p->p_emul->e_errno)
    error = p->p_emul->e_errno[error];

p is a pointer to the current process, and error the error code that the kernel wants to return to userland. e_errno is a field in struct emul which points to an emulation dependent array, defining the translation between native NetBSD error codes and emulated error codes.

Let us examine trap handling now. We introduced an e_fault field to struct emul which points to a function responsible for trap handling. Native NetBSD processes will want to use uvm_fault() and IRIX will want to use irix_vm_fault(), which is implemented in sys/compat/irix/irix_prctl.c. In the MIPS trap handler (sys/arch/mips/mips/trap.c), we have:

if (p->p_emul->e_fault) 
	rv = (*p->p_emul->e_fault)(p, va, 0, ftype);
	rv = uvm_fault(map, va, 0, ftype);

It is worth mentioning that the prototypes of irix_vm_fault() differ a bit from uvm_fault().

int irix_vm_fault __P((struct proc *, vaddr_t, vm_fault_t, vm_prot_t));
int uvm_fault __P((struct vm_map *, vaddr_t, vm_fault_t, vm_prot_t));

This is because we need the struct proc pointer in irix_vm_fault(), and changing uvm_fault() prototype would have caused too many invasive changes in NetBSD's virtual memory subsystem. Because there is no strong requirement to have the same prototype, we used a slightly different one. The struct vm_map pointer can be easily derived from struct proc (it is just p->p_vmspace->vm_map), so the struct proc can just replace the vm_map argument in the irix_vm_fault() prototype.

irix_vm_fault() just calls uvm_fault() and makes a virtual address space mapping sync across the share group. The next point is about how the actual sync is done. Whether requested from a memory related system call or from irix_vm_fault(), the sync is implemented by the irix_vm_sync() function (implemented in sys/compat/irix/irix_prctl.h).

irix_vm_sync() takes the struct proc pointer of the modified process as an argument. For each process in the share group, it unmaps shared regions and remaps them as in the modified process. The unmapping is done by uvm_unmap(9), and the copy of the modified process mapping is done by uvm_map_extract(9).

We have to keep track of which region is shared and which region is private. This information could have been added to the VM map entries, but adding emulation specific data there is not a good practice. We therefore use a chained list whose head is in struct emuldata:

LIST_HEAD(ied_shared_regions, irix_shared_regions_rec)
    ied_shared_regions; /* list of (un)shared memory regions */

The irix_shared_regions_rec structure is defined in sys/compat/irix/irix_prct.h like this:

struct irix_shared_regions_rec {
    vaddr_t isrr_start;
    vsize_t isrr_len;
    int     isrr_shared;    /* shared or not shared */
    LIST_ENTRY(irix_shared_regions_rec) isrr_list; 

This list is modified when we create the PRDA and when the MAP_LOCAL option is requested in irix_sys_mmap(). When we modify the list, we have to compute region intersections to avoid having information about the same region twice. This is done through irix_isrr_insert() in sys/compat/irix/irix_prctl.c.

There is a debug function in the same file, irix_isrr_debug(), which is useful for checking what happens to the list. If the kernel is built with the DEBUG_IRIX option, this function is called each time the list is modified. Here is some output from the debug function, which helps to explain what is going on:

At process creation time:

isrr for pid 233
  addr = 0x0, len = 0x80000000, shared = 1

After the PRDA is created:

isrr for pid 233
  addr = 0x0, len = 0x200000, shared = 1
  addr = 0x200000, len = 0x1000, shared = 0
  addr = 0x201000, len = 0x7fdff000, shared = 1

After mapping some data with MAP_LOCAL:

isrr for pid 233
  addr = 0x0, len = 0x200000, shared = 1
  addr = 0x200000, len = 0x1000, shared = 0
  addr = 0x201000, len = 0x2fe0f000, shared = 1
  addr = 0x30010000, len = 0x5000, shared = 0
  addr = 0x30015000, len = 0x4ffeb000, shared = 1

With information from this list, irix_vm_sync() is able to perform the sync, sharing mappings only for shared regions. The code is getting quite complicated, but we are getting close to genuine IRIX behavior. irix_vm_sync() now has to compute the intersection of shared and private regions across the share group, and make the memory mapping synchronize accordingly.

We end with quite a horrible piece of code. On a plain page fault, we have to walk through multiple chained lists, which is a pain on the performance front. But we have no choice: we want to emulate an odd feature, so we get odd code.

In the next part, if I am courageous enough to write about it, we will look at the emulation of an IRIX pseudo-device driver that is used to implement pollable semaphores. This includes reverse engineering the driver entry points, since nearly no documentation is available about it, and, of course, implementing the driver in NetBSD.

Emmanuel Dreyfus is a system and network administrator in Paris, France, and is currently a developer for NetBSD.

Return to the BSD DevCenter.

Sponsored by: