ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


IRIX Binary Compatibility, Part 6

by Emmanuel Dreyfus
04/03/2003

In a previous article, we studied the IRIX threading model, focusing on how it was possible to emulate it on NetBSD. We now have a good idea of how to launch a native thread on NetBSD, but we still have to discover undocumented IRIX secrets such as the stack layout and the register setup when the native thread is launched by the IRIX kernel. To discover this, we will reverse engineer sproc(2).

The end of this part is about the emulation of IRIX oddities called share groups. We have to play a bit more than usual with the NetBSD virtual memory subsystem in order to get the work done. Working on IRIX turns into an adventure.

Reverse Engineering sproc(2)

We want the CPU register and stack setup of a process just created by sproc(2) on IRIX. We already faced this kind of situation when we had to discover the initial stack and register setup on program startup. If you forget how we handled this situation, please go back to part 2 of this series.

In This Series

IRIX Binary Compatibility, Part 1

IRIX Binary Compatibility, Part 2

IRIX Binary Compatibility, Part 3

IRIX Binary Compatibility, Part 4

IRIX Binary Compatibility, Part 5

Of course the first idea is to use the same trick: if we use gdb to break at the beginning of the entry function in userland, we will be able to dump the stack and registers. We could imagine that we would set the breakpoint before entering sproc(2) and then continue to see the break in the child after the sproc(2) call.

Things are a bit more complicated now: we would like to set a breakpoint in the child process before it has even been created. This is not possible.

There is a technique for handling this kind of problem, which is to prepare an infinite empty loop at the beginning of the entry function. That way the child process gets caught on userland return, and we can attach gdb to it while it is running. We can see that with the following sample program:

/* sprocchild.c -- A sproc child test program */
#include <stdio.h>
#include <sys/types.h>
#include <sys/prctl.h>

void entry(void *);

int main(void) {
    pid_t pid;

    pid = sproc((void *)*entry, PR_SADDR, (void *)0x42534400);
    printf("parent: sproc() returned %d\n", pid);
    return 0;
}

void entry(void *args) {
    while(1); /* infinite loop */

    printf("child: args = %p\n", args);
    return; 
}

Note we gave the arg argument a funky value so that we can easily recognise it later. Everything is ready; let's start the game!

$ gdb ./sprocchild
(gdb) b sproc
Breakpoint 1 at 0x400aec
(gdb) r       
Starting program: ./sprocchild 
Breakpoint 1 at 0xfa5c0e0: file sproc.s, line 53. 

Breakpoint 1, _sproc () at sproc.s:58
58      sproc.s: No such file or directory.
Current language:  auto; currently asm
(gdb) show reg
(gdb) info reg   
        zero       at       v0       v1       a0       a1       a2       a3
 R0  00000000 100040b8 00000004 00000000 00400cc0 00000040 42534400 7fff2f7c 
(snip)

In registers A0 to A2, we find the arguments to sproc(). The entry function is at 0x0fa5c270, the inh flag is 0x40, and we recognize our arg argument: 0x42534400. Let's explore the entry function:

(gdb) x/10i $a0
0x400cc0 <entry>:       lui     $gp,0xfc1
0x400cc4 <entry+4>:     addiu   $gp,$gp,-19600
0x400cc8 <entry+8>:     addu    $gp,$gp,$t9
0x400ccc <entry+12>:    addiu   $sp,$sp,-32
0x400cd0 <entry+16>:    sw      $ra,28($sp)
0x400cd4 <entry+20>:    sw      $gp,24($sp)
0x400cd8 <entry+24>:    sw      $a0,32($sp)
0x400cdc <entry+28>:    b       0x400cdc <entry+28>
0x400ce0 <entry+32>:    nop

At 0x400cdc we have our infinite loop: a jump (b stands for the MIPS branch instruction) that loops to the current address. Let us remember this address and move forward. We are looking for the system call. Where is it?

 (gdb) x/4i $pc
0xfa5c0e0 <_sproc+20>:  b       0xfa5c138 <_nsproc+24>
0xfa5c0e4 <_sproc+24>:  li      $t0,1129
0xfa5c0e8 <_sprocsp>:   lui     $gp,0x10
0xfa5c0ec <_sprocsp+4>: addiu   $gp,$gp,-15880

No system call here, just a jump to another place. Obviously we are in a libc stub. We want to find the system call itself. Using the si (stands for stepi) command, we execute the branch instruction, and have a look at the destination:

 (gdb) si  
_nsproc () at sproc.s:110
110     in sproc.s
(gdb) x/400i $pc
0xfa5c138 <_nsproc+24>: lw      $t9,-31396($gp)
0xfa5c13c <_nsproc+28>: sw      $s0,56($sp)
0xfa5c140 <_nsproc+32>: sw      $s1,52($sp)
(snip)
0xfa5c1f4 <_nsproc+212>:        move    $s2,$a0
0xfa5c1f8 <_nsproc+216>:        lw      $v0,32($sp)
0xfa5c1fc <_nsproc+220>:        syscall
(snip)

This is probably what we are looking for. Now we can try to break at 0xfa5c1fc and check if we do get there or not.

 (gdb) b *0xfa5c1fc
Breakpoint 2 at 0xfa5c1fc: file sproc.s, line 155.
(gdb) c
Continuing.

Breakpoint 2, _nsproc () at sproc.s:155
155     in sproc.s
(gdb) info reg
        zero       at       v0       v1       a0       a1       a2       a3
 R0  00000000 0fb4f3d8 00000469 000000c9 0fa5c270 00000040 42534400 7fff2f7c 
(snip)

We got it! Maybe you remember that on the MIPS, the V0 register holds the system call number. IRIX system calls have an offset of 1000, so 0x469 (1129) is system call number 129, also known as sproc (remember, these are listed in IRIX's /usr/include/sys.s and in NetBSD's sys/compat/irix/syscall.master).

The second and third arguments to sproc have been left untouched, but the libc stub changed the pointer to the entry function. It was 0x400cc0 when we entered the libc stub, and it is now 0x0fa5c270. Where is this going?

 (gdb) x/4i $a0
0xfa5c270 <_nsproc+336>:        lui     $gp,0x10
0xfa5c274 <_nsproc+340>:        addiu   $gp,$gp,-16272
0xfa5c278 <_nsproc+344>:        addu    $gp,$gp,$s2
0xfa5c27c <_nsproc+348>:        lw      $t9,-29592($gp)

In fact, the libc stub requests the sproc(2) system call to return to another part of the stub. It will probably jump to the entry function at 0x400cc0; since tacks and registers may have changed in the meantime, we do not want to follow this path. No problem, we just have to change A0 to go directly to our infinite loop at 0x400cdc.

 (gdb) set $a0=0x400cdc
(gdb) info reg
        zero       at       v0       v1       a0       a1       a2       a3
 R0  00000000 0fb4f3d8 00000469 000000c9 00400cdc 00000040 42534400 7fff2f7c
(snip)
(gdb) c
Continuing.
parent: sproc() returned 761096

Program exited normally.

Our program went into the sproc() system call and then followed its normal code path and exited, after giving us the child PID. Now the child should be hung in the infinite loop, waiting for us to attach to it with gdb.

 (gdb) attach 761096
Attaching to program `./sprocchild', process 761096
Retry #1:
Retry #2:
Retry #3:
Retry #4:
[New Process 761096]
Symbols already loaded for /usr/lib/libc.so.1
entry () at sprocchild.c:16
16              while(1);
Current language:  auto; currently c
(gdb) x/3i $pc
0x400cdc <entry+28>:    b       0x400cdc <entry+28>
0x400ce0 <entry+32>:    nop
0x400ce4 <entry+36>:    nop

The child hung here just after the return to userland, so we have the virgin CPU registers and stack exactly as the kernel just prepared them. This is wonderful.

 (gdb) info reg
        zero       at       v0       v1       a0       a1       a2       a3
 R0  00000000 fffffffe 00000000 00000001 42534400 00000000 00000000 00000000 
          t0       t1       t2       t3       t4       t5       t6       t7
 R8  00000000 00000000 00000000 00000000 00000001 0000000b 00000001 ffffffff 
          s0       s1       s2       s3       s4       s5       s6       s7
 R16 00400cc0 00000040 0fa5c270 00000001 00000000 00000000 00000000 00000000 
          t8       t9       k0       k1       gp       sp       fp       ra
 R24 00000000 00000000 00000000 00000001 0fb582e0 7bff7fc0 00000000 00000000 
          pc    cause      bad       hi       lo      fsr      fir
     00400cdc 80008000 00000000 00000009 00000001 00000000 00000000 
(gdb) x/20w $sp-16
0x7bff7fb0:     0x00000000      0x00000000      0x00000000      0x00000000
0x7bff7fc0:     0x00000000      0x00000000      0x00000000      0x00000000
0x7bff7fd0:     0x00000000      0x00000000      0x00000000      0x00000000
0x7bff7fe0:     0x00000000      0x00000000      0x00000000      0x00000000
0x7bff7ff0:     0x00000000      0x00000000      0x00000000      0x00000000

At least the stack setup will not be difficult to emulate. On the register front, irix_sproc_child() must prepare the following:

Other values seems meaningless; they are equal to the registers' values in the parent or set to zero.

irix_sproc_child() uses the registers saved on the trap frame to set up the register values. We already saw, in part 4 of this series how this works, when we studied signal delivery emulation. Here is a code snippet from irix_sproc_child that does this.

struct frame *tf = (struct frame *)p2->p_md.md_regs;

tf->f_regs[PC] = (unsigned long)isc->isc_entry;
tf->f_regs[RA] = 0;

The last job of irix_sproc_child() is to map the new process stack. Once everything is done, the parent awakens, and the child_return() function is called to return to userland. The trap machinery will restore the register values we prepared in the trap frame.

This implementation led to a fair emulation of sproc(2); however, some bugs are awaiting us at the next stage, in the Process Data Area.

The Complete FreeBSD

Related Reading

The Complete FreeBSD
Documentation from the Source
By Greg Lehey

The IRIX Process Data Area (PRDA)

The first bug appeared when trying to run the IRIX version of Photoshop. It manifested itself as an unexpected SIGSEGV, which usually deeply depresses me. Tracing which bug in the emulation subsystem caused a segmentation fault can be extremely difficult, since the emulation inconsistency can be quite far away from the segmentation fault. It is not easy, but at least we can try.

Here is the kernel trace before the segmentation fault, obtained with ktrace:

 2811 AdobePhotoshop3. CALL  access(0x110437f8,0x3)
 2811 AdobePhotoshop3. NAMI  "/emul/irix/var/tmp"
 2811 AdobePhotoshop3. NAMI  "/var/tmp"
 2811 AdobePhotoshop3. RET   access 0
 2811 AdobePhotoshop3. CALL  getpid
 2811 AdobePhotoshop3. RET   getpid 2811/0xafb
 2811 AdobePhotoshop3. CALL  access(0x110437f8,0)
 2811 AdobePhotoshop3. NAMI  "/emul/irix/var/tmp/photoAAAa000fv"
 2811 AdobePhotoshop3. NAMI  "/var/tmp/photoAAAa000fv"
 2811 AdobePhotoshop3. RET   access -1 errno 2 No such file or directory
 2811 AdobePhotoshop3. PSIG  SIGSEGV caught handler=0x41444c4 mask=(13,20) 
code=0x200e00

Because the access system call just returns 0 or -1, I could not see any way of returning a result that could cause a segmentation fault. It had to be caused by something else.

Using gdb, we can disassemble the code leading to the SIGSEGV delivery:

Program received signal SIGSEGV, Segmentation fault.
0xfb06420 in ?? ()
(gdb) x/4i $pc-16
0xfb06414:      sw      $s1,40($sp)
0xfb06418:      sw      $s0,36($sp)
0xfb0641c:      lui     $t6,0x20
0xfb06420:      lw      $t7,3584($t6)
(gdb) info reg
         zero       at       v0       v1       a0       a1       a2       a3
 R0  00000000 00000000 11042eb8 0000024a 11042eb8 00000000 0000002d 00000001 
           t0       t1       t2       t3       t4       t5       t6       t7
 R8  0fb4f9c0 0fb4f9c0 00870000 00000000 00000000 0fb063f0 00200000 00000000 
(snip)

The instruction that caused the exception is lw $t7,3584($t6). It was supposed to store the value at the address T6+0x3584 into T7. Given that T6 is 0x200000, we end up accessing 0x200e00, where no memory is mapped, hence the SIGSEGV.

Reading the code, it is interesting to note that T6 has just been initialized by a constant value: lui stands for load upper half word of an integer, hence lui $t6,0x20 just caused T6 to be filled with 0x200000. This attempt to read data at 0x200e00 is not a consequence of some data handed out by the emulation subsystem.

The idea that a page of memory should magically be mapped at that address seems a bit odd, but it is worth a program to check it. Here is code to do the job:

/* magicpage.c -- check for a mapping at 0x200000 */
#include <stdio.h>
#include <signal.h>

char *checkpoint = (char *)0x200000;

int main(int argc, char **argv) {
    char c;
    int sgn = 1;

    if (argc == 2)
        sgn = -1;

    while (1) {
        printf("trying %p\n", checkpoint);
        c = *checkpoint;
        printf("ok\n");
        checkpoint = (char *)
            ((unsigned long)checkpoint + (sgn * 0x1000));
    }
    return 0;
}

With this program and another version that tries descending addresses, we confirm that the IRIX kernel maps one page of memory at 0x200000 on program startup.

Related Reading

Unix Power Tools
By Shelley Powers, Jerry Peek, Tim O'Reilly, Mike Loukides

The next step is to dump this area and to have the NetBSD kernel prepare it just like the IRIX kernel does. This could have been a difficult job, but fortunately the "magic page" is documented. I have to thank Chuck Silvers for pointing this out to me. We had an email exchange about how to handle some issues related to the "magic page", and Chuck called it the PRDA. I asked him why he used this name, and Chuck told me this was in the sproc(2) man page.

The sproc(2) man page explains that when you create a thread sharing the virtual memory space with the parent, everything is shared except the Process Data Area (PRDA). A reference to <sys/prctl.h> is also given to get more information about the PRDA. In this header file, we can find the actual definition of the structures contained in the PRDA, making it much easier to emulate.

The real problem now is to implement this feature correctly on NetBSD. Mapping and filling the PRDA at process creation time is easy, but having our sproc(2) emulation share the whole virtual memory space except for the PRDA is more difficult. We will cover this in the next section.

Private Mappings in Share Groups

We have to handle shared virtual memory spaces that contain one private page. In fact, we have to handle potentially multiple private pages, since the PRDA is not the only situation where this property is needed. In the IRIX mmap(2) man page, we can see that there is a MAP_LOCAL option used to make a private mapping within the share group shared virtual memory space.

Sharing the whole virtual memory space is simple: there is a field in the proc structure called p_vmspace (this is defined in < sys/proc.h>). This field is a pointer to a struct vmspace (defined in <uvm/uvm_extern.h>) which described a process virtual address space. When we want to share the whole virtual address space, we share the same struct vmspace among different processes.

qThe struct vmspace contains a substructure called vm_map (defined in <uvm/uvm_map.h>) which in turn contains the list of map entries, describing the various mappings in the process virtual address space. To share only some pages, we must have different vmspace structures with different list of mappings. The map entries linked in the lists will be the same for all the processes in the share group when they describe the shared regions. For private regions, each process will have its own map entries.

The real problem, since we have different lists for each process, is keeping the lists synchronized. If a process modifies the mapping in a shared region, the modification must be visible to all other processes. What are the ways of modifying the virtual address space mappings?

Through system calls, virtual memory mappings are affected by mmap(2), munmap(2), break(2), shmsys(2), mprotect(2), plock(2), mpin(2), munpin(2), memcntl(2), ptrace(2), and syssgi(2) commands such as SGI_ELFMAP (see part 3 of this series to learn about SGI_ELFMAP). For each of these system calls, if the operation is successful, the map entry list must be kept in sync.

Additionally, memory mappings can be affected by page fault handling. We have to handle these as well and maintain the map entries in sync within the share group each time a member makes a page fault. This is a primer on the NetBSD emulation subsystem, which has only been used to emulate system calls and signal handling so far.

We therefore had to write a per emulation page fault handling in the emulation subsystem. The goal is exactly the same as system call emulation: emulation independent code has some hooks to handle emulation specific behavior. The hook is usually done by using a pointer from struct emul, which points to some emulation specific function or data for each emulation. As an example, here is how the MIPS system call handler handles error codes (sys/arch/mips/mips/syscall.c:syscall_plain()):

if (p->p_emul->e_errno)
    error = p->p_emul->e_errno[error];

p is a pointer to the current process, and error the error code that the kernel wants to return to userland. e_errno is a field in struct emul which points to an emulation dependent array, defining the translation between native NetBSD error codes and emulated error codes.

Let us examine trap handling now. We introduced an e_fault field to struct emul which points to a function responsible for trap handling. Native NetBSD processes will want to use uvm_fault() and IRIX will want to use irix_vm_fault(), which is implemented in sys/compat/irix/irix_prctl.c. In the MIPS trap handler (sys/arch/mips/mips/trap.c), we have:

if (p->p_emul->e_fault) 
	rv = (*p->p_emul->e_fault)(p, va, 0, ftype);
else
	rv = uvm_fault(map, va, 0, ftype);

It is worth mentioning that the prototypes of irix_vm_fault() differ a bit from uvm_fault().

int irix_vm_fault __P((struct proc *, vaddr_t, vm_fault_t, vm_prot_t));
int uvm_fault __P((struct vm_map *, vaddr_t, vm_fault_t, vm_prot_t));

This is because we need the struct proc pointer in irix_vm_fault(), and changing uvm_fault() prototype would have caused too many invasive changes in NetBSD's virtual memory subsystem. Because there is no strong requirement to have the same prototype, we used a slightly different one. The struct vm_map pointer can be easily derived from struct proc (it is just p->p_vmspace->vm_map), so the struct proc can just replace the vm_map argument in the irix_vm_fault() prototype.

irix_vm_fault() just calls uvm_fault() and makes a virtual address space mapping sync across the share group. The next point is about how the actual sync is done. Whether requested from a memory related system call or from irix_vm_fault(), the sync is implemented by the irix_vm_sync() function (implemented in sys/compat/irix/irix_prctl.h).

irix_vm_sync() takes the struct proc pointer of the modified process as an argument. For each process in the share group, it unmaps shared regions and remaps them as in the modified process. The unmapping is done by uvm_unmap(9), and the copy of the modified process mapping is done by uvm_map_extract(9).

We have to keep track of which region is shared and which region is private. This information could have been added to the VM map entries, but adding emulation specific data there is not a good practice. We therefore use a chained list whose head is in struct emuldata:

LIST_HEAD(ied_shared_regions, irix_shared_regions_rec)
    ied_shared_regions; /* list of (un)shared memory regions */

The irix_shared_regions_rec structure is defined in sys/compat/irix/irix_prct.h like this:

struct irix_shared_regions_rec {
    vaddr_t isrr_start;
    vsize_t isrr_len;
    int     isrr_shared;    /* shared or not shared */
#define IRIX_ISRR_SHARED 1 
#define IRIX_ISRR_PRIVATE 0
    LIST_ENTRY(irix_shared_regions_rec) isrr_list; 
};

This list is modified when we create the PRDA and when the MAP_LOCAL option is requested in irix_sys_mmap(). When we modify the list, we have to compute region intersections to avoid having information about the same region twice. This is done through irix_isrr_insert() in sys/compat/irix/irix_prctl.c.

There is a debug function in the same file, irix_isrr_debug(), which is useful for checking what happens to the list. If the kernel is built with the DEBUG_IRIX option, this function is called each time the list is modified. Here is some output from the debug function, which helps to explain what is going on:

At process creation time:

isrr for pid 233
  addr = 0x0, len = 0x80000000, shared = 1

After the PRDA is created:

isrr for pid 233
  addr = 0x0, len = 0x200000, shared = 1
  addr = 0x200000, len = 0x1000, shared = 0
  addr = 0x201000, len = 0x7fdff000, shared = 1

After mapping some data with MAP_LOCAL:

isrr for pid 233
  addr = 0x0, len = 0x200000, shared = 1
  addr = 0x200000, len = 0x1000, shared = 0
  addr = 0x201000, len = 0x2fe0f000, shared = 1
  addr = 0x30010000, len = 0x5000, shared = 0
  addr = 0x30015000, len = 0x4ffeb000, shared = 1

With information from this list, irix_vm_sync() is able to perform the sync, sharing mappings only for shared regions. The code is getting quite complicated, but we are getting close to genuine IRIX behavior. irix_vm_sync() now has to compute the intersection of shared and private regions across the share group, and make the memory mapping synchronize accordingly.

We end with quite a horrible piece of code. On a plain page fault, we have to walk through multiple chained lists, which is a pain on the performance front. But we have no choice: we want to emulate an odd feature, so we get odd code.

In the next part, if I am courageous enough to write about it, we will look at the emulation of an IRIX pseudo-device driver that is used to implement pollable semaphores. This includes reverse engineering the driver entry points, since nearly no documentation is available about it, and, of course, implementing the driver in NetBSD.

Emmanuel Dreyfus is a system and network administrator in Paris, France, and is currently a developer for NetBSD.


Return to the BSD DevCenter.

Copyright © 2009 O'Reilly Media, Inc.