IRIX Binary Compatibility, Part 6
Pages: 1, 2
The IRIX Process Data Area (PRDA)
The first bug appeared when trying to run the IRIX version of
Photoshop. It manifested itself as an unexpected SIGSEGV,
which usually deeply depresses me. Tracing which bug in the emulation
subsystem caused a segmentation fault can be extremely difficult, since
the emulation inconsistency can be quite far away from the segmentation
fault. It is not easy, but at least we can try.
Here is the kernel trace before the segmentation fault, obtained with
ktrace:
2811 AdobePhotoshop3. CALL access(0x110437f8,0x3)
2811 AdobePhotoshop3. NAMI "/emul/irix/var/tmp"
2811 AdobePhotoshop3. NAMI "/var/tmp"
2811 AdobePhotoshop3. RET access 0
2811 AdobePhotoshop3. CALL getpid
2811 AdobePhotoshop3. RET getpid 2811/0xafb
2811 AdobePhotoshop3. CALL access(0x110437f8,0)
2811 AdobePhotoshop3. NAMI "/emul/irix/var/tmp/photoAAAa000fv"
2811 AdobePhotoshop3. NAMI "/var/tmp/photoAAAa000fv"
2811 AdobePhotoshop3. RET access -1 errno 2 No such file or directory
2811 AdobePhotoshop3. PSIG SIGSEGV caught handler=0x41444c4 mask=(13,20)
code=0x200e00
Because the access system call just returns 0 or -1, I could not see any way of returning a result that could cause a segmentation fault. It had to be caused by something else.
Using gdb, we can disassemble the code leading to the
SIGSEGV delivery:
Program received signal SIGSEGV, Segmentation fault.
0xfb06420 in ?? ()
(gdb) x/4i $pc-16
0xfb06414: sw $s1,40($sp)
0xfb06418: sw $s0,36($sp)
0xfb0641c: lui $t6,0x20
0xfb06420: lw $t7,3584($t6)
(gdb) info reg
zero at v0 v1 a0 a1 a2 a3
R0 00000000 00000000 11042eb8 0000024a 11042eb8 00000000 0000002d 00000001
t0 t1 t2 t3 t4 t5 t6 t7
R8 0fb4f9c0 0fb4f9c0 00870000 00000000 00000000 0fb063f0 00200000 00000000
(snip)
The instruction that caused the exception is lw
$t7,3584($t6). It was supposed to store the value at the address
T6+0x3584 into T7. Given that T6 is 0x200000, we end up
accessing 0x200e00, where no memory is mapped, hence the
SIGSEGV.
Reading the code, it is interesting to note that T6 has just been
initialized by a constant value: lui stands for load upper
half word of an integer, hence lui $t6,0x20 just caused T6 to
be filled with 0x200000. This attempt to read data at
0x200e00 is not a consequence of some data handed out by the
emulation subsystem.
The idea that a page of memory should magically be mapped at that address seems a bit odd, but it is worth a program to check it. Here is code to do the job:
/* magicpage.c -- check for a mapping at 0x200000 */
#include <stdio.h>
#include <signal.h>
char *checkpoint = (char *)0x200000;
int main(int argc, char **argv) {
char c;
int sgn = 1;
if (argc == 2)
sgn = -1;
while (1) {
printf("trying %p\n", checkpoint);
c = *checkpoint;
printf("ok\n");
checkpoint = (char *)
((unsigned long)checkpoint + (sgn * 0x1000));
}
return 0;
}
With this program and another version that tries descending addresses,
we confirm that the IRIX kernel maps one page of memory at
0x200000 on program startup.
|
Related Reading
Unix Power Tools |
The next step is to dump this area and to have the NetBSD kernel
prepare it just like the IRIX kernel does. This could have been a
difficult job, but fortunately the "magic page" is documented. I have to
thank Chuck Silvers for pointing this out to me. We had an email exchange
about how to handle some issues related to the "magic page", and Chuck
called it the PRDA. I asked him why he used this name, and Chuck told me
this was in the sproc(2)
man page.
The sproc(2) man page explains that when you create a
thread sharing the virtual memory space with the parent, everything is
shared except the Process Data Area (PRDA). A reference to
<sys/prctl.h> is also given to get more information about
the PRDA. In this header file, we can find the actual definition of the
structures contained in the PRDA, making it much easier to emulate.
The real problem now is to implement this feature correctly on NetBSD.
Mapping and filling the PRDA at process creation time is easy, but having
our sproc(2) emulation share the whole virtual memory space
except for the PRDA is more difficult. We will cover this in the next
section.
Private Mappings in Share Groups
We have to handle shared virtual memory spaces that contain one private
page. In fact, we have to handle potentially multiple private pages,
since the PRDA is not the only situation where this property is needed. In
the IRIX mmap(2) man page, we can see that there is a
MAP_LOCAL option used to make a private mapping within the
share group shared virtual memory space.
Sharing the whole virtual memory space is simple: there is a field in
the proc structure called p_vmspace (this is defined in <
sys/proc.h>). This field is a pointer to a struct
vmspace (defined in
<uvm/uvm_extern.h>) which described a process
virtual address space. When we want to share the whole virtual address
space, we share the same struct vmspace among different
processes.
qThe struct vmspace contains a substructure called
vm_map (defined in
<uvm/uvm_map.h>) which in turn contains the list of map entries,
describing the various mappings in the process virtual address space. To
share only some pages, we must have different vmspace structures with
different list of mappings. The map entries linked in the lists will be
the same for all the processes in the share group when they describe the
shared regions. For private regions, each process will have its own map
entries.
The real problem, since we have different lists for each process, is keeping the lists synchronized. If a process modifies the mapping in a shared region, the modification must be visible to all other processes. What are the ways of modifying the virtual address space mappings?
Through system calls, virtual memory mappings are affected by
mmap(2), munmap(2), break(2),
shmsys(2), mprotect(2), plock(2),
mpin(2), munpin(2), memcntl(2),
ptrace(2), and syssgi(2) commands such as
SGI_ELFMAP (see part 3 of this series to learn
about SGI_ELFMAP). For each of these system calls, if the
operation is successful, the map entry list must be kept in sync.
Additionally, memory mappings can be affected by page fault handling. We have to handle these as well and maintain the map entries in sync within the share group each time a member makes a page fault. This is a primer on the NetBSD emulation subsystem, which has only been used to emulate system calls and signal handling so far.
We therefore had to write a per emulation page fault handling in the
emulation subsystem. The goal is exactly the same as system call
emulation: emulation independent code has some hooks to handle emulation
specific behavior. The hook is usually done by using a pointer from
struct emul, which points to some emulation specific function
or data for each emulation. As an example, here is how the MIPS system
call handler handles error codes (sys/arch/mips/mips/syscall.c:syscall_plain()):
if (p->p_emul->e_errno)
error = p->p_emul->e_errno[error];
p is a pointer to the current process, and
error the error code that the kernel wants to return to
userland. e_errno is a field in struct emul
which points to an emulation dependent array, defining the translation
between native NetBSD error codes and emulated error codes.
Let us examine trap handling now. We introduced an e_fault
field to struct emul which points to a function responsible
for trap handling. Native NetBSD processes will want to use
uvm_fault() and IRIX will want to use
irix_vm_fault(), which is implemented in sys/compat/irix/irix_prctl.c.
In the MIPS trap handler (sys/arch/mips/mips/trap.c),
we have:
if (p->p_emul->e_fault)
rv = (*p->p_emul->e_fault)(p, va, 0, ftype);
else
rv = uvm_fault(map, va, 0, ftype);
It is worth mentioning that the prototypes of irix_vm_fault()
differ a bit from uvm_fault().
int irix_vm_fault __P((struct proc *, vaddr_t, vm_fault_t, vm_prot_t));
int uvm_fault __P((struct vm_map *, vaddr_t, vm_fault_t, vm_prot_t));
This is because we need the struct proc pointer in
irix_vm_fault(), and changing uvm_fault()
prototype would have caused too many invasive changes in NetBSD's virtual
memory subsystem. Because there is no strong requirement to have the same
prototype, we used a slightly different one. The struct
vm_map pointer can be easily derived from struct proc
(it is just p->p_vmspace->vm_map), so the struct
proc can just replace the vm_map argument in the
irix_vm_fault() prototype.
irix_vm_fault() just calls uvm_fault() and
makes a virtual address space mapping sync across the share group. The
next point is about how the actual sync is done. Whether requested from a
memory related system call or from irix_vm_fault(), the sync
is implemented by the irix_vm_sync() function (implemented in
sys/compat/irix/irix_prctl.h).
irix_vm_sync() takes the struct proc pointer
of the modified process as an argument. For each process in the share
group, it unmaps shared regions and remaps them as in the modified
process. The unmapping is done by
uvm_unmap(9), and the copy of the modified process
mapping is done by
uvm_map_extract(9).
We have to keep track of which region is shared and which region is
private. This information could have been added to the VM map entries,
but adding emulation specific data there is not a good practice. We
therefore use a chained list whose head is in struct
emuldata:
LIST_HEAD(ied_shared_regions, irix_shared_regions_rec)
ied_shared_regions; /* list of (un)shared memory regions */
The irix_shared_regions_rec structure is defined in
sys/compat/irix/irix_prct.h like this:
struct irix_shared_regions_rec {
vaddr_t isrr_start;
vsize_t isrr_len;
int isrr_shared; /* shared or not shared */
#define IRIX_ISRR_SHARED 1
#define IRIX_ISRR_PRIVATE 0
LIST_ENTRY(irix_shared_regions_rec) isrr_list;
};
This list is modified when we create the PRDA and when the
MAP_LOCAL option is requested in
irix_sys_mmap(). When we modify the list, we have to compute
region intersections to avoid having information about the same region
twice. This is done through irix_isrr_insert() in
sys/compat/irix/irix_prctl.c.
There is a debug function in the same file,
irix_isrr_debug(), which is useful for checking what happens
to the list. If the kernel is built with the DEBUG_IRIX
option, this function is called each time the list is modified. Here is
some output from the debug function, which helps to explain what is going
on:
At process creation time:
isrr for pid 233
addr = 0x0, len = 0x80000000, shared = 1
After the PRDA is created:
isrr for pid 233
addr = 0x0, len = 0x200000, shared = 1
addr = 0x200000, len = 0x1000, shared = 0
addr = 0x201000, len = 0x7fdff000, shared = 1
After mapping some data with MAP_LOCAL:
isrr for pid 233
addr = 0x0, len = 0x200000, shared = 1
addr = 0x200000, len = 0x1000, shared = 0
addr = 0x201000, len = 0x2fe0f000, shared = 1
addr = 0x30010000, len = 0x5000, shared = 0
addr = 0x30015000, len = 0x4ffeb000, shared = 1
With information from this list, irix_vm_sync() is able to
perform the sync, sharing mappings only for shared regions. The code is
getting quite complicated, but we are getting close to genuine IRIX
behavior. irix_vm_sync() now has to compute the intersection
of shared and private regions across the share group, and make the memory
mapping synchronize accordingly.
We end with quite a horrible piece of code. On a plain page fault, we have to walk through multiple chained lists, which is a pain on the performance front. But we have no choice: we want to emulate an odd feature, so we get odd code.
In the next part, if I am courageous enough to write about it, we will look at the emulation of an IRIX pseudo-device driver that is used to implement pollable semaphores. This includes reverse engineering the driver entry points, since nearly no documentation is available about it, and, of course, implementing the driver in NetBSD.
Emmanuel Dreyfus is a system and network administrator in Paris, France, and is currently a developer for NetBSD.
Return to the BSD DevCenter.