BSD DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Big Scary Daemons

System Panics, Part 2: Recovering and Debugging

04/04/2002

This is Part 2 in a two-part series on system panics. In his first column, Michael Lucas talked about how to prepare a FreeBSD system in case of a panic. In this column, he talks about what to do when the worst happens.

Preparing for a crash immediately after you install a system is an excellent way to reduce stress. When your computer panics, would you rather have all the crash information at your fingertips, or would you prefer frantically reading the documentation and trying to set up the debugger? Last time, we discussed building a debugging kernel and setting up your system to save a panic after a crash. Let's hope you'll never need any of this. If you do suffer a crash, however, here's how to get some useful information out of it.

Let's assume that you've followed all the advice in the previous column. savecore(8) should have copied a dump of your crashed kernel to /var/crash.

If you take a look in /var/crash, you'll see the files kernel.0 and vmcore.0. (Each subsequent crash dump will get a consecutively higher number, e.g., kernel.1 and vmcore.1.) The vmcore.0 file is the actual memory dump. The kernel file is a copy of the crashed kernel. You want to be sure to use the debugging kernel instead of this one, however. If you look in your kernel compile directory (/sys/compile/MACHINENAME), you'll see a file called kernel.debug. This kernel file contains the symbols we discussed in the previous article. To make your life slightly easier, you might copy this file to /var/crash/kernel.debug.0. This will help you keep track of your debug kernels and the crashes they are associated with.

This process is an excellent opportunity to use script(1). This program copies everything that appears on your screen and makes it simple to keep a record of your debugging session (or, indeed, anything else you do). After you start the script, start the gdb debugger. gdb takes three arguments: a -k to configure the debugger appropriately for kernel work; the name of a file containing the kernel with symbols; and the name of the memory dump.

# gdb -k kernel.debug.0 vmcore.0

Once you do that, gdb will spit out its copyright information, the panic message, and a copy of the memory dumping process. We've seen an example of a panic earlier, so I won't repeat it now. What is new is the debugger prompt you get back at the end of all this:

(kgdb)

You've now gotten further than any number of people who have system panics. Pat yourself on the head. To find out exactly where the panic happened, type where and hit enter.

(kgdb) where
#0  dumpsys () at ../../../kern/kern_shutdown.c:505
#1  0xc0143119 in db_fncall (dummy1=0, dummy2=0, dummy3=0,
    dummy4=0xe0b749a4 " \0048\200%") at ../../../ddb/db_command.c:551
#2  0xc0142f33 in db_command (last_cmdp=0xc0313724, cmd_table=0xc0313544,
    aux_cmd_tablep=0xc030df2c, aux_cmd_tablep_end=0xc030df30)
    at ../../../ddb/db_command.c:348
#3  0xc0142fff in db_command_loop () at ../../../ddb/db_command.c:474
#4  0xc0145393 in db_trap (type=12, code=0) at ../../../ddb/db_trap.c:72
#5  0xc02ad0f6 in kdb_trap (type=12, code=0, regs=0xe0b74af4)
    at ../../../i386/i386/db_interface.c:161
#6  0xc02ba004 in trap_fatal (frame=0xe0b74af4, eva=40)
    at ../../../i386/i386/trap.c:846
#7  0xc02b9d71 in trap_pfault (frame=0xe0b74af4, usermode=0, eva=40)
    at ../../../i386/i386/trap.c:765
#8  0xc02b9907 in trap (frame={tf_fs = 24, tf_es = 16, tf_ds = 16, tf_edi = 0,
      tf_esi = 0, tf_ebp = -524858548, tf_isp = -524858592,
      tf_ebx = -525288192, tf_edx = 0, tf_ecx = 1000000000, tf_eax = 0,
      tf_trapno = 12, tf_err = 0, tf_eip = -1071645917, tf_cs = 8,
      tf_eflags = 66182, tf_esp = -1070136512, tf_ss = 0})
    at ../../../i386/i386/trap.c:433
#9  0xc01ffb23 in vcount (vp=0xe0b0bd00) at ../../../kern/vfs_subr.c:2301
#10 0xc01a5e58 in spec_close (ap=0xe0b74b94)
    at ../../../fs/specfs/spec_vnops.c:591
#11 0xc01a55f1 in spec_vnoperate (ap=0xe0b74b94)
    at ../../../fs/specfs/spec_vnops.c:121
#12 0xc0207454 in vn_close (vp=0xe0b0bd00, flags=3, cred=0xc32cce00,
    td=0xe0a8d360) at vnode_if.h:183
#13 0xc0207fab in vn_closefile (fp=0xc3369080, td=0xe0a8d360)
    at ../../../kern/vfs_vnops.c:757
#14 0xc01b1d50 in fdrop_locked (fp=0xc3369080, td=0xe0a8d360)
    at ../../../sys/file.h:230
#15 0xc01b155a in fdrop (fp=0xc3369080, td=0xe0a8d360)
    at ../../../kern/kern_descrip.c:1538
#16 0xc01b152d in closef (fp=0xc3369080, td=0xe0a8d360)
    at ../../../kern/kern_descrip.c:1524
#17 0xc01b114e in fdfree (td=0xe0a8d360) at ../../../kern/kern_descrip.c:1345
#18 0xc01b5173 in exit1 (td=0xe0a8d360, rv=256)
    at ../../../kern/kern_exit.c:199
#19 0xc01b4ec2 in sys_exit (td=0xe0a8d360, uap=0xe0b74d20)
    at ../../../kern/kern_exit.c:109
#20 0xc02ba2b7 in syscall (frame={tf_fs = 47, tf_es = 47, tf_ds = 47,
      tf_edi = 135227560, tf_esi = 0, tf_ebp = -1077941020,
      tf_isp = -524857996, tf_ebx = -1, tf_edx = 135044144,
      tf_ecx = -1077942116, tf_eax = 1, tf_trapno = 12, tf_err = 2,
      tf_eip = 134865696, tf_cs = 31, tf_eflags = 663, tf_esp = -1077941064,
      tf_ss = 47}) at ../../../i386/i386/trap.c:1049
#21 0xc02ae06d in syscall_with_err_pushed ()
#22 0x80503a5 in ?? ()
#23 0x807024a in ?? ()
#24 0xbfbfffb4 in ?? ()
#25 0x807daaf in ?? ()
#26 0x807d6eb in ?? ()
#27 0x80630c1 in ?? ()
#28 0x8062fed in ?? ()
#29 0x805ea4c in ?? ()
#30 0x8065949 in ?? ()
#31 0x806544d in ?? ()
#32 0x806dc17 in ?? ()
#33 0x80616b7 in ?? ()
#34 0x80613f0 in ?? ()
#35 0x8048135 in ?? ()
(kgdb)

Whoa! This is definitely scary looking stuff. If you copied this and the output of uname -a into an email and sent it to hackers@FreeBSD.org, various developers would take note and help you out. They'd probably write you back and tell you other things to type at the kgdb prompt, but you'd definitely get developer attention. You'd be well on your way to getting the problem solved, and helping the FreeBSD folks squash a bug.

If you're not familiar with programming, nobody would blame you if you stopped here. You're better than that, though, and smarter. I know you are. So, without further ado, let's see what we can learn from the debug message and try to figure out some things to include in that first email. Without being intimate with the kernel, you can't solve the problem yourself, but you might be able to help narrow things down a little.

The first thing to realize is that the debugger backtrace contains actual instructions carried out by the kernel, in reverse order. Line number one is the last thing the kernel did. When someone says "before" or "after," they're almost certainly talking about chronological order and not the order things appear in the debugger.

When does a system panic? Well, panicking is a choice that the kernel makes. If the system reaches a condition that it doesn't know how to handle, or fails its own internal consistency checks, it will panic. In these cases, the kernel will call a function called either trap or (if you have INVARIANTS in your kernel) panic. You'll see variants on these, such as db_trap, but you just want the plain old, unadorned trap or panic. Look through your gdb output for either of these. In the example above, there's a trap in line 8. We see other types of trap on lines 4-7, but no plain, straightforward trap statements. These other traps are "helper" functions, called by trap to try to figure out what exactly happened and what to do about it.

Whatever happened right before line 8 chose to panic. In line 9, we see:

#9 0xc01ffb23 in vcount (vp=0xe0b0bd00) at ../../../kern/vfs_subr.c:2301

The hex numbers don't mean much, but this panicked in vcount. If you try man vcount, you'll see that vcount(9) is a standard system call. The panic occurred while executing code that was compiled from the file /usr/src/sys/kern/vfs_subr.c, on line 2301. (All paths in these dumps should be under /usr/src/sys.) This gives a developer a very good idea of where to look for this problem.

Let's go up and look at line 9. Use the up command and the number of lines you want to move.

(kgdb) up 9
#9  0xc01ffb23 in vcount (vp=0xe0b0bd00) at ../../../kern/vfs_subr.c:2301
2301            SLIST_FOREACH(vq, &vp->v_rdev->si_hlist, v_specnext)
(kgdb)

Here we see the actual line of vfs_subr.c that was compiled into the panicking code. You don't need to know what SLIST_FOREACH is. (It's a macro, by the way.) Getting this far is pretty good, but there's still a little more information you can squeeze out of this dump without knowing exactly how the kernel works.

Pages: 1, 2

Next Pagearrow





Sponsored by: