ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Big Scary Daemons

BSD Tricks: Unprepared Disaster Recovery

12/07/2000

A friend called me at work the other day. Some time ago, I'd built him a FreeBSD system as an inexpensive e-mail and NAT box. He was sitting at the console looking at a "Changing root device to /dev/wd0s1a" message. The box had crashed. He'd been sitting there for a while, power cycling and hoping something would magically change.

The local system administrator, while a decent Novell admin, knew nothing about Unix. He had the root password, of course. Heaven knew what had happened to the box since I put it in.

We quickly found that the system wouldn't even boot into single-user mode. Since I'd seen the system, e-mail and Web access had become mission-critical. Backups? Surely you jest.

And to top it off, the company had vital mail messages stored on the system.

There are worse disasters, but this was bad enough. That evening I sat down with the machine to see what could be done.

It was a 3.2-stable system, with a kernel dated in May 1999. Fortunately, I keep all my old FreeBSD CDs. I booted off the 3.2 install disk and tried to enter "fixit" mode. The 2nd CD, labeled "live filesystem," apparently wasn't; sysinstall kept complaining that the disk contained the Alpha install bits. Finally I booted off the 3.3 CD and selected fixit mode. Sysinstall complained that the version on the system didn't match the version on the disk, then went straight on ahead and gave me a command prompt.

The first thing to do when using the fixit image is to mount the disk. A mount /dev/ad0s1 /mnt didn't work, but mount /dev/wd0s1 /mnt did. Back in 3-stable, /dev/wd* was the norm. This was typical of the problems I experienced; while FreeBSD's user interface has remained fairly constant, many of the more sophisticated administration aspects have changed.

At first glance, /mnt looked like a standard root directory. However, /mnt/dev was missing. If the system lacked device nodes, well, it wouldn't be able to change the root to one, now could it? Naively, I thought that all I'd need to do was recreate the /dev entries and we'd be in business.

Of course, locate doesn't work on the fixit image. But find / -name MAKEDEV -print does. I copied MAKEDEV under /mnt/dev and ran sh MAKEDEV all.

MAKEDEV complained that it couldn't find /sbin/mknod. Sure enough, there was no /sbin/mknod on the fixit image. I quickly found /mnt/sbin/mknod, however, and the fixit vi worked just fine. I quickly whacked /mnt/dev/MAKEDEV into shape and tried again.

It ran, but complained repeatedly about chown: wheel: illegal group name. Although /mnt/dev contained a few devices, it wasn't nearly what you'd expect. I shrugged, and decided to try it anyway.

The system didn't boot. I rebooted again, this time off the fixit image.

Once I hit Alt-F4 to go to the fixit command prompt, the answer literally stared me in the face. The first screen you see on the fixit image recommends symlinking /mnt/etc/group and /mnt/etc/*pwd.db to /etc, so tar can restore file permissions properly. Sure enough, I did that and MAKEDEV ran without a hitch.

I tried to reboot into single-user mode, and was quite surprised that it works. A fsck showed that the file systems were clean, so I Ctrl-D'd back into multiuser mode.

That's when I saw just how bad things were. Not only did the various startup scripts fail when their binaries weren't found, I was left staring at a screen full of console messages:

can't exec getty '/usr/libexec/getty' for /dev/ttyv0: no such file or directory

This couldn't be good. I rebooted into single-user mode again so I could get a command prompt, mounted some file systems, and went looking.

Huge swatches of files were just flat-out gone. If I had to guess, I'd say that someone logged in as root and typed rm -rf /*, and the system just kept eating itself until it lost some vital file and crashed.

I was in luck, though. Since /etc was intact, I tarred it up. And /var/mail has lots of large files, so I tarred that up, too.

This time, I booted off the 3.4 CD. (I don't have a 3.5 disk, sadly.) Instead of doing a fixit attempt, I did a fresh install. The disk partitions were still present, I just had to label their mount points. My struggles with /dev weren't fruitless; when you install over an existing root partition, sysinstall assumes that /dev has good entries.

An hour later, the system rebooted into a clean 3.4-RELEASE box. Most of the customizations in /etc/ were still intact. I reinstalled everything listed under /var/db/pkg, copied some select files over from my backup /etc, and rebooted one more time to a repaired system.

The next day, my friend picked up the box at work, took it to the office, and plugged it in. To my surprise, it found the ISDN modem, dialed out, logged on, and took its assigned IP. The only problem was that I hadn't reconfigured popper, which took about five minutes over an ssh connection. They had four working hours of downtime. I had four hours of headaches and a final rush of success.

The fixit image isn't easy to use; you can't just do a "repair installation" and rebuild your system. Still, if you read your error messages and apply a little common sense, it's entirely possible to restore a badly fried FreeBSD box to service in just a few hours.

Michael W. Lucas


Read more Big Scary Daemons columns.

Discuss this article in the Operating Systems Forum.

Return to the BSD DevCenter.

 

Copyright © 2009 O'Reilly Media, Inc.