System Failure and Recovery Practice
Pages: 1, 2
This time, we're going get rid of
bash, which can't be fixed by
booting into single-user mode.
While writing this article, I discovered a bug in the UML block driver which causes COW files not to work properly when they aren't mounted as the root filesystem. So, we are going to dispense with them for the time being.
no_bash, boot it up, log in, and get rid of
% cp root_fs no_bash
% linux ubd0=no_bash
usermode:~# rm /bin/bash
If the halt hangs, halt UML with the
Let's boot it up again and see how it does without a shell:
It boots very quickly and it's impossible to log in:
INIT: cannot execute "/etc/init.d/rcS" INIT: Entering runlevel: 2 INIT: cannot execute "/etc/init.d/rc" Debian GNU/Linux 2.2 (none) ttys/0 (none) login: root Unable to determine your tty name.
So, we need to shut it down with the
mconsole and figure out how to
We're going to simulate booting from a rescue disk. We're going to do
root_fs as the rescue disk, assigning that to be disk 0, and
moving the damaged filesystem to disk 1:
% linux ubd0=root_fs ubd1=no_bash
So, log in, mount the damaged filesystem on
/mnt and make sure that
bash is missing:
usermode:~# mount /dev/ubd/1 /mnt usermode:~# ls /mnt/bin/bash ls: /mnt/bin/bash: No such file or directory
OK, this is now easy to fix. We can just copy the shell from the rescue disk:
usermode:~# cp -p /bin/bash /mnt/bin/bash usermode:~# ls -l /bin/bash /mnt/bin/bash -rwxr-xr-x 1 root root 461400 Feb 20 2000 /bin/bash -rwxr-xr-x 1 root root 461400 Feb 20 2000 /mnt/bin/bash
Now, you can halt UML and boot it on
no_bash to confirm that it again boots OK.
Backups, backups, backups
For our finale, we are going to make a backup of the filesystem and destroy enough of it that fixing it requires restoring the backup. The backup device will be an empty file that's large enough to hold our filesystem:
% dd if=/dev/zero of=backup seek=600 bs=$((1024*1024)) count=1
My filesystem is just over 500MB, so I created a 600MB backup file to
allow for any overhead of the backup format. Replace the
with whatever size is appropriate for you. Now copy
trashed and boot it up with
backup as disk 1.
% cp root_fs trashed % linux ubd0=trashed ubd1=backup
Log in, and make the backup on
/dev/ubd/1. I'm using
tar here. If
you favor a different backup tool, feel free to use it. Notice that
we're not creating a filesystem on this device. It's being used as a
raw data device in exactly the same way as a tape.
If it fails with an I/O error, the backup file you created was too
small. You can extend it by simply running
dd on the file with a
seek argument and retrying the backup.
usermode:~# tar clf /dev/ubd/1 / tar: Removing leading '/' from member names tar: Removing leading '/' from link names
When it's done, we will make "trashed" live up to its name:
usermode:~# rm -rf /bin /lib /usr/lib
Remove anything you like. Feel free to corrupt things, too. When
you're done having fun, shut it down, using the
mconsole, if necessary.
Now, it's time to fix it back up. Boot UML with
root_fs as the
backup as disk 1 again, and
trashed as disk 2:
% linux ubd0=root_fs ubd1=backup ubd2=trashed
Now, log in, mount the damaged filesystem on
cd to it, and restore the backup:
usermode:~# mount /dev/ubd/2 /mnt usermode:~# cd /mnt usermode:/mnt# tar xpf /dev/ubd/1 tar: : Cannot mkdir: No such file or directory tar: Error exit delayed from previous errors
It succeeded, despite the error:
usermode:/mnt# ls bin arch dd fgrep ls pidof run-parts touch ...
Now, you can check that it is fixed by halting UML and booting it on "trashed" again and seeing that it's fine.
Hopefully this article has convinced you that UML can be a valuable system administration tool. I've demonstrated the creation and recovery of a variety of different types of sysadmin catastrophes.
Obviously, this is only a tiny sample of the possible disasters that can happen. You can ensure that you are prepared for them by making them happen and figuring out how to fix them. It is possible to make them happen on a physical machine, but it should be apparent that simulating them with UML is far more convenient, and almost completely authentic. The devices may have different names, but the procedures are exactly the same as on a physical machine.
With the publication of this article, I am inaugurating the Sysadmin Disaster of the Month on the UML web site at http://user-mode-linux.sourceforge.net/sdotm.html. I will present a disaster and take submissions of solutions. I will arbitrarily choose a winner each month based on criteria such as originality, subtlety, brevity, and parsimony. I will also take submissions of proposed disasters. If you have a disaster that you'd like featured, submit it, along with a proposed solution, if you have one.
Return to the Linux DevCenter.