Posts Tagged ‘dump’

A bug in restore in Centos4.1 and probably RHEL 4 update 1

Sunday, February 26th, 2006

I’ve been to Hostopia today. The land of hosting servers. I’ve had an emergency job on one Linux server, due to a mistake I’ve made. It appears that the performance hindrance of using raid0 instead of raid1 (Centos/RH default raid setup is raid0 and not raid1, which led me to this mistake) for the root partition is terrible.

I tend to setup servers in the following way:

Small (100MB) raid1 partition (/dev/sda1 and /dev/sdb1, usually) for /boot.

Two separated partitions for swap (/dev/sda2 and /dev/sdb2), each just half the required total swap.

One large raid1 (/dev/sda3 and /dev/sdb3) containing LVM, which, in turn, holds the “/” and the rest of the data partitions, if required.

In this specific case, I’ve made a mistake and was not aware of it on time. I’ve setup the large LVM over a stripe (raid0) by mistake. I’ve had degraded performance on the server, and all disk access were slow. Very slow. Since it is impossible to break such a raid array without loosing data, I’ve had to backup the data currently there, and make sure I would be able to restore it. It’s an old habit of mine to use dump and restore. Both ends of the procedure worked so far perfectly, on all *nix operating systems I’ve had experience with. I’ve dumped the data, using one of the swap partitions as a container (formatted as ext3, of course), and was ready to continue.

I’ve reached the server farm, where all hosting servers stood in long rows (I’m so sorry I did not take a picture. Some of those so called “servers” had color leds in their fans!), and got busy on this specific server. Had to backup all from the start, as it failed to complete before (and this time, I’ve done so to my laptop through the 2nd NIC), and then I’ve booted into rescue mode, destroyed the LVM, destroyed the raid (md device), and recreated them. It went fine, except that restore failed to work. The claim was “. is not the root” or something similar. Checking restore via my laptop worked fine, but the server itself failed to work. Eventually, after long waste of time, I’ve installed minimal Centos4.1 setup on the server, and tried to restore through overwrite from within a chroot environment. It failed as well. Same error message. I’ve suddenly decided to check if I’ve had an update to the dump package, and there was. Installing it solved the issue. I was able to restore the volume (using the “u” flag, to overwrite files), and all was fine.

I’ve wasted over an hour over this stupid bug. Pity.

Keeping the static copy of the up-to-date restore binary. Now I will not have these problems again. I hope 🙂