Posts Tagged ‘RAID’

XenServer 6.0 with DRBD

Wednesday, January 18th, 2012

DRBD is a low-cost shared-SAN-like solution, which has several great benefits, among which are no single point of failure, and very low cost (local storage and network cable). Its main disadvantages are in the need to constantly monitor it, and make sure it does what’s expected. Also – in some cases – performance might be affected greatly.

If you need XenServer pool with VMs XenMotion (used to call it LiveMigration. I liked it better then…), but you cannot afford or do not want classic shared storage acting a single point of failure, DRBD could be for you. You have to understand the limitations, however.

The most important limitation is with data consistency. If you aim at using it as Active/Active, as I have, you need to make sure that under any circumstance you will not have split brain, as it will mean losing data (you will recover to an older point in time). If you aim at Active/Passive, or all your VMs will run on a single host, then the danger is lower, however – for A/A, and VMs spread across both hosts – the danger is imminent, and you should be aware of it.

This does not mean that you will have to run crying in case of split brain. It means you might be required to export/import VMs to maintain consistent data, and that you will have a very long downtime. Kinda defies the purpose of XenMotion and all…

Using the DRBD guid here, you will find an excellent solution, but not a complete one. I will describe my additions to this document.

So, first, you need to download the DRBD packages. I have re-packaged them, as they did not match XenServer with XS60E003 update. You can grub this particular tar.gz here: drbd-8.3.12-xenserver6.0-xs003.tar.gz . I did not use DRBD 8.4.1, as it has shown great instability and liked getting split-brained all the time. Don’t want it with our system, do we?

Make sure you have defined the private link between your hosts, both as a network interface, as described, and in both servers’ /etc/hosts file. It will be easier later. Verify that the host hostname matches the configuration file, else DRBD will not start.

Next, follow the mentioned guide.

Unlike this guide, I did not define DRBD to be Active/Active in the configuration file. I have noticed that upon reboot of the pool master (and always it), probably due to timing issues, as the XE Toolstack did not release the DRBD device, it would have started in split-brain mode, and I was incapable of handling it correctly. No matter when I have tried to set the service to start, as early as possible, it would have always start in split-brain mode.

The workaround was to let it start in passive mode, and while being read-only device, XE Toolstack cannot use it. Then I wait (in /etc/rc.local) for it to complete sync, and connect the PBD.

You will need each host PBD for this specific SR.

You can do it by running:

for i in `xe host-list --minimal` ; do 
echo -n "host `xe host-param-get param-name=hostname uuid=$i`  "
echo "PBD `xe pbd-list sr-uuid=$(xe  sr-list name-label=drbd-sr1 --minimal) --minimal`"
done

This will result in a line per host with the DRBD PBD uuid. Replace drbd-sr1 with your actual DRBD SR name.

You will require this info later.

My drbd.conf file looks like this:

# You can find an example in  /usr/share/doc/drbd.../drbd.conf.example

#include "drbd.d/global_common.conf";
#include "drbd.d/*.res";

resource drbd-sr1 {
protocol C;
startup {
degr-wfc-timeout 120; # 2 minutes.
outdated-wfc-timeout 2; # 2 seconds.
#become-primary-on both;
}

handlers {
    split-brain "/usr/lib/drbd/notify.sh root";
}

disk {
max-bio-bvecs 1;
no-md-flushes;
no-disk-flushes;
no-disk-barrier;
}

net {
allow-two-primaries;
cram-hmac-alg "sha1";
shared-secret "Secr3T";
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-1pri consensus;
after-sb-2pri disconnect;
#after-sb-2pri call-pri-lost-after-sb;
max-buffers 8192;
max-epoch-size 8192;
sndbuf-size 1024k;
}

syncer {
rate 1G;
al-extents 2099;
}

on xenserver1 {
device /dev/drbd1;
disk /dev/sda3;
address 10.1.1.1:7789;
meta-disk internal;
}
on xenserver2 {
device /dev/drbd1;
disk /dev/sda3;
address 10.1.1.2:7789;
meta-disk internal;
}
}

I did not force them both to become primary, as split-brain handling in A/A mode is very complex. I have forced them to start as secondary.
Then, in /etc/rc.local, I have added the following lines:

echo 1 > /sys/devices/system/cpu/cpu1/online
while grep sync /proc/drbd > /dev/null 2>&1
do
        sleep 5
done
/sbin/drbdadm primary all
/opt/xensource/bin/xe pbd-plug uuid=dfb02709-2483-a11a-cb0e-eac0fb05d8e2

This performs the following:

  • Add an additional core to Domain 0, to reduce chances of CPU overload with DRBD
  • Waits for any sync to complete (if DRBD failed, it will continue, but you will have a split brain, or no DRBD at all)
  • Brings the DRBD device to primary mode. I have had only one DRBD device, but this can be performed selectively for each device
  • Reconnects the PBD which, till this point in the boot sequence, was disconnected. An important note – replace the uuid with the one discovered above for each host – each host should unplug its own PBD.

To sum it up – until sync has been completed, the PBD will not be plugged, and until then, no VMs can run on this SR. Split brain handling for A/P configuration is so much easier.

Some additional notes:

  • I have failed horribly when the interconnect cable was down. I did not implement hardware fencing mechanisms, but it would probably be a very good practice for production systems. Disconnecting the cross cable will result in a split brain.
  • For this system to be worthy, it has to have external monitoring. DRBD must be monitored at all times.
  • Practice and document cases of single node failure, both nodes failure, host master failure, etc. Make sure you know how to react before it happens in real-life.
  • Performance was measured on a Linux RHEL6 VM to be about 82MB/s. The hardware it was tested on was Dell PE R610 with a very nice RAID5 array, etc. When the 2nd host was down, performance resulted in abour 450MB/s, so the bandwidth, in this particular case, matters.
  • Performance test was done using the command:
    dd if=/dev/zero bs=1M of=/tmp/test_file.dd oflag=direct
    Without the oflag=direct, the system will overload the disk write cache of the OS, and not the disk itself (at least – not immediately).
  • I did not test random-access performance.
Hope it helps

Aquiring and exporting external disk software RAID and LVM

Wednesday, August 22nd, 2007

I had one of my computers die a short while ago. I wanted to get the data inside its disk into another computer.

Using the magical and rather cheap USB2SATA I was able to connect the disk, however, the disk was part of a software mirror (md device) and had LVM on it. Gets a little complicated? Not really:

(connect the device to the system)

Now we need to query which device it is:

dmesg

It is quite easy. In my case it was /dev/sdk (don’t ask). It shown something like this:

usb 1-6: new high speed USB device using address 2
Initializing USB Mass Storage driver…
scsi5 : SCSI emulation for USB Mass Storage devices
Vendor: WDC WD80 Model: WD-WMAM92757594 Rev: 1C05
Type: Direct-Access ANSI SCSI revision: 02
SCSI device sdk: 156250000 512-byte hdwr sectors (80000 MB)
sdk: assuming drive cache: write through
SCSI device sdk: 156250000 512-byte hdwr sectors (80000 MB)
sdk: assuming drive cache: write through
sdk: sdk1 sdk2 sdk3
Attached scsi disk sdk at scsi5, channel 0, id 0, lun 0

This is good. The original system was RH4, so the standard structure is /boot on the first partition, swap and then one large md device containing LVM (at least – my standard).

Lets list the partitions, just to be sure:

# fdisk -l /dev/sdk

Disk /dev/sdk: 80.0 GB, 80000000000 bytes
255 heads, 63 sectors/track, 9726 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdk1 * 1 13 104391 fd Linux raid autodetect
/dev/sdk2 14 144 1052257+ 82 Linux swap
/dev/sdk3 145 9726 76967415 fd Linux raid autodetect

Good. As expected. Let’s activate the md device:

# mdadm –assemble /dev/md2 /dev/sdk3
mdadm: /dev/md2 has been started with 1 drive (out of 2).

It’s going well. Now we have the md device active, and we can try to scan for LVM:

# pvscan

PV /dev/md2 VG SVNVG lvm2 [73.38 GB / 55.53 GB free]

Activating the VG is a desired action. Notice the name – SVNVG (a note at the bottom):

# vgchange -a y /dev/SVNVG
3 logical volume(s) in volume group “SVNVG” now active

Now we can list the LVs and mount them on our desired location:

]# lvs
LV VG Attr LSize Origin Snap% Move Log Copy%
LogVol00 SVNVG -wi-a- 2.94G
LogVol01 SVNVG -wi-a- 4.91G
VarVol SVNVG -wi-a- 10.00G

Mounting:

mount /dev/SVNVG/VarVol /mnt/

and it’s all ours.

To remove this connected the disk, we need to reverse the above process.

First, we will umount the volume:

umount /mnt

Now we need to disable the Volume Group:

# vgchange -a n /dev/SVNVG
0 logical volume(s) in volume group “SVNVG” now active

0 logical volumes active means we were able to disable the whole VG.

Disable the MD device:

# mdadm –manage -S /dev/md2

Now we can disconnect the physical disk (actually, the USB) and continue with out life.

A note: RedHat systems name their logical volumes using a default name VolGroup00. You cannot have two VGs with the same name! If you activate a VG which originated from RH system and used a default name, and your current system uses the same defaults, you need to connect the disk to an external system (non RH would do fine) and change the VG name using vgrename before you can proceed.

Installing RHEL4 on HP DL140 G3 with the embedded RAID enabled

Friday, July 6th, 2007

While DL140 G3 is quite a new piece of hardware, RHEL4, even with the later updates, is rather old.

When you decide to install RHEL4 on a DL140 G3 server, my first recommendation is this: if you decide to use the embedded SATA-II RAID controller – don’t. This is a driver-based RAID, much like the past win-modem devices. Some major parts of its operations are based on calculations done through the driver, directly on the host CPU. It has no advantages comparing to software RAID, and its major disadvantage is its immobile state – unlike software based mirror, this array cannot “migrate” to another server, unless this server is of the same type of hardware. Not sure about it, but it might also require close enough version of firmware as well.

It happened that your boss believes in win-RAID devices (despite the note above), or for some other reason you decide to use this win-RAID, here are the steps to install the system.

1. Download the latest driver disk image for RHEL4 from HP site.

2. If you have the privilage of having an NFS server, uncompress the image and put it on it, where it can be accessed through network.

3. Test that you can mount it from another server. Verify you can reach the image file. Debugging incorrect NFS issues can waste lots of time.

4. If you don’t have the privilege, I hope you have a USB floppy. Put the image on a floppy disk:

gzip -dc /path/to/compressed/image/file.gz | dd of=/dev/fd0

(/dev/fd0 assuming this is not /dev/sdX, as it tends to be with USB floppies)

5. Boot the server with the first RHEL4 CD in the drive, or with PXE, or whatever is your favorite method. In the initial boot prompt type:

linux text dd=nfs:server:/path/to/nfs/disk/disk_image.dd

This assuming that the name of the file (including its full path) is /path/to/nfs/disk/disk_image.dd. For floppy users, type dd=floppy instead.

6. RHEL will boot, loading “ahci” module (which is bad) during its startup. It will ask you to select through which network card the system is to reach the NFS server. I assume you have a working DHCP in your site.

7. As soon as you are able to use the virtual terminal (Ctrl+F2) switch to it.

8. Run the following commands:

cd /tmp
mkdir temp
cd temp
gzip -S .cgz -dc /tmp/ramfs/DD-0/modules.czf | cpio -id

cd to the modules directory, and look at the modules. Know which is the module which fits your running kernel. You can do this by using ‘uname’ command.

9. Run the following commands

rmmod ahci
rmmod adpahci
insmod KERNEL_VER/ARCH/adpahci.ko

Replace KERNEL_VER with your running(!!!) single-CPU kernel version, and replace ARCH with your architecture, either i386 or x86_64

10. Using Ctrl+F1 return to your running installer. Continue installation until the end but do not reboot the system when done.

11. When installation is done, before the reboot, return to the virtual console using Ctrl+F2.

12. Run the following commands to prepare your system for a happy reboot:

cp /tmp/temp/KERNEL_VER/ARCH/adpachi.ko /mnt/sysimage/lib/modules/KERNEL_VER/kernel/drivers/scsi/

cp /tmp/temp/KERNEL_VERsmp/ARCH/adpachi.ko /mnt/sysimage/lib/modules/KERNEL_VERsmp/kernel/drivers/scsi

Notice that we’ve copied both the single CPU (UP) and the SMP versions.

13. Edit modprobe.conf of the system-to-be and remove the line containing “alias scsi_hostadapter ahci” from the file.

14. Chroot into the system-to-be, and build your initrd:

chroot /mnt/sysimage
cd /boot
mv initrd-KERNEL_VER.img initrd-KERNEL_VER.img.orig
mv initrd-KERNEL_VERsmp.img initrd-KERNEL_VERsmp.img
mkinitrd /boot/initrd-KERNEL_VER.img KERNEL_VER
mkinitrd /boot/initrd-KERNEL_VERsmp.img KERNEL_VERsmp
exit

If things went fine so far, you are now ready to reboot. Use Ctrl+F1 to return to the installation (anaconda) console, and reboot the system.

Notes:

1. You need to download the “Driver Diskette” from HP site.

2. The latest Driver Diskette will support only Update3 and Update4 based systems. At this time, Update5 has no modules by HP yet. You can compile your own, but this is not in our scope.

3. Avoid using floppies at all cost.

4. Do not install the system in full GUI mode. In the model I have installed the VGA (Matrox device) had a bug and did not allow to reach the virtual text consoles. It disconnected the VGA. If you use GUI installation, you will be required to reboot the system into rescue mode and do steps 7 to 14 then.

5. Underlining the word smp is meant to help you not forget it. This is the more important one.

6. On the system itself, using Xorg, I was able to reach max resolution of 640×480 even with the display drivers supplied by HP. I was able to reach 1024×768 only when using 256 colors.

Upgrading Ubuntu 6.06 to 6.10 with software non-standard RAID configuraion

Sunday, October 29th, 2006

A copy of a post made in Ubuntu forms, with the hope it would help others in status such as I’ve been in. Available through here

In afterthought, I think the header should be non-common, rahter than non-standard…

Hi all.

I have passed through this forum yesterday, searching for a possible cause for a problem I have had.
The theme was almost well known – after upgrade from 6.06 to 6.10 the system won’t boot. Lilo (I use Lilo, not Grub) would load the kernel, and after a while, never reachine "init" the system would wait for something, I didn’t know what.

I tried disconnecting USB, setting different boot parameters, even virified (using the kernel messages) that disks didn’t replace their locations (and they did not, although I have additional IDE controller). Alas, it didn’t seem good.

The weird thing is that the during this wait (there was keyboard response, but the system didn’t move on…), the disk LED flashed in a fixed rate. I though it might have to do with RAID rebuild, however, from live-cd, there was no signs of such a case.

Then, on one of the "live-cd" boots, accessing the system via "chroot" again, I have decied to open the initrd used by the system, in an effort to dig into the problem.
Using the following set of commands did it:
cp /boot/initrd.img-2.6.17-10-generic /tmp/initrd.gz
cd /tmp && mkdir initrd.out && cd initrd.out
gzip -dc ../initrd.gz | cpio -od

Following this, I’ve had the initrd image extracted, and I could look into it. Just to be on the safe side, I have looked into the image’s /etc, and found there mdadm/mdadm.conf file.
This file had different (wrong!) UUIDs for my software RAID setup (compared with "mdadm –detail /dev/md0" for md0, etc).
I have located the origin of this file to be the system’s real /etc/mdadm/mdadm.conf, which was originated a while ago, before I’ve made many manual modifications (changed disks, destroyed some of the md devices, etc). I have fixed the system’s real /etc/mdadm/mdadm.conf file to reflect the correct system, and recreated the initrd.img file for the system (now with the up-to-date mdadm.conf). Updated Lilo, and the system was able to boot correctly this time.

The funny thing is that even using the previous kernel, which had its initrd.img built long ago, and which worked fine for a long while failed to complete the boot process altogether using the upgraded system.

My system relevant details:

/dev/hda+/dev/hdc -> /dev/md2 (/boot), /dev/md3

/dev/hde+/dev/hdg -> /dev/md1
/dev/md1+/dev/md1 -> LVM2, including the / on it

Lilo is installed on /dev/hda and /dev/hdc altogether.

HP-UX and Software Raid1

Tuesday, August 8th, 2006

I have installed today an HP-UX 11i V2 on PA-Risc server, and it went quite fine. I have used the “Technical Environment” DVDs for installation, and it went fine. I was unable to find, however, the Raid1 (Mirror) tools for the LVM.

Symptoms: There is no parameter “-m” to “lvextend“. According to documentaion (or even better, HP Forum1 and HP Forum2), it is plain simple, using the lvextend. Only here I got to figure that it was part of the LVM package for Enterprise Servers.

I finally found it in the CD called “Mission Critical Operating Environment DVD 1”. Inside,in a bundle called “HPUX11i-OE-Ent”. I have selected “LVM” from the list there, installed, let the system recompile the kernel, and reboot. Then lvextend will started accepting the “-m” flag.

Per the posts described above, I run:

for LVOL in `ls /dev/vg00/l*` ; do

lvextend -m 1 $LVOL

done

Took a while, but at least it worked.