Posts Tagged ‘Xen’

XenServer 6.0 with DRBD

Wednesday, January 18th, 2012

DRBD is a low-cost shared-SAN-like solution, which has several great benefits, among which are no single point of failure, and very low cost (local storage and network cable). Its main disadvantages are in the need to constantly monitor it, and make sure it does what’s expected. Also – in some cases – performance might be affected greatly.

If you need XenServer pool with VMs XenMotion (used to call it LiveMigration. I liked it better then…), but you cannot afford or do not want classic shared storage acting a single point of failure, DRBD could be for you. You have to understand the limitations, however.

The most important limitation is with data consistency. If you aim at using it as Active/Active, as I have, you need to make sure that under any circumstance you will not have split brain, as it will mean losing data (you will recover to an older point in time). If you aim at Active/Passive, or all your VMs will run on a single host, then the danger is lower, however – for A/A, and VMs spread across both hosts – the danger is imminent, and you should be aware of it.

This does not mean that you will have to run crying in case of split brain. It means you might be required to export/import VMs to maintain consistent data, and that you will have a very long downtime. Kinda defies the purpose of XenMotion and all…

Using the DRBD guid here, you will find an excellent solution, but not a complete one. I will describe my additions to this document.

So, first, you need to download the DRBD packages. I have re-packaged them, as they did not match XenServer with XS60E003 update. You can grub this particular tar.gz here: drbd-8.3.12-xenserver6.0-xs003.tar.gz . I did not use DRBD 8.4.1, as it has shown great instability and liked getting split-brained all the time. Don’t want it with our system, do we?

Make sure you have defined the private link between your hosts, both as a network interface, as described, and in both servers’ /etc/hosts file. It will be easier later. Verify that the host hostname matches the configuration file, else DRBD will not start.

Next, follow the mentioned guide.

Unlike this guide, I did not define DRBD to be Active/Active in the configuration file. I have noticed that upon reboot of the pool master (and always it), probably due to timing issues, as the XE Toolstack did not release the DRBD device, it would have started in split-brain mode, and I was incapable of handling it correctly. No matter when I have tried to set the service to start, as early as possible, it would have always start in split-brain mode.

The workaround was to let it start in passive mode, and while being read-only device, XE Toolstack cannot use it. Then I wait (in /etc/rc.local) for it to complete sync, and connect the PBD.

You will need each host PBD for this specific SR.

You can do it by running:

for i in `xe host-list --minimal` ; do 
echo -n "host `xe host-param-get param-name=hostname uuid=$i`  "
echo "PBD `xe pbd-list sr-uuid=$(xe  sr-list name-label=drbd-sr1 --minimal) --minimal`"

This will result in a line per host with the DRBD PBD uuid. Replace drbd-sr1 with your actual DRBD SR name.

You will require this info later.

My drbd.conf file looks like this:

# You can find an example in  /usr/share/doc/drbd.../drbd.conf.example

#include "drbd.d/global_common.conf";
#include "drbd.d/*.res";

resource drbd-sr1 {
protocol C;
startup {
degr-wfc-timeout 120; # 2 minutes.
outdated-wfc-timeout 2; # 2 seconds.
#become-primary-on both;

handlers {
    split-brain "/usr/lib/drbd/ root";

disk {
max-bio-bvecs 1;

net {
cram-hmac-alg "sha1";
shared-secret "Secr3T";
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-1pri consensus;
after-sb-2pri disconnect;
#after-sb-2pri call-pri-lost-after-sb;
max-buffers 8192;
max-epoch-size 8192;
sndbuf-size 1024k;

syncer {
rate 1G;
al-extents 2099;

on xenserver1 {
device /dev/drbd1;
disk /dev/sda3;
meta-disk internal;
on xenserver2 {
device /dev/drbd1;
disk /dev/sda3;
meta-disk internal;

I did not force them both to become primary, as split-brain handling in A/A mode is very complex. I have forced them to start as secondary.
Then, in /etc/rc.local, I have added the following lines:

echo 1 > /sys/devices/system/cpu/cpu1/online
while grep sync /proc/drbd > /dev/null 2>&1
        sleep 5
/sbin/drbdadm primary all
/opt/xensource/bin/xe pbd-plug uuid=dfb02709-2483-a11a-cb0e-eac0fb05d8e2

This performs the following:

  • Add an additional core to Domain 0, to reduce chances of CPU overload with DRBD
  • Waits for any sync to complete (if DRBD failed, it will continue, but you will have a split brain, or no DRBD at all)
  • Brings the DRBD device to primary mode. I have had only one DRBD device, but this can be performed selectively for each device
  • Reconnects the PBD which, till this point in the boot sequence, was disconnected. An important note – replace the uuid with the one discovered above for each host – each host should unplug its own PBD.

To sum it up – until sync has been completed, the PBD will not be plugged, and until then, no VMs can run on this SR. Split brain handling for A/P configuration is so much easier.

Some additional notes:

  • I have failed horribly when the interconnect cable was down. I did not implement hardware fencing mechanisms, but it would probably be a very good practice for production systems. Disconnecting the cross cable will result in a split brain.
  • For this system to be worthy, it has to have external monitoring. DRBD must be monitored at all times.
  • Practice and document cases of single node failure, both nodes failure, host master failure, etc. Make sure you know how to react before it happens in real-life.
  • Performance was measured on a Linux RHEL6 VM to be about 82MB/s. The hardware it was tested on was Dell PE R610 with a very nice RAID5 array, etc. When the 2nd host was down, performance resulted in abour 450MB/s, so the bandwidth, in this particular case, matters.
  • Performance test was done using the command:
    dd if=/dev/zero bs=1M of=/tmp/test_file.dd oflag=direct
    Without the oflag=direct, the system will overload the disk write cache of the OS, and not the disk itself (at least – not immediately).
  • I did not test random-access performance.
Hope it helps

Oracle VM post-install check list

Saturday, May 22nd, 2010

Following my experience with OracleVM, I am adding my post-install steps for your pleasure. These steps are not mandatory, by design, but will help you get up and running faster and easier. These steps are relevant to Oracle VM 2.2, but might work for older (and newer) versions as well.

Define bonding

You should read more about it in my past post.

Define storage multipathing

You can read about it here.

Define NTP

Define NTP servers for your Oracle VM host. Make sure the daemon ‘ntpd’ is running, and following an initial time update, via

ntpdate -u <server>

to set the clock right initially, perform a sync to the hardware clock, for good measures

hwclock –systohc

Make sure NTPD starts on boot:

chkconfig ntpd on

Install Linux VM

If the system is going to be stand-alone, you might like to run your VM Manager on it (we will deal with issues of it later). To do so, you will need to install your own Linux machine, since Oracle supplied image fails (or at least – failed for me!) for no apparent reason (kernel panic, to be exact, on a fully MD5 checked image). You could perform this action from the command line by running the command

virt-install -n linux_machine -r 1024 -p –nographics -l nfs://iso_server:/mount

This directive installs a VM called “linux_machine” from nfs iso_server:/mount, with 1GB RAM. You will be asked about where to place the VM disk, and you should place it in /OVS/running_pool/linux_machine , accordingly.

It assumes you have DHCP available for the install procedure, etc.

Install Oracle VM Manager on the virtual Linux machine

This should be performed if you select to manage your VMs from a VM. This is a bit tricky, as you are recommended not to do so if you designing HA-enabled server pool.

Define autostart to all your VMs

Or, at least, those you want to auto start. Create a link from /OVS/running_pool/<VM_NAME>/vm.cfg to /etc/xen/auto/

The order in which ‘ls’ command will see them in /etc/xen/auto/ is the order in which they will be called.

Disable or relocate auto-suspending

Auto-suspend is cool, but your default Oracle VM installation has shortage of space under /var/lib/xen/save/ directory, where persistent memory dumps are kept.  On a 16GB RAM system, this can get pretty high, which is far more than your space can contain.

Either increase the size (mount something else there, I assume), or edit /etc/sysconfig/xendomains and comment the line  with the directive XENDOMAINS_SAVE= . You could also change the desired path to somewhere you have enough space on.

Hashing this directive will force regular shutdown to your VMs following a power off/reboot command to the Oracle VM.

Make sure auto-start VMs actually start

This is an annoying bug. For auto-start of VMs, you need /OVS up and available. Since it’s OCFS2 file system, it takes a short while (being performed by ovs-agent).

Since ovs-agent takes a while, we need to implement a startup script after it and before xendomains. Since both are markes “S99” (check /etc/rc3.d/ for details), we would add a script called “sleep”.

The script should be placed in /etc/init.d/

# sleep     Workaround Oracle VM delay issues
# chkconfig: 2345 99 99
# description: Adds a predefined delay to the initialization process


case "$1" in
start) sleep $DELAY
exit 0

Place the script as a file called “sleep” (omit the suffix I added in this post), set it to be executable, and then run

chkconfig –add sleep

This will solve VM startup problems.

Fix /etc/hosts file

If you are into multi-server pool, you will need that the host name would not be defined to address. By default, Oracle VM defines it to match, which will result in a poor attempt to create multi-server pool.

This is all I have had in mind for now. It should solve most new-comer issues with Oracle VM, and allow you to make good use of it. It’s a nice system, albeit it’s ugly management.

Update the OracleVM

You could use Oracle’s unbreakable network, if you are a paying customer, or you could use the Public Yum Server for your system.

Updates to Oracle VM Manager

If you won’t use Oracle Grid Control (Enterprise Manager) to manage the pool, you will probably use Oracle VM Manager. You would need to update the ovs-console package, and you will probably want to add tightvnc-java package, so that IE users will be able to use the web-based VNC services. You would better grub these packages from here.

Quickly install Xen Community Linux VM

Saturday, December 5th, 2009

On RHEL-type of systems, with virt-manager (libvirt), you can make use of virt-manager to easy your life. I, for myself, prefer to work with ‘xm‘ tools, but for the initial install, virt-manager is the quickest and most simple available tool.

To install a new Linux VM, all you need to follow this flow

Create an LV for your VM (I use LVs because it’s easier to manage). If not LV, use a file. To create an LV, run the following command

lvcreate -L 10G -n new_vm1 VolGroup00

I assume that the name you wish to grant is ‘new_vm1’ (better maintain order there, else you will find yourself with hundreds of small LVs you have no idea what to do with), and that the name of the volume group is ‘VolGroup00’. Change to different values to match your environment.

Next, make sure you have your ISO contents unpacked (you can use loop device) and exported via NFS (my favorite method).

To mount a CD/DVD ISO, you should use ‘mount’ command with the ‘loop’ options. This would look like this:

mount -o loop my_iso.iso /mnt/temp

Again, I assume the name of the ISO is my_iso.iso and that the target directory /mnt/temp is available.

Now, export your newly created directory. If you have NFS already running, you can either add to /etc/exports the newly mounted directory /mnt/temp and restart the ‘nfs’ service, or you could use ‘exportfs’ to add it:

exportfs -o no_root_squash *:/mnt/temp

would probably do the trick. I added ‘no_root_squash’ to make sure no permission/access problems present themselves during the installation phase. Test your export to verify it’s working.

Now you could begin your installation. Run the following command:

virt-install -n new_vm1 -r 512 -p -f /dev/VolGroup00/new_vm1 –nographics nfs://nfs_server:/mnt/temp

The name follows the ‘-n’ flag. The amount of RAM to give is 512MB. The -p means it’s paravirtualized. The -f shows which device will be the block device, and the last argument is the source of the installation. Do not use local files, as the VM installer should be able to access the installation source.

Following that, you should have a very nice TUI installation experience.

Now – let’s make this machine ‘xm’ compatible.

Currently, the VM is virt-manager compatible. It means you need virt-manager to start/stop it correctly. Since I prefer ‘xm’ commands, I will show you how to convert this machine to VM.

First – export its XML file:

virsh dumpxml new_vm1 > /tmp/new_vm1.xml

virsh domxml-to-native xen-xm /tmp/new_vm1.xml > /etc/xen/new_vm1

This should do the trick.

Now you can turn the newly created VM off, and remove the VM from virt-manager using

virsh undefine new_vm1

and you’re back to ‘xm’-only interface.

XenServer “Internal error: Failure… no loader found”

Saturday, October 24th, 2009

It has been long since I had the time to write here. I have recently been involved more and more with XenServer virtualization, as you might see in the blogs, and following a solution to a rather common problem, I have decided to post it here.

The problem: When attempting to boot a Linux VM on XenServer (5.0 and 5.5), you get the following error message:

Error: Starting VM ‘Cacti’ – Internal error: Failure(“Error from xenguesthelper: caught exception: Failure(\”Subprocess failure: Failure(\\\”xc_dom_linux_build: [2] xc_dom_find_loader: no loader found\\\\n\\\”)\”)”)

This is very common with Linux VMs which were converted from physical (or other, non-PV virtualization) to XenServer.

This will probably either happen during the P2V process, or after a successful update to the Linux VM.

The cause is that the original kernel, non PV-aware one, has not been removed, and GRUB likes to load from it. XenServer will use the GRUB menu, but will not display it to us to select our desired kernel.

With no chance to intervene, XenServer will attempt to load a PV-enabled machine using non-PV kernel, and will fail.

Preventing the problem is quite simple – remove your non-PV kernel (non-xen) so that future updates will not attempt to update it as well and set it to be the default kernel. Very simple.

Solving the problem in less than two minutes is a bit more tricky. Let’s see how to solve it.

All operations are performed from within the control domain. This guide does not apply to StorageLink or NetApp/Equalogic devices, as they behave differently. This applies only to LVM-over-something, whatever it may be.

First, we will need to find the name of the VDI we are to work on. Use xe in the following manner, using the VM’s name:

xe vbd-list vm-name-label=Cacti

uuid ( RO)             : 128f29dc-4a14-1a2d-75d1-8674d3d2403b
vm-uuid ( RO): eae053de-4a20-28a5-f335-f5a18dd79993
vm-name-label ( RO): Cacti
vdi-uuid ( RO): 90524af4-5b20-4412-9bfe-f1fe27f220b1
empty ( RO): false
device ( RO): xvda

uuid ( RO)             : de177727-b28a-8b79-e73e-d08366d56277
vm-uuid ( RO): eae053de-4a20-28a5-f335-f5a18dd79993
vm-name-label ( RO): Cacti
vdi-uuid ( RO): <not in database>
empty ( RO): true
device ( RO): xvdd

It is very common that xvdd is used for CDROM, so we can safely ignore the second section. The first section is the more interesting one. There is a correlation between the name of the VDI and the name of the LVM on the disk. We can find this specific LV using the following command. Notice that the name of the VDI is used here as the argument for the ‘grep’ command:

lvs | grep 90524af4-5b20-4412-9bfe-f1fe27f220b1

LV-90524af4-5b20-4412-9bfe-f1fe27f220b1 VG_XenStorage-4aa20fc2-fd92-20c2-c549-bed2597c622b -wi-a-  10.00G

We now have our LV path! As you can see, its status is offline. We need to set it to online state. Using both the LV and the VG name, we can do it like that:

lvchange -ay /dev/VG_XenStorage-4aa20fc2-fd92-20c2-c549-bed2597c622b/LV-90524af4-5b20-4412-9bfe-f1fe27f220b1

Now we can access the volume. We can actually check that the problem is the one we look for, using pygrub:

pygrub /dev/VG_XenStorage-4aa20fc2-fd92-20c2-c549-bed2597c622b/LV-90524af4-5b20-4412-9bfe-f1fe27f220b1

We should now see the GRUB menu of the VM at question. If you don’t see any menu, either you have missed a step or used the wrong disk.

The menu should show you all the list of kernels. The default one is the one highlighted, and if it doesn’t include the word “xen” with it, most likely that we have found the problem.

We now need to change to a PV-capable kernel. We will need to access the “/boot” partition of the Linux VM, and change GRUB’s options there.

First we map the disk to a loop device, so we can access its partitions:

losetup /dev/loop1 /dev/VG_XenStorage-4aa20fc2-fd92-20c2-c549-bed2597c622b/LV-90524af4-5b20-4412-9bfe-f1fe27f220b1

Notice that you need to use the entire path to the LV, that the LV is online, and that loop1 is not in use. If it is, you will have a message saying something like “LOOP_SET_FD: Device or resource busy”

Now we need to access its partitions. We will map them using ‘kpartx’ to /dev/mapper/ devices. Notice we’re using the same loop device name:

kaprtx -a /dev/loop1

Now, new files present themselves in /dev/mapper:

ls -la /dev/mapper/
total 0
drwxr-xr-x  2 root root     220 Oct 24 12:39 .
drwxr-xr-x 14 root root   16560 Oct 24 12:31 ..
crw——-  1 root root  10, 62 Sep 29 10:15 control
brw-rw—-  1 root disk 252,  5 Oct 24 12:39 loop1p1
brw-rw—-  1 root disk 252,  6 Oct 24 12:39 loop1p2
brw-rw—-  1 root disk 252,  7 Oct 24 12:39 loop1p3

Usually, the first partition represents /boot, so we can now mount it and work on it:

mount /dev/mapper/loop1p1 /mnt

All we need to do is edit /mnt/grub/menu.lst to match our requirements, and then wrap everything back up:

umount /mnt

kpartx -u /dev/loop1

losetup -d /dev/loop1

We don’t have to change the LV to offline, because the XenServer will activate it if it’s not, however, we could do it, to be on the safe side:

lvchange -an /dev/VG_XenStorage-4aa20fc2-fd92-20c2-c549-bed2597c622b/LV-90524af4-5b20-4412-9bfe-f1fe27f220b1

Now we can activate the VM, and see it boot successfully.

This whole process takes several minutes the first time, and even less later.

I hope this helps.

Xen guests cannot serv NFS requests

Tuesday, March 3rd, 2009

This sounds weird, but I have witnessed it today, and had to work rather hard to figure the cause of the problem.

When using ” Intel Corporation 82575EB Gigabit Network Connection (rev 02)” (as lspci reports), TCP offload causes problems.


  • The host can communicate with the guest flawlessly (including HTTP get for larger than 2.6k files)
  • Other external hosts/guests report NFS timeout during mount attempt
  • Other external hosts/guests take a long while running “showmount -e” on the target guest
  • Pings work flawlessly
  • HTTP get from external nodes halts at about 2660 bytes, to which it reaches almost immediately
  • VLAN tagged interfaces on other than the default VLAN (1) do not experience these problems (cause – unknown to me at the moment).

The solution is simple – disable the offload from the NIC, and be happy. You could do it using the following line:

ethtool -K eth0 tx off

This should do the trick. It is required only on Dom0, and was tested to work well with my own method of configuring bonds and VLAN tags, as described in this post.

I was able to find the required hint in this post, under comment by Alejandro Anadon, and from there, directly to Xen’s FAQ site and to the  solution mentioned above.

Due to ETHTOOL_OPTS parameter limitations, I have placed (in an ugly manner, I know) the relevant ethtool commands in /etc/rc.local – in contradiction to this great article which shows the correct way of doing this action. Seems to be solved since.