Posts Tagged ‘LVM’

XenServer 6.5 PCI-Passthrough

Thursday, April 16th, 2015

While searching the web for how to perform PCI-Passthrough on XenServers, we mostly get info about previous versions. Since I have just completed setting up PCI-Passthrough on XenServer version 6. 5 (with recent update 8, just to give you some notion of the exact time frame), I am sharing it here.

Hardware: Cisco UCS blades, with fNIC. I wish to pass through two FC HBAs into a VM (it is going to act as a backup server, and I need it accessing the FC tape). While all my XenServers in this pool have four (4) FC HBAs, this particular XenServer node has six (6). I am intending the first four for SR communication and the remaining two for the PCI Passthrough process.

This is the output of ‘lspci | grep Fibre’:

0b:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0c:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0d:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0e:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
0f:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)
10:00.0 Fibre Channel: Cisco Systems Inc VIC FCoE HBA (rev a2)

So, I want to pass through 0f:00.0 and 10:00.0. I had to add to /boot/extlinux.conf the following two entries after the word ‘splash’ and before the three dashes:

pciback.hide=(0f:00.0)(10:00.0) xen-pciback.hide=(0f:00.0)(10:00.0)

Initially, and contrary to the documentation, the parameter pciback.hide had no effect. As soon as the VM started, the command ‘multipath -l‘ would hang forever (or until hard reset to the host).

To apply the settings above, run (for a good measure. Don’t think we need it, but did not read anything about it): ‘extlinux -i /boot‘ and then reboot.

Now, when the host is back, we need to add the devices to the VM. Make sure that the VM is in ‘off’ state before doing that. Your command would look like this:

xe vm-param-set uuid=<VM UUID> other-config:pci=0/0000:0f:00.0,0/0000:10:00.0

The expression ‘0/0000’ is required. You can search for its purpose, however, in most cases, your value would look exactly like mine – ‘0/0000’

Since my VM is Windows, here it almost ends: Start the VM, and if it boots correctly, Install Cisco VIC into it, as if it were a physical host. You’re done.

XenServer – increase LVM over iSCSI LUN size – online

Wednesday, September 4th, 2013

The following procedure was tested by me, and was found to be working. The version of the XenServer I am using in this particular case is 6.1, however, I belive that this method is generic enough so that it could work for every version of XS, assuming you're using iSCSI and LVM (aka - not NetApp, CSLG, NFS and the likes). It might act as a general guideline for fiber channel communication, but this was not tested by me, and thus - I have no idea how it will work. It should work with some modifications when using Multipath, however, regarding multipath, you can find in this particular blog some notes on increasing multipath disks. Check the comments too - they might offer some better and simplified way of doing it.

So - let's begin.

First - increase the size of the LUN through the storage. For NetApp, it involves something like:

lun resize /vol/XenServer/luns/SR1.lun +1t

You should always make sure your storage volume, aggregate, raid group, pool or whatever is capable of holding the data, or - if using thin provisioning - that a well tested monitoring system is available to alert you when running low on storage disk space.

Now, we should identify the LUN. From now on - every action should be performed on all XS pool nodes, one after the other.

cat /proc/partitions

We should keep the output of this command somewhere. We will use it later on to identify the expanded LUN.

Now - let's scan for storage changes:

iscsiadm -m node -R

Now, running the previous command again will have a slightly different output. We can not identify the modified LUN

cat /proc/partitions

We should increase it in size. XenServer uses LVM, so we should harness it to our needs. Let's assume that the modified disk is /dev/sdd.

pvresize /dev/sdd

After completing this task on all pool hosts, we should run sr-scan command. Either by CLI, or through the GUI. When the scan operation completes, the new size would show.

Hope it helps!

Recovery of a StorageRepository (SR) in XenServer, part one

Wednesday, February 6th, 2013

In this part I will discuss a possible solution to a problem I encountered several times already – failure to understand XenServer use of LVM, but first – a little explanation of the topic.

XenServer makes extensive use of LVM technology in order to support the storage requirements of virtual disks. It is being utilized in two methods – LVMoISCSI/LVMoHBA and ext. In both cases, XenServer defines the initial layout as a LVM framework. The LVM, except for the system disk, is positioned directly on the disk in whole, and not on the first partition. I imagine that the desire to avoid dealing with GPT/Basic/Other partitioning schemes is the root of this notion. While it does solve the disk partitioning method problem, it creates a different problem – PEBKC problem (Problem Exists Between Keyboard and Chair). Lack of understanding that there is no partition on the disk, but the data is structured directly on it, is the cause of relatively frequent deletion of the LVM structure as it being replaced by a partitioning layout. The cause of it can be one of two common problems – the first is that the LUN/disk is exposed directly to a Windows machine, which asks joyfully if one would like to ‘sign the partition’. If one does so, a basic partitioning structure is created, and the LVM data structure is overwritten by it. The second problem is a little less common, and involves lack of understanding of the LVM structure as employed by XenServer, when performing disk tasks as the root user on the XenServer host directly. In this case, the user will not be aware of the data structure, and might be tempted to partition, and God forbid – even format the created partition. The result would be a total loss of the SR.

This was about how data is structured and how it is erased or damaged.

I was surprised to discover the ‘easy’ method of recovery from a partitioning table layer over the LVM metadata. I assume that no one has attempted to format the resulting partition(s), but stopped only at creating the partition layout and attempting to understand why it doesn’t work anymore in XenServer.

The easy way, which will be discussed here, is the first of two articles I intend on writing about LVM recovery. If this ‘easy’ method works for you – no need to try your luck with the more complex one.

So, to work. In case someone has created a partition layout, overwriting, as explained earlier, the LVM metadata structure, the symptoms would be that a disk will have (a) partition(s). For example, the results of ‘cat /proc/partitions’ would look like that (snipping the irrelevant parts)

8         16        156290904 sdb
8        17        156288321 sdb1

As clearly visible – the bold line should not be there. The output of ‘fdisk -l /dev/sdb’ showed (again – snipping the irrelevant parts):

/dev/sdb1                                1                   19457                 156288321       83  Linux

It proves someone has manually attempted to partition the disk. Had a mount command worked (example: ‘mount /dev/sdb1 /mnt’) my response e-mail message would go like this: “Sorry. The data was overwritten. Can’t do anything about it”, however, this was not the case. Not this time.

The magic trick I used was to remove the partition entirely, freeing the disk to be identified as LVM, if it could – I wasn’t sure it would – and then take some recovery actions.

First – fdisk to remove the partition:

fdisk /dev/sdb << EOF

Now, a pvscan operation could take place. The following command returned the correct value – a PV ID which wasn’t there before, meaning that the PV information was still intact:


Now, a simple ‘SR Repair’ operation could take place.

My next article in this series will show a more complex method of recovery to employ when this ‘easy’ one doesn’t work.

XenServer 6.0 with DRBD

Wednesday, January 18th, 2012

DRBD is a low-cost shared-SAN-like solution, which has several great benefits, among which are no single point of failure, and very low cost (local storage and network cable). Its main disadvantages are in the need to constantly monitor it, and make sure it does what’s expected. Also – in some cases – performance might be affected greatly.

If you need XenServer pool with VMs XenMotion (used to call it LiveMigration. I liked it better then…), but you cannot afford or do not want classic shared storage acting a single point of failure, DRBD could be for you. You have to understand the limitations, however.

The most important limitation is with data consistency. If you aim at using it as Active/Active, as I have, you need to make sure that under any circumstance you will not have split brain, as it will mean losing data (you will recover to an older point in time). If you aim at Active/Passive, or all your VMs will run on a single host, then the danger is lower, however – for A/A, and VMs spread across both hosts – the danger is imminent, and you should be aware of it.

This does not mean that you will have to run crying in case of split brain. It means you might be required to export/import VMs to maintain consistent data, and that you will have a very long downtime. Kinda defies the purpose of XenMotion and all…

Using the DRBD guid here, you will find an excellent solution, but not a complete one. I will describe my additions to this document.

So, first, you need to download the DRBD packages. I have re-packaged them, as they did not match XenServer with XS60E003 update. You can grub this particular tar.gz here: drbd-8.3.12-xenserver6.0-xs003.tar.gz . I did not use DRBD 8.4.1, as it has shown great instability and liked getting split-brained all the time. Don’t want it with our system, do we?

Make sure you have defined the private link between your hosts, both as a network interface, as described, and in both servers’ /etc/hosts file. It will be easier later. Verify that the host hostname matches the configuration file, else DRBD will not start.

Next, follow the mentioned guide.

Unlike this guide, I did not define DRBD to be Active/Active in the configuration file. I have noticed that upon reboot of the pool master (and always it), probably due to timing issues, as the XE Toolstack did not release the DRBD device, it would have started in split-brain mode, and I was incapable of handling it correctly. No matter when I have tried to set the service to start, as early as possible, it would have always start in split-brain mode.

The workaround was to let it start in passive mode, and while being read-only device, XE Toolstack cannot use it. Then I wait (in /etc/rc.local) for it to complete sync, and connect the PBD.

You will need each host PBD for this specific SR.

You can do it by running:

for i in `xe host-list --minimal` ; do 
echo -n "host `xe host-param-get param-name=hostname uuid=$i`  "
echo "PBD `xe pbd-list sr-uuid=$(xe  sr-list name-label=drbd-sr1 --minimal) --minimal`"

This will result in a line per host with the DRBD PBD uuid. Replace drbd-sr1 with your actual DRBD SR name.

You will require this info later.

My drbd.conf file looks like this:

# You can find an example in  /usr/share/doc/drbd.../drbd.conf.example

#include "drbd.d/global_common.conf";
#include "drbd.d/*.res";

resource drbd-sr1 {
protocol C;
startup {
degr-wfc-timeout 120; # 2 minutes.
outdated-wfc-timeout 2; # 2 seconds.
#become-primary-on both;

handlers {
    split-brain "/usr/lib/drbd/ root";

disk {
max-bio-bvecs 1;

net {
cram-hmac-alg "sha1";
shared-secret "Secr3T";
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-1pri consensus;
after-sb-2pri disconnect;
#after-sb-2pri call-pri-lost-after-sb;
max-buffers 8192;
max-epoch-size 8192;
sndbuf-size 1024k;

syncer {
rate 1G;
al-extents 2099;

on xenserver1 {
device /dev/drbd1;
disk /dev/sda3;
meta-disk internal;
on xenserver2 {
device /dev/drbd1;
disk /dev/sda3;
meta-disk internal;

I did not force them both to become primary, as split-brain handling in A/A mode is very complex. I have forced them to start as secondary.
Then, in /etc/rc.local, I have added the following lines:

echo 1 > /sys/devices/system/cpu/cpu1/online
while grep sync /proc/drbd > /dev/null 2>&1
        sleep 5
/sbin/drbdadm primary all
/opt/xensource/bin/xe pbd-plug uuid=dfb02709-2483-a11a-cb0e-eac0fb05d8e2

This performs the following:

  • Add an additional core to Domain 0, to reduce chances of CPU overload with DRBD
  • Waits for any sync to complete (if DRBD failed, it will continue, but you will have a split brain, or no DRBD at all)
  • Brings the DRBD device to primary mode. I have had only one DRBD device, but this can be performed selectively for each device
  • Reconnects the PBD which, till this point in the boot sequence, was disconnected. An important note – replace the uuid with the one discovered above for each host – each host should unplug its own PBD.

To sum it up – until sync has been completed, the PBD will not be plugged, and until then, no VMs can run on this SR. Split brain handling for A/P configuration is so much easier.

Some additional notes:

  • I have failed horribly when the interconnect cable was down. I did not implement hardware fencing mechanisms, but it would probably be a very good practice for production systems. Disconnecting the cross cable will result in a split brain.
  • For this system to be worthy, it has to have external monitoring. DRBD must be monitored at all times.
  • Practice and document cases of single node failure, both nodes failure, host master failure, etc. Make sure you know how to react before it happens in real-life.
  • Performance was measured on a Linux RHEL6 VM to be about 82MB/s. The hardware it was tested on was Dell PE R610 with a very nice RAID5 array, etc. When the 2nd host was down, performance resulted in abour 450MB/s, so the bandwidth, in this particular case, matters.
  • Performance test was done using the command:
    dd if=/dev/zero bs=1M of=/tmp/test_file.dd oflag=direct
    Without the oflag=direct, the system will overload the disk write cache of the OS, and not the disk itself (at least – not immediately).
  • I did not test random-access performance.
Hope it helps

LVM Recovery

Friday, May 29th, 2009

A friend of mine made a grieve mistake – partition a disk containing Linux LVM directly on it, without any partition table. Oops.

When dealing with multi-Tera sized disks, one gets to encounter limitations not known on smaller scales – the 2TB limitation. Normal partition table can contain only around 2TB mapping, meaning that to create larger partitions, or even smaller partitions which exceed that specific limit, you have to take one of two actions:

  • Use GPT partition tables, which is meant for large disks, and partition the disk to the size limits you desire
  • Define LVM PV directly on the block device (the command would look like ‘pvcreate /dev/sdb -> see? No partitions)

“Surprisingly” and for no good reason, it appears that the disk which was used completely for the LVM PV suddenly had a single GPT partition on it. Hmmmm.

This is/was a single disk in a two-PV VG continging a single LV spanned all over the VG space. Following the “mysterious” actions, the VG refused to start, claiming that it could not find PV with PVID <some UID>.

This is a step where one should stop and call a professional if he doesn’t know for sure how to continue. These following actions are very risky to your data, and could result in you either recovering from tapes (if exist) or seeking a new job, if this is/was some mission-critical data.

First – go to /etc/lvm/archive and find the latest file named after the VG which has been destroyed. Look into it – you should see the PV is in there. Search the PV based on the UID reported not to repond on the logs.

Second – you need to remove the GPT partition from the disk. The PV will be recreated exactly as it was suppoed to be before. Replace /dev/some_disk with your own device file.

fdisk /dev/some_disk



Third – Reread the VG archive file, to be on the safe side. Verify again that the PV you are about to recreate is the one you are to. When done, run the following command

pvcreate -u <UID> /dev/some_disk

Again – the name of the device file has been changed in this example to prevent copy-paste incidents from happening.

Fourth – Run vgcfgrestore with the name of the VG as parameter. This command would restore your meta information into the PV and VG.

vgcfgrestore VG_TEST

Fifth – Activate the VG:

vgchange -ay VG_TEST

Now the volumes should be up, and you have the ability to attempt to mount these volumes.

Notice that the data might be corrupted in some way. Running fsck is recommended, although time-consuming.

Good luck!