Posts Tagged ‘storage’

Hot-resize disks on Linux

Monday, April 6th, 2020

After major investigations around, I came to the conclusion that a full guide describing the procedure required for online disk resize on Linux (especially – expanding disks). I have created a guide for RHEL5/6/7/8 (works the same for Centos or OEL or ScientificLinux – RHEL-based Linux systems) which takes into account the following four scenarios:

  • Expanding a disk where there is a filesystem directly on disk (no partitioning used)
  • Expanding a disk where there is LVM PV directly on disk (no partitioning used)
  • Expanding a disk where there is a filesystem on partition (a single partition taking all the disk’s space)
  • Expanding a disk where there is an LVM PV on partition (a single partition taking all the disk’s space)

All four scenarios were tested with and without use of multipath (device-mapper-multipath). Also – notes about using GPT compared to MBR are given. The purpose is to provide a full guideline for hot-extending disks.

This document does not describe the process of extending disks on the storage/virtualisation/NAS/whatever end. Updating the storage client configuration to refresh the disk topology might differ in various versions of Linux and storage communication methods – iSCSI, FC, FCoE, AoE, local virtualised disk (VMware/KVM/Xen/XenServer/HyperV) and so on. Each connectivity/OS combination might require different refresh method called on the client. In this lab, I use iSCSI and iSCSI software initiator.

The Lab

A storage server running Linux (Centos 7) with targetcli tools exporting 5GB (or more) LUN through iSCSI to Linux clients running Centos5, Centos6, Centos7 and Centos8, with the latest updates (5.11, 6.10, 7.7, 8.1). See some interesting insights on iSCSI target disk expansion using linux LIU (targetcli command line) in my previous post.

The iSCSI clients all see the disk as ‘/dev/sda’ block device. When using LVM, the volume group name is tempvg and the logical volume name is templv. When using multipath, the mpath name is mpatha. On some systems the mpath partition would appear as mpatha1 and on others as mpathap1.

iSCSI client disk/partitions were performed like this:

Centos5:

* Filesystem on disk

1
2
mkfs.ext3 /dev/sda
mount /dev/sda /mnt

* LVM on disk

1
2
3
4
5
pvcreate /dev/sda
vgcreate tempvg /dev/sda
lvcreate -l 100%FREE -n templv tempvg
mkfs.ext3 /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

* Filesystem on partition

1
2
3
parted -s /dev/sda "mklabel msdos mkpart primary 1 -1"
mkfs.ext3 /dev/sda1
mount /dev/sda1 /mnt

* LVM on partition

1
2
3
4
5
6
parted -s /dev/sda "mklabel msdos mkpart primary 1 -1 set 1 lvm on"
pvcreate /dev/sda1
vgcreate tempvg /dev/sda1
lvcreate -l 100%FREE -n templv tempvg
mkfs.ext3 /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

Centos6:

* Filesystem on disk

1
2
mkfs.ext4 /dev/sda
mount /dev/sda /mnt

* LVM on disk

1
2
3
4
5
pvcreate /dev/sda
vgcreate tempvg /dev/sda
lvcreate -l 100%FREE -n templv tempvg
mkfs.ext4 /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

* Filesystem on partition

1
2
3
parted -s /dev/sda "mklabel msdos mkpart primary 1 -1"
mkfs.ext4 /dev/sda1
mount /dev/sda1 /mnt

* LVM on partition

1
2
3
4
5
6
parted -s /dev/sda "mklabel msdos mkpart primary 1 -1 set 1 lvm on"
pvcreate /dev/sda1
vgcreate tempvg /dev/sda1
lvcreate -l 100%FREE -n templv tempvg
mkfs.ext4 /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

Centos7/8:

* Filesystem on disk

1
2
mkfs.xfs /dev/sda
mount /dev/sda /mnt

* LVM on disk

1
2
3
4
5
pvcreate /dev/sda
vgcreate tempvg /dev/sda
lvcreate -l 100%FREE -n templv tempvg
mkfs.xfs /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

* Filesystem on partition

1
2
3
parted -a optimal -s /dev/sda "mklabel msdos mkpart primary 1 -1"
mkfs.xfs /dev/sda1
mount /dev/sda1 /mnt

* LVM on partition

1
2
3
4
5
6
parted -a optimal -s /dev/sda "mklabel msdos mkpart primary 1 -1 set 1 lvm on"
pvcreate /dev/sda1
vgcreate tempvg /dev/sda1
lvcreate -l 100%FREE -n templv tempvg
mkfs.xfs /dev/tempvg/templv
mount /dev/tempvg/templv /mnt

Some variations might exist. For example, use of ‘GPT’ partition layout would result in a parted command like this:

1
parted -s /dev/sda "mklabel gpt mkpart ' ' 1 -1"

Also, for multipath devices, replace the block device /dev/sda with /dev/mapper/mpatha, like this:

1
parted -a optimal -s /dev/mapper/mpatha "mklabel msdos mkpart primary 1 -1"

There are several common tasks, such as expanding filesystems – for XFS, using xfs_growfs <mount target> ; for ext3fs and ext4fs using resize2fs <device path>. Same goes for LVM expansion – using pvresize <device path>, followed by lvextend command, followed by the filesystem expanding command as noted above.

The document layout

The document will describe the client commands for each OS, sorted by action. The process would be as following:

  • Expand the visualised storage layout (storage has already expanded LUN. Now we need the OS to update to the change)
  • (if in use) Expand the multipath device
  • (if partitioned) Expand the partition
  • Expand the LVM PV
  • Expand the filesystem

Actions

For each OS/scenario/mutipath combination, we will format and mount the relevant block device, and attempt an online expansion.

Operations following disk expansion

Expanding the visualised storage layout

For iSCSI, it works quite the same for all OS versions. For other transport types, actions might differ.

1
iscsiadm -m node -R

Expanding multipath device

If using multipath device (device-mapper-multipath), an update to the multipath device layout is required. Run the following command (for all OSes)

1
multipathd -k"resize map mpatha"

Expanding the partition (if disk partitions are in use)

This is a bit complicated part. It differs greatly both in the capability and the commands in use between different versions of operation systems.

Centos 5/6

Online expansion of partition is impossible, except if used with device-mapper-multipath, in which case we force the multipath device to refresh its paths to recreate the device. It will result in an I/O error if there is only a single path defined. For non-multipath setup, a umount and re-mount is required. Disk partition layout cannot be read while the disk is in use.

Without Multipath
1
2
fdisk /dev/sda # Delete and recreate the partition from the same starting point
partprobe # Run when disk is not mounted, or else it will not refresh partition size
With Multipath
1
2
3
4
5
6
fdisk /dev/mapper/mpatha # Delete and recreate the partition from the same starting point
partprobe
multipathd -k"reconfigure" # Sufficient for Centos 6
multipathd -k"remove path sda" # Required for Centos 5
multipathd -k"add path sda" # Required for Centos 5
# Repeat for all sub-paths of expanded device

Centos 7/8

Without Multipath
1
2
fdisk /dev/sda # Delete and recreate partition from the same starting point. Sufficient for Centos 8
partx -u /dev/sda # Required for Centos 7
with Multipath
1
2
fdisk /dev/mapper/mpatha # Delete and recreate the partition from the same starting point. Sufficient for Centos 8
kpartx -u /dev/mapper/mpatha # Can use partx

Expanding LVM PV and LV

1
pvresize DEVICE
Device can be /dev/sda ; /dev/sda1 ; /dev/mapper/mpatha ; /dev/mapper/mpathap1 ; /dev/mapper/mpatha1 – according to the disk layout and LVM choice. lvextend -l +100%FREE /dev/tempvg/templv

Expanding filesystem

For ext3fs and ext4fs
1
resize2fs DEVICe
Device can be /dev/sda ; /dev/sda1 ; /dev/mapper/mpatha ; /dev/mapper/mpathap1 ; /dev/mapper/mpatha1 – according to the disk layout and LVM choice.
For xfs
1
xfs_growfs /mnt

Additional Considerations

MBR vs GPT

On most Linux versions (For Centos – up and including version 7) the command ‘fdisk’ is incapable of handling GPT partition layout. If using GPT partition layout, use of gdisk is recommended, if it exists for the OS. If not, parted is a decent although somewhat limited alternative.

gdisk command can also modify a partition layout (at your own risk, of course) from MBR to GPT and vice versa. This is very useful in saving large data migrations where legacy MBR partition layout was used on disks which are to be expanded beyond the 2TB limits.

GPT backup table is located at the end of the disk, so when extending a GPT disk, it is require to repair the GPT backup table. Based on my lab tests – it is impossible to both extend the partition and repair the GPT backup table location in a single call to gdisk. Two runs are required – one to fix the GPT backup table, and then – after the changes were saved – another to extend the partition.

Storage transport

I have demonstrated use of iSCSI software initiator on Linux. Different storage transport exist – each may require its own method of ‘notifying’ the OS of changed storage layout. See RedHat’s article about disk resizing (RHN access required). This article explains how to refresh the storage transport for a combination of various transports and RHEL versions. and sub-versions.

targetcli extend fileio backend

Friday, April 3rd, 2020

I am working on an article which will describe the procedures required to extend LUN on Linux storage clients, with and without use of multipath (device-mapper-multipath) and with and without partitioning (I tend to partition storage disks, even when this is not exactly required). Also – it will deal with migration from MBR to GPT partition layout, as part of this process.

During my lab experiments, I have created a dedicated Linux storage machine for this purpose. This is not my first, of course, and not likely my last either, however, one of the challenges I’ve had to confront was how to extend or resize in general an iSCSI LUN from the storage point of view. This is not as straight-forward as one might have expected.

My initial setup:

  • Centos 7 or later is used.
  • Using targetcli command-line (meaning – using LIO mechanism).
  • I am using ZFS for the purpose of easily allocating block devices and files on filesystems. This is not a must – LVM can do just right.
  • targetcli is using automatic saveconfig (default configuration).

I will not go over the whole process of setting up and running iSCSI target server. You can find this in so many guides around the web, such as this and that, as well as so many more. So, skipping that – we have a Linux providing three LUNs to another Linux over iSCSI. Currently – using a single network link.

Now comes the interesting part – if I want to expand/resize my LUN on the storage, there are several branches of possibilities.

Assuming we are using the ‘block’ backstore – there is nothing complicated about it – just extend the logical volume, or the ZFS volume, and you’re done with that. Here is an example:

LVM:

lvextend -L +1G /dev/storageVG/lun1

ZFS:

zfs set volsize=11G storage/lun1 # volsize should be the final size

Extremely simple. Starting at this point, LIO will know of the updated sizes, and will just notify any relevant party. The clients, of course, will need to rescan the iSCSI storage, and adept according to the methods in use (see my comment at the beginning of this post about my project).

It is as simple as that if using ‘fileio’ backstore with a block device. Although this is not the best recommended setup, it allows for (default) more aggressive write-back cache, and might reduce disk load. If this is how your backstore is defined (fileio + block device) – same procedure applies as before – extend the block device, and everyone is notified about it.

It becomes harder when using a real file as the ‘fileio’ backstore. By default, fileio will create a new file when defined, or use an existing one. It will use thin provisioning by default, which means it will not have the exact knowledge of the file’s size. Extending or shrinking the file, except for the possibility of data corruption, would have no impact.

Documentation about how to do is is non-existing. I have investigated it, and came to the following conclusion:

  • It is a dangerous procedure, so do it at your own risk!
  • It will result in a short IO failure because we will need to restart the service target.service

This is how it goes. Follow this short list and you shall win:

  • Calculate the desired size in bytes.
  • Copy to a backup the file /etc/target/saveconfig.json
  • Edit the file, and identify the desired LUN – you can identify the file name/path
  • Change the size from the specified size to the desired size
  • Restart the target.service service

During the service restart all IO would fail, and client applications might get IO errors. It should be faster than the default iSCSI retransmission timeout, but this is not guaranteed. If using multipath (especially with queue_if_no_path flag) the likeness of this to affect your iSCSI clients is nearly zero. Make sure you test this on a non-production environment first, of course.

Hope it helps.

HA ZFS NFS Storage

Tuesday, January 29th, 2019

I have described in this post how to setup RHCS (Redhat Cluster Suite) for ZFS services, however – this is rather outdated, and would work with RHEL/Centos version 6, but not version 7. RHEL/Centos 7 use Pacemaker as a cluster infrastructure, and it behaves, and configures, entirely differently.

This is something I’ve done several times, however, in this particular case, I wanted to see if there was a more “common” way of doing this task, if there was a path already there, or did I need to create my own agents, much like I’ve done before for RHCS 6, in the post mentioned above. The quick answer is that this has been done, and I’ve found some very good documentation here, so I need to thank Edmund White and his wiki.

I was required to perform several changes, though, because I wanted to use IPMI as the fencing mechanism before using SCSI reservation (which I trust less), and because my hardware was different, without multipathing enabled (single path, so there was no point in adding complexity for no apparent reason).

The hardware I’m using in this case is SuperMicro SBB, with 15x 3.5″ shared disks (for our model), and with some small internal storage, which we will ignore, except for placing the Linux OS on.

For now, I will only give a high-level view of the procedure. Edmund gave a wonderful explanation, and my modifications were minor, at best. So – this is a fast-paced procedure of installing everything, from a thin minimal Centos 7 system to a running cluster. The main changes between Edmund version and mine is as follows:

  • I used /etc/zfs/vdev_id.conf and not multipathing for disk names aliases (used names with the disk slot number. Makes it easier for me later on)
  • I have disabled SElinux. It is not required here, and would only increase complexity.
  • I have used Stonith levels – a method of creating fencing hierarchy, where you attempt to use a single (or multiple) fencing method(s) before going for the next level. A good example would be to power fence, by disabling two APU sockets (both must be disconnected in parallel, or else the target server would remain on), and if it failed, then move to SCSI fencing. In my case, I’ve used IPMI fencing as the first layer, and SCSI fencing as the 2nd.
  • This was created as a cluster for XenServer. While XenServer supports both NFSv3 and NFSv4, it appears that the NFSD for version 4 does not remove file handles immediately when performing ‘unexport’ operation. This prevents the cluster from failing over, and results in a node reset and bad things happening. So, prevented the system from exporting NFSv4 at all.
  • The ZFS agent recommended by Edmund has two bugs I’ve noticed, and fixed. You can get my version here – which is a pull request on the suggested-by-Edmund version.

Here is the list:

yum groupinstall “high availability”
yum install epel-release
# Edit ZFS to use dkms, and then
yum install kernel-devel zfs
Download ZFS agent
wget -O /usr/lib/ocf/resource.d/heartbeat/ZFS https://raw.githubusercontent.com/skiselkov/stmf-ha/e74e20bf8432dcc6bc31031d9136cf50e09e6daa/heartbeat/ZFS
chmod +x /usr/lib/ocf/resource.d/heartbeat/ZFS
systemctl disable firewalld
systemctl stop firewalld
systemctl disable NetworkManager
systemctl stop NetworkManager
# disable SELinux -> Edit /etc/selinux/config
systemctl enable corosync
systemctl enable pacemaker
yum install kernel-devel zfs
systemctl enable pcsd
systemctl start pcsd
# edit /etc/zfs/vdev_id.conf -> Setup device aliases
zpool create storage -o ashift=12 -o autoexpand=on -o autoreplace=on -o cachefile=none mirror d03 d04 mirror d05 d06 mirror d07 d08 mirror d09 d10 mirror d11 d12 mirror d13 d14 spare d15 cache s02
zfs set compression=lz4 storage
zfs set atime=off storage
zfs set acltype=posixacl storage
zfs set xattr=sa storage

# edit /etc/sysconfig/nfs and add to RPCNFSDARGS “-N 4.1 -N 4”
systemctl enable nfs-server
systemctl start nfs-server
zfs create storage/vm01
zfs set [email protected]/24,async,no_root_squash,no_wdelay storage/vm01
passwd hacluster # Setup a known password
systemctl start pcsd
pcs cluster auth storagenode1 storagenode2
pcs cluster setup –start –name zfs-cluster storagenode1,storagenode1-storage storagenode2,storagenode2-storage
pcs property set no-quorum-policy=ignore
pcs stonith create storagenode1-ipmi fence_ipmilan ipaddr=”storagenode1-ipmi” lanplus=”1″ passwd=”ipmiPassword” login=”cluster” pcmk_host_list=”storagenode1″
pcs stonith create storagenode2-ipmi fence_ipmilan ipaddr=”storagenode2-ipmi” lanplus=”1″ passwd=”ipmiPassword” login=”cluster” pcmk_host_list=”storagenode2″
pcs stonith create fence-scsi fence_scsi pcmk_monitor_action=”metadata” pcmk_host_list=”storagenode1,storagenode2″ devices=”/dev/sdb,/dev/sdc,/dev/sdd,/dev/sde,/dev/sdf,/dev/sdg,/dev/sdh,/dev/sdi,/dev/sdj,/dev/sdk,/dev/sdl,/dev/sdm,/dev/sdn,/dev/sdo,/dev/sdp” meta provides=unfencing
pcs stonith level add 1 storagenode1 storagenode1-ipmi
pcs stonith level add 1 storagenode2 storagenode2-ipmi
pcs stonith level add 2 storagenode1 fence-scsi
pcs stonith level add 2 storagenode2 fence-scsi
pcs resource defaults resource-stickiness=100
pcs resource create storage ZFS pool=”storage” op start timeout=”90″ op stop timeout=”90″ –group=group-storage
pcs resource create storage-ip IPaddr2 ip=1.1.1.7 cidr_netmask=24 –group group-storage

# It might be required to unfence SCSI disks, so this is how:
fence_scsi -d /dev/sdb,/dev/sdc,/dev/sdd,/dev/sde,/dev/sdf,/dev/sdg,/dev/sdh,/dev/sdi,/dev/sdj,/dev/sdk,/dev/sdl,/dev/sdm,/dev/sdn,/dev/sdo,/dev/sdp -n storagenode1 -o on
# Checking if the node has reservation on disks – to know if we need to unfence
sg_persist –in –report-capabilities -v /dev/sdc

Connecting EMC/NetApp shelves as JBOD to a Linux machine

Wednesday, April 29th, 2015

Let’s say you have old shelves of either EMC or NetApp with SAS or SATA disks in them. And let’s say you want to connect them via FC to a Linux machine and have some nice ZFS machine/cluster, or whatever else. There are few things to know, and to attend in order for it to work.

The first one is the sector size. For NetApp – this applies only to non SATA disks (I don’t know about SSDs, though), and for EMC this could apply, as far as I noticed, to all disks – sector size is not 512 bytes, but 520 – the additional 8 bytes are used for block checksum. Linux does not handle well 520 blocks – the following error message will appear in the logs:

Unsupported sector size 520.

To solve it, we will need to identify the disks – using sg3_utils (in Centos-like – yum install sg3_utils) and then modify them to block size of 512 bytes. To identify the disks, run:

sg_scan -i
/dev/sg0: scsi0 channel=3 id=0 lun=0
HP P410i 3.66 [rmb=0 cmdq=1 pqual=0 pdev=0xc]
/dev/sg1: scsi0 channel=0 id=0 lun=0
HP LOGICAL VOLUME 3.66 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg2: scsi3 channel=0 id=0 lun=0 [em]
hp DVD A DS8A5LH 1HE3 [rmb=1 cmdq=0 pqual=0 pdev=0x5]
/dev/sg3: scsi1 channel=0 id=0 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg4: scsi1 channel=0 id=1 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg5: scsi1 channel=0 id=2 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg6: scsi1 channel=0 id=3 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg7: scsi1 channel=0 id=4 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg8: scsi1 channel=0 id=5 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg9: scsi1 channel=0 id=6 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg10: scsi1 channel=0 id=7 lun=0
SEAGATE SX3500071FC DA04 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg11: scsi1 channel=0 id=8 lun=0
FUJITSU MXW3300FE 0906 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg12: scsi1 channel=0 id=9 lun=0
FUJITSU MXW3300FE 0906 [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg13: scsi1 channel=0 id=10 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg14: scsi1 channel=0 id=11 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg15: scsi1 channel=0 id=12 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg16: scsi1 channel=0 id=13 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]
/dev/sg17: scsi1 channel=0 id=14 lun=0
SEAGATE SX3300007FC D41B [rmb=0 cmdq=1 pqual=0 pdev=0x0]

So, for each sg device (member of our batch of disks) we need to modify the sector size.

Two ways to do so – the first suggested by this post here, is by using sg_format in the following manner:

sg_format –format –size=512 /dev/sg2

Another post suggested using a dedicated program called ‘setblocksize’. I followed this one, and it worked fine. I had to power cycle the disks before the Linux could use them.

I did notice that disk performance were not bright. I got about 45MB/s write, and about 65-70 MB/s read for large sequential operations, using something like:

dd bs=1M if=/dev/sdf of=/dev/null bs=1M count=10000
dd bs=1M if=/dev/null of=/dev/sdf oflag=direct count=10000 # WARNING – this writes on the disk. Do not use for disks with data!

Fairly disappointing. Also, using multipath, when the shelf is connected to one FC port, and then back to another, showed me that with the setting:

path_grouping_policy multibus

I got about 10MB/s less compared to using “failover” flag (the default for Centos 6). Whatever modification I did to the multipathd.conf, I was unable to exceed this number when using multiple access. These results were consistent when using multibus or group_by_serial, however, when a single path was active and the other was passive, It clearly showed better. I did modify rr_min_io and rr_min_io_rq, but with no effect.

The low disk performance could suggest I need to flush the original disk firmware, however, I am not sure I will do so. If anyone is reading this and had different results – I would love to hear about it.

Redhat Cluster and Citrix XenServer

Thursday, April 9th, 2015

I wanted to write down a guide for RHCS on RHEL/Centos6 and XenServer.

If you want to do that, you need to go through two major challenges which you will encounter. I want to save on the search and sum it all up together here.

The first difficulty is the shared disk. In order to set up most common cluster scenarios, you will need a shared storage. You could either map the VMs to an iSCSI LUNs external to the environment, however, if you do not have such infrastructure (either because everything is based on SAS/FC, or you do not have the ability to set up iSCSI storage with reasonable level of availability), you will want XenServer to allow you to share the VDI between two VMs.

In order to do so, you will need to add a flag to all your pool’s XenServers, and to create the VDI in a specific method. First – the flag – you need to create a file in /etc/xensource called “allow_multiple_vdi_attach”. Do not forget to add it to all your XenServers:

touch /etc/xensource/allow_multiple_vdi_attach

Next, you will need to create your VDI as “raw” type. This is an example. You need to change the SR UUID to the one you use:

xe vdi-create sm-config:type=raw sr-uuid=687a023b-0b20-5e5f-d1ef-3db777ce7ae4 name-label=”My Raw LVM VDI” virtual-size=8GiB type=user

You can find Citrix article about it here.

Following that, you can complete your cluster setup and configuration. I will not add details about it here, as this is not the focus of this article. However, when it comes to fencing, you will need a solution. The solution I used was a fencing agent which was written specifically for XenServer using XenAPI, by using the agent called fence-xenserver. I did not use the fencing agents repository (which this page also points to), because I was unable to compile the required components to run on Centos6. They just don’t compile well. This is, however, a simple Python script which actually works.

In order to make it work, I did the following:

  • Extracted the archive (version 0.8)
  • Placed fence_cxs* in /usr/sbin, and removed their ‘.py’ suffix
  • Placed XenAPI.py as-is in /usr/sbin
  • Verified /usr/sbin/fence_cxs* had execution permissions.

Now, I needed to add it to the cluster configuration. Since the agent cannot handle accessing a non-pool master, it had to be defined for each pool member (I cannot tell in advance which of them is going to have the pool master role when a failover should happen). So, this is my cluster.conf relevant parts:

<fencedevices>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver01″ passwd=”password” session_url=”https://xenserver01″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver02″ passwd=”password” session_url=”https://xenserver02″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver03″ passwd=”password” session_url=”https://xenserver03″/>
<fencedevice agent=”fence_cxs_redhat” login=”root” name=”xenserver04″ passwd=”password” session_url=”https://xenserver04″/>
</fencedevices>
<clusternodes>
<clusternode name=”clusternode1″ nodeid=”1″>
<fence>
<method name=”xenserver01″>
<device name=”xenserver01″ vm_name=”clusternode1″/>
</method>
<method name=”xenserver02″>
<device name=”xenserver02″ vm_name=”clusternode1″/>
</method>
<method name=”xenserver03″>
<device name=”xenserver03″ vm_name=”clusternode1″/>
</method>
<method name=”xenserver04″>
<device name=”xenserver04″ vm_name=”clusternode1″/>
</method>
</fence>
</clusternode>
<clusternode name=”clusternode2″ nodeid=”2″>
<fence>
<method name=”xenserver01″>
<device name=”xenserver01″ vm_name=”clusternode2″/>
</method>
<method name=”xenserver02″>
<device name=”xenserver02″ vm_name=”clusternode2″/>
</method>
<method name=”xenserver03″>
<device name=”xenserver03″ vm_name=”clusternode2″/>
</method>
<method name=”xenserver04″>
<device name=”xenserver04″ vm_name=”clusternode2″/>
</method>
</fence>
</clusternode>
</clusternodes>

Attached xenserver-fencing-cluster.xml for clarity (WordPress makes a mess out of that)

Note that I used four (4) entries, since my pool has four hosts. Also note the VM name (it is case sensitive), and your methods – one for each host, since you don’t want them running in parallel, but one at a time. Failover time is between 5-15 seconds on my tests, depending on who is the actually pool master (xenserver04 takes the longest, obviously). I did not test it with pool master down (before or without HA kicking in), nor with the hosts down and thus TCP timeout is longer (than when attempting to connect a host which responds immediately that it is not the pool master). However, if ILO fencing takes about 30-60 seconds, I am not complaining about the current timeouts.